### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 654 Class Note for STAT 59800 with Professor Neville at Purdue

### View Full Document

## 18

## 0

## Popular in Course

## Popular in Department

This 18 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 18 views.

## Similar to Course at Purdue

## Reviews for 654 Class Note for STAT 59800 with Professor Neville at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Data Mining 3857300 STAT 59800 024 Purdue University February 19 2009 Predictive modeling evaluation Score functions 9 Zeroone loss N as Accuracy sltMgt i dime M gm Sensitivityspecificity i1 PrecisionRecallFl Actual 0 Absolute loss 0 Squared loss E TP FP Root meansquared error FN TN 0 I Likelihoodconditional likelihood I Area under the ROC curve Cost sensitive models Define a score function based on a cost matrix 0 If y is the predicted class and y is the true class then need to define a ACtual matrix of costs Cyy g TP FP 9 Reflects the severity of classifying an E instance with true classy to class y E FN TN 0 ROC curves O Receiver Operating Characteristic curve 0 Plots the true positive rate against the false positive rate for different classification thresholds Evaluates performance over varying costs and class distributions O Can summarize with area under the curve AUC TP Rate 04 10 Base RPT l l l 00 02 04 06 08 FP Rate l 1 0 Biasvariance analysis EDL5qty EDI EDI2 EDI EDUD2 EDEDyy2 noise 39 Noise loss incurred independent of algorithm Bias loss incurred of mean prediction relative to optimal Variance average loss of predictions compared to mean prediction variance bias Biasvariance analysis A Model predictions Test Set Training 5 e1 Samples Models Findings 0 Bias 0 Often related to size of model space 0 More complex models tend to have lower bias 0 Variance Often related to size of dataset 0 When data is large enough to estimate parameters well then models have lower variance 0 Simple models can perform surprisingly well due to lower variance Bxasvamance tradeo Huh mp low m mmmm Nahumquot Expected NEE Sis m paramslsv space Ensemb e methods Maturatmn Tau mmmm Dunsfmm a smg e mude ma upum zes pe mmance why Appmach Cunstrum many rmde s an dWerent rersmns m methierer set and cummnemem mm p edu mn Gua reduce mas andurvanance General idea Apply learning algorithm all M new Aggregate into M all M new Altered training data 11 Bagging Bootstrap aggregating 0 Main assumption O Combining many unstable predictors in an ensemble produces a stable predictor Unstable predictor small changes in training data produces large changes in the model eg trees O Model space non parametric can model any function if an appropriate base model is used 12 Baggmg Given a training data set DX1y1 XNyN For m1M 0 Obtain a bootstrap sample Dm by drawing N instances with replacement from D 0 Learn model Mm from Dm To classify test instance t apply all models to t and take majority vote Models have uncorrelated errors due to difference in training sets each bootstrap sample has 68 of D 13 Boos ng 0 Main assumption O Combining many weak but stable predictors in an ensemble produces a strong predictor Weak predictor only weakly predicts correct class of instances eg tree stumps l R 0 Model space non parametric can model any function if an appropriate base model is used 14 Boosting Assign every example in D an equal weight 1N For m1M 0 Learn model Mm with Dm C Calculate the error of Mm and up weight the examples that are incorrectly classified to form Dm1 O Normalize weights in qu to sum to 1 Set em og1 errmerrm C To classify test instance t apply all models to t and take weighted vote of predictions ie using Xm 15 Pathologies 16 Pathologies of induction algorithms 39 Overfitting 0 Adding components to models that reduce performance or leave it unchanged 0 Oversearching Selecting models with lower performance as the size of search space grows 0 Attribute selection errors 0 Preferring attributes with many possible values despite lower performance Jensen and Cohen 2000 17 Overfitting Accuracy Tree size Sample size Oates lt2 Jensen 7997 7998 7999 18 Overfitting cont 2000 i i i i i i i i i MDL EBF39 7 1300 REF 399quot 1500 7 g X 1400 x 1200 7 n 3 3 1000 7 z 600 i i 600 i 7 1 4m 7 quot 7273quot 7 quot a a a7 639 200 i i 37 u i i i i i i i 0 500 1000 I500 2000 2500 3000 3500 4000 4500 5000 Insunces Oates amp Jensen 1999 19 Oversearching Heuristic search Exhaustive search Search Method Accuracy Training set a Test set E Quinlan and CameronJones 1995 Munhy and Salzberg 1995 20 Attribute selection errors Few Many possible Possible values values Possible values Accuracy Training set Test set F il Quinlan 1998 Liu and White 1994 21 Evaluation functions are estimators 0 Evaluation functions are functions fmD on models m and data samples D 0 Samples vary in their representativeness fmD1 x1 at X2 fmDz Each score x is an estimate of some population parameter 11 22 How do we use statistical inference 0 Parameter estimates What is the accuracy of m All Derived Evaluate accuracy on many samples Population Possible slausuc Sampling Samples Values Dismbulion empirically estimate sampling distribution D 0 Use distribution mean as estimate of population parameter I b e Hypothesis tests Does m perform better than chance Aquot Derived Population Possible Sialis c Sampling under Hu Samples Values Dismbulion Evaluate accuracy on sample 0 Compare to sampling distribution under null hypothesis H0 asses probability that accuracy would be achieved by chance 3 24 3133 I Bug 2 b 0027 23 Multiple comparison procedures L39w 0 Generate multiple items 0 Generate n models 6 Estimate scores 6 Using the training set and an evaluation function caicuate a score for each model 39 143 I 2862 a Select max scoring item 39 3950 The sampling distribution of39Xmax is different frOm39the sa mplin g distribution of Xi 24 Example Dice rolling For a fair die with six outcomes H0 All outcomes are equally likely What is the sampling distribution of XI 1 2 3 4 5 6 ExHO35 pX gt5H0 0167 25 Example Dice rolling For the maximum of ten dice H0 all outcomes equally likely What is the sampling distribution of me 123456 EXmaXH0 58 pXmaXgt5HO 0838 26 Using the right sampling distribution The sampling distribution of XW differs from the sampling distribution of X A direct analogy exists between dice rolling and searching multiple models model components attributes etc 0 The evaluation of any given score varies with the number of models or components attributes etc compared during search 27 Multiple comparisons are ubiquitous in learning 0 Used to select O Settings Agt1 Agt2 Agt4 39 Components Agt3 B4 Cgt563 0 Models Tree 1 Tree 2 Tree 3 0 Methods trees rules networks Parameters depth4 depth5 depth6 28 Explaining Pathologies 29 Incorrect hypothesis tests Under H0 there is a nonzero probability that any model s scorexy will exceed some critical valuexm The probability that the maximum ofn scores xmax will exceed xm is uniformly equal x or higher 5quot Xm gt m ng EMX gt m lHa 80 Overfitting Many components are available to use in a given model Algorithms select the component with the maximum score 0 The correct sampling distribution depends on number of components evaluated 0 Most learning algorithms do not adjust for number of components 81 Biased parameter estimates 39 Sample scores are routinely used as estimates of population parameters Any xy score is often an unbiased estimator of the population score But the xrrEX is almost always a biased estimator 82 Oversearching 0 Two or more search spaces contain different numbers of models O Maximum scores in each space are biased to differing degrees C Most algorithms directly compare scores 39 Attribute selection errors can be explained in an analogous way 33 Adjusting for multiple comparisons C Remove bias by testing on withheld data 0 New data eg Oatesamp Jensen 1999 0 Cross validation eg4 Weiss and Kulikowski 1991 0 Estimate sampling distribution accurately O Randomization tests eg Jensen 1992 0 Adjust probability calculation Bonferroni adjustment eg4 Jensen amp Schmill 1997 0 Alter evaluation function to incorporate complexity penalty MDL BIC etc 34 Summary Classification is an extensively studied problem O Probably the most widely used data mining technique 0 Clearly defined task 0 Well defined evaluation functions to guide search O Framework to understand model performance 0 Other issues 0 Scalability feature selection understandability non standard data representations 35 Next Class 0 Due HW3 O Reading Chapter 6 PDM 0 Topic Descriptive modeling 36

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I made $350 in just two days after posting my first study guide."

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.