### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Data Mining&Stat Learn ISYE 7406

GPA 3.82

### View Full Document

## 12

## 0

## Popular in Course

## Popular in Industrial Engineering

This 0 page Class Notes was uploaded by Maryse Thiel on Monday November 2, 2015. The Class Notes belongs to ISYE 7406 at Georgia Institute of Technology - Main Campus taught by Staff in Fall. Since its upload, it has received 12 views. For similar materials see /class/234211/isye-7406-georgia-institute-of-technology-main-campus in Industrial Engineering at Georgia Institute of Technology - Main Campus.

## Reviews for Data Mining&Stat Learn

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/02/15

T neeBased Models KwokLeung Tsui Industrial amp Systems Engineering Georgia Institute of Technology Regression Models 1 Classical linear model EY X1Xp 50 p1X1 ppxp 2 Generalized linear model gElYIX1 Xp 2 I50 I31X1 l3po 3 Additive model P EYX1 Xp 3 53 251Xj Sj smooth nonparametric function j1 4 Generalized additive models 13 gEYIX13 39l39axpD so fiistKj J Classification and Regression Tree CART Original data Input Igt Algorithm Igt Output Root node Variables Response Split the data into two or more subset x1 gt c X1 X2 Y based on the value of YES Intermediate node Terminal variables node Continuously split each subset into finer subsets StOp grOWing trees Terminal Terminal or prune trees node node Weather Data Play or not Play H HIHT Hi39i39iilt high false No sunny sunny high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Example Tree for Play w su y overcast ram TreeBased Models Basic idea Partition the feature space into a set of rectangle For simplicity recursive binary partition Fit a simple model eg constant in each one R5 X2 X1 Binary Partition Binary Tree TreeBased Models Models 0 Cm the regression model prediction value corresponding to the region Rm fltxZchltxlx2eRm Cllx1 x2ER1 21x1 x2ER2 31x1 x2ER3 C4Ix1 x2ER4 Cslx1x2ER5 Two types 0 Regression trees 0 Classification trees Fundamentals Issues in Treebased Models 0 How to decide the splitting point Tree growing 0 How to control the size of the tree Tree pruning Regression Trees For each of N observations input is xi xi1 xizy xip output is continuous yi Partition the space into M regions R1 R2 RM fx cm1xeRm The best partition to minimize the sum of squared error N Z yj fxi2 i1 cm averageyi x e Rm Classification Trees 0 For regression trees impurity measure QmT QmT 2i 2 yl 5m2 where 5m 2 2y m x1 ERquot m x1 ERquot 0 For classification trees impurity measure Qm T 1 A Misclassification error Z I y 75 k m 1 pkmm m ieRm K GiniIndeX Z amc ama Z mk 1 13mk K Crossentropy or deviance Z mk log mk Classification and Regression Trees Fundamentals Notation X input vector X input space J number of Classes C set of Classes eg C12 J 11 Two De nitions A classi er or classi cation rule is a function dX de ned on the input space X such that for every X E X dX E C A classi er is a partition of the input space into J exhaustive disjoint subsets R1 RJ such that for every X E R the predicted class is j 12 Accuracy Estimation Question How good is a classi er in prediction ie how accurate is a classi er True Misclassi cation Rate Given the learning sample L X11 jn n1 N X11 6 Xjn E C construct dX Let Xj X E Xj E C be a new sample from the same population as L The true misclassi cation rate of dX Rd is de ned as Rd PdX 7395 j 13 HOW to Estimate Rd Three Types of Estimates of Rd Training Error Rd Test Error Rtsd CrossValidation Error RCVd 14 How to Estimate Rd Given the sample L X11 jn n1N X11 6 X jn E C 0 Training Error Rd Construct dX using L Testing dX on L gives the training error M 2M 6 L1ltdltxn 2 in N 0 Test Error Rtsd Randomly split L into two sets L1 and L2 and construct dX using L1 Testing dX on L2 gives the testing error Rtsd 2081411 E L21dXn ijn N2 Where N2 is the number of cases in L2 15 How to Estimate Rd Cond CrossValidation Error RCVd Randomly split L into V subsets of as nearly equal size as possible L1 LV For V 1 V construct dVX using L LV The testing error of dVX is RtsdV 26 6 LV 1dVXn 7 jn NV Where NV is the number of cases in LV The Vfold crossvalidation error is de ned as Rcvd ZvvletsdV V 16 Classi cation Trees Are tree structured classi ers Construction beginning with the input space repeatedly split into two descendant subsets Splits are formed by conditions on the input vector X X1 X2 eg a split could be of the form XEX2X47 17 An Example Tree T Partition of X by this tree R1t6 U t9 9 lt5 I R3t5 7 t6 Therefore 3 1 3 XR1U RZU R3 and for V X E Rj the assigned NE HE class is j Each terminal node is assigned a class label the number below terminal node Terminal nodes with the same class label are joined together to get a partition of the input space 18 Tree Construction Three Issues How to grow ie how to split Which Class should a terminal node be assigned to When to stop 19 Split Selection Steps for Growing A Tree Starting from root node t1 search through all candidate splits to nd the optimal one using a criterion Split t1 into t2 and t3 using the optimal split Repeat the procedure on both t2 and t3 separately 21 Split Selection Question How to choose the optimal split at each step What criterion should be used Key Idea Splitting a node should make the data in the descendant node purer than the data in the parent node 22 An Illustration p1 p2 p3 13 13 13 E p19 p29 p319 0 p19 p29 p3 09 For a threeclass problem denote by p1 p2 p3 the proportions of three classes in a node Nodes t2 and t3 are purer than t1 23 More Notations 70 classj prior probability jl 2 J Nt number of cases in node t Njt number of class j cases in node t NJ number of class j cases in learning sample N number of cases in learning sample p0tnj NjtNj prob that a class j case falls into node t ptZJj1pjt prob that a case falls into node t pitpjtpt prob that a case is in class j given that it falls into node t 24 Impurity Function 4 An impurity function 1 is an function de ned on p1 pJ satisfying pj Oj1 J ZJJlefl with 1 achieves maximum only at lJ lJ 1 achieves minimum only at 10 O O1 O 3 033031 1 is a symmetric function of p1 pJ 25 Three Impurity Functions Gini Index ltp1939 9pJ ZJj1pj139pj Entropy ltp1939 9pJ 39ZJj21pj 10ng Misolassi oation Error MPH quot913921 39 man pj 26 Three Impurity Functions 05 l 045 04 035 7 037 02 0157 7 Gini Index 7 Entropy 7 Misclassification Error 017 005 7 Impurity Function for TwoClass Case Entropy has been scaled to pass through 05 05 27 Impurity Measure for Node t Given an impurity function 1 de ne the impurity measure for a node t as i0 ltgtp1ta pJt If a split s sends a proportion PL of cases in t to tL and a proportion PR to tR the goodness of split s is de ned as Mis t 1a PLitL PRitR 28 Impurity Measure for Trees For a tree T with the set of terminal nodes T de ne its impurity as 1T ZteTit39pt For V t E T split it into tL and tR using a splits The new tree T has impurity 1T Zn it39pt itL 39ptL itR 130R The decrease in impurity is de ned as M1St1T1T it 39pt itL POL itR 39ptR 2 K0 39 PL39 i1 39 PR39 itRPt Mist pt where PL ptLpt PR ptRpt 29 Split Selection Decision rule The best split is selected to split a node to minimize the overall tree impurity 30 Standard Set of Splits S Suppose XX1 XM Where M is the dimension of the input space S is de ned as follow Each split depends on only a single input variable For a continuous or ordinal variable Xm 8 includes all splits of the form Xm c Where values for c are restricted to midpoints between consecutively ordered data values in learning sample If Xm is categorical taking values in b1 bL 8 includes all splits of the form XIn E A Where A ranges over all subsets of b1 bL 31 Steps for Splitting A Node At each node search over all input variable beginning with X1 and continuing up to XM For each variable nd the best split Compare these M best splits and select the optimal one to split this node 32 Tree vs Partitioning class 1 o class 2 A X2 0 t t o 5 3 o o0 o 0 o 0 0 000 05 0 00 o 4r 0 X1 07 7 When all input variable can be ordered the tree construction procedure can be treated as recursively partitioning of X into rectangles such that cases Within one rectangle becomes more and more class homogeneous 33 Class Assignment Class Assignment Class Assignment Rule 1 Let jt denote the class assigned to node t E T jt argmaxie C pilt If the maximum is achieved by more than one class assign jt randomly as any one of the maximizing classes 35 Class Assignment Misclassi cation Cost Let ciLj denote the cost of misclassifying a class j case as a class i case and satisfy Cili 2 0 iii Cili 0 ij Given a node t with estimated p1t pJt If a randomly selected case of unknown class falls into t and is classified as i the estimated expected misclassi cation cost is Zj Cilj39pit 36 Class Assignment Class Assignment Rule 2 jt argminie c Zj Cilj39pilt If the minimum is achieved by more than one class assign jt randomly as any one of the minimizing classes 37 When to Stop StopSplitting Rule A simple stopsplitting rule Set a threshold 3 gt 0 declare a node t terrninal if maxS Mlst lt 3 Problems 3 is too low 3 too much splitting 3 tree is too large 3 over tting 3 is too high 3 stop splitting too soon gt there may exist a node t such that maxS MIst lt 3 However the descendant nodes tL and tR may have splits with large decreases in impurity Conclusion Stopsplitting rules are not the right answer Solution Grow a large tree then prune upward to nd the right sized tree 39 Get Ready to Prune Grow a suf ciently large tree Tmax by continuing splitting until each terminal node is pure ie the node cases are all in one class 0139 contains cases with identical input vectors these cases may belong to different classes or satisfies Nt gt Nmm 40 De nitions A branch Tt of a tree T with root node t consists of the node t and all descendents of t in T Pruning a branch Tt from a tree T consists of deleting all descendents of t The pruned tree is denoted by T Tt If T is gotten from by successively pruning off branches then T is called a pruned subtree of T and is denoted by T ET 41 An Illustration 9 Prune th from T KW Subtree T th 42 Tree Pruning Problem Even for a moderate sized tree with say 30 to 40 nodes there is a large number ofsubtrees and an even larger number of ways of pruning up to the root node One Solution Minimal CostComplexity Pruning 43 CostComplexity Measure Recall The class assignment rule jtargmaxiecpit pt is the probability that a case falls into node t The probability of misclassi cation given that a case falls into node t is RU 1 maXie c pilt39pt 1 t39pt Proposition for any split of a node t into tL and tR Rt 2 RtL RtR 44 CostComplexity Measure Cond Misclassi cation Cost of tree T is de ned as RC Z1eT rt39pt ZteT Rt Let T denote the set of terminal nodes in tree T Complexity of tree T is de ned as T the number of terminal nodes in T Let oc gt 0 be a real number called complexity parameter the costcomplexity measure of tree T is de ned as CAT MD 0c 39 lTl 45 Minimal CostComplexity Pruning MCCP Objective For each value of oc find the subtree TocSTIm1X such that CaTa minTE Tmax CaT If Ca T Ca Toc then Toc T Toc is called the smallestsized minimizing subtree for the complexity parameter oc Remarks For each value of oc there may be several minimizing substrees But there is one and only one unique smallestsized minimizing subtree The smaller the oc the larger the Toc For oc 0 Toc will be the smallest subtree with the smallest misclassification cost for oc sufficiently large Toc will consist of only the root node 46 MCCP Procedure Start with Tmax For or 0 get the smallestsized minimizing subtree TO T1 Note C0T1 RT1 RT max For any nonterminal node tETl let T1t be the branch of T1 Set Coca Rt 060 COXTH RTlt at39T1tl By letting Cat CaT1t calculate 060 Rt RTIWGTIJ 1 Note Rt gt RTlt and for small 0Lt Cat gt CaT1t Find the nonterminal node t1 6T1 such that atlminteT1 telet 090 Z 092 The node t1 is the weakest link in T1 in the sense that as 0c increases t1 is the rst node st CatCaT1t Then the single node t1 is preferred to the branch Tltl and a2 is the value of 0c at Which equality happens 47 Prune away the branch T1t1 to get a new tree TZC T1 T2 2 T1 Tlt Using T2 instead of T1 Calculate 060 Rt RT2tT2t 1 t6 T2 t T2 Find the weakest link tZE T2 such that Xt2mint T2 teeth 000 2 0l3 Prune away the branch th2 to get a new tree T3C T2 T3 2 T2 391quotth Repeat this procedure to get a decreasing sequence of subtrees 1 2 3 T 3T 3T 3 3 troot and an increasing sequence of values of 0c Ooc1ltoc2lt ltocK If at any stage there is more than one weakest links for instance if 090k Z 06tk then de ne Tk1 Tk Tktk Tktkquot 48 CrossValidation for Optimal Tree Question Which subtree has better generalization performance What value of 06 should be chosen One Solution CrossValidation Question Can we use RT only Yes with indept validation data No with crossvalidation Each grown tree is different as it is based on the training data 49 VFold CrossValidation Randomly split the learning sample L into V subsets of as nearly equal size as possible LV V 1 V For each v Grow a suf ciently large tree T using U L LV Applying MCCP to T max gives a decreasing sequence of subtrees We 3 pm 3 mm and an increasing sequence of values of 0 av1lt 0402 lt lt avKv m aX Using the Whole sample L Grow a suf ciently large tree TmaX Applying MCCP to Tmax gives a decreasing sequence of subtrees T1 3 T2 3 tom and an increasing sequence of values of or Ooc1lt0L2lt ltOLK 50 VFold CrossValidation Cond Let oc39k ock ock112 Calculate the crossvalidation error of tree Tk klK R VTk Zvv1 RtSTV05k V Note TkToc Voc ockock1 The optimal tree Tko ie the tree with the smallest cross validation error is determined by RcvTk0 mink Rcv Tk 51 Standard Errors and 1 SE Rule The testing error of a tree Tk on a sample L2 with N2 cases is N2 RtSK ITkXn q 4 4 n1 The standard error of RtSTk is ro I SERtSltTkgt 1Tkxn 72in Rt5Tk2 tSTk1 TtSTk iR i N2 J 52 Standard Errors and 1 SE Rule De ne Vn 1quot39V by VnV if X11 jn LV The crossvalidation error of a tree Tk is IA N 1TKV XnOzk J39n RCVk 2 n1 The standard error of RCVTk is 4 I 7 SERCVTK id 2 ITltVngtltn ak J39n RCVTK12 I VIIZJ N RCVTK1 RCVTK N v N 53 Standard Errors and 1 SE Rule The 1 SE rule chooses T where k1 is the largest k such that RestTkllt RestTk SEltRestTk Where Rest could be Rts or R TkargminkReStTk The 1 SE rule chooses the simplest tree Whose accuracy is comparable to MT 54 Variable Combination Up to now all splits are on single variable If linear structure is suspected the set of candidate splits is extended to include all line combination splits of the form Zmamxf 0 Too expensive computationally Not very useful in many cases 55 Missing Values Two Circumstances Missing values in the learning sample Missing values in the testing sample Tree ways to handle the missing value problem Discard any cases With missing values May lead to serious depletion of the learning sample For continuous variable ll in the missing values With averages For categorical variable create a new category as missing Surrogate split for a node rank the optimal splits on each variable If the variable using by the best split is missing use the second best split etc 56 Other Issues Variable selection is biased in favor of those variables having more distinct values and thus offering more splits end cut or splits that tend to separate a node into one small and one large subsets are favored over middle cuts The tree procedure is only onestep optimal not overall optimal For instance suppose the tree procedure produces a tree with 12 terminal nodes If one could search over all possible trees With 12 terminal nodes for the optimal one the two results might be very different e g similar to stepwise regression 57 Example Email or Spam Hastie et al Objective Predict Whether the email was junk email or spam Data i 4601 email messages with true outcome email or spam ii the relative frequencies of 5 7 most commonly occurring words and punctuation marks in email message 58 Example Email or Spam Hastie et a1 0 5 7 predictors 48 quantitative predictors percentage of words that match a given word e g business address internet free george etc 6 quantitative predictors percentage of words that match a given character eg ch ch ch ch ch and ch Average length of uninterrupted sequences of capital letter CAPAVE Length of the longest uninterrupted sequence of capital letter CAPMAX Sum of the length of uninterrupted sequence of capital letter CAPTOT 59 39IXSJJLUSaF K 1 30 139o 191 1151 10 7227 L 0053 I 1909lt05s gl ul gvgt00l 5 1 busmmaqui a gt15 39 Tree Prediction Results on Test Data Predicted True email spam Email 573 40 spam 53 334 Sensitivity and Speci city Sensitivity probability of predicting disease given true disease Specificity probability of predicting nondisease given no disease In email example Sensitivity 100 x 863 33453 57 3 S eci ici lOOx 39 934 V p f 1y 57340 62 Performance Comparison F 1 4 hvi K RARE oz 7 quot391 2 o W Z n I H l39 g 7 7 w E Tree 095 c g 31AM 098 v E Welghted Tree 0901 I 1 m 7 I V D 00 00 02 04 36 08 10 Specificity 63 FrontierBased Tree Pruning Algorithm Huo Kim Tsui and Wang Journal of Computing 2006 64 Complexity Penalized Loss Function Complexity Penalized Loss Function CPLF Breiman et al 1984 LTb ZTb COSt Error Rate Complexity Tree Size of terminal node x A penalizing parameter Objective 70 arnginiLlt72gt AIEH Finding a subtree Tb that minimize the CPLF 65 FrontierBased TreePruning Algorithm 40 342k 253k min 253X2732 184k 20x14x 20x gtlt 14x 132 x 52 x 132A52 Number of error case at the node 11 2 3 2 11k2 3k22 66 FrontierBased TreePruning Algorithm Cont Utilizing the list of CPLFs at the root node 01 0 22 02 where ci is the value of minimized loss function when size of tree is i 3 94 cmml 4 c33l c22l Jr 04 01 if 11111111ng 15 CL 0 Piecewise linear function 0 Algorithm See Huo Kim Tsui Wang 2004 0 Complexity ON where N is the number of lines 67 Frontier Lines and Admissible Tree Size Frontier Lines c33 l C22 l 05 04 01 Analogous to an efficient frontier in investment theory 44 LA N Tree size 5 5 admissible tree sizes based on 1 Next Question Which A tree size produces an optimal tree 68 Inadmissible Tree Size Inadmissible Line 0 Number of possible tree sizes 4 c Number of admissible tree sizes 3 69 Determining Optimal A 10fold Cross Validation lst round 2 3 10 iI39 vIilll 2nd round 10 a x T r 12 10thround 1 2 3 JIJ 11 11111111 Jil 1 2 n1n 1 2 r 139 Find the such that l is 1 111111 EVE1 1r 70 Determining Optimal A CVE Entire Curve CVE BeaJean Q 1Ci Average of 10 error curves r1 CVE Beiween 08 23 The range of k which produces minimum error rate 71 Computational Complexity of the FBP Algorithm Theorem Let N and T denote the number of nodes and terminal nodes in a binary tree If we use N to express the complexity it takes no more than NN2 1 operations to generate the lists of linear functions at all nodes 72 Numerical Study 1 0 Twelve real data sets available on the UCI database are used Blake CL and Merz httpwwwicsuciedumlearnMLRepostoryhtml Possible number of trees vs Admissible number of trees Namv of Dal z ihvi EH39ijlttiw 111v Largest Trev Sim Aiisirzi1ian I i ilit Appl m zil 21 451 li39wlz 1111 Hmri Disease 11 100 C Ullllll hhil111211 rrUtillH R 391quot1h 1 1 1ViHi i inain Breast C anwr 11 15 Iris Plants 6 13 BUPA Liver D1HU1 C1CI 13 132 PIMA Indian Diahbim 2 260 Image Svmnmiiutiun 330 24 3011113111 Credit 26 3 14 131111110 81111011M139v 530 QUE Vm i 11 1 111 2H 3ij Siitl 111tt39 hlli39lf 46 7 15 Numerical Study 2 CrossValidation error rates 0 Testing error rates Name of Dataset CCP FBP Winner Name of Dataset CCP FBP Winner Australian Credit Approval 1413 1401 FBP Australian Credit Approval 14841458 FBP Cleveland Heart Disease 2115 2089 FBP Cleveland Heart Disease 2682 2719 CCP Congressional Voting Records 416 412 FBP Congressional Voting Records 550 560 CCP Wisconsin Breast Cancer 456 447 FBP Wisconsin Breast Cancer 500 404 FBP Iris Plants 520 507 FBP Iris Plants 773 733 FBP BUPA Liver Disorder 3127 3103 FBP BUPA Liver Disorder 3520 3474 FBP PlhlAlndian Diabetes 2437 2397 FBP PIMA indian Diabetes 2534 2526 FBP Image Segmentation 384 383 FBP image Segmentation 580 581 CCP German Credit 2461 2448 FBP German Credit 2760 2706 FBP Vehicle Silhouette 2802 2790 FBP Vehicle Silhouette 3025 3027 GOP Waveform 2286 2283 FBP Waveform 2747 2747 FBP CCP Satellite Image 1267 1265 FBP Satellite image 1285 1285 FERCCP Numerical Study 3 0 Tree Size total number of nodes Name of Dataeet CCP FBP W inner Australian Credit Approval 1060 800 FBP Cleveland Heart Disease 820 680 FBP Congressional Voting Records 700 580 F BP Wieconsin Breast Cancer 1420 1060 FBP Iris Plants 720 660 FBP BUPA LiVer Disorder 1660 1380 FBP PIMA Indian Diabetes 1520 900 FBP Image Segmentation 8040 8720 FBP German Credit 2020 2020 FBP COP Vehicle Silhouette 6340 6260 FBP lax eform 6000 6900 FBP CCP Satellite Image 24900 2490 FBP CCP 75 Software for TreeBased Models 0 Free CART Evaluation 0 httpwwwsalfordsvstemscomcartphp 0 MATLAB 0 SPLUS 0 R ctree rpart etc 0 Numerous data mining software expensive l 76 Generalization of CART Generalization of CART For regression trees can we replace constant function by other models eg linear regression model polynomials For classi cation trees can we replace Bayes Rules by other classi cation methods eg LDA logistics regression These have been studied in the literature 78 Dummy Variables Global vs local models It is sometimes necessary to determine how a depend variable is related to an independent variable when a qualitative factor is in uencing the situation Example 75 39 Female 39 Male y 20901zx1 8 y 38901zx1 8 39 Overall y 14401Ox1 8 Coding the dummy variable x2 Female 0 Male 1 Overall y2 0 1xl 5 y 0 1xl 2s y l96012x1 2l8x2s Job Performance Rating 12 O Aptitude Test Score 120 79 Used Car Auction Example Objective Predict used car price based on mileage age color condition etc ml Cur l39lit i39 U Jl W 3005 Tm Ann d BMW 3911m 13 J 1 531 r mm m 74 Mrdnl uL CA W 1 mi Figure 2 Hierarchical Structure of Segments for Used Car Problem 80 Generalization of CART Should we t a single global model to all data with category as dummy variables Less data BMW Should we t local models at the lowest level regions More data Civic How about at intermediate levels Clustering by how well the model ts Should we split the levels of quantitative variables How much past time series data should we use 81 Generalization of CART Classi cationPrediction by Clustering Know cluster structure de ned by categorical variables 0 Use only categorical variables for splitting 0 Use other variables for tting models a chosen family of models 0 Should we include categorical variables in tting 0 Identify clusters based on how well the model ts the data how well the predictor variables predict y Unknown cluster structure Nonoverlapping Clusters Use clustering methods to de ne clusters hierarchical amp use it as the full grown tree Use true pruning methods to nd optimal pruned tree within a family of models 0 Overlapping Clusters Extend kNN method from tting constant function or Bayes rule to more general regress10n models or class1 cation methods 0 Identify clusters based on distance of the data how close the predictor variables are from each other 82 Used Car Auction Example Objective Find out how to group the segments for best prediction a clustering problem ml Cur l39lit i39 A n mm 1 n a 1 ceney s mines 74 Mrdnl L CA wr Mi Figure 2 Hierarchical Structure of Segments for Used Car Problem 83 Generalization of CART Clustering by Classi cationPrediction prediction models among variables Question Can we identify clusters based on how well the modelfrts the data how well some predictor variables predict other variables instead of the distance of the data e g clustering multiple time series from hotel customers Clusterwise Regression Model Lau et al 1998 n subjects with dependent variable Y eg sale and independent variables e g income The objective is to divide the sample into two mutually exclusive segments two clusters according to the impact of X on Y Slopes and intercepts of Y on X are expected to be different from the two different clusters e g income elasticity This problem may be solved by GUIDE with X as splitting variables but difficult for single variable split Maybe we can include X and y as splitting variables and consider linear combination of splitting variables We can extend the regression model to other prediction models clustering by prediction Known cluster structure defined by categorical variable The problem can be solved by GUIDE e g the times series example with categorical variables regions hotel classes etc that are related to the time series clustering Unknown cluster structure This is very challenging 84 Generalization of CART Loh U of Wisconsin GUIDE httpWwwstatwiscedulohguidehtrnl QUEST A binary classi cation tree CRUISE A classi cation tree that splits each node into two or more subnodes LOTUS A logistic regression tree 85 GUIDE Regression Tree Negligible bias in split variable selection critical for correct tree interpretationno other algorithm has this property Sensitivity to local interactions between pairs of predictor variables no other algorithm has this property Ability to use ordered continuous and unordered categorical predictor variables Choice of weighted least squares Gaussian Poisson quantile including median or proportional hazards regression Choiceof piecewise constant best simple polynomial multiple linear or stepwise regress1on models Choice of roles for predictor variables splitting only node modeling only both or none Choice of using categorical variables for splitting only or both splitting and tting through dummy Ol vectors Choice of stopping rules no pruning pruning by crossvalidation or pruning with a test sample Free executables for Windows Macintosh and Linux computers 86 Related Papers Loh WY 2006 Regression by parts Fitting Visually interpretable models with GUIDE Handbook of Computational Statistics vol III Springer in press Kim H Loh WY Shih YS and Chaudhuri P 2006 A Visualizable and interpretable regression model With good prediction power HE Transactions Special Issue on Data Mining and Web Mining in press Loh WY 2006 Regression tree models for designed experiments The Second Erich L Lehmann SymposiumOptimality Institute of Mathematical Statistics Lecture Notes Monograph Series vol 49 210228 Loh WY 2002 Regression trees with unbiased variable selection and interaction detection Statistica Sinica vol 12 361386 This is the principal reference for GUIDE Chaudhuri P and Loh WY 2002 Nonparametric estimation of conditional quantiles using quantile regression trees Bernoulli vol 8 561576 This paper extends GUIDE to quantile regression Chaudhuri P Lo WD Loh WY and Yang CC 1995 Generalized regression trees Statistica Sinica vol 5 641666 This is the rst paper on Poisson and logistic regression trees Chaudhuri P Huang MC Loh WY and Yao R 1994 Piecewisepolynomial regression trees Statistica Sinica vol 4 143167 This is the rst paper on polynomial regression trees 87 A visualizable and interpretable regression model with good prediction power Kim et al Table 5 Datasets N denotes the number of training cases N the number of test cases if any Q the number of quantitativo predictors and C be number of categorical predictors category sizes in parenthesgs Name N C I Source Abalone 1 1 11c Ass 1 k and VVElsbsrg 1991 Alcohol 61 E kE39 and Tana 39 111 Amenity Chatlupadhya 2031 Attend C39othran 2002 Basel 1 i Baskhall Simon 111946 Boston Eelsley a a1 1930 13930112 Belsley em a 19301 Budget Bcvlllnu Peralj and 113351 2000 Cane Denman and Gregory 1998 Cardi 1 Bryant and Smith 11396 College 91 Stanly County 3111 Harrall 5 531 Barndl 1 cpsss 21252 12501 1 p stat barkelay edupubdatasecsfamQS zip 11 2139 mi Deer 651 Onoyama ohsuml Micsumrchl and Kishlllara 119931 Diabsvas 375 Harrell 33131 Diamond 3133 Chu 2001 m 1401 anins 120111 Bagel 11950 119135 Delgado and Mom 1 19113 Enron 2118 Llu an S ng 1199 Fame 1315 Far 25 Pemcse N n and Fishar 11935 Fishery 38015 Fernandez Ley and 81591 21302 Hatco 100 Halr A ndwsau admin and Bled 1993 Houses 5875 Fate and Barry 1997 In 2152 Hallm and humbli 11933 Labo 2953 Aarlerge mblnl and Slrom 1999 Labarz 5143 5113 Laraque and 55131119 12002 Lahean 2m 13 Ali and en 19791 adicare 11116 2 1 Da and Trivedi 111997 392 E 1 a l 1 Mpg2001 552 5 5 enigma Wu fuelecnnamy go 1523 3 r 313mb Mussels 2111 3 1 15 1998 one o s 3 r man and medman 1988 Pole 51300 100111 213 I3 2155 and lndurkhya 11 Price 59 15 3 L 239 Rate 111 9 0 Lutkepahl Terasvirla ancl quotiters 51999 ice 171 13 2 33 Hornace and Schmidt 2mm Scenic 113 9 1 A N Kuhner Naclnshehn and VquotaKemlan 1991 two 1117 2 2 155 a 111 9 1 A at al 1990 ouse 11136 11130 21 n Strike 625 4 l 15 a 334 3 3 13131141 Tecator 215 111 3 Tree inc 3 3 stlin 1955 Triasz 156 25 1391 1 1999 v 33513 13 Schafgans 1998 88 A visualizable and interpretable regression model with good prediction power Kim et al Table I3 Plet 331111 ele ef the twent zseven algeiithmg Cart Cr Ci Crb G1 Gq Gm Gs G52 H 2 gem lad 1r CART piecewise cenetrint GUEI39S T ri ebaeed inedel Cr and nearestneighber icmnpeeite Boosted Cr lCDl llllltEfEf medelj GUIDE piecewise constant GUIDE piecewise simple linear GUIDE piecewiee simple quadratic GUIDE piecewiee multiple linear GUIDE piecewiee stepwise linear GUIDE twe regresser stepwise linear GUI DE twe regreeser forward linear Generalized additive inedel Leaet alieelute tieriatiene regreseien Least squares linear regreesien Mu Mcb Hm limb mare mart Imet pal PPr Pit Pun Fe rreg ME piecewise cenetant Bagged He ME piecewise multiple linear Be gged Mm MARS It39lr39LHT Neural netwnrk PD LTI VIA RE P rejectien pursuit regression RT piecewise constant RT piecewise multiple linear RT piecewise partial linear Huberle robust regreesien 89 A visualizable and interpretable regression model with good prediction power Kim et al CART Piecewise oonstant regression tree IB1 el139l l l l3l Jll NEH CART is a registered trademark of California Statistical Software Inc We use version 4 of the 1l r39indows implementation Steinberg and Colin 1995 with 10 fold cross validation and the default El SE rule The minimal node size is 10 except for the 313595 dnLasei where the value is changed to 100 because of the program s memory limitations CUBIST A rulebased algorithm due to R Quinlan Ifwmlmulsquost comcubistinfoitml l r39e use Release 110 Three type of models are studied mile based only Cr composite Cijl and committee 3er with n memheb The Ci model combines Cr with a nearest neighbor method Crb is n boosted version of Cr GAIV39I Generalized additive model Hastie and Tilishirnni 1990 We use the SPlus Function gem with the Gaussian 111111in and nonparametric smoothing splines option GUIDE Generalized regression tree I L011 200239 Gc and En denote piecewise constant and piece wise multiple linear models Categorical variables n1 e used for splitting and for regression modeling via durumr vectors in Gm Our proposed piecewise simple linear and simple quadratic models are denoted h 61 and Gq respectively Gs denotes the method where forward and backward stepwise regression is used in each node If the number of regressors is limited to two the method is denoted by G52 Finally Gf2 denotes the method using two regressm forvmrd 0rdr stepwise regression at each node All the trees are pruned with the default NJ SE rule 90 A visualizable and interpretable regression model with good prediction power Kim et al Least absolute deviations regression We use the SPlus function 11f it Least squares linear regression We use the R Function In NNET Neural network using the R function met with size3 decay0001 linoutTRUE skip39ITiU39E nnd naxit200 1H5 Piecewise constant and linear regression tree We use the implementation in version 32 of VEKA rittetn and Frank 2000 llc denotes piecewise constant and in piecewise multiple linear If bagging is employed we use ten iterations The resulting methods are denoted by Mcb and limb respectively h IARS Multivariate adaptive regression splines Friedman 1991 We use the R inction mars in the inda library with pai39umeter Values degree1 penalty2 and thresh0001 lVIART Multivariate adaptive regression tree Friedman 2001 This is a stochastic gradient boosting algorithm applied to regression trees We use the softhu39e from armstat Stanford edul39thEmart html with lUiuld crossvzilidntion and 200 boosting iterations POLYR IARS An adaptive regression procedure using piecewise linear splines ILKOQPCTlJEIg Bose and Stone 1997 Ve use the Rfunctican polymars in the polspline library The gcv option is used for model selection The maximum number of basis functions is mint n nfl NH and the maximum nLu nlier of knots per predictor is I1lll201 roundhi l 3 where n is the seunple size Projection pursuit regression Friedman and Stuetzle l JSl 53 use the R function ppr with optlevel2 in the modreg library 91 A visualizable and interpretable regression model with good prediction power Ii111 et1l Robust regression with Bilestimate l llll 1391 1981 p 194 Fe the R function rlm with initle k21345 111ait39L1392ICr and acc1e4 in the MASS library Valuables and Ripley 1999 ET A regression tree algorithm due LC Torge IjleQEJjI Like MB it rst grows a piecewise constant tree and then ts various linear models to the leaf nodes luring priming are use remitn 4 0f the software RC denotes piecewise constant Rm piecewise multiple linear1 and R1 piecewise partial linear with bandwidth 10 92 A visualizable and interpretable regression model with good prediction power Kim et a1 Table l39 G l tl i TILEH113 Of relative tCl linear IF I39ESSiD in inureaging order Algtll39ithl j the IS39L I39DGquot ELIE IlIZIt signi cantly 11311 III Fbe Hm Ci CI Gs Gm 3am GfE GEE mars Mcb ppr UTE 030 032 134 8 39IE 090 090 1190 091 091 I192 G1 DD 0 I31 6quot ix nnat mart Hp Cart Hm Gq Mr rreg Gc RC lad pol 194 I196 6 I19 I19quot UE 198 I199 quotLEI 1IZI 10 11 12 I r1 93 A visualizable and interpretable regression model with good prediction power Kim et a1 r cl DU 0 i io n rub 0 an m 4 o 21 o 00 M 39 39 0 Cr 0 no a knu39ul m c MIE no ownE 4 o 3 lt3 i 1 I0 0 c 00 0c 0 rEa on DD 0 oi ECG U GOO FgtE i 00 com 1 t a o a mmr1m 013 55 D D o marImago Heb u on or4uau r pp i Tinct an 0 oorl39mr 00 Gr o D E kJnioocmo t 3 H30 393 a 0 o i U100 U o jcart o o o ii so 0 rm 0 c o Uri a o a p21 0 u r4 rm one o a o 51 0 o 0 r 00 3 39srid 7 39D G 3gp u 0 r i u a Re Do or iuoo o a 3 o o o v quotEU1 mo 0 l I I I I 01 02 05 10 2 0 50 100 Prediction FiMSE relative to that of Linear Regression 94 A visualizable and interpretable regression model with good prediction power Kim et al Cert 3 In 3 quot3 C T E 2 RP emu He 0 a I 0quot Ii 9 r 5 a 3C a Mm e G g G Siam q I I I I 985 090 095 1 B39D Geometric mean prediction RMSE re et39r we to that ef Linear Regression Bagging Motivation Combines the predictors of the bootstrap samples to improve predicting power 96 Bootstrap Methods IK Iain idea 1 sampling with replacement 2 pmcedm e see Figure 710 Exzmlple 1We consider the training set 239 1 3y 2 yeFri as the E1pr131llajicunquot We generate bootstrap samples from taking samples with re placement from this populationf 3 5 I 1 1 235 5 1 R Then WE can nate of an unknown parameter H estimate the Variance of the statistic SILZ I by E 1 w 1 in NZ willy 1 b l where 4 n b 2 gm bquotB 39 97 Bagging Method 1 Fit a model from each bootstrap sample 2 Take average of the predictors from the B fitted models with one of the following methods 1 take average of the indicator vectors or 2 take average of the conditional probabilities 3 Apply Bayes classification rule to determine the class 1 E v frugi f E Z I wigfl i5 1 98 34x FIC39 UR F In WIw rm rmwru m bu Bagging 3111101 InfrrIIu39v 21m vrugnlz 1 gin Tree lr x7 quot I 15 0mm 1 Buzuxll39tlp 39I x 5 i I 39 1 vac x3gtr1T5 l J F 0 0517 2395 7 5 20 l39unll39x J mquml rr 3 l arm n l 1 o o 1 e m Human wz on MumMaul Jam 2 Tup Inf pun Hrs mwuml rrfxfxa 1HIprtgt nn Show 99 87 Bugging 24 035 Original Tree Test Enor 025 020 Number 0 Bootstrap Samples FIGURE 81 Erml39 I39HI39I39I39H for u39 bugging m nmplt 0f I Vymv bill 571an is the Its I39I IH39 of In orig11ml law mu ltL39l 411 s Mh n fuuruuu oft14 nnmbvr 1 boostrap mmplm TIMI quotwen pumfx rmvmpmul lo muquva Unit whilr ilu39 pm plr poinls average Hm in39nbnbihl irs Boosting and MART Motivation Combines the outputs of many weak learners to produce a powerful prediction rule AdaBoost Most popular boosting algorithm Introduced by Freund and Sehapire in 1996 Key Idea Trains weak learners 0n weighted versions of the training sample giving higher weights to the cases which are currently misclassi ed AdaBoost fMltx I quotquotf3x M T Fltx2amfmltxgt ml gtf2ltxgt so T f1x The nal model Schematic of AdaBoost train weak learners on weighted versions of data then combine into a final prediction AdaBoost Given x1y1xNyN with x a vector and 2 e 1 1 1 Start with weights w 1N z 1N 2 For m1tOMI a Fit fmx e 1 1 using weights w on the training data b Compute errm wi1yl 7t fmxiZ1 wl C Compute am 10g1 errmerrm 1 Set wl wl expam1yi 72 fmxi i 1N 3 Output Fx sign amfmx AdaBoost A synthetic example I 10 independent standard normal predictor variables I The deterministic response is defined by Y 1 211 Xi gt median of 250 1 otherwise I 2000 training cases 10000 test cases I weak learner a twoterminalnode classi cation tree ie a stump Test Error AdaBoost 05 04 03 39 02 01 39 Single Stump 046 400 N d T 0 e ree 026 0 Boosting Iterations MART A gradientbased boosting algorithm with small trees as weak learners Introduced by Friedman in 1999 Key idea At each iteration a small tree is tted to the gradient of the loss function MART Given x1y1xNyN 1 Initialize f0x arg main 21Lyia 2 For m1tOMI a Compute rim 6Lyi fxi6 fxiffm71 i1N b Fit a tree to rim giving terminal regions Rm j 1Jm C Compute 05m arg main 2amp6ij Lyifm1xi a 1 Update fm x fm71x am1x 6 RM 3 Output x fMx Comparison of AdaBoost Bagging and CART Dataset Descriptions Dataset Training Te sting Variable Classes breast cancer 699 9 2 ionosphere 351 34 2 glass 214 9 6 soybean 683 35 19 letters 15000 5000 16 26 satellite 4435 2000 36 6 shuttle 43500 14500 9 7 DNA 2000 1186 60 3 digit 7291 2007 256 10 Comparison of AdaBoost Bagging and CART Misclassi cation Rate Dataset Adaboost Bagging CART breast cancer 32 37 59 ionosphere 64 79 112 glass 220 232 304 soybean 58 68 86 letters 34 64 124 satellite 88 103 148 shuttle 0007 0014 0062 DNA 42 50 62 digit 62 105 271 Some Characteristics of Different Learning Methods Hastie et al A good fair poor Characteristics Neural Net A lt Z MARS KNN Kernel Z gt pm a Natural handling of data of mixed type b b Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations of inputs Computational scalability large N Ability to deal with irrelevant inputs Ability to extract linear combinations of features Interpretability Predictive power payee yopoooooo 090gtgtgtgtgtgt O gtgtquotgt poOoooppo gtgtgtgtgtgtgtgt R Software Classi cation and Regression Tree Classi cation and Regression Tree Load tree function Click Packages then Install Package Choose USACA1 or other cities choose tree gt librarytree gt iristrlt treeSpecies datairis subset samp nodesp tndevianceyvalCypr0b denotes terminal node 1 root 75 164800 setosa 033333 033333 033333 2 PetalLength lt 245 25 0000 setosa 100000 000000 000000 3 PetalLength gt 245 50 69310 versicolor 000000 050000 050000 6 PetalLength lt 485 24 8314 versicolor 000000 095833 004167 12 SepalLength lt 555 5 5004 versicolor 000000 080000 020000 13 SepalLength gt 555 19 0000 versicolor 000000 100000 000000 7 PetalLength gt 485 26 14100 virginica 000000 007692 092308 14 PetalLength lt 505 6 7638 virginica 000000 033333 066667 15 PetalLength gt 505 20 0000 virginica 000000 000000 100000 Classification and Regression Tree Cont gt summaryiristr Classification tree treef0rmula Species data iris subset samp Variables actually used in tree construction 1 quotPetalLengthquot quotSepalLengthquot Number of terminal nodes 5 Residual mean deviance 01806 1264 70 Misclassi cation error rate 004 3 75 gt tableirisSpecies samp predictiristr iris samp type 39class39 setosa versicolor virgini ca setosa 25 0 0 versi color 0 23 2 virgini ca 0 2 23 gt testerr0rlt2275 gt testerr0r 1 005333333 Classification and Regression Tree Cont gt plotiristr textiristr Petal Pnnth lt 245 F Petal an th lt 485 setosa Se aLe gth lt 555 PetaILen th lt 505 versicolor versicolor virginica virginica Classification and Regression Tree Cont HMEQ example gt dlt hmeq1 gt dimd 1 3067 13 gt namesd 1 quotBADquot quotLOANquot quotMORTDUEquot quotVALUEquot quotREASONquot quotJOBquot HYOJH 8 quotDEROGquot quotDELINQquot quotCLAGEquot quotNINQquot quotCLNOquot quotDEBTINCquot gt setseed12 gt idltsample13067size2045replaceF gt d1lt did gt d2lt d id gt dimdl 1 2045 13 gt dimd2 11022 13 Classification and Regression Tree Cont gt ctreelt treeBAD data d1 method class node split 11 deviance yval denotes terminal node 1 root 2045 1625000 008704 2 DEBTINC lt 45439 2009 1328000 007118 4 DELINQ lt 25 1988 1145000 006137 8 DEROG lt 15 1973 1012000 005423 16 CLAGE lt 621048 53 107500 028300 17 CLAGE gt 621048 1920 875900 004792 34 VALUE lt 47501 72 139900 026390 68 LOAN lt 9650 24 56250 062500 136 CLAGE lt 175971 15 09333 093330 137 CLAGE gt 175971 9 08889 011110 69 LOAN gt 9650 48 36670 008333 35 VALUE gt 47501 1848 701200 003950 70 DEBTINC lt 436007 1828 636200 003611 71 DEBTINC gt 436007 20 45500 035000 142 CLAGE lt 233316 8 15000 075000 143 CLAGE gt 233316 12 09167 008333 9 DEROG gt 15 15 00000 100000 5 DELINQ gt 25 21 00000 100000 3 DEBTINC gt 45439 36 09722 097220 Classification and Regression Tree Cont gt summaryctree Regression tree treeformula BAD data d1 method quotclassquot Variables actually used in tree construction 1 quotDEBTINCquot quotDELINQquot quotDEROGquot quotCLAGEquot quotVALUEquot quotLOANquot Number of terminal nodes 10 Residual mean deviance 004091 8325 2035 Distribution of residuals Min lst Qu Median Mean 3rd Qu Max 9722e01 3611e02 3611e02 2171e18 3611e02 9639e01 Compute testing error gt prltpredictctree d2 gt prltprgt05 gt tabled2BAD pr pr FALSE TRUE 0 918 6 1 53 45 gt testerrorlt5361022 gttesterror 1 005772994 Classification and Regression Tree Cont gt plotctree textctree DELIN lt 25 097220 DEROG lt 15 100000 CLAGE 621048 VALUE 47501 100000 CLAGE 1 003611 008333 075000 008333 093330 011110 Comparison of Logistic Regression and Tree Logistic regression model gt lreg1lt glmBADDEROGDELINQCLAGENINQDEBTINCdatatrainbin0mial gt summarylreg1 Call glmformula BAD DEROG DELINQ CLAGE NINQ DEBTINC family binomial data train 0 Regression tree treef0rmula BAD data d1 method quotclassquot Variables actually used in tree construction 1 quotDEBTINCquot quotDELINQquot quotDEROGquot quotCLAGEquot quotVALUEquot quotLOANquot Number of terminal nodes 10 Residual mean deviance 004091 8325 2035 Comparison of Logistic Regression and Tree Logistic regression model gt tableprtestBAD pr 0 1 FALSE 919 71 TRUE 5 27 gt testerr0rlt 71 5 1022 gt testerr0r 1 007436399 Regression tree gt tabled2BAD pr pr FALSE TRUE 0 918 6 1 53 45 gt testerr0rlt 5361022 gttesterr0r 1 005772994 T neeBased Models KwokLeung Tsui Industrial amp Systems Engineering Georgia Institute of Technology Regression Models 1 Classical linear model EY X1Xp 50 p1X1 ppxp 2 Generalized linear model gElYIX1 Xp 2 I50 I31X1 l3po 3 Additive model P EYX1 Xp 3 53 251Xj Sj smooth nonparametric function j1 4 Generalized additive models 13 gEYIX13 39l39axpD so fiistKj J Classification and Regression Tree CART Original data Input Igt Algorithm Igt Output Root node Variables Response Split the data into two or more subset x1 gt c X1 X2 Y based on the value of YES Intermediate node Terminal variables node Continuously split each subset into finer subsets StOp grOWing trees Terminal Terminal or prune trees node node Weather Data Play or not Play H HIHT Hi39i39iilt high false No sunny sunny high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Example Tree for Play w su y overcast ram TreeBased Models Basic idea Partition the feature space into a set of rectangle For simplicity recursive binary partition Fit a simple model eg constant in each one R5 X2 X1 Binary Partition Binary Tree TreeBased Models Models 0 Cm the regression model prediction value corresponding to the region Rm fltxZchltxlx2eRm Cllx1 x2ER1 21x1 x2ER2 31x1 x2ER3 C4Ix1 x2ER4 Cslx1x2ER5 Two types 0 Regression trees 0 Classification trees Fundamentals Issues in Treebased Models 0 How to decide the splitting point Tree growing 0 How to control the size of the tree Tree pruning Regression Trees For each of N observations input is xi xi1 xizy xip output is continuous yi Partition the space into M regions R1 R2 RM fx cm1xeRm The best partition to minimize the sum of squared error N Z yj fxi2 i1 cm averageyi x e Rm Classification Trees 0 For regression trees impurity measure QmT QmT 2i 2 yl 5m2 where 5m 2 2y m x1 ERquot m x1 ERquot 0 For classification trees impurity measure Qm T 1 A Misclassification error Z I y 75 k m 1 pkmm m ieRm K GiniIndeX Z amc ama Z mk 1 13mk K Crossentropy or deviance Z mk log mk Classification and Regression Trees Fundamentals Notation X input vector X input space J number of Classes C set of Classes eg C12 J 11 Two De nitions A classi er or classi cation rule is a function dX de ned on the input space X such that for every X E X dX E C A classi er is a partition of the input space into J exhaustive disjoint subsets R1 RJ such that for every X E R the predicted class is j 12 Accuracy Estimation Question How good is a classi er in prediction ie how accurate is a classi er True Misclassi cation Rate Given the learning sample L X11 jn n1 N X11 6 Xjn E C construct dX Let Xj X E Xj E C be a new sample from the same population as L The true misclassi cation rate of dX Rd is de ned as Rd PdX 7395 j 13 HOW to Estimate Rd Three Types of Estimates of Rd Training Error Rd Test Error Rtsd CrossValidation Error RCVd 14 How to Estimate Rd Given the sample L X11 jn n1N X11 6 X jn E C 0 Training Error Rd Construct dX using L Testing dX on L gives the training error M 2M 6 L1ltdltxn 2 in N 0 Test Error Rtsd Randomly split L into two sets L1 and L2 and construct dX using L1 Testing dX on L2 gives the testing error Rtsd 2081411 E L21dXn ijn N2 Where N2 is the number of cases in L2 15 How to Estimate Rd Cond CrossValidation Error RCVd Randomly split L into V subsets of as nearly equal size as possible L1 LV For V 1 V construct dVX using L LV The testing error of dVX is RtsdV 26 6 LV 1dVXn 7 jn NV Where NV is the number of cases in LV The Vfold crossvalidation error is de ned as Rcvd ZvvletsdV V 16 Classi cation Trees Are tree structured classi ers Construction beginning with the input space repeatedly split into two descendant subsets Splits are formed by conditions on the input vector X X1 X2 eg a split could be of the form XEX2X47 17 An Example Tree T Partition of X by this tree R1t6 U t9 9 lt5 I R3t5 7 t6 Therefore 3 1 3 XR1U RZU R3 and for V X E Rj the assigned NE HE class is j Each terminal node is assigned a class label the number below terminal node Terminal nodes with the same class label are joined together to get a partition of the input space 18 Tree Construction Three Issues How to grow ie how to split Which Class should a terminal node be assigned to When to stop 19 Split Selection Steps for Growing A Tree Starting from root node t1 search through all candidate splits to nd the optimal one using a criterion Split t1 into t2 and t3 using the optimal split Repeat the procedure on both t2 and t3 separately 21 Split Selection Question How to choose the optimal split at each step What criterion should be used Key Idea Splitting a node should make the data in the descendant node purer than the data in the parent node 22 An Illustration p1 p2 p3 13 13 13 E p19 p29 p319 0 p19 p29 p3 09 For a threeclass problem denote by p1 p2 p3 the proportions of three classes in a node Nodes t2 and t3 are purer than t1 23 More Notations 70 classj prior probability jl 2 J Nt number of cases in node t Njt number of class j cases in node t NJ number of class j cases in learning sample N number of cases in learning sample p0tnj NjtNj prob that a class j case falls into node t ptZJj1pjt prob that a case falls into node t pitpjtpt prob that a case is in class j given that it falls into node t 24 Impurity Function 4 An impurity function 1 is an function de ned on p1 pJ satisfying pj Oj1 J ZJJlefl with 1 achieves maximum only at lJ lJ 1 achieves minimum only at 10 O O1 O 3 033031 1 is a symmetric function of p1 pJ 25 Three Impurity Functions Gini Index ltp1939 9pJ ZJj1pj139pj Entropy ltp1939 9pJ 39ZJj21pj 10ng Misolassi oation Error MPH quot913921 39 man pj 26 Three Impurity Functions 05 l 045 04 035 7 037 02 0157 7 Gini Index 7 Entropy 7 Misclassification Error 017 005 7 Impurity Function for TwoClass Case Entropy has been scaled to pass through 05 05 27 Impurity Measure for Node t Given an impurity function 1 de ne the impurity measure for a node t as i0 ltgtp1ta pJt If a split s sends a proportion PL of cases in t to tL and a proportion PR to tR the goodness of split s is de ned as Mis t 1a PLitL PRitR 28 Impurity Measure for Trees For a tree T with the set of terminal nodes T de ne its impurity as 1T ZteTit39pt For V t E T split it into tL and tR using a splits The new tree T has impurity 1T Zn it39pt itL 39ptL itR 130R The decrease in impurity is de ned as M1St1T1T it 39pt itL POL itR 39ptR 2 K0 39 PL39 i1 39 PR39 itRPt Mist pt where PL ptLpt PR ptRpt 29 Split Selection Decision rule The best split is selected to split a node to minimize the overall tree impurity 30 Standard Set of Splits S Suppose XX1 XM Where M is the dimension of the input space S is de ned as follow Each split depends on only a single input variable For a continuous or ordinal variable Xm 8 includes all splits of the form Xm c Where values for c are restricted to midpoints between consecutively ordered data values in learning sample If Xm is categorical taking values in b1 bL 8 includes all splits of the form XIn E A Where A ranges over all subsets of b1 bL 31 Steps for Splitting A Node At each node search over all input variable beginning with X1 and continuing up to XM For each variable nd the best split Compare these M best splits and select the optimal one to split this node 32 Tree vs Partitioning class 1 o class 2 A X2 0 t t o 5 3 o o0 o 0 o 0 0 000 05 0 00 o 4r 0 X1 07 7 When all input variable can be ordered the tree construction procedure can be treated as recursively partitioning of X into rectangles such that cases Within one rectangle becomes more and more class homogeneous 33 Class Assignment Class Assignment Class Assignment Rule 1 Let jt denote the class assigned to node t E T jt argmaxie C pilt If the maximum is achieved by more than one class assign jt randomly as any one of the maximizing classes 35 Class Assignment Misclassi cation Cost Let ciLj denote the cost of misclassifying a class j case as a class i case and satisfy Cili 2 0 iii Cili 0 ij Given a node t with estimated p1t pJt If a randomly selected case of unknown class falls into t and is classified as i the estimated expected misclassi cation cost is Zj Cilj39pit 36 Class Assignment Class Assignment Rule 2 jt argminie c Zj Cilj39pilt If the minimum is achieved by more than one class assign jt randomly as any one of the minimizing classes 37 When to Stop StopSplitting Rule A simple stopsplitting rule Set a threshold 3 gt 0 declare a node t terrninal if maxS Mlst lt 3 Problems 3 is too low 3 too much splitting 3 tree is too large 3 over tting 3 is too high 3 stop splitting too soon gt there may exist a node t such that maxS MIst lt 3 However the descendant nodes tL and tR may have splits with large decreases in impurity Conclusion Stopsplitting rules are not the right answer Solution Grow a large tree then prune upward to nd the right sized tree 39 Get Ready to Prune Grow a suf ciently large tree Tmax by continuing splitting until each terminal node is pure ie the node cases are all in one class 0139 contains cases with identical input vectors these cases may belong to different classes or satisfies Nt gt Nmm 40 De nitions A branch Tt of a tree T with root node t consists of the node t and all descendents of t in T Pruning a branch Tt from a tree T consists of deleting all descendents of t The pruned tree is denoted by T Tt If T is gotten from by successively pruning off branches then T is called a pruned subtree of T and is denoted by T ET 41 An Illustration 9 Prune th from T KW Subtree T th 42 Tree Pruning Problem Even for a moderate sized tree with say 30 to 40 nodes there is a large number ofsubtrees and an even larger number of ways of pruning up to the root node One Solution Minimal CostComplexity Pruning 43 CostComplexity Measure Recall The class assignment rule jtargmaxiecpit pt is the probability that a case falls into node t The probability of misclassi cation given that a case falls into node t is RU 1 maXie c pilt39pt 1 t39pt Proposition for any split of a node t into tL and tR Rt 2 RtL RtR 44 CostComplexity Measure Cond Misclassi cation Cost of tree T is de ned as RC Z1eT rt39pt ZteT Rt Let T denote the set of terminal nodes in tree T Complexity of tree T is de ned as T the number of terminal nodes in T Let oc gt 0 be a real number called complexity parameter the costcomplexity measure of tree T is de ned as CAT MD 0c 39 lTl 45 Minimal CostComplexity Pruning MCCP Objective For each value of oc find the subtree TocSTIm1X such that CaTa minTE Tmax CaT If Ca T Ca Toc then Toc T Toc is called the smallestsized minimizing subtree for the complexity parameter oc Remarks For each value of oc there may be several minimizing substrees But there is one and only one unique smallestsized minimizing subtree The smaller the oc the larger the Toc For oc 0 Toc will be the smallest subtree with the smallest misclassification cost for oc sufficiently large Toc will consist of only the root node 46 MCCP Procedure Start with Tmax For or 0 get the smallestsized minimizing subtree TO T1 Note C0T1 RT1 RT max For any nonterminal node tETl let T1t be the branch of T1 Set Coca Rt 060 COXTH RTlt at39T1tl By letting Cat CaT1t calculate 060 Rt RTIWGTIJ 1 Note Rt gt RTlt and for small 0Lt Cat gt CaT1t Find the nonterminal node t1 6T1 such that atlminteT1 telet 090 Z 092 The node t1 is the weakest link in T1 in the sense that as 0c increases t1 is the rst node st CatCaT1t Then the single node t1 is preferred to the branch Tltl and a2 is the value of 0c at Which equality happens 47 Prune away the branch T1t1 to get a new tree TZC T1 T2 2 T1 Tlt Using T2 instead of T1 Calculate 060 Rt RT2tT2t 1 t6 T2 t T2 Find the weakest link tZE T2 such that Xt2mint T2 teeth 000 2 0l3 Prune away the branch th2 to get a new tree T3C T2 T3 2 T2 391quotth Repeat this procedure to get a decreasing sequence of subtrees 1 2 3 T 3T 3T 3 3 troot and an increasing sequence of values of 0c Ooc1ltoc2lt ltocK If at any stage there is more than one weakest links for instance if 090k Z 06tk then de ne Tk1 Tk Tktk Tktkquot 48 CrossValidation for Optimal Tree Question Which subtree has better generalization performance What value of 06 should be chosen One Solution CrossValidation Question Can we use RT only Yes with indept validation data No with crossvalidation Each grown tree is different as it is based on the training data 49 VFold CrossValidation Randomly split the learning sample L into V subsets of as nearly equal size as possible LV V 1 V For each v Grow a suf ciently large tree T using U L LV Applying MCCP to T max gives a decreasing sequence of subtrees We 3 pm 3 mm and an increasing sequence of values of 0 av1lt 0402 lt lt avKv m aX Using the Whole sample L Grow a suf ciently large tree TmaX Applying MCCP to Tmax gives a decreasing sequence of subtrees T1 3 T2 3 tom and an increasing sequence of values of or Ooc1lt0L2lt ltOLK 50 VFold CrossValidation Cond Let oc39k ock ock112 Calculate the crossvalidation error of tree Tk klK R VTk Zvv1 RtSTV05k V Note TkToc Voc ockock1 The optimal tree Tko ie the tree with the smallest cross validation error is determined by RcvTk0 mink Rcv Tk 51 Standard Errors and 1 SE Rule The testing error of a tree Tk on a sample L2 with N2 cases is N2 RtSK ITkXn q 4 4 n1 The standard error of RtSTk is ro I SERtSltTkgt 1Tkxn 72in Rt5Tk2 tSTk1 TtSTk iR i N2 J 52 Standard Errors and 1 SE Rule De ne Vn 1quot39V by VnV if X11 jn LV The crossvalidation error of a tree Tk is IA N 1TKV XnOzk J39n RCVk 2 n1 The standard error of RCVTk is 4 I 7 SERCVTK id 2 ITltVngtltn ak J39n RCVTK12 I VIIZJ N RCVTK1 RCVTK N v N 53 Standard Errors and 1 SE Rule The 1 SE rule chooses T where k1 is the largest k such that RestTkllt RestTk SEltRestTk Where Rest could be Rts or R TkargminkReStTk The 1 SE rule chooses the simplest tree Whose accuracy is comparable to MT 54 Variable Combination Up to now all splits are on single variable If linear structure is suspected the set of candidate splits is extended to include all line combination splits of the form Zmamxf 0 Too expensive computationally Not very useful in many cases 55 Missing Values Two Circumstances Missing values in the learning sample Missing values in the testing sample Tree ways to handle the missing value problem Discard any cases With missing values May lead to serious depletion of the learning sample For continuous variable ll in the missing values With averages For categorical variable create a new category as missing Surrogate split for a node rank the optimal splits on each variable If the variable using by the best split is missing use the second best split etc 56 Other Issues Variable selection is biased in favor of those variables having more distinct values and thus offering more splits end cut or splits that tend to separate a node into one small and one large subsets are favored over middle cuts The tree procedure is only onestep optimal not overall optimal For instance suppose the tree procedure produces a tree with 12 terminal nodes If one could search over all possible trees With 12 terminal nodes for the optimal one the two results might be very different e g similar to stepwise regression 57 Example Email or Spam Hastie et al Objective Predict Whether the email was junk email or spam Data i 4601 email messages with true outcome email or spam ii the relative frequencies of 5 7 most commonly occurring words and punctuation marks in email message 58 Example Email or Spam Hastie et a1 0 5 7 predictors 48 quantitative predictors percentage of words that match a given word e g business address internet free george etc 6 quantitative predictors percentage of words that match a given character eg ch ch ch ch ch and ch Average length of uninterrupted sequences of capital letter CAPAVE Length of the longest uninterrupted sequence of capital letter CAPMAX Sum of the length of uninterrupted sequence of capital letter CAPTOT 59 39IXSJJLUSaF K 1 30 139o 191 1151 10 7227 L 0053 I 1909lt05s gl ul gvgt00l 5 1 busmmaqui a gt15 39 Tree Prediction Results on Test Data Predicted True email spam Email 573 40 spam 53 334 Sensitivity and Speci city Sensitivity probability of predicting disease given true disease Specificity probability of predicting nondisease given no disease In email example Sensitivity 100 x 863 33453 57 3 S eci ici lOOx 39 934 V p f 1y 57340 62 Performance Comparison F 1 4 hvi K RARE oz 7 quot391 2 o W Z n I H l39 g 7 7 w E Tree 095 c g 31AM 098 v E Welghted Tree 0901 I 1 m 7 I V D 00 00 02 04 36 08 10 Specificity 63 FrontierBased Tree Pruning Algorithm Huo Kim Tsui and Wang Journal of Computing 2006 64 Complexity Penalized Loss Function Complexity Penalized Loss Function CPLF Breiman et al 1984 LTb ZTb COSt Error Rate Complexity Tree Size of terminal node x A penalizing parameter Objective 70 arnginiLlt72gt AIEH Finding a subtree Tb that minimize the CPLF 65 FrontierBased TreePruning Algorithm 40 342k 253k min 253X2732 184k 20x14x 20x gtlt 14x 132 x 52 x 132A52 Number of error case at the node 11 2 3 2 11k2 3k22 66 FrontierBased TreePruning Algorithm Cont Utilizing the list of CPLFs at the root node 01 0 22 02 where ci is the value of minimized loss function when size of tree is i 3 94 cmml 4 c33l c22l Jr 04 01 fix 2 111inlgi3w L A CL 0 Piecewise linear function 0 Algorithm See Huo Kim Tsui Wang 2004 0 Complexity ON where N is the number of lines 67 Frontier Lines and Admissible Tree Size Frontier Lines c33 l C22 l 05 04 01 Analogous to an efficient frontier in investment theory 44 LA N Tree size 5 5 admissible tree sizes based on 1 Next Question Which A tree size produces an optimal tree 68 Inadmissible Tree Size Inadmissible Line 0 Number of possible tree sizes 4 c Number of admissible tree sizes 3 69 Determining Optimal A 10fold Cross Validation lst round 2 3 10 2nd round 10 gt 0 FT 3 I m 10thround 1 2 3 Ij l lu um Jil IILJ 2 n1 J 2 r 10 Find the HUC39ll that I l is 239 mm 211 r 70 Determining Optimal A CVE Entire Curve CVE BeaJean Q 1Ci Average of 10 error curves p CVE Beiween 08 23 The range of k which produces minimum error rate 71 Computational Complexity of the FBP Algorithm Theorem Let N and T denote the number of nodes and terminal nodes in a binary tree If we use N to express the complexity it takes no more than NN2 1 operations to generate the lists of linear functions at all nodes 72 Numerical Study 1 0 Twelve real data sets available on the UCI database are used Blake CL and Merz httpwwwicsuciedumlearnMLRepostoryhtml Possible number of trees vs Admissible number of trees Nillllt39 11f Datuwi EHm39tiV u The Lz u msl Tree SiZL39 Australian Credit Apprm39zil 21 451 li39wlz391111 Hmri Dismiso 11 IUD C ivil1lizl ig39nssii111111 7itilli39 Ri39t39m dh 44 715111115111 Breast tlllti39t l39 11 45 Il i Ph illfs U 13 BUPA Livvr Disorder 133 132 PIEIA 1111111111 Dieil h 2 25 11111131 311g111011i391139t11111 Bil 24 3011111111 Credit 2039 3 14 l fl llil39ll Silthll l h C30 269 A T VPfi 1 111 2i 1 30 T Snidlin 111111511 46 7 15 Numerical Study 2 CrossValidation error rates 0 Testing error rates Name of Dataset CCP FBP Winner Name of Dataset CCP FBP Winner Australian Credit Approval 1413 1401 FBP Australian Credit Approval 1484 1458 FBP Cleveland Heart Disease 2115 2089 FBP Cleveland Heart Disease 2682 2719 CCP CongressionalVoting Records 416 412 FBP Congressional Voting Records 550 560 CCP Wisconsin Breast Cancer 456 447 FBP Wisconsin Breast Cancer 509 494 FBP iris Plants 520 507 FBP iris Plants 773 733 FBP BUPA Liver Disorder 3127 3103 FBP BUPA Liver Disorder 3520 3474 FBP PlhlAlndian Diabetes 2437 2397 FBP PIMA hidian Diabetes 2534 2526 FBP Image Segmentation 384 383 FBP image Segmentation 580 581 CCP German Credit 2461 2448 FBP ierinan Credit 2760 2706 FBP Vehicle Silhouette 2802 2790 FBP Vehicle Silhouette 3025 3027 CCP Waveft n m 2286 2283 FBP Waveform 2747 2747 FBP CCP Satellite Image 1267 1265 FBP Satellite image 1285 1285 FBPCCP Numerical Study 3 0 Tree Size total number of nodes Name of Dataeet Cpr FBP W inner Australian Credit Approval 1060 800 FBP Cleveland Heart Disease 820 680 FBP Congressional Voting Records 700 580 F BP Wieconsin Breast Cancer 1420 1060 FBP Iris Plants 73920 660 FBP BUPA Li v er Disorder 1660 1380 FBP PIMA Indian Diabetes 1520 900 FBP Image Segmentation 8940 8720 FBP German Credit 2920 2920 FBP CCP Vehicle Silhouette 6340 6260 FBP Wax39eform 6900 6900 FBP CCP Satellite Image 24900 2490 FBP CCP 75 Software for TreeBased Models 0 Free CART Evaluation 0 httpwwwsalfordsvstemscomcartphp 0 MATLAB 0 SPLUS 0 R ctree rpart etc 0 Numerous data mining software expensive l 76 Generalization of CART Generalization of CART For regression trees can we replace constant function by other models eg linear regression model polynomials For classi cation trees can we replace Bayes Rules by other classi cation methods eg LDA logistics regression These have been studied in the literature 78 Dummy Variables Global vs local models It is sometimes necessary to determine how a depend variable is related to an independent variable when a qualitative factor is in uencing the situation Example 75 39 Female 39 Male y 20901zx1 8 y 38901zx1 8 39 Overall y 14401Ox1 8 Coding the dummy variable x2 Female 0 Male 1 Overall y2 0 1xl 5 y 0 1xl 2s y l96012x1 2l8x2s Job Performance Rating 12 O Aptitude Test Score 120 79 Used Car Auction Example Objective Predict used car price based on mileage age color condition etc ml Cur l39lit i39 U Jl W 3005 Tm Ann d BMW 3911m 13 J 1 531 r mm m 74 Mrdnl uL CA W 1 mi Figure 2 Hierarchical Structure of Segments for Used Car Problem 80 Generalization of CART Should we t a single global model to all data with category as dummy variables Less data BMW Should we t local models at the lowest level regions More data Civic How about at intermediate levels Clustering by how well the model ts Should we split the levels of quantitative variables How much past time series data should we use 81 Generalization of CART Loh U of Wisconsin GUIDE httpWwwstatwiscedulohguidehtrnl QUEST A binary classi cation tree CRUISE A classi cation tree that splits each node into two or more subnodes LOTUS A logistic regression tree 82 GUIDE Regression Tree Negligible bias in split variable selection critical for correct tree interpretationno other algorithm has this property Sensitivity to local interactions between pairs of predictor variables no other algorithm has this property Ability to use ordered continuous and unordered categorical predictor variables Choice of weighted least squares Gaussian Poisson quantile including median or proportional hazards regression Choiceof piecewise constant best simple polynomial multiple linear or stepwise regress1on models Choice of roles for predictor variables splitting only node modeling only both or none Choice of using categorical variables for splitting only or both splitting and tting through dummy Ol vectors Choice of stopping rules no pruning pruning by crossvalidation or pruning with a test sample Free executables for Windows Macintosh and Linux computers 83 Related Papers Loh WY 2006 Regression by parts Fitting Visually interpretable models with GUIDE Handbook of Computational Statistics vol III Springer in press Kim H Loh WY Shih YS and Chaudhuri P 2006 A Visualizable and interpretable regression model With good prediction power HE Transactions Special Issue on Data Mining and Web Mining in press Loh WY 2006 Regression tree models for designed experiments The Second Erich L Lehmann SymposiumOptimality Institute of Mathematical Statistics Lecture Notes Monograph Series vol 49 210228 Loh WY 2002 Regression trees with unbiased variable selection and interaction detection Statistica Sinica vol 12 361386 This is the principal reference for GUIDE Chaudhuri P and Loh WY 2002 Nonparametric estimation of conditional quantiles using quantile regression trees Bernoulli vol 8 561576 This paper extends GUIDE to quantile regression Chaudhuri P Lo WD Loh WY and Yang CC 1995 Generalized regression trees Statistica Sinica vol 5 641666 This is the rst paper on Poisson and logistic regression trees Chaudhuri P Huang MC Loh WY and Yao R 1994 Piecewisepolynomial regression trees Statistica Sinica vol 4 143167 This is the rst paper on polynomial regression trees 84 A visualizable and interpretable regression model with good prediction power Kim et al Table 5 Datasets N denotes the number of training cases N the number of test cases if any Q the number of quantitativo predictors and C be number of categorical predictors category sizes in parenthesgs Name N C I Source Abalone 1 1 11c Ass 1 k and VVElsbsrg 1991 Alcohol 61 E kE39 and Tana 39 111 Amenity Chatlupadhya 2031 Attend C39othran 2002 Basel 1 i Baskhall Simon 111946 Boston Eelsley a a1 1930 13930112 Belsley em a 19301 Budget Bcvlllnu Peralj and 113351 2000 Cane Denman and Gregory 1998 Cardi 1 Bryant and Smith 11396 College 91 Stanly County 3111 Harrall 5 531 Barndl 1 cpsss 21252 12501 1 p stat barkelay edupubdatasecsfamQS zip 11 2139 mi Deer 651 Onoyama ohsuml Micsumrchl and Kishlllara 119931 Diabsvas 375 Harrell 33131 Diamond 3133 Chu 2001 m 1401 anins 120111 Bagel 11950 119135 Delgado and Mom 1 19113 Enron 2118 Llu an S ng 1199 Fame 1315 Far 25 Pemcse N n and Fishar 11935 Fishery 38015 Fernandez Ley and 81591 21302 Hatco 100 Halr A ndwsau admin and Bled 1993 Houses 5875 Fate and Barry 1997 In 2152 Hallm and humbli 11933 Labo 2953 Aarlerge mblnl and Slrom 1999 Labarz 5143 5113 Laraque and 55131119 12002 Lahean 2m 13 Ali and en 19791 adicare 11116 2 1 Da and Trivedi 111997 392 E 1 a l 1 Mpg2001 552 5 5 enigma Wu fuelecnnamy go 1523 3 r 313mb Mussels 2111 3 1 15 1998 one o s 3 r man and medman 1988 Pole 51300 100111 213 I3 2155 and lndurkhya 11 Price 59 15 3 L 239 Rate 111 9 0 Lutkepahl Terasvirla ancl quotiters 51999 ice 171 13 2 33 Hornace and Schmidt 2mm Scenic 113 9 1 A N Kuhner Naclnshehn and VquotaKemlan 1991 two 1117 2 2 155 a 111 9 1 A at al 1990 ouse 11136 11130 21 n Strike 625 4 l 15 a 334 3 3 13131141 Tecator 215 111 3 Tree inc 3 3 stlin 1955 Triasz 156 25 1391 1 1999 v 33513 13 Schafgans 1998 85 A visualizable and interpretable regression model with good prediction power Kim et al Table ti Piet 5311113 els af the i uVElli39 S E11 algelithms Cart Cr Ci Crb Gt G1 Gq Gm Gs G52 Gf 2 gen lad 1r GART piecewise celietmrt GUEIS T rulebased inedel Cr and nearestneiglrlmr icempesite Boosted Cr committee medial GUIDE piecewise milstain GUIDE piecewise simple linear GUIDE piecewise simple quadratic GUIDE piecewise multiple linear GUIDE piecewise stepwise linear GUIDE twe regresser stepwise linear GUI DE twe regreseer forward linear Generalized additive medel Least abealute detria39tiene regreeeion Least equaree linear regressien Mu llcb Mm limb mare mart Imet pol PPr Pit Pun RP rreg lt39lEi piecewise canetaut Bagged Ht ME pieeewise multiple linear Es gged Mm ll ISLRS lt lr39LRT Neural Et l39lt P D LTI VIA RE Prejectien pursuit regression RT piecewise constant RT piecewise multiple linear RT piecewise partial linear HLIITIEII S robust regressien 86 A visualizable and interpretable regression model with good prediction power Kim et al CART Piecewise oonstant regression tree IB1 el139l l l l3l Jll NEH CART is a registered trademark of California Statistical Software Inc We use version 4 of the 1l r39indows implementation Steinberg and Colin 1995 with 10 fold cross validation and the default El SE rule The minimal node size is 10 except for the 313595 dnLasei where the value is changed to 100 because of the program s memory limitations CUBIST A rulebased algorithm due to R Quinlan Ifwmlmulsquost comcubistinfoitml l r39e use Release 110 Three type of models are studied mile based only Cr composite Cijl and committee 3er with n memheb The Ci model combines Cr with a nearest neighbor method Crb is n boosted version of Cr GAIV39I Generalized additive model Hastie and Tilishirnni 1990 We use the SPlus Function gem with the Gaussian 111111in and nonparametric smoothing splines option GUIDE Generalized regression tree I L011 200239 Gc and En denote piecewise constant and piece wise multiple linear models Categorical variables n1 e used for splitting and for regression modeling via durumr vectors in Gm Our proposed piecewise simple linear and simple quadratic models are denoted h 61 and Gq respectively Gs denotes the method where forward and backward stepwise regression is used in each node If the number of regressors is limited to two the method is denoted by G52 Finally Gf2 denotes the method using two regressm forvmrd 0rdr stepwise regression at each node All the trees are pruned with the default NJ SE rule 87 A visualizable and interpretable regression model with good prediction power Kim et al Least absolute deviations regression We use the SPlus function 11f it Least squares linear regression We use the R Function In NNET Neural network using the R function met with size3 decay0001 linoutTRUE skip39ITiU39E nnd naxit200 1H5 Piecewise constant and linear regression tree We use the implementation in version 32 of VEKA rittetn and Frank 2000 llc denotes piecewise constant and in piecewise multiple linear If bagging is employed we use ten iterations The resulting methods are denoted by Mcb and limb respectively h IARS Multivariate adaptive regression splines Friedman 1991 We use the R inction mars in the inda library with pai39umeter Values degree1 penalty2 and thresh0001 lVIART Multivariate adaptive regression tree Friedman 2001 This is a stochastic gradient boosting algorithm applied to regression trees We use the softhu39e from armstat Stanford edul39thEmart html with lUiuld crossvzilidntion and 200 boosting iterations POLYR IARS An adaptive regression procedure using piecewise linear splines ILKOQPCTlJEIg Bose and Stone 1997 Ve use the Rfunctican polymars in the polspline library The gcv option is used for model selection The maximum number of basis functions is mint n nfl NH and the maximum nLu nlier of knots per predictor is I1lll201 roundhi l 3 where n is the seunple size Projection pursuit regression Friedman and Stuetzle l JSl 53 use the R function ppr with optlevel2 in the modreg library 88 A visualizable and interpretable regression model with good prediction power Ii111 et1l Robust regression with Bilestimate l llll 1391 1981 p 194 Fe the R function rlm with initle k21345 111ait39L1392ICr and acc1e4 in the MASS library Valuables and Ripley 1999 ET A regression tree algorithm due LC Torge IjleQEJjI Like MB it rst grows a piecewise constant tree and then ts various linear models to the leaf nodes luring priming are use remitn 4 0f the software RC denotes piecewise constant Rm piecewise multiple linear1 and R1 piecewise partial linear with bandwidth 10 89 A visualizable and interpretable regression model with good prediction power Kim et a1 Table l39 G l tl i TILEH113 Of relative tCl linear IF I39ESSiD in inureaging order Algtll39ithl j the IS39L I39DGquot ELIE IlIZIt signi cantly 11311 III Fbe Hm Ci CI Gs Gm 3am GfE GEE mars Mcb ppr UTE 030 032 134 8 39IE 090 090 1190 091 091 I192 G1 DD 0 I31 6quot ix nnat mart Hp Cart Hm Gq Mr rreg Gc RC lad pol 194 I196 6 I19 I19quot UE 198 I199 quotLEI 1IZI 10 11 12 I r1 90 A visualizable and interpretable regression model with good prediction power Kim et a1 r cl DU 0 i io n rub 0 an m 4 o 21 o 00 M 39 39 0 Cr 0 no a knu39ul m c MIE no ownE 4 o 3 lt3 i 1 I0 0 c 00 0c 0 rEa on DD 0 oi ECG U GOO FgtE i 00 com 1 t a o a mmr1m 013 55 D D o marImago Heb u on or4uau r pp i Tinct an 0 oorl39mr 00 Gr o D E kJnioocmo t 3 H30 393 a 0 o i U100 U o jcart o o o ii so 0 rm 0 c o Uri a o a p21 0 u r4 rm one o a o 51 0 o 0 r 00 3 39srid 7 39D G 3gp u 0 r i u a Re Do or iuoo o a 3 o o o v quotEU1 mo 0 l I I I I 01 02 05 10 2 0 50 100 Prediction FiMSE relative to that of Linear Regression 91 A visualizable and interpretable regression model with good prediction power Kim et al Cert 3 In 3 quot3 C T E 2 RP emu He 0 a I 0quot Ii 9 r 5 a 3C a Mm e G g G Siam q I I I I 985 090 095 1 B39D Geometric mean prediction RMSE re et39r we to that ef Linear Regression Bagging Motivation Combines the predictors of the bootstrap samples to improve predicting power 93 Bootstrap Methods IK Iain idea 1 sampling with replacement 2 pmcedm e see Figure 710 Exzmlple 1We consider the training set 239 1 3y 2 yeFri as the E1pr131llajicunquot We generate bootstrap samples from taking samples with re placement from this populationf 3 5 I 1 1 235 5 1 R Then WE can nate of an unknown parameter H estimate the Variance of the statistic SILZ I by E 1 w 1 in NZ willy 1 b l where 4 n b 2 gm bquotB 39 94 Bagging Method 1 Fit a model from each bootstrap sample 2 Take average of the predictors from the B fitted models Take average of the indicator vectors Take average of the conditional probabilities Apply Bayes classification rule to determine the class 1 E u 91 E Z Li39 blzj ll 151 95 87 Bugging 24 035 Original Tree Test Enor 025 020 Number 0 Bootstrap Samples FIGURE 81 Erml39 I39HI39I39I39H for u39 bugging m nmplt 0f I Vymv bill 571an is the Its I39I IH39 of In orig11ml law mu ltL39l 411 s Mh n fuuruuu oft14 nnmbvr 1 boostrap mmplm TIMI quotwen pumfx rmvmpmul lo muquva Unit whilr ilu39 pm plr poinls average Hm in39nbnbihl irs 96 34x FIC39 UR F In WIw rm rmwru m bu Bagging 3111101 InfrrIIu39v 21m vrugnlz 1 gin Tree lr x7 quot I 15 0mm 1 Buzuxll39tlp 39I x 5 i I 39 1 vac x3gtr1T5 l J F 0 0517 2395 7 5 20 l39unll39x J mquml rr 3 l arm n l 1 o o 1 e m Human wz on MumMaul Jam 2 Tup Inf pun Hrs mwuml rrfxfxa 1HIprtgt nn Show 97 Boosting and MART Motivation Combines the outputs of many weak learners to produce a powerful prediction rule 98 AdaBoost Most popular boosting algorithm Introduced by Freund and Sehapire in 1996 Key Idea Trains weak learners 0n weighted versions of the training sample giving higher weights to the cases which are currently misclassi ed 99 AdaBoost fMltx I quotquotf3x M T Fltx2amfmltxgt ml gtf2ltxgt so T f1x The nal model Schematic of AdaBoost train weak learners on weighted versions of data then combine into a final prediction AdaBoost Given x1y1xNyN with x a vector and 2 e 1 1 1 Start with weights w 1N z 1N 2 For m1tOMI a Fit fmx e 1 1 using weights w on the training data b Compute errm wi1yl 7t fmxiZ1 wl C Compute am 10g1 errmerrm 1 Set wl wl expam1yi 72 fmxi i 1N 3 Output Fx sign amfmx AdaBoost A synthetic example I 10 independent standard normal predictor variables I The deterministic response is defined by Y 1 211 Xi gt median of 250 1 otherwise I 2000 training cases 10000 test cases I weak learner a twoterminalnode classi cation tree ie a stump Test Error AdaBoost 05 04 03 39 02 01 39 Single Stump 046 400 N d T 0 e ree 026 0 Boosting Iterations MART A gradientbased boosting algorithm with small trees as weak learners Introduced by Friedman in 1999 Key idea At each iteration a small tree is tted to the gradient of the loss function MART Given x1y1xNyN 1 Initialize f0x arg main 21Lyia 2 For m1tOMI a Compute rim 6Lyi fxi6 fxiffm71 i1N b Fit a tree to rim giving terminal regions Rm j 1Jm C Compute 05m arg main 2amp6ij Lyifm1xi a 1 Update fm x fm71x am1x 6 RM 3 Output x fMx Comparison of AdaBoost Bagging and CART Dataset Descriptions Dataset Training Te sting Variable Classes breast cancer 699 9 2 ionosphere 351 34 2 glass 214 9 6 soybean 683 35 19 letters 15000 5000 16 26 satellite 4435 2000 36 6 shuttle 43500 14500 9 7 DNA 2000 1186 60 3 digit 7291 2007 256 10 Comparison of AdaBoost Bagging and CART Misclassi cation Rate Dataset Adaboost Bagging CART breast cancer 32 37 59 ionosphere 64 79 112 glass 220 232 304 soybean 58 68 86 letters 34 64 124 satellite 88 103 148 shuttle 0007 0014 0062 DNA 42 50 62 digit 62 105 271 Some Characteristics of Different Learning Methods Hastie et al A good fair poor Characteristics Neural Net A lt Z MARS KNN Kernel Z gt pm a Natural handling of data of mixed type b b Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations of inputs Computational scalability large N Ability to deal with irrelevant inputs Ability to extract linear combinations of features Interpretability Predictive power payee yopoooooo 090gtgtgtgtgtgt O gtgtquotgt poOoooppo gtgtgtgtgtgtgtgt R Software Classi cation and Regression Tree Classi cation and Regression Tree Load tree function Click Packages then Install Package Choose USACA1 or other cities choose tree gt librarytree gt iristrlt treeSpecies datairis subset samp nodesp tndevianceyvalCypr0b denotes terminal node 1 root 75 164800 setosa 033333 033333 033333 2 PetalLength lt 245 25 0000 setosa 100000 000000 000000 3 PetalLength gt 245 50 69310 versicolor 000000 050000 050000 6 PetalLength lt 485 24 8314 versicolor 000000 095833 004167 12 SepalLength lt 555 5 5004 versicolor 000000 080000 020000 13 SepalLength gt 555 19 0000 versicolor 000000 100000 000000 7 PetalLength gt 485 26 14100 virginica 000000 007692 092308 14 PetalLength lt 505 6 7638 virginica 000000 033333 066667 15 PetalLength gt 505 20 0000 virginica 000000 000000 100000 Classification and Regression Tree Cont gt summaryiristr Classification tree treef0rmula Species data iris subset samp Variables actually used in tree construction 1 quotPetalLengthquot quotSepalLengthquot Number of terminal nodes 5 Residual mean deviance 01806 1264 70 Misclassi cation error rate 004 3 75 gt tableirisSpecies samp predictiristr iris samp type 39class39 setosa versicolor virgini ca setosa 25 0 0 versi color 0 23 2 virgini ca 0 2 23 gt testerr0rlt2275 gt testerr0r 1 005333333 Classification and Regression Tree Cont gt plotiristr textiristr Petal Pnnth lt 245 F Petal an th lt 485 setosa Se aLe gth lt 555 PetaILen th lt 505 versicolor versicolor virginica virginica Classification and Regression Tree Cont HMEQ example gt dlt hmeq1 gt dimd 1 3067 13 gt namesd 1 quotBADquot quotLOANquot quotMORTDUEquot quotVALUEquot quotREASONquot quotJOBquot HYOJH 8 quotDEROGquot quotDELINQquot quotCLAGEquot quotNINQquot quotCLNOquot quotDEBTINCquot gt setseed12 gt idltsample13067size2045replaceF gt d1lt did gt d2lt d id gt dimdl 1 2045 13 gt dimd2 11022 13 Classification and Regression Tree Cont gt ctreelt treeBAD data d1 method class node split 11 deviance yval denotes terminal node 1 root 2045 1625000 008704 2 DEBTINC lt 45439 2009 1328000 007118 4 DELINQ lt 25 1988 1145000 006137 8 DEROG lt 15 1973 1012000 005423 16 CLAGE lt 621048 53 107500 028300 17 CLAGE gt 621048 1920 875900 004792 34 VALUE lt 47501 72 139900 026390 68 LOAN lt 9650 24 56250 062500 136 CLAGE lt 175971 15 09333 093330 137 CLAGE gt 175971 9 08889 011110 69 LOAN gt 9650 48 36670 008333 35 VALUE gt 47501 1848 701200 003950 70 DEBTINC lt 436007 1828 636200 003611 71 DEBTINC gt 436007 20 45500 035000 142 CLAGE lt 233316 8 15000 075000 143 CLAGE gt 233316 12 09167 008333 9 DEROG gt 15 15 00000 100000 5 DELINQ gt 25 21 00000 100000 3 DEBTINC gt 45439 36 09722 097220 Classification and Regression Tree Cont gt summaryctree Regression tree treeformula BAD data d1 method quotclassquot Variables actually used in tree construction 1 quotDEBTINCquot quotDELINQquot quotDEROGquot quotCLAGEquot quotVALUEquot quotLOANquot Number of terminal nodes 10 Residual mean deviance 004091 8325 2035 Distribution of residuals Min lst Qu Median Mean 3rd Qu Max 9722e01 3611e02 3611e02 2171e18 3611e02 9639e01 Compute testing error gt prltpredictctree d2 gt prltprgt05 gt tabled2BAD pr pr FALSE TRUE 0 918 6 1 53 45 gt testerrorlt5361022 gttesterror 1 005772994 Classification and Regression Tree Cont gt plotctree textctree DELIN lt 25 097220 DEROG lt 15 100000 CLAGE 621048 VALUE 47501 100000 CLAGE 1 003611 008333 075000 008333 093330 011110 Comparison of Logistic Regression and Tree Logistic regression model gt lreg1lt glmBADDEROGDELINQCLAGENINQDEBTINCdatatrainbin0mial gt summarylreg1 Call glmformula BAD DEROG DELINQ CLAGE NINQ DEBTINC family binomial data train 0 Regression tree treef0rmula BAD data d1 method quotclassquot Variables actually used in tree construction 1 quotDEBTINCquot quotDELINQquot quotDEROGquot quotCLAGEquot quotVALUEquot quotLOANquot Number of terminal nodes 10 Residual mean deviance 004091 8325 2035 Comparison of Logistic Regression and Tree Logistic regression model gt tableprtestBAD pr 0 1 FALSE 919 71 TRUE 5 27 gt testerr0rlt 71 5 1022 gt testerr0r 1 007436399 Regression tree gt tabled2BAD pr pr FALSE TRUE 0 918 6 1 53 45 gt testerr0rlt 5361022 gttesterr0r 1 005772994 Model Assessment and Selection KwokLeung Tsui Industrial amp Systems Engineering Georgia Institute of Technology 222009 Data Mining KDD Process Determine Business Objectives 1 Data Preparation l Mining amp Modeling 1 Consolidation and Application 222009 Data Mining amp Modeling Train Data gt Samplexx Data Validation Data Test Data Evaluation Data Score Data 222009 Start Choose Models l BuildFit Model Re ne Tune Model model size amp diagnosis Evaluate Model e g Prediction error gt NO Collect more data Consider Alternate Models Meet accuracy reqt YES Prediction Make Decisions Supervised amp Unsupervised Learning Supervised learning Learning with a teacher Classification eg online shoppers buyers Vs nonbuyers Unsupervised learning Learning without a teacher Clustering eg online shoppers segmentation of non buyers Other related terms Machine Learning analogies to human receiving Neural Networks biological analogies to brain 222009 Supervised Learning Inputs Predictors independent variables y A set of variables which are measured or preset Outputs Responses dependent variables x A set of measurable variables which are influenced by the inputs Steps Establish models systemsy hat based on collected inputs amp outputs x and y Predict the values of outputs based on the established models systems and a new set of specified inputs 222009 5 Training Validation and Test Error Two Basic Objectives Model Selection Choose best model based on certain performance measures Model assessment estimating prediction error of final model Error Types Training Error fitting model Validation Error selecting model Test Generalization Error assessing model Methods Analytical Cp AIC BIC etc Resampling crossvalidation bootstrap 222009 Prediction or Classification Error Over tting Test error Training error LOW Mc del Complexity High Prediction Error 222009 Training Error CrossValidation 7 Error Test Error Testing data Training data CrossValidation Fitted model using training data Testing error based on testing data Training error amp CV Error based on training data 222009 Bias Variance Decomposition Hi g1 EiEiH Law E i315 Lam Varia 1112363 Hi E511 3731 i 3111 I i 39b T ample f 3 1 g V a 11111111le Smnplu a P I Ei if t ion E for M R If V h11 tk 1ampl kut J Lam Hi gh T m 81 C In 111p I exit FI GURE 2 1 1 Tat3515 11 are d rmsming arm if is a fun t 23quot a Tl of m Dd El carriplrajrityi 222009 kNN Hegreeeiem Linear Mediel a Pegreeeiom m In 4 A Q m e Er 3 3 re c I39M CE 1 a i e I g I 39 ai G I i h quot 1quot I quot I 7quot I I 5 II 15 2e Number ei Meighbeie k Suijeet Size p KNN Clieeeifieatien Linear Medel Cleeeiiieeiiien LIT 7 fa L m39i CI 5 quot 5 It 3 W quot5 o E A 4 c I j I I I 3 I 3 I 50 10 31 2e quotID i 5 m 15 20 Number ef Nieighbere i SIIIISeei Size p Figure 7 O Bias Variance Decomposition IL I I in a as Lgi J 1 I l H I II eheer retican 39 quotm true underlying menu i eetilriete eta Ireint T3313 Errn 139 a El ill I LlI a r a Tl l llig Eietg I 1a in I I a r I girlsgr ii 1 Training Errer 222009 squared errer elreelutue errer 11 222009 Bias Variance Decomposition E I I E TN 1 I an quot 39i i E LU II quotI JI J5 For Squared 1153 we have E r 5131 I E Ii 155 I I E 53 E 553 f j E E I 3 I E o i F j a j V ij L3 W a 2 m l r1 If I 42 3 I is 1 53 l I in E E 1 H 7 5 1mm Error E mg I if 12

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.