New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Giovani Ullrich PhD


Giovani Ullrich PhD
GPA 3.5


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Statistics

This 97 page Class Notes was uploaded by Giovani Ullrich PhD on Saturday September 26, 2015. The Class Notes belongs to STAT 500 at Iowa State University taught by Staff in Fall. Since its upload, it has received 30 views. For similar materials see /class/214415/stat-500-iowa-state-university in Statistics at Iowa State University.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/26/15
Stat 648 Outline Steve Vardeman Iowa State University May 10 2009 Abstract This outline summarizes the main points of lectures based on the book The Elements of Statistical Learning by Hastie7 Tibshiraui7 and Friedman Some of the later material here bene ts from lzeumau s Modem Multir Uaiiate Statistical Techniques Contents H N Introduction Ch 1 and Ch 2 of HTF 11 Some Very Simple Probability Theory and Decision Theory and Prediction Rules TWO Conceptually Simple Ways One Might Approximate a Cone ditional Mean 121 Nearest Neighbor Rules 122 Rules Based on a Linear Assumption About the Form of the Conditional Mean The Curse of Dimensionality VarianceeBias Considerations in Prediction Some Fancier Ways One Might Approximate a Conditional Mean 151 NoneLinear Regression Methods Least Squares Fitting 152 Methods Based on Local SmoothingAveraging 16 Complexity7 Fit7 and Choice of Tuning Parameters 12 13 1 a 15 Linear Methods Ch 3 of HTF The GrameSchmidt Process and the QR Decomposition X 1 22 The Singular Value Decomposition of a Ful an N on 231 Ridge Regression 232 The Lasso and Generalizations 24 TWO Methods With Derived lnput Variables Ridge Regression and the Lasso and Some Generalizations Thereof 16 19 course development supported by NSF grant DMS 0502347 EMSWQLRTG awarded to the Department of Statistics Iowa State University OD gs OT 03 q 241 Principal Components Regression 19 242 Partial Least Squares Regression 20 Linear and a Bit on Quadratic Methods of Classi cation Ch 4 of HTF 22 31 Linear and a bit on Quadratic Discriminant Analysis 22 32 Logistic Regression 25 33 Separating Hyperplanes 26 Basis Expansions and Regularization Ch 5 of HTF 27 41 Piecewise Polynomials and Regression Splines 28 42 Smoothing Splines 29 43 Generalizations to peDimensional lnputs 34 431 MultieDimensional Regression Splines Tensor Bases 34 432 MultieDimensional Smoothing Splines 34 44 Reproducing Kernel Hilbert Spaces and Fitting Functions in SW 35 441 Some Special Cases 39 45 Wavelet Bases 40 Kernel Smoothing Methods Ch 6 of HTF 43 51 Oneedimensional Kernel Smoothers 43 52 Kernel Smoothing in 17 Dimensions 45 521 Structured Kernels 45 522 Structured Regression Functions 46 53 Kernel Density Estimation 47 Understanding and Predicting Predictor Performance Ch 7 of HTF 48 61 More on Predictor Squared Bias 49 62 Optimism of the Training Error 50 63 Op A10 and BIC 51 631 CP and A10 51 632 BIC 52 64 Cross7Validation Estimation of Err 53 65 Bootstrap Estimation of Err 55 Inference and Model Averaging Ch 8 of HTF 56 71 Bayes Predictors 56 72 Predictors Built on Bootstrap Samples 58 721 Bagging 59 722 Bumping 59 73 Model Averaging and Stacking 59 731 Bayesian Model Averaging 60 732 Stacking 61 00 C5 Additive Models Trees and Related Methods Ch 9 of HTF 81 Additive Models 82 TreeeBased Methods 83 PRlM Patient Rule lnduction Method 84 MARS Mulitvariate Adapative Regression Splines Boosting Ch10 of HTF 91 The AdaBoost Algorithm 92 Why AdaBoost Might Be Expected to Work 921 Expected quotExponential Loss for a Function ofX 922 Sequential Search for a g With Small Expected Exe ponential Loss and the AdaBoost Algorithm 93 Other Forms of Boosting 931 SEL 932 AEL and Binary Regression Trees 933 KeClass Classi cation 94 Issues More or Less Related to Boosting and Use of Trees as Base Predictors 10 Neural Networks Ch 11 of HTF 101 Projection Pursuit Regression 102 Neural Networks 1021 The Backepropagation Algorithm 1022 Formal Regularization of Fitting 11 Support Vector Machines Ch 12 of HTF 111 The Linearly Separable Case 112 The Linearly Noneseparable Case 113 SVM s and Kernels 1131 Heuristics 1132 An Optimality Argument 114 Other Support Vector Stuff 12 Prototype Methods and Nearest Neighbors Ch 13 of HTF 1 Introduction Ch 1 and Ch 2 of HTF The course is mostly about quotsupervisedquot machine learning for statisticians The fundamental issue here is prediction usually with a fairly large number of predictors and a large number of cases upon which to build a predictor For X 6 ER an input vector a vector of predictors with jth coordinate Xj and Y some output value usually univariate and when it s not we ll indicate so with some form of bold face the object is to use N observed X Y pairs 17311 7 27312 77mN7JN called training data to create an effective prediction rule Y X f X quotSupervisedquot learning means that the training data include values of the output variable A primary difference between a quotmachine learning point of view and that common in basic graduate statistics courses is an ambivalence toward making statements concerning parameters of any models used in the invention of pre diction rules including those that would describe the amount of variation in observables one can expect the model to generate Standard statistics courses often conceive of those parameters as encoding scienti c knowledge about some underlying dataegenerating mechanism or realeworld phenomenon Machine learners don t seem to care much about these issues but rather are concerned about quothow good a prediction rule is An output variable might take values in ER or in some nite set 9 1 2 In this latter case it is often convenient to replace a Qevalued output with K indicator variables YJIYj j12K of course each taking real values This replaces the original Yquot with the vector output taking values in ERK 11 Some Very Simple Probability Theory and Decision Theory and Prediction Rules For purposes of appealing to general theory that can suggest sensible forms for prediction rules suppose that the random pair XY has joint distribution P Until further notice all expectations refer to distributions and conditional distributions derived from P Suppose further that L 17 y 2 0 is some loss function quantifying how unhappy I am with prediction fwhen in fact Y y I might consider trying to choose the form of f to minimize MUWLW This is in theory easily done since ELfX7YEELfX7YlX and the 39 39 39 39 canbe quot by 39 39 quot the quot39 mean for each possible value for X That is an optimal Choice of prediction rule is de ned by f argaminE L a Y X z 1 In the simple case where L 177 y2 the optimal prediction rule 1 is well known to be ME EmX 90 In a simple twoeclass classi cation context where Yquot takes values in g 1 2 I might set up a decision problem with action space A g and loss function LaY I 1 7E Y Ia1IY2Ia2 IY1 An optimal decision rule corresponding to 1 is 1 ifEIY1HX13 a 2 ifEIY1HXilt Further ifl de ne Y I Y 1 this is is 1 ifEY X z 2 5 a 2 ifEmX z lt 5 So in both the squared error loss context and in the 071 loss 27state classi cation context P optimal choice of predictor involves the conditional mean of an output variable albeit the second is a 01 output variable These strongly suggest the importance of nding ways to approximate EY X z in order to either directly or indirectly produce good prediction rules 12 Two Conceptually Simple Ways One Might Approxi mate a Conditional Mean HTF introduce two standard and conceptually appealing ways of approximating EY X z for X7Y with distribution P The tacit assumption here is that the training data are realizations of X17 Y1 7 X27 Y2 7 7 XN7 YN iid P 121 Nearest Neighbor Rules One idea is to think that if N is quotbig enough7 perhaps something like EY X 21m yi 2 ivuth 171 ofmiz might work as an approximation to EY X But almost always unless N is huge and the distribution of X is discrete the umber of mi a is only 0 or 1 So some modi cation is surely required Perhaps the condition might be replaced with in One form of this is to de ne for each a the keneighborhood Nk the set of k inputs mi in the training set closest to a in SW A kenearest neighbor approximation to EY X z is then mic E i 21239 iwith wigf m suggesting the prediction rule Y X W X One might hope that upon allowing k to increase with N7 provided that P is not so bizarre this could be an effective predictor 122 Rules Based on a Linear Assumption About the Form of the Conditional Mean kenearest neighbor prediction rules potentially provide very exible forms7 but also potentially require huge N to produce reliable results Something far simpler7 far less exible7 but when appropriate also more reliably t to a training data set is a linear form Letting and the corresponding vector of output values in the training set the usual Stat 511 material says that least squares tting of a linear form of conditional mean to the training data can be accomplished as E XXV1 X Y with corresponding predictor A A Y X 3 X Regarding notation As we must repeatedly refer to both the vector of inputs and the matrix of training values for the inputs and considerable confusion is possible if we use the same bold face for both I have decided to adopt the this dual notation of X for the rst and X for the second Since it is rare that one wants to refer to vector outputs the notation Y will be mostly unambiguous as the column vector of output values in the training data 13 The Curse of Dimensionality Roughly speaking it s clear that one wants to allow the most exible kinds of prediction rules that a given training data set will adequately support tting Naively one might hope that with what seem like the quotlargequot sample sizes in machine learning contexts even ifp is fairly large highly exible prediction rules like kenearest neighbor rules might be practically useful as points quot ll up ERP and allow one to gure out how Y behaves as a function of X operating quite locally But our intuition about how sparse data sets in ERP actually must be is pretty weak and needs some education There is the quotcurse of dimensionality to be reckoned with There are many ways of framing this inescapable sparsity Some simple ones involve facts about uniform distributions on the unit ball in SW 90 E Wl Hmll S 1 and on a unit cube centered at the origin 755 For one thing quotmostquot of these distributions are very near the surface of the solids The cube 77 7 capturing for example half the volume of the cube half the probability mass of the distribution has 7 55 which converges to 5 as 17 increases Essentially the same story holds for the uniform distribution on the unit ball except that the radius capturing half the probability mass has T 5 which converges to 1 as 17 increases Points uniformly distributed in these solids are mostly near the surface or boundary of the spaces Another interesting calculation concerns how large a sample must be in order for points generated from the uniform distribution on the ball or cube in an iid fashion to tend to quotpile up anywhere Consider the problem of describing the distance from the origin to the closest of N iid points uniform in the unit ball With R the distance from the origin to a single random point B has cdf 0 7 lt0 177 7 0 T 1 1 7 gt 1 So if R1R2 RN are iid with this distribution M min R1R2 RN hasc f 0 mlt0 FMm 1717mPN 0377131 1 mgt1 This distribution has for example median herNV For say 17 100 and N 106 the median of the distribution if M is 87 In addition to these kinds of considerations of sparsity there is the fact that the potential complexity of functions ofp variables explodes exponentially in 17 Ft 5 14 VarianceBias Considerations in Prediction Just as in model parameter estimation the quotvarianceebias tradeeoffquot notion is important in a prediction context Here we consider a decomposition of what HTF in their Section 29 call the quottestgeneralization error that can help us think about that tradeeoff In what follows assume that X1 Y1 X2 Y2 XN YN are iid P and independent of X Y that is also P distributed Write EN for expectation with respect to the joint distribution of the training data and at for conditional expectation and VarY lX at for conditional variance based on the conditional distribution of Y X a For a xed 137 sup7 pose that f is a function of X17Y1 7 X2713 7 7 XN7YN A measure of the effectiveness of the prediction f for Y at X a is ENE f w 7 Y X 7 0 EN fz 7 me 902 E Y 7EYlX 90 lX EN a w 7 w as w z 7 mm 7 W varmX 7 z VarN f 90 ENf z 7 me 902 Var le z The rst quantity7 VarN f is the variance of the prediction at a The second term7 ENf 7 E Y X z27 is a kind of quotbiasquot of the prediction at at And VarYlX z is an unavoidable variance in output at X a 15 Some Fancier Ways One Might Approximate a Con ditional Mean There are many important tools for producing prediction rulesapproximating conditional means HTF sketch a number of them at the end of their Chapter In some ways7 listed are methods that t between the extremes in terms of exibilitycomplexity of a prediction rule of least squares tting of a linear orm and nearest neighbor ru es 151 NonLinear Regression Methods Least Squares Fitting One basic form of prediction rule is M Y z Z Bmhm z m1 for some set of quotbasisquot functions on ERP7 hm These basis functions hm might be 1 powers and products of powers of coordinates of a The result is polyno7 mial regression to sines and cosines of known multiples of coordinates of a The result is Fourier analysis 03 of the form 1 1 exp am where the am and m are unknown parameters that are included in the tting The result is a kind of quotneural network and tting the parameters is a non7linear regression problem 4 quotspline functions For example in p 1 dimension With quotknotsquot t1 lt Q lt lt tM2 one might have hmac witer form12M72 hM1ac ac and hMac 1 Here predictions are piecewise linear in ac Products of such things With arguments coordinates of a E SW can be used for p gt 1 Where knots are xed tting is linear regression problem If choosing knots is part of the tting process this is a nonlinear regression problem 01 quotradial basis functions radially symmetric functions like 1 2 exp 7 Hwiumll m for some parameters Am and pm Where these are xed tting is linear regression problem If choosing the Am and um part of the tting process this is a nonlinear regression problem 152 Methods Based on Local SmoothingAveraging One might try choosing a function f by balancing some measure of quot tquot against a penalty associated With how quotWigglybizarre the function is For 17 1 it is possible to solve for A gt 0 the optimization problem N A argmin 2m 7hltxigtgt2x 1h amde It With enough derivatives 21 and the optimizing function is a soecalled quotcubic smoothing spline The integral above might be called J h and the larger is A the less Wiggly is f For 17 gt 1 one might try considering tting functions of the form Mm 2111le F1 using a penalty like and hope to do something generalizing the p 1 method Similarly one might consider a quotprojection pursuit regression model that ts 37 w 2 gm ma where the m are unknown parameters and the realevalued functions of a sin gle real variable gm are to be chosen as well presumably using some penalty function idea and an iterative scheme that alternatively chooses m s and gm s Another possibility is the local use of weighted regression ideas For some kernel function H 20 at like for example 1 1 2 H mmz Xexp 751mg 7 gt and some usually fairly simple function like hgz 69 or h5z z one might choose for each to N 5 0 arg minz x 07 2 21239 i he MHZ i1 and then employ the prediction rule 1A c 1131 9 This produces smoothers like the loess smoother 16 Complexity Fit and Choice of Tuning Parameters The methods presented at the end of Chapter 2 all have complexity parame ters the number of basis functions employed the penalty weight and the band widthneighborhood size parameter that must be chosen by a user If a choice of complexity parameter doesn t allow enough exibility in the form of a pre dictor undere t occurs and there is large bias in prediction On the other hand if the choice allows too much exibility bias may be reduced but the price paid is large variance and overe t It is a theme that runs through this material that complexity must be chosen in a way that balances variance and bias for the particular combination of N and p and general circumstance one faces Speci cs of methods for doing this are presented later in the book 2 Linear Methods Ch 3 of HTF There is more to say about the development of a linear predictor X a x for an appropriate 3 E SW than what is said in Stat 500 and Stat 511 where least squares is used to t the linear form to all 17 input variables or to some subset of M of them Some of that involves more linear algebra background than is assumed or used in Stat 511 We begin with some linear theory that 11 is useful both here and later and then go on to consider other tting methods besides ordinary least squares Consider again the matrix of training data inputs I m1 I m2 mi N and corresponding vector of training data outputs ZIN It is standard Stat 511 fare that least squares projects Y onto C X the column space of X in order to produce the vector of tted values 271 A 212 N X1 7 39 3N 21 The GramSchmidt Process and the QR Decomposi tion For many purposes it would be convenient if the columns of the matrix X were orthogonal In fact it would be useful to replace the N X 17 matrix X with an N X 17 matrix Z with orthogonal columns and having the property that for each I if X and Z are N X 1 consisting of the rst 1 columns of respectively X and Z then C Z1 C Such a matrix can in fact be constructed using the soecalled Gram7Schmidt process This process generalizes beyond the present application to ERP to general Hilbert spaces and recognition of that important fact we ll adopt standard Hilbert space notation for the innereproduct in ERN namely N u v E Zuivi u v 21 and describe the process in these terms In what follows vectors are Nevectors and in particular Q is the jth column of X Notice that this is in potential con ict with earlier notation that made m the pevector of inputs for the ith case in the training data We will simply have to read the following in context and keep in mind the local convention With this notation the Grameschmidt process proceeds as follows 12 1 Set the rst column of Z to be 21 ml 2 Having constructed Z14 2122z1 let 7 ltmlvzjgt z 17 2 72 j1 lt21quot Zjgt It is easy enough to see that ltzzjgt 0 for all j lt 1 building up the orthogonality of 21 22 211 by induction since ltZlvzjgt ltmlvzjgt ltmlvzjgt as at most one term of the sum in step 2 above is non7zero Further assume that CZ1 CXz1 z E CXz and so CZz C CXz And since any element of C X can be Written as a linear combination of an element of C X14 C Z14 and El we also have C X C C Z1 Thus as advertised CZz C Since the zj are perpendicular for any N7vector w is the projection of 11 onto C Z1 C To see this consider minimiza7 2 tion of the quantity lt11 7 21 cjzj w 7 21 cjzjgt w 7 21 cjzj by choice of the constants Cj In particular the sum on the right side of the equality in step 2 of the Gram7Schmidt process is the projection of a onto 11 And the projection of Y onto C X is z ltYv Zjgt z j1ltzj z739gt J This means that for a full p7variable regression ltWpgt ltZP ZPgt is the regression coef cient for 2p and since only 2p involves it the last vari7 able in X mp So in constructing a vector of tted values tted regression coef cients in multiple regression can be interpreted as weights to be applied to that part of the input vector that remains after projecting onto the space spanned by all the others The construction of the orthogonal variables zj can be represented in matrix form as X Z I Ngtltp Ngtltppgtltp 13 Where I is upper triangular with 7M the value in the kth row and jth column of I ltZkvmjgt Zkazkgt De ning 1 2 12 134mg mm dlagltH21HHszgt and letting QmT1deDF one may Write XQR that is the soecalled QR decomposition of X In 37 Q is N X p With QQDZZDme hJ WWW DI That is7 Q has for columns perpendicular unit vectors that form a basis for C R is upper triangular and that says that only the rst I of these unit vectors are needed to create ml The decomposition is computationally useful in that for qj the jth column of Q the projection of Y onto C is since the qj comprise an orthonormal basis for C P YZltYqjgtqjQQ Y 4 j1 and AOLS 71 R QY The fact that R is upper triangular implies that there are ef cient ways to compute its inverse 22 The Singular Value Decomposition of a Full Rank X The matrix X has a soecalled singular value decomposition as X U D V Ngtltp Ngtltppgtltppgtltp Where U has orthonormal columns spanning C X7 V has orthonormal columns spanning C X the row space of X7 and D diagdl7 12 7dp for azazmzagto The dj are the quotsingular values of X 14 An interesting property of the singular value decomposition is this If U V and D are matrices consisting of the rst 1 columns of U V and D respectively then X E UZDZVQ is the best in the sense of squared distance from X in ERNP Tank l approxie mation to X Since the columns of U are an orthonormal basis for C X the projection of Y onto C is AOLS Y 0 ltYujgt uj UU Y 5 J H This is of course completely parallel to 4 and is a consequence of the fact that the columns of both U and Q from the QR decomposition of X form orthonormal bases for C In general the two bases are not the same Now using the singular value decomposition of X XX VD U UD V VD2V Which is the quoteigen decomposition of the symmetric and positive de nite X X The eigenvectors v1 v2 up the columns of V are called the quotprincipal components directions of X in SW The vector 21 E le is the product Xw With the largest sample variance of entries or largest squared length in ERN subject to the constraint that 1 This sample variance is 1 1 1 d2 z lzl viX Xm v lVDZV vl W1 A second representation of 21 is l 0 21X71 UDV vl UD d1u1 0 and this vector is called the quot rst principal component of X in ERN In general the jth principal component of X is zj X U duj Which is the vector of the form Xw With the largest sample variance of entries subject to the constraints that 1 and zjzlgt 0 for all I lt j This sample variance is z zj d N Notice that when X has standardized columns ie each column of X zj has lt1 zjgt 0 and zj zjgt N the p X 17 matrix 1 I NX X 15 is the sample covariance and correlation matrix for the 17 input variables X1X2Xp 23 Ridge Regression and the Lasso and Some General izations Thereof An alternative to seeking to nd a suitable level of complexity in a linear predice tion rule through subset selection and least squares tting of a linear form to the selected variables is to employ a shrinkage method based on a penalized vere sion of least squares to choose a vector 3 E SW to employ in a linear prediction rule Here we consider several such methods The implementation of these methods is not equivariant to the scaling used to express the input variables Xj So that we can talk about properties of the methods that are associated with a wellede ned scaling we assume here that the output variable has been centered ie that Y 1 0 and that the columns of X have been standardized and if originally X had a constant column it has been removed 23 1 Ridge Regression For a A gt 0 the ridge regression coef cient vector 3 E SW is 2 Arldge N p p 2 3 argmln Z 31 iz jmj AZ j 6 5Qquot i1 j1 j1 AOLS Here A 1s a penaltycomplex1ty parameter that controls how much 3 1s shrunken towards 0 The unconstrained minimization problem expressed in 6 has an equivalent constrained minimization description as d N 0 2 An ge a argmm 2 y 7 5 7 5 With 21 719 11 j1 Amng 2 for an appropriate t gt 0 Corresponding to A used 1n 6 1s t A used in Conversely corresponding to i used in 7 one may use a value of A in 6 producing the same error sum of squares The unconstrained form 6 calls upon one to minimize Y 7 X Y 7 M AM and some vector calculus leads directly to Midge A 3 x x AI 1 x Y So then7 using the singular value decomposition of X Aridge Midge Y X x UDV VDU UDV AI 1 VDU Y UDV VDU UDV AIV 1DU Y UD D2 AI 1DU Y 0 d2 7 7 i 2 d2 ltYujgtuj 8 11 7 Comparing to 5 and recognizing that d d lt7lt d 1A d A MAC 0 lt1 we see that the coef cients of the orthonormal basis vectors uj employed to A d AOLS get Yr1 as are shrunken version of the coef cients applied to get Y The most severe shrinking is enforced in the directions of the smallest principal components of X the uj least important in making up low rank approximations to The function dm tr X X X AI 1 X tr UD D2 AI 1 DU is called the effective degrees of freedom associated With the ridge regression Note that if A 0 ridge regression is ordinary least squares and this is 177 the usual degrees of freedom associated With projection onto C X7 ie trace of the projection matrix onto this column space As A A 07 this goes to 0 232 The Lasso and Generalizations The quotlassoquot and the other relatives of ridge regression are the result of generalize ing the optimization criteria 6 and 7 by replacing 231 With 21 l jlq 17 for a q gt 0 That produces Aq N p 2 0 3 argmin yi 7 4527 A q 9 7 7 5amp3 21 j1 j1 generalizing 6 and 2 Ag N p t arg min 2 31 7 Z jacij 10 With Eidl jlqst i1 j1 generalizing The so called quotlassoquot is the q 1 case of 9 or 10 That is7 for t gt 0 1 N p 2 A asso at argmln yi Z jfij 11 With Zigl jl t i1 j1 Because of the shape of the constraint region p I3 Ewl Zl jl St j1 71 in particular its sharp corners at coordinate axes some coordinates of tasso AOLS are often 07 and the lasso automatically provides simultaneous shrinking of toward 0 and rational subset selection The same is true of cases of 10 with q lt 1 In Table 347 HTF provide explicit formulas for tted coef cients for the case of X with orthonormal columns These translate in the present case of X with standardized and orthogonal columns to the following Method of Fitting Fitted Coef cient for X OLS A L AOLS AOLS Best Subset of Size M 8 I rank j l 2 M AOLS Ridge Regression 1 j 1NA AOLS AOLS AOLS Lasso j sign j j l 7 NA Best subset regression provides a kind of quothard thresholding of the least squares coef cients7 ridge regression provides shrinking of all coef cients toward 07 and the lasso provides a kind of quotsoft thresholding of the coef cients Chapter 3 of the 2nd Edition of HTF provides some modi cations of the ridgelasso idea One is the quotelastic net idea This is a kind of compromise 18 between the ridge and lasso For an a E 07 l and some t gt 07 this is de ned by 2 N Aelastic net P la Eng 2 yi Z jxij j1 min a With 21a 1ia j t 21 The constraint is some kind of compromise between the ridge and lasso con straints The constraint regions have quotcornersquot like the lasso regions but are more rounded than the lasso regions 24 Two Methods With Derived Input Variables Another possible approach to the problem of nding an appropriate level of complexity in a tted linear prediction rule is to consider regression on some number M lt p of predictors derived from the original inputs Xj Two such methods discussed by HTF are the methods of Principal Components Regression and Partial Least Squares Here we continue to assume that the columns of X have been standardized and Y has been centered 241 Principal Components Regression The idea here is to replace the 17 columns of predictors in X with the rst few M principal components of X from the singular value decomposition of X zj ij dj u Correspondingly7 the vector of tted predictions for the training data is PCR M ltYzjgt 391 lt21quot Zjgt M J39 K H M ltYv ujgt quotj 12 x H Comparing this to 5 and 8 we see that ridge regression shrinks the coe eicients of the principal components uj according to their importance in making up X7 while principal components regression quotzeros out those least important in making up X A P CR Notice too7 that Y can be written in terms of the original inputs as APCR M 1 Y ZltYvujgt d XW J39 so that APCR M 1 I3 g YaXW W 13 K F1 AOLS where obviously the sum above has only M terms 3 is the p M version of As the Uj are orthonormal7 it is therefore clear that APCR AOLS 8 II E 8 APCR So principal components regression shrinks both Y toward 0 in ERN and AOLS 3 toward 0 in SW 242 Partial Least Squares Regression The shrinking methods mentioned thus far have taken no account of Y in de termining directions or amounts of shrinkage Partial least squares speci cally employs Y In what follows7 we continue to suppose that the columns of X have been standardized and that Y has been centere The logic of partial least squares is this Suppose that 21 le is the linear combination of columns of X maximizing lltY7Xwgtl 14 which is essentially the sample covariance between the variables Y and X w subject to the constraint that 1 Then for j gt 17 let zj ij be the linear combination of columns of X maximizing lltY7Xwgtl 15 20 subject to the constraints that 1 and zj zl 0 for alll lt j Note that upon replacing lltYXwl With lltXwXwl one gets the principal components of X in place of the PLS predictors Principal components regression uses the rst M of these variables zj as input variables HTF are not terribly clear about an algorithm for accomplishing the come putation of the PLS predictors I believe that the PLS algorithm is as follows Take 0 21 Z ltYv 90 90139 j1 and de ne X0 X For I 2 1 de ne X by orthogonalizing the columns of X14 With respect to 2 That is de ne the jth column of X24 by 71 x Z mlljmll7fl lt 7 lgtzl 2221 Then take P l l Zz1 E ltYa jgt j 1 So constructed the PLS predictors zj are orthogonal Using the rst M of these as regressors one has the vector of tted output values PLS M ltYzj 7 Z j1 lt21quot Zj gt Since the PLS predictors are albeit recursivelyecomputed dataedependent line APLS ear combinations of columns of X it is possible to nd a pevector M such that A PLS APLS Y X M and thus produce a corresponding prediction rule A APLS YltXgt at X 16 It seems like in 16 the number of components M should function as a complexity parameter But then again there is the following When the z AOLS are orthogonal it s fairly easy to see that 21 is a multiple of Y 31 APLS L A p and all steps of partial least squares after the rst are simply providing a basis for the orthogonal complement of the ledimensional subspace of C generated by T Without improving tting at all That is here changing M doesn t change exibility of the t at all Presumably When the Q are nearly orthogonal something similar happens 21 3 Linear and a Bit on Quadratic Methods of Classi cation Ch 4 of HTF Suppose now that an output variable Yquot takes values in g l2 K or equivalently that Yis a K7variate set of indicators Yk I Y The object here is to consider methods of producing predictionclassi cation rules Y taking values in g and mostly ones that have sets X E ERP W k with boundaries that are de ned at least piece7wise by linear equalities a X c 17 The most obviousnaive potential method here namely to regress the K indicator variables Yk onto X producing least squares regression vector coe i cients and then to employ l arg maxC arg max kX k k fails miserably because of the possibility of quotmaskingquot if K gt 2 One must be smarter than this Three kinds of smarter alternatives are Linear and Quadratic Discriminant Analysis Logistic Regression and direct searches for separating hyperplanes The rst two of these are quotstatisticalquot in origin with long histories in the eld 31 Linear and a bit on Quadratic Discriminant Analysis Suppose that for X Yquot N P 77k P Y k and the conditional distribution of X on SW given that Yquot k is MVNp pk E ie the conditional pdf is l fk z MP2 det Erl exp 7i 0 7 m 2 1 w 7 ma Then it follows that PYk Xm 77k 1 r 71 1 r 71 r 71 l 1 7 77 E 7 E E 7 n FDA Z X w n m Q k k2z Hzz Mk 1 18 so that an optimal predictordecision rule is A 1 Y arglrcnax ln739rk 7 ipkzilpk X E 1pk and boundaries between regions in SW where l7 X k and l X l are subsets of the sets 77k 1 1 X WX E 1 7m 7 71m 7 uk2 1 7 22 1m 772 22 ie are de ned by equalities of the form 17 This is dependent upon all K conditional normal distributions having the same covariance matrix 2 In the event these are allowed to vary conditional distribution k with covariance matrix 2k an optimal predictordecision rule is W X 7 argmax ln739rk 7 1ndet2k 7 7 X 7 W s X 7 pk k and boundaries between regions in SW where 17 X k and 1quot X l are subsets of the sets X Rplx Hkl2g1x Hk X HzIEf1X l z 7111 71ndet2k1ndet2 Unless 2k 2 this kind of set is a quadratic surface in SW not a hyperplane One gets not linear but Quadratic Discriminant Analysis Of course in order to use LDA or QDA one must estimate the vectors pk and the covariance matrix 2 or matrices 2k from the training data Estimating K potentially different matrices 2k requires estimation of a very large number of parameters So thinking about QDA versus LDA one is again in the situation of needing to nd the level of predictor complexity that a given data set will support QDA is a more exiblecomplex method than LDA but using it in preference to LDA increases the likelihood of overe t and poor prediction One idea that has been offered as a kind of continuous compromise between LDA and QDA is for a E 0 1 to use 2 a 7 a k 1 7 a lmed in place of Elk in QDA This kind of thinking even suggests as an estimate of a covariance matrix common across k E 7 Vipooled 1 7 7 021 for y E 0 1 and 3 an estimate of variance pooled across groups k and then across coordinates of X j in LDA Combining these two ideas one might even invent a twoeparameter set of tted covariance matrices ilk OKY 042k 1 0 Vipooied 1 7 1751 for use in QDA Employing these in LDA or QDA provides the exibility of choosing a complexity parameter or parameters and potentially improving prediction performance Returning speci cally to LDA let and note that one is free to replace X and all K means pk With respectively X 7 2 2 X 7m and 12 242 pk 7m This produces PYlez me 1 2 1 2 1 1 7 7 7 7 7 7 nltPYZ Xw n 71 gllz kll 2Hz Ill and the optimal predictordecision rule can be described as A 1 2 Y argmaX ln739rk 7 i 7 pkH c That is7 in terms of Xquot7 optimal decisions are based on ordinary Euclidian distances to the transformed means pg The p typically span a subspace of SW of dimension min pK 7 1 For M ying 7 pgtltK let PM be the p X p projection matrix projecting onto the column space of M in SW Then HW MCHZ HlPMIPMl2H2 WPMW ilk I PM WHZ HPMW MCHZHWPMWWZ the last equality coming because PAIf 7 pg 6 C and I 7 PM 2 E CMi Since 7 PM 13H2 doesn t depend upon k the optimal predic7 tordecision rule can be described as A 1 Y argmaX ln739rk 7 i HPMX 7 pkHZ k and optimal decision rules can be described in terms of the projection of Xquot onto C and its distances to the 2 ow7 LMM K is the typically rank K 7 1 sample covariance matrix of the p and has an eigen decomposition as 1 7MM VDV K D diagd1d2dp Where are the eigenvalues and the columns of V are orthonormal eigenvectors corre7 sponding in order to the successively smaller eigenvalues of 7MM These The With dk gt 0 specify linear combinations of the coordinates of the 2 mpg With the largest possible sample variances subject to the constraints that 1 and whine 0 for all I lt k These The are L vectors in direc7 tions of most important unaccounted7for spread of the 2 This suggests the possibility of quotreduced rank LDA That is de ne V1 v1v2vl let P VZVQ be the matrix projecting onto C V in SW A possible quotreduced rank version of the LDA classi cation rule is A 1 Y1quot argernaX 1n 77k 7 i HPZX 7 PzpkHZ and 1 becomes a complexity parameter that one might optimize to tune or regularize the met 0 Note also that for w 6 ER 2 le 2 ohm 39uk k1 For purposes of graphical representation of What is going on in these compu7 tations one might replace the p coordinates of X and the means pk With the min pK 7 1 coordinates o v1Xv2Xltvminp K1Xgt 19 and of the 717 12 v 7 27 1 v 7 ltvminpK717 1 20 It seems to be essentially ordered pairs of these coordinates that are plotted in HTF in their Figures 48 and 411 In this regard we need to point out that since any eigenvector Uk could be replaced by 7 Uk Without any fundamental effect in the above development the vector 19 and all of the vectors 20 could be altered by multiplication of any particular set of coordinates by 71 Whether a particular algorithm for nding eigenvectors produces Uk or 7 0k is not fundamental and there seems to be no standard convention in this regard It appears that the pictures in HTF might have been made using the R function lda and its choice of signs for eigenvectors 32 Logistic Regression A generalization of the MVN conditional distribution result 18 is an assump7 tion that for all k lt K P Y le m 1 7 21 HltPYK Xm 5k0 km l 25 Here there are K 7 1 constants ko and K 7 1 pevectors k to be speci ed not necessarily tied to class mean vectors or a common Withineclass covariance matrix for X In fact the set of relationships 21 do not fully specify a joint distribution for X Y Rather they only specify the nature of the conditional distributions of YlX a In this regard the situation is exactly analogous to that in ordinary simple linear regression A bivariate normal distribution for X Y gets one normal conditional distributions for Y With a constant variance and mean linear in X But one may make those assumptions conditionally on X Without assuming anything about the marginal distribution of X that in the bivariate normal model is univariate normal sing H as shorthand for a vector containing all the constants ko and the vectors k the linear log odds assumption 21 produces the forms exp ko le PYle z pkzH 1 Zhd exp ko kz for k lt K and 1 K71 1 Zhd exp ko and the corresponding optimal predictorclassi cation rule is PYKlXEPKm70 Y argmaxP Y le z k Not only does assumption 21 generalize the quotmixture of MVNs assump tion of LDA but the standard methods of tting the corresponding parameters based on training data are necessarily fundamentally different That is using maximum likelihood in LDA the K probabilities 77k the K means pk and the covariance matrix 2 are chosen to maximize the likelihood Wff Xilpyj2 i1 This is a mixture model and the complete likelihood is involved ie a joint density for the N pairs Logistic regression methodology maximizes N H 17v X 2397 9 23971 over choices of 0 This is not a full likelihood but rather one conditional on the Xi observed 33 Separating Hyperplanes In the K 2 group case if there is a 3 E SW and real number e such that in the training data Yquot 2 exactly When 3 X g gt 0 26 a quotseparating hyperplane z e Rpl z 80 0 can be found via logistic regression The conditional likelihood will not have a maximum but if one follows a search path far enough toward limiting value of 0 for the loglikelihood or 1 for the likelihood satisfactory 3 6 ER and u from an iteration of the search algorithm will produce separation A famous older algorithm for nding a separating hyperplane is the soecalled quotperceptronquot algorithm It can be de ned as follows Temporarily treat Yquot taking values in g 12 as realevalued and de ne Y2Y 73 so that YM takes real values i1 From some starting points 30 and g cycle through the training data cases in order repeatedly as needed At any iteration I take 7 171 7 171 372 and Xi 0gt0or 8 and O O If Y1 and a X Ogo l l71yvixi I otherw1se and l W This will eventually identify a separating hyperplane when a series of N iterae tions fails to change the values of 3 and g If there is a separating hyperplane it will not be unique One can attempt to de ne and search for quotoptimalquot such hyperplanes that eg maximize distance from the plane to the closest training vector See HTF Section 452 in this regar 4 Basis Expansions and Regularization Ch 5 of HTF A fundamental way of moving beyond prediction rules that are functions of a linear form in X ie depend upon X through X zgi ij is to consider some set of basis functions hm and predictors of the form or depending upon the form f X 2 mm X 22 We next consider some exible methods employing this idea Until further notice we restrict attention to a oneedimensional input variable X Notice that tting of form 22 can be done using OLS or any of the methods of HTF Chapter 3 based on the N X 17 matrix of inputs X M 902 239 indexing rows and j indexing columns 27 41 Piecewise Polynomials and Regression Splines Consider K quotknotsquot 51 lt 2lt lt K and forms for f that are 1 polynomials of order M or less on all intervals iii7175 and potene tially7 at least 2 have derivatives of some speci ed order at the knots7 and potentially7 at least 3 are linear outside 17 K If we let 1 and de ne IK1 functions Iaclt 17 forj277K let I I j1 S at lt j7 I 5K S ac7 one can have 1 in the list above using basis 1 1712ac77IK1 11 7 12 7 711K1 211 7 212 7 7 21K1 M11 90 790N112 95 77M1K1 95 Further7 one can enforce continuity and differentiability at the knots conditions on a form f Zg i lxKJrl mhm by enforcing some linear relations between appropriate ones of the m While this is conceptually simple7 it is messy It is much cleaner to simply begin with a set of basis functions that are tailored to have the desired continuitydifferentiability properties A set of M 1 K basis functions for piecewise polynomials of degree M with derivatives of order M 7 1 at all knots is easily seen to be 1xx2xMltx751gt ltx752gt ltx75K since the value of and rst j derivatives of at 7 at Q are all 0 The choice of M 3 is fairly standard Since extrapolation with polynomials typically gets worse with order7 it is common to impose a restriction that outside 175K a form f be linear For the case of M 3 this can be accomplished by beginning with basis functions 17 7 at 1i71 7 52 7 7 at 7 EK and imposing restrictions necessary to force 2nd and 3rd derivatives to the right of K to be 0 Notice that considering 95 gt 5K d2 K 3 K g aoa12 m7 jgt 625139 35 5139 23 j1 j1 28 and d3 K 3 K aoa1z jx7 j GZij 24 j1 j1 So linearity for large at requires from 24 that j 0 Further substituting this into 23 means that linearity also requires that j j 0 Using the rst of these to conclude that K 7 j and substituting into the second yields M Q 7 5 g w and then K 2 2 7 5K E 7 5K 5i J Z 5i j1 5K 5K4 j1 These then suggest the set of basis functions consisting of 11 and for j 1 2 K 7 2 3 5K 57 3 3 gK jgt 3 3575 7 75 xi 75 J a 75m K 0 K Q 75 K These are essentially the basis functions that HTF call their Nj Their use produces soecalled quotnaturalquot linear outside 15K cubic regression splines There are other harder to motivate but in the end more pleasing and computationally more attractive sets of basis functions for natural polynomial splines See the Bespline material at the end of HTF Chapter 5 42 Smoothing Splines A way of avoiding the direct selection of knots is to instead for a smoothing parameter A gt 0 consider the problem of nding N 21239 hi2 A W 1M2 d9 f A arg min functions It With 2 derivatives Amazingly enough this optimization problem has a solution that can be fairly simply described f is a natural cubic spline With knots at the distinct values an in the training set That is for a set of basis functions for such splines i1 h17h277h1v here we re tacitly assuming that the N values of the input variable in the training set are all different f ZEAjhj F1 29 Where the EM are yet to be identi ed So consider the function N 990 291 90 25 j1 this has second derivative N g r 291 r j1 and so 9 962 Z Z 919 W17 90 1 x H H for H 919279N and NQN U h t h tdtgt M 9 H 7 NgtltN ln fact7 With the notation 239 indexing rows and j indexing columns the criterion to be optimized in order to nd f can be Written for functions of the form 25 as Y 7H0 Y 7 HH AH QH and some vector calculus shows that the optimizing H is EA H H m 1H Y 26 Which can be thought of as some kind of vector of generalized ridge regression coef cients Corresponding to 26 is a vector of smoothed output values A H H H An 1H Y and the matrix 1 5 EH H HQ7 H is called a smoother matrix Contrast this to a situation Where some fairly small number7 p7 of basis functions are employed in a regression context That is7 for basis function 111112 71p suppose bj 9 Ngtltp 30 Then OLS produces the vector of tted values 2 B B B 1 B Y and the projection matrix onto the column space ofB C B is PB B B BV1 B SA and P3 are both Ngtlt N symmetric nonenegative de nite matrices While PEPE PB ie P3 is idempotent SAS 5 5 meaning that SA 7 SAS is nonenegative de nite P3 is of rank 17 trPB While SA is of rank N In a manner similar to What is done in ridge regression we might de ne an quoteffective degrees of freedom for SA or for smoothing as am ii 5 27 We proceed to develop motivation and a formula for this quantity Notice that it is possible to Write s1IAK for 1 K H 9H1 so that SA I AK 1 28 Some vector calculus shows that G SAY is a solution to the minimization problem minimize Y 7 7 Y 7 390 Av Kv 29 1amp9 so that this matrix K can be thought of as de ning a quotpenaltyquot in tting a smoothed version of Y Then since 5 is symmetric nonenegative de nite it has an eigen decome position as N sA UD U Z djuJu 30 j1 Where columns of U the eigenvectors uj comprise an orthonormal basis for N and D diagd1d2 dN for eigenvalues of SA d1 2d22quot392d1vgt0 It turns out to be guaranteed that 11 d2 1 Consider how the eigenvalues and eigenvectors of SA are related to those for K 31 An eigenvalue for Ksay r solves detK inI0 1 det K 7 n1 det I AK 7 1 AnIgt So 1Ar must be an eigenvalue ofIAK and 1 1 An must be an eigenvalue of SA I AK71 So for some j we must have 1 d 7 J 1 Ar and observing that 1 1 A71 is decreasing in r we may conclude that 1 d 1 TINij1 for T11 2712 239 271N72 271N71TIN0 the eigenvalues of K that themselves do not depend upon A So for example in light of 27 30 and 31 the smoothing effective degrees of freedom are N N N N 1 dfAtrSAtr gdjuju tr gdjuguj gdjm which is clearly decreasing in A with minimum value 2 in light of the fact that SA has two eigenvalues that are 1 Further consider uj the eigenvector of SA corresponding to eigenvalue dj SAuj dJuj so that u sgldjuj I AKdJuj so that uj dej deKuj and thus 17d Kuj My quotj quotNejuuj That is uj is an eigenvector of K corresponding to the N 7j 1st largest eigenvalue That is for all A the eigenvectors of SA are eigenvectors of K and thus do not depend upon A Then for any A A N N N Y YA SAY ZdjuJu Y 2d uj Y u 11 11 11 1 32 and we see that YA is a shrunken version of Y The larger is A the more severe the shrinking overall Further the larger is j the smaller is dj and the more severe is the shrinking in the uj direction In the context of cubic smoothing splines large j correspond to quotwigglyquot as a functions of coordinate 239 or value of the input uj and the prescription 32 calls for suppression of quotwigglyquot components of Y Notice that large j correspond to earlylarge eigenvalues of the penalty ma trix K in 29 Letting uNJ1 so that U uN uN1u1 ufu uN the eigen decomposition of K is K Udiagn1n2nN U and the criterion 29 can be written as mmgmYivYYivAvT dmgnvnwunNUw vE or equivalently as N 2 quot39 Y7 Y7 AE f mhglgrgze U U j17 ltujz 7 gt and we see that eigenvalues of K function as penalty coe eicients applied to the N orthogonal components of 1 21 u 7 in the choice of optimizing 1 From this point of view the uj or provide the natural alternative to the columns of H basis for ERN for representing or approximating Y and the last equality in 32 provides an explicit form for the optimizing smoothed vector YA In this development K has had a speci c meaning derived from the H and 9 matrices connected speci cally with smoothing splines But in the end perhaps the most interesting possibility brought up by the whole development is that of forgetting the origins from K of the nj and uj and beginning with any interestingintuitively appealing orthonormal basis and set of none negative penalties 17 for use in Working backwards through 32 and 31 one is then led to the corresponding smoothed vector YA and smoothing matrix S A A It is also worth remarking that since YA SAY the rows of SA prof vide weights to be applied to the elements of Y in order to produce predice tionssmoothed values corresponding to Y These can for each 239 be thought of as de ning a corresponding quotequivalent kernel for an appropriate quotkernele weighted average of the training output values See Figure 58 of HTF2 in this regard 43 Generalizations to p Dimensional Inputs 431 MultiDimensional Regression Splines Tensor Bases lfp 2 and the vector of inputs7 X7 takes values in ERZ one might proceed as follows If U111 h12 7 thl is a set of spline basis functions based on X1 and JU1217 h227 h2M1 is a set of spline basis functions based on X2 on might consider the set of M1 M2 basis functions based on X de ned by gjk 90 h1j1h2k 2 and corresponding regression splines f Z jkgjk J39JC The biggest problem with this potential method is the explosion in the size of a tensor product basis as 17 increases For example7 using K knots for cubic regression splines in each of 17 dimensions produces 4 Kp basis functions for the pedimensional problem Some kind of forward selection algorithm or shrinking of coe icients will be needed to produce any kind of workable ts with such large numbers of basis functions 432 MultiDimensional Smoothing Splines lfp 2 and the vector of inputs7 X7 takes values in ERZ one might propose to seek Z a 7 h 902 M m fA 7 arg min functions 1 With 2 derivatives 82h 2 82h 2 82h 2 2 a 2 m 87 M An optimizing f ERZ H ER can be identi ed and is called a quotthin plate spline s A 07 f becomes an interpolator7 as A A 00 it de nes the OLS plane through the data in 37space ln general7 it can be shown to be of the form N A z 8 8 90 Z 04239in z 34 21 where gi 7 for r zZInzZ The gi are quotradial basis functions radially symmetric basis functions and tting is accomplished much as for the p 1 case The form 34 is plugged into the optimization criterion and a discrete penalized least squares problem emerges after taking account of some linear constraints that are required to keep Jf lt 00 HTF seem to indicate that in order to keep computations from exploding with N7 it usually su ices to replace the N functions gi in 34 with K lt N functions 9 34 7 for K potential input vectors placed on a rectangular grid covering the convex hull of the N training data input vectors m For large 17 one might simply declare that attention is going to be limited to predictors of some restricted form and for h in that restricted class seek to optimize m 7 has M h M2 H H 239 for J h some appropriate penalty on h intended to regularizerestrict its Wige gling For example one might assume that a form 996 2 95139 and set and be led to additive splines Or one might assume that 0 990 Zgjj29jkjwk 35 71 J39JC and invent an appropriate penalty function It seems like a sum of led smoothe ing spline penalties on the gj and 27d thin plate spline penalties on the gjk is the most obvious starting point Details of tting are a bit murky though I am sure that they can be found in book on generalized additive models Pref sumably one cycles through the summands in 35 iteratively tting functions to sets of residuals de ned by the original 31 minus the sums of all other current versions of the components until some convergence criterion is satis ed 35 is a kind of quotmain effects plus 27factor interactions decomposition but it is at least in theory possible to also consider higher order terms in this kind of expansion 44 Reproducing Kernel Hilbert Spaces and Fitting Func tions in W7 A general framework that encompasses many regularized tting methods is that of reproducing kernel Hilbert spaces This begins With C a compact subset of ERP and a symmetric kernel function KCgtltCgt R Ultimately we Will consider as predictors for a E C linear combinations of sec tions of the kernel function biK z Where the m are the input vectors 35 in the training set But to get there in a rational way and to incorporate use of a complexity penalty into the tting we will restrict attention to those kernels that have nice properties In particular we require that K be continuous and noninegative de nite We mean by this latter that for any set of M elements of ERP say zj the M X M matrix Kmivm is nonenegative de nite Then according to what is known as Mercer s Theorem look it up in Wikipedia K may then be written in the form K 27 w 272 Z 51 w 36 for linearly independent in L2 functions and constants 7 2 0 where the ab comprise an orthonormal basis for L2 C so any function in L2 C has an expansion in terms of the 5b and the ones corresponding to positive 7 may be taken to be continuous on C Further the b are quoteigenfunctionsquot of the kernel K corresponding to the quoteigenvalues 7 in the sense that in L2 C lt2gtKltz gtdz m gt Our present interest is in a function space HK that is a subset of L2 C with a different inner product and norm with members of the o m f 13 Ema z for c with lt oo 37 21 i1 239 called the quotprimal form of functions in the space Notice that all elements of L2 C are of the form 02 with 02 lt More naturally our interest centers on 00 f 2 biK z 2239 for some set of z 38 21 supposing that the series converges appropriately called the quotdual form of functions in the space The former is most useful for producing simple proofs while the second is most natural for application since how to obtain the ab and corresponding 7 for a given K is not so obvious Notice that K z z 2m 2m 90 and letting ME 90 0239 90 Since 21 0 90 Mi 21 72 90 K 907 90 lt 07 the function K 7 z is of the form 377 so that we can expect functions of the form 38 With absolutely convergent bi to be of form In the space of functions 377 we de ne an inner product for our Hilbert space 00 oo 00 Cd cianzm 2 21 21 HR i1 7239 so that Note then that for f ciqbi belonging to our Hilbert space HK ltfKzgtHK 2cm 0 fw and so K 7 z is called the representer of evaluation at a Further7 K 397Z7K397zgt71 E Kva K Which is the reproducing property of the RKHS Notice that this reproducing property implies that for f biK 1322 for some set of 2239 M M llfllitk ltf7fgtHK 22biijzizJ i1j71 That is7 a function that is a linear combination of some set of slices of the kernel function has squared norm in HK that is the variance of that linear combination of coordinates of a random vector With covariance matrix obtained by consulting the kernel at pairs of points zi that specify Which slices are used to make up f For applying this material to the tting of training data7 for A gt 0 and a loss function L 317 2 0 de ne an optimization criterion N mifr ifmxize 1212397 f A 39 As it turns out7 an optimizer of this criterion must7 for the training vectors be of the form f m biK 7 2 40 and the corresponding is then lt fgtm 22bibJKwizj i1 j1 37 The criterion 39 is thus N N mi r iarenlize L bJK mi zj Ab K m 9 b 41 Letting K K 132v7 and P K7 a symmetric generalized inverse of K and de ning N N LN YKb 2 2L yivzijmiij i1 j1 the optimization criterion 41 is minimize LN Y7 Kb Ab Kb beRN minimize LN Y7Kb Ab K PKb beRN 39 39 39 L Y A P 42 135133 N 7v v v That is7 the function space optimization problem 39 reduces to the Nedimensional optimization problem 42 A U E C the column space of K minimizing LN Y7 U A U Pv corresponds to b minimizing LN Y7Kb AllKb via Kb U 43 For the particular special case of squared error loss7 L y 7 1V727 this development has a very explicit punch line That is7 LN YKb Ab Kb Y7 Kb Y7 Kb Ab Kb Some vector calculus shows that this is minimized over choices of b by In K AI 1 Y 44 and corresponding tted values are AmKKM 1 Y Then using 44 under squared error loss7 the solution to 39 is from 40 N fA E Z WK 7 2 45 21 38 To better understand the nature of 42 consider the eigen decomposition of K in a case where it is nonesingular as K Udiag7117712777IN U for eigenvalues n1 2 712 gt gt nN gt 0 where the eigenvector columns of U comprise an orthonormal basis for MN The penalty in 42 is Av P39u Av Udiag1n11n21nNU v N 1 2 AZivuJgt F17 Now remembering that the uj comprise an orthonormal basis for ERN N 7 2 lt7 ujgt j and H UHZ lt7 7 J lt75 ujgt2 W12 7 so we see that in choosing a 1 to optimize LN Y U A U P U we penalize highly those 1 with large components in the directions of the late eigenvectors of K the ones corresponding to its small eigenvalues thereby tending to suppress those features of a potential 1 HTF seem to say that the eigenvalues and eigenvectors of the dataedependent N X N matrix K are somehow related respectively to the constants 7 and functions ab in the representation 36 of K as a weighted series of products That seems hard to understand and even if true certainly not obvious HTF also make some discussion of tting problems that involve candidate functions of the form M 2 9119 f j1 where f E HK and 119 s are somehow not penalized by the A penalty since they belong to a quotnull space of the kernel This is not so easy to understand I think what is meant is that in a properly constructed larger RKHS space containing all f E HK and the 119 there is an inner product that reduces to the HK inner product for f E HK and under which each 119 is perpendicular to every 15 Then in this larger space the projection of CjiZJj f onto HK is f and the penalty is set in terms of the norm of this projection alone Some special cases of this general RKHS development can be brie y de scribed 441 Some Special Cases One possible kernel function in 17 dimensions is K 2 z 1 zzgtd 39 For xed z the basis functions K m are dth order polynomials in the entries of mi So the tting is in terms of such polynomials Note that since K 2 z is relatively simple here there seems to be a good chance of explicitly deriving a representation 36 and perhaps working out all the details of What is above in a very concrete setting Another standard kernel function in 17 dimensions is K 2 z exp iv Hz 7 ml and the basis functions K m are essentially spherically symmetric normal pdfs With mean vectors mi These are quotGaussian radial basis functions A kernel function in 2 dimensions is 2 K9790 HZ 90H 111HZ 90H and this for squared error loss Ly17 y172 is the case of thin plate splines The standard development of socalled quotsupport vector classi ers in a 27 class context With Yquot taking values i1 uses some kernel K 2 z and predictors f z be Z biK z m 2391 in combination With L01quot f 90 l1 yquotf mm the sign of f providing the classi cation associated With 45 Wavelet Bases Return now for exposition purposes to the case of a oneedimensional input varie able X and in fact suppose that X takes values in 0 1 One might consider a set of basis function for use in the form 22 that is big enough and rich enough to approximate essentially any function on 0 1 In particular various orthoe normal bases for the square integrable functions on this interval the space of functions L2 0 1 come to mind One might for example consider using some number of functions from the Fourier basis for L2 0 1 oo 7 oo sinj2739rt U V2 cos j2739rt UH j1 j1 For example using M m N2 sinecos pairs and the constant one could consider the tting of predictors M M f x e 2 51mm mum Z zmcosltm2wxgt 46gt 7711 7111 If one has training at on an appropriate regular gird the use of form 46 leads to orthogonality in the N X 2M 1 matrix of values of the basis functions X and simplefast calculations Unless however one believes that EYX ac is periodic form 46 has its serious limitations In particular unless M is very very large a trigonometric series like 46 will typically provide a poor approximation for a function that varies at different scales on different parts of 0 1 and in any case the coeffk cients necessary to provide such localized variation at different scales have no obvious simple interpretationsconnections to the irregular pattern of variation being described Soeca led quotwavelet bases are much more useful in providing parsimonious and interpretable approximations to such functions The simplest wavelet basis for L2 0 1 is the Haar basis It may be described as follows De n 50 I 0 lt at S 1 the socalled quotfatherquot wavelet and 1 1 I 0 lt at S a 7 I i lt at S 1 the soecalled quotmotherquot wavelet Linear combinations of these functions provide all elements of L2 0 1 that are constant on 0 and on G 1 Write 1 0907 Nextde ne 1501 ltI0ltxSiililtxSgt and 1106 ltIEltS IElt 1gt 1 1 14107 1411 Using the set of functions 110 U 111 one can build as linear combinations all elements of L2 0 1 that are constant on 0 i and on i and on g and and let 3 on I 1 The story then goes on as one should expect One de nes W Moms 1 ltxi 1 Qflliq illl ms l W lel q illl wsil W lel lt lll ltwlgt and lets 1 2 107 117 227 23 In general m jac27m 2m forj122m71 and Wm 1pm07 m17 7 m2m71 The Haar basis of L2 0 1 is then Uo 1 m Then one might entertain use of the Haar basis functions through order M in constructing a predictor M 27quot 71 f X e Z Z 5me X 47 with the understanding that 04 11 a form that in general allows building of functions that are constant on consecutive intervals of length 12M1 This form can be t by any of the various regression methods especially involving thresholdingselection as a typically very large number 2M1 of basis funce tions is employed in 47 Large absolute values of coe eicients mj encode scales at which important variation in the value of the index m and location in 01 where that variation occurs in the value jQm Where perhaps afe ter model selection thresholding only a relatively few tted coe eicients are important the corresponding scales and locations provide an informative and compact summary 0 t e In special situations where N 2K and 1 forz39122K and one uses the Haar basis functions through order K 7 1 the tting of 47 is computationally clean since the vectors 71 951 min together with the column vector of 1 s are orthogonal In fact upon division by N 2K they form an orthonormal basis for ERN The Haar wavelet basis functions are easy to describe and understand But they are discontinuous and from some points of view that is unappealing Other sets of wavelet basis functions have been developed that are smooth The construction begins with a smooth quotmother wavelet in place of the step function used above HTF make some discussion of the smooth quotsymmletquot wavelet basis 42 5 Kernel Smoothing Methods Ch 6 of HTF The central idea of this material is that when nding fz0 I might weight points in the training set according to how close they are to 130 do some kind of tting around we and ultimately read off f from the value of the t at 130 51 Onedimensional Kernel Smoothers For the time being suppose that X takes values in 01 Invent weighting schemes for points in the training set by de ning a usually symmetric about 0 nonenegative realevalued function D t that is noneincreasing for t 2 0 and nonedecreasing for t S 0 Often D t is taken to have value 0 unless M S 1 Then a kernel function is KA 110 D 48 where A is a quotbandwidthquot parameter that controls the rate at which weights drop off as one moves away from 0 and indeed in the case that D t 0 for M gt 1 how far one moves away from 10 before no weight is assigned Common choices for D are 1 the Epanechnikov quadratic kernel D t 2 1 7 t2 I S 1 3 2 the quottriecubequot kernel D t lt1 7 M3 I S 1 and 3 the standard normal density D t ab Using weights 48 to make a weighted average of training responses one arrives at the NadarayaeWatson kerneleweighted prediction at 10 N Zi1K WWW f A 0 N 221 KA 9007 90239 This typically smooths training outputs y in a more pleasing way than does a kenearest neighbor average but it has obvious problems at the ends of the interval 01 and at places in the interior of the interval where training data are dense to one side of 0 and sparse to the other if the target EY X ac has nonezero derivative at 10 For example at 0 1 only as lt 1 get weight and 49 if EY X ac is decreasing at 0 1 f 1 will be positively biased That is with usual symmetric kernels 49 will fail to adequately follow an obvious trend at 0 or 1 or at any point between where there is a sharp change in the density of input values in the training set A way to x this problem with the NadarayaeWatson predictor is to replace the locallye tted constant with a locallye tted line That is at 10 one might choose a are and aco to N 321332 1 KA 9007 90239 11239 04000 5 900 MW 50 i 43 and then employ the prediction fA0a0500 51 Now the weighted least squares problem 50 has an explicit solution Let 1 1 1 12 B Ngtlt2 1 1N and take W 10 diag KA x0 11 K are 1N NgtltN then 51 is 71 13950 17300 BIWWolB BIWWO Y l are Y for the 1 X N vector l 10 110B W10BilB Wac0 It is thus obvious that locally weighted linear regression is an albeit avoidependent line ear operation on the vector of outputs The weights in l are combine the original kernel values and the least squares tting operation to produce a kind of quotequivalent kernel for a Nadaraya7Watson type weighted average Recall that for smoothing splines we represented smoothed values as a SAY where the parameter A is the penalty weight and employed df tr 5 We may do something parallel in the present context We may take l 901 7 l 902 l a where now the parameter A is the bandwidth write a LAY and de ne df tr LA HTF suggest that matching degrees of freedom for a smoothing spline and a kernel smoother produces very similar equivalent kernels smoothers and pre dictions 52 Kernel Smoothing in p Dimensions A direct generalization of 17dimensional kernel smoothing to 17 dimensions might go roughly as follows For D as before7 and z 6 MP 1 might set KMzmz D 52 and t linear forms locally by solving the problem N 2 minimize K m07 7 a 0 3 m0 aave and 50 21 choosing a we 6 ER and 3 we E SW and predicting as fA 0 04 0 g 0 w0 This seems to be typically done after standardizing the coordinates of X and can be effective as longs as N is not too small and p is not more than 2 or 3 However for p gt 37 the curse of dimensionality comes into play and N points just usually aren t dense enough in p7space to make direct use of kernel smoothing effective If the method is going to be successfully used in SW it will need to be applied under appropriate structure assumptions HTF suggest several clever versions of such assumptions in Sections 63 and 64 521 Structured Kernels One way to apply additional structure to the p7dimensional kernel smoothing problem is to essentially reduce input variable dimension by replacing the kernel 52 with KA Az0zD for an appropriate non7negative de nite matrix A For the eigen decomposition of A7 A VD V write z 7 zO A z 7 m0 ltDV z 7 we DiV z 7 m0 This amounts to using not a and ERP distance from a to we to de ne weights7 but rather DV z and ERP distance from DV z to DiV zo In the event that some entries of D are 0 or are nearly so7 we basically reduce dimension from p to the number of large eigenvalues of A and de ne weights in a space of that dimension spanned by eigenvectors corresponding to non7zero eigenval7 ues where the curse of dimensionality may not preclude effective use of kernel smoothing The quottrickquot is7 of course7 identifying the right directions into which to project Searching for such directions is part of the Friedman quotprojection pursuit ideas 522 Structured Regression thctions Another way to apply additional structure to the pedimensional kernel smoothe ing problem is through assumptions on the form of the predictor t For example if one assumes additivity in a predictor fXaZgjXj 53 one might apply kernel smoothing in succession to the variables ngyiaZgX 2 as functions of the ledimensional variables Xj in order to produce a new version of Q Xj iterating until convergence to say This might be termed a quotback tting algorithm And where form 53 might be termed a quotmain effects model the same kind of algorithm might be applied to use kernel smoothing to t a quotmain effects and two factor interactions model at some iterations doing smoothing of appropriate quotresidualsquot as functions of ledimensional Xj and at other iterations doing smoothing as functions of 27dimensional XjXj Another possibility for introducing structure assumptions and making use of lowedimensional kernel smoothing in a large p situation is by making strong global assumptions on the forms of the in uence of some input variables on the output but allowing parameters of those forms to vary in a exible fashion with the values of some small number of coordinates of X For sake of example suppose that p 4 One might consider predictors of the form f X 04X37X4 51X37X4X1 52 X37X4X2 That is one might assume that for xed X3X4 the form of the predictor is linear in X1 X2 but that the coe icients of that form may change in a exible way with X3 X4 Fitting might then be approached by locally weighted least squares with only X3 X4 involved in the setting of the weights That is one might for each 3030 x40 minimize over choices of a 3030 x40 l 3030 x40 and 2 3030 40 the weighted sum of squares N ZKA 307 40 7 323 42 i 04 307 40 51 307 40 1z39 52 307 40 222 i1 and then employ the predictor f0 54 307 40 31307 40 10 32 307 40 20 This kind of device keeps the dimension of the space where one is doing smoothe ing down to something manageable 53 Kernel Density Estimation The balance of Chapter 6 of HTF seems to be a potpourri of topics whose name might include the word quotkernelquot without any necessary real connection to the quotkernel smoothing that is the chapter s main topic The principal one of these topics that deserves mention here is that of quotkernel density estimation The problem here is no longer smoothing to produce a tted f X but rather a solution to the problem given X1X2 XN iid with unknown pdf fX how to estimate fX For the time being suppose that p 1 For some xed pdf like for example the standard normal pdf Invent a locationescale family of densities on ER by de ning for A gt 0 1 7 9 9 A 7 7 g l 7 Ag lt A gt One may if one then wishes think of a corresponding quotkernelquot 7 9 K 9 g T The Parzen estimate of fX 0 is then fx 0 9060me 2 M2 H H 2 L N 2 H M2 KA 9007 1 This is not a kerneleweighted average of outputs but an average of kernel values A standard choice of univariate density 9 is go the standard normal pdf The natural generalization of this to 17 dimensions is to let 9 be the MVNp 0 A21 density One should expect that unless N is huge this method ology will be reliable only for fairly small 17 say 3 at most as a means of estimating a general pedimensional pdf There is some discussion in HTF of classi cation based on estimated multie variate densities and in particular on quotnaivequot estimators That is for purposes of de ning classi ers not for the purpose of real density estimation one might treat elements Xj of X as if they were independent and multiply together kere nel estimates of marginal densities Apparently even where the assumption of independence is clearly quotwrongquot this naive methodology can produce classi ers that work pretty well 6 Understanding and Predicting Predictor Per formance Ch 7 of HTF There are a variety of theoretical and empirical quantities that might be come puted to quantify predictor performance Those that are empirical and reliably track important theoretical ones might potentially be used to select an effece tive predictor We ll here consider some of those theoretical and empirical measures We will do so in the byenowefamiliar setting where training data X17 Y1 7 X27 Y2 7 7 XN7 YN are assumed to be iid P and independent of X7Y that is also P distributed7 and are used to pick a prediction rule f to be used under a loss L 2 0 What is very easy to think about and compute is the empirical quantity 1 N 2 M2 Lfi7yi err 1 that HTF call the training error This decreases with complexity in the form of f7 and is no reliable indicator of predictor performance off the training set Measures of prediction rule performance off the training set must have a the oretical basis or be somehow based on data held back from the process of predice tion rule development In Section 147 the quantity ENE f 7 Y2 lX m was considered and called a quottest error or quotgeneralization error under squared error loss Recall that EN was used to stand for expectation with respect to the joint distribution of the training data and at for conditional exe pectation and VarY lX z for conditional variance based on the conditional distribution of Y X A varianceebias decomposition of this was provided7 one value of at at a time A shorthand adopted in HTF for this kind of version of test error is Errz ENELfX7YlX m In Chapter 77 HTF seem to prefer to average this kind of Errz according to the marginal distribution of X7 say PX They rst de ne a measure of prediction error that holds the training set xed and averages out over the distribution of X7Y That is7 letting T stand for X17 Y1 7 X27 Y2 7 7 XN7YN they let ErrT EL f X 7Y 54 and call this a quottest error or quotgeneralization error or quotprediction error ldee ally7 one would like this kind of measure to be guaranteed to be small for the training set in hand But control of ErrT is typically not possible One has a better chance of controlling the mean of this over the distribution of the training set7 Err EN ErrT an expected prediction or test error Notice that this is also Err EErr 55 48 A slightly different and semieempirical version of this expected prediction error 55 is the quotinesamplequot test error 712 of HTF 1 N N ZErr i1 61 More on Predictor Squared Bias The varianceebias decomposition of Section 14 is Ema VarN f 90 ENf z 7 E mx 90 varmx z Here consider the expected value of the squared bias the mean of the second term of this decomposition EENfX iElYleV Suppose that the training data T are used to select a prediction rule f from some linear subspace of L2 PX say 9 that is f 9T Notice that since subspaces are convex g E ENf ENQT E 9 Further suppose that 9 arg linE 9 X ElYle2 9 That is gquot is the L2 PX projection of onto Then Write V 90 ElYlX ml if 90 so that E YlX ml 9 90 V 90 Now hlg in L2 PX and thus hlg and hLg so that EE1vfXElYle2 E9XElYle2 E9XgXhX2 My X 79 X2 Eh XV 2E9X 79 E9X 79 X X V X 2EhX2 The term Ehquot X2 is an average squared quotmodel bias It is controlled by the size of the class of models t ie the size of On the other hand Eg 7 gquot is an average squared estimation bias This is controlled by the tting method One may be Willing to trade an increase in it off against a decrease in EVarN f in the search for a good predictor 49 62 Optimism of the Training Error Typically m is less than ErrT Part of the difference in these is potentially due to the fact that ErrT is an quotextraesamplequot error in that the averaging in 54 is potentially over values a outside the set of values in the training data We might consider instead N Errm 2 me mm 56 i1 Where the expectations indicated in 56 are over N Py Xm the entire training sample used to choose f both inputs and outputs is being lield constant in the averaging in 56 The difference op ErrTm 7 m is called the quotoptimism of the training error HTF use the notation to Eyop Ey ErrTm 7 m 57 Where the averaging indicated by By is over the outputs in the training set using the conditionally independent Yi s Y N PY XC31 HTF say that for many losses N 2 A w W 0on 58 21 For example consider the case of squared error loss There to Ey ErrTm 7m 1 N 2 1 N 2 EYEYNZYZ39 fwi EYNZO rfW i1 i1 Eyw mi 7 EyEinf mm H 2W MZ H H H 2W MZ H H Eyf 90239 K 7 ElYlX mil H 2W MZ H H COVY 2 Y2 The quantity 2210on Yi turns out to have an important connection to earlier developments Consider linear predictors for Which YMY Then under a model that says that Vary Y 03921 Y M Vary W 0392 W M lI I Y 7 2 MM M quot7 M I and in this context the terms Con are the diagonal elements of the upper right block of this covariance matrix But that means that in the context of linear predictors N 2 Con Yi 02tr 21 For linear predictors we have called trM the degrees of freedom of the t This suggests that a plausible general de nition for any tting method producing values Y is A df 2 Ziv1 00 63 Op A10 and BIC The fact that ErrTm m op suggests the making of estimates of w Eyop and the use of m 3 59 as a guide in model selection This idea produces consideration of the model selection criteria CpAIC an BIC 631 CP and A10 For the situation of least squares tting With p predictors or basis functions and squared error loss df p tr x x x 1 x so that 310on 1702 Then if 3395 is an estimated error variance based on a lowebiashigh number of predictors t a version of 59 suitable for this context is Mallows7 2 A Op Eer pa39Z In a more general setting if one can appropriately evaluate or estimate 2210on 2 df 0392 a general version of 59 becomes the Akaike information criterion 2 N A 2 A AIC m W 0on m Edi Y 0392 632 BIC For situations Where tting is done by maximum likelihood the Bayesian In formation Criterion of Schwarz is an alternative to AlC That is Where the joint distribution P produces density P ylH z for the conditional distribution of Y X a and E is the maximum likelihood estimator of H a maximized logelikelihood is N loglik 2 log P We 21 and the soecalled Bayesian information criterion BIC 72 loglik log N df For Y X normal With variance 0392 up to a constant this is 7N i logN A 2 BIG err N dfltYa 02 and after switching 2 for log N BIC is a multiple of A10 The replacement of 2 With logN means that When used to guide model selections BIC Will typically favor simpler models than Will AlC he Bayesian origins of BIG can be developed as follows Suppose that M models are under consideration the mth of Which has parameter vector Hm and corresponding density for training data fm Tle With prior density for Hm gm Hm and prior probability for model m 77 771 With this structure the posterior distribution of the model index is r M olt r m fm mam am dam Under 01 loss and uniform 7T one wants to choose model m maximizing fm Tlegm Hm de fm T the mth marginal of T Apparently the soecalled Laplace approximation says that log fm T a log fm b 7 d logN 01 where dm is the real dimension of Hm Assuming that the marginal of X doesn t change model to model or parameter to parameter log fm is loglikCN where UN is a function of only the input values in the training set Then 72 log fm T m 72 loglik log N dm 01 7 201V BIC 01 7 QCN and at least approximately choosing m to maximize fm T is choosing m to minimize BIC 64 CrossValidation Estimation of Err This is an attempt to directly estimate Err EN ErrT ENELf Y Kefold crossevalidation consists of H breaking the training set into K disjoint roughly equalesized pieces say T177577TK N training on each of the reduced training sets T 7 EC to produce K pref dictors fk 03 letting k be the piece EC containing training case 239 and computing the crossevalidation error 0v f L y W m This is roughly the same as tting on each T77C and correspondingly evaluating on 7 and then averaging When N is a multiple of K these are exactly the same One hopes that CV approximates Err The choice ofK N is called quotleave one out crossevalidation and sometimes in this case there are slick computational ways of evaluating CV f ln model selection say where predictor f has a tuning parameter a we look at CVfa 53 as a function of a try to optimize and then re t With that a to the Whole training set K7fold cross7validation can be expected to estimate Err EnNgtEL f XY 1 N 177 N lt K The question of how cross7validation might be expected to do is thus related to how Err changes With N the size of the training sample Typically Err decreases monotonically in N approaching some limiting value as N goes to in nity The quotearlyquot small N part of the quotERR vs N curve is steep and the quotlatequot part large N is relatively at lf 1 7 lK N is large enough that at such size of the training data set the curve is at then the effectiveness of cross7validation is limited only by the noise inherent noise in estimating it and not by the fact that training sets of size 1 7 lKN are not of size N Operationally K 5 or 10 seem standard and a common rule of thumb is to select for use the predictor of minimum complexity Whose cross7validation error is Within one standard error of the minimum CV for predictors f under consideration Presumably quotstandard error here refers to KN712 times a sample standard deviation of the K values 5 Z L 11239 f 0 i3kik though it s not completely obvious which complexity s standard error is under discussion When one says this HTF say that for many linear tting methods that produce T MY including least squares projection and cubic smoothing splines the N K leave one out cross7validation error is for f produced by training on T 7 y we 7 i 1 i1 i N V i CVf N for Mii the ith diagonal element of The so7called generalized cross7 validation approximation to this is the much more easily computed 1 N yrfmv 2 Galf Nglt17trMNgt 1 7 tr M W It is worth noting per HTF Exercise 77 that since 1 1 7 302 H 1 235 for at near 0 1 N yrfmv 2 Galf NZlt1itrMNgt a Z 2 7 f M 0 N Z W f W2 Which is close to AlC the difference being that here 0392 is being estimated based on the model being t as opposed to being estimated based on a lowebiaslarge model 65 Bootstrap Estimation of Err Suppose that the values of the input vectors in the training set are unique One might make B bootstrap samples of N random samples With replacement of size N from the training set T say 717 T51 and train on these bootstrap samples to produce predictors say b pred1ctor f based on T Let Ci be the set of indices 1 12B for Which Tbquot A possible bootstrap estimate of Err is then N A0 1 E E 7 IT N It s not completely clear What to make of this For one thing the Tbquot do not have N distinct elements In fact it s fairly easy to see that the expected number of distinct cases in a bootstrap sample is about 632N That is note that Z L ymb 900 bEC N P 21 is in a particular bootstrap sample 7 17 lt1 7 A liexp 71 632 Then N E number of distinct cases in 7 E I is in Tgt N 2E m is in T So roughly speaking we might expect EU to estimate Err at 632N not at N So unless Err as a function of training set size is fairly at to the right of 632N one might expect substantial positive bias in it as an estimate of Err at N HTF argue for A 632 A0 Err E 368 632 Err as a rst order correction on the biased bootstrap estimate but admit that this is not perfect either and propose a more complicated x that they call A 632 Err for classi cation problems It is hard to see What real advantage the bootstrap has over Kefold cross validation especially since one Will typically want B to be substantially larger than usual values for K 7 Inference and Model Averaging Ch 8 of HTF Much of Chapter 8 of HTF is generalities about ordinary statistical inference common in any Statistics graduate program We ll not say much about them both because they are already in your background and also because they are really off the basic machine learning path of prediction One part of this that does perhaps merit a bit of exposition is Bayes development of predictors The balance of the chapter is then about quotmodel averaging 71 Bayes Predictors Consider rst a very simple context Where P has parameter 0 and therefore the conditional mean of the output is a parametric in 0 function of at E9 le w m m Suppose that the conditional distribution of Y X a has density f9 Then for a prior distribution on H With density 9 H the posterior has density 9 H T Olt f9 H For any xed a this posterior distribution induces a corresponding distribution on m Then a sensible predictor based on the Bayes structure is Hz ElleH m ml999lTd9 A second more speci c and complicated application of essentially Bayesian thinking to the development of a predictor might be based the use of a Gaussian process as a more or less noneparametric quotprior distribution for the function EYlX That is for purposes of developing a predictor suppose that one assumes that y 7190 6 56 where 7190 w V E6 0 Vare 0392 the function y is known it could be taken to be identi7 cally 0 and plays the role of a prior mean for was 7 mix 7 w and independent of the errors 6 y is a realization of a mean 0 stationary Gaussian process on SW this Gaussian process describing the prior uncertainty for r around p More explicitly the assumption on y is that E y 0 and Var y 7392 for all z and for some appropriate correlation function p Cov y 7 72 z 7 z for all a and z p 0 1 and the function of two variables p z 7 z must be positive de nite The quotGaussianquot part of the assumptions is then that we assume that for any nite set of element of ERP say 21 22 2M the corresponding values 7 are multivariate normal There are a number of standard forms that have been suggested for the correlation function p The simplest ones are of a product form ie if pj is a valid one7dimensional correlation function then the product is a valid correlation function for a Gaussian process on SW Standard forms for correlation functions in one dimension are p A exp 70A2 and p A exp 7clAl The rst produces quotsmootherquot realizations than does the second and in both cases the constant c governs how fast realizations vary One may then consider the joint distribution conditional on the m and assuming that for the training values 31 the 6239 are iid independent of the y of the training output values and a value of r From this one can nd the conditional mean for r given the training data To that end let 2 7392p13i7 2712 N NgtltN j1 2 N Then for a single value of a 111 061 212 II m2 2 021 2 z N MVNN1 7 E 72 ZIN m 7196 w for Tp z 7 m1 T P 90 i 902 E Ngtlt1 T299090N Then standard multivariate normal theory says that the conditional mean of r given Y is fz p2m 2021 60 luv 7 l1 061v Write 111 7 061 N131 2 021 1 112 7 7 61 211v I LVEN and then note that 60 implies that f 90 Na ZWMZP 90 i 90239 62 and we see that this development ultimately produces p plus a linear com7 bination of the quotbasis functionsquot 72 z 7 as a predictor Remembering that 72 z 7 z must be positive de nite and seeing the ultimate form of the predictor7 we are reminded of the RKHS material in Section 44 In fact7 consider the case where p E 0 If one has some non7zero prior mean for r z7 arguably that mean function should be subtracted from the raw training outputs before beginning the development of a predictor At a mini7 mum7 output values should probably be centered before attempting development of a predictor Compare 61 and 62 to 44 and 45 for the p 0 case What is then clear is that the present quotBayesquot Gaussian process development of a predictor under squared error loss based on a covariance function 72 z 7 z and error variance 0392 is equivalent to a RKHS regularized t of a function to training data based on a kernel K a z 72 z 7 z and penalty weight A 0392 72 Predictors Built on Bootstrap Samples One might make B bootstrap samples of N random samples with replacement of size N from the training set T say 717 T51 and train on these boot7 strap samples to produce7 say7 predictor fb based on T Now rather than using these to estimate the prediction error as in Section 65 consider a couple of ways of using these to build a predictor under squared error loss 721 Bagging The possibility considered in Section 87 of HTF is quotbootstrap aggregation or quotbaggingquot This is use of the predictor f z 2 2 f 90 Notice that even for xed training set T and input a this is random varying with the selection of the bootstrap samples One might let E denote averaging over the creation of a single bootstrap sample and fquot be the predictor derived from such a bootstrap sample and think of Ef as the quottruequot bagging predictor that has the simulationebased approximation fbag One is counting on a law of large numbers to conclude that fbag gt Ef as B A 00 Note too that unless the operations applied to a training set to produce f are linear Ef will differ from the predictor computed from the training data f Where a loss other than squared error is involved exactly how to quotbagquot bootstrapped versions of a predictor is not altogether obvious and apparently even what might look like sensible possibilities can do poorly 722 Bumping Anotherdifferent thing one might do with bootstrap versions of a predictor is to quotpick a winner based on performance on the training data This is the quotbumpingquotstochastic perturbation idea of HTF s Section 89 That is let f0 f be the predictor computed from the training data and de ne Z 11239 7 f 002 and take A fbump 90 f 90 The idea here is that if a few cases in the training data are responsible for making a basically good predictor perform poorly eventually a bootstrap sample will miss those cases and produce an effective predictor 73 Model Averaging and Stacking The bagging idea averages versions of a single predictor computed from differ ent bootstrap samples An alternative might be to somehow weight together different predictors potentially even based on different models 731 Bayesian Model Averaging One theoretically straightforward way to justify this kind of enterprise and idene tify candidate predictors is through the Bayes quotmultiple model scenario of Sece tion 632 That is again suppose that M models are under consideration the mth of which has parameter vector Hm and corresponding density for training data f m Tlgm with prior density for Hm gm 9m and prior probability for model m 77 m As before the posterior distribution of the model index is speci ed by W 21 7T f fm Tlgm 9m Hm dam Now suppose that for each model there is a quantity ym Hm that is of in terest and having a common interpretation across models Notice that under the current assumption that P depends upon the parameter Hm in model in ym Hm could be the conditional mean of Y given X a That is one impore tant instance of this structure is 77 Hm z Eg Y X at Since in model in the density of Hm conditional on T is fm TM 9m 9m mamfg the conditional mean of ym Hm in model in is 7 f Ym Hm fm Tlgm 9m Hm dam Emmi W77 7 f fm T Hmwm 07 de and a sensible predictor of ym Hm is 77m T Emmi lTl Z 7TmlTEVm 9m W77 7111 a modelposterior probability weighted average of the predictors Eh Hm m T individually appropriate under the M di ereht models In the important special case where ym Hm z Eg Y X z this is M memm ZWW MWW W m1 60 732 Stacking What might be suggested like this7 but With a less Bayesian avor Supe pose that M predictors are available all based on the same training data7 f17f27fM Under squared error loss7 I might seek a weight vector 11 for Which the predictor f z Z wmfm 90 is effective I might even let w depend upon at producing f z 2 mm as fm z HTF pose the problem of perhaps nding M 2 17 arg minENE Y 7 Z wmfm 7111 or even M 2 17 arg minENE Y 7 Z wmfm lX m w m1 Using either the PN X P joint distribution of T X7 Y or the PN gtlt PY XI conditional distribution of T7Y given X a one may consider the random vector f1 7 f2 7fM Y or the random vector f1 7 f2 7fM 7Y Let f7 Y stand for either one and f 7 Ef E lt Y 7 BY M1gtlt1 be the corresponding mean vector and E W M X M be the matrix of expected products of the predictions Upon Writing out the expected square to be minimized and doing some matrix calculus7 it s possible to see that optimal weights are of the form a EY E ff 1 Ef or a z is of this form We observe that HTF s notation in 857 could be interpreted to incorrectly place the inverse sign inside the expectation They call this quotthe population linear regression of Y on f Of course7 none of this is usable in practice7 61 as typically the mean vector and expected cross product matrix are unknown And apparently the simplest empirical approximations to these do not work well What instead HTF seem to recommend is to de ne N M 2 argmin 7 2 mm am quot i1 m1 for the mth predictor t to T 7 the training set with the ith case removed Then ultimately the quotstackedquot predictor of Y is M f z Z wiff kfm z m1 8 Additive Models Trees and Related Methods Ch 9 of HTF 8 1 Additive Models Section 91 of HTF is the book s formal discussion of additive models The authors have already alluded to them repeatedly The main points of the section are 1 Additive models are highly exible One may a predict continuous outputs Y using squared error loss and 071 clas si cation variables using a negative logelikelihood loss an 9 P Y llX an additive form ie P Y llX min at S 11 S max at 9 1 an additive form i12 N i12 N for some known link function g b mix types of predictors continuous categorical and types of funce tions of them in the additive form see page 297 to produce all sorts of interesting models including semieparametric ones and ones with low order interactions 2 Fitting is possible via the backe tting algorithm That is if I am to t under SEL fa2fzl for all some part of a I might a set a yi cycle through112L12 7 tting via some appropriate often linear operation I z 2 fl to data mi7yii12 N for y yi 7 azjfm90 711 where the fm are the current versions of the tted summands 7 setting f the newly tted versionithe sample mean of this newly tted version across all in theory this is not necessary but it is here to prevent numericalrounde off errors from causing the fm to drift up and down by additive constants summing to 0 across m until convergence There is a logistic regression version of this on page 300 of HTF 3 This material may not be all that helpful for large p as there is nothing here that does any thresholding or automatic variable selection 8 2 TreeBased Methods This is something genuinely new to our discussion We ll rst consider the SELregression version This begins with a forwardeselectionquotgreedyquot algoe rithm for inventing predictions constant on pedimensional rectangles by succese sively looking for an optimal binary split of a single one of an existing set of rectangles For example with p 2 one begins with a rectangle de ned by a min at S11 S max 1b i12 N i12 N and 0 min 12239 S 12 S max 12 d 112 N 112 N and looks for a way to split it at 1 51 or 2 51 so that the resulting two rectangles minimize the training error SSE Z Z yr receeeglf rectangles 239 With a m t e rectangle One then splits optimally one of the now two rectangles at 1 52 or 2 52 etc Where I rectangles R1 R2 R in SW have been created and m the index of the rectangle to which a belongs the corresponding predictor is 1 m M fl m 239 With a m mm and in this notation the training error is N 2 SSE 2 y 7 fl am 2391 If one is to continue beyond I rectangles one then looks for a value 5 to split one of the existing rectangles R1 R2 R on 1 or 2 or or app and thereby produce the greatest reduction in SSE We note that there is no guarantee that after I splits one will have the best in terms of SSE possible set of l 1 rectangles Any series of binary splits of rectangles can be represented graphically as a binary tree each split represented by a node where there is a fork and each nal rectangle by an end node It is convenient to discuss rectangleesplitting and quotunsplittingquot in terms of operations on corresponding binary trees and henceforth we adopt this language It is of course possible to continue splitting rectanglesadding branches to a tree until every distinct a in the training set has its own rectangle But that is not helpful in a practical sense in that it corresponds to a very low biashigh variance predictor So how does one nd a tree of appropriate size The standard recommendation seems to be this 1 Grow a tree on a given data set until the cell with the fewest training mi contains 5 or less such points then N quotprunequot the tree in 1 back by for each a gt 0 a complexity parameter weighting SSE against complexity de ned in terms of tree size minimize ing over choices of subetrees the quantity Ca T m a SSE T for in the obvious way lTl the number of nal nodes in the candidate tree and SSE T the error sum of squares for the corresponding predictor 03 pick a good value of a the smallest one that has average performance quotclosequot to the best SSE by crossevalidation call this amp and r then operating on the entire training set apply 1 and nd the subtree say Ta optimizing 03 T and use it 64 The question of how to execute 2 short of an exhaustive search over sub7 trees for every different value of a has a workable answer There is a relatively small number of nested candidate sub7trees of an original tree that are the only ones that are possible minimizers of Ca T7 and as a decreases one moves through that nested sequence of sub7trees from the largestoriginal tree to the smallest Suppose that one begins with a large tree7 T0 with lTOl nal nodes One may quickly search over all quotprunedquot versions of T0 sub7trees T created by removing a node where there is a fork and all branches that follow below it and nd the one with minimum SSE T 7 SSE T0 W 7 W This IS the per node of the lopped off branch of the rst tree increase in Call that sub7tree T1 T0 is the optimizer of Ca T over sub7trees of T0 for every a 2 SSE T1 7 SSE T0 lTol 7 lTll7 but at that value of a the optimizing sub7tree switches to T1 ne then may search over all quotprunedquot versions of T1 for the one with minimum SSE T 7 SSE T1 lTll 7 W and call it T2 T1 is the optimizer of of Ca T over sub7trees of T0 for every SSE T2 7 SSEltT1gtgtltT1 7 szl s a SSE T1 7 SSETolTol7lT1l but at a SSE T2 7 SSE T1 lTll 7 ngl the optimizing sub7tree switches to T27 and so on In this process7 if there ever happens to be a tie among sub7 trees in terms of a minimizing a ratio of increase in error sum of squares per decrease in number of nodes7 one chooses the sub7tree with the smaller For Ta optimizing Ca T7 the function of a 7 Ca Ta mainCa T is piecewise linear in a and both it and the optimizing nested sequence of sub7trees can be computed very e eiciently in this fas ion The continuous Y7 SEL version of this stuff is known as the making of quotregression trees The quotclassi cation trees version of it is very similar One needs only to de ne an empirical loss to associate with a given tree parallel to SSE used above To that end7 we rst note that in a K class problem where Yquot takes values in 9 12 7K corresponding to a particular rectangle Rm is the fraction of training vectors with classi cation k 1 A 1 t k pmk training input vectors in Rm V Z yz 2 With 11 m Rm and a plausible g7valued predictor based on I rectangles is fz 96 arg mama keg 65 the class that is most heavily represented in the rectangle to Which a belongs The empirical miseclassi cation rate for this predictor is 2 1 1 l A err i lyi fz Ml NW Nm 1 Pmmm 1 Where Nm training input vectors in Rm7 and k arg maxm Two related alternative forms for a training error are quotthe Gini indlex 1 l K A A err W EN gpmk 1 717mm and the soecalled quotcross entropy l K m 7 Z Nm Zmlnltmgtgt m1 k1 Upon adopting one of these forms for m and using it to replace SSE in the regression tree discussion7 one has a classi cation tree methodology HTF suggest using the Gini index or cross entropy for tree growing and any of the indices but most typically the empirical miseclassi cation rate for tree pruning according to costecomplexity 83 PRIM Patient Rule Induction Method This is another rectangleebased method of making a predictor on SW It could be termed a type of quotbumpehuntingquot For a series of rectangles or boxes in pispace R17R27H7Rz one de nes a predictor 3R1 if z E R1 rm if m e R2 7 R1 fz w yRKUEle if z e Rm 7 ukmngk 3U21Rkc if z U1Rk The boxes or rectangles are de ned recursively in a way intended to catch quotthe remaining part of the input space With the largest output values That is7 to nd R1 1 identify a rectangle l l 1 S appltup that includes all input vectors in the training set N identify a dimension j and either lj or uj so that by reducing uj or increasing lj just enough to remove a fraction a say a 1 of the training vectors currently in the rectangle the largest value of yrectangle possible is produced and update that boundary of the rectangle 03 repeat 2 until some minimum number of training inputs mi remain in the rectangle say at least 10 r expand the rectangle in any direction increase a u or decrease an 1 adding a training input vector that provides a maximal increase in Erectangle and 01 repeat 4 until no increase is possible by adding a single training input vector This produces R1 For what it is worth step 2 is called quotpeelingquot and step 4 is called quotpastingquot Upon producing R1 one removes from consideration all training vectors with m E R1 and repeats 1 through 5 to produce R2 This continues until a desired number of rectangles has been created One may pick an appropriate number of rectangles l is a tuning parameter by crossevalidation and then apply the procedure to the whole training set to produce a set of rectangles and predictor on pespace that is pieceewise constant on regions built from boolean operations on rectangles 84 MARS Mulitvariate Adapative Regression Splines This is a highedimensional regression methodology based on the use of quothockey stick functions and their products as basis functions That is with input space ERP we consider basis functions built on the Np pairs of functions hm 90 90139 WM and 1127290 902739 WM 63 acij is the jth coordinate of the ith input training vector and both hijl and hijg depend on a only through the jth coordinate of MARS builds predictors sequentially making use of these quotre ected pairs roughly as follows 67 1 Begin With f0 50 2 Identify a pair 63 so that 50 511 1271 m 12hij1m has the best SSE possible Call the selected functions 911 hm and 912 hij2 and set f1 50 511911 512912 At stage I of the predictorebuilding process With predictor 03 121 m 30 mwml z 5m29m2 an in hand consider for addition to the model pairs of basis functions that are either of the basic form 63 or of the form hij19m1 and hij2gm1 or of the form hij19m2 and hi 9 gm2 90 for some m lt 1 subject to the constraints that no acj appears in any candidate product more than once maintaining the pieceewise linearity of sections of the predictor Additionally one may decide to put an upper limit on the order of the products considered for inclusion in the predictor The best candidate pair in terms of reducing SSE gets called say 911 and 912 and one sets fl 50 Z 5m19m1 5m29m2 One might pick the tuning parameter I by crossvalidation but the standard implementation of MARS apparently uses instead a kind of generalized cross validation error N 2 GOV 211 11239 fl 0 lt1 7 L N Where M l is some kind of degrees of freedom gure One must take account of both the tting of the coef cients in this and the fact that knots values acij have been chosen The HTF recommendation is to use M l 21 2 or 3 the number of different knots chosen Where presumably the knot count refers to different acij appearing in at least one gml or gmg 9 Boosting Cth of HTF The basic idea here is to take even a fairly simple and poorly performing form of predictor and by sequentially modifyingperturbing it and reeweighting or modifying the training data set to creep toward an effective predictor 91 The AdaBoost Algorithm Consider a 27class 11 loss classi cation problem We ll suppose that output Yquot takes values in g 71 1 this coding of the states will be more convenient than 0 1 or 1 2 codings for our present purposes The AdaBoostM1 algoe rithm is built on some base classi er f that could be for example the simplest possible tree classi er possessing just 2 nal nodes lt proceeds as follows 1 lnitialize weights on the training data at 1 wilE forz3912N 2 Fit a gevalued predictorclassi er f1 to the training data to optimize N Ejuwifmm i1 let g I 21 7E f1wil err1 M2 1 1 7W1 a1ln i err1 3 Set new weights on the training data and de ne 1 M2 eXPWlllyi f1 for Z Lin7N This upeweights miseclassi ed observations by a factor of 1 7 W1 W1 4 Form23M a Fit a gevalued predictorclassi er fm to the training data to optimize N 2 mm m 7E f m i1 b Let LmuMmmm 21 wim errm c Set 17 am 111 T errm wim1 wim 9X13 04ml fm for i 1727 7N d Update weights as 5 Output a resulting classi er that is based on quotweighted voting by the classi ers fm as M M f w sign 2 amfm an 1 7 21 Z amfm w lt 0 m1 m1 Classi ers with small mm get big positive weights in the quotvotingquot This apparently turns out to be a very effective way of developing a classie er and has been called the best existing quotoffetheeshelfquot methodology for the problem 92 Why AdaBoost Might Be Expected to Work HTF offer some motivation for the effectiveness of AdaBoost This is related to the optimization of an expected quotexponential loss 921 Expected quotExponential Lossquot for a Mnction of X For 9 an arbitrary function of X consider the quantity EexpY9X EEleXP Y9Xle 64 If g were typically very large positively when Yquot 1 and very large negae tively when Yquot 71 this quantity 64 could be expected to be small So if one were in possession of such a function of X one might expect to produce a sensible classi er as f 90 sign9 90 1 21mg lt 0l 65 A bit differently put a g constructed to make 64 small should also produce a good classi er as 65 It s a bit of a digression but HTF point out that it s actually fairly easy to see what 9 optimizes 64 One needs only identify for each a what value a optimizes Eexp 7a lX it That is Eexp 7a lX z exp 711 P Y llX wlexp 113 Y illX at and an optimal a is easily seen to be half the log odds ratio ie the g optimizing 64 is 1 ltPY1lX 2 g ln PY71lXzgt 70 922 Sequential Search for a g With Small Expected Exponen tial Loss and the AdaBoost Algorithm Consider quotbase classi ers h 77 taking values in g 71 1 with para meters 7 and functions built from them 9mm thmwm We might think of successive ones of these as perturbations of the previous ones by addition of a term mhm 777 to gm1 We might further think about how such functions might be successively de ned in an effort to make them effective as producing small values of 64 Of course7 since we typically do not have available a complete probability model for X7 Y we will only be able to seek to optimize an empirical version of 64 As it turns out7 taking these points of view produces a multiple of the AdaBoost voting function 21 amfm as gM It then follows that AdaBoost might well produce effective classi ers by the logic that g producing small 64 ought to translate to a good classi er via 65 So consider an empirical loss that amounts to an empirical version of N times 64 for gm 21 Thisis N Em Zexp immil mi 7 ymmhm mm i1 N Zexpky gmilwiexpy mhmm m i1 Let vim eXP y gmil 9 and consider optimal choice of 7m and m gt 0 for purposes of making gm the best possible perturbation of gm1 We may assume that gt 0 provided the set of classi ers de ned by h 77 as 7 varies includes both h and 7h for any version of the base classi er7 h Write Em Z vimexpezsmw Z vimexpw ivuth imth hmw 39vmyf hmw 39vmyf N N exp gm 7 exp m zviml hm 00mm 7E 11 exp examiw 21 21 So independent of the value of m gt 07 7m needs to be chosen to optimize the vimeweighted error rate of hm 777 Call the optimized using weights 71 vim version of hm 777 by the name hm z7 and notice that the prescription for choice of this base classi er is exactly the AdaBoost prescription of how to choose fm based on weights So now consider minimization of N 2 exp y hm will 11 over choices of This is epoi Z vim Z vimexpa m 239 With 239 With hmw 39vmyf hmw 39vmyf N N exp 73 2 exp 21 7 1 I hm mi 7E m 21 21 and minimization of it is equivalent to minimization of N V v I h v exp Hi 1 exp Wm 7 1 EMU N m a 7E 21 221 vim N hm m vim and see that choice of requires minimization of exp 73 1 exp 28 7 W1 A bit of calculus shows that the optimizing is 7 5m ln 66 errmm err Notice that the prescription 66 calls for a coef cient that is derived as exactly half am the coef cient of fm in the AdaBoost voting function starting from weights prescribed in 4b of the AdaBoost algorithm And note that as regards sign7 the 12 is completely irrelevant So to nish the argument that successive choosing of gm s amounts to Ad aBoost creation of a voting function7 it remains to consider how the weights vim get updated For moving from gm to gm1 one uses weights vim1 eXPZI9mi exp 11 grH 00239 mhm 900 vimexphyf mhmm vimexp m 21 hm mi 7E 11 7 1 vim exp Z ml hm 06239 7E y l exp 5m 72 Since exp is constant across 239 it is irrelevant to weighting and since the prescription for m produces half what AdaBoost prescribes in 4b for am the weights used in the choice of m1 and hm1z7m1 are derived exactly according to the AdaBoost algorithm for updating weights So at the end of the day gm s are built from base classi ers to creep towards a function optimizing an empirical version of Eexp fl9 Since it is easy to see that 91 corresponds to the rst AdaBoost step gm is further 12 of the AdaBoost voting function they generate the same classi er as the AdaBoost algorithm 93 Other Forms of Boosting One might in general think about strategies for sequentially modifyingperturbing a predictor perhaps based on a modi ed version of a data set as a way of creep ing towards a good predictor Here is a general quotgradient boosting algorithm taken from lzenman aimed in this direction for a general loss function and general class of potential updates for a predictor I don t see that HTF ever get to the point of actually laying out this logic though they say lots of related things 1 Start with f0 arg min L 2 Form 12M a letNim7i Al y a y y gamma b t for some class of functions hm parameters and as m 7m N wm argminZ am 7 hm mm n 21 c let N pm arg min 1212397 fmil th 23770 7 i1 d set fm fm71l7mhmm77m 3 and nally set f fM In this algorithm the set of N values mhm zi ym in 2b functions as an approximate negative gradient of the total loss at the set of values fm1 and the least squares criterion at 2b might be replaced by others like least absolute deviations The quotminimizationsquot need to be taken with a grain of salt as they are typically only the quotapproximatequot minimizations like those associated with greedy algorithms used to t binary trees eg The step 2d is some kind of line search in a direction of steepest descent for the total loss Despite the fact that the HTF Section 109 is titled quotBoosting Trees there is nothing here that requires that the functions hm 137 come from classi cation or regression trees When the problem is a regression problem with squared error loss some substantial simpli cations of this occur 931 SEL Suppose for now that L 7 algorithm above Then in the gradient boosting 1 f0 37 i 1172 yzm 3y 2 y y gamma I of the algor1thm says quot t us1ng SEL a model for the res1duals from the previous iteration and N 21239 fm71 so that step 2b 03 the optimization of p in step 3c of the algorithm produces pm m so that step 3d of the algorithm sets fm 96 fm71z mhm 967m and ultimately one is left with the prescription quot t using SEL a model for the residuals from the previous iteration and add it to the previous predictor to produce the mth one a very intuitively appealing algorithm Notice that still there is nothing here that really requires that the hm 137 come from regression trees This notion of gradient boosting could be applied to any kind of base regression predictor 932 AEL and Binary Regression Trees Suppose now that L ly 7 if Then in the gradient boosting algorithm 1 f0 median and N Em 7 6 A f signyi 7 fm1 so that step 2b y m71 131 of the algorithm says quot t a model for il s coding the signs of the residuals from the previous iteration In the event that the base predictors being t are regression trees the mhm whym t at step 2b will be values between 71 and 1 constant on rectangles if the tting is done using least squares exactly as indicated in 2b If the alternative of least absolute error were used in place of least squares in 2b one would have the median of a set of 71 s and 1 s and thus values either 71 or 1 constant on rectangles In this latter case7 the optimization for pm is of the form N 9 pm arg min E y 7 fm1 7 7 ilk 7 11 gm and While there is no obvious closed form for the optimizer7 it is at least the case that one is minimizing a piecewise linear continuous function of p 933 KClass Classi cation HTF allude to a procedure that uses gradient boosting in tting a K7dimensional predictor ff177fK under the constraint that 251 fk 07 sets corresponding class probabilities for k E g 12K t m expmcm p 251expltfkltwgtgt and has the implied classi er f w 7 argmaxpk w keg An appropriate and convenient loss is MN Hwy I 21 kl 111pk 2 H H l MN N E H E 7 T A MR D N quotU S 2 H H This leads to 8 exp fk 77L I k 7 Wk 2171 y l 21 expw and thus gikm I kl pkm71 Algorithm 104 on page 387 then provides a version of the gradient boosting algorithm for this loss and K7class classi cation 94 Issues More or Less Related to Boosting and Use of Trees as Base Predictors The balance of Ch 10 of HTF concerns a number of issues that might come up in the use of boosting7 particularly if regression or classi cation trees are the base predictors One issue is the question of how large regression or classi cation trees should be allowed to grow if they are used as the functions hm 137 in a boosting algorithm The answer seems to be quotNot too large7 maybe to about 6 or so terminal nodesquot Section 1012 concerns regularizationcontrol of complexity parametersavoiding over t in boosting Where trees are used as the base predictors7 limiting their size as above provides some regularization Further7 there is the parameter M in the boosting algorithm that canshould be limited in size very large values surely producing over t Crossevalidation here looks typically pretty unworke able because of the amount of computing involved HTF seem to indicate that for better or worse holding back a part of the training sample and watching performance of a predictor on that single test set as M increases is a commonly used method of Choosing M quotShrinkagequot is an idea that has been suggested for regularization in boosting The basic idea is that instead of setting fm w fm7190 MM 067 as in 2d of the gradient boosting algorithm7 one chooses V E 07 1 and sets fm w fir171w Vthm 067 ie one doesn t make the quotfull correctionquot to fm1 in producing fm This will typically require larger M for good nal performance of a predictor than the noneshrunken version7 but apparently can in combination with larger M improve performance quotSubsamplingquot is some kind of bootstrapbagging idea The notion is that at each iteration of the boosting algorithm7 instead of Choosing an update based on the whole training set7 one chooses a fraction 7 of the training set at random7 and ts to it using a new random selection at each iteration This reduces the computation time per iteration and apparently can also improve performance HTF show its performance on an example in combination with shrinkage Consider the matter of assigning measures of quotimportancequot of input variables for a regression tree predictor In the spirit of ordinary linear models assessment of the importance of a predictor in terms of some reduction it provides in some error sum of squares7 Breiman suggested the following Suppose that in a regression tree7 predictor Xj provides the rectangle splitting criterion for nodes nodelj nodemmj and that before splitting at nodezj the relevant rectangle Rnodelj has associated sum of squares 7 2 SSEnodelj Z 7 ynodelj ivuth Lemma 76 and that after splitting Rnodelj on variable Xj to create rectangles Rnodeljl and Rnodel one has sums of squares associated with those two rectangles 2 SSEnodeljl Z grimdeljl imth meRnodu 2 and Z grimdel ivuth maizan The reduction in error sum of squares provided by the split on Xj at nodezj is thus D nodezj SSEnodelj 7 S39S39E odelj1 SSEnodel One might then call m0 1 2 D nodezj 21 a measure of the importance of Xj in tting the tree and compare the various I s Further7 if a predictor is a sum of regression trees produced by boosting and Jim measures the importance of Xj in the mth tree7 then 1 M is a sensible measure of the importance of Xj in the overall predictor One can then compare the various 1 as a means of comparing the importance of the input variables Another issue brie y discussed in HTF Ch 10 is the making and plotting of functions a few of the coordinates of X in an attempt to understand the nature of the in uence of these in a predictor What they say is really perfectly general7 not at all special to trees or boosted predictors lf7 for example7 I want to understand the in uence the rst two coordinates of X have on Y f X7 I might think of somehow averaging out the remaining coordinates of X One theoretical implementation of this idea would be 12 x1 x2 Ef x1 3527X37X47 7Xp ie averaging according to the marginal distribution of the excluded input variables An empirical version of this is N l N E f17273174177 vpi i1 This might be plotted and the plot called a partial dependence plot HTF s language is that this details the dependence of the predictor on 1 12 quotafter accounting for the average effects of the other variables 77 Something different is 171mm EU 00 1X1 1129 m This is by the way the function of X1 X2 closest to f in L2 This is obtained by averaging not against the marginal of the excluded variables but against the conditional distribution of the excluded variables given X1 1X2 2 No workable empirical version of g 1352 can typically be de ned And it should be clear that this is not the same as 12 1 2 HTF say this in some sense describes the effects of x1 2 on the prediction quotignoring the impact of the other variables N The difference between T12 and g is easily seen through resort to a simple example If for example f is additive of the form f h1ac1ac2 h2 3 acp then T12 17302 7419517952 Eh2 X37 pr In 351 2 constant while 17120017902 hl 17902 E U12 X37 7Xp le 17X2 2 and the 12 quotcorrectionquot to hl x1 x2 is not necessarily constant in 1 2 10 Neural Networks Ch 11 of HTF 101 Projection Pursuit Regression For w1 w2 wM unit pevectors of parameters we consider predictors of the form M fX 2 9m w39mX 67 m1 This is an additive form in the derived variables Um w mX The functions gm and the directions mm are to be t from the training data The M 1 case of this form is the quotsingle index model of econometrics How does one t a predictor of this form 67 Consider rst the M 1 case Given 11 we simply have pairs 122312 for 11 10 and a ledimensional smoothing method can be used to estimate 9 On the other hand given 9 we might seek to optimize 11 via an iterative search A GausseNewton algorithm can be based on the rst order Taylor approximation 9 wimi m 9 wgldzi 9w1dmiw wind mi so that EN1239 g was m 9 mgme yig g 5 7 21 21 wgldzi We may then update wold to 11 using the closed form for weighted by g wgldzi Z noeintercept regression of I 7 w m wgldmi yz gr 01d 2 9 woldwi on m Presumably one must normalize the updated 11 in order to preserve unit length property of the w in order to maintain a stable scaling in the tting The 9 and 11 steps are iterated until convergence When M gt 17 terms gm are added to a sum of such in a forward stageewise fashion HTF provide some discussion of details like readjusting previous 9 s and perhaps w s upon adding gm to a t7 and the choice of M 102 Neural Networks A multielayer feedeforward neural network is a nonlinear map of X E SW to one or more realevalued Y s through the use of functions of linear combinations of functions of linear combinations of of functions of linear combinations of coordinates of X Figure 1 is a network diagram representation of a single hidden layer feedeforward neural net with 3 inputs 2 hidden nodes and 2 outputs This diagram stands for a function of X de ned by setting Zl 0a011a11X1 a21X2a31X3 Z2 T 0402 1 a12X1 a22X2 a32X3 and then Y1 91501391 51121 5215 Y2 91502391 51221 52222 Of course7 much more complicated networks are possible7 particularly ones with multiple hidden layers and many nodes on all layers The common choices of functional form a39 at hidden nodes are ones with sigmoidal shape like 1 am 7 1 exp iu or 0a tanhw w exp exp iu statisticians seem to favor the former7 probably because of familiarity with 107 gistic forms7 while computer scientists seem to favor the latter Such functions 79 Input Layer Hidden Layer Output Layer Figure 1 A Network Diagram Representation of a Single Hidden Layer Feed foward Neural Net With 3 Inputs 2 Hidden Nodes and 2 Ouputs are really quite linear near u 0 so that for small oz s the functions of X entering the g s in a single hidden layer network are nearly linear For large 04 s the functions are nearly step functions in 0 X In light of the latter it is not surprising that there are universal approximation theorems that guarantee that any continuous function on a compact subset of 3 can be approximated to any degree of delity with a single layer feed forward neural net with enough nodes in the hidden layer This is both a blessing and a curse It promises that these forms are quite exible It also promises that there must be both over tting and identi ability issues inherent in their use the latter in addi tion to the identi ability issues already inherent in the symmetric nature of the functional forms assumed for the predictors In regression contexts identity functions are common for the functions g while in classi cation problems in order to make estimates of class probabilities the functions 9 often exponentiate one Z and divide by a sum of such terms for all Z s There are various possibilities for regularization of the ill posed tting prob lem for neural nets ranging from the fairly formal and rational to the very informal and ad hoc Possibly the most common is to simply use an iterative tting algorithm and quotstop it before it convergesquot 1021 The Backpropagation Algorithm The most common tting algorithm is something called the back propagation algorithm or the delta rule In the case where one is doing SEL tting for a single layer feed forward neural net with M hidden nodes and K output nodes 80 the logic goes as follows There are Mp 1 parameters a to be t In addition to the notation 0mm use the notation I am 041m7042m7 704me There are K M 1 parameters to be t In addition to the notation Ok use the notation I Gk 511m 2k739 7 Mk Let 0 stand for the Whole set of parameters and consider the SEL tting criterion Rm 2 11m 7 fk ltme 68 N k39 H 2 H Let 270i 1 and for m 12M de ne Zmz39 0 040m Otimmi and I 2239 271239 272239 7 ZMz39 and Write 1 K 2 Bi i Z fie k1 so that R a 3 Bi Partial derivatives With respect to the parameters are by the chain rule BRZ39 a mk i fie 50k kzi Zmz39 for m 0 12 M and k 1 2 K and we Will use the abbreviation 6m 1127c flc MIMIC ak 52 69 so that 8R 7i 5 239 mi 0 83 k Z 7 Similarly letting rig 1 BB K I r I 804 7 7 Z 1127c fk mil 9k 50k kzi mk 0 0m Ctr1m k1 H MN Ski mka 0mm oz mmi xi 2 H 1 for l 0 1 2 p and m 12 M and we will use the abbreviations Smi K i Z flc 50k Iaiczi 57mg 040m aimzi k1 K Z mka 040m Otim i 5m 71 k1 equations 71 are called the quotbackepropagationquot equations so that BRV 804 smixi 72 An iterative search to make R 0 small may then proceed by setting N N BRv 532 32 7 W Z a Z 32 7 m 262255 Wk 73 21 WC 3533 and 5 21 and N 01 r 313i a a 7 7 m m T 804m a323 045 and all 33 and 5 N 045 i W Z 529 74 21 for some quotlearning rate 7 So operationally 1 in what is called the quotforward pass through the network using Tth iterates of 04 s and s one computes Tth iterates of tted values f to in what is called the quotbackward pass through the network the 7 th iterates 6 are computed using the Tth iterates of the 27 s s and the fkr in equations 69 and then the 7 th iterates 55 are computed using the 7 th 6 s 04 s and s in equations 71 03 the 6 and 53 then provide partials using 70 and 72 and r updates for the parameters a and come from 73 and 74 1022 Formal Regularization of Fitting Suppose that the various coordinates of the input vectors in the training set have been standardized and one wants to regularize the tting of a neural net One possible way of proceeding is to de ne a penalty function like 10 g 1 at 0 gm 75gt 82 it is not absolutely clear whether one really wants to include I 0 and m 0 in the sums 75 and seek not to partially optimize R H in 68 but rather to fully optimize R H AJ 0 76 for a A gt 0 By simply adding i yrA EZi on the right side of 73 and i yrAalQ on the right side of 74 one arrives at an appropriate gradient descent algorithm for trying to optimize the penalized error sum of squares 76 Potentially an appropriate value for A might be chosen based on crossevalidation Something that may at rst seem quite different would be to take a Bayesian point of view For example with a model yik fie 6m for the em iid N0a392 a likelihood is simply 0172 H hyiklfk il9v 72 21 for h in 0392 the normal pdf lf then 9 0 0392 speci es a prior distribution for H and 0392 a posterior for 002 has density proportional to 0 0392 9 0 0392 For example one might well assume that a priori the 04 s and s are iid N0 712 where small 712 will provide regularization and it is again unclear whether one wants to include the 0mm and 0k in such an assumption or to instead provide more diffuse priors for them like improper quotUniform7oooo or at least large variance normal ones And a standard improper prior for 0392 would be to take loga39 to be Uniform7oo In any case whether improper or proper let s abuse notation and write 9 0392 for the prior density for 0392 Then with independent mean 0 variance 72 priors for all the weights except possibly the 0mm and Ok that might be given Uniform7oo 00 priors one has ln 1 0172 g 0172 X iNKlna39 7 Rw 7 ni2JH lng 0392 7NK1na lng 0392 7 130 gum 77 at improper priors for the a0m and Ok correspond to the absence of terms for them in the sums for J H in 75 This recalls 76 and suggests that approe priate A for regularization can be thought of as a variance ratio of quotobservation variance and prior variance for the weights It s fairly clear how to de ne MetropolisHastingsewithineGibbs algorithms for sampling from 002g 002 But it seems that typically the high 83 dimensionality of the parameter space combined with the symmetryederived multiemodality of the posterior will typically prevent one from running an MCMC algorithm long enough to fully detail the posterior It also seems uni likely however that detailing the posterior is really necessary or even desirable Rather one might simply run the MCMC algorithm monitoring the values of l 0 0392 9 0 0392 corresponding to the successively randomly generated MCMC iterates An MCMC algorithm will spend much of its time where the corre sponding posterior density is large and we can expect that a long MCMC run will identify a nearly modal value for the posterior Rather than averaging neural nets according to the posterior one might instead use as a predictor a neural net corresponding to a parameter vector at least locally maximizing the posterior Notice that one might even take the parameter vector in an MCMC run with the largest l 002 9 002 value and for a grid of 0392 values around the empirical maximizer use the backepropagation algorithm modi ed to fully optimize R H J 0 over choices of B This in turn could be used with 77 to perhaps improve somewhat the result of the MCMC quotsearchquot 11 Support Vector Machines Ch 12 of HTF Consider a 27class classi cation problem As for the AdaBoost derivation we ll suppose that output Yquot takes values in g 71 1 Our present concern is in a further r t of linear 39 39 t beyond that provided in Section 3 For 3 E SW and u 6 ER we ll consider the form 9 w w 50 7s and a predictorclassi er f 90 sign 9 90 79 We will approach the problem of choosing 3 and u to in some sense provide a maximal cushion around a hyperplane separating between m with corresponding 71 and m with corresponding 1 111 The Linearly Separable Case In the case that there is a classi er of form 79 with 0 training error rate we consider the optimization problem maximize M subject to 50 2 M W 80 u with 1 and g Ell This can be thought of in terms of choosing a unit vector u or direction in SW so that upon projecting the training input vectors mi onto the subspace of multiples of u there is maximum separation between convex hull ofprojections of the m with 71 and the convex hull of projections of m with corresponding 1 The sign on u is chosen to give the latter larger than the former Notice that if u and u solve this minimization problem 1 M g min 2H 7 max m with m with y 11 1 and 1 39 I I e 77 mm m max ziu u m with m with y 1 y i For purposes of applying standard optimization theory and software it is useful to reformulate the basic problem 80 several ways First note that 80 may be rewritten as u e max1m1ze M subject to y ltsz 7 2 1 V2 81 uwithHuH1 Z Z M M an 0 6 Then if we let u 5 M it s the case that 1 1 ll llM 0r MW so that 81 can be rewritten minimize l subject to 50 2 1 W 82 8 6 ER 2 and g 6 9 This formulation 82 is that of a convex quadratic criterion linear inequality constraints optimization problem for which there exists standard theory and algorithms The soecalled primal functional corresponding to 82 is for oz 6 ERN 1 N Fp mama 2 i W 7 2m mi 50 7 1 for a 2 0 2391 85 To solve 827 one may for each oz 2 0 choose oz 0 to minimize Fp 7301 and then choose oz 2 0 to maximize Fp oz o 0001 The Karush7Kuhn7Tucker conditions are necessary and suf cient for solution of a constrained optimization problem In the present context they are the gradient conditions BFP 575070 N i E i i 0 83 ago i1a y and BF 8 oz N Pa Zam 7g aiyg zi0 84 i1 the feasibility conditions 2 mm H30 7 1 2 0 w 85 the non7negativity conditions oz gt 0 86 and the orthogonality conditions 04239 50 1 0 W 87 Now 83 and 84 are N N 204211 0 and I3 Zaiy m E 13 a 88 i1 i1 and plugging these into Fp 7 07a gives a function of oz only 1 N 7 2 FD a g H5 00H 204239 11 9025 a i 1 i1 1 i i i i i gaiajyi yjzgzj 7 aiajyi yj wgwj 04239 1 ix I 239 239 j I 1 I 1 a 7 01 Ha for NgN 7 mange 89 Then the quotdualquot problem is l maximize 1 01 7 ia Ha subject to a 2 0 and og 0 90 86 and apparently this problem is easily solved Now condition 87 implies that if a gt 0 y 9625 01 e 03 1 so that 1 by 85 the corresponding mi has minimum aopt for training vectors With 1 or maximum 0101 for training vectors With 71 so that m is a support vector for the quotslabquot of thickness 2M around a separating hyperplane7 2 e aopt may be determined using the corresponding mi from me am 1 7 mm am e am y 7 mm am apparently for reasons of numerical stability it is common practice to average values 7 0101 for support vectors in order to evaluate e 01 and 3 I N 1 y oa Pty 2042 ij mi j1 N 7 ot opt 1 yi500 p O j yjyimjzi j1 The fact 88 that 3 oz E aiyg zi implies that only the training cases With 04239 gt 0 typically corresponding to a relatively few support vectors deter mine the nature of the solution to this optimization problem Further7 for 8 the set indices of support vectors in the problem7 2 ll am Z Z iESV jESV Z Z iESV jESV Z a 1 7 113 GNU iesv Z iesv the last of these following from 3 above Then the quotmarginquot for this problem is simply 1 1 HIE aoptm V Ziesvaipt M 112 The Linearly Nonseparable Case In a linearly noneseparable case the convex optimization problem 82 does not have a solution no pair 3 6 WP and g 6 W provides g 2 1 Vi We might therefore in looking for good choices of 3 6 WP and u E W try to relax the constraints of the problem slightly That is suppose that 5 2 0 and consider the set of constraints 11 9023 e 5 2 1 W the v are called quotslackquot variables and provide some quotwiggle room in search 2 for a hyperplane that quotnearlyquot separates the two classes with a good margin We might try to control the total amount of slack allowed by setting a limit N 5 s 0 21 for some positive C Note that if e 2 0 case 239 is correctly classi ed in the training set and so if for some pair 3 6 WP and u E W this holds for all 239 we have a separable problem So any noneseparable problem must have at least one negative g for any 3 6 WP and u E W pair This in turn requires that the budget C must be at least 1 for a noneseparable problem to have a solution even with the addition of slack variables In fact this reasoning implies that a budget C allows for at most C miseclassi cations in the training set nd in a noneseparable case C must be allowed to be large enough so that some choice of 3 6 WP and u E W produces a classi er with training error rate no larger than CN n any event we consider the optimization problem 25 mm H30 5 2 1 W for some 5 2 0 with 5 S C 91 1 minimize 7 subject to 8 6 ER 2 and g 6 W that can be thought of as generalizing the problem 82 HTF prefer to motivate 91 as equivalent to y aw30gt 2 Ma 7a w M b t t malelZe Sll JeC O for some E 2 0 w1th 5239 S C u with 1 and g 6 W generalizing the original problem 80 but it s not so clear that this is any more natural than simply beginning with 91 A more convenient version of 91 is N 1 2 y o 21 Vi rigngrgpze a C 15 subject to for some E 2 0 z and g 6 W 92 A nice development on pages 376378 of lzenman s book provides the followe ing solution to this problem 92 parallel to the development in Section 111 Generalizing 90 the quotdualquot problem boils down to 1 maximize 1 01 7 ia Ha subject to 0 S a S 01 and oy 0 93 for I NIXIN M aw 94 The constraint 0 S oz S 01 is known as a quotbox constraint and the quotfeasible region prescribed in 93 is the intersection ofa hyperplane de ned by 031quot 0 and a quotboxquot in the positive orthant The Cquot 00 version of this reduces to the quothard margin separable case Upon solving 93 for 0101 the optimal 3 E SW is of the form 5 MN Z dipty wi 95 512 for 8 the indices of set of quotsupport vectors m which have a gt 0 The points with 0 lt a lt Cquot will lie on the edge of the margin have 5 0 and the ones with a Cquot have 5 gt 0 Any of the support vectors on the edge of the margin with 0 lt a lt Cquot may be used to solve for u 6 ER as e am y i 9023 am 96 and again apparently for reasons of numerical stability it is common practice to average values 7 aopt for such support vectors in order to evaluate e 0101 Clearly in this process the constant 0 functions as a regular izationtuning parameter and large 0 in 92 correspond to small C in 91 Identi cation of a classi er requires only solution of the dual problem 93 evaluation of 95 and 96 to produce linear form 78 and classi er 79 113 SVM s and Kernels The form 78 is of course and by design linear in the coordinates of a A natural generalization of this development would be to consider forms that are linear in some nonelinear functions of the coordinates of at There is nothing really new or special to SVM classi ers associated with this possibility if it is applied by simply de ning some basis functions hm and considering form 9W 50 hM for use in a classi er lndeed HTF raise this kind of possibility already in their Chapter 4 where they introduce linear classi cation methods However 89 the fact that in both linearly separable and linearly noneseparable cases7 optie mal SVM s depend upon the training input vectors mi only through their inner products see again 89 and 94 and our experience with RKHS s and the computation of inner products in function spaces in terms of kernel values sug gests another way in which one might employ linear forms of nonlinear functions in classi cation 113 1 Heuristics Let K be a continuous nonenegative de nite kernel and consider the possie bility of using functions K 13131 7K 72 7K z7 EN to build new N7 dimensional feature vectors Km 1 K m Mm KWJN for any input vector at including the m in the training set and rather than de ning inner products for new feature vectors for input vectors a and z in terms of ERN inner products N kz kz ZKzzkKzzk k1 we instead consider using the RKHS inner products of corresponding functions K m7 7K 27 39gt71K K 72 Then7 in place of 89 or 94 de ne H 97 NgtltN and let aopt solve either 90 or Then with N I3 am Z aiptyfk 90239 21 as in the developments of the previous sections7 we replace the ERN inner product of 3 aopt and a feature vector k with ltZaiptyKltzigt7K quotgt HK a ltKwi77Kwwgt M2 H H 71x winmm H M2 H H Then for any 239 for Which a gt 0 an index corresponding to a support feature vector in this context we set N o am y 7 ZQEPWK mnwj j1 and have the analogue of 78 N g z Z aipty as so e am 98 21 With corresponding classi er f 90 sign 9 90 99 as an analogue of 79 It remains to argue that this classi er developed simply by analogy is a solution to any welleposed optimization problem 1132 An Optimality Argument The heuristic argument for the use of kernels in the SVM context to produce form 98 and classi er 99 is clever enough that some authors simply let it stand on its own as quotjusti cationquot for using quotthe kernel trick of replacing ERN inner products of feature vectors With HK inner products of basis funce tions A more satisfying argument would be based on an appeal to optimale ityregularization considerations HTF s treatment of this possibility is not helpful The following is based on a 2002 Machine Learning paper of Lin7 Wahba7 Zhang7 and Lee s in Section 44 consider the RKHS associated With the nonenegative de e nite kernel K7 and the optimization problem involving the quothinge loss l N l mlnlmIZe 1 i 11239 e 9 i 3 Halli 100 9 E HK 21 and g 6 ER Exactly as noted in Section 44 an optimizing g E HK above must be of the form 959 Z jKmlmj 1In the discussion of AdaBoost the exponential loss Was convenient when it canne time to make optinnality arguments Here the hinge loss is convenient For XY N P it is reasonably straighforward to argue that the function g minimizin the expected hinge loss E1 e lgX is g 1M m signP Y llX m e l classi ers as f z 9A so one might hope to build good so the minimization problem is 2 N 1 N minimilzle Z 1 7 y 50 We mm A Z jK z 90 I3 6 ER i1 j1 and g 6 W HR that is N I 1 II lneln l e 1 1 7 31 g Ai K and g 6 W K K 062 907 Now apparently this is equivalent to the optimization problem 11 WWW o i 2 1 w for some 5 2 0 N 1 minimize 5 A K subject to 6 ER Z 2 and g 6 W Which also apparently for NHN as in 97 has the dual X problem of the form 1 maximize 1 17 7 n Hn subject to 0 S 17 S 1 and yquot 0 101 01 1 1 maximize 1 01 7 7001 subject to 0 S a 3 A1 and oy 0 2 A2 102 That is the function space optimization problem 100 has a dual that is the same as 93 for the choice of Cquot A and kernel K zz produced by the heuristic argument in Section 1131 Then if 77 3t is a solution to 101 Lin et al say that an optimal 3 E WN is 1 Xdlagy177yNn opt this producing coef cients to be applied to the functions On the other hand the heuristic of Section 1131 prescribes that for aopt the solution to 102 coef cients in the vector diag 111 7112 01 get applied to the functions K Upon recognizing that 77 3t aopt it becomes evident that for the choice of 0 A and kernel K the heuristic in Section 1131 produces a solution to the optimization problem 100 92 114 Other Support Vector Stuff Several other issues related to the kind of arguments used in the development of SVM classi ers are discussed in HTF and lzenman One is the matter of multieclass problems That is7 where g 127 7K how might one employ machinery of this kind There are both heuristic and optimalityebased methods in the literature A heuristic oneeagainsteall strategy might be the following lnvent 27class problems K of them7 the kth based on 7 1 if y k 11m 7 71 otherwise Then for a single 0 and k 127K solve the possibly linearly none separable Qeclass optimization problems to produce functions gk that would lead to oneeversuseall classi ers fk signgk A possible overall classie er is then f arg maxgC kEg A second heuristic strategy is to develop a voting scheme based on pairwise comparisons That is7 one might invent problems of classifying class 1 versus class m for l lt m choose a single 0 and solve the possibly linearly noneseparable Qeclass optimization problems to produce functions 91m and corresponding classi ers me signgm For m gt I de ne fm ime and de ne an overall classi er by f arg max 2 fkm keg m or7 equivalently f w arg max 2 1mm as 1 keg m Another type of question related to the support vector material is the ex tent to which similar methods might be relevant in regressionetype prediction problems As a matter of fact7 there are loss functions alternative to squared error or absolute error that lead naturally to the use of the kind of technology needed to produce the SVM classi ers That is7 one might consider so called quot6 insensitive losses for prediction like Li 1717 max la 7 17V 6 or 2 L m max 0 y 7 a 7 e and be led to the kind of optimization methods employed in the SVM classi cation context See lzenman pages 3987401 in this regard 93 12 Prototype Methods and Nearest Neighbors Ch 13 of HTF We saw When looking at quotlinearquot methods of classi cation in Section 3 that these can reduce to classi cation to the class With tted mean quotclosestquot in some appropriate sense to an input vector X A related notion is to represent classes each by several quotprototypequot vectors of inputs and to classify to the class With closest prototype In their Chapter 13 HTF consider such classi ers and related nearest neighbor classi ers So consider a Keclass classi cation problem Where Yquot takes values in g 1 2 and suppose that the coordinates of input X have been standardized according training means and standard deviations For each class k 12 K represent the class by prototypes Zlc17 Zk27 7 21m belonging to ERP and consider the classi erpredictor f argmin mlin 7 2le k that is one classi es to the class that has a prototype closest to The most obvious question in using such a rule is quotHOW does one choose the prototypesquot One standard admittedly ad hoc but not unreasonable method is to use the soecalled quotKemeans clustering algorithmquot one class at a time The quotKquot in the name of this algorithm has nothing to do With the number of classes in the present context In fact here the quotKquot naming the clustering algorithm is our present R the number of prototypes used per class And the point in applying the algorithm is not so much to see exactly how training vectors aggregate into quothomogeneousquot groupsclusters as it is to nd a few vectors to represent them For TIC m with corresponding k an quotR quotemeans algorithm might proceed by I randomly selecting R different elements from TIC say 1 1 1 z1gtzg2gtzgg 2 then for m 23 letting m the mean of all m E TIC With Z WA lt 7 2574 for all 7amp1 mi 7 zkl iterating until convergence This way of choosing prototypes for class k ignores the quotlocationquot of the other classes and the eventual use to Which the prototypes Will be put A potential improvement on this is to employ some kind of algorithm again ad hoc but 94 reasonable that moves prototypes in the direction of training input vectors in their own class and away from training input vectors from other classes One such method is known by the name quotLVQquotquotlearning vector quantizationquot This proceeds as follows With a set of prototypes chosen randomly or from an Remeans algorithm or some other way 2k k12K and l12R in hand at each iteration m 1 2 for some sequence of quotlearning ratesquot em with em 2 0 and em 0 1 sample an m at random from the training set and nd kl minimizing Hmz Zkzll 2 if k from 1 update 2k as new Zkl Zkl 6m lie and if 7E k from 1 update 2k as 22quot Zkz 6m Zkz iterating until convergence As early as Section 1 we considered nearest neighbor methods mostly for regressionetype prediction Consider here their use in classi cation problems As before de ne for each a the neighborhood M the set ofl inputs mi in the training set closest to a in ERP A nearest neighbor method is to classify a to the class with the largest repre sentation in M possibly breaking ties at random That is de ne f argmax Z I k 103 k wigMz l is a complexity parameter that might be chosen by crossevalidation Appare ently properly implemented this kind of classi er can be highly effective in spite of the curse of dimensionality This depends upon cleverappropriateapplicatione speci c choice of feature vectorsfunctions de nition of appropriate quotdistancequot in order to de ne quotclosenessquot and the neighborhoods and appropriate local or global dimension reduction HTF provide several attractive examples particue larly the LandSat example and the handwriting character recognition example that concern feature selection and specialized distance measures They then go on to consider more or less general approaches to dimension reduction for classi cation in highedimensional Euclidean space A possibility for quotlocalquot dimension reduction is this At a E SW one might use regular Euclidean distance to nd say 50 neighbors of a in the training 95 inputs to use to identify an appropriate local distance to employ in actually de ning the neighborhood M to be employed in 103 The following is what I think is a correction of what HTF write in describing a DANN dis7 criminant adaptive nearest neighbor squared metric at a E SW Let D2zzz7z Qz7z 71 Q W i WBW 61 W 1 W7 Bquot 1 61 W 104 for some 6 gt 0 where W is a pooled within7class sample covariance matrix K W Z 7 Wk 1 k K 1 A 7 7 I 7 Zn 7 7 k1 5k is the average mi from class k in the 50 used to create the local metric B is a weighted between7class covariance matrix of sample means B Z 39kik EHTIC if k1 E is a probably weighted average of the 5k and B W B W Notice that in 104 the quotoutsidequot W factors quotspherequot z 7 z differences relative to the within7class covariance structure Bquot is then the between7class covariance matrix of sphered sample means Without the 61 the distance would then discount differences in the directions of the eigenvectors corresponding to large eigenvalues of Bquot al owing the neighborhood de ned in terms of D to be severely elongated in those directions The effect of adding the SI term is to limit this elongation to some degree preventing mi quottoo far in terms of Euclidean distance from w from being included in M A global use of the DANN kind of thinking might be to do the following At each training input vector mi E SW one might again use regular Euclidean distance to nd say 50 neighbors and compute a weighted between7class7mean covariance matrix Bi as above for that These might be averaged to pro uce 1 N BNBi 96 Then for eigenvalues of F say A1 2 A2 2 2 AP 2 0 With corresponding unit eigenvectors ul Ug 7 up one might do nearest neighbor classi cation based on the rst feW features vj ujz and ordinary Euclidean distance I m fairly certain that this is a nearest neighe bor version of the reduced rank classi cation idea rst met in the discussion of linear classi cation in Section 3


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.