Class Note for C&PE 940 at KU (2)
Popular in Course
Popular in Department
This 24 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Kansas taught by a professor in Fall. Since its upload, it has received 23 views.
Reviews for Class Note for C&PE 940 at KU (2)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
NONPARAMETRIC REGRESSION TECHNIQUES CampPE 940 28 November 2005 Geoff Bohling Assistant Scientist Kansas Geological Survey geoffkgskuedu 8642093 Overheads and other resources available at http peoplekuedugbohlingcpe940 Modeling Continuous Variables Regression In regressionstyle applications we are trying to develop a model for predicting a continuousvalued numerical response variable Y from one or more predictor variablesX that is YfX In more classical statistical terminology the X variables would be referred to as independent variables and Y as the dependent variable In the language of the machine learning community the X s might be referred to as inputs and the Y as an output We will limit our discussion to Supervised learning In which the response function f is learned based on a set of training data with knownX amp Y values Deterministic X That is the noise or measurement error is considered to be strictly in the Y values so that regression in the general sense of Y onX is the appropriate approach rather than an approach aiming to minimize error in both directions e g RMA regression In other words we are looking at the case where each observed response yi is given by yi fxi 8i where X is the corresponding vector of observed predictors and 8 th is the n01se or measurement error in the z observation of y In typical statistical modeling the form of f X is not known exactly and we instead substitute some convenient approximating function f X 0 with a set of parameters 6 that we can adjust to produce a reasonable match between the observed and predicted y values in our training data set What constitutes a reasonable match is measured by an objective function some measure of the discrepancy between y and fxi averaged over the training dataset Ideally the form of the objective function would be chosen to correspond to the statistical distribution of the error values 8 but usually this distribution is also unknown and the form of the objective function is chosen more as a matter of convenience By far the most commonly used form is the sum of squared deviations between observed and predicted responses or residual sum of squares 129 iti Josie If the errors are independent and distributed according to a common normal distribution N 0 0 then the residual sum of squares RS S is in fact the correct form for the objective function in the sense that minimizing the RSS with respect to 6 yields the maximum likelihood estimates for the parameters Because of its computational convenience least squares minimization is used in many settings regardless of the actual form of the error distribution A fairly common variant is weighted least squares minimization with the objective function Ryi JExi92 which would be appropriate if each error value was distributed according to 8 N00 However we rarely have external information from which to evaluate the varying standard deviations 0 One approach to this problem is to use iteratively reweighted least squares IRLS where the 0 values for each successive fit are approximated from the computed residuals for the previous fit IRLS is more robust to the in uence of outliers than standard least squares minimization because observations with large residuals far from the fitted surface will be assigned large 0 values and thus downweighted with respect to other observations Probably the only other form of objective function for regression style applications that sees much use is the sum of absolute rather than squared deviations Rm in me which is appropriate when the errors follow an eXponential distribution rather than a normal distribution Minimizing the absolute residuals the L1 norm is also more robust to outliers than minimizing the squared residuals the L2 norm since squaring enhances the in uence of the larger residuals However L1 minimization is difficult due to discontinuities in the derivatives of the objective function with respect to the parameters So the basic ingredients for supervised learning are A training dataset Paired values of 321 and x1 for a set of N observations An approximating function We must assume some form for the approximating function f X 0 preferably one that is eXible enough to mimic a variety of true functions and with reasonably simple dependence on the adjustable parameters 9 An objective function We must also assume a form for the objective function which we will attempt to minimize with respect to to obtain the parameter estimates The form of the objective function implies a distributional form for the errors although we may often ignore this implication Even if we focus solely on the least squares objective function as we will here the variety of possible choices for f X 6 leads to a bewildering array of modeling techniques However all of these techniques share some common properties elaborated below Another important ingredient for developing a reliable model is a test dataset independent from the training dataset but with known yl values on which we can test our model s predictions Without evaluating the performance on a test dataset we will almost certainly be drawn into over tling the training data meaning we will develop a model that reproduces the training data well but performs poorly on other datasets In the absence of a truly independent test dataset people often resort to crossvalidation withholding certain subsets from the training data and then predicting on the withheld data We Will return to the Chase Group permeability example presented earlier by Dr Doveton except With a somewhat expanded and messier set of data Here are the permeability data plotted versus porosity and uranium concentration The circle sizes are proportional to the base 10 logarithm of permeability in millidarcies With the permeability values ranging from 0014 md LogPerm 185 to 327 md LogPerm 152 The contours represent LogPerm predicted from a linear regression on porosity and uranium The residuals from the t represented by the gray scale range from 239 to 127 With a standard deviation of061 Uranium ppm Porosity Va For this dataset the linear regression explains only 28 of the variation in LogPerm Here is a perspective View of the fitted surface followed by the prediction results versus depth 2700 Measured Leastsquares prediction 2750 e 1 Q 2800 e 5 o D 2850 e 2900 i i i i i i i i 720 715 710 705 00 05 10 15 Log10 Permeability md To start thinking about different approaches for estimating Y f X imagine sitting at some point x in the predictor space looking at the surrounding training data points and trying to figure out something reasonable to do with the yl values at those points to estimate y at x Nearestneighbor averaging For each estimation point x select k nearest neighbors in training dataset and average their responses A l Yx yi kax N k x means the neighborhood surrounding x containing k training data points This approach is extremely simple It is also very flexible in the sense that we can match any training dataset perfectly by choosing k 1 assuming there are no duplicate x values with different yl values Using one nearest neighbor amounts to assigning the constant value y to the region of space that is closer to x than to any other training data point the Thiessen polygon of xi Why do we not just always use nearestneighbor averaging Because it does not generalize well particularly in higher dimensions larger numbers of predictor variables It is very strongly affected by the curse ofdimensionalily The Curse of Dimensionality We are trying to map out a surface in the space of predictor variables which has a dimension d equal to the number of predictor variables Imagine that all the variables have been standardized to a common scale and that we are considering a hypercube with a side length of c in that common scale The volume of this hypercube is given by Vcd So the volume of the hypercube increases as a power of the dimension This means that it becomes harder and harder for the training data to fill the volume as the dimensionality increases and that the probability that an estimation point will fall in empty space far from any training data point increases as a power of the number of predictor variables So even a seemingly huge training dataset could be inadequate for modeling in high dimensions Another way of looking at the curse of dimensionality is that if 11 training data points seem adequate for developing a one dimensional model with a single predictor variable then nd data points are required to get a comparable training data density for a d dimensional problem So if 100 training data points were adequate for a ldimensional estimation problem then 10010 lgtlt1020 data points would be required to give a comparable density for a lOdimensional problem For any realistic training dataset almost all of the predictor variable space would be far from the nearest data point and nearestneighbor averaging would give questionable results A number of methods can be used to bridge the gap between a rigid global linear model and an overly exible nearestneighbor regression Most of these methods use a set of basis or kernel functions to interpolate between training data points in a controlled fashion These methods usually involve a large number of fitting parameters but they are referred to as nonparametric because the resulting models are not restricted to simple parametric forms The primary danger of nonparametric techniques is their ability to over t the training data at the expense of their ability to generalize to other data sets All these techniques offer some means of controlling the tradeoff between localization representation of detail in the training data and generalization smoothing by selection of one or more tuning parameters The specification of the tuning parameters is usually external to the actual fitting process but optimal tuning parameter values are often estimated via crossvalidation Global regression with higherorder basis functions The approximating function is represented as a linear expansion in a set of global basis or transformation functions hm x miemmx The basis functions could represent polynomial terms logarithmic transformations trigonometric functions etc As long as the basis functions do not have any fitting parameters buried in them in a nonlinear fashion this is still linear regression that is the dependence on the parameters is linear and we can solve for them in a single step using the standard approach for multivariate linear regression The tuning here is in the selection of basis functions For example we could fit the training data arbitrarily well by selecting a large enough set of polynomial terms but the resulting surface would flap about wildly away from data points 10 Kernel Methods and Locally Weighted Regression This involves weighting each neighboring data point according to a kernel function giving a decreasing weight with distance and then computing a weighted local mean or linear or polynomial regression model So this is basically the same as smoothing interpolation in geographic space but now in the space of the predictor variables The primary tuning parameter is the bandwidth of the kernel function which is generally specified in a relative fashion so that the same value can be applied along all predictor axes Larger bandwidths result in smoother functions The form of the kernel function is of secondary importance Smoothing Splines This involves fitting a sequence of local polynomial basis functions to minimize an objective function involving both model fit and model curvature as measured by the second derivative expressed in one dimension as 1mm i a foo amoral The smoothing parameter 2 controls the tradeoff between data fit and smoothness with larger values leading to smoother functions but larger residuals on the training data anyway The smoothing parameter can be selected through automated crossvalidation choosing a value that minimizes the average error on the withheld data The approach can be generalized to higher dimensions The natural form of the smoothing spline in two dimensions is referred to as a thinplate spline The figures below show the optimal thinplate spline fit for the Chase LogPerm data The residuals from the fit range from 239 to 142 with a standard deviation of 048 compared to 239 to 127 with a standard deviation of 061 for the linear regression fit The spline fit accounts for 56 of the total variation about the mean compared to 28 for the leastsquares fit 11 4000 mm mm Uranium ppm 15 1 n5 an M 4 45 an 725 Porosity Va 12 Here are the measured and thinplate spline predicted LogPerms versus depth 270E Measured Thinplate spline prediction 2750 3 5 2800 39 539 D 2850 2513quot l x v v 730 VIS r10 705 00 05 IE IS Lag10Pennbilitymd 13 Neural Networks There is a variety of neural network types but the most commonly applied form builds up the tted surface as a summation of sigmoid Sshaped basis functions each oriented along a different direction in variable space A direction in variable space corresponds to a particular linear combination of the predictor variables That is for a set of coefficients 06 j ld where indexes the set of variables plus an intercept coefficient 060 the linear combination d 050 05ij J391 represents a combined variable that increases in a particular direction in predictor space essentially a rotated aXis in that space For notational simplicity we often add a column of l s to the set of predictor variables that is X 0 l for every data point so that the linear combination can be represented more compactly as d ZaJXj za39X F0 The second expression shows the summation as the inner product of the coefficient vector and the variable vector A typical neural network will develop some number M linear combinations like the one above meaningM different coefficient vectors am and pass each one through a sigmoid transfer function to form a new variable basis function Zm cm X 11 eXp 05quot X 14 10 9 9 0 ea l l Z1l1expu39x 9 9 N b 00 l l l l l For continuousvariable prediction the predicted output is then typically just a linear combination of the Z values for Bo mzm i mzm Z Where again we have collapsed the intercept term into the coef cient vector by introducing Z0 l The complete expression for the approximating function is then 15 Ifwe chose to useM 4 sigmoid basis functions for the Chase example the network could be represented schematically as Input Layer Hidden Layer Output Layer The input layer contains three nodes representing the bias term X 0 l and the two input variables porosity and uranium The hidden layer in the middle contains the four nodes that compute the sigmoid transfer functions Zm plus the bias term ZO 1 The lines connecting the input and hiddenlayer nodes represent the coefficients or weights 06m The input to each hiddenlayer node excluding the hiddenlayer bias node is one of the linear combinations of input variables and the output is one of the Zm values The lines connecting the hidden layer node to the output 16 node represent the coefficients or weights m so the output node M computes the linear combination 2 mZm our estimate for m0 LogPerm Training the network means adjusting the network weights to minimize the objective function measuring the mismatch between predicted and observed response variables in a training data set For continuousvariable prediction this is usually the leastsquares objective function introduced above Adjusting the weights is an iterative optimization process involving the following steps 0 Scale the input variables eg to zero mean and unit standard deviation so that they have roughly equal in uence Guess an initial set of weights usually a set of random numbers 2 Compute predicted response values for the training dataset using current weights 3 Compute residuals or errors between observed amp predicted responses 4 Backpropagate errors through network adjusting weights to reduce errors on next go round 5 Return to step 2 and repeat until weights stop changing significantly or objective function is sufficiently small ji Any of a number of algorithms suitable for largescale optimization can be used meaning suitable for problems with a large number of unknown parameters which are the network weights in this case Given d input variables andM hiddenlayer nodes actually Ml including the bias node the total number of weights to estimate is NW 2 0 1 M M 1 oc s plus B s which can get to be quite a few parameters fairly easily l7 Because we are trying to minimize the objective function in an NW dimensional space it is very easy for the optimization process to get stuck in a local minimum rather than finding the global minimum and typically you will get different results find a different local minimum starting from a different set of initial weights Thus the neural network has a stochastic aspect to it and each set of estimated weights should be considered one possible realization of the network rather than the correct answer The most fundamental control on the complexity of the estimated function is the choice of the number of hiddenlayer nodes M Usually the bias node is left out of the count of hiddenlayer nodes We will refer toMas the size of the network A larger network allows a richer more detailed representation of the training data with a smaller network yielding a more generalized representation Here is the result of using a network with a single hiddenlayer node 7 just one sigmoid basis function 7 for the Chase permeability example 18 Here the network finds a rotated axis in PhiU space oriented in the direction of the most obvious LogPerrn trend from lower values in the northwest to higher values in the southeast and fits a very sharp sigmoid function 7 practically a step function 7 in this direction If we use three basis functions we might get a representation like the following Ung 4m cm 1000 d9 This looks like our first basis function plus one oriented roughly along the uranium axis centered at about U 12 and one oriented along an axis running from lower left to upper right helping to separate out the low values in the southwest corner 19 Quite often a term involving the magnitudes of the network weights is added to the objective function so that the weights are now adjusted to minimize Ra Na yefxa 2IZaz2 z 11 Minimizing this augmented function forces the network weights to be smaller than they would be in the absence of the second term increasingly so as the damping or decay parameter 2 increases Forcing the weights to be smaller generally forces the sigmoid functions to be spread more broadly leading to smoother representations overall Here is the fit for a network with three hiddenlayer nodes using a decay parameter of 001 You can see that the basis functions developed here are much less steplike than those developed before with no decay 2 0 20 Thus the primary tuning parameters for this kind of neural network a single hiddenlayer neural network with sigmoid basis functions are the size of networkM and the decay parameter A One strategy for choosing these might be to use a fairly large number of hiddenlayer nodes allowing the network to compute a large number of directions in predictorvariable space along which the response might show significant variation and a large decay parameter A forcing the resulting basis functions to be fairly smooth and also tending to reduce weights associated with directions of less significant variation 1 have used crossvalidation to attempt to estimate the optimal values for M and A for the Chase example using an Rlanguage script to run over a range of values of both parameters M and A and for each parameter combination 0 Split the data at random into a training set 23 of the data consisting of two thirds of the data and a test set remaining 13 0 Train a network on the training set 0 Predict on the prediction set 0 Compute the rootmeansquared residual on the prediction set To account for the random variations due to the selection of training and testing data and due to the stochastic nature of the neural network I have run the splittingtrainingtesting cycle 100 times over different splits and different initial weights each time for each parameter combination to yield 21 slze m sue so slze so sue 100 slze 3 slze10 slze 15 nns residual on test data llll 521012 Log10decay The lines follow the median and upper and lower quartiles of the rms residual values for each parameter combination The lowest median rmsr 0341 occurs for a network size of3 and a decay of l LoglOdecay ofO but the results forM 10 and 7L 001 are slightly better in that the median rmsr is almost the same 0342 and the upper and lower quartiles are a little lower Using these values forM and 7L and training on the whole dataset four different times leads to the following four realizations of the fitted surface 22 w II m u u w I mmn u The R2 values percent variation explained for the above fits are 63 upper left 58 upper right 63 lower left and 62 lower right As with geostatistical stochastic simulation the range of variation in results for different realizations of the network could be taken as a measure of the uncertainty in your knowledge ofthe true surface leading to an ensemble of prediction results rather than a single prediction for each value of X You could of course average the predictions from a number of different network realizations 23 Despite the varied appearance of the fitted surfaces vs PhiU the results do not vary greatly at the data points and look much the same plotted versus depth 2 1 0 1 2 2 1 0 1 2 l l l l l l 2700 Reammmn 1 Realization 2 Realization a Realizanun A 2750 N 8 o Depth feet 260 Predicted 2900 i i i i i Measured 2 1 o 1 2 2 1 o 1 2 Log10 Permeability md We would expect more variation at points in PhiU space at some distance from the nearest training data point However in this case predictions versus depth at a nearby prediction well also look quite similar for the four different networks Reference T Hastie R Tibshirani and J Friedman The Elements of Statistical Learning DataMining Inference andPrediction 2001 Springer 24
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'