Note 12 for GEOS 585A with Professor Meko at UA
Note 12 for GEOS 585A with Professor Meko at UA
Popular in Course
Popular in Department
This 8 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Arizona taught by a professor in Fall. Since its upload, it has received 14 views.
Reviews for Note 12 for GEOS 585A with Professor Meko at UA
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
12 Validating the Regression Model Regression Rsquared even if adjusted for loss of degrees of freedom due to the number of predictors in the model can give a misleading overly optimistic View of accuracy of prediction when the model is applied outside the calibration period Application outside the calibration period is the rule rather than the exception in dendroclimatology The calibrationperiod statistics are typically biased because the model is tuned for maximum agreement in the calibration period Sometimes too large a pool of potential predictors is used in automated procedures to select nal predictors Another possible problem is that the calibration period itself may be anomalous in terms of the relationships between the variables modeled relationships may hold up for some periods of time but not for others It is advisable therefore to validate the regression model by testing the model on data not used to t the model Several approaches to validation are available Among these are crossvalidation and splitsample validation In cross validation a series of regression models is t each time deleting a different observation from the calibration set and using the model to predict the predictand for the deleted observation The merged series of predictions for deleted observations is then checked for accuracy against the observed data In splitsample calibration the model is t to some portion of the data say the second half and accuracy is measured on the predictions for the other half of the data The calibration and validation periods are then exchanged and the process repeated In any regression problem it is also important to keep in mind that modeled relationships may not be valid for periods when the predictors are outside their ranges for the calibration period the multivariate distribution of the predictors for some observations outside the calibration period may have no analog in the calibration period The distinction of predictions as extrapolations versus interpolations is useful in agging such occurrences 121 Validation Validation strategies Several alternative strategies for validation are available and some may be better than others depending on the data and purpose of analysis Three different ways of validating are 1 Compare predictions made by the model with records ofsome proxyfor the predictand a Calibrate on the entire available length of the overlap of predictors and predictand b Apply the model to predict outside the calibration period c Compare the predictions outside the calibration period with observations of some proxy for the predictand d Pro uses all available data for calibration a long calibration time series generally gives a more stable model e Con validation semiqualitative 2 Validate the model with a time series segment ofthe predictand withheldfrom calibration a Calibrate on just a part of the period of overlap of predictors and predictand b Apply model to generate predictions for the data withheld from calibration c Compare the predicted and observed predictand for the period withheld from calibration d Use the model from a for nal prediction e Pro model validated is same model used for nal prediction f Con requires long time series of predictand if some data are to be sacri ced to validation Notes712 GEOS 585A Spring 2009 1 3 Cross validation a Divide period of overlap of predictors and predictand into two or more subsets b At each step in crossvalidation omit a subset and calibrate on remaining data c Use subperiod model from b to predict for the omitted subperiod d Repeat steps b and c each time omitting a different subset from calibration e Aggregate the predictions from the various steps d into a single predicted series 1 Compare the aggregated predictions with the observed predictand g Recalibrate using fulllength predictand data for nal model for prediction model h Pro optimum use of relatively short predictand series i Con model validated not exactly same as nal model for predictions These methods might be referred to as leave n out At one eXtreme n is half the sample length This type of crossvalidation is split sample validation In splitsample validation the model is calibrated on some fraction say rst half of the data and validated on the other fraction Snee 1977 Then the calibrationvalidation periods are exchanged and the calibration and validation done again The nal prediction model is then calibrated using the full available length of predictand data At the other eXtreme is leaveoneout crossvalidation which is equivalent to cross validation as described by Michaelsen 1987 and to the predictedresidualsumof squares procedure or PRESS procedure as described by Weisberg 1985 Say the full available period for calibration is length of n years Models are repeatedly estimated using data sets of n 7 lyears each time omitting a different observation from calibration and using the estimated model to generate a predicted value of the predictand for the deleted observation At the end of this procedure a time series of n predictions assembled from the deleted observations is compared with the observed predictand to compute validation statistics of model accuracy and error Validation statistics Validation statistics measure the error or accuracy of the prediction for the validation period The statistics can generally be eXpressed as functions of just a few simple terms or build blocks We begin by de ning the building blocks Validation errors All of the statistics described here are computed as some function of the validation error which is the difference of the observed and predicted values Em yr 5 0 1 where yI and 521 are the observed and predicted values of the predictand in year i and the notation i indicates that data for year iwere not used in tting the model that generated the prediction 331 Sum of squares of errors validation SSEV SSEV is the sum of the squared differences of the observed and predicted values SSE m 2 2 where the summation is over the nV years making up the validation period Mean squared error of validation MSEV MSEV is the average squared error for the validation data or the sumofsquares of errors divided by the length of validation period Notes712 GEOS 585A Spring 2009 2 SSEV VI The closer the predictions to the actual data the smaller the MSEV Recall that the calibration period equivalent of MSEV is the residual mean square MSE which was listed in the ANOVA table in the previous notes MSEV 3 Root mean squared error of validation RMSEV The RMSEV is a measure of the average size of the prediction error for the validation period and is computed as the square root of the mean squared error of validation 12 Em 2 RMSEV JMSEV quotV Ms 4 RMSEV has the advantage over MSEV of being in the original units of the predictand The calibration equivalent of RMSEV is the standard error of the estimate 3 RMSEV will generally be greater than s because 3 re ects the tuning of the model to the data in the calibration period The difference between RMSEV and s is a practical measure of the importance of this tuning of the model If the difference is small the model is said to be validated or to verify well What is meant by small is somewhat subjective For example in a reconstruction of annual precipitation for agriculture a difference of 02 inches between RMSEV and 3 might be judged inconsequential if an error of 02 inches makes no appreciable difference to the health of the crop Reduction of error RE RE measures the skill of a regression model de ned as its accuracy relative to a prediction based on no knowledge In de ning RE it is first necessary to specify the noknowledge prediction Frequently this prediction is simply the calibrationperiod mean of the predictand 7 In other words with no other knowledge about the predictand other than its calibrationperiod data it makes sense simply to substitute the calibrationperiod mean of the predictand as the predicted value for any year outside the calibration period Following Fritts et al 1990 RE is then given by SSE RE 1 5 SSEW where SSEV is the sum of squares of validation errors as de ned previously and n 2 SSEmAll 20 yc 6 11 RE has a possible range of 700 to 1 An RE of 1 indicates perfect prediction for the validation period and can be achieved only if all the residuals are zero ie SSEV 0 On the other hand the minimum possible value of RE cannot speci ed as RE can be negative and arbitrarily large if SSEV is much greater than SSEW As a rule of thumb a positive RE is accepted as evidence of some skill of prediction In contrast if RE g 0 the prediction or reconstruction is deemed to have no skill Recall that the equation for computing the regression R2 is Notes712 GEOS 585A Spring 2009 3 Z R SST 7 The similarity in form of the equations for R2 and RE equations 7 and 5 suggests that RE be used as a validation equivalent of regression R2 and that a value of RE close to the value of R2 be considered as evidence of validation The rational for this comparison is easily seen for leaveoneout crossvalidation In both equations the numerator is a sum of squares of prediction errors and the denominator if the sum of squares of departures of the observed values of the predictand from a constant For leavelout crossvalidation the constant is equal to the calibrationperiod mean for both 5 and 7 This is so because for leavelout crossvalidation the aggregate validation period is essentially the same as the calibration period each year of the calibration period is individually and separately used as a validation period in the iterative crossvalidation and the aggregate of these validation years is the validation period PRESS Statistic PRESS is an acronym for predicted residual sum of squares Weisberg 1985 p 217 The PRESS procedure is equivalent to leavelout crossvalidation as described previously The PRESS statistic is defined as PRESS 2 50 8 11 where 531 is the residual for observation 139 computed as the difference between the observed value of the predictand and the prediction from a regression model calibrated on the set of n 7 1 observations from which observation iwas excluded The PRESS statistic is therefore identical to the sum of squares or residuals for validation SSEV defined in equation 2which was described previously 122 Crossvalidation stopping rule As described earlier the automated entry of predictors into the regression equation runs the risk of over tting as R2 is guaranteed to increase with each predictor entering the model The adjusted R2 is one alternative criterion to identify when to halt entry of predictors e g Meko et al 1980 but the adjusted R2 has two major drawbacks First the theory behind adjusted R2 assumes the predictors are independent while in practice the predictors are often inter correlated Consequently entry of an additional predictor does not necessarily mean the loss of one degree of freedom for estimation of the model Second the adjusted R2 does not address the problem of selecting the predictors from a pool 7 sometimes a large pool 7 of potential predictors If the pool of potential predictors is large R2 can be seriously biased high and the bias will not be accounted by the adjustment for number of variables in the model used by the algorithm for adjusted R2 Rencher and Pun 1980 An alternative method of guarding against overfitting the regression model is to use cross validation as a guide for stopping the entry of additional predictors Wilks 1995 By evaluating the performance of the model on data withheld from calibration at every step of the stepwise procedure the level of complexity number of predictors above which the model is overfit can be estimated Graphs of change in calibration and validation accuracy statistics as a function of step in forward stepwise entry of predictors can be used as a guide for cutting off entry of predictors into the model For example in a graph of RMSEv against step in a model run out to many steps e g 10 steps the step at which the RMSEv is minimized or approximately so can set as the final step for the model are produced The same result would be achieved from a plot of RE against step except that the maximum in RE indicates the best model Notes712 GEOS 585A Spring 2009 4 Extending the entry of predictors beyond the indicated steps amounts to over tting the model Over tting refers to the tuning of the model to noise rather than to any real relationship between the variables In the extreme over tting is illustrated by a model whose number of predictors equals the number of observations for calibration the model will explain 100 of the variance of the predictand even if the predictor data is merely random noise 123 Prediction Reconstruction Predictions are the values of the predictand obtained when the prediction equation 9180 819 2ng 8Kx1Jlt 9 is applied outside the period used to t the model For example in dendroclimatology the tree ring indices x S for the longterm record are substituted into 9 to get estimates of past climate The prediction is called a reconstruction in this case because the estimates are extended into the past rather than the future Once the regression model has been estimated the generation of the reconstruction is a trivial mathematical step but important assumptions are made in taking the step First the multivariate relationship between predictand and predictors in the calibration period is assumed to have applied in the past This assumption might be violated for many possible reasons For example in a treering reconstruction the climate for the calibration period may have been much different than for the earlier period such that a threshold of response was exceeded in the earlier period Or the quality of the treering data might have decreased back in time because of a dropoff in sample size number of trees in the chronologies Many other data dependent scenarios could be envisioned that would invalidate the application of the regression data to reconstruct past climate For time series in general regardless of the physical system it is important to statistically check the ability of the model to predict outside its calibration period or to validate the model as described in the preceding section 124 Error bars for predictions A reconstruction should always be accompanied by some estimate of its uncertainty The uncertainty is frequently summarized by error bars on a time series plot of the reconstruction Error bars can be derived by different methods 1 Standard error of the estimate 3 Recall that s is computed as the square root of the mean squared residuals MSE Following Wilks 1995 p 176 the Gaussian assumption leads to an expected 95 con dence interval of roughly C11 2 t r 25 10 Con dence bands by this method are the same width for all reconstructed values The i232 rule of thumb is often a good approximation to the 95 con dence interval especially if the sample size for calibration is large Wilks 1995 p 176 But because of uncertainty in the sample mean of the predictand and in the estimates of the regression coefficients the prediction variance for data not used to t the model is somewhat larger than indicated by MSE and is not the same for all predicted values This consideration gives rise to a slightly more precise estimate of prediction error called the standard error of prediction see next section Also note that the 2 Notes712 GEOS 585A Spring 2009 5 in 10 is a roundedoff value of the 0975 probability point on the cdf of the normal distribution 196 rounded to 2 Strictly speaking the appropriate multiplier 2 in the example should come from a T distribution with nKl degrees of freedom where n is the sample size for calibration and K is the number of predictors in the model Weisberg 1985 The distinction will be important only for small sample sizes or models for which the number of predictors is dangerously close to the number of observations for calibration 2 Standard error of prediction sj This improved estimate of prediction error is proportional to s but in addition takes into account the uncertainty in the estimated mean of the predictand and the in the estimates of the regression coefficients Because of these additional factors the prediction error is larger when the predictor data are far from their calibrationperiod means and vice versa For simple linear regression the standard error of the estimate and standard error of prediction are related as follows 1 Z 11 where s is the standard error of the estimate n is the sample size number of years for the calibration period 61 is the value of the predictor in year 139 f is the calibrationperiod mean of the predictor x is the value of the predictor in the reconstruction year in question Si is the standard error of prediction for that year and the summation in the denominator is over the n years of the calibration period Note rst that syn gt s and that the difference has contributions from the two righthand terms inside the square root The rst source of difference is uncertainty due to the fact that the estimate of the predictand will not equal its expectation this contribution can be made smaller by increasing the sample size The second source is the uncertainty in the estimates of the regression constant and coefficient The consequence of this term is that the prediction error is greater when the predictor is farther from its calibrationperiod mean This feature is what causes the aring out of the prediction intervals in a plot of the predicted values against the predictor values More on this topic can be found in Weisberg 1985 p 22 229 and Wilks 1995 p 176 The equation for the standard error of prediction in MLR is more complicated than given by 11 which applies to simple linear regression as syn depends on the variances and covariances of the estimated regression coefficients The equation for syn in the multivariate case is best expressed in matrix terms The MLR model following Weisberg 1985 p 229 can be written in vectormatrix form as Y X e Y column vector of predictand for calibration period X matrix of predictors for calibration period 12 i row vector of regression coefficients with regression constant rst e column vector of regression residuals Notes712 GEOS 585A Spring 2009 6 If the model is used to predict data outside the calibration period and the predictor data for some year to be predicted is given by the row vector Xi the predicted value for that year is given by 9 xf 13 Assuming the linear model is correct the estimate is an unbiased point estimate of the predictand for the year in question the variance of the prediction is s aarpred u 21 011 x3 XTX1 xi 14 az1h where oquot2 is generally estimated as the residual mean square or s2 The estimated standard error of prediction is the square root of the above conditional variance s sepredyilxia 1441 15 2 Root mean squared error of validation RMSEV Another possible way of assigning a con dence interval to predictions is use the validation error as an estimate of the expected error of reconstruction or prediction For example with leavelout crossvalidation or the PRESS procedure RMSEV iPRESS nV is the validation equivalent of the standard error of prediction and if normality is assumed can be used in the same way as described for s or sj to place con dence bands at a desired signi cance level around the predictions For example an approximate 95 con dence interval is 3 i 2 RMSEV Weisberg 1985 p 230 recommends this approach as a sensible estimate of average prediction error 125 Interpolation vs extrapolation A regression equation is estimated on a data set called the construction data set or calibration data set For this construction set the predictors have a de ned range For example in regressing annual precipitation on treering indices perhaps the treering data for the calibration period are range between 04 and 18 or 40 to 180 of normal growth The relationship between the predictand and predictors expressed by the regression equation applies strictly only when the predictors are similar to their values in the calibration period If the form of the regression equation is not known a priori then we have no information on the relationship outside the observed range for the predictor in the calibration period When the model is applied to generate predictions outside the calibration period an important question is how different can the predictor data be from its values in the calibration period before the predictions are considered invalid When the predictors are acceptably similar to their values in the calibration period the predictions are called interpolations Otherwise the predictions are called extrapolations Extrapolations in a dendroclimatic reconstruction model present a dilemma the most interesting observations are often extrapolations while the regression model is strictly valid only for interpolations A compromise to simply tossing out extrapolations is to ag them in the reconstruction Algorithm for identifying extrapolations Extrapolations are identified by locating the predictor data for any given prediction year relative to the multivariate cloud of the predictor data for the calibration period Identi cation is trivial for the simple linear regression as any Notes712 GEOS 585A Spring 2009 7 prediction year for which the predictor is outside its range for the calibration period can be regarded as an extrapolation For MLR any prediction for which the predictor data fall outside the predictor cloud for the calibration period can be regarded as an extrapolation In MLR extrapolations can be de ned more speci cally as observations that fall outside an ellipsoid that encloses the predictor values for the calibration period This is an ellipsoid in p dimensional space where p is then number of predictors For the simple case of one predictor the ellipsoid is onedimensional and any values of x outside the range of x for the calibration period would lead to an extrapolaton For MLR with two variables the ellipsoid is an ellipse in the space de ned by the axes for variables x1 and xi For the general case of an MLR regression with p predictors and an calibration period of n years Weisberg 1985 p 236 suggests an ellipsoid de ned by constant values of the diagonal of the hat matrix H defined in matrix algebra as HXXTX 1XT 16 where X is the n by p 1 time series matrix of predictors with ones in the first column to allow for the constant of regression For each prediction year with predictor values in the vector x the scalar quantity h xTxTX 1 x 17 is computed and any prediction for which hgth max 18 where hmax is the largest h in the diagonal of the hat matrix H is regarded as an extrapolation 126 References Fritts HC Guiot 1 and Gordon GA 1990 Verification in Cook ER and Kairiukstis LA eds Methods of Dendrochronology Applications in the Environmental Sciences Kluwer Academic Publishers p 178185 Fritts HC 1976 Tree rings and climate London Academic Press 567 p Meko DM Stockton CW and Boggess WR 1980 A treering reconstruction of drought in southern California Water Resources Bulletin v 16 no 4 p 594600 Michaelsen 1 1987 Crossvalidation in statistical climate forecast models 1 of Climate and Applied Meterorology 26 15891600 Rencher AC and Pun Fu Ceayong 1980 In ation of R2 in best subset regression Technometrics 22 1 4953 Snee RD 1977 Validation of regression models Methods and examples Technometrics 19 415428 Weisberg S 1985 Applied Linear Regression 2nd ed John Wiley New York 324 pp Wilks DS 1995 Statistical methods in the atmospheric sciences Academic Press 467 p Notes712 GEOS 585A Spring 2009 8
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'