### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Linear Regression and Time Series STAT 5100

Utah State University

GPA 3.98

### View Full Document

## 12

## 0

## Popular in Course

## Popular in Statistics

This 149 page Class Notes was uploaded by Anita Hettinger on Wednesday October 28, 2015. The Class Notes belongs to STAT 5100 at Utah State University taught by Staff in Fall. Since its upload, it has received 12 views. For similar materials see /class/230500/stat-5100-utah-state-university in Statistics at Utah State University.

## Similar to STAT 5100 at Utah State University

## Reviews for Linear Regression and Time Series

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/28/15

4 0 Stat 5100 Notes Spring 2009 Unit 4 Advanced Linear Regression Section Topic 41 42 43 44 45 Multicollinearity pp 83 133 136 338 339 Variable Selection Procedures pp 82 84 Ridge Regression p 136 In uence amp Outliers p 125 133 158 161 Conditional Effects Robust Nonlinear p 158 161 183 212 163 173 4 1 41 Multicollinearity 11 11 Data on 19 1 1 9 9 mountain basins a 8 in New Zealand 7 7 see Hamilton pp 6 39 6 139 253 and s s 4 AdeultReg sas 4 4 33 2 1 O 1 2 q 2 1 0 1 2 mnoff precip variable interpretation Y yield mean sediment yield tonskm2 runoff mean annual runoff mm water out ow precip mean annual precipitation mm water in ow glacier percentage of basin that is glacierized area drainage area km2 Regression of logyield on four predictors Analysis of Variance Sum of Mean Source DF Squares Square Model 4 5443094 1360774 Error 14 1382664 098762 Corrected Total 18 6825759 Root MSE 099379 RSquare 0 Dependent Mean 736864 Adj RSq 0 Coeff Var 1348675 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Intercept 1 736883 022799 3232 runoff 1 026654 105343 025 precip 1 178612 104254 171 glacier 1 030260 028359 107 area 1 023248 027641 084 F Value 1378 Pr gt F lt0001 7974 7396 Prgt t lt0001 08039 01087 03040 04144 4 2 Pearson Correlation E g 7 6 5 4 39l 3 2 1 o runoff logyield logyield 100000 runoff 084327 precip 086478 glacier 044793 area 041926 runoff runoff 084327 100000 097383 033864 028731 1 0 1 Precip glacier Coefficients N 19 precip 086478 097383 100000 030279 028304 glacier 044793 033864 030279 100000 051213 area 041926 028731 028304 051213 100000 4 3 These are symptoms of multicollinearity sometimes collinearity strong linear associations exist among predictor variables Problems not so much With Y but With b s estimates of s How to detect multicollinearity o pairwise correlations sometimes enough 0 more complicated associations multiple predictors gt consider something more complicated 4 5 Recall multiple linear regression model YI o 1X1 k 1Xk 1 6 If X j s were highly related 1 dif cult interpretation of s can t hold other X s constant bj s may vary Widely across samples gt imprecise info about s SE for estimates higher variance is in ated estimates bj could be counterintuitive model s predictive ability would not be inhibited signi cance tests on individual predictors could contradict model F test or subset F tests 4 6 Useful now an elegant formulation of multiple linear regression model see Hamilton Appendix 3 Let Y be the vector 6 be the vector be the vector 0 1 k1T Xj be the VBCtOI39 X1jX2janTj17k3 1 1 be the vector 111T length n X be the matrix Ll if 2 Xk1 Then Y7 g le k1X 7k1 l e is equivalent to with 239 1 n and e7 iid N002 e N MO 021 Y XQJrg Why is this formulation useful b n 13 VaNQ XTXY102 Key idea here the Closer X j and Q are the larger Will be the 339 and ht diagonal elements of XTQVC1 so the variance is higher for bj and bh Aside Note that KT must be invertible if not then a generalized inverse XTXV is used instead of the inverse XTQVC1 Generalized inverses are not unique so different Q vectors are possible but they would all be unbiased and have minimum variance 4 8 Recall from linear algebra is an eigenvalue of a symmetric square matrix A iff there exists a vector g the eigenvector for such that A Ag Let 1 k be the eigenvalues of INCTX and let A 12 Conditionlndcxi Associated With each Condition Index or eigenvalue is a sample princi pal component uncorrelated linear combinations of the predictors that account for the variation in the predictors 191 I 0112 CLk 1 k1 92 61ampCk1 k1 192 0 4 9 In general multicollinearity is problematic potentially damaging if both of the following occur 0 there is a high condition index usually means more than 10 or so 0 the associated principal component contributes strongly to the variance of two or more predictors usually means more than about 50 of the predictor s variability 4 10 Another multicollinearity diagnostic VIF Let R be the coef cient of multiple determination the R2 value When predictor Xj is regressed on the other predictors then R32 is related to the pairwise correlations among the standard ized predictors Then VIFj 1 R321forj1k3 1 This is the Variance In ation Factor for the estimate of j In general multicollinearity is problematic if at least one of the following occur o the largest VIF is much more than 10 o the mean VIF is much more than 1 Other criteria are sometimes used Three ways to diagnose multicollinearity 1 combination of condition index and proportion of variation 2 variance in ation factors 3 model F test vs individual t tests Do not confuse these 0 Interaction between predictors X1 and X2 if the effect of X1 on Y depends on X2 0 Multicollinearity involving X1 and X2 if X1 is linearly related to X2 no mention of Y Source Model Error Corrected Total Variable Intercept runoff precip glacier area DF 1 1 1 1 1 Number Eigenvalue 1 2 3 4 5 239187 109551 100000 048731 002531 Regression of logyield on four predictors Look for signs of multicollinearity Analysis of Variance 4 12 area 004571 024014 3522818E9 070914 Sum of Mean DF Squares Square F Value Pr gt F 4 5443094 1360774 1378 lt0001 14 1382664 098762 18 6825759 Parameter Standard Variance Estimate Error t Value Pr gt t Inflation 736883 022799 3232 lt0001 0 026654 105343 025 08039 2022951 178612 104254 171 01087 1980997 030260 028359 107 03040 146613 023248 027641 084 04144 139256 Collinearity Diagnostics Condition Proportion of Variation Index Intercept runoff precip glacier 100000 4018341E9 000709 000707 004840 147761 102903E12 000686 000779 019452 154657 100000 35385E10 286946E11 1545386E8 221547 498706E8 7598956E7 000022600 072303 972190 5432388E8 098604 098491 003405 000501 4 13 42 Variable Selection Procedures Possible remedial measures for multicollinearity 1 Standardize predictors With correlation transformation XL 1 Z V 7 1 Correlation between variables X and Y is r XZ Yi 2 Collect more data maybe much more not always feasible 3 Latent root regression use principal components approach to identify create composite variables This can reduce the number of predictors and create uncorrelated predictors but these usually lack good interpretability 4 Ridge regression later in 43 5 Choose a subset of variables to use as predictors here 42 How to choose the best subset of variables for predictors 0 best to eliminate some on contextual grounds 0 automatic statistical procedures are available but no guarantee of a right subset Two main classes of variable selection procedures here 1 All possible regressions Fit a model using each possible subset of predictors and select the best based on some criterion 2 Stepwise methods Take a structured approach to building a good subset of pre dictors 4 15 All possible regressions choose best of all possible subsets of predictors o X1Xk1 gt 231subsetsofsizep1k 1 c Total of possible subsets 211 k I 1 2k 1 1 0 Three main ways to determine the best 1 R2 but which model Will always have the highest R2 2 R3 balances against of predictors 33 3 Mallovv s Op for a certain subset of p predictors C RSS from model With p predictors 2 n p RSS from model With k 1 predictors p Recall MSE RSSdf Look for model With smallest 19 such that Op p Number in Rsquare Selection The REG Procedure 4 16 Dependent Variable logyield Model RSquare Variables in Model 1 07478 precip 1 07111 runoff 1 02006 glacier 1 01758 area 2 07860 precip glacier 2 07809 precip area 2 07479 runoff precip 2 07452 runoff area 2 07409 runoff glacier 2 02495 glacier area 3 07965 precip glacier area 3 07872 runoff precip glacier 3 07810 runoff precip area 3 07550 runoff glacier area 4 07974 runoff precip glacier area Number in Model H H M H w M M M H w w M w M Adjusted Rsquare Selection Adjusted RSquare O O O O O O O O O O O O O O O 7592 7558 7536 7446 7396 7372 7330 7163 7134 7085 7060 6941 1557 1536 1273 The REG Procedure Dependent Variable logyield RSquare O O O O O O O O O O O O O O O 7860 7965 7809 7872 7974 7810 7478 7479 7452 7409 7550 7111 2495 2006 1758 4 17 Variables in Model precip precip precip runoff runoff runoff precip runoff runoff runoff runoff runoff glacier glacier area area precip glacier precip glacier area precip area precip area glacier glacier area glacier area glacier area Number in Model HHMOOrPI MMMOOOOOOI MM HgtHgtOO H000 O39lO39ervPvPrhthDCOMMI k Mallows Cp Selection The REG Procedure Dependent Variable logyield 4 18 Cp RSquare Variables in Model 7932 07860 precip glacier 1402 07809 precip area 4279 07478 precip 0640 07965 precip glacier area 7074 07872 runoff precip glacier 1386 07810 runoff precip area 4262 07479 runoff precip 6067 07452 runoff area 9084 07409 runoff glacier 9663 07111 runoff 0000 07974 runoff precip glacier area 9352 07550 runoff glacier area 8692 02495 glacier area 2464 02006 glacier 9645 01758 area Misc notes on all possible regressions 0 Need sample size n greater than max of parameters sound results require n substantially larger than k 0 Big problem for large number of predictors k 1 gt 14 or so 2k 1 1 becomes cost prohibitive o Other criteria exist AIC Akaike s Information Criterion BIC Savva s Bayesian Information Criterion choose model With smallest AIC or BIC derivation of these is beyond the scope of this class but just barely R129 PRESSp7 more 4 20 Stepwise methods 0 automatically select a model based on some criterion convenient 0 less satisfactory do not guarantee the right model 0 best used as con rmatory approaches Three main strategies considered 1 Backward Elimination okay 2 Forward Selection worst 3 Stepwise Selection hybrid Stepwise method 1 Backward Elimination Basic algorithm 1 Fit model With all k 1 predictors a Compare each predictor s individual P Value to some thresh old slstay default in SAS is 010 b If any predictor s P Value gt slstay drop predictor With largest P Value 2 Repeat With k 2 predictors 3 Continue until all predictors remaining have P Values below slstay Backward Selection 4 22 Backward Elimination Step 0 Parameter Standard Variable Estimate Error Type II SS F Value Pr gt F Intercept 736883 022799 103169442 104463 lt0001 runoff 026654 105343 006323 006 08039 precip 178612 104254 289885 294 01087 glacier 030260 028359 112446 114 03040 area 023248 027641 069866 071 04144 Backward Elimination Step 1 Parameter Standard Variable Estimate Error Type II SS F Value Pr gt F Intercept 736882 022076 103169094 111415 lt0001 precip 152996 024094 3733705 4032 lt0001 glacier 028822 026903 106281 115 03010 area 023577 026735 072013 078 03918 Summary of Backward Elimination Variable Number Partial Model Step Removed Vars In RSquare RSquare Cp F Value Pr gt F 1 runoff 3 00009 07965 30640 006 08039 2 area 2 00106 07860 17932 078 03918 3 glacier 1 00381 07478 24279 285 01108 Stepwise method 2 Forward Elimination Basic algorithm 423 1 Find predictor with highest correlation with response a Regress response on this predictor b Leave predictor in model if P value is below some threshold slentry default in SAS is 050 2 Given the previously entered predictor nd the predictor with the highest partial correlation with response a Add this predictor to the model b Leave in model if P value is below slentry 3 Continue until no more predictors warrant inclusion P value of next predictor above threshold Big problem here best 2 variable model does not necessarily contain best 1 variable model rst steps can throw everything off Forward Selection Forward Selection Step 1 424 Variable precip Entered RSquare 07478 and Cp 24279 Parameter Standard Variable Estimate Error Type II SS F Value Pr gt F Intercept 736881 023084 103168855 101898 lt0001 precip 168397 023716 5104549 5042 lt0001 No other variable met the 01000 significance level for entry into the model Summary of Forward Selection Variable Number Partial Model Step Entered Vars In RSquare RSquare Cp F Value Pr gt F 1 precip 1 07478 07478 24279 5042 lt0001 4 25 Stepwise method 3 Stepwise Selection Basic algorithm 1 Take a forward step add best predictor with P value below slentry default 015 2 Take a backward step evaluate all predictors in model and drop the variable with the highest P value above slstay default 015 3 Iterate forward and backward steps until model stays the same Note in all these automatic stepwise procedures backward forward stepwise the slentry and slstay thresholds are deceptive After the rst step really a hypothesis test they are not signi cance levels 04 but conditional signi cance levels which are harder to interpret Stepwise Selection Stepwise Selection Step 1 426 Variable precip Entered RSquare 07478 and Cp 24279 Parameter Standard Variable Estimate Error Type II SS F Value Pr gt F Intercept 736881 023084 103168855 101898 lt0001 precip 168397 023716 5104549 5042 lt0001 All variables left in the model are significant at the 01000 level No other variable met the 01000 significance level for entry into the model Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In RSquare RSquare Cp F Value Pr gt F 1 precip 1 07478 07478 24279 5042 lt0001 4 27 43 Ridge Regression Recall from 41 and Hamilton Appendix 3 Y u iXm k le39Jc l 6239 With 239 1 n and 67 iid N002 X XQJFE ENN0702D Q n 13 by 0L8 b NW fie 102 So each bj has an expected value of j it s unbiased Current data are one sample consider repeating sample many times each time tting the model and obtaining an estimate bj for each j the distribution of these is the sampling distribution for the estimator 9939 With mean j and variance the corresponding diagonal element of 110402 normality assumed 4 28 Could transform Xj s and Y using correlation transformation 42 XL 1 Xij Xj 7J9 n l SDj Then Y as X 71 1X7 7 1 e i1n gt V I Qeland 620 7 correlation matrix of the X element in in row 2 and column 3 of 729 is CorrX X eg r295 vector W jth element CorrXj Y OLS H 12 1CQ unstandardize each element gt same as Q KT 1Tl 4 29 Note that rigIi 715 gives the same unbiased estimates as OLS on the untransformed data Recall symptom of multicollinearity in ated variance of bj estimates 0 Can we reduce the variance of estimates by biasing them 0 Think of two sampling distributions sketch 1 unbiased estimate With in ated variance 2 biased estimate With reduced variance 4 30 Allowing estimates the 9939 to be slightly biased could reduce their stan dard errors gt avoid major problem of multicollinearity How to introduce bias and reduce variance 0 recall unbiased estimate Q 733 1713 With VarQ 021 o biasing constant c 2 0 i identity matrix 0 biased estimate Q 71 CD471 With VarQ 02029 Gil 1 o jth diagonal element of Varb is SE of b 0 estimate 02 by MS for error We call C the ridge parameter What does C 0 d0 gives unbiased estimates What does larger C d0 larger smaller TWO graphical summaries to choose the right ridge parameter C 1 Ridge Trace Plot 2 VIF Plot Note these are guides there is no Optimal decision Ridge Trace Plot 0 Simultaneous plot of 91 b1 for different ridge parameters 0 usually from 0 to 1 or 2 c As c increases from 0 the b may fluctuate Wildly and even change signs 0 Eventually the b Will move slowly toward 0 VIF Plot 0 Simultaneous plot of the variance inflation factor for the k 1 predictors for different ridge parameters 0 As c increases from 0 the VIP drop toward 0 4 33 Pulse data on 31 people measured variable interpretation Y oxygen Oxygen intake rate in ml per kg of body weight per min age Age of subject in years weight Weight of subject in kilograms runtime Time taken to run 15 mile course in minutes rstpulse Rest pulse rate in beats per minute runpulse Average pulse rate While running maxpulse Maximum pulse rate While running Variable Intercept age weight seconds rstpulse runpulse maXpulse 6 0 0000 9499 0187 0150 0091 0061 0010 DE I I I I I I H Condition Eigenvalue 1 19 21 27 33 82 00001795 196 Pulse Data Index 0000 2909 5007 6212 8292 6376 7856 Parameter Estimate 10293448 22697 07418 04381 02153 36963 30322 The REG Procedure Dependent Variable oxygen Parameter Estimates Standard 12 000000 Error 40326 09984 05459 00641 06605 11985 13650 Value 30 27 36 84 33 08 22 Collinearity Diagnostics Intercept 00000233 00022 00006154 00064 00013 07997 01898 age 00001545 0 01501 00319 0 0 0 1463 1128 4966 0621 weight 00001965 0 02357 01831 0 0 0 0 0104 4444 1033 0228 Pr gt seconds 00002109 0 01286 06090 0 0 0 0252 1250 0975 0146 OOOAOOA t 0001 0322 1869 0001 7473 0051 0360 rstp 0000 0 0 0 0 0 0 ulse 2785 3906 0281 1903 3648 0203 0057 Variance Inflation 0 51284 15533 59087 41559 43727 74385 OOOOI II LI tp s runpulse 00000086 00000024 00012 00015 00151 00695 09128 4 34 Proportion of Variation maXpulse 00000063 00000074 00013 00012 00083 00056 09836 Coef cient Estimate 04 03 02 01 00 01 02 03 04 Plot Ridge Trace Plot xxx 44 imam Ii iiIii 1EEiimififiiiiii r i ffft Mgr H4Hltr 9va 44 MM 1 eawt f f M 4 M 000 025 050 075 100 125 150 175 200 Ridge k age Hr weight Ht seconds t t t rstpulse runpulse maxpulse VIF for predictor variable 4 35 VIF plot 05 10 15 20 Ridge Parameter c In general 1 choose smallest ridge parameter c 2 where the b rst become stable their approach towards 0 has slowed 3 and the VIF s have become small enough close to 1 or less than 1 Why is it called ridge regression o c is added to diagonal of 799 gt makes sort of ridge there 0 ridge trace plot shows how each 99 follows a ridge for increasing 0 Obs 1 PARMS 2 SEE 3 RIDGEVIF 4 RIDGE 5 RIDGESEB Ridge Estimates for Variable Coefficients TYPE Intercept 102934 12403 104256 10077 with ridge parameter c age 022697 009984 062851 023386 007586 weight 007418 005459 060615 005019 004661 03 seconds 004381 000641 062741 003463 000474 rstpulse 002153 006605 063315 006524 005207 runpulse 036963 011985 045717 011466 003289 maXpulse 030322 013650 044876 001778 003645 4 37 Ridge regression is a type of shrinkage method 1 Impose a penalty on the sizes of the 99 s 43938 2 Find Q to minimize n 2 k l 2 Yr Yi 21 j0 First part is OLS second part is penalty 3 Which penalty A 0 same approach as With selecting ridge parameter c 4 The penalty A or c effectively shrinks the 99 towards 0 Another popular shrinkage method the Lasso 1 A 2 minimize 271 Y Y i subject to g 4 39 A few extra notes on ridge regression choice of ridge parameter is somewhat subjective but must be defendable given ridge parameter c can get resulting parameter estimates Q on the unstandardized original data scale SAS gives these automatically ridge regression estimates Q tend to be more robust against small changes to data than are OLS estimates ridge regression tends to produce more precise predicted values Y than OLS When predictors are highly related also better at slight extrapolation predictors With very unstable ridge trace tends toward zero With out any plateau or slowing down may be dropped from model also consider dropping predictors With very small ridge trace Final notes on ridge regression 4 40 major limitation traditional inference is not directly applicable to ridge regression estimates be cautious 0 need bootstrapping to evaluate precision of ridge regression estimates computationally intensive beyond scope of this course still need to check other model assumptions after selecting ridge parameter Residual Plot with ridge parameter 03 Residual Value 38 40 42 44 46 48 50 52 54 56 Predicted Value for Oxygen Why do ridge regression Maybe want to adjust for multicollinearity but still keep certain predictors in model for mechanistic theory 4 41 44 In uential Observations amp Outliers Example data Savings data averaged over 1960 1970 variable interpretathjn Countryr CounUgrnanm Y SavRatio Average Savings Ratio for Country AvIncome Average Income in US dollars PopU15 Proportion of Population Under 15 Years of Age PopO7 5 Proportion of Population Over 75 Years of Age GrovvRate Average Growth Rate in Per Capita Income 48avingsInfluencesas 4 42 Recall model Y g le k1Xk1 6 There may be points individual observations that are not well explained by the model may be called outliers There may be points individual observations that are unduly in uenc ing the model t the bj estimates or the Y predicted values may be called in uential observations Based on only a consideration of the residuals one is not necessarily a subset of the other depends on the nature of in uence and the sample size Use both numerical and graphical diagnostics 4 43 Main diagnostics for In uential Observations 1 Hat matrix diagonals 2 DFBETAS 3 DFFITS 4 Cooks Distance Main diagnostics for Outliers 5 Residuals 6 Studentized Residuals 7 Studentized Deleted Residuals NOTE all of these diagnostics should enhance not replace scatter plots 4 44 441 Hat matrix diagonals Recall from 41 and Appendix 3 of Hamilton X XE b n 13 by 0L8 Then 12 Kb XltXT2lt 12ltTX 1113 11 2 awn 1 3ST 0 Hat matrix hats or puts a hat on X 0 projection matrix 13 projects X down to column space of 2 sketch 4 45 Let hi7 be the element in row 2 and column l of j sometimes called leverage in uence of obs 2 on its tted value Since 12 Ijlf then 2L1 halll What would a larger diagonal element hm mean Y is more in uential in determining Y7 How large must hm be to declare observation 2 as in uential 0 Rule of thumb hm gt or hm gt gt observation 2 is in uential 0 One possible graphical check plot hm against observation number add reference lines at 212 22 and 31222 Simple Residual Plot Plot of Hal Diagonals against Observation Number 055 Residual Variable Intercept PopU15 Pop075 AvIncome GrowRate Hat Diagonal DF l l l l H 8 910111213141516 Predicted Value 050 Dependent Variable SaVRatio Parameter Estimates Parameter Estimate 2856609 046119 169150 000033690 040969 4SavingsInfluencesas Standard Error 735452 014464 108360 000093111 019620 1 Value 388 319 156 036 209 20 30 ObseNalion Number Prgt t 00003 00026 01255 07192 00425 4 46 4 47 Another graphical diagnostic With hm Recall leverage plots partial regression plots for X1 1 Regress X1 on X2 Xk1 and obtain residuals cX1X2W7Xk1 2 Regress Y on X2 Xk1 and obtain residuals chthXlw1 3 Plot cYX X vs cX X X and add regression line I 27 7 k 1 1 2 7 k 1 With slope 91 from multiple regression model Modi cation here make the size of each point in the leverage plot be proportional to the corresponding hm c then this is called a proportional leverage plot o in uential observations Will be the points With big bubbles that appear to pull the regression line in their direction Residual of Y given other predictors Residual of Y given other predictors Proportional Leverage Plot Proportional Leverage Plot Bubble size prop to Hat diagonal Bubble size prop to Hat diagonal 11 11 11 o 11 9 O 9 E 9 9 7 7 7 O O 7 5 5 g 5 0 0 O 5 2 o O 3 3 5 3 O o o 3 1 1 1 1 1 1 3 1 1 3 3 E 3 3 2 5 5 9 5 00 5 7 7 d8 7 a o 7 1 1 1 1 1 1 1 1 1 1 9 91 1 1 1 1 9 1o 8 6 4 2 o 2 4 6 e 2 1 o 1 2 Partial Regression Residual for PopU15 Partial Regression Residual for Pop075 Proportional Leverage Plot Proportional Leverage Plot Bubble size prop to Hat diagonal Bubble size prop to Hat diagonal 11 11 11 O 11 9 9 g 9 9 7 Q1 7 7 7 5 0 CF 5 a 5 5 o 5 3 0 9 O D 3 1g 3 3 1 o o o o 8 o 1 g 1 1 m 39m 1 o 0 1 1 1 O O 00 00 O E 3 0 10 3 3 3 O E 5 5 E 5 o 5 0 o 7 O 7 E 7 0 o 7 91 1 1 1 1 9 91 1 1 1 1 1 1 1 1 1 9 1000 0 1000 2000 3000 4 2 o 2 4 6 8 10 12 14 Parlial Regression Residual for Avlnoome Partial Regression Residual for GrowRate Observations with large Hat Diagonals Observation 21 23 44 49 Country Ireland Japan United States Libya Hat Diagonal 02122 02233 03337 05315 4 49 Obs OUO39IphCJOIQI L Possible concerns from Hat graphical diagnostics Observation 23 37 44 47 49 Country Canada Japan South Rhodesia United States Jamaica Libya ePopU15 81235 48247 83717 41082 86077 98336 ePopO75 0 1 0 0 0 83739 C 36581 30002 80949 30252 21802 Income 147255 32801 70881 219151 3117 4972 eGrow Rate 05243 41403 28010 14602 62175 124774 4 50 442 DFBETAS DF means different here 0 How different would est of s be Without observation in data Let bj estimate of j using full data by estimate of j When observation 2 is ignored Sew SD of residuals When observation 2 is ignored RSSXj residual SS from regressing X j on all other predictors using full data M SE Mean SS for error When observation 2 is ignored ij jth diagonal element of XTX1 Then bi W by be DFBETASM 9 9 sew i Rssz i M SE ij Interpreting DFBETAS o DFBETASM positive obs 2 pulls bj up 0 DFBETASM negative obs 2 pulls bj down 0 How large to declare observation 2 in uential on bj Rough rule of thumb lDFBETASijl gt 1 for 22 g 30 lDFBETASijl gt 2 for 22 gt 30 Graphical diagnostics probably better for DFBETAS gtllt Histograms or boxplots for each j gtllt Proportional leverage plot With bubble size prop to DFBETASM gtllt Plot DFBETASM against obs number for each j DFBETAS for each predictor variable 05 Q Hm mr H 3 DFB C 01 10 15 Avlncome GrowRate PopO75 PopU15 4 52 Plot of DFBErAs against Observation Number 037 027 017 39 007 39 39 01r 39 39 02 03 PopU15 DFBEI39AS 04 05 39 06 07r 0 10 20 30 4O 50 Observation Number Plot of DFBErAS against Observation Number 01a 013 005 39 003 39 39 39 I 39 I I 007 39 Avinoome DFBEI39AS 012 39 017 022 027 0 10 20 30 4O 50 Observation Number Plot of DFBErAs against Observation Number 057 047 037 02 017 39 39 007 017 39 39 027 39 037 047 39 05r 06 077 0 10 20 30 40 50 Observation Number 4 53 Pop075 DFBEI39AS Plot of DFBErAS against Observation Number 04 nq 02 39 OJ 00 39 39 39 01 39 o2 39 39 oa 04 05 06 07 os 09 10 11 o 10 20 30 4o 50 Observation Number GrawRa lie DFBEI39AS Observations with large DFBETAS cutoff is 283 Observation 10 21 23 33 46 47 49 Country Costa Rica Ireland Japan Peru Zambia Jamaica Possible concerns Observation 21 23 49 Country Ireland Japan Libya DFBPop U15 02843 02962 06561 01467 00792 01002 04832 DFBPop O75 01424 04816 06739 00915 03390 00572 03797 DFBAv Income 00564 02573 01461 00858 00941 00070 00194 from DFBETAS graphical diagnostics DFBPop U15 02962 06561 04832 DFBPop O75 04816 06739 03797 DFBAv Income 02573 01461 00194 DFB GrowRate 00328 00933 03886 02872 02282 02955 10245 DFB GrowRate 00933 03886 10245 4 54 443 DFFITS Similar to DFBETAS how different would be 455 if observation 239 were not used to t the model DFFITer 2 SE of 1747 xMSE7h How large DFFITS to declare obs 239 as in uential on 12 0 Rough rule of thumb lDFFITSil gt 1 for n g 30 lDFFITSZ39l gt 2 for n gt 30 0 Good graphical diagnostics for DFFITS Plot DFFITS vs Observation Number Plot Residuals vs Predicted Values With point sizes propor tional to corresponding DFFITSi Plot of DFFITS against Observation Number DFFITS Bubble Residual Plot using DFFITS 4 56 1r 1ammoi O 39 ammo e ammoro O O or mammio O O E 20mm 0 0 O a c 2nmmr Q k 9 17 0 O 33 1 3 mm0 60000 1 O Q ampmm0 2 1onnooq o 10 20 30 4o 50 59mm 1Qmmo 150mm Observation Number Predicted Value Observations with large DFFITS Obs Observation Country DFFITS 1 23 Japan 08597 2 46 Zambia 07482 3 49 Libya 11601 Possible concerns from DFFITS graphical diagnostics Obs Observation Country DFFITS 1 49 Libya 11601 200000 4 57 Aside DFFITS can be rewritten as hm 6 DFFITSi I 1 hm Se 1 h77 hm 1 hm 67 QIK Yz This 62 is of interest in its own right and can be used to detect outliers coming up in 447 4 58 DFBETASM vs DFFITSi vs hm o somewhat related 0 conclusions Will quite often agree 0 BUT if two or more points exert in uence together then the drop one diagnostics DFBETAS and DFFITS may not detect them these are leverage points need to look at hm also called leverage sketches 444 Cooks Distance Kind of an overall measure of effect of obs 239 on all of the Y values A A 2 D 2211 Yz Yaw Z 1 MSE Diagnostics 0 Numerical simple compare Di With 4 n more useful compare D7 With the F1 k distribution gtllt percentile 10 20 little in uence gtllt percentile 50 major in uence 0 Graphical plot D7 or percentile from Fk7nk vs observation number 239 With reference line Cooks Distance Graphical Diagnostic Cooks Distance Percentile in F distribution 0087 0077 39 0067 0057 0047 0037 0027 o 0017 0007 unocnl39ochuocIlco CIIIIC IIIOII IIOII 0 10 20 30 40 50 Observation Number Observations with large Cooks Distance based on simple rule of 4n 08 Obs Observation Country CooksD pctF 23 23 Japan 0143 0018820 46 46 Zambia 0097 0007745 49 49 Libya 0 268 0071805 445 Residuals 6r Yr Sometimes a large leil indicates an outlier c not well explained by tted model 0 but how large it needs to be depends on the residuals recall 6 N NO 02 so 67 N N0021 because 12 Ijlf results in g X jig i 30X We could compare 67 With the normal critical values but we d need to estimate the variance including 02 gt normal approximation not appropriate need student s t 4 62 446 Studentized Residuals 67 6r estimated variance of 67 Sex 1 hi7 Se xMSE V62 If 67 iid N 0 02 then the studentized residuals follow the tnk distrib ution Numerical diagnostic compare l studentized residual l With upper 04 2 critical value of tnk Graphical diagnostic plot predicted value vs studentized residual With reference lines at upper 04 2 critical value of tnk Plot of Studentized Residuals against Predicted Values 3000 2000 463 1mm Studentized Residual lt mei 39 39 2000 3000 50000 100000 150000 200000 Predicted Value Critical Values for Studentized Residual Obs t95 t95Bonf 1 201410 352025 Possible concerns from Studentized Residuals graphical diagnostic Student Obs Observation Country Residual 7 7 Chile 2209 46 46 Zambia 2651 4 64 447 Studentized Deleted Residuals If obs 239 really is an outlier then including it in the data Will in ate Se VMSE So consider dropping it and re calculating the studentized residual 6 9 Z Se V1 hm39 recall Se 7 is M SE 7 after dropping observation 239 Diagnostics similar to Studentized Residuals o plot vs 0 compare to tnk 4 65 CAUTION 0 Compare le to upper 04 2 critical value of tnk gt oz prob of calling obs 239 outlier When it s not prob of type I error falsely calling obs an outlier BUT after doing this for 239 1 n we ve done n tests What if we want 04 to be prob of falsely calling at least one obs an outlier a family Wise error rate prob of making at least one false call in n tests 0 We need to adjust the critical value many ways to do this Here we ll use Bonferroni correction compare le to upper oz2n critical value of tnk Plot oi Studenllzed Deleted Reslduale agalnst Predch Values 30000 20000 10000 10000 Sludentized Deleted Residual c 20000 30000 50000 100000 150000 200000 Predicted Value Critical Values for Studentized Deleted Residual Obs t95 t95Bonf 1 201410 352025 Possible concerns from St Del Resid graphical diagnostics Obs Observation Country RStudent 1 7 Chile 23134 2 46 Zambia 28536 448 Remedial Measures for In Obs or Outliers 1 Look for typos in data or fundamental differences in observations including very skewed distributions of predictors 2 Look at potential changes to model a Will a transformation bring in the observations b should a curvilinear or other predictor be added look at leverage plot for the possible predictor any trend suggests adding it to model 3 Could give observation less weight least absolute deviation instead of OLS consider IRLS 25 Percent 50 225 275 325 375 425 PopU15 475 300 900 15002100270033003900 Avlncome 30 7 50 7 16 24 32 Pop075 4 48 45 75 105 GrowRate 135 165 4 68 Obs 21 23 44 46 49 Country Ireland Japan United States Zambia Libya Suspect observations Sav Ratio 1134 2110 756 1856 889 Pop U15 3116 2701 2981 4525 4369 Pop 075 Av Income 113995 125728 400189 13833 12358 Grow Rate 299 821 245 514 1671 4 69 Simple Residual Plot a er remedial measures 10 a o 6 4 39 39 g 2 39 39 E c 39 2 39 39 4 quot 6 39 8 i i l l l l 4 5 6 7 8 9 10 11 Predicted Value Plot of DFFITS against Observation Number aner remedial measures ObseNation Number Plot of Hat Diagonals against Observation Number alter remedial measures 4 70 Hat Diagonal o 8 Observation Number Plot of Sludemized Deleted Residuals against Predicted Values 30000 20000 10000 10000 Sluden lized Deleted Residual c 20000 30000 a er remedial measures 40000 60000 80000 100000120000140000 Predicted Value Plot of DFBEI39AS against Observation Number after remedial measures POpU15 DFBEI39AS o o 0 10 20 30 40 50 Observation Number Plot of DFBErAs against Obsenration Number alter remedial measures 05 04 03 02 39 01 39 o0 u u 01 logAvlnoome DFBEI39AS 02 o3 04 i i i i 0 10 20 30 40 50 ObseNation Number Plot of DFBEI39AS against Observation Number alter remedial measures 04 03 2 39 39 4 71 01 ooi 39 o1 7 Pop075 DFBEI39AS 02 03 04 r i i 0 10 20 30 40 50 Observation Number Plot of DFBErAS against Obsenration Number alter remedial measures 04 i 03 027 017 on 39 o o I o1 o27 39 03 IogGrawRate DFBEI39AS o4 05 06 i i i i O 10 20 30 4o 50 ObseNation Number Variable Intercept PopU15 Pop075 logAvIncome logGrowRate DF I I I I H Predict SavRatio after remedial measures Dependent Variable SavRatio Parameter Estimate 2625118 033837 068558 071860 133042 Standard Error 1052632 015791 113571 097492 072528 t Value 249 214 060 074 183 Pr gt t 00165 00377 05492 04650 00734 4 72 Variable Intercept PopU15 logGrowRate DF I bv hv h Final Model Dependent Variable SavRatio Parameter Estimate 1427955 018046 145209 Standard Error 240166 005915 071058 t Value 595 305 204 Pr gt t lt0001 00038 00468 4 73 45 Additional Topics 451 Conditional Effect Plots After certain remedial measures transformations resulting model may be dif cult to interpret Example regress cube root Water81 on square root Income and cube root Water80 and log People80 see 4Concord1additionalsas Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 228915 035717 641 lt0001 sqrtIncome 1 000755 000178 424 lt0001 crtWater80 1 062448 002991 2088 lt0001 logPeople80 1 088964 015739 565 lt0001 Water8113 228915 l 000755V Income 062448Water8013 088964 log P60pl 80 Conditional effect plot convenient summary of single predictor s effect 4 74 1 substitute means of other predictors into tted model 2 inverse transform the predicted response variable 3 plot this predicted response against the original predictor of interest Variable Mean crtwaterSO 133739864 Water81 228915000755Income logPeopleSO 099381 12 062448133739864 088964099381123 Condilional E ect Plat for Income wilh other predimr varietal at thequot means mama WalarUsagaln1981 u 8 0 2M0 40000 80000 50000 100000 Income 4 75 452 Robust Regression Recall X X N0702Iy 2 6 When model assumptions are met OLS is best can often make assumptions be met When not speci cally 02 constant robust regression often helpful 0L8 Q n 13W WLS Q XTWXWXTWX E a matrix of weights often but not necessarily With off diagonal elements zero OLS E 4 76 IRLS iteratively reweighted least squares 1 Obtain maybe from OLS Q then calculate and 12 g and g X 13 2 Calculate weights E based on g lots of weight functions available 3 Calculate WLS Q gngrlgTwz and resulting g X Kg 4 Iterate steps 2 amp 3 to convergence of Q How to calculate weights usually chosen to optimize some criterion the choice of criterion determines the method of weight calculation 4 77 Mestimation If u1 un are z39z39d from some distribution With parameter 6 then the type M estimate of 6 is of the form 6 argmianu 6 Where p is some scalar objective function Example pu 9 log fu 6 f is pdf of distribution of ul un Then 1 l i 9 argmaxZ n ogfu arg max likelihood What is this called W estimation approach in IRLS 4 7 8 MAD 1 Calculate robust estimate of 0 such as s TM 2 Let W 3 Calculate diagonal weights 1117 Where zMu p for some scalar objective function p Examples 2 u if S C 1 Huber PW C C otherwise default 0 1345 1 1 gf if lul g c 2 Tukey biweight bisquare pu otherwise DION DION default 0 4685 2 Bisquare weight function 1 for g C 0 otherwise Bisquare Weight Function Summary of Water81 107 35 097 30 08 077 25 g 06 3 E 20 7 g 05 g 839 g 15 E 047 03 10 027 017 5 0390quot l l u l l l l l 2 1 0 1 2 0 2000 4000 6000 8000 10000 Standardized Residual WM Example here force outliers ie contaminate the data and then compare robust vs OLS Note M estimation works well for outliers for leverage points use MM estimation see SAS help Comparison of Methods 11000 g 4 80 Water81 2000 4000 6000 8000 10000 12000 14000 Water80 Water81 quot contaminated OLS 39 39 39 quot contaminated MM original contaminated M Iistinnates Intercept Water80 Income P values Intercept Water80 Income Original 20382169 059313 002055 Original 00313 lt0001 lt0001 0 0 O lt MM 0785 6692 0082 MM 0003 0001 Contaminated DLS M 125233164 2821935 247 131907 06433 001303 00090 Contaminated DLS M 00329 lt0001 lt0001 lt0001 05349 00002 0 0012 453 Nonlinear Models 4 82 Usually need mechanistic theory 2 Example 1 Y g le g exp 4X2 6 Plots of Nonlinear Data Plots of Nonlinear Daia Y Y 4000 4000 3000 3000 2000 39 2000 n 10007 nu quot 39 quot 10007 o of 39I 39 39 o 1090 9 1000 20mr 39 O 2m r 3000 o 3000 40007 40007 5000quot 5000quot 10 11 12 13 14 15 16 17 18 19 20 10 12 14 16 18 20 22 24 26 28 30 X1 X2 with simulated data prOC nlinf sthegenonhnearrnodeb proc nlin datatemp1 noitprint maXiter500 pred b0 b1X1b2 b3expb4X2 LOSS Ypred2 model Y b0 b1X1b2 b3expb4X2 parameters bO1OO b18 b23 b32O b44 output outout1 rresid ppred run Parameters estimated by an iterative process to reduce the SSE at each eratknn unt convergence I eys h3usefu convergence 0 form of nonlinear equation 0 iintiaIIJararneter estiniates proc nlin with default squared error loss function truth b050 b110 b22 b316 b42 484 The NLIN Procedure NOTE Convergence criterion met Estimation Summary Method GaussNewton Iterations 17 Sum of Mean Approx Source DF Squares Square F Value Pr gt F Model 4 13361E8 33403686 403754 lt0001 Error 45 37230 827327 Corrected Total 49 13362E8 Approx Approximate 95 Confidence Parameter Estimate Std Error Limits b0 329411 231548 136949 795771 b1 101254 06771 87618 114891 b2 19970 00207 19554 20387 b3 155777 02049 159904 151650 b4 20090 000450 19999 20180 Example 2 a nonlinear curve to describe sand compression from Lagioia et al 1996 Computers and Geotechnics 193171 191 where gQEEShB39BQKH 4 85 k2 1 Mk1 k2 q P 1 PM 29 1 P Aqf l k1 7 1 Mk1 k2 yield surface response deViatoric stress predictor mean effective stress predictor hardening softening constant de ning current size of surface known stress ratio 19 q parameter de ning value of 77 with no strain increment parameter de ning general slope of d vs 77 curve parameter de ning how close to 77 0 axis curve bends towards d oo dilatancy 2uM1 oz Goal nd u 04 and M to make f m 0 and look at the relationship between these three parameters Compare deviatoric q and mean effective p stresses from system wi39lh true values mu17 alpha01 M0 and pc1607123 q m 4 86 500 0 200 400 610 81 1000 1200 1400 1600 p proc model estimates such nonlinear systems can do multiple equations proc model dataex2 parms mu 17 alpha 2 M 7 bounds M mu gt 0 control pc 1607123 k1 mu1alpha21mu 1sqrt14alpha1mumu1alpha2 k2 mu1alpha21mu 1sqrt14alpha1mumu1alpha2 eqf ppc 1qpMk2k21muk1 k21qpMk1k11muk1 k2 fit f methodmarquardt prllr corrb prl give CI run Parameter Sand stress example truth mu17 alpha01 M068 NOTE At DLS Iteration 4 CONVERGE0001 Criteria Met Nonlinear DLS Summary of Residual Errors DE DE Model Error SSE 3 15 0000165 0000011 Nonlinear DLS Parameter Estimates Approx Estimate Std Err t Value 167184 00181 9249 0110909 000762 1456 0677976 000215 31483 Parameter Likelihood Ratio 95 Confidence Intervals Parameter Value Lower mu 16718 16352 alpha 01109 00967 M 06780 06736 Correlations of Parameter Estimates mu alpha mu 10000 09117 alpha 09117 10000 M 07978 08644 0 0 1 Approx Pr Upper 17061 01267 06821 7978 8644 0000 MSE Root MSE RSquare 000331 gt t lt0001 lt0001 lt0001 Adj RSq Statistics 5100 Spring 2009 Midterm Review Notes Main Section Topic Important Ideas 11 Graphical Summaries o stemleaf plots histograms 12 Numerical Summaries 0 center spread percentiles robustness symmetry 13 Symmetry X5 Boxplots 0 building X5 interpreting 14 Normality obuild X5 interpret normal quantile plot 15 Transformations oladder of powers Box Cox interpret output why transform which power based on 4 plot summary 21 Scatterplots X5 Correlation o lin association cluster about SD line 22 Linear Regression Model 0 regression line X5 graph of averages model assumptions OLS basic idea parameters vs estimates 23 lnference o R2 and 7 components of ANOVA table hypothesis testing P value interpret Cl for 67s sampling distribution 1 sided vs 2 sided tests whenhow conclusion in context con dence vs prediction intervals response Y interpret predictions 24 Residual Diagnostics 0 residual plots what they are how X5 why to use 25 Remedial Measures 0 transformations outliers 26 Cautions o assumptions interpretability abusemisuse 31 Partial Correlation o interpretation controlling X5 calculation 32 Multiple Regression 0 model OLS parameters vs estimates X5 Leverage Plots interpret coef cient estimates leverage plot partial regression and slope 33 lnference o prediction components of ANOVA table X5 interpret hypothesis tests individual t model F subset F R2 vs R and r 34 Interactions o interpret additive model interpret interaction effect and test signi cance know whyhow 35 Qualitative Predictors o dummy variables indicators and interpretation of coef cients in regression model interaction vs t separate models regression vs ANOVA vs ANCOVA when and why 41 Multicollinearity o what it is why a problem how to detect VlF Cond lnd F test vs t tests why not correlations remedial measures 42 Variable Selection 0 why use them all possible regressions Procedures 017 X5 R2 stepwise methods backward stepwise how they work 43 Ridge Regression 0 why use it what it does to bs X5 SE7s choose ridge parameter VlF X5 Ridge Trace 44 In uence and 0 why a problem numerical X5 graphical diagnostics Outliers hat diagonals DFFlTS DFBETAS Cooks Distance Studentized Deleted Residuals what they measure interpretation remedial measures Bonferroni correction 45 Add Topics 0 robust regression X5 nonlinear models why 46 Summary 0 reasonable strategy with necessary components Types of questions to expect all dealing with one data set 1 lnterpret SAS output and basic code 0 procs univariate corr means reg glm 0 graphics what they do when to use them 4 plot summaries boxplot histogram symmetry plot normal quantile plot scatterplots leverage plots residual plots 2 Fill in the blank for SAS output 0 univariate corr output 0 ANOVA tables con dence amp prediction intervals 0 parameter estimates R square correlation 0 key how are things related 3 Interpretation and Rationale 0 what do components of output really mean and how should they be used 0 why when would you use certain models procedures or diagnostics 4 Model assumptions amp remedial measures plus which transformation does what 5 Group project group name amp names of group members Misc notes 0 test is open book amp open note but should only need them sparingly front page 0 bring calculator no cell phones or other wireless devices 0 be concise and pace yourself 0 common courtesy arriving amp leaving 3 0 Stat 5100 Notes Spring 2009 Unit 3 Multiple Linear Regression Section Topic 31 32 33 34 35 Partial Correlation pp 65 72 Mult Reg 85 Leverage Plots pp 70 72 Inference in Mult Reg pp 7781 Interactions p 84 Qualitative Predictors pp 84 101 3 1 3 1 Partial Correlation Previously 2 variables gt scatterplot amp correlation coef cient indicate relationship Now more than 2 variables gt assessing relationships becomes more complicated Still look at pairwise scatterplots amp correlations But What do these pairwise relationships mean 3 2 Example X2 Income Y Water81 amp X1 Water80 Pair Water80 amp Income Water80 amp Water81 Income amp Water81 Correlation positive 034 positive 076 positive 042 But What is the real relationship between Income amp Water81 0 real effect or is 7 large only because Water80 is a confounding factor 0 want to look at Income vs Water81 controlling for Water80 3Concord1MultRegsas 3 3 How to control for effect of Water80 Could look at Income vs Water81 for different Water80 groups 0 2500 2501 5000 etc 0 but these estimated correlations could be noisy highly variable due to low sample sizes 0 this approach could get worse as we try to control for more factors Instead use regression o essentially chop out the parts of Income amp Water81 explained by Water80 c then compare What s left of Income amp Water81 3 4 General algorithm for examining the relationship between variables X2 amp Y controlling for variable X1 1 Regress X2 on X1 get residuals cX2X1 residual of X2 given X1 2 Regress Y on X1 get residuals ch X1 3 Plot residuals cX2X1 vs chX1 4 Compute 7a37X2IX17 the correlation between 6X2X1 amp ch X1 This gives the partial correlation between X2 amp Y controlling for X1 3 5 Note that eX2X1 X2 X2X1 yX1 Y Y Xl What is being subtracted and Why in both cases adjusting for effect of X1 Example X2Income YWater81 X1Water80 X2 vs Y 3396 Pearson Correlation Coefficients N 496 X2 Y X1 X2 100000 041779 033705 Y 041779 100000 076479 Z1 033705 076479 100000 Partial correlation between X2 amp Y After controlling for effect of X1 Pearson Correlation Coefficients N 496 eX2X1 eYX1 eXZ 100000 026379 eYZ 026379 100000 Example X2 Income Y Water81 X1 Water80 3 7 X2 vs Y X2 vs Y after controlling for X1 parlial residuals Correlation 042 Partial Conelation 026 v 6000 110007 100007 39 40007 90007 3000 39 39 8000 7000 o 39 E 2000 5 I o u o 7 39 0 o 6000 l E 1WD 0 wo 39 39 a 39 39 quot quot3 39 5000 i 0 39 3 39 I quot quot 8 u s 139 4000 39g39oquot a 391000 fquotu 39 u 300039 sf 1quot3I 39 39 2000 339 quotu 91 39I E 339 m 2000 2 H r 39 i 39 1000JEIlgIIo 40007 39 I 0139 39 l l l l 5 01 39l l l l l 0 20000 40000 60000 80000 100000 40000 20000 0 20000 40000 60000 00000 x2 ResidualofXZ x1 The partial correlation is smaller closer to 0 because of the adjustment for the effect of X1 on both X2 and Y 3 8 What if we have more than three variables EX Water81 Water80 Income People81 Education etc Need to compute partial correlation between two variables While control ling for more than one other variable 0 regress each variable of primary interest on more than one other variable 0 this is multiple regression 3 9 32 Multiple Regression amp Leverage Plots Consider a single response variable Y or dependent variable that we want to relate to several predictor variables X1 X 2 X k1 or independent carrier or explanatory variables EX Y Water81 X1 Water80 X2 Income etc 3 10 Multiple linear regression model for predicting Y based on X1 Xk1 for observations 239 1nz Y I 30 iXm 52392 k 1X k 1 6239 Assumptions similar as in simple linear regression o 61 en iid N002 independent normally distributed constant variance 0 61 en unrelated to Xj s o no outliers Why linear 0 Yr 30 51 Sin Xi1 32392 3 Xm 6239 Let X1 sinXl X 2 X2 X91 6X3 Then Y e 1X 1 2X 2 3X 3 67 0 Yr 30 8X1 5139 32392 6239 Problem here cannot be made linear 0 Yr 30 8X1 le 2X 2 z gt IOgY 10g 30 313931 32X 2 10g gt Yz39l 39921 35292 6f 3 12 NOTE Do not confuse these 0 multiple regression regress a single response variable Y on multiple explanatory variables X17 7Xk 1 o multivariate regression regress multiple response variables Y1 amp Y2 for example on one or more explanatory variables X 8 3 13 Estimating parameters 0 HOW many are there 1 n Y 30 iXm 32392 quot39 k 1X k 1 6r e N N002 c There are k l 1 unknown parameters 0 Let 07 17 7 k 1 0 Ordinary Least Squares OLS gt Q is the minimizer 0f 39I L A 2 RSS Residual Sum of Squares Z 21 Z YZ39 50 le k 1X k 12 7L1 3 14 Estimate 02 Vare 0 Use 67 Yr to estimate 67 0 Sample variance of data eiz 231 67 2 Divide by n 1 and not 71 here because we ve lost 1 df by computing 6 we ve constrained one data point 0 But in OLS how many df have we really lost 1 df for each j estimated gt lose k residual mean 6 E 0 gt lose no df 0 So divide by n k3 3 15 Formulae for individual estimates bj Bj 0 Simple regression single predictor X1 gt simple formulae 22 0 Multiple regression multiple predictors X1 Xk1 gt no simple formulae for individual estimates 0 We ll have to trust the estimates from the computer SAS similarly for standard errors and prediction intervals 0 We Will return later to an elegant formulation of the linear model and estimates in terms of matrices and vectors 41 3 16 Coefficient Interpretation 0 Y 90 b1X1 bk1Xk1 o What does 5939 bj mean 1 How much of an increase in Y you would expect for every unit increase in Xj While holding all the other X s constant relationships among the X s might make this condition hard to satisfy changing one Xj may affect other X s 2 The average effect of X j on Y the slope When regressing Y on Xj after removing the effects of the other X variables on both Xj and Y Example effect of X2 Income and X1 Water80 on Y Water81 Regressing Water81 on Income 11000 10000 Water Consumph39on in 1981 1 1 1 1 1 0 20000 40000 000 80000 100000 Household Income in 1981 Water Consumption in 1981 post shortage Regressing Water81 on Waterao 3 17 11000 10000 What about the joint effect of X2 Income and X1 Water80 on Y Water81 Previously we looked at the plot of the partial residuals between X2 Income and Y Water81 controlling for X1 Water80 Residual of Walem WaterSO This 3 18 Partial Regression Leverage Plot of Water81 on Income given Water o o Regress chXl on cX2X1 5000 D O 4000 a 0 Correlation of these residuals 3000 partial correlation m m o Slope the average effect of X2 on 0 Y after removing the effect of X1 1009 on both X2 and Y o Slope here is same as 92 in multiple m o regression model 5000 o Yb0b1X1b2X2 40000 20000 0 20000 4W0 60000 800 Residual of IncomeWater80 is called a partial regression residual plot or a leverage plot Regressing residual of Water81Water80 on residual of IncomeWater80 Partial regression of Water81 on Income after adjusting for Water80 319 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr gt t Intercept Intercept 1 12742E13 4151085 000 10000 rIncW80 Residual 1 002055 000338 608 lt0001 Regressing Water81 on Income and Water80 Note same slope for Income here as in partial regression above Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 20382169 9436129 216 00313 Income 1 002055 000338 607 lt0001 Water80 1 059313 002505 2368 lt0001 6000 4000 2000 Water81 2000 4000 6000 Cheap Plot Partial Regression Residual Plot 1 2244414414134 1 12 11 331 1 211321 421 1 1 12 112 22 111 1 1 22371335842333213 2 111 1 1 151163446975935 13311 21112 1 1268885765885862421412121 1 146564811343 112 111 121143 21 1 212212121 1 11 311 2 1 1 1 1 11 1 2 1 1 11 1 1 1 11 1 1 40000 20000 0 20000 Income 40000 60000 80000 Cfquot w c A Cheap leverage plot not good enough Why use leverage plots 0 Visualize effects of individual X j variables on Y 0 Assess the linearity n0nlinearity of relationships Why are they called leverage plots 0 Very useful in detecting data points that have high leverage overly in uential 0 We Will return to this idea later 44 3 22 33 Inference in Multiple Regression Recall g 1X 71 k1X77k1 67 1 77 OLS gt estimates 99191 bk1 gt predicted value Yr 90 l lem bk1X77k1 How good is the tted model Look at how the variability is partitioned in the ANOVA Table Source df Sum of Squares Mean SS F ratio P value Model k 1 E55 2amp1 Y ESSk 1 ESSk 1 PF RSSn k Error n k RSS 2 2m Yr RSSn k W Total n 1 TSS 172 MSE 3 23 Testing individual predictor X j 0 H0 j C vs H1 j 7E C usually we have C 0 0 Use test statistic t CSEBj 0 Under H0 t N tnk 15 distribution With n k df Model F test in ANOVA table 0 H01 i 2 k 1 0 0 H1 at least one j 7E 0 0 Use test statistic F W 0 Under H0 F N Fk 1n k F distribution With k 1 numerator and n k denominator df sketch 3 24 Example Regress Water81 on Income WaterSO Education amp People81 Full Model Dependent Variable Water81 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F Model 4 736580499 184145125 25351 lt0001 Error 491 356658211 726391 Corrected Total 495 1093238710 Root MSE 85228602 RSquare 06738 Dependent Mean 229838710 Adj RSq 06711 Coeff Var 3708192 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt ItI Intercept 1 39964803 18885176 212 00348 Income 1 001960 000336 584 lt0001 Water80 1 048440 002613 1854 lt0001 Education 1 4398044 1323258 332 00010 People81 1 24050194 2758814 872 lt0001 3 25 Example Testing a subset of predictors Regress Water81 on Income Water80 Education amp Peop1e81 Y Water81 X1 Income X2 Water80 X3 Education X4 Peop1e81 Y Z 30 31X1 52X2 53X 54X4 6 Do Water80 and Peop1e81 both have no effect H01 2 40 3 26 Testing a subset of H predictors re ordered here wlog 0 H03 k H k HH H 516 1 0 0 Full model Yr e iXm k H lXiJc H l k HX7LJc H k lXiJc l 6239 0 Reduced model Y I 30 iXm k H lXiJc H l 6239 k H k HH k i 0 0 Test statistic RSSW RSSfullH RSSfuun F 0 Under H0 F N FH7nk Reduced Model Dependent Variable Water81 Analysis of Variance Sum of Squares 204328570 888910139 1093238710 Mean Square 102164285 1803063 3 27 Source DF Model 2 Error 493 Corrected Total 495 Full vs Reduced Model automatically Test Water8OANDPeople81 Results for Dependent Variable Water81 Source Numerator Denominator DF 2 491 Mean Square 266125964 726391 F Value 36637 F Value Pr gt F 5666 lt0001 Pr gt F lt0001 Con dence Prediction Intervals 0 Obtain similar to simple linear regression 0 Difference change tn2 to tnk for critical values done automatically by SAS Variable Intercept Income Water80 Education People81 Parameter Estimates DF HHHHH 90 Confidence Limits 8842733 001406 044134 6578725 19503771 71086873 002513 052747 2217363 28596617 Obs Predicted values and confidence and predicted intervals Income 21000 21000 21000 21000 21000 Note Water81 2200 for 20000ltIncomelt22000 that these are 90 intervals Predict 157518 116981 131067 211908 390352 lPred90 16239 249299 uPred90 298797 257768 271778 352571 531405 lConf90 142268 107303 122560 204236 377363 uConf90 172767 126660 139575 219580 403340 3 28 3 29 Recall Coef cient of Determination Ess 1 RSS Tss Tss variability in Y explained by model R2 What Will adding more predictor variables to model do to R2 TSS stays the same Danger here overparameterization having too many variables in the model even spurious ones How to balance Adjust R2 for the number of predictor variables Adjusted R2 R2 1 n 1 RSS a n k TSS As k of vars l 1 increases RC2 rst increases then decreases 3 30 34 Interactions in Multiple Regression Assumed until now effects of explanatory variables are additive Y e iXm 52X 2 6239 What if the real effect of X2 on Y actually depends on X1 as well Example Y Water81 X2 Water80 X1 People80 Additive Model Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 12285756 8977782 137 01718 Water80 1 052465 002709 1936 lt0001 People80 1 23856894 2882049 828 lt0001 3 31 What would it mean for the effect of Water80 on Water81 to depend on People80 0 Whole is worth more than the sum of its parts synergy 0 We know higher Water80 gt higher Water81 and higher Peop1e80 gt higher Water81 0 But maybe higher Water80 and higher People80 gt much higher Water81 0 much higher here signi cantly more than could be attributed to the sum of the effects of Water80 and Peop1e80 only This is an interaction effect 3 32 De ne an interaction term as a new predictor variable X3 X1X2 Yr 30 313931 32392 3339 6239 Z 30 le 32392 53X 1X 2 6139 Interaction Model Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 44406772 14752184 301 00027 Water80 1 041307 004889 845 lt0001 People80 1 13191163 4839600 273 00066 Int1 1 003136 001147 273 00065 3 33 Coefficient interpretation in model Y Z 30 iXi 52X2 33X1X2 6 0 Interaction interpretation g if X2 increases by 1 unit then we expect an average change of 2 gXl in Y so the effect of X2 on Y depends on X1 0 Main effect interpretation 2 if X2 increases by 1 unit and X1 0 then we eXpect an average change of 2 in Y not necessarily meaningful all by itself we can t separate out the effect of X1 from the effect of X2 3 34 Final notes on interactions 0 Two way interactions X 1X2 are good When interpretable o Higher order interactions like X1X2X3 become dif cult to in terpret only use When clearly interpretable o If a higher order interaction like X1X2X3 is used then the model must also include all lower order interactions X1X27 X1X37 X2X37 X17 X2 X3 good form mathematically consistent otherwise those coef cients are forced to be zero want a exible response surface want to maintain correct interpretation 3 35 35 Qualitative Predictor Variables One of these things is not like the others 1 cars length weight cylinders Wheelbase 2 students height weight age innieoutie Quantitative variables measured on a meaningful numeric scale 0 Continuous values could fall in a continuum 0 Discrete can only take on certain values 3 36 Qualitative variables cannot be measured on a meaningful numeric scale 0 sometimes called categorical or class variables 0 can use in linear regression if coded as dummy or indicator variables Indicator variable examples 1 X I 1 if innie TRUE innie Z TRUE 0 otherwise 2 X3I X4I X5I cylinders3 cylinders4 cylinders5 3 37 351 Indicator Variable in Regression Example 13 turkeys in three states 2 South 1 North have data on their age in weeks weight in pounds and color Possible model before now Weight g 1Age 6 Suppose we want to predict average turkey weight given a turkey s age and location De ne indicator variable LOG I location North 1 if North 0 otherwise 3TurkeyQualitativePredictorssas Then could use the linear regression model Weight g l lAge l gLOC l 6 3 38 Note What this assumes 0 relationship for each location is it different North Loo 1 Weight g l g l lAge l e South Loo 0 Weight g l 1Age l e Additive Model Regression lines for Nonh top and Soulh bottom weight 20212223242526272829303132 399 mil00001 Sourc Model Error Corre linear regression with dummy variable Dependent Variable weight Analysis of Variance e DF 2 3 10 cted Total 12 3 Root MSE Dependent Mean 1 Coeff Var Parameter Variable DF Estimate Intercept 1 042292 age 1 047917 Loc 1 204375 Sum of Squares 846442 095250 941692 030863 278462 241404 Mean Square F Value 1923221 201 009525 RSquare 09758 Adj RSq 09710 Standard Error t Value 069023 061 002572 1863 018012 1135 Pr gt F 91 lt0001 Pr gt It 05537 lt0001 lt0001 3 39 352 Interaction with Indicator Variable in Regression What would it mean for the two lines to not have the same slope 3 40 0 effect of Age on Weight depends on Location Weight g l lAge gLoe gAge gtllt L00 l e North L00 1 Weight g l g l l l 3Age l e South L00 0 Weight g lAge 6 Interactive Model Regression lines for Nonh top and Soulh bottom weight 2021222324252522329303132 age LwIIIOOOO1 3 41 353 Interaction vs Separate Fits What if we split the data and t two separate lines 0 North L00 1 Weight N imAge e 0 South L00 0 Weight gs ESUlge 6 These will be the same as the two lines from the interaction model 5N50 2 7 gs o iN 1 l 3 7 9951 But the signi cance tests will be different because of the different sample sizes linear regression with dummy interaction Number of Observations Used 13 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt It Intercept 1 077115 082368 094 03736 age 1 049231 003080 1598 lt0001 Loc 1 324615 149652 217 00582 AgeLoc 1 004731 005844 081 04391 Two lines separately Loc0 Number of Observations Used 8 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt It Intercept 1 077115 076371 101 03516 age 1 049231 002856 1724 lt0001 Loc1 Number of Observations Used 5 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt It Intercept 1 247500 141394 175 01783 age 1 044500 005620 792 00042 3 43 A few comments 0 Best to use model With interaction to take advantage of larger sample size and to test for signi cance of the interaction term 0 How many indicator dummy variables are needed We used 1 for Location With 2 levels North and South What if we wanted to use State 3 levels GA VA WI De ne 2 indicator variables 31 and 2 1 SQ State 1 0 GA 0 1 VA 0 0 WI In general for a qualitative predictor variable With q levels need q 1 indicator variables 3 44 354 Indicators Only What if we had only qualitative predictor variables 0 Use State to predict Weight ignore Age Weight 30 lsl 3252 6 39 GA 517 521703 Weight e i 6 1 6 O 5132012 Weight g g E M2 E 0 WI 313200 Weight g e ch e 0 Comparing state averages M1M2M3 how much of variance in Weight can be attributed to the levels of the variable State 0 This is Analysis of Variance ANOVA Side by side boxplots T E VA l i WI 3 45 3 46 One Way ANOVA 0 using only one qualitative predictor here State Ym39 LLi l em39 i1qj1m ulanglEm39 iWIO 7L1 em z39z39d N0 02 0 Can recast this as a regression model using q 1 indicator variables 0 Could instead t ANOVA model directly proc glm Source Model Error Corrected Total Variable Intercept S1 S2 Source Model Error Corrected Total Parameter Intercept state GA state VA state WI State as a dummy variable The REG Procedure Sum of Mean DF Squares Square F Value Pr gt F 2 638192 319096 097 04135 10 3303500 330350 12 3941692 Parameter Standard DF Estimate Error t Value Pr gt It 1 1360000 081283 1673 lt0001 1 167500 121925 137 01995 1 097500 121925 080 04425 State in oneway ANOVA The GLM Procedure Sum of DF Squares Mean Square 2 638192308 319096154 10 3303500000 330350000 12 3941692308 Standard Estimate Error t Value 1360000000 B 081283455 1673 167500000 B 121925182 137 097500000 B 121925182 080 000000000 B F Value Pr gt F 097 04135 Pr gt It lt0001 01995 04425 3 47 3 48 One way ANOVA vs Regression With dummy variables 0 Similarities ANOVA table SS df MS R square etc Predicted values 0 Differences Regression gives more easily interpretable coef cient esti mates Regression gives direct standard errors and t tests ANOVA automatically tests equivalence among variable lev els 3 49 355 TwoWay ANOVA What if we wanted to use more than one qualitative predictor 0 Example 3 level State and 2 level Color 0 Regression approach 31 32 as before C I Color Black Weight 30 lsl 252 330 6 o Two way ANOVA tvvo qualitative predictors 04 level State 7 level Color 1 GA 1 black 2 VA 2 White 3 WI 3 50 Two way ANOVA example State and Color as qualitative predictors Yzjk MO 7j jk i1Ij1Jk31n 7j Z Z o 73 j em z39z39d N002 Could write out every u l 047 l 7939 combination in terms of the s Same similarities amp differences between regression and two way ANOVA as between regression and one way ANOVA State and color Source Model Error Corrected Total Root MSE Variable DF Intercept 1 S1 1 S2 1 C 1 Source Model Error Corrected Total RSquare 0171847 Source state color Source state color as dummy variables in regression Sum of Mean DF Squares Square F Value Pr gt F 3 677366 225789 062 06181 9 3264326 362703 12 3941692 190448 RSquare 01718 Parameter Standard Estimate Error t Value Pr gt It 1338136 108075 1238 lt0001 163856 128236 128 02333 102966 128834 080 04447 036441 110883 033 07499 State and color in twoway ANOVA Sum of DF Squares Mean Square F Value Pr gt F 3 677366037 225788679 062 06181 9 3264326271 362702919 12 3941692308 Coeff Var Root MSE weight Mean 1489662 1904476 1278462 DF Type I SS Mean Square F Value Pr gt F 2 638192308 319096154 088 04477 1 039173729 039173729 011 07499 DF Type III SS Mean Square 2 620473729 310236864 1 039173729 039173729 F Value Pr gt F 086 04570 011 07499 356 Interactions in TwoWay ANOVA 3 52 What if the effect of State on Weight depends on Color 0 gt interaction term 0 Regression need to de ne products of every pair of indicator variables 10 320 then model Weight 81 S2 C SlC S2C Many levels of qualitative predictors gt messy quickly also the signi cance tests Will be misleading except for the model F test 0 ANOVA syntax and interpretation is much easier proc glm Yijk MO 7j 06j 6 ZOHZZW ZOWM ZOWM 0 7 j 7 j State and color and interaction as dummy variables in regression Note that other than the overall model Ftest the significance tests here are wrong Parameter Standard Variable DF Estimate Error t Value Pr gt It Intercept 1 128 00 148075 lt0001 S1 1 1 00000 209410 0 48 06475 S2 1 0 35000 256473 0 14 08953 C 1 1 25000 191164 0 65 05341 S1C 1 1 10000 283542 0 39 07096 S2C 1 2 01667 308243 0 65 05338 State and color and interaction in twoway ANOVA Sum of Source DF Squares Mean Square F Value Pr gt F Model 5 872025641 174405128 040 08363 Error 7 3069666667 438523810 Corrected Total 12 3941692308 Source DF Type III SS Mean Square F Value Pr gt F state 2 524659605 262329802 060 05756 color 1 012666667 012666667 003 08699 statecolor 2 194659605 097329802 022 08064 Standard Parameter Estimate Error t Value Pr gt It Intercept 1285000000 B 148074949 868 lt0001 state GA 100000000 B 209409601 048 06475 state VA 035000000 B 256473335 014 08953 state WI 000000000 B color black 1 25000000 B 191163937 065 05341 color white 0 00000000 B statecolor GA black 110000000 B 283541940 039 07096 statecolor GA white 000000000 B statecolor VA black 201666667 B 308242586 065 05338 statecolor VA white 000000000 B statecolor WI black 000000000 B statecolor WI white 000000000 B 3 53 3 54 A few notes about ANOVA 0 Some interactions cannot be estimated because not enough obser vations at every 04 7 combination 0 Different SS tables Type I looks at contributions made by the variable after previous variables have been accounted for so order in model statement matters here Type III looks at contributions made by the variable after all other variables have been accounted for so order in model statement does not matter here 0 Type III SS closer to What multiple linear regression does gt primarily interested in Type III SS 3 55 357 Quantitative Predictors and ANOVA What if we have both quantitative and qualitative predictors Example Weight depends on both Age and Location 0 Previously convert qualitative predictor to indicator variables gt regression model 0 Consider ANOVA based approach called ANCOVA Analysis of Covariance Y2quot O l ai l lXijJ l Eij Here 047 is qualitative and X1 is quantitative Can add interaction as well ANCOVA yields same results as regression approach location and age and interaction in ANCOVA Sum of Source DF Squares Mean Square F Value Pr gt F Model 3 3852907692 1284302564 13019 lt0001 Error 9 088784615 009864957 Corrected Total 12 3941692308 Source DF Type III SS Mean Square F Value Pr gt F age 1 2538020940 2538020940 25728 lt0001 location 1 046415855 046415855 471 00582 agelocation 1 006465385 006465385 066 04391 linear regression with dummy interaction Sum of Mean Source DF Squares Square F Value Pr gt F Model 3 3852908 1284303 13019 lt0001 Error 9 088785 009865 Corrected Total 12 3941692 Parameter Standard Variable DF Estimate Error t Value Pr gt It Intercept 1 077115 082368 094 03736 age 1 049231 003080 1598 lt0001 Loc 1 324615 149652 217 00582 AgeLoc 1 004731 005844 081 04391 3 57 Type of predictor Type of SAS procedure variables in model analysis to use quantitative linear proc reg only regression but glm works too qualitative ANOVA proc glm only or regression With With qual vars both quantitative and qualitative dummy vars unless interactions are of interest ANCOVA or regression With dummy vars unless qual gtlt qual interactions are of interest in class statement or proc reg if no interaction proc glm With qual vars in class statement or proc reg if no qualgtltqual interactions A few nal notes on models With qualitative variables 3 58 0 Why use dummy vars if syntax for ANOVAANCOVA is easier 1 easily interpretable coef cient estimates 2 procedures not well de ned for determining Whether to add or drop qualitative predictors need dummy vars as quant preds in regression o All this assumes response Y is quantitative and at least roughly continuous later Will address discrete or qualitative response 0 Scope of this course limited to linear regression and time series models treatment of models for qualitative variables here is relatively light ANOVA ANCOVA etc more in depth Design of Experiments STAT 5200 and Categorical Data Analysis STAT 5120

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.