Popular in Course
Popular in Statistics
This 19 page Class Notes was uploaded by Orval Funk on Monday September 28, 2015. The Class Notes belongs to STAT102 at University of Pennsylvania taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/215434/stat102-university-of-pennsylvania in Statistics at University of Pennsylvania.
Reviews for INTROBUSINESSSTAT
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
Lecture 13 Model Building in multiple regression Stat 102 0 We will illustrate the practice of model building and discuss the theory through examination of a model building exercise 0 Data set pollutionJ MP provides information about the relationship between pollution and mortality for 60 cities between 19591961 0 Goal Build a multiple regression model that can be used to examine the effect on mortality of several pollution related variables Build the most useful model based on the available data The Data 0 The Dependent Variable Y is MORTality total age adjusted mortality in deaths per 100000 population o The xvariables that could be used are PRECIPmean annual precipitation in inches EDUC median number of school years completed for persons 25 and older NONWHITE percentage of 1960 population that is nonwhite NOX relative pollution potential of N20 related to amount of tons of Nitrous Oxide N20 emitted per day per square kilometer 02 relative pollution potential of 802 0 Among these the pollutionrelated descriptors are NOX 02 and PRECIP indirectly o The remaining 2 variables are included as controls Controls help answer whether pollution is important after controlling for other relevant nonpollution factors Choosing variables to include 1 Which form of the variable For each variable decide whether it is appropriate to transform it eg use the log or square root of the variable 0 For the Y variable the main reason for a transformation is to attain homoscedasticity of the nal model 0 There are two main reasons to make a transformation of the x variables 1 the relationship between the explanatory variable and the response variable is nonlinear 2 the explanatory variable is crunched together with a few outliers andor some in uential points 2 Which of the explanatory variables to include in the model The Model Building Process is Interactive 0 The goal is the best end product 0 There is no uniquely correct order in which to proceed 0 Decisions as to which variables to use and how to transform them may need to be revisited as the analysis progresses Scatterplot Matrix 0 This is often a useful first step 0 Examine the Simple Regression Plots of Y on the potential x variables See Whether they seem mostly linear And see Whether the Y Variable looks as if it will be homoscedastic in future analyses 39 If not try a transformation such as LogY before going further 0 Our data looks good in terms of heteroscedasticity o It also looks pretty good in terms of nonlinearity but we will still find some useful corrective actions 0 To get the ScatterplotMatrl39x in JMP click Multivariate and then put the Y vartable and all xvartables into the Columns box Scatterplot Matrix 1100 9505 89839 quot 139 39soz 8001050 039 3940 700 115 5 quot20 3550200 50300 Transformation to Correct NonLinearities o If nonlinearity is present in the relationship between Y and some x it is usually a good idea to try to correct it now before going further We ll want to use only transformations of X We probably don t want to transform Y since we seem to already have desirable homoscedasticity 0 It s not very evident from the small size plot in the Scatterplot Matrix but there is some nonlinearity in the plot of MORT on 802 We ll work on that rst Plot of MORT on 802 1150 1100 750 I I l I I 0 50 100 150 200 250 300 802 Linear Fit R2 181 Transformed Fit to Log R2 163 Transformed Fit to Sth R2 199 SthSOZ is the best if only by a modest amount And it has another advantage we ll look at soon amp The residual plot also looks good Transformations to Correct Crunching and bad in uence situations 0 All the scatterplots involving NOX as independent variable look odd Including that for MORT on NOX o The problem is that the values of NOX are crunched 0 They mostly lie near one end the left with a few high in uence points hanging out 0 High in uence points are usually undesirable as we ve seen before 0 This type of situation can sometimes be improved by transforming the offending xvariable Use Logx for right skewness ex for left skewness o Logx works well for our data Scatterplot of MORT by LogNOX 1150 1100 1050 F1ooo a 3 950 E 900 39 850 I 800 750 I I I I I I LogNOX o The values of this XVariable are nicely spread out o The Y on x relationship looks not too nonlinear It s not perfect but it s not clear how we could get it to be better mehmmeme4o hgpommbwmrQhOgmdw6 kmknmmat theselater The Bonus in Using SthSOZ o 802 is also a little crunched o SthSOZ uncrunches it Scatterplot of MORT by SthSOZ 1150 1100 39 750 051015 SthSOZ There s even a nicely linear pattern here Forward Selection 0 It is important to use only the variables which are useful predictors Using other variables will result in worse predictions and higher standard errors of coefficient estimates because degrees of freedom must be used to estimate the coefficients of the notuseful variables 0 So we usually don t keep variables which are insigni cant Pvalue lt 005 0 Proceed in steps 1 Choose the explanatory variable having the highest R2 with Y Include it if its Pvalue is lt 005 approx 2 Compute residuals from the simple linear regression 3 For the remaining xvariables calculate their values of R2 with these residuals Include the value with the highest such R2 if it has P lt 005 4 There are now two xvariables in the model being built Compute the residuals from the multiple regression of Y on these two variables 5 For the remaining xvariables calculate their values of R2 with these residuals Include the value with the highest such R2 if it has P lt 005 approx ETC 6 Stop when the relevant Pvalue is gt 005 Forward Selection for Pollution Data 1 Multivariate Correlations MORT PRECIP EDUC NONWHITE LogNOX SthSOZ MORT 10000 05095 05110 06437 02920 04458 o NONWHITE has the largest R2 643 72 4143 0 We can find the its Pvalue from the Parameter Estimates table Parameter Estimates Term Estimate Std Error t Ratio Probgtt Intercept 8871 1037 8553 lt0001 NONWHITE 4488 07006 641 lt0001 o It has P lt 0001 So we include this variable in our model 2 In JMP save the residuals from this model 3 Repeat Step 1 but using these residuals instead of the original Y values Forward Selection for the Data cont 3cont Multivariate Correlations with the Residuals Residuals MORT PRECIP EDUC NONWHITE LogNOX SthSO2 Residuals MORT 10000 03182 04921 00000 02220 04616 c EDUC now has the largest R2 49212 2422 0 We can find the its Pvalue from the Parameter Estimates table Parameter Estimates with Y ResidualsMORT Term Estimate Std Error t Ratio Probgtt Intercept 30408 7084 429 lt0001 EDUC 2771 644 431 lt0001 o It has P lt 0001 So we include this variable in our model 0 Additional Note The correlation between ResidualsMORT and NONWHITE is 00000 It is a property of linear regression simple or multiple that the correlation between the residuals and any explanatory variable in the model that produced them is 0 Final Model 0 When this process is continued the FINAL MODEL has the 4 explanatory variables PRECIP EDUC NONWHITE and SthSOZ o LogNOX is not part of that model 0 Note If we consider a variable substantively important we might want to include it in the model even if it is not put into the model by the model building process Since we are interested in studying the effect of LogNOX on mortality we might want to include it in the model even though it is not put in the model by the model building process Automatic Model Building in J MP 1 Click Analyze Fit Model and add all variables under consideration to the Construct Model Effects box Change the personality to Stepwise and click Run Model 2 If there are variables Which you would like to include in the model for substantive reasons regardless of their signi cance check Lock next to the variable 3 Enter into the model the variable With the largest F ratio if ProbgtF is less than 05 for this variable do this by clicking the Enter box 4 Enter into the model the variable that has not already been entered into the model With the largest F ratio if ProbgtF is less than 05 The F ratio for a variable X j that has not been included in the model is the F statistic for testing the reduced model that includes only the variables already included in the model versus the full model that includes variable X j in addition to the variables that have already been included in the model 5 Repeat Step 4 until no more variables can be entered into the model Here are the results of the model building process for the Pollution data Stepwise Fit Response MORT Stepwise Regression Control Prob to Enter 0050 Prob to Leave 0100 Current Estimates SSE DFE MSE RSquare Cp AIC 72504 55 1318 06824 479 4358 Lock Entered Parameter Estimate nDF SS FRatio X X Intercept 9567 1 0 0000 X PRECIP 1725 1 10167 7713 X EDUC 1400 1 5404 4100 X NONWHITE 3048 1 34372 26074 LogNOX 0 1 1051 0795 X SthSO2 583 1 24858 18857 Step History Step Parameter Action quotSig Probquot Seq SS RSquare Cp 1 NONWHITE Entered 00000 94595 04144 4503 2 EDUC Entered 00000 33848 05627 2145 3 SthSO2 Entered 00012 17157 06378 1048 4 PRECIP Entered 00075 10167 06824 479 ProbgtF 10000 00075 00478 00000 03767 00001 01bme Click Make Model to t the model with the chosen variables Parameter Estimates Term Estimate Std Error t Ratio Probgtt Intercept 9567 928 1031 lt0001 PRECIP 1725 0621 278 00075 EDUC 1397 6900 202 00478 NONWHITE 3047 0597 511 lt0001 SthSO2 5829 1342 434 lt0001 Notes 1 The nal R2 is R2 6824 2 The Final PValue for EDUC is P 0478 even though its PValue when rst entered at Step 2 of this analysis was P 0000 This is the same as the value P 0001 that occurred in step 3 on p 14 3 The Stepwise Fit process produces exactly the same sequence of choices and 0f PValues as did the earlier process described on p 12 14 You should be able to explain WHY Tables from the Full Sfactor Model Source DF Model 5 Error 54 C Total 59 Term Intercept PRECIP EDUC NONWHITE LogNOX SthSO2 Source PRECIP EDUC NONWHITE LogNOX SthSO2 LogNOX is not statistically significant as was also claimed on p Summary of Fit RSquare Observations D 1 1 1 1 1 0687 60 Analysis of Variance Sum of Squares Mean Square 156820 31364 71452 1323 228273 Parameter Estimates Estimate Std Error t Ratio 95035 9323 1019 201 0698 288 1471 6963 211 2825 06485 436 6708 7526 089 4389 2102 209 Effect Tests Sum of Squares F Ratio 10940 827 5907 446 25120 1899 1051 07955 5767 436 F Ratio 2370 Prob gt F lt0001 Probgtt lt0001 00058 00392 lt0001 03767 00415 Prob gt F 00058 00392 lt0001 03767 00415 15 and shown in the table on p 17