Elementary Statistical Methods
Elementary Statistical Methods STAT 30100
Popular in Course
Popular in Statistics
This 15 page Class Notes was uploaded by Bailey Macejkovic on Saturday September 19, 2015. The Class Notes belongs to STAT 30100 at Purdue University taught by Christa Sorola in Fall. Since its upload, it has received 11 views. For similar materials see /class/207931/stat-30100-purdue-university in Statistics at Purdue University.
Reviews for Elementary Statistical Methods
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
CHAPTER 11 Multiple Regression With multiple linear regression more than one explanatory variable is used to explain or predict a single response variable Introducing several explanatory variables leads to additional considerations We will not be able to address all these issues but will outline some basic facts about multiple regression Equation of the Multiple Regression Model We have data on several explanatory variables x1x2x3 xp Where p is the number of explanatory variables in the model and a response variable y o The regression model for the population is yl 250 lx1 Zx2 pxp 81 0 The sample prediction equation is 32 b0 b1x1b2x2 bpxp o The 139 residual is 61 yl 31 observed response predicted response 0 The estimate for the variability of the response about the regression equation is e s 1 The degrees of freedom p associated with s2 are n p l As with all models there are some assumptions that need to be met with multiple regression Multiple Regression Assumptions 1 N U 4 LINEARITY The regression equation must be of the right form to describe the true underlying relationship among the variables To check for linearity make a scatterplot of y against each predictor variable CONSTANT VARIANCE The variability of the residuals must be the same for all values of the x variables To check for constant variance scatterplots of residuals against predicted values are made INDEPENDENCE The residual at one data value must be independent of the residuals at any other data values NORMALITY The distribution of the residuals must be Normal for the I test on the coefficients to follow student s I distribution exactly To check the normality assumption make a probability plot of residuals Collinearity A multiple regression has a collinearity problem when any of the predictors has a strong linear relationship with any of the other predictors The standard error of the coefficient of any predictor that is collinear with the others is in ated leading to a smaller I statistic and correspondingly larger less significant pvalue One clue that collinearity might be a problem is a regression with a large overall Rsquare but with small I ratio for the coefficients Detecting collinearig Regress one predictor on the others If R2 is high for any of the regressions you know that the two predictors are collinear THE WHOLE POINT You may have several explanatory X variables however not all of them are important enough to make it into your model Your goal is to find a model which has low bias and low variability There is however a trade off between bias and variability in your model Adding variables to the model will decrease the bias if the variables are helpful in predicting the response However adding variables could also increase the variability of your model Consequently you only want to add variables that will be helpful How do we know which variables should be included in our model and which should not There are two procedures that you can use Procedure 1 Start with a model that contains all your eXplanatory variables and remove them one at a time till you find that your bias starts to increase too rapidly Procedure 2 Start with a model that contains only one eXplanatory variable and add one variable at a time till you find that your bias is no longer decreasing How do you check the bias in your model Look at the overall R2 the squared multiple correlation for the model The R2 is the proportion of the variation of the response variable y that is eXplained by the eXplanatory variables x1x2xp in amultiple linear regression Basically R2 should be as high as possible or at least not drop drastically when you remove a variable Any variables left in the equation ideally should have a significant Pvalue from the individual I tests of the coefficient Furthermore the confidence intervals or these coefficients should not contain 0 Confidence Intervals for individual j b i SE5 SPSS gives you a 95 CI Significance Test for 6 j Format 0 State the null and alternative hypothesis H 0 5 0 Haz jio Haz jlt0 orHaz j gt0 0 Find the test statistic on the printout or by using the formula I bf SE Find the Pvalue from the printout Compare the Pvalue to the I level If Pvalue S 0 then reject H 0 If Pvalue 2 0 then fail to reject H 0 0 State your conclusions in terms of the problem How do you check the variability in your model 1 Look at the standard deviation s for the model Recall 2 2 Ze S i and can be found in the regression output The n p smaller s is the better 2 Look at the Widths of the confidence intervals for the 39s Note Individual regression coe icients their standard errors and signi cance tests are meaningful only when interpreted in context of the other explanatory variables in the model Another test that is useful is the F test It is an overall test that will tell you Whether you want to proceed If you fail to reject the null in the Ftest then none of the explanatory variables in your model will help explain the changes in the response so there is not point continuing This test is helpful if you start with the overall model Analysis of Variance F Test Format 0 State the null and alternative hypothesis HO 1 z z p 0 Ha not all are equal to 0 Find the test statistic on the printout Find the Pvalue from the printout Compare the Pvalue to the I level If Pvalue S 0 then reject H 0 If Pvalue 2 0 then fail to reject H 0 0 State your conclusions in terms of the problem What are some of the other things that will be helpful when trying to find the overall best model 1 Look at the variables individually Find their means standard deviations minimums maximums and outliers Look at a histogram or stemplot of the variables Look at the relationship between variables using the correlation and scatterplots Note We want the explanatory variables to have a high correlation with the response variable but do not want two explanatory variables to have a high correlation with each other this could cause collinearity problems N We will now look at an example on SPSS Our goal today will be to find the best model to predict a STAT 301 students test 2 score based on their scores for test 1 lab grades homework grades attendance and whether or not they handed in the review for exam 2 The grades are taken from three of Joan Brenneman s stat 301 classes last semester 1 For each of variables in the data set find the mean median standard deviation and IQR Display each distribution with a histogram SPSS To get the descriptive statistics gtAnalyze gtDescriptive statistics gt Explore Pall all variables into Dependent List box Click OK Do the histograms individually for each variable Descnptlves scansue 3m Ermr anenganee Mean 17 33 443 95 Cunngenee aner Emma 1619 1nterva1f3r Mean Uppergum 77 97 5 7nmmeg Mean 1745 Megan 19 33 vananee 13 395 so Dewanun 4 254 M1nnn3m 5 Maxwum 21 Range 16 mterquame Range 6 ewness V1 136 254 Kunusm 393 533 Lang Mean 16412 3136 95 Cunngenee aner 331mg 15 795 1nterva1f3r Mean Uppergum 77 my 5 7nmmeg Mean 16635 Megan 16 753 vananee 3 631 so Dewanun 2 9463 M1nnn3m 5 6 Maxwum 19 9 Range 14 3 mterquame Range 4 3 ewness V1 333 254 Kunusm 2 124 533 NW ean 14 375 4442 95 Cunngenee aner 331mg 13493 1nterva1f3r Mean Uppergum 75 258 5 7nmmeg Mean 14333 Megan 15 533 vananee 17 755 so Dewanun 4 2137 M1n1 um 3 Maxwum 19 4 ng 19 4 mterquame Range 3 5 V1 733 254 Kunusm 395 533 renew M 63 353 95 Cunngenee aner 331mg 53 1nterva1f3r Mean Upperaum 7B 5 7nmmeg Mean 73 Megan 1 33 vananee 221 so Dewanun 473 M1n1 um 3 Maxwum 1 1 mterquame Range 1 774 254 Kunusm V1 434 533 TESH Mean 77 31 1 797 95 Cunngenee aner 331mg 73 44 1nterva1f3r Mean Uppergum EU 58 5 7nmmeg Mean 77 93 Megan 33 53 van nee 293 753 so Dewanun 17 351 M1nnn3m 33 Maxwum 133 Range 67 mterquame Range 24 r 351 254 Kunusm r 334 533 Test2 Mean 74 26 1 633 95 Cunngenee aner 331mg 73 91 1nterva1f3r Mean Uppergum 77 EU 5 7nmmeg Mean 75 32 Megan 77 33 vananee 254 777 so Dewanun 15 962 M1nnn3m 26 Maxwum 97 Range 71 mterquame Range 19 ewness r 972 254 Kunusm 632 533 memvy is mun 2 Make a scatterplot for each pair of variables in data set Describe the relationships for each Calculate the correlation for each pair of variables and report the PValue for the test of zero population correlation in each case SPSS A quick way to get all the scatterplots in a matrix is as follows gtGraphs gtScatterplots Select Matrix and click De ne Pull all variables into the Matrix variable box Below are the steps to get all the correlations gtAnalyze gtCorrelate gtBivariate Move all variables into Variable box and click OK Matrix of all Possible Scatterplots attendance 23 awa Lab t hwt lt9 0 Q review all 38 i l I i leggy I 06 I I g m a 0 1 Q30 0 attendance Lab t hwt review Test 1 Test 2 Correlations Sig 2tailed N Sig 2tailed N Sig 2tailed N Sig 2tailed N Sig 2tailed N Sig 2tailed 000 N 90 Correlation is signi cant at the 001 level 2tailed 3 Perform a multiple regression using all the explanatory variables and answer the questions on the next page based on the output SPSS To get the multiple regression output below gtAnalyze gtRegression gtLinear Move Test 2 to dependent box Move remaining variables to Independent box Select Statistics and click on Con dence Intervals Then click Continue followed by OK Model Summary Model R 1 808a Adjusted R S uare R S uare 653 632 Std Error of the Estimate 9 681 3 Predictors Constant Test 1 review hwt bt attendance La ANOVAb Sum of Model S uares df Mean 8 uare F Sig 1 Regression 14803030 5 2960606 31591 0003 Residual 7872092 84 93715 Total 22675122 89 3 Predictors Constant Test 1 review hwt attendance Labt b Dependent Variable Test 2 Coef cientsa Unstandardized Standardized Coef ients Coef cients 95 quot 4 Interval for B Model B Std Error Beta t Sig Lower Bound Upper Bound 1 Constant 6355 3 1024 309 981 attendance 287 393 076 730 468 1069 495 Labt 1908 606 352 3151 002 704 3112 hwt 565 381 149 1484 142 192 1322 review 4623 2639 136 1752 083 625 9872 Test 1 393 078 419 5028 000 237 548 3 Dependent Variable Test 2 a CT 9 State your hypotheses for an AVOVA Ftest give the test statistic and its Pvalue and state your conclusions Report the t statistic and Pvalues for the tests of the regression coefficients of your explanatory variables Which regression coefficients give you non significant results What conclusions do you draw from these tests Give 95 confidence intervals for the regression coefficients of your eXplanatory variables Do any of the intervals contain the point 0 What is the value of s the estimator for standard deviation What is the percent of the variability in taste that is eXplained by this regression line 4 One variable looks like a good candidate to be dropped Which is it Try running the regression again Without this variable Model Summary Adjusted Std Error of Model R R Sguare R Sguare the Estimate 1 807a 651 634 9654 3 Predictors Constant review Test 1 hwt Labt ANOVAb Sum of Model Sguares df Mean Sguare F Sig 1 Regression 14753108 4 3688277 39574 0003 Residual 7922014 85 93200 Total 22675122 89 3 Predictors Constant review Test 1 hwt Labt b Dependent Variable Test 2 Coefficientsquot Unstandardized Standardized Coef ients Coef cients 95 Confidence Interval for B Model B Std Error Beta t Sig Lower Bound UEEer Bound 1 Constant 5463 6065 901 370 6596 17522 Labt 1745 561 322 3110 003 629 2860 Test 1 400 077 428 5193 000 247 554 hwt 465 354 123 1312 193 239 1168 review 3905 2442 115 1599 114 950 8761 3 Dependent Variable Test 2 a Give the fitted regression equation b Report the t statistic and Pvalues for the tests of the regression coefficients of your explanatory variables Which regression coefficients give you non significant results What conclusions do you draw from these tests c What is the value of s the estimator for standard deviation d What is the percent of the variability in test 2 that is explained by this regression line Now lets see what happens when we remove hwt Model Summary Adjusted Std Error of Model R R Square R Square the Estimate 1 I 8023 644 631 9694 I 3 Predictors Constant Labt review Test1 ANOVAb Sum of Model S uares df Mean S uare F Sig 1 Regression 14592579 3 4864193 51756 0003 Residual 8082543 86 93983 Total 22675122 89 3 Predictors Constant Labt review Test1 b Dependent Variable Test 2 Coefficientsa Unstandardized Standardized Coef cients Coef cients 95 Con dence Interval for B Model B Std Error Beta t Sig Lower Bound Upper Bound 1 Constant 4395 6035 728 468 7603 16393 Test 1 413 077 441 5374 000 260 566 review 4533 2405 133 1885 063 248 9314 Labt 2132 479 394 4451 000 1180 3084 3 Dependent Variable Test 2 a Give the fitted regression equation b What is the value of s the estimator of standard deviation c Has R2 changed drastically with hwt removed d Are any of the explanatory variables in the model still not significant at the 5 significance level How about the 10 significance level 5 Now let s look at the model with review removed Model Summary Adjusted Std Error of Model R R Sguare R Sguare the Estimate 1 793a 629 620 9836 3 Predictors Constant Test 1 Labt ANOVAb Sum of Model uares df Mean Sguare F Sig 1 Regression 14258661 2 7129330 73695 000a Residual 8416461 87 96741 Total 22675122 89 3 Predictors Constant Test 1 Labt b Dependent Variable Test 2 Coefficientsa Unstandardized Standardized Coefficients Coefficients 95 Confidence Interval for B Model B Std Error Beta t Sig Lower Bound Upper Bound 1 Constant 2947 6073 485 629 9125 15018 Labt 2479 449 458 5526 000 1587 3371 Test 1 398 078 425 5130 000 244 552 3 Dependent Variable Test 2 a Has R2 changed drastically with the review variable removed b What is the value of s the estimator of standard deviation c Which model do you think is best and why d What are some additional things we can look at when deciding which model would be best
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'