REGRESSION ANALYSIS STA 4210
Popular in Course
Popular in Statistics
This 63 page Class Notes was uploaded by Golden Bernhard on Friday September 18, 2015. The Class Notes belongs to STA 4210 at University of Florida taught by Staff in Fall. Since its upload, it has received 30 views. For similar materials see /class/206582/sta-4210-university-of-florida in Statistics at University of Florida.
Reviews for REGRESSION ANALYSIS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/18/15
Chapter 1 Linear Regression With 1 Predictor Statistical Model K8081Xlgl iln where o Yl is the random response for the I39m case o 50 51 are parameters 0 X1 is a known constant the value of the predictor variable for the I39m case 0 81 is a random error term such that Bw0fmkaz s z gj0 Vij9i j The last point states that the random errors are independent uncorrelated with mean 0 and variance 0392 This also implies that EY o31X 0392Y0392 039YYJ0 1 Thus 50 represents the mean response when X 0 assuming that is reasonable level of X and is referred to as the Y intercept Also l represent the change in the mean response as X increases by 1 unit and is called the slope Least Squares Estimation of Model Parameters In practice the parameters o and A are unknown and must be estimated One widely used criterion is to minimize the error sum of squares Yiz ol39 lel39gz 3 gzY o le QZ Zmw xw 11 This is done by calculus by taking the partial derivatives of Q with respect to o and 51 and setting each equation to 0 The values of 50 and 51 that set these equations to 0 are the least squares estimates and are labelled b0 and b1 First take the partial derivates of Q with respect to o and 51 QJ ZEOZampAXDD m QJ ZEOZampAXDX 2 Next set these these 2 equations to 0 replacing 50 and 51 with be and b1 since these are the values that minimize the error sum of squares ZZYl b0 b1Xl0 gt ZYlnbobIZXl la 11 11 11 x 22 K b0 lelXl 0 gt Z XZYZ 130 X 1312 X 2a 11 11 1 1 11 These two equations are referred to as the normal equations although note that we have said nothing YET about normally distributed data Solving these two equations yields where kl and Z are constants and Yl is a random variable with mean and variance given above Xl E kl 71 Zm mz 11 1 1 XXI X 117Xk17 71 n n 2 Zm m The tted regression line also known as the prediction equation is Y 130le The tted values for the individual observations aye obtained by plugging in the corresponding level of the predictor variable X l into the tted equation The residuals are the vertical distances between the observed values Y and their tted values Y and are denoted as at Yb0b1Xl elYl Y 0 Z 6 0 The residuals sum to 0 0 2 X161 0 The sum ofthe weighted by X residuals is 0 0 Z Y e 0 The sum ofthe weighted by Y residuals is 0 11 The regression line goes through the point Ej These can be derived via their de nitions and the normal equations Estimation 0fthe Error Variance Note that for a random variable its variance is the expected value of the squared deviation from the mean That is for a random variable W with mean LLW its variance is 02W EW 12 For the simple linear regression model the errors have mean 0 and variance 0392 This means that for the actual observed values Y their mean and variance are as follows EY o31X 0392YEY3031X20392 First we replace the unknown mean 50 51X 1 with its tted value Y be lel then we take the average squared distance from the observed values to their tted values We divide the sum of squared errors by 712 to obtain an unbiased estimate of 0392 recall how you computed a sample variance when sampling from a single population Common notation is to label the numerator as the error sum of squares SSE SSE Z Y Y2 11 11 Also the estimated variance is referred to as the error or residual mean square MSE 2 SSE MSE S 7 n 2 To obtain an estimate of the standard deViation which is in the units of the data we take lMSE the square root of the erro mean square S A shortcut formula for the error sum of squares which can cause problems due to round off errors is SSE Z K 11 Some notation makes life easier when writing out elements of the regression model EXIT SSXy Zoe xz I m 2sz 11 11 SSW 2K 2ZYK 11 11 71 Note that we will be able to obtain most all of the simple linear regression analysis from these quantities the sample means and the sample size Normal Error Regression Mode If we add further that the random errors follow a normal distribution then the response variable also has a normal distribution with mean and variance given above The notation we will use for the errors and the data is 8N00392 KN o31X0392 The density function for the 1m observation is fl 2 1m 27139039 2 039 The likelihood function is the product of the individual density functions due to the independence assumption on the random errors 2 n 1 1 2 L30u 31039 WBXP gm I30 3199 l l 2 2 02YL2 exp 202 Z o in 11 The values of 50 51 0392 that maximize the likelihood function are referred to as maximum likelihood estimators The MLE s are denoted as 50 A 0392 Note that the natural logarithm of the likelihood is maximized by the same values of 50 51 0392 that maximize the likelihood function and it s easier to work with the log likelihood function 1 20392 n n 2 n 2 10g L Elog27z Elogo Z Y 80 BIXZ 11 Taking partial derivatives with respect to 50 51 0392 yields logL 1 0150 2202 1 Y u 31X 1 4 logL i n 3 81 22021Y 6 max X 5 logL n 1 502 2022022K o81X2 6 Setting these three equations to 0 and placing hats on parameters denoting the maximum likelihood estimators we get the following three equations ixnb0kfX 4a l 7 A A n 4ZK o 81X22 6a 039 H 039 From equations 4a and 5a we see that the maximum likelihood estimators are the same as the least squares estimators these are the normal equations However from equation 6a we obtain the maximum likelihood estimator for the error variance as A ZY8031X2 2KYz2 o 1 11 n n This estimator is biased downward We will use the unbiased estimator 32 MSE throughout this course to estimate the error variance Example LSD Concentration and Math Scores A pharmacodynamic study was conducted at Yale in the 1960 s to determine the relationship between LSD concentration and math scores in a group of volunteers The independent predictor variable was the mean tissue concentration of LSD in a group of 5 volunteers and the dependent response variable was the mean math score among the volunteers There were 717 observations collected at different time points throughout the experiment Source Wagner JG Agahajanian GK and Bing OH 1968 Correlation of Performance Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects Clinical Pharmacology anal Therapeutics 9635638 The following EXCEL spreadsheet gives the data and pertinent calculations 2884286 3162857 8319104082 100036653 8112857 1362857 6581845102 185737959 1738286 1072857 3021637224 115102245 1261714 0357143 1591922939 012755102 4437143 1497143 1968823673 224143673 1716714 1667143 2947107939 277936531 2011714 2077143 4046994367 431452245 4178617959 The fitted equation is Y 8912 901X and the estimated error variance is s2 MSE 5078 with corresponding standard deviation 3 713 A plot of the data and the tted equation are given below obtained from EXCEL Math Score vs LSD Concentration 90 80 70 60 50 407 Math Score Y 30 0 t t t t t t 0 1 2 3 4 5 6 7 LSD Concentration X Output from various software packages is given below Rules for standard errors and tests are given in the next chapter We will mainly use SAS EXCEL and SPSS throughout the semester 1 EXCEL Using Builtin Data Analysis Package Data Cells Time i Score Y Conc X 1 117 2 582 297 3 6747 326 4 3747 469 5 4565 583 6 3292 6 7 2997 641 Regression Coef cients Coef cient Standard s Er t Stat Pvaue Intercept Co nc X 599402 0001854 Lower 95 ror 8912387 7047547 1264608 549E05 7100761 900947 1503076 128732 Upper 95 1072401 514569 Fitted Values and Residuals Observation Predicted Score Y Residuals ICDU ILQMA 785828 6236576 5975301 4686948 3659868 3506708 3137319 0347202 416576 7716987 939948 9051315 214708 140319 2 SAS Using PROC REG Program Bottom portion generates graphics quality plot for WORD Program Output Some output suppressed Pharmacodynamic Study YMath Score XTissue LSD Concentration The REG Procedure Model MODELl Dependent Variable score Parameter Estimates Parameter Standard Variable DE Estimate Error t Value Pr gt t Intercept 1 8912387 704755 1265 lt0001 1 900947 150308 599 00019 Output Statistics Dep Var Predicted Std Error Std Error Student Obs score Value Mean Predict Residual Residual Residual 1 789300 785828 54 9 03 72 0 7 2 582000 623658 33838 41658 6271 0664 3 674700 597530 31391 77170 6397 1206 4 374700 468695 27463 93995 6575 1430 5 456500 365987 35097 90513 6201 1460 6 329200 350671 36787 21471 6103 0352 7 299700 313732 41233 14032 5812 0241 mm Including Regressinn Line Pharmaoodynamic Study Hum Sm Hm LSD an 3 srss SpreadsheetNIenu Driven Package Output Regressiun Cnef ciems Pumun coef cientsa Unaanuarmzeu Sta nuarmzeu Ornaments Cuef mems Made 5 Std Enur Beta 1 t S a Cunstam 89124 7 ma 2 646 um Cunt x a ma 1 513 937 994 mm a Depenuem vaname Scare V Plot of Data and Regression Line Math Score vs LSD Concentration SPSS 8000 Ll near Regresswn 7000 6000 Score Y 5000 4000 3000 Srnre Y 8912 9 01 rnnr I I I I I 100 200 30RSquauamp6 088 500 600 Conc X 4 STATVIEW SpreadsheetMenu Driven Package from SAS Output Regression Coef cients P01tion Regression Coefficients Score Y vs Conc X Coefficient Std Error Std Coeff t Value P Value Intercept 89124I 7048I 89124I 12646I lt0001 Cone X 9009 1503 937 5994 0019 Graphic output Regression Plot I I I I I 80 20 I I I I I 4 Cone X Y 89124 9009 X Rquot2 878 5 SPlus Also available in R Program Commands Program Output Residuals 7 03472 74166 7717 79399 9051 72147 71403 Coefficients Value Std Error t value Prgtt Intercept 891239 70475 126461 00001 x 90095 15031 59940 00019 Residual standard error 7126 on 5 degrees of freedom Graphics Output 6 STATA Output Regression Coef cients Portion score Coef Std Err t Pgtt 95 Conf Interval conc 9009467 1503077 599 0002 1287325 5145686 icons 8912388 7047547 1265 0000 7100758 1072402 Graphics Output Math Scores vs LSD Concentration STATA Output 50 60 70 scoreFitted values 40 4 Gone Chapter 2 Inferences in Regression Analysis Rules Concerning Linear Functions of Random Variables P 1318 Let Y1 Yn be 71 random variables Consider the function Z alY where the coefficients 1 11 a1 an are constants Then we have E axx iazEm 11 11 a iall iialaja jj 11 1 11 1 When Y1 Y are independent as in the model in Chapter 1 the variance of the linear combination simpli es to a iaxx iafazm 11 11 When Y1Yn are independent the covariance of two linear functions ZalYl and 11 n 2011 can be written as 11 a axxicxx iaxclazm 11 11 11 We will use these rules to obtain the distribution of the estimators b0 b1 Y be le Inferences Concerning 61 Recall that the least squares estimate of the slope parameter b1 is a linear function of the observed responses Y1Y 11 2 X1 XXX 1 b1 71 i Zn i zzzszi ZXZX2 zIZXlX2 11 12 10 Zoe XV Note that EYl 50 51X so that the expected value of b1 is 11 11 X E E1121 Z k1E1Y1 Z 961 16121 11 11 Z X1 X2 1 1 i 1 i X 20 1612 X1 we 2 X1 3 11 Note that 2 X1 E 0 why so that the rst term in the brackets is 0 and that 11 we can add 51 X1 E 0 to the last term to get 11 1 1 i 1 W 1 1 i E1b1 i312 X1 XX1 1612X1 XV 312X1 XY 181 X1 XV 2 X1 if 11 M1 1 1 Thus b1 is an unbiased estimator of the parameter A To obtain the variance of b1 recall that 0392 Yl 0392 Thus Ms X EZ f 2 E Oquot X 2 X if wmri wwcznm39 0 11 11 ZXZX2 7L Met 1 1 Note that the variance of b1 decreases when we have larger sample sizes as long as the added X levels are not placed at the sample mean E Since 0392 is unknown in practice and must be estimated from the data we obtain the estimated variance of the estimator b1 by replacing the unknown 0392 with its unbiased estimate 32 MSE 32 MSE Zm EZZerY l SZUJI with estimated standard error 3 MSE Mrerggglt7447 prEZ Zm Ez Further the sampling distribution of b1 is normal that is 2 039 b1 MN 51 n i Zm mz 11 since under the current model b1 is a linear function of independent normal random variables Y1Y n Making use of theory from mathematical statistics we obtain the following result that allows us to make inferences concerning l b1 1 Sb1 freedom N tn 2 where tn2 represents Student s tdistribution with 712 degrees of Con dence Interval for 71 b1 l Sb1 As a result of the fact that N tn 2 we obtain the following probability statement Pta2n 2SStl a2n 2 1 0 where ta2n 2isthe S 1 02lOOLh percentile of the t distribution with 712 degrees of freedom Note that since the t distribution is symmetric around 0 we have that ta 2 n 2 tl a 2 n 2 Traditionally we obtain the table values corresponding to tl a 2 n 2 which is the value of that leaves an upper tail area of 12 The following algebra results in obtaining a la100 confidence interval for A Pta2n 2S b1 1 Stl a2n 2 Sb1 P tl a2n 2SMSIG azn D 3931 P tl a2n 2sb1s b1 1 Stl a2n 2sb1 P b1 tl 0t2n 2sb1gtS 1 S bl tl a2n 2sb1 Pltb1 tl 0l2n 2sb1Z 1 Zbl tl a2n 2sb1 This leads to the following rule for a la100 confidence interval for A b1 itl a2n 2sb1 Some statistical software packages print this out automatically eg EXCEL and SPSS Other packages simply print out estimates and standard errors only e g SAS Tests Concerning l b1 51 131 the slope parameter As with means and proportions and differences of means and proportions we can conduct onesided and twosided tests depending on whether a priori a speci c directional belief is held regarding the slope More often than not but not necessarily the null value for A is 0 the mean of Y is independent of X and the alternative is that A is positive lsided negative lsided or different from 0 2 sided The alternative hypothesis must be selected before observing the data We can also make use of the of the fact that N tH to test hypotheses concerning 2 sided tests Null Hypothesis H0 1 lo Alternative Research Hypothesis H A l 2 lo z Sb1 DeCiSion R1119 COUCIUde HA if l tl Z tl a2n 2 otherwise conclude H0 Pvalue 2Ptn 2 gtl t l Test Statistic All statistical software packages to my knowledge will print out the test statistic and P value corresponding to a 2sided test with 3100 1 sided tests Upper Tail Null Hypothesis H0 A lo Alternative Research Hypothesis H A A gt lo z sb1 Decision Rule Conclude HA if t Z tl a n 2 otherwise conclude Hg Pvalue Ptn 2 gt t Test Statistic A test for positive association between Y andX H A 1gt0 can be obtained from standard statisical software by first checking that b1 and thus t is positive and cutting the printed Pvalue in half 1 sided tests Lower Tail Null Hypothesis H0 A lo Alternative Research Hypothesis H A A lt lo sb1 Decision Rule Conclude HA if t S tl a n 2 otherwise conclude H0 Pvalue Ptn 2 lt t Test Statistic t A test for negative association between Y andX H A 1lt0 can be obtained from standard statisical software by rst checking that b1 and thus t is negative and cutting the printed Pvalue in half Inferences Concerning A Recall that the least squares estimate of the intercept parameter b0 is a linear function ofthe observed responses Y1 Yn Recalling that EK o le n X X X n X X X n X X X EboZ olez 02 12 Xx 209 X2 209 X2 209 X2 n i n X 7 2 i i o10 1 iZXz XZ o 1X X1 o 11 11 2X1X2 11 Thus b0 is an unbiased estimator or the parameter u Below we obtain the variance of the estimator of b0 i 7 2 i i i quot X XX quot XX XZ 2XX X aw zzi5 if 02 quoti 7 11 XXX X2 H 209 X2 nZX X2 11 11 11 2 i a2 12X ZZXl YZ KLZXl 7 inn Ey 1 nZLK YYH 21 Y2 0 quotiwrhz Note that the variance will decrease as the sample size increases as long as X values are not all placed at the mean Further the sampling distribution is normal under the assumptions of the model The estimated standard error of b0 replaces 02 with its unbiased estimate SZMSE and taking the square root of the variance 2 sb0 s l quotX quotZwrmz 11 b0 0 Note that b N tn 2 allow1ng for 1nferences concerning the intercept parameter S 0 g when it is meaningful namely when X0 is within the range of observed data Con dence Interval for o b0 itl a2n 2sb0 It is also useful to obtain the covariance of b0 and b1 as they are only independent under very rare circumstances n 11 abbazzx kxzzkazm 11 11 g Mr weaz H quot Zea ff Earl ff Xz 70 02X 204 32 nZXl X2 H gaff 11 0 n 02X 2 n 02X Zwrhz Zwrhz 11 11 In practice Y is usually positive so that the intercept and slope estimators are usually negatively correlated We will use the result shortly Considerations on Making Inferences Concerning o and A Normality of Error Terms If the data are approximately normal simulation results have shown that using the t distribution will provide approximately correct signi cance levels and con dence coef cients for tests and con dence intervals respectively Even if the distribution of the errors and thus Y is far from normal in large samples the sampling distributions of b0 and b1 have sampling distributions that are approximately normal as results of central limit theorems This is sometimes referred to as asymptotic normality Interpretations of Con dence Coef cients and Error Probabilities Since X levels are treated as xed constants these refer to the case where we repeated the experiment many times at the current set of X levels in this data set In this sense it s easier to interpret these terms in controlled experiments where the experimenter has set the levels of X such as time and temperature in a laboratory type setting as opposed to observational studies where nature determines the X levels and we may not be able to reproduce the same conditions repeatedly This will be covered later Spacing of X Levels The variances of b0 and b1 for given 71 and 02 decrease as the X levels are more spread out since their variances are inversely related to Z X I 702 However there are 1 reasons to choose a diverse range oleevels for assessing model fit This is covered in Chapter 4 Power of Tests The power of a statistical test refers to the probability that we reject the null hypothesis Note that when the null hypothesis is true the power is simply the probability of a Type I error a When the null hypothesis is false the power is the probability that we correctly reject the null hypothesis which is 1 minus the probability of a Type 11 error Fl where Irdenotes the power of the test and is the probability of a Type 11 error failing to reject the null hypothesis when the alternative hypothesis is true The following procedure can be used to obtain the power of the test concerning the slope parameter with a 2sided alternative 1 Write out null and alternative hypotheses H 0 A lo H A A at lo 2 Obtain the noncentrality measure the standardized distance between the true value of A and the value under the null hypothesis lo 5 o 1 3 Choose the probability ofa Type I error 005 or 0t001 4 Determine the degrees of freedom for error df 712 5 Refer to Table B5 pages 13467 identifying apage 5 row and error degrees of freedom column The table provides the power of the test under these parameter values Note that the power increases within each tables as the noncentrality measure increases for a given degrees of freedom and as the degrees of freedom increases for a given noncentrality measure Con dence Interval for EYh o 1Xh When we wish to estimate the mean at a hypotheticalX value within the range of observedX values we can use the tted equation at that value of XX11 as a point estimate but we haveto include the uncertainty in the regression estimators to construct a con dence interval for the mean Parameter EYh o th Estimator Y h be leh We can obtain the variance of the estimator as a function of X Xh as follows 0392 Y1 0392 be leh ozb0XZozbl2Xhab0bl 2 2 0 Xh2 n 7 2Xh 0X XXX XV ZUC XY ZUC XY 11 11 11 Estimated standard error of estimator SY h Yh Em SY 1 response at speci c X levels and tests concerning the mean tests are rarely conducted N tn 2 which can be used to construct con dence intervals for the mean 1 a100 Con dence Interval for E Yh Yhitl a2n 2SYh Predicting a Future Observation When X is Known If o A 039 were known we d know that the distribution of responses when XX11 is normal with mean o th and standard deviation 039 Thus making use of the normal distribution and equivalently the empirical rule we know that if we took a sample item from this distribution it is very likely that the value fall within 2 standard deviations of the mean That is we would know that the probability that the sampled item lies within the range o th 039 o th 039 is approximately 095 In practice we don t know the mean o th or the standard deviation 039 However we just constructed a loc100 Con dence Interval for E Y h and we have an estimate of 039 s Intuitively we can approximately use the logic of the previous paragraph with the estimate of 039 across the range of believable values for the mean Then our prediction interval spans the lower tail of the normal curve centered at the lower bound for the mean to the upper tail of the normal curve centered at the upper bound for the mean See Figure 25 on page 64 ofthe text book The prediction error is for the new observation is the difference between the observed value and its predicted value Y h Y 1 Since the data are assumed to be independent the new future value is independent of its predicted value since it wasn t used in the regression analysis The variance of the prediction error can be obtained as follows A A X 7 2 0392pred0392Yh Yh0392Yh0392Yh03920392 lfh quot Zoe X 11 i 2 02 1lnh X quot XXXXV 11 and an unbiased estimator is X X 2 s2predMSE1lnh XXX X2 11 l a100 Prediction Interval for New Observation When XX11 X X 2 Yhita2n 2 MSE llnh quot Zoe XV 11 It is a simple extension to obtain a prediction for the mean of m new observations when 2 039 XXh The sample mean of m observatlons 1s 7 and we get the followmg var1ance for m for the error in the prediction mean X X2 s2predmeanMSE ilnh m n XXXXV 11 and the obvious adjustment to the prediction interval for a single observation l a100 Prediction Interval for the Mean of In New Observations When XX11 Yhita2n 2 MSE i Con dence Band for the Entire Regression Line Working Hotelling Method YhiWSYh W2F1 a2n 2 Analysis of Variance Approach to Regression Consider the total deviations of the observed responses from the mean K When these terms are all squared and summed up this is referred to as the total sum of squares SSTO SSTO Zan FY 11 The more spread out the observed data are the larger SSTO will be Now consider the deviation of the observed responses from their tted values based on the regression model K Y1 K b0 lel ex When these terms are squared and summed up this is referred to as the error sum of squares SSE We ve already encounterd this quantity and used it to estimate the error variance SSE Zea Y 2 11 When the observed responses fall close to the regression line SSE will be small When the data are not near the line SSE will be large Finally there is a third quantity representing the deviations of the predicted values from the mean Then these deviations are squared and summed up this is referred to as the regression sum of squares SSR SSR Zea FY 11 The error and regression sums of squares sum to the total sum of squares SSTO SSR SSE which can be seen as follows x x Y Y x YY 2 x Z x YY 2 x Y2 Y1 F2 2x YY 2 SSTO x Z ix i2 02142 2x ii x i2 1152 mica ii 11 11 11 x i2 r Z 2ielb0 b1Xl F 11 11 11 x 12 r Z while blie1Xl Jig 11 11 11 11 11 x i2 r2 20 x 12 i Z SSESSR 11 11 11 11 ThelasttermwasOsince 28128le 0 Each sum of squares has associated with degrees of freedom The total degrees of freedom is dfT nl The error degrees offreedom is de 712 The regression degrees of freedom is de 1 Note that the error and regression degrees of freedom sum to the total degrees offreedom n l 1 71 2 Mean squares are the sums of squares diVided by their degrees of freedom MSRQ MSEi 1 71 2 Note that MSE was our estimate of the error variance and that we don t compute a total mean square It can be shown that the expected values of the mean squares are EMSE 0392 EMSR 0392 pf ion 72 Note that these expected mean squares are the same if and only if 10 The Analysis of Variance is reported in tabular form Source df SS MS F Regression 1 SSR MSRSSRI1 FMSRMSE Error n2 SSE MSESSEn2 C Total n l SSTO FTest of l 0 versus 11 0 As a result of Cochran s Theorem stated on page 76 of text book we have a test of whether the dependent variable Y is linearly related to the predictor variable X This is a very speci c case of the ttest described previously Its full utility will be seen when we consider multiple predictors The test proceeds as follows 0 Null hypothesis H0 1 0 Alternative Research Hypothesis H A A at 0 Test Statistic TS F M7SR MSE Rejection Region RR F 2 Fl al n 2 Pvalue PFl n 2 2 F Critical values of the F distribution indexed by numerator and denominator degrees of freedom are given in Table B4 pages 13401345 Note that this is a very speci c version of the t test regarding the slope parameter speci cally a 2sided test of whether the slope is 0 Mathematically the tests are identical 20g K 47 XXX Y1 j 21314 Eva Y z EvaX ml J MSE MSE Eon ff Note that MSR SSR Zea 172 2030 lel 172 mg bfZXf n7 2b0bIZXI 2nbOI7 2b1172X1 n bjy bszf nFZ 2F b1 bln7 2n b1 21in 2 22 2 2 2 22 2 nY an X 2nb1XYb1 2X1 nY 2nb1XY 2nb1 X 2nY 2nb1XY 2nb1XY 0 71172 2n2 2nb17 2m My Maj bf 2X3 nbf Y2 2m Y2 00bfZXf 4135 b12209 72 20g 72 2X1 27Y l 209 X2 Thus 2097001 Y 2 2m 7x E W xZoc X2 2 20g 72 zA zw MSE MSE MSE FuIther the critical values are equivalent tl 022 71 22 Fl al n 2 check this from the two tables Thus the tests are equivalent General Linear Test Approach This is a very general method of testing hypotheses concerning regression models We rst consider the the simple linear regression model and testing whether Y is linearly associated withX We wish to test H0 1 0 vs HA 1 0 Full Model This is the model speci ed under the alternative hypothesis also refenred to as the unrestricted model Under simple linear regression with normal errors we have Y 0 1Xxgx Using least squares and maximum likelihood to estimate the model parameters Y 1 be lel we obtain the error sum of squares for the full model SSEF 20 be M Y 20 f1 2 SSE Reduced Model This the model speci ed by the null hypothesis also referred to as the restricted model Under simple linear regression with normal errors we have Y z o 0X1 8 081 Under least squares and maximum likelihood to estimate the model parameter we obtain Y as the estimate of g and have be Y as the tted value for each observation We when get the following error sum of squares under the reduced model SSER Zac b02 Zac if SSTO Test Statistic The error sum of squares for the full model will always be less that or equal to the error sum of squares for reduced model by de nition of least squares The test statistic will be SSER SSEF df R df F SSE F dfF full and reduced models We will use this method throughout course where 6 dfF are the error degrees of freedom for the For the simple linear regression model we obtain the following quantities SSEFSSE dfFn 2 SSERSSTO den l thus the F Statistic for the General Linear Test can be written SSER SSEF SSTO SSE SSR Mw 1 JAR SSEF SSE SSE MSE dfF n 2 n 2 Thus for this particular null hypothesis the general linear test generalizes to the F test Descriptive Measures of Association Along with the slope Y intercept and error variance several other measures are often reported Coef cient of Determination r2 The coefficient of determination measures the proportion of the variation in Y that is explained by the regression on X It is computed as the regression sum of squares divided by the total corrected sum of squares Values near 0 imply that the regression model has done little to explain variation in Y while values near 1 imply that the model has explained a large portion of the variation in Y If all the data fall exactly on the fitted line r2l The coefficient of determination will lie beween 0 and l r SSR 1 SSE 7 7 0ltr2lt1 SSTO SSTO Coef cient of Correlation r The coefficient of correlation is a measure of the strength of the linear association between Y andX It will always be the same sign as the slope estimate b1 but it has several advantages 0 In some applications we cannot identify a clear dependent and independent variable we just wish to determine how two variables vary together in a population peoples heights and weights closing stock prices of two firms etc Unlike the slope estimate the coefficient of correlation does not depend on which variable is labeled as Y and which is labeled as X o The slope estimate depends on the units of X and Y while the correlation coefficient does not o The slope estimate has no bound on its range of potential values The correlation coefficient is bounded by 71 and 1 with higher values in absolute value implying stronger linear association it is not useful in measuring nonlinear association which may exist however rsgnb1rTWL bI 499 2X XY Y 3y where sgnb1 is the sign positive or negative of b1 and 3 sy are the sample standard deviations of X and Y respectively Issues in Applying Regression Analysis 0 When using regression to predict the future the assumption is that the conditions are the same in future as they are now Clearly any future predictions of economic variables such as tourism made prior to September 11 2001 would not be valid 0 Often when we predict in the future we must also predictX as well as Y especially when we aren t controlling the levels of X Prediction intervals using methods described previously will be too narrow that is they will overstate confidence levels 0 Inferences should be made only within the range of X values used in the regression analysis We have no means of knowing whether a linear association continues outside the range observed That is we should not extrapolate outside the range of X levels observed in experiment 0 Even if we determine thatX and Y are associated based on the t test andor F test we cannot conclude that changes inX cause changes in Y Finding an association is only one step in demonstrating a causal relationship 0 When multiple tests andor confidence intervals are being made we must adjust our confidence levels This is covered in Chapter 4 o WhenX is a random variable and not being controlled all methods described thus far hold as long as the X are independent and their probability distribution does not depend on w paz Chapter 3 Diagnostics and Remedial Measures Diagnostics for the Predictor Variable X Levels of the independent variable particularly in settings where the experimenter does not control the levels should be studied Problems can arise when One or more observations have X levels far away from the others When data are collected over time or space X levels that are close together in time or space are more similar than the overall set of X levels Useful plots of X levels include histograms boxplots stemandleaf diagrams and sequence plots versus time order Also a useful measure is simply the z score for each observation s X value We will later discuss remedies for these problems in Chapter 9 Residuals True Error Term 81 K EK K o le Observed Residual ex K Y K b0 lel Recall the assumption on the true error terms they are independent and normally distributed with mean 0 and variance 02 51 N NID0 0392 The residuals have mean 0 since they sum to 0 but they are not independent since they are based on the tted values from the same observations but as 71 increases this becomes less important Ignoring the nonindependence for now we have concerning the residuals e1 en g9o szel Zel 28 Ze 0 7Ze MSE n n n n 2 n 2 Semistudentized Residuals We are accustomed to standardizing random variables by centering them subtracting off the mean and scaling them dividing through by the standard deviation thus creating a z score While the theoretical standard deviation of ex is a complicated lnction of the entire set of sample data we will see this after introducing the matrix approach to regression we can approximate the standardized residual as follows which we call the semistudentized residuals In large samples these can be treated approximately as t statistics with 712 degrees of freedom Diagnostic Plots for Residuals The major assumptions of the model are i the relationship between the mean of Y andX is linear ii the errors are normally distributed iii the mean of the errors is 0 iv the the variance of the errors is constant and equals 02 V the errors are independent vi the model contains all predictors related to EY and vii the model ts for all data observations These can be visually investigated with various plots Linear Relationship Between EY and X Plot the residuals versus eitherX or the tted values This will appear as a random cloud of points centered at 0 under linearity and will appear Ushaped or inverted Ushaped if the relationship is not linear Normally Distributed Errors Obtain a histogram of the residuals and determine whether it is approximately mound shaped Alternatively a normal probability plot can be obtained as follows 1 Order the residuals from smallest large negative values to largest large positive values Assign the ranks as k 2 Compute the percentile for each residual n 025 3 Obtain the 2 value from the standard normal distribution corresponding to these k 0375 percent11es z 7 n 025 4 Multiply the 2 values by s xMSE these are the expected residuals for the kLh smallest residuals under the normality assumption 5 Plot the observed residuals on the vertical axis versus the expected residuals on the horizontal axis This should be approximately a straight line with slope l Errors have Mean 0 Since the residuals sum to 0 and thus have mean 0 we have no need to check this assumption Errors have Constant Variance Plot the residuals versus X or the tted values This should appear as a random cloud of points centered at 0 if the variance is constant Ifthe error variance is not constant this may appear as a funnel shape Errors are Independent When Data Collected Over Time Plot the residuals versus the time order when data are collected over time If the errors are independent they should appear as a random cloud of points centered at 0 If the errors are positively correlated they will tend to approximate a smooth not necessarily monotone functional form No Predictors Have Been Omitted Plot residuals versus omitted factors or againstX seperately for each level of a categorical omitted factor If the current model is correct these should be random clouds of points centered at 0 If patterns arise the omitted variables may need to be included in model Multiple Regression Model Fits for All Observations Plot Residuals versus fitted values As long as no residuals stand out either much higher or lower from the others the model fits all observations Any residuals that are very extreme are evidence of data points that are called outliers Any outliers should be checked as possible data entry errors We will cover this problem in detail in Chapter 9 Tests Involving Residuals Several of the assumptions stated above can be formally tested based on statistical tests Normally Distributed Errors Correlation Test Using the expected residuals denoted ex obtained to construct a normal probability plot we can obtain the correlation coefficient between the observed residuals and their ee expected residuals under normality rm 2 VZ 82 Z W The test is conducted as follows 0 H 0 Error terms are normally distributed 0 H A Error terms are not normally distributed 0 TS rm 0 RR r S Tabled values in Table B6 Page 1348 indexed by a and n Note this is a test where we do not wish to reject the null hypothesis Another test that is more complex to manually compute but is automatically reported by several software packages is the ShapiroWilks test It s null and alternative hypotheses are the same as for the correlation test and Pvalues are computed for the test Errors have Constant Variance Modi ed Levene Test There are several ways to test for equal variances One simple to describe approach is a modified version of Levene s test which tests for equality of variances without depending on the errors being normally distributed Recall that due to Central Limit Theorems lack of normality causes us no problems in large samples as long as the other assumptions hold The procedure can be described as follows 1 Split the data into 2 groups one group with lowX values containing m of the observations the other group with highX values containing 712 observations n1nzn 2 Obtain the medians of the residuals for each group labeling them 61 and 62 respectively 3 Obtain the absolute deviations for each residual from its group median all1 le11 ell z391n1 d12le ez z391n2 4 0 Obtain the sample mean absolute deviation from the median for each n1 n2 2 dd 7 2 an 11 2 7 11 quot2 group d1 n1 Obtain the pooled variance of the absolute deviations 20111 gal 20112 gal 11 S2 11 71 2 31 32 1 l s 77 n1 n2 Conclude that the error variance is not constant if l tL l 2 tl a 2 n 2 otherwise Compute the test statistic tL conclude the error variance is constant Errors are Independent When Data Collected Over Time When data are collected over time one common departure from independence is that error terms are positively autocorrelated That is the errors that are close to each other in time are similar in magnitude and sign This can happen when learning or fatigue is occuring over time in physical processes or when longterm trends are occuring in social processes A test that can be used to determine whether positive autocorrelation non independence of errors eXists is the DurbinWatson test see Section 123 we will consider it in more detail later The test can be conducted as follows 0 H 0 The errors are independent 0 H A The errors are not independent positively autocorrelated V Z 82 271 2 TS D 39Z n 28 21 0 Decision Rule i Reject H0 if D S dL ii Accept HA if D Z dU iii withhold judgment if dL lt D lt dU where dLdU are bounds indexed by a n andp l the number of predictors which is l for now These bounds are given in Table B7 pages 13491350 F Test for Lack of Fit to Test for Linear Relation Between EY and X A test can be conducted to determine whether the true regression function is that which is being currently speci ed For the test to be conducted we must have the following conditions hold The observations Y conditional on theirX level are independendent normally distributed and have the same variance 02 Further the X levels in the sample must have repeat observations at a minimum preferably more of one X level Repeat trials at the same levels of the predictor variables are called replications The actual observations are referred to as replicates The null and alternative hypotheses for the simple linear regression model are stated as The null hypothesis states that the mean structure is a linear relation the alternative says that the mean structure is any structure except linear this is not simply a test of whether 10 The test which is a special case of the general linear test is conducted as follows Begin with 71 total observations at 0 distinct levels of X There are nj observations at th the ofX n1 n5 n 2 Let Yij be the i3911 replicate at thejLh level ofX j lc i lnj 3 Fit the Full model HA Y1 Lij 1 The least squares estimate of if is 1 F 4 Obtain the error sum of squares for the Full model also known as the Pure Error sum ofsquares SSEF SSPE 2209 if 2 11 11 5 The degrees of freedom for the Full model is dfF nc This is from the fact that the jLh level ofX we have njl degrees of freedom and they sum up to nc Also we have estimated 0 parameters Li1 Li 6 Fit the Reduced model Hg Y1 o lXj EU The least squares estimate of 0 1Xj is Y b0b1Xj 7 Obtain the error sum of squares for the Reduced model also known as the Error sum of squares SSER SSE Z YU Y 2 11 11 8 The degrees of freedom for the Reduced model is den2 We have estimated two parameters in this model o l SSER SSEF SSE SSPE SSE SSPE df df quot 2 quot C c 2 9C ttthttF R F ompue e sa1s 1c SSEF SSPE MSPE dfF 71 0 10 Obtain the rejection region RR F 2 Fl 10 2 n 0 Note that the numerator of the F statistic is also known as the Lack of Fit sum of squares SSLF SSE SSPE 226 Yj2 ElmE Yj2 51f 6 2 11 11 11 The degrees of freedom can be intuitively thought of as being a result of us tting a aimple linear regression model of 0 sample means on X Note then that the F statistic can be written as SSE R SSE F SSE SSPE SSE SSPE SSLF de dfF n 2 n c 6 2 6 2 MSLF SSEF SSPE MSPE MSPE MSPE dfF n 0 Thus we have partitioned the Error sum of squares for the linear regression model into Pure Error based on deviations from individual responses to their group means and Lack of Fit based on deviations from group means to the tted values from the regression model The expected mean squares for MSPE and MSLF are as follows EMSPE 0392 EMSLF 62 W c Under the null hypothesis relationship is linear the second term for the lack of t mean square is 0 Under the alternative hypothesis relationship is not linear the second term is positive Thus large values of the F statistic are consistent with the alternative hypothesis Remedial Measures Nonlinearity 0f Regression Function Several options apply Quadratic Regression Function EY o lX ZX2 Places a bend in the data Exponential Regression Function E Y o lX Allows for multiplicative increases Nonconstant Error Variance Often transformations can solve this problem Another option is weighted least squares Chapter 10 not covered in this course Nonindependent Error Terms One option is to work with a model permitting correlated errors Other options include working with differenced data or allowing for previously observed Y values as predictors Nonnormality of Errors Nonnormal errors and errors with nonconstant variances tend to occur together Some of the transformations used to stabilize variances often normalize errors as well The Box Cox transformation can but not necessarily cure both problems Ommission 0f Important Variables When important predictors have been omitted they can be added in the form of a multiple linear regression model Chapter 6 Outliers When an outlier has been determined to be not due to data entry or recording error and should not be removed from model due to other reasons indicator variables may be used to classify these observations away from others Chapter 11 or use of robust methods Chapter 10 not covered in this class Transformations See Section 39 pages 126132 for prototype plots and transformations of Y andorX that are useful in linearizing the relation andor stabilizing the variance Many times simply taking the logarithm of Y can solve the problems Chapter 4 Simultaneous Inference and Other Topics Joint Estimation of o and 61 We ve obtained la100 con dence intervals for the slope and intercept parameters in Chapter 2 Now we d like to construct a range of values o l that we believe contains BOTH parameters with the same level of con dence One way to do this is to construct each individual con dence interval at a higher level of con dence namely la2100 con dence intervals for g and M seperately The resulting ranges are called Bonferroni Joint Simultaneous Con dence Intervals Joint Con dence Level 1 a 100 Individual Con dence Level la 2 l 00 90 95 95 975 99 995 The resulting simultaneous con dence intervals with a joint con dence level of at least la100 are be iBsb0 b1 4 mm B tl a 4 n 2 Simultaneous Estimation of Mean Responses Case 1 Simultaneous 1 a100 Bounds for the Regression Line W orking Hotelling s Approach YhiWSYh W2F1 a2n 2 Case 2 Simultaneous 1 a100 Bounds at g Speci c X Levels Bonferroni s Approach YhiBgYh Bt1 a2gn 2 Simultaneous Prediction Intervals for New Observations Sometimes we wish to obtain simultaneous prediction intervals for g new outcomes Scheffe s Method YhiSspred S gF1 agn 2 X 7 2 where spred MSE 1 l is the estimated standard error of the W Z X X 2 prediction Bonferroni s Method YhiBspred Bt1 a2gn 2 Both S and B can be computed before observing the data and the smallest of the two should be used Regression Through the Origin Sometimes it is desirable to have the mean response be 0 when the predictor variable is 0 this is not the same as saying Y must be 0 whenXis 0 Even though it can cause extra problems it is an interesting special case of the simple regression model Y AX 8 8 NID0az We obtain the least squares estimate of M which also happens to be maximum likelihood as follows QZef ZY AXZ 2 2ZK lX X ZXIY 2X The tted values and residuals which no longer necessarily sum to 0 are 2 2ZXIYl bIZXf0 2 ZXIYIbIZXf 2 b1 Y1b1Xl e1Yl Y An unbiased estimate of the error variance 02 is A 2 2 s2 MSE ZY Y1 7Ze n 1 n 1 Note that we have only estimated one parameter in this regression function Note that the following are linear functions of Y1 Yquot ZX1Y1 X X b1 2X12 ZZX12Y1ZGIYI all ZZXlz 2 EbE aYZaEYZa XZL X Ep 1 11 1 1 11 2X1 1 12X 1 X 2 2X 3 02b102201Y120120230Z 02712 2 2X1 2X3 2X Thus b1 is an unbiased estimate of the slope parameter A and its variance and thus standard error can be estimated as follows 2 039 s2 MSE MSE 2X12 2X12 3 Sb1 2X12 S 2 bl This can be used to construct confidence intervals for or conduct tests regarding M The mean response at Xh for this model is EYh th and its estimate is Y leh with mean and variance Eu Eb1Xh XhEb1 X A A X2 A X2 o zYh039zb1XhX0392b10392 h 2 szYhMSE h XX 2sz This can be used to obtain a con dence interval for the mean response when X X h The estimated prediction error for a new observation at XX11 is Sz ipred SZYhn2w Yh S2Yhn2w 2X2 X2 s2Yh s2 EX MSE1 2amp2 This can be used to obtain a prediction interval for a new observation at this level of X Comments Regarding Regression Through the Origin You should test whether the true intercept is 0 when X0 before proceeding 0 Remember the notion of constant variance If you are forcing Y to be 0 whenX is 0 you are saying that the variance of Y at X0 is 0 o If X 0 is not an important value of X in practice there is no reason to put this constraint into the model 0 r2 is no longer constrained to be 0 the error sum of squares from the regression can exceed the total corrected sum of squares The coefficient of determination loses its interpretation of being the proportion of variation in Y that is explained by X Effects of Measurement Errors Measurement errors can take on one of three forms Two of the three forms cause no major problems one does Measurement Errors in Y This causes no problems as the measurement error in Y becomes part of the random error term which represents effects of many unobservable quantities This is the case as long as the random errors are independent unbiased and not coorelated with the level of X Measurement Errors in X Problems do arise when the measurement of the predictor variable is measured with error This is particularly the case when the observed reported X i level is the true levelXi plus a random error term In this case the random error terms are not independent of the reported levels of the predictor variable causing the estimated regression coefficients to be biased and not consistent See textbook for a mathematical development Certain methods have been developed for particular forms of measurement error See Measurement Error Models by WA Fuller for a theoretical treatment of the problem or Applied Regression Analysis by J O Rawlings SG Pantula and DA Dickey for a brief description Measurement Errors with Fixed Observed X Levels When working in engineering and behavioral settings a factor such as temperature may be set by controlling a level on a thermostat That is you may set an oven s cooking temperature at 300 350 400 etc When this is the case and the actual physical temperatures vary at random around these actual observed temperatures the least squares estimators are unbiased Further when normality and constant variance assumptions are applied to the new errors that re ect the random actual temperatures the usual tests and con dence intervals can be applied Inverse Predictions Sometimes after we fit or calibrate a regression model we can observe Y values and wish to predict the X levels that generated the outcomes Let Y hmew represent a new value of Ywe have just observed or a desired level onwe wish to observe In neither case was this observation part of the sample We wish to predict the X level that led to our observation or the X level that will lead to our desired level Consider the estimated regression function Yb0 b1X Now we observe a new outcome Y hmew and wish to predict the X value corresponding to it we can use an estimator that solves the previous equation for X The estimator and its approximate estimated standard error are A Y bo XWW L 1 Xhn2w X2 n 209 if spredX Then an approximate l a100 Prediction Interval for X hmew is Xhn2wit1 aZ n 2spredX Choosing X Levels Issues arising involving choices of X levels and sample sizes include The range of X values of interest to experimenter The goal of research inference concerning the slope predicting future outcomes understanding the shape of the relationship linear curved The cost of collecting measurements Note that all of our estimated standard errors depend on the number of observations and the spacing of X levels The more spread out the smaller the standard errors generally However if we wish to truly understand the shape of the response curve we must space the observations throughout the set of X values See quote by DR Cox on page 170 of textbook Example of Model Diagnostics Calculator Maintenance Data Using EXCEL First we begin with the original data I have sorted it with respect to the predictor variable X number of machines serviced Note that in this case we wish to preserve the pairs XiYi To do this 0 Move the cursor into the eld of data 0 Click on Data on the main toolbar then Sort 0 Select Column 2 X and Ascending If you have already placed headers on the columns make sure you click on the correct option regarding headers Y minutes X Machines l 01 muwwammmmmLLLwMM x x Diagnostics for the Predictor Variable Section 31 Xvalues that are far away from the rest of the others can exert a lot of in uence on the least squares regression line A histogram or bar chart of the Xvalues can identify any potential extreme values The following steps in EXCEL can be used to obtain a histogram of the Xvalues A copy of the histogram is given below the instructions 0 Select Tools on the header bar then Data Analysis you may need to add it in from add ins then Histogram o For the Input Range highlight the column containingX if you have included the header cell click on Labels 0 Click Chart Output then OK 0 You may experiment and make the chart more visually appealing if preparing reports but for investigating the model assumptions this is ne H istog ram 7 6d 5 4 4 El Frequency 3 2 7 1 if D D O l l l 1 275 45 625 More Bin Frequency l l Residuals Section 32 The model assumptions are that the error terms are independent and normally distributed with mean 0 and constant with respect to levels of X variance 02 The errors are 51Y1 EYY o lX Since the model parameters are unknown we cannot observe the actual errors However if we replace the unknown parameters we have an estimate of each residual by taking the difference between the actual and tted values These are referred to as the residuals eY YxY bobX These residuals should approximately demonstrate the same behavior as the true error terms the approximation will be better as the sample size increases Some important properties concerning the residuals 0 Mean 2 ex 0 3 E 0 Shown in Chapter 1 Thus the residuals have mean 0 SSE 26 E2 n 2 n 2 0 Independence Residuals are not independent due to Zel 2X18 0 o Variance s2 MSE For large samples relative to the number of model parameters the dependency is unimportant Note that under the model assumptions if we standardize the errors by subtraction off their mean which is 0 and divide through by their standard deviation then they have a standard normal Z distribution s Eg 5 0 s g 17NN01 039 039 039 Semistudentized Residuals are quantities that approximate the standardized errors based on the tted equation They are based on the estimates of the unknown errors the residuals and the estimate of the error standard deviation These can be used to identify outlying observations since these are like Z scores e 2 s MSE Note that the residuals have complicated standard deviations that are not constant we will pursue this later in course so this is an approximation EXCEL produces Standardized Residuals which appear to be computed as a e MSE1 n The denominator is the square root of the average variance of the residuals Note as the sample size increases these are very similar quantities For purposes of identifying outlying observations either of these is useful Obtaining Residuals in EXCEL 0 Choose Tools Data Analysis Regression 0 Highlight the column containing Y then the column containing X then the appropriate Labels option 0 Click on Residuals and Standardized Residuals 0 Click OK 0 The residuals will appear on a worksheet below the ANOVA table and parameter estimates Also printed are observation number predicted tted values and standardized residuals Regression Statistics Multiple R 0990215218 R Square 0980526177 Adjusted R Square 0979309063 Standard Error 4481879999 Observations 18 ANOVA df SS MS F Signi cance F Regression 1 161826 161826 806 409733E 15 Residual 16 3214 201 Total 17 16504 Observation Predicted Y minutes Residuals Standard Residuals 1 1241610738 2416107383 0555674513 2 1241610738 4583892617 1054238034 3 2715436242 5845637584 1344423613 4 2715436242 2154362416 0495476441 5 4189261745 289261745 0665265875 6 5663087248 5369127517 1234832251 7 5663087248 3630872483 083505531 8 5663087248 7630872483 1755005337 9 7136912752 6630872483 152501783 10 7136912752 3630872483 083505531 11 7136912752 6369127517 1464819758 12 7136912752 0369127517 0084894717 13 7136912752 3369127517 0774857237 14 8610738255 010738255 0024696645 15 1008456376 3845637584 08844486 16 1008456376 0154362416 0035501427 17 1008456376 4154362416 0955451454 18 1155838926 2416107383 0555674513 Diagnostics for Residuals 33 Obtaining a Plot of Residuals Against X ei vs Xi 0 Copy and paste the column of Residuals to the original spreadsheet in Column C 0 Highlight Columns B and C and click on the Chart Wizard icon 0 Click on XY Scatter then click through the dialog boxes 0 Using all default options your plot will appear as below Y minutes X Machines m 10 1W 17 1 4583893 33 2 5845838 25 2 215438 39 3 289282 82 4 5389128 53 4 3 83087 49 4 7 83087 78 5 8 830872 75 5 3 830872 85 5 8 38913 71 5 0 38913 88 5 3 38913 88 8 0 10738 97 7 3 84584 101 7 0154382 105 7 4154382 118 8 2418107 Residuals Plots of residuals versus predicted values and residuals versus time order when data are collected over time would be obtained in similar manners Simply copy and paste columns of interest to new columns placing the variable to go on the horizontal X aXis to the left of the variable to go on the vertical Y axis Normality of Errors The simplest way to check for normality of the error terms is to obtain a histogram of the residuals There are several ways to do this the simplest being as follows Choose Tools Data Analysis Histogram Highlight the column containing the Residuals Choose appropriate Labels choice Click Chart Output then OK A crude histogram will appear which is ne for our purposes You may wish to experiment with EXCEL to obtain more elegant plots Histogram quotChap 9 Frequency O NOOAU39ION Q 391 5 79 7 quotquot6 79 0 lt9 29 635 00 0quot0 spy E Note that you can choose bin upper values that are more satisfactory Type in desired upper endpoints of bins in a new range of cells 0 0 Choose Tools Data Analysis Histogram 0 Highlight the column containing the Residuals o For Bin Range highlight the range of values you ve entered include a label 0 Choose appropriate Labels choice 0 Click on Chart Output then OK residual 75 25 25 75 Therangeswillbe 00 75 75 25 2525 2575 75oo Histogram 7 6 3 4quot g 3W I requenoy it 2 w 1 7 0 D i r i i r 75 25 25 75 More residual Computing Expected Residuals Under Normality 0 Copy the cells containing Observation and Residuals to a new worksheet in Columns A and B respectively 0 Highlight the column of Residuals then select Data and Sort then click on Continue with Current Selection then OK Note that the residuals are in ascending order and the observation number represents the rank now as opposed to 139 0 Compute the percentile representing each residual in their empirical distribution Go to Cell C2 assuming that you have a header row with labels Then type A2 0375n025 where n is the sample size type the number 0 Highlight Cell C2 then Copy it Then highlight the next n 1 cells in column C then Paste 0 Compute the Z values from the standard normal distribution corresponding to the percentiles in column C G0 to Cell D2 assuming that you have a header row with labels Then type NORMSINVC2 0 Highlight Cell D2 then Copy it Then highlight the next n 1 cells in column D then Paste 0 Compute the Expected residuals under normality by multiplying the elements of Column D by xMSE This could be done in Column E The results of the steps are shown below First put observation number and residuals in a new worksheet Observation Residuals 1 241611 2 4583893 3 5845638 4 215436 5 289262 6 5369128 7 363087 8 763087 9 6630872 10 3630872 11 636913 12 036913 13 336913 14 010738 15 384564 16 0154362 17 4154362 18 2416107 Second sort only the residuals Observation Residuals 1 763087 2 636913 3 384564 4 363087 5 336913 6 289262 7 241611 8 215436 9 036913 10 010738 11 0154362 12 2416107 13 3630872 14 4154362 15 4583893 16 5369128 17 5845638 18 6630872 Third compute the percentiles notice that they are symmetric around 05 Here n18 Observation Residuals percentile W 0034247 2 636913 0089041 3 384564 0143836 4 363087 019863 5 336913 0253425 6 289262 0308219 7 241611 0363014 8 215436 0417808 9 036913 0472603 10 010738 0527397 11 0154362 0582192 12 2416107 0636986 13 3630872 0691781 14 4154362 0746575 15 4583893 080137 16 5369128 0856164 17 5845638 0910959 18 6630872 0965753 Fourth compute the ZValues from the standard normal distribution corresponding to the percentiles for the ordered residuals PZ S zA A Observation Residuals percentile zpct 1 763087 0034247 182175 2 636913 0089041 134668 3 384564 0143836 106324 4 363087 019863 084652 5 336913 0253425 066375 6 289262 0308219 05009 7 241611 0363014 035041 8 215436 0417808 02075 9 036913 0472603 006873 10 010738 0527397 0068728 11 0154362 0582192 0207503 12 2416107 0636986 0350415 13 3630872 0691781 0500904 14 4154362 0746575 0663752 15 4583893 080137 0846524 16 5369128 0856164 1063245 17 5845638 0910959 1346684 18 6630872 0965753 1821745 Fifth multiply the residual standard error lMSE by the ZValues to obtain the expected residuals under normality Observation Residuals percentile zpct expected W 0034247 182175 816142 2 636913 0089041 134668 603315 3 384564 0143836 106324 476334 4 363087 019863 084652 379243 5 336913 0253425 066375 297361 6 289262 0308219 05009 224405 7 241611 0363014 035041 156986 8 215436 0417808 02075 092962 9 036913 0472603 006873 03079 10 010738 0527397 0068728 0307903 11 0154362 0582192 0207503 0929616 12 2416107 0636986 0350415 1569858 13 3630872 0691781 0500904 2244051 14 4154362 0746575 0663752 2973607 15 4583893 080137 0846524 3792426 16 5369128 0856164 1063245 4763337 17 5845638 0910959 1346684 6033145 18 6630872 0965753 1821745 8161418 Obtaining a Normal Probability Plot 0 Copy the Residuals column to the righthand side of the Expecteds column 0 Highlight these 2 columns 0 Click on Chart Wizard then XY Scatter then click thru dialog boxes Observation Residuals percentile zpct expected W W 0034247 182175 816142W 2 636913 0089041 134668 603315 636913 3 384564 0143836 106324 476334 384564 4 363087 019863 084652 379243 363087 5 336913 0253425 066375 297361 336913 6 289262 0308219 05009 224405 289262 7 241611 0363014 035041 156986 241611 8 215436 0417808 02075 092962 215436 9 036913 0472603 006873 03079 036913 10 010738 0527397 0068728 0307903 010738 11 0154362 0582192 0207503 0929616 0154362 12 2416107 0636986 0350415 1569858 2416107 13 3630872 0691781 0500904 2244051 3630872 14 4154362 0746575 0663752 2973607 4154362 15 4583893 080137 0846524 3792426 4583893 16 5369128 0856164 1063245 4763337 5369128 17 5845638 0910959 1346684 6033145 5845638 18 6630872 0965753 1821745 8161418 6630872 Residuals As always you can make the plot more attractive with plot options but it is unnecessary for our purposes of assessing normality For this example the residuals appear to fall on a reasonably straight line as would be expected under the normality of errors assumption Correlation Test for Normality 35 o H 0 Error terms are normally distributed 0 H A Error terms are not normally distributed TS Correlation coefficient between observed and expected residuals rm 0 RR r22 S Tabled values in Table B6 Page 1348 indexed by a and n We can obtain the correlation coefficient between the observed and expected residuals as follows 0 Select Tools Data Analysis Correlation 0 Highlight the columns for Residuals and Expected 0 Click on Labels if they are included 0 Click OK expected Residuals expected 1 Residuals 0980816 1 For this example 7115 and with a 005 we obtain a critical value of 0946 Since the correlation coefficient 0981 is larger than the critical value we conclude in favor of the null hypothesis We conclude that the errors are normally distributed Modi ed Levene Test for Constant Variance 36 To conduct this test in EXCEL do the following steps 0 Split the data into two groups with respect to levels of X Use best judgment in terms of balance and closeness of X levels For our example a natural split is group lX 14 and group 2X 58 0 Obtain the Residuals from the regression In a new worksheet put the residuals from group 1 in one column say Column A the residuals from group 2 in another column say Column B For this example the group sizes are n1 8 r12 10 0 Obtain the Median residual for each group In Cell A15 type median A2A9 since we have 7118 and a header row In Cell B15 type median B2B11 since we have 71210 and a header row 0 Obtain the absolute values of the differences between the residuals and their group medians in the next two columns In Cell C2 type absA2 A15 the dollar signs make cut and paste work correctly Then Copy Cell C2 and Paste it to Cells C3 C9 In Cell D2 type absB2 B15 Then Copy Cell D2 and Paste it to Cells D3 D11 0 Obtain the mean and sum of squared deviations of the absolute difference from the median in the previous step In Cell F2 type averageC2C9 this computes g1 In Cell F3 type devsqC2C9 this computes 26111 gal In Cell G2 type averageD2D11 this computes g2 In Cell G3 type devsqD2D11 this computes 26112 d2 2 0 Compute 32 In Cell H2 type F3G318 2 181 0 Compute II In Cell 12 type F2 G2sqrtH218110 since n18 and n210 The result of the steps on the calculator maintenance are shown below First separate the residuals into Columns A and B Group 1 Group 2 241611 6630872 4583893 3630872 5845638 636913 215436 036913 289262 336913 5369128 010738 363087 384564 763087 0154362 4154362 2416107 Second obtain the median residuals for each group Group 1 Group 2 241611 6630872 4583893 3630872 5845638 636913 215436 036913 289262 336913 5369128 010738 363087 384564 763087 0154362 4154362 2416107 228523 002349 Third obtain the absolute difference between the actual residuals and the group medians Group 1 Group 2 d1 241611 6630872 0130872 6607383 4583893 3630872 6869128 3607383 5845638 636913 8130872 6392617 215436 036913 0130872 0392617 289262 336913 0607383 3392617 5369128 010738 7654362 0130872 363087 384564 1345638 3869128 763087 0154362 5345638 0130872 4154362 4130872 2416107 2392617 228523 002349 Fourth compute the statistics mean and sum of squared deviations for the d values for groups 1 and 2 Group 1 Group 2 d1 d2 stats 1 stats 2 241611 6630872 0130872 6607383 3776846 3104698 4583893 3630872 6869128 3607383 8855851 5060191 5845638 636913 8130872 6392617 215436 036913 0130872 0392617 289262 336913 0607383 3392617 5369128 010738 7654362 0130872 363087 384564 1345638 3869128 763087 0154362 5345638 0130872 4154362 4130872 2416107 2392617 228523 002349 Fifth Compute the pooled variance 32 Group 1 Group 2 d1 d2 stats 1 stats 2 pooled squot2 241611 6630872 0130872 6607383 3776846 3104698 8697526 4583893 3630872 6869128 3607383 8855851 5060191 5845638 636913 8130872 6392617 215436 036913 0130872 0392617 289262 336913 0607383 3392617 5369128 010738 7654362 0130872 363087 384564 1345638 3869128 763087 0154362 5345638 0130872 4154362 4130872 2416107 2392617 228523 002349 Sixth compute the test statistic I Group 1 Group 2 241611 6630872 4583893 3630872 5845638 636913 215436 036913 289262 336913 5369128 010738 363087 384564 763087 0154362 4154362 2416107 228523 002349 Finally we can conduct the test d1 0130872 6869128 8130872 0130872 0607383 7654362 1345638 5345638 6607383 3607383 6392617 0392617 3392617 0130872 3869128 0130872 4130872 2392617 stats 1 3776846 8855851 stats 2 3104698 5060191 pooled squot2 8697526 tstat 048048 For a 005 we obtain tl a2n 2 t0975l6 2120 Since our test statistic 048 does not exceed 2120 we fail to reject the hypothesis of equal variances We have no reason to believe that the error variance is not constant