### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 296 Class Note for STAT 51200 at Purdue

### View Full Document

## 17

## 0

## Popular in Course

## Popular in Department

This 24 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 17 views.

## Similar to Course at Purdue

## Reviews for 296 Class Note for STAT 51200 at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Stat 512 Topic 1 Topic Overview This topic we will cover Course Overview amp Policies SAS KNNL Chapter 1 much should be review 7 Simple linear regression KNNL Chapter 2 7 Inference in simple linear regression Prediction intervals and Con dence bands ANOVA tables General Linear Test Class website httpwebicspurdueeduNsimonsenstatSl2 Class policies Refer to handout Overview We will cover simple linear regression SLR Chapters 1 7 5 multiple regression MLR Chapters 6 7 ll analysis of variance AN OVA Chapters 16 7 25 The emphasis will be placed on using selected practical tools using SAS rather than on the mathematical manipulations We want to understand the theory so that we can apply it appropriately Some of the material on SLR will be review but our goal with SLR is to be able to generalize the methods to MLR References Text Applied Linear Statistical Models 4Lh ed by Neter Kutner Nachtsheim and Wasserman KNNL SAS System for Regression by Freund and Little SASSTAT User s Guide Vol 1 and 2 SAS System for Elementary Analysis by Schlotzhauer SAS Help menus SAS Getting Help with SAS Statistical Consulting Service Math B5 Hours 104 M through F httpwwwstatpurdueeduconsulting TopicUl doc 82205 1134 AM 1 of 24 MW Room help with SAS Excel for multiple Stat courses Hours 77 9 M through Th starting second week of classes staffed with graduate student TA s SAS SAS is the program we will use to perform data analysis for this class I will often give examples from SAS in class The commands are meant to be all together as one program but it will be easier to understand if I show each command followed by its output The programs will be available for you to download from the website I will use the following font convention in the lecture notes SAS input will look like this Courier New SAS output will look like this SAS Monospace I will usually have to edit the output somewhat to get it to fit on the page of notes My own comments will be in regular Timesprintout or AriaIlecture fonts like this Let me know if you get confused about what is input output or my comments You should run the SAS programs yourself to see the real output and experiment with changing the commands to learn how they work I will tell you the names of all SAS files I use in these notes Ifthe notes differ from the SAS file take the SAS file to be correct since there may be cutandpaste errors There is a tutorial in SAS to help you get started Help 9 Getting Started with SAS Software You should spend some time before next week getting comfortable with SAS see HW 0 For today don t worry about the detailed syntax of the commands Just try to get a sense of what is going on Example Price Analysis for Diamond Rinqs in Sinqapore Variables response variable price in Singapore dollars Y explanatory variable weight of diamond in carats X Goals create a scatterplot of the data fit a regression line predict the price of a sale for a 043 carat diamond ring SAS Data Ste file diamondsas on website One way to input data in SAS is to just type or paste it in In this case we have a sequence of ordered pairs weight price TopicOldoc 2205 I 134 AIM Z of 24 data diamonds input weight price cards 17 355 16 328 17 350 18 325 25 642 16 342 15 322 19 485 21 483 15 323 18 462 28 823 16 336 20 498 23 595 29 860 12 223 26 663 25 750 27 720 18 468 16 345 17 352 16 332 17 353 18 438 17 318 18 419 17 346 15 315 17 350 32 918 32 919 15 298 16 339 16 338 23 595 23 553 17 345 33 945 25 655 35 1086 18 443 25 678 25 675 15 287 26 693 15 316 43 data diamonds1 set diamonds if price he S ntaX Notes Each line must end with a semicolon There is no output from this statement but information does appear in the log window Often you will obtain data from an existing SAS le or import it from another le such as a spreadsheet Examples showing how to do this will come later SAS Proc Print Now we want to see what the data look like proc print datadiamonds run Obs weight price 1 017 355 2 016 328 3 017 350 47 026 693 48 015 316 49 043 SAS Proc GElot We want to plot the data as a scatterplot using circles to represent data points and adding a smoothing curve to see if it looks linear The symbol statement Vcircle V stands for Value lets us do this The symbol statement ism70 will add a smooth line using splines interpolation smooth These are options which stay on until you turn them off In order for the smoothing to work properly we need to sort the data by the X variable proc sort datadiamonds1 by weight symbo11 vcirc1e ism70 tit1e1 39Diamond Ring Price Study39 tit1e2 39Scatter plot of Price vs Weight with Smoothing Curve39 axis1 1abe139Weight Carats39 axis2 1abe1ang1e90 39Price Singapore 39 proc gplot datadiamonds1 p1ot priceweight haxisaxis1 vaxisaxis2 TopicUl doc 82205 1134 AM 3 of 24 run Diamond Ring Price Study Sca39lter plat of Price vs Weight wih SmalhiIg Curve 1 00 1000 900 K10 700 0 500 400 300 200 010 06 020 025 030 035 Weight Carats Price Singapore as Now we want to use simple linear regression to t a line through the data We use the symbol option irl meaning interpolation regression line that s an L not a one symboll vcircle irl title2 39Scatter plot of Price vs Weight with Regression Line39 proc gplot datadiamondsl plot priceweight haxisaxisl vaxisaxis2 run Diamond Ring Prioe Study Sm er plat of Price vs Weight wi39ih HeglESiDn Line 1100 1000 900 800 700 600 500 400 300 200 010 06 020 025 03 035 Weight Carats Price Singapore SAS Proc Reg We use proc reg regression to estimate a regression line and calculate predictors and residuals from the straight line We tell it what the data are what the model is and what options we want TopicOldoc 0205 1 134 AM 4 of 24 proc reg datadiamonds model priceweightp r output outdiag ppred rresid id weight run Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F Model 1 2098596 2098596 206999 lt0001 Error 46 46636 101381886 Corrected Total 47 2145232 Root MSE 3184052 RSquare 09783 Dependent Mean 50008333 Adj RSq 09778 Coeff Var 636704 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 25962591 1731886 1499 lt0001 weight 372102485 8178588 4550 lt0001 proc print datadiag run Output Statistics Dep Var Predicted Std Error Std Error Obs weight price Value Mean Predict Residual Residual 1 017 3550000 3729483 53786 179483 31383 2 016 3280000 3357381 58454 77381 31299 3 017 3500000 3729483 53786 229483 31383 4 018 3250000 4101586 50028 851586 31445 5 025 6420000 6706303 59307 286303 31283 46 015 2870000 2985278 63833 115278 31194 47 026 6930000 7078406 64787 148406 31174 48 015 3160000 2985278 63833 174722 31194 49 043 1340 190332 Simple Linear Reqression Why Use It TopicUl doc Descriptive purposes causeeffect relationships Control often of cost Prediction of outcomes 82205 1134 AM v 0 N Data for Simple Linear Regression 7 Observe i l 2 n pairs ofvariables explanatory response 7 Each pair o en called a case or a data point 7 Y1 ith response variable 7 X 7 13911 explanatory variable 1 Simple Linear Regression Model Yl 60 BlXi i fori l 2 n Simple Linear Regression Model Parameters 7 3D is the intercept 7 31 is the slope 7 e are independent normally distributed mndom errors With mean 0 and variance 2 ie 215N0UZ Features of Simple Linear Regression Model 7 Individual Observations If 3n 31Xx e 7 Since a are random Yx are also random and EK n 1XE x u 1X 7 Varlf0VarecsZ 7 Since 2 is Normally distributed Yx 7 N03 31X39162 See A36 p1319 Attempts Km 1 275 xiv 7m m 1 o o 10 20 30 40 so an x Fitted Regression Eguation and Residuals We must estimate the parameters 3 3162 from the data The estimates are denoted b blsz These give us the fitted or estimated regression line 1 bu le Where Topiwldoc 82205 1 54 AM 6 of 24 b0 is the estimated intercept b1 is the estimated slope is the estimated mean for Y when the predictor is X I ie the point on the tted line I el is the residual for the I39m case the vertical distance from the data point to the tted regression line Note that ei Yi l Y b0 b1Xi I Using SAS to plot the residuals Diamond Example When we called PROC REG earlier we assigned the residuals to the name resid and placed them in a new data set called diag We now plot them vs X symboll vcircle iNONE title2 colorblue 39Residual Plot39 axis2 labelangle90 39Residual39 proc gplot datadiag plot residweight haxisaxisl vaxisaxis2 vref0 where price he run Diamond Ring Price Study Residual Plat 80 60 40 a 20 n a E 0 5 a a 20 g a 40 60 quot 80 100 010 015 020 025 030 035 Weight Carais Notice there does not appear to be any obVious pattern in the residuals We ll talk a lot more about diagnostics later but for now you should know that looking at residual plots is an important way to check assumptions Least Squares want to nd best estimates b0 and b1 will minimize the sum ofthe squared residuals Zn e2 b0 lei 2 i1 i1 TopicUl doc 82205 1134 AM 7 of 24 use calculus take derivative with respect to b0 and with respect to b1 and then set the two resulting equations equal to zero and solve for b0 and b1 see NKNW pgs 1920 Least Squares Solution These are the best estimates for Bl and Bo bZZCKfo az 1 2Xi y2 SSX 0 f blf These are also maximum likelihood estimators see NKNW pp 3035 This estimate is the best because it is unbiased and minimum variance Maximum Likelihood Y N o 1Xcz 1 1343043139 2 f 1 87 f I 2116 L f1 likelihood function Find values for 50 and 51 which maximize L These are the SAME as the least squares estimators b0 and b1 Estimation of 0392 We also need to estimate 62 with 32 We use the sum ofthe squared residuals SSE divided by the degrees of freedom n2 A 2 EZZWKza n 2 n 2 MSE de s J Root MSE where SSE 26 is the sum of squared residuals or errors and MSE stands for mean squared error There will be other estimated variances for other quantities and these will be denoted s2other quantity eg s2b1 Without any s2 refers to the value above that is the estimated variance of the residuals Identifving these things in the SAS Output Analysis of Variance Sum of Mean TopicOldoc 2205 1 134 AIM 8 of 24 Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var Variable DF I nte rce pt 1 weight 1 DE Squares Square F Value Pr gt F 1 2098596 2098596 206999 lt0001 46 46636 101381886 47 2145232 3184052 RSquare 09783 50008333 Adj RSq 09778 636704 Parameter Estimates Parameter Standard Estimate Error t Value Pr gt t 95 Confidence Limits 25962591 1731886 1499 lt0001 29448696 22476486 372102485 8178588 4550 lt0001 355639841 388565129 Review of Statistical Inference for Normal SamQIes This should be review In 503511 you learned how to construct con dence intervals and do hypothesis tests for the mean of a normal distribution based on a random sample Suppose we have an iid random sample W1 Wn from a normal distribution usually I would use the symbol X or Y but I want to keep it general and not use the symbols we use for regression We have WI iidNu62 where u and 62 are unknown W W sample mean n SSW W2 sum of squares for W 2 W W s2 W L SSW sample variance n 1 n 1 SW s2 W sample standard deviation SW s W standard error of the mean and from these de nitions we obtain 2 W NHG I l TW W H has a tdistribution with nl df for short T N tnl This leads to inference for con dence intervals for p signi cance tests for p TopicUl doc 82205 1134 AEM 9 of 24 Con dence Intervals We are 100l0L con dent that the following interval contains u Wigs W ZCSWW ZCSW where t5 tl n l the upper l percentile of the t distribution with nl degrees of freedom and la is the con dence level eg 095 95 so or 005 Signi cance tests To test whether u has a speci c value we use a ttest onesample nondirectional H03HH0VS Haruwo Izw Ho SW Reject HoifIIIZtC t5 tl n l Pvalue ProbHU gtt where T Ntn l has a tnl distribution under H0 The pvalue is twice the area in the upper tail of the tn 1 distribution above the observed ltl It is the probability of observing a test statistic at least as extreme as what was actually observed when the null hypothesis is really true We reject H0 if P S or Important notational comment The text says conclude HA ift is in the rejection region ltl Z tc otherwise conclude H0 This is shorthand for conclude HA means there is suf cient evidence in the data to conclude that H0 is false and so we assume that HA is true conclude H0quot means there is insuf cient evidence in the data to conclude that either H0 or H A is true or false so we default to assuming that H0 is true Notice that a failure to reject H0 does n0t mean that there was any evidence in favour of H0 NOTE In this course at 005 unless otherwise speci ed 21 LInference b1 N l62 b1 where 62 X tb1 1 Slbl where sb1 g X ttn 2 if 10 TopicOldoc 2205 1 134 AIM 10 of 24 According to our discussion above for W you therefore know how to obtain CI s and ttests for Bl I ll go through it now but not in the future There is one important difference the df here are n 7 2 not n 7 1 because we are also estimating Bo Confidence Interval for 21 b1 i tcs where t5 tl n 2 the upper 100l percentile ofthe t distribution with n 7 2 degrees of freedom l 7 0c is the confidence level Significance tests for 21 H0 10 vs Ha 1 0 bl O 3 Reject H0 if ltl 2 t t tl a2n 2 P ProbT gt It where Ttn 2 22 EL Inference b0 N 0czb0 1 i2 where 62 b0 62 b0 o for sb0 replace 62 by s2 and take J 1 gt22 Sb0 S EE t 7 tn 2 Confidence Interval for 9 b0 i tcs b0 where t5 tl n7 2 la is the confidence level TopicUl doc 82205 1134 AnM ll of 24 Significance tests for girl H0 00 vs HA 0 0 bO O Slboi Reject H0 if IIIZIC tC tl n 2 P Pr0bT gt lt where T N tn 2 t Notes The normality of b0 and b1 follows from the fact that each of these is a linear combination of the Yi each of which is an independent normal For b1 see NKNW p46 For b0 try this as an exercise Usually the CI and signi cance test for o is not of interest If the 8 are not normal but are approximately normal then the CIs and significance tests are generally reasonable approximations These procedures can easily be modi ed to produce onesided con dence intervals and signi cance tests 62 Because 62 2 we can make this quantity small by making XXX if large 2049 ie by spreading out the Xi s SAS Proc Reg Here is how to get the parameter estimates in SAS Still using diamondsas The option clb asks SAS to give you con dence limits for the parameter estimates b0 and b1 proc reg datadiamonds model priceweightClb Par amete r Estimates Par ameter Standar d Var iable DF Estimate Er r or t Value Pr gt t 95 Confidence Limits Inter cept 1 25962591 1731886 1499 lt0001 29448696 22476486 weight 1 3721 02485 8178588 4550 lt 0001 3556 39841 388565129 Points to remember What is the default value of a that we use in this class What is the default con dence level that we use in this class Suppose you could choose the X s How would you choose them if you wanted a precise estimate of the slope intercept both TopicOldoc 82205 1 134 AM 12 of 24 Summary of Inference 7 0 1Xi8i 8 N N 062 are independent random errors Parameter Estimators X i Y a For 1b1 Z l 12 2Xi X m b0 1b 02 S2 ZltYi bo b1Xi2 I n 2 95 Con dence Intervals for Bo and BI b1 i tcs b0 i tcs b0 where t5 t1 n 2 the 1001 upper percentile ofthe t distribution with n2 degrees of freedom Signi cance tests for 20 and 21 H01 o O Ha o 0 b SZO tn 2 under H0 H02 10HaI 1 0 I S521 tn 2 under H0 Reject H0 ifthe Pvalue is small lt05 t 23 Power The power of a signi cance test is the probability that the null hypothesis will be rejected when in fact it is false This probability depends on the particular value of the parameter in the alternative space When we do power calculations we are trying to answer questions like the following Suppose that the parameter Bl truly has the value 15 say and we are going to collect a sample of a particular size n and with a particular SSX What is the probability that based on our not yet collected data we will reject H0 Power for 1 H02 10HaI 1 0 t5 t1 n 2 for a 05 we rejectHo when H Ztc TopicUl doc 82205 1134 Aim 13 of 24 so we need to nd Pt 2 It for arbitrary values of l 0 when l 0 the calculation gives a H0 is true t N tn25 7 noncentral t distribution tdistribution not centered at 0 51 cm H0 is kind of like effect size is the noncentrality parameter it represents on a standardized scale how far from true 62 We need to assume values for 62 2 and n X X I NKNW use tables see pp 5556 we will use SAS Example of Power for 21 Response Variable Work Hours Explanatory Variable Lot Size See page 21 for details of this study pp 5556 for details regarding power 62 201263 20H I we assume 622500 n25 and SSX 19800 so we have 62 consider l 15 51 6 bl with t N tn25 we want to nd Pltl Z tc we use a function that calculates the cumulative distribution function for the noncentral t distribution we now can calculate 5 See program nknw055sas for the power calculations data a1 n25 sig22500 ssx19800 alpha05 sig2b1sig2ssx dfn2 beta115 deltaabsbetalsqrtsig2bl tstartinv1alpha2df power1probttstardfdeltaprobttstardfdelta output proc print dataa1run Obs n 5192 ssx alpha sig2b1 df beta1 delta tstar power 1 25 2500 19800 005 012626 28 15 422187 206866 098121 data a2 n25 sig22500 ssx19800 alpha05 sig2b1sig2ssx dfn2 do beta120 to 20 by 05 TopicOl doc deltaabsbetalsqrtsig2bl tstartinv1alpha2df power1probttstardfdeltaprobttstardfdelta 82205 1134 AM 14 of 24 output end proc print dataa2 run titlel 39Power for the slope in simple linear regression39 symboll vnone ijoin proc gplot dataa2 plot powerbeta1 run Power for he slope in simple linear regression ba 31 24 Estimation of EfY z EYh uh o th the mean value of Y for the subpopulation with X Xh we will estimate EYh by Ii uh b0 leh KNNL use Ii to denote this estimate we will use the symbols I31 uh interchangeably see equation 228 on p 57 Theog for Estimation of E1 Yb 2 A A 1 X X Yh is normal With mean 11 and variance 62 K1 62 h 7 n 2 Xi X The normality is a consequence of the fact that b0 leh is a linear combination of Yi s The variance has two components 7 one for the intercept and one for the slope The variance associated with the slope depends on the distance X h f The estimation is more accurate near J See NKNW pp 5659 for details TopicUl doc 8 2205 1134 AM 1 v 0 H N r Application of the Theog 2 X X 1 h n 209502 N tn 2 proceed as usual We estimate 62 by s2 la s2 1 EY 30311 It follows that t 95 Con dence Interval for E1 Yb 1 ifs 161 where t t975 n2 NOTE significance tests can be performed for I but they are rarely used in practice Example See program nknw060sas for the estimation of subpopulation means The option clm to the model statement asks for con dence limits for the mean la data a1 infile 39HStat512DatasetsCh01ta01dat39 input size hours data a2 size65 output size100 output data a3 set a1 a2 proc print dataa3 proc reg dataa3 model hourssizeclm run id size run Dep Var Predicted Std Er r or Obs size hour s Value Mean Pr edict 9596 CL 25 70 3230 3122800 97647 2920803 26 65 2944290 99176 2739129 27 100 4193861 142723 3898615 25 Prediction of Yhmewl Mean 3324797 3149451 4489106 We wish to construct an interval into which we predict the next observation for a given Xh will fall The only difference between this and E Y 11 is that the variance is different In prediction we have two variance components 1 variance associated with the estimation of the mean response If and 2 variability in a single observation taken from the distribution with that mean YMMV 50 ith s is the value for a new observation with X Xh We estimate Yhmew starting with the predicted value If This is the centre of the confidence interval just as it was for EYh However the width of the CI is different because they have different variances TopicOldoc 82205 1134 AM 160f24 VarYhnewVarlhVars szpredszfhsz X i 2 szpreds2 11M Zea if tn 2 Yhnew Yh s pred spred denotes the estimated standard deviation of a new observation with X Xh It takes into account variability in estimating the mean la as well as variability in a single observation from a distribution with that mean Notes The procedure can be modi ed for the mean ofm observations at XX11 see 239a on page 66 Standard error is affected by how far Xh is from X see Figure 23 As was the case for the mean response prediction is more accurate near Y See program nknw065sas for the prediction interval example The cl i option to the model statements asks SAS to give con dence limits for an individual observation cf clb and clm data a1 infile 39HStat512DatasetsCh01ta01dat39 input size hours data a2 size65 output size100 output data a3 set a1 a2 proc reg dataa3 model hourssizecli run Dep Var Pr edicted Std Er r or Obs size hour s Value Mean Pr edict 9596 CL Pr edict Residual 25 70 3230000 3122800 97647 2092811 4152789 107200 26 65 2944290 9 9176 1913676 3974904 27 100 4193861 142723 3141604 5246117 Notes The standard error Std Er r or Mean Pr edict given in this output is the standard error of la not spred That s why the word M is in there The CL Pr edict label tells you that the con dence interval is for the prediction of a new observation The prediction interval for YMMV is wider than the con dence interval for Ii because it has a larger variance TopicUl doc 8 2205 1134 AM 17 of 24 26 WorkinqHotellinq Con dence band for the entire reqression line This is a con dence limit for the whole line at once in contrast to the con dence interval for just one 1911 at a time Regression line b0 leh describes EYh for given Xh We have 95 CI for EYh 19h pertaining to speci c Xh We want a 95 Con dence band for all Xh 7 this is a con dence limit for the whole line at once in contrast to the con dence interval for just one 1911 at a time The con dence limit is given by 1 iW sfh where W2 2Fl 0c2 n 2 Since we are doing all values of X h at once it will be wider at each X h than C1s for individual X h The boundary values de ne a hyperbola The theory for this comes from the joint con dence region for o l which is an ellipse see Stat 524 We are used to constructing CI s with t s not W s Can we fake it We can nd a new smaller alpha for t0 that would give the same results 7 kind of an effective alpha that takes into account that you are estimating the entire line We nd W2 for our desired true or and then nd the effective at to use with t0 that gives W0L 06 Con dence Band for Regression Line See program nknw067sas for the regression line con dence band data a1 n25 alpha10 dfn2 dfdn2 W22finv1alphadfndfd wsqrtw2 alphat21probtwdfd tstartinv1alphat2 dfd output proc print dataa1 Note lprobtw dfd gives the area under the tdistribution to the right of w We have to double that to get the total area in both tails Obs n alpha dfn dfd w2 w alphat tstar 1 25 01 2 23 509858 225800 0083740 225800 data a2 infile 39HStat512DatasetsChOltaOldat39 input size hours symboll vcircle irlclm97 proc gplot dataa2 plot hourssize Topi001doc 2205 1134 AM 18 of 24 Wotking Hoielling Con dence Bands for Toluca Company Example hours 600 5CD 20 30 4O 50 60 7O 80 90 00 1 0 120 Estimation of EYh compared to Prediction of Yh 1 h 50 51X h 2 A 2 1 lth lt2 S W 2 s2 pred s2 1iXh X22 n 2 Xi X See program nknw067xsas for the clm mean and c1 i individual plots data a1 infile 39HStat512DatasetsChOltaOldat39 input size hours Con dence intervals symboll vcircle irlclm95 proc gplot dataa1 plot hourssize run TopicUl doc 8 2205 1134 AM 19 of 24 WH 95 Confidence bands for the mean hours 20 30 40 50 60 70 80 90 00 1 0 120 Prediction intervals symboll vcircle irlcli95 proc gplot dataa1 plot hourssize run WH 95 Prediction bands 27 Analysis of Variance ANOVA Table Organize results arithmetically Total sum of squares in Y is SSY i l7z Partition this into two sources Model explained by regression Error unexplained residual TopicOldoc 0205 1 134 AM 20 of 24 Yi YYi Yi39 i 2Yi 32 2Yi G2 26 32 cross terms cancel see p 72 Total Sum of Sguares Consider ignoring Xh to predict EYh Then the best predictor would be the sample mean 17 SST is the sum of squared deviations from this predictor SST SSY l72 The total degrees of freedom is dfT nl MST SSTdfT MST is the usual estimate of the variance of Y if there are no explanatory variables also known as 32 Y SAS uses the term Corrected Total for this source Uncorrected is ZYIz The term corrected means that we subtract off the mean 17 before squaring Model Sum of Sguares A 2 SSM ZY Y The model degrees of freedom is dfM 1 since one parameter slope is estimated MSM SSMdfM KNNL uses the word regression for what SAS calls model So SSR KNNL is the same as SS Model SAS Iprefer to use the terms SSM and dfM because R stands for regression residual and reduced later which I nd confusing Error Sum of Sguares SSE XXV I2 The error degrees of freedom is de n2 since estimates have been made for both slope and intercept MSE SSEde MSE s2 is an estimate of the variance of Y taking into account or conditioning on the explanatory variables ANOVA Table for SLR Source df SS MS Model 1 A 2 SSM Regression i Y dfM Error n2 A 2 SSE Yi I Z de Total nl Z Y ff SST dfr TopicUl doc 82205 1134 ASM Note about de rees of freedom Occasionally you will run across a reference to degrees of freedom without specifying whether this is model error or total Sometimes it will be clear from context and although that is sloppy usage you can generally assume that if it is not specified it means error degrees of freedom Expected Mean Sguares MSM MSE are random variables EMSM 62 pfSSX EMSE oz When H0 Bl 0 is true then EMSM EMSE This makes sense since in that case Y F test FMSMMSEFdfMdeFln 2 See NKNW pp7677 When H0 f1 0 is false MSM tends to be larger than MSE so we would want to reject H0 when F is large Generally our decision rule is to reject the null hypothesis if F 2F F1 0cd adeF0951n 2 In practice we use pValues and reject H0 if the pValue is less than 0L l Recall that t b1sb1 tests H0 10 It can be shown that t2 df Fldf The two approaches give same PValue they are really the same test Aside When H0 10 is false F has a noncentral F distribution this can be used to calculate power W Source df SS MS F P Model 1 SSM MSM MSM XXX MSE Error n2 SSE MSE Total nl See program nknw073sas for the program used to generate the other output used in this lecture data al infile 39 H Stat512DatasetsChOltaOl dat39 input size hours proc reg dataal model hourssize run Sum of Mean Sour ce DF Squar es Squar e F Value Pr gt F Model 1 252378 252378 10588 ltOOO1 Error 23 54825 238371562 0 Total 24 307203 TopicOldoc 0205 1 134 AM 22 of 24 Par amete r Standard Var iable DF Estimate Er r or t Value Pr gt t Inter cept 1 6236586 2617743 238 00259 size 1 357020 034697 1029 lt0001 Note that t2 10292 10588 F 28 General linear test A different View of the same problem testing Bl 0 It may seem redundant now but the concept is extremely useful in MLR We want to compare two models Yi o 1Xi 8i full model Yi o 8i reduced model Compare using the error sum of squares Let SSEF be the SSE for the Full model and let SSER be the SSE for Reduced model F SSERSSEltFgtdf or SSEFdeF Compare to the critical value FC Fl 0L deW EFd zF to test H0 Bl 0 vs Ha Bl 0 Test in Simple Linear Regression SSEltRgtzltn bogtz2Y Y zSST SSEF SST e SSM the usual SSE deR nl deF n2 deR deG l SST SSE 1 g M Same test as before SSE n 2 MSE This approach full vs reduced is more general and we will see it again in MLR Pearson Correlation r is the usual correlation coefficient It is a number between 71 and 1 and measures the strength of the linear relationship between two variables zltxi xgtltY Ygt JZXXi gtlt2 2Yi Y2 r Notice that Test H0 10 similar to H0 p0 TopicUl doc 82205 1134 AM 23 of 24 R2 and 1 2 R2 is the ratio of explained and total variation R2 SSM SST r2 is the square of the correlation between X and Y 2 2 2 2Xi X V b1 2 Z Yi Y SSM SST In SLR r2 R2 are the same thing However in MLR they are different there will be a different r for each X variable but only one R2 R2 is often multiplied by 100 and thereby expressed as a percent In MLR we often use the adjusted R2 which has been adjusted to account for the number of variables in the model more in chapter 6 Sum of Mean F Value Pr gt F Source DF Squar es Squar e 10588 lt0001 Model 1 252378 252378 Error 23 54825 2383 C Total 24 307203 RSquar e 08215 SSMSST lSSESST 252378307203 Adj RSq 08138 lMSEMST 1238330720324 TopicOldoc 0205 1 134 AIM 24 of 24

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.