### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 201 Class Note for STAT 51200 with Professor Zhang at Purdue

### View Full Document

## 16

## 0

## Popular in Course

## Popular in Department

This 102 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 16 views.

## Similar to Course at Purdue

## Popular in Subject

## Reviews for 201 Class Note for STAT 51200 with Professor Zhang at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Statistics 512 Appi md Regression Analysis Purdue University Professor Dabao Zhang swing 2009 Statistics 512 Applied Regression Analysis I Overview We will cover I simple linear regression SLR i multiple linear regression MLR 0 analysis of variance ANOVA January 12 2009 Page 1 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Emphasis will be placed on using selected practical tools such as SAS rather than on mathematical manipulations We want to understand the theory so that we can apply it appropriately Some of the material on SLR will be review but our goal with SLR is to be able to generalize the methods to MLR January 12 2009 Page 2 Statistics 512 Applied Regression Pm University Professor Dabao Zhang Spring 2009 I Course Information I Class Section 3 MWF 230320pm at REC 121 Text Applied Linear Statistical Models 5th edition by Kutner Neter Nachtsheim and Li Recommended Applied Statistics and the SAS Programming Language 5th edition by Cedy and Smith January 12 2009 Pages Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Professor Dabao Zhang MATH 534 Office Hours MW 330pm430pm or by appointment or phone 46046 or email zhangdbstatpurdueedu Evaluation Problem sets will be assigned more or less weekly They will typically be due on Friday Refer to the handout about specific evaluation policies January 12 2009 Page 4 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 liLecture Notes 0 Available as MSWord or PDF 0 Usually hopefully prepared a week in advance a Not comprehensive Be prepared to take notes a Onetwo chapters per week 0 Ask questions if you re confused January 12 2009 Page 5 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 WebpageI http www stat purdue eduNzhangdbstatS 12 0 Announcements 0 Lecture Notes a Homework Assignments 0 Data Sets and SAS files a General handouts please see immediately Course Information January 12 2009 Page 6 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Mailing List I will very occasionally send reminders or announcements through email Blackboard Vista 0 Holds solutions documents a Moniter grades 0 Information restricted to enrolled students a Discussion group s January 12 2009 Page 7 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 0 One midterm exam has been scheduled on March 5 2009 810pm Please check your schedule and make sure that it works for you Please notify me one week in advance for any conflict a If the lecture viewing schedule is not realistic for homework deadlines please let me know as soon as possible a In class please try to make sure I hear your question a Chatting with your neighbors may disturb others please be courteous to your classmates January 12 2009 Page 8 Statistics 512 Applied Regression Analysis Purdue University Professor Daba o Zhang Spring 2009 SAS is the program we will use to perform data analysis for this class Learning to use SAS will be a large part of the course Getting Help with SAS Several sources for help 0 SAS Help Files not always best 0 World Wide Web look up the syntax in your favorite search engine 0 SAS Getting Started in SAS Files section of class website and Tutorials 0 Statistical Consulting Service January 12 2009 Page 9 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 a Wednesday Evening Help Sessions 0 Applied Statistics and the SAS Programming Language 5th edition by Cody and Smith most relevant material in Chapters 1 2 5 7 and 9 0 Your instructor Statistical Consulting Service Math B5 Hours 104 M through F httpwwwstatpurdueeduscs January 12 2009 Page 10 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 o Offcampus students If DACS doesn t work for you fill out a license agreement online in SAS folder mail or fax it to Pro Ed Disks will be sent to you I need the license agreements or notification that you re sending a license agreeement by the end of the first week of classes January 12 2009 Page 1 1 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Evening Computer Labs 0 SC 283 a help with SAS for multiple Stat courses 0 Hours 7pm9pm Wednesdays 0 starting second week of classes a staffed with graduate student TA January 12 2009 Page 12 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 I will often give examples from SAS in class The programs used in lecture and any other programs you should need will be available for you to download from the website I will usually have to edit the output somewhat to get it to fit on the page of notes You should run the SAS programs yourself to see the real output and experiment with changing the commands to learn how they work Let me know if you get confused about what is input output or my comments I will tell you the names of all SAS filesl use in these notes If the notes differ from the SAS file take the SAS file to be correct since there may be cutandpaste errors There is a tutorial in SAS to hep you get started Help gt Getting Started with SAS Software January 12 2009 Page 13 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 You should spend some time before next week getting comfortable with SAS For today don t worry about the detailed syntax of the commands Just try to get a sense of what is going on January 12 2009 Page 14 Statistics 512 Appiied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Example Price Analysis for Diamond Rings in Singapore I Variables 0 response variable price in Singapore dollars Y a explanatory variable weight of diamond in carats X Goals 3 Create a scatterplot 4 Fit a regression line a Predict the price 0f a sale for a 043 caret diamond ring January 122009 Page 15 Statistics 512 Wed Ramadan Pm gammy PmMnummemm msz SAS Data Step Fills diamond s as an website One way to input data in SAS is to type or paste it in In this case we have a sequence 01 ardered pairs weight price data diamonda 17 21 12 17 32 25 input weight price cards 355 16 483 15 223 26 353 18 919 15 655 35 328 323 663 17 350 18 462 25 750 18 325 28 823 27 720 15 322 23 595 17 352 25 542 16 336 18 468 438 17 318 18 419 17 346 15 315 17 350 3218 298 16 339 16 338 23 595 23 553 17 345 33 945 1086 18 443 25 678 25 675 15 287 26 693 15 315 16 342 20 498 16 345 19 485 29 860 16 332 43 hmnq1zzms Page Purdue University Statistics 512 Applied Regression Analysis Spring 2009 Professor Dabao Zhang data diamondsl set diamonds if price ne Syntax Notes a Each line must end with a semicolon 0 There is no output from this statement but information does appear in the log window a Often you will obtain data from an existing SAS file or import it from another file such as a spreadsheet Examples showing how to do this will come later January 12 2009 Page 17 Statistics 512 weanegmmn mm quotPm Unwemy Pmknxmmw mm uzu SAS proc print Now we want to see what the data look like proc print datadiamands run lbs weight price 1 017 355 3 016 328 3 017 35D 47 025 693 48 015 315 49 043 hmmqizzms Ema Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 We want to plot the data as a scatterplot using circles to represent data points and adding a curve to see if it looks linear The symbol statement v circle v stands for value lets us do this The symbol statement i 39sm7O will add a smooth line using splines interpolation smooth These are options which stay on until you turn them off In order for the smoothing to work properly we need to sort the data by the X variable January 12 2009 Page 19 Statistics 512 Applied Regression Analysis Pm University Prdlhssof Dabao Zhang Spring 2009 pros sort dataudiamondsl by weight symboll vcirc1e izsmVO titlel Diamond Ring Price Study titlez 39Scatter plot of Price vs Weight with Smoothing C1 axisl lahe1 Weight CaretS axisz labelangle90 Price Singapore proc gplot datadiam0ndsl plot priceweight haxisaxis1 vaxisaxisz run January 122009 Pagan Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Diamond Ring Price Study Scatter plot of Price vs Weight with Smoothing Curve 6 ea 9 o 0 cu c E 0 o 200 010 015 020 025 Weight Carats January 12 2009 Page 21 Statistics 512 Applied Regression Analysis Purdue University Professor Daba o Zhang Spring 2009 Now we want to use the simple linear regression to fit a line through the data We use the symbol option i r1 meaning interpolation regression line that s an not a one symboll Vcircle irl title2 Scatter plot of Price vs Weight with Regression proc gplot datadiam0ndsl plot prieeweight haxiszaxisl vaxisaxi52 run January 12 2009 Page 22 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Diamond Ring Price Study Scatter plot of Price vs Weight with Regression Line 020 025 Weight Carats January 12 2009 Page 23 Statistics 512 Applied Regression Analysis Purdue University Professor Daba o Zhang Spring 2009 We use proc reg regression to estimate a regression line and calculate predictors and residuals from the straight line We tell it what the data are what the model is and what options we want proc reg datadiamonds model pricezweightclb p r Output outzdiag pzpred rresid id Weight run January 12 2009 Page 24 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Analysis of Variance Sum of Mean Source DE Squares Square E Value Model 1 2098596 2098596 206999 Error 46 46636 101381886 Corrected Total 47 2145232 Root MSE 3184052 R Square 09783 Dependent Mean 50008333 Adj R Sq 09778 Coeff Var 636704 Parameter Estimates Parameter Standard Variable DE Estimate Error t Value Pr gt t Intercept 1 25962591 1731886 1499 lt0001 weight 1 372102485 8178588 4550 lt0001 January 12 2009 Page 25 Statistics 512 Applied Regression Analysis Professor Dabao Zhang proc print datadiag run Obs weight 1 017 2 016 3 017 4 018 5 025 46 015 47 026 48 015 Output Statistics Dep Var Predicted 287 0000 316 693 price 355 328 350 325 642 0000 0000 0000 0000 0000 0000 0000 372 335 372 410 670 298 707 298 Value 9483 7381 9483 1586 6303 5278 8406 5278 Std Error Mean Predict 5 8454 3786 0028 9307 U IU IU IU I 3786 3833 64787 63833 Purdue University Sp ngZOOQ Residual 17 7 22 11 14 17 9483 7381 9483 85 28 1586 6303 5278 8406 4722 Std Error Residual 31 31 31 445 31 31 31 31 31 383 299 383 283 194 174 194 January122009 Page 26 statistics 512 Appfmd Regression Analysis Purdue University Professor Dabao Zhang swing 2009 Simple Linear Regressionl Why Use It 0 Descriptive purpeses causeleffect relationships 1 Control often of cost a Prediction of outcomes January 12 2009 Page 27 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Data for Simple Linear Regression o Observez39 1 392 ny pairs of variables explanatory response 0 Each pair often called a case or a data point 0 Y2 27th response 0 X2 2th explanatory variable January 12 2009 Page 28 Statistics 512 Applied Regression Analysis Purdue University Professor Daba o Zhang Spring 2009 Simple Linear Regression Model mz o l lXil Eifori1727 Simple Linear Regression Model Parameters 039 g is the intercept o l is the slope a 61 are independent normally distributed random errors with mean 0 and variance 02 Le 61 N NO 02 January 12 2009 Page 29 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Features of Simple Linear Regression Model 0 Individual Observations YZ g lXZ 6239 0 Since 6239 are random Y2 are also random and EYz 50 5199 EEz 30 le VarYZ 0 VareZ39 02 Since 62 is Normally distributed Y2 NW0 lXi 02 See A4 page 1302 January 12 2009 Page 30 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Fitted Regression Equation and Residuals We must estimate the parameters g l 02 from the data The estimates are denoted b0 b1 52 These give us the tted or estimated regres sion line b0 lei where 0 0 is the estimated intercept o bl is the estimated slope o is the estimated mean for Y when the predictor is Xi ie the point on the fitted line 0 8 is the residual for the ith case the vertical distance from the data point to the fitted regression line Note that eiz Yib0lei January 12 2009 Page 31 Statistics 512 Applied Regression Analysis Purdue University Professor Daba o Zhang Spring 2009 Using SAS to Plot the Residuals Diamond Example When we called pch reg earlier we assigned the residuals to the name resid and placed them in a new data set called diag We now plot them vs X proc gplot datazdiag plot residweight haxiszaxisl vaxisaxi52 Vref0 where price He run January 12 2009 Page 32 Diamond Ring Price Study Residual Plot 3 3 Q I o EC 020 025 Weight Carats Notice there does not appear to be any obvious pattern in the residuals We ll talk a lot more about diagnostics later but for now you should know that looking at residuals plots is an important way to check assumptions Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Least Squares a Want to find best estimators b0 and b1 0 Will minimize the sum of the squared residuals 22711622 ELM 90 lez39W 0 Use calculus take derivative with respect to b0 and with respect to 1 and then set the two resultant equations equal to zero and solve for 0 and 1 see KNNL pages 1718 January 12 2009 Page 34 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Least Squares Solution 0 These are the best estimates for l and g XXX XXV 37 SSXY 2Xi XV 88X b0 Y le b1 0 These are also maximum likelihood estimators MLE see KNNL pages 26 32 a This estimate is the best because it is unbiased its expected value is equal to the true value with minimum variance January 12 2009 Page 35 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Maximum Likelihood Y N Nlt 0 lX U 0239 Yi Xi 2 V 27ra2 L f1 gtlt f2 gtlt gtlt fn likelihood function Find values for g and l which maximize L These are the SAME as the least squares estimators b0 and b1 January 12 2009 Page 36 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Estimation of 02 We also need to estimate 02 with 52 We use the sum of squared residuals SSE divided by the degrees of freedom n 2 2 EGG 3202 83 S 71 2 71 2 SSE MSE de s V52 RootJWSE7 where SSE Z 8 is the sum of squared residuals or errors and MSE stands for mean squared error January 12 2009 Page 37 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 There will be other estimated variance for other quantities and these will also be denoted 82 eg 82b1 Without any 82 refers to the value above that is the estimated variance of the residuals January 12 2009 Page 38 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Identifying these things In the SAS output Analysis of Variance Sum of Mean Source DE Squares Square E Value Model 1 2098596 2098596 206999 Error 46 46636 1013 81886 Corrected Total 47 2145232 Root MSE 3184052 R Square 09783 Dependent Mean 50008333 Adj R Sq 09778 Coeff Var 636704 Parameter Estimates Parameter Standard Variable DE Estimate Error t Value Pr gt t Intercept l 2596259l l73l886 l499 lt0001 weight 1 372l02485 8l78588 4550 ltOOOl January 12 2009 Page 39 Statistics 512 Applied Regression Pm University Professor Delano Zhang Spring 2009 f Review of Statistical Inference for Normal Samples I This should be review In Statistics 503511 you learned how to construct con dence intervals and do hypothesis tests for the mean of a normal distribution based on a random sample Suppose we have an iid random sample W1 it t Wn from a normal distribution Usually I would use the symbol X or Y but I want to keep the context general and not use the symbols we use for regression January 122009 Pane10 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 We have m Niid NW 02 where u and 02 are unknown W sample mean n SSW W2 sum of squares for W L m W 2 SS 82W W sample variance n 1 n 1 8W V s2W sample standard deviation W 8W a standard error of the mean m January 12 2009 Page 41 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 and from these definitions we obtain 2 W NM0 n W T 39M has a tdistribution with n 1 df in short T N tn1 This leads to inference a con dence intervals for u a signi cance tests for to January 12 2009 Page 42 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Confidence Intervals We are 1001 00 confident that the following interval contains ILL W WW W MW W tcsWl where 750 tn1 the upper 1 percentile of the 75 distribution with n 1 degrees of freedom and 1 04 is the confidence level eg 095 95 so CE 005 January 12 2009 Page 43 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Significance Tests To test whether pi has a specific value we use a ttest one sample nondirectional qum5kuuo W M 39t 39awf has a tn1 distribution under H0 0 Reject H0 if iti 2 750 where t0 tn11 o pvalue ProbHOiTi gt where T N tn1 January 12 2009 Page 44 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 The pvalue is twice the area in the upper tail of the tn1 distribution above the observed It is the probability of observing a test statistic at least as extreme as what was actually observed when the null hypothesis is really true We reject H0 if p 339 04 Note that this is basically the same more general actually as having iii 2 750 January 12 2009 Page 45 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Important Notational Comment The text says conclude H A ift is in the rejection region ltl 2 t0 otherwise conclude H0 This is shorthand for a conclude H A means there is sufficient evidence in the data to conclude that H0 is false and so we assume that H A is true a conclude H0 means there is insufficient evidence in the data to conclude that either H0 or HA is true or false so we default to assuming that H0 is true January 12 2009 Page 46 Statistics 512 Applied Regression Analysis Purdue University Professor Daba o Zhang Spring 2009 Notice that a failure to reject H0 does not mean that there was any evidence in favor of H0 NOTE In this course 04 005 unless otherwise specified January 12 2009 Page 47 Statistics 512 Wed Ramadan Pm gammy Professor Dabao Zhang 39Spting 2009 Section 21 Inference about 81 b1 N A1031 02591 2 where 0392b1 1 X t 2 In 431 I 8b1 82 where3b1 January 2009 Page 48 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 According to our discussion above for you therefore know how to obtain Cl s and ttests for l I ll go through it now but not in the future There is one important difference the degrees of freedom df here are n 2quot not n 1 because we are also estimating 50 January 12 2009 Page 49 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Confidence Interval for 31 I 61 2 tCSb1 o where 756 tn21 the upper 1001 percentile of the 15 distribution with n 2 degrees of freedom 0 1 Or is the confidence level Significance Tests for 31 0H02 10vsHa 17 O b O t S ltbl Reject H0 if t 2 230 t0 tn21 042 a p value ProbT gt Where T N tn Z January 12 2009 Page 50 Statistics 512 App ed Regression Analysis Pm University Professor Dabao Zhang Spring 2009 I Inference for u a 90 N N 33 0392b0 where 02W 0 2 713 a it for Sb replacing 02 by 32 and take isb0s 39 t N tn L January i2 2009 Page 51 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Confidence Interval for g 0 60 l t68b0 o where to tn21 1 Or is the confidence level Significance Tests for g 0H0 00vsHA 07 O b 0 t 50 a Reject H0 if lt 2 230 256 tn21 o p value ProbT gt where T N tn2 January 12 2009 Page 3952 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Notes 0 The normality of be and 1 follows from the fact that each is a linear combination of the themselves each independent and normally distributed 0 For 1 see KNNL page 42 o For 0 try this as an exercise 0 Often the Cl and significance test for g is not of interest 0 lfthe 6139 are not normal but are approximately normal then the Cl s and significance tests are generally reasonable approximations 0 These procedures can easily be modified to produce onesided confidence intervals and sig n ificanCe tests L ZltXi X2 ZXXZ39 X2 large ie by spreading out the Xi s 0 Because 02b1 we can make this quantity small by making January 12 2009 Page 53 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Here is how to get the parameter estimates in SAS Still using diamond sas The option clb asks SAS to give you confidence limits for the parameter estimates b0 and b1 proc reg datazdiamonds model priceweightclb Parameter Estimates Parameter Standard Variable DF Estimate Error Intercept l 2596259l 1731886 weight 1 372102485 8178588 95 Confidence Limits 29448696 22476486 355639841 388565129 January 12 2009 Page 3954 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Points to Remember a What is the default value of 05 that we use in this class a What is the default confidence level that we use in this class 0 Suppose you could choose the X s How would you choose them if you wanted a precise estimate of the slope intercept both January 12 2009 Page 3955 Statistics 512 App ed Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Summary of lnferencel 39Kampamp a 33 N 0 0392 are independent random errors Parameter Estimators XXX XX 3 PM 1 1 2 Eva X 30 i bg b1X g2 52 20 be bIXz392 39 n 2 January 122009 P811956 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 95 Confidence Intervals for g and l O bl 2 tCSbl O 50 2 tCSb0 o where 756 tn21 the 1001 upper percentile of the 75 distribution with n 2 degrees of freedom January 12 2009 Page 3957 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Significance Tests for g and l HOI OIO Har c m t 8550 N tn2 under H0 H0331Z7 Ha3317 0 t Sfbll N tn2 under H0 Reject H0 if the p value is small lt 005 January 12 2009 Page 3958 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 MKNNL Section 23 Power The power of a significance test is the probability that the null hypothesis will be rejected when in fact it is false This probability depends on the particular value of the parameter in the alternative space When we do power calculations we are trying to answer questions like the following Suppose that the parameter l truly has the value 15 and we are going to collect a sample of a particular size n and with a particular 88X What is the probability that based on our not yet collected data we will reject H0 January 12 2009 Page 3959 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Power for l 0H02 10Ha2 17 0 b 39 5 8531 0 t0 tn21 o for a 005 we reject H0 when Z 750 0 so we need to find Pltl 2 70 for arbitrary values of 51 0 a when l 0 the calculation gives CE H0 is true January 12 2009 Page 60 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 o t N tn26 noncentral t distribution tdistribution not centered at O a 6 03911 is the noncentrality parameter it represents on a standardized scale how far from true H0 is kind of like effect size a We need to assume values for 02b1 W and n o KNNL uses tables see pages 5051 a we will use SAS January 12 2009 Page 61 Statistics 512 Applied Regression Analysis Professor Dabao Zhang Purdue University Spring 2009 Example of Power for l Response Variable Work Hours Explanatory Variable Lot Size See page 19 for details of this study page 5051 for details regarding power We assume 02 2500 n 25 and SSX 19800 so we have 2 Consider l 15 We now can calculate 6 bl 039 1 with t N tn26 we want to find Pltl 2 t0 We use a function that calculates the cumulative distribution function cdf for the noncentral t distribution January 12 2009 Page 62 sumo 512 Appiied egmsdon Pm University mmamewzmu amumw Sass pmgram knnl 050 sas fur the pnwer calculations data a1 n25sig22500 ssx19800 alpha05 sigalsig2ssx dfn2 beta115 deltaabsbeta1sqrtsigzbl tstartinv1alpha2df power1probttstardfde1taprobttstardfdelta Gutput proc print dataa1run 0133 11 gig sax alpha 3192121 df beta 1 25 2539 19800 305 0 12626 23 15 delta tstrar pgwer 422137 2 8866 898131 hmmmizzms 63 Statistics 512 Applied Regression Analysis Pm University Prdlhssof Dabao Zhang Spring 2009 data a2 n25 31922500 ssx19800 alpha05 sig2b1sig2ssx dfn2 do beta120 to 20 by 05 deltaabsbeta1sqrtsig2b1 tstartinv1alpha2df power1probttstardfdeltaprobttstardfdelta output end pros print dataa2 run titlel 39Power for the slope in simple linear regression symboll vnone ijoin proc gplot datazaz plot powerwbetal run January 122009 Pqu Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Power for the slope in simple linear regression January 12 2009 Page 65 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang swing 2009 Section 24 Estimation of EY i EYh m 60 th the mean value of Y for the subpopulation with X Xh t We will estimate EYh with Y h Eh bg leh o KNNL uses ta denote this estimate we will use the symbols 3 2 1h interchangeably t See equation 228 on page 52 January 122009 P811966 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Theory for Estimation of EYh Yh is normal with mean uh and variance 1 If 502 02 02 E o The normality is a consequence of the fact that 0 leh is a linear combination of Yi s o The variance has two components one for the intercept and one for the slope The variance associated with the slope depends on the distance Xh X The estimation is more accurate near X 0 See KNNL pages 5255 for details January 12 2009 Page 67 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Application of the Theory We estimate 02Yh with 523 82 l XhX2 EGG XV It follows thatt N tn2 proceed as usual 95 Confidence Interval fOr EYh Yh tCsYh where t0 tn20975 NOTE Significance tests can be performed for Yh but they are rarely used in practice January 12 2009 Page 68 swam 512 Appiied egmsdon Pm University HMEmMDmMome SMhQZNQ Example See program knn1054 sas for the estimation of subpopula on means The caption c 1m to the model statement asks for can dance limits far the mean Yh data al infile HStat512Datasetsdh01ta01dat39 input size hours data a2 aize65i output size100 output data a3 get al a2 proc print data33i run prcc reg dataaa3 model hourssizelclm id size run January 12 2009 a Statistics 512 Applied Regression Analysis Professor Dabao Zhang Obs size 25 7O 26 65 27 100 Std Error Mean Predict 97647 99176 142723 Dep Var Predicted hours Value 3230000 3122800 2944290 4193861 95 CL Mean 2920803 3324797 2739129 3149451 3898615 4489106 Purdue University Spring 2009 January 12 2009 Page 70 Purdue University Statistics 512 Applied Regression Analysis Spring 2009 Professor Dabao Zhang Section 25 Prediction of Wm M We wish to construct an interval into which we predict the next observation for a given Xh will fall 0 The only difference operationally between this and EYh is that the variance is different a In prediction we have two variance components 1 variance associated with the estimation of the mean response Y h and 2 variability in a single observation taken from the distribution with that mean 0 Yhmew 30 th 6 is the value for a new observation With X Xh January 12 2009 Page 71 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 We estimate Yhmew starting with the predicted value Yh This is the center of the confidence interval just as it was for EYh However the width of the Cl is different because they have different variances VarYhn w VarOAh Var s2pred 529 52 1 Xh X2 2 2 d 1 WW 5 nZltXi Xgt2 Yhnew Yh N tn Z spred January 12 2009 Page 72 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 s2pred 529 52 1 Xh X2 2 d 2 1 5 pre s nL ZXiX2 Yhnew Yb spred sp39red denotes the estimated standard deviation of a N tn Z new observation with X X h It takes into account variability in estimating the mean Yh as well as variability in a single observation from a distribution with that mean January 12 2009 Page 73 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Notes The procedure can be modified for the mean of m observations at X X h see 239a on page 60 Standard error is affected by how far Xh is from X see Figure 23 As was the case for the mean response prediction is more accurate near X See program knn1059 sas for the prediction interval example The cli option to the model statements asks SAS to give confidence limits foran individual observation cf clb and 01m January 12 2009 Page 74 Statistics 512 Applied Regression Analysis Pm University Probest Dabao Zhang Spring 2009 data a1 infile 39HStat512DatasetsChOltaOldat39 input size hours data a2 size65 output 3123100 output data a3 set a1 a2 proc reg dataa3 model hourssizecli run January 122009 Pagers Statistics 512 Applied Regression Analysis Professor Dabao Zhang Dep Var Obs size hours 25 70 3230000 26 65 27 100 95 CL Predict 2092811 4152789 1913676 3974904 3141604 5246117 Predicted Value 3122800 2944290 4193861 Residual 107200 Purdue University Spring 2009 Std Error Mean Predict 97647 99176 142723 January 12 2009 Page 76 Purdue University Statistics 512 Applied Regression Analysis Spring 2009 Professor Dabao Zhang Notes 0 The standard error Std Error Mean Predict given in this output is the standard error of Yh not spred That s why the word mean is in there The CL Predict label tells you that the confidence interval is for the prediction of a new observation 39039 The prediction interval for Yhmew is wider than the confidence interval for Yh because it has a larger variance January 12 2009 Page 77 Statistics 512 Applied Regression Pm University Probest Dabao Zhang Spring 2009 WorkingsHotelling Con dence Bands I Section 26 a This is a con dence limit for the whale line at once in contrast to the confidenca interval forjust one Yh at a timei c Regressiun line 50 leh describes E03 or a given Xh a We have 95 CI for 53 2 17quot pertaining to speci c X h 0 We want a con dence band for all X h this is a con dence limit for the whole line at once in contrast tn the con dance interval orjust ane Yh at a time a The con dence limit is given by 1 i Ws h where January 122009 Pagan Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 W2 2F27n21 05 Since we are doing all values of Xh at once it will be wider at each Xh than Cl s for individual Xh o The boundary values define a hype39rbola o The theory forthis comes from the joint confidence region for g which is an ellipse see Stat 524 a We are used to constructing Cl s with is not W s Can we fake it a We can find a new smaller 04 for t0 that would give the same result kind of an effective alpha that takes into account that you are estimating the entire line a We find lV2 for our desired oz and then find the effective of to use with tc that gives WOz 15004t January 12 2009 Page 79 swam 512 Appiied egmsdon Pm University mmamewzmu amumw Con dence Band for Regmsion Line 833 pmgram knnl 03961 sas for the regressien line con dence band data a1 n25alpha10 dfn2 dfdn2 w22finvlalphadfnfdfd wsqrtiw2 alphat21probtwdfd tstartinvlalphat2 did output proc print dataa1run hmmnzmm 1 89080 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Note 1 probtw dfd gives the area underthe tdistribution to the right of 10 We have to double that to get the total area in both tails Obs n alpha dfn dfd w2 w l 25 01 2 23 509858 225800 alphat tstar 0033740 225800 January 12 2009 Page 81 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Estimation of EYh Compared to Prediction of Yh Yh be b1Xh 2 A39 82 ll Xh X2 8 Yh in 39 zoo Xvi vol XV January 12 2009 Page 83 swam 512 Ammnegmdon Purdue University mmammMmzmu amumm See the program knnl 0 61 sas for the clmmean and cl i individual plots data a1 infile 39HSystemDesktopCHOlTA01DAT39 input size haurs Con dence intervals symboll vcircle irlclm95 proc gplot dataal plot hourssize lanuary39izm Pagem Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 January 12 2009 Page 85 statistics 512 Appl md Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 A Section 27 Analysis of Variance ANOVA Table I Organizes results arithmetically Total sum ofsquares in Y is 88y 37 i Partitien this into twa sources Model explained by regressian Error unexplained I residual 14 n Hm Y39 Em Y 2 2m 13 203 37 cross terms cancel see page 65 II II January 122009 P8908 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Total Sum of Squares 0 Consider ignoring Xh to predict EYh Then the best predictor would be the sample mean 37 o SST is the sum of squared deviations from this predictor SST SSy 37 o The total degrees of freedom is dfT n 1 o MST SSTdfT o MST is the usual estimate of the variance of Y if there are no explanatory variables also known as s2Y 0 SAS uses the term Corrected Total for this source Uncorrected is YE The term corrected means that we subtract off the mean 37 before squaring January 12 2009 Page 88 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Model Sum of Squares 0 88M 3 02 a The model degrees of freedom is dfM 1 since one parameter slope is estimated MSM SSMdfM o KNNL uses the word regression for what SAS calls model 0 80 SSE KNNL is the same as 88 Model SAS I prefer to use the terms 88M and dfM because R stands for regression residual and reduced later which I find confusing January 12 2009 Page 89 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Error Sum of Squares SSE Em W o The error degrees of freedom is de n 2 since estimates have been made for both slope and intercept MSE SSE de o MSE 8 2 is an estimate of the variance of Y taking into account or conditioning on the explanatory variables January 12 2009 Page 90 Statistics 512 Applied Regression Analysis Professor Dabao Zhang ANOVA Table for SLR Purdue University Spring 2009 Source df 88 MS A 2 SSM Model Regressmn 1 7 Y dfM Error Tl 2 Yi2 9 5 Total n 1 09 292 Note about degrees of freedom Occasionally you will run across a reference to degrees of freedom without specifying whether this is model error or total Sometimes is will be clear from context and although that is sloppy usage you can generally assume that if it is not specified it means error degrees of freedom January 12 2009 Page 91 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Expected Mean Squares o MSM MSE are random variables EMSM 0239 SSX 039 0392 c When H0 l 0 is true then EMSM EMSE A o This makes sense since in that case Y January 12 2009 Page 92 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 F test 0 F N FdfMyde F17n2 a See KNNL pages 6970 a When H0 l O is false MSM tends to be larger than MSE so we would want to reject H0 when F is large 0 Generally our decision rule is to reject the null hypothesis if F 2 FC Fde7de1 Oz F17n2095 o In practice we use p values and reject H0 if the p value is less than 05 0 Recall that t blsbl tests H0 l 0 It can be shown January 12 2009 Page 93 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 that 7521f Flydf The two approaches give the same p value they are really the same test 0 Aside When H0 l O is false F has a noncenz ralF distribution this can be used to calculate power ANOVATabIe Source df SS MS F p Model 1 SSM MSM p Error 71 2 SSE MSE Total n 1 January 12 2009 Page 94 Statistics 512 Appiied Ramadan Pm University mmammMmamu Wmm Sea the program knn106 7 gas for the program used tn generate the ather uutput used in this lecture data a1 infile 39HSystemDesktopCH01TAOlDAT39 input size hours proe reg dataa1 model hourssize run lanuary39izzuos m Statistics 512 Applied Regression Analysis Professor Dabao Zhang Analysis of Variance Sum of Source DE Squares Model 1 252378 Error 23 54825 Corrected Total 24 307203 E Value Pr gt E 10588 ltOOOl Parameter Estimates Parameter Standard Variable DE Estimate Error Intercept l 6236586 2617743 size 357020 034697 t Value Pr gt t 238 00259 1029 ltOOOl Note that t2 10292 10588 F Purdue University Spring 2009 Mean Square 252378 238371562 January 12 2009 Page 96 Statistics 512 App ed Regression Purdue University Professor Dabao Zhang Spring 2009 Section 28 General Linear Test a A different View of the sama pmblem testing 131 0 It may seem redundant naw but the concept is extremely useful in MLR a We want ta compare two models Y lxi 6i model Y2 le 61 maimed model Campare using the error sum of squares January 122009 P139997 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Let be the SSE for the Full model and let SSER be the SSE for the Reduced Model lt88EltRgt SSEltFgtgtltdeltRgt deltFgt SSEFdeF Compare to the critical value FC FdeRdeF7deF1 CV to test H0 2 S1 0 vs Ha S1 0 January 12 2009 Page 98 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Test in Simple Linear Regression SSER Em Y SST SSEF SST SSM the usual SSE deR n 1 deF n 27 deR de39F 1 SST SSE1 MSM F SSEn Z MSE Same test as before This approach full vs reduced is more general and we will see it again in MLR January 12 2009 Page 99 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Pearson Correlation p is the usual correlation coefficient estimated by 7 o It is a number between 1 and 1 that measures the strength of the linear relationship between two variables 7 2Xi XXYi Y ZX 3 02 EC 37 a Notice that January 12 2009 Page 100 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 Test H0 l 0 similar to H0 p 0 R2 and r2 o R2 is the ratio of explained and total variation R2 SSMSST o 7 2 is the square of the correlation between X and Y 2 3 SS T January 12 2009 Page 101 Statistics 512 Applied Regression Analysis Purdue University Professor Dabao Zhang Spring 2009 In SLR r2 R2 are the same thing However in MLR they are different there will be a different 7 for each X variable but only one R2 R2 is often multiplied by 100 and thereby expressed as a percent In MLR we often use the adjusted R2 which has been adjusted to account for the number of variables in the model more in Chapter 6 January 12 2009 Page 102 Statistics 512 Applied Regression Analysis Professor Dabao Zhang Source DF Model 1 Error 23 C Total 24 R Square Adj R sq Purdue University Spring 2009 Sum of Mean Squares Square F Value Pr gt F 252378 252378 10588 lt0001 54825 2383 307203 08215 SSMSST 1 SSESST 252378307203 08138 1 MSEMST 1 2383 39307203 24 January 12 2009 Page 103

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.