### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Linear Regression and Time Series STAT 5100

Utah State University

GPA 3.72

### View Full Document

## 48

## 0

## Popular in Course

## Popular in Statistics

This 195 page Class Notes was uploaded by Geovanny Lakin on Wednesday October 28, 2015. The Class Notes belongs to STAT 5100 at Utah State University taught by John Stevens in Fall. Since its upload, it has received 48 views. For similar materials see /class/230498/stat-5100-utah-state-university in Statistics at Utah State University.

## Similar to STAT 5100 at Utah State University

## Reviews for Linear Regression and Time Series

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/28/15

6 0 Stat 5100 Notes Spring 2009 Unit 6 Regression W Discrete Response Section Topic 61 Logistic Regression Hamilton pp 217 233 62 Diagnostics in Logistic Reg Hamilton pp 233 242 63 Nominal Ordinal Logistic Reg Dobson An Introduction to Generalized Linear Models 2002 2nd ed pp 135 150 64 Poisson Regression Dobson pp 151 170 6 1 6 1 Logistic Regression Example How likely is it that a male beetle with body weight 13 ounces would be killed by a pesticide at dosage 003 mgL 0 Collect data on many beetles and record Y alive 0 or dead 1 X1 sex male or female 1 or 0 X2 body weight in ounces X3 dosage in mgL 0 Want to say something about pi 1 X 71 1X 72 13X 73 Can we do this with our existing tools 6 2 Linear regression model Y o 1X1 k 1Xk 1 e N N002 Y assumed normal or from mixture normal is our Y normal continuous Simplest discrete response dichotomous response 0 two level classi cation of each observation male female live die left right yes no launch explode absent present 0 each observation 239 has response Y7 dummy coded 01 and pre dlCtOI S pro le Xi l 7X 7k1 6 3 pi 1 Xi717 Xi7k1 0 1 0 COUld COHSidGI g 1X 71 k1X77k1 67 but hard to constrain 0 g 137 g 1 o A bigger range would be nice for response in the model 0 Make a transformation to keep 0 g 137 g 1 link function Most common logit function sketch 2 1 21 L7 log Another common probit function sketch 6 4 Other S shaped curves exist tend to reach similar conclusions logit has nice interpretation and has computational advantages Use logit link gt logistic regression o odds that Y7 1 0 L7 log odds Fit linear regression model to logit L7 e iXm 5k 1X k 1 Parameter estimates 0 b1 bk1 from MLE based iterative procedure 7 be lei1 bk 1X k 1 Transform back to probability scale 1 A All A pi A Oddsi p A 6L1 1 c Li 1 297 The LDGISTIC Procedure Response Profile Ordered Total Value Y Frequency 1 0 12 2 1 18 Probability modeled is Y1 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error ChiSquare Intercept 1 17148 21722 06232 8 1 01552 10840 00205 weight 1 53083 23596 50611 dosage 1 1496 573314 68093 Wald Confidence Interval for Parameters Parameter Estimate 95 Confidence L Intercept 17148 25426 5 8 01552 19694 2 weight 53083 99330 0 dosage 1496 372369 6BeetlesLogisticRegressionsas Pr gt ChiSq 04299 08861 00245 00091 imits 9722 2799 6836 2620 6 5 6 6 Interpretation of estimates Z91 quot39Xik 1 0f bo gtOddsqeb0 39 HOld X22 2 X k 1 0 increase Xi from 0 to 1 gt b0 l 91 gt Oddsi eb0b1 ebo ebl 0 So an increase of one unit in Xj While holding the other predic tors constant multiplies the odds in favor of Y 1 by a factor of o OR Odds of Y 1 change by 100eb3 1 per unit increase in Xj While holding other predictors constant 6 7 Inference With estimates 0 Single variable test H0 j 0 Xj has no effect on PY 1 Test statistic t bjSE of 9 Under H0 t N tnk standard normal for large n gt t2 N x obtain conf intervals here cstiSE gtlt CritVal o Subset variables test full vs reduced models H03 k H k 1 0 Let Efuu be the likelihood associated With the full model Ted Test statistic X2 210g 5f u Under H0 X2 N xi sketch 0 Overall model test analogous to model F test model X2 test 6 8 Goodness of Fit measures o pseudo R square xii X2 from model test 0 others exist More useful pieces in SAS output 0 Odds ratio for Xj odds of Y 1 When Xj 1 vs odds of Y 1 When Xj b0b1X1bjXj1bk1Xk b0b1X1ijjbk1Xk MOdel X2 210g Intercept 39 2 10g Intampcovar7 ates Likelihood 2 Log L 2 Log L Ratio Intercept Intercept ChiSquare Only and Covariates Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 42381 29569 SC 43782 35174 2 Log L 40381 21569 Testing Global Null Hypothesis BETA0 Test ChiSquare DF Pr gt ChiSq Likelihood Ratio 188120 3 00003 Score 149301 3 00019 Wald 78106 3 00501 Linear Hypotheses Testing Results Wald Label ChiSquare DF Pr gt ChiSq SeXAndWeight 50613 2 00796 Odds Ratio Estimates Point 95 Wald Effect Estimate Confidence Limits 8 1168 0140 9775 weight 0005 lt0001 0505 dosage gt999999 gt999999 gt999999 6 9 6 10 Another goodness of t measure Concordance 0 Look at all pairs of observations With different Y pair is concordant if Y 1 obs has larger 13 pair is discordant if Y 1 obs has smaller 13 pair is tied if Y 1 and Y 0 obs have same 13 Higher percent concordant gt better predictive ability 0 Let me concordant pairs nd discordant pairs 7 tied pairs TLC 771d ncndnt These are rank Somers D correlation indices Gamma In general a c d model t With larger values here has better c m predictive ability ncndnt TLC 771d Tau a 5n 1n Conditional Effect Plot at Weight 13 oz circles are male dots are female 10 SO What about 097 0839 P PYz 1 X 117 03977 Xiag 06 05 X 73 003 04 03 02 01 00 001 002 003 004 005 Dosage mgL Predicted Probability of Mortality Association of Predicted Probabilities and Observed Responses Percent Concordant 926 Somers D 0852 Percent Discordant 74 Gamma 0852 Percent Tied 00 Taua 0423 Pairs 216 c 0926 6 12 62 Diagnostics in Logistic Regression Multicollinearity in logistic regression 0 Problem relationships among predictors can in ate SE of bj s 0 Diagnostic condition index and VIF in proc reg for example response Y isn t considered only X fs 0 Remedial measures similar to before looking for best subset in proc logistic backward elimination stepwise selection score similar to all possible regressions displays best models containing a certain number of predictors Variable DF Intercept 1 S 1 weight 1 dosage 1 Number Eigenvalue 1 340995 2 042631 3 013506 4 002868 Collinearity Check The REG Procedure Dependent Variable Y Parameter Estimates Parameter Standard Estimate Error t Value Pr gt t 076305 031399 243 00223 002183 013640 016 08741 076140 026107 292 00072 1996193 481505 415 00003 Collinearity Diagnostics Condition Proportion of Variation Index Intercept 8 weight dosage 100000 000385 002824 000496 001359 282821 000493 093284 000676 004735 502462 003173 003199 012982 083070 1090468 095949 000693 085846 010837 Variance Inflation 0 100302 100303 100001 Backward Elimination The LDGISTIC Procedure 6 14 Summary of Backward Elimination Effect Number Wald Step Removed DF In ChiSquare Pr gt ChiSq 1 S 1 2 00205 08861 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error ChiSquare Pr gt ChiSq Intercept 1 17766 21067 07112 03991 weight 1 52945 23380 51279 00235 dosage 1 1495 572610 68187 00090 Variable Selection best by score The LDGISTIC Procedure Regression Models Selected by Score Criterion Number of Score Variables ChiSquare Variables Included in Model 1 100000 dosage 1 49532 weight 2 149152 weight dosage 2 100000 S dosage 6 15 Outliers and in uential observations in logistic regression 0 Problem same as in linear regression o Diagnostics Similar to linear regression Hat Diagonals DFBETAS Best for graphical checks 1 Ax w vs 137 3 4 bubble plot of Ax i vs 13 With size oc AB ABQD VS o Remedial measures similar to linear regression 1 look for typos 2 consider transformations or dropping points 6 16 Ax w formula in text p 237 0 measures decrease in mis t When obs 2 is deleted or ignored poorness of t for obs 0 large Ax 7 gt obs 2 or any obs With same X pro le not well represented by model 0 SAS DIFCHISQ one step difference in Pearson chi square o Ax w vs 137 plot look for points With low 13 but Y7 1 upper left corner or With high 137 but Y7 0 upper right corner Ax i formula in text p 237 0 similar to Ax w poorness of t for obs 2 0 look at similar plot Ax i vs 13 0 SAS DIFDEV one step difference in deviance 6 17 ABW formula in text p 236 0 similar to Cooks Distance measures in uence of obs 239 amp obs With same X pro le on estimates bj 0 SAS C con dence interval displacement C 0 AB vs 13 look for 1 points With AB 2 1 or so or 2 points much different from overall pattern Bubble plot VS 237 size 0C 0 look for big bubbles in upper left low 137 but Yr 1 or upper right high 13 but Y 0 Graphical checks for infl obs and outliers a 0 U1 0 h D 0 O 0 O 0 39 00 01 02 03 04 05 06 07 08 0910 Estimated Probability One Step Difference in Pearson Chisquare M O Graphical checks for infl obs and outliers o 3 E D E 12 a E E 1 8 39E 80 0001020304050603030510 E mmai dmww Obs Observation sex weight 1 5 M 103 2 30 F 200 Graphical checks for infl obs and outliers 11 10 One Step Difference in Deviance O nMC hmm m o 39 00 01 02 03 04 05 06 07 08 0910 Es mated Probability Graphical checks for in obs and outliers 11 J 10 O O OO 0 01 02 03 04 05 06 07 08 0910 Estimated Probability One Step Difference in Deviance 9 8 7 6 5 4 3 2 1 0 0 Suspect points delta dosage Y phat CHI 005 0 097982 513748 005 0 019441 07614 6 18 deltaD deltaB 106278 298564 09525 164117 6 19 63 NominalOrdinal Logistic Regression Example How important are AC and power steering to different types of people Variable Description sex men 81 or women 80 age age class A1 for 18 23 A22 for 24 40 A23 for 40 response Y importance of AC and power steering in cars 1little 2important 3Very Before dichotomous response now polytomous Types of polytomous response 0 multi level classi cation of each individual order not meaningful nominal logistic regression gtllt Engineer Poet Other gtllt Pine Fir Deciduous Fern gtllt Car Truck Van SUV Motorcycle order meaningful ordinal logistic regression gtllt Disagree Agree Strongly Agree gtllt Low Medium High gtllt Bland Mild Medium Hot Fuego 0 category count Poisson regression of events at each covariate pro le or exposure level 6 20 631 Nominal Logistic Regression no natural order of response levels 0 Before compared PY MamaMates With PY O covam39ates 10g PY MamaMates 1 PY MamaMates 10 PY licovarz39ates PY O covarz39ates Li o Looked at prob of one category Y 1 vs prob of a reference category Y O 0 Here do similar picking reference category arbitrarily compare probs of other categories to reference category Notation o M X k1 is covariate pattern pro le covariatcsq 0 Y 0 1 R are possible responses non meaningful numeric representation 0 Y 0 is the reference category 0 pTPYr r0RZTpT1 pm Y rlcovarz atcsi o generalized logit glogit link function LT 10gPrP0 Lm 10gpr P0 Fit model 39 Lm39 ow lrX7Ll k 1rXik 1 n is coef cient on Xj for Y 7 vs Y O o parameter estimates by iterative procedure With MLE 6 22 6 23 After tting model L747 90m b1rX7L1 bk lvaL c l PY rlcovarz39atcsi PY Olcovarz39atcsi predicted log Estimate interpretation similar to logistic regression An increase of one unit in Xj While holding other predic tors constant multiplies the odds of Y 7 vs Y 0 by a factor of ebj Diagnostics similar to logistic regression Note When predictors X1 Xk1 are all categorical data can be rep resented more ef ciently in a table 6PolytomousResponsesas Nominal Logistic Regression Response Profile Ordered Total Value response Frequency 1 3 101 2 2 94 3 1 105 Logits modeled use response1 as the reference category Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 662544 596702 SC 669952 626332 2 Log L 658544 580702 Testing Global Null Hypothesis BETA0 Test ChiSquare DF Pr gt ChiSq Likelihood Ratio 778419 6 lt0001 Score 749761 6 lt0001 Wald 629703 6 lt0001 6 24 Parameter Intercept Intercept S 8 A2 A2 A3 A3 Type 3 Analysis of Effects Effect 8 A2 A3 Wald DF ChiSquare 2 64173 2 175366 2 475933 Pr gt ChiSq 00404 00002 lt0001 Analysis of Maximum Likelihood Estimates response DF M w M w M w M w I I I I I I I H Effect 8 8 A2 A2 A3 A3 response M w M w M w Estimate 10391 5908 8129 3881 14780 11283 29165 15876 Point Estimate 444 678 384 090 477 892 Sta 0 O O O O O O O ndard Error 3305 2840 3210 3005 4009 3416 4229 4029 Ratio Estimates 0 Wald ChiSquare 98843 43286 64122 16677 135912 109059 475594 155270 95 Wald Confidence Limits 236 0832 376 1223 998 9620 582 6037 066 42327 221 10775 MOOI I O Pr gt ChiSq A A O O O O O O 0017 0375 0113 1966 0002 0010 0001 0001 6 25 6 26 632 Ordinal Logistic Regression a natural order exists in response levels 0 take advantage of ordering 0 R possible levels order mean ingful Notation 0 pg PY g 7 pgli PY g rlcovam atesi C for cumulative o logit link here gives proportional odds model 10g rpc 5 0Q A 22 ltlt 3 V 6 27 With covariates t model Lrli ea MXM kAWXMA n is coef cient on Xj for Y g 7 vs Y gt r Iterative procedure With MLE to obtain estimates 90779191779 bk177n for 7 R l After tting model Lrli boa bMXm l bk 1rXik 1 PY g rlcovarz atesi PY gt rlcovarz atesi predicted log Estimate interpretation similar to logistic regression 0 An increase of one unit in Xj While holding other predictors constant multiplies the odds of Y g 7 vs Y gt r by a factor of ebj 0 So byquot gt 0 gt increasing X j makes Y g 7 more likely Ordinal Logistic Regression Response Profile Ordered Total Value response Frequency 1 3 101 2 2 94 3 1 105 Probabilities modeled are cumulated over the lower Ordered Values Score Test for the Proportional Odds Assumption ChiSquare DF Pr gt ChiSq 07139 3 08699 Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 662544 591296 SC 669952 609814 2 Log L 658544 581296 Testing Global Null Hypothesis BETA0 Test ChiSquare DF Pr gt ChiSq Likelihood Ratio 772485 3 lt0001 Score 700452 3 lt0001 Wald 680278 3 lt0001 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error ChiSquare Pr gt ChiSq Intercept 3 1 16546 02536 425742 lt0001 Intercept 2 1 00433 02303 00354 08508 8 1 05762 02261 64936 00108 A2 1 11468 02773 171079 lt0001 A3 1 22322 02904 590806 lt0001 Odds Ratio Estimates Point 95 Wald Effect Estimate Confidence Limits 8 0562 0361 0875 A2 3148 1828 5421 A3 9320 5275 16467 6 30 64 Poisson Regression Example Famous 1951 1961 study of British doctors by Sir Richard Doll 1912 2005 Who rst con rmed the smoking lung cancer link All male British doctors surveyed in 1951 Variable Description Death if dead from coronary heart disease after 10 years Person Years sum of time each person was observed exposure if a person drops out is lost etc only their time involved is counted Age age class in 1951 135 44 245 54 355 64 465 74 575 84 Smoker smoking status 1yes 0no Smokers Non smokers Age Deaths Person Years Deaths Person Years 35 32 52407 2 18790 45 54 104 43248 12 10673 55 64 206 208612 28 5710 65 74 186 12663 28 2585 75 84 102 5317 31 1462 Why include Person Years Exposure 0 higher of young smokers gt higher of deaths even if age and smoking have no effect 0 size of population at risk higher person years gt more deaths recorded 6 32 Here response Y7 Who die from heart disease at covariate pro le i Y of events at each exposure level gt Poisson regression Poisson distribution is often used to model counts see Ross A First Course in Probability 5th ed 0 misprints on each page in a book 0 wrong telephone calls made each day 0 Widgets sold at store each day 0 vacancies on Supreme Court each year 0 students to visit of ce hours each week sketch Here investigate What predictors contribute to counts 6 33 39 Y7 eVentS at exposure 7 With covariate pattern 239 Xi17 X k1 covariatesi 0 p7 Pevent at covariatesi Pevent occurslcovam39atesi 0 Assume logpz39 g 1X 71 k le39Jc l nipi ni 0161XL18k 1Xik 1 IOgEle39l 10g 7 e iXm k lXiJc l offset linear model interested in s i not interested in this just need to account for it o Iterative MLE based procedure gives estimates 90191 bk1 and associated con dence intervals Death Rate Deaths per 100000 person years 3000 2000 1000 0 1 1 1 1 1 2 3 4 5 Age Class 1 35 44 245 54 etc WOW1 smoker Log of Death Rate 6 34 Circle smokers 2 3 4 5 Age Class 135 44 245 54 etc smoker PH 0 H 1 Poisson Regression The GENMOD Procedure Data Set WORKSMDKING Distribution Poisson Link Function Log Dependent Variable deaths Offset Variable lyears Criteria For Assessing Goodness 0f Fit Criterion DF Value ValueDF Deviance 5 16354 03271 Scaled Deviance 5 16354 03271 Pearson ChiSquare 5 15503 03101 Scaled Pearson X2 5 15503 03101 Log Likelihood 27276433 Analysis Of Parameter Estimates Standard Wald 95 Chi Parameter DF Estimate Error Confidence Limits Square Pr gt ChiSq Intercept 1 107918 04501 116739 99096 57492 lt0001 smoker 1 14410 03722 07115 21705 1499 00001 age 1 23765 02079 19689 27841 13060 lt0001 agesq 1 01977 00274 02513 01440 5217 lt0001 smkage 1 03075 00970 04977 01174 1004 00015 Scale 0 10000 00000 10000 10000 NOTE The scale parameter was held fixed 6 36 Interpretation of estimates 0 lOg z39 be lei1 bk 1X k 1 o 7sz me lX 39 k 1Xigtk 1 expected count 0 An increase of one unit in Xj While holding the other predictors constant multiplies the expected count Y by a factor of cbj Goodness of t measures 0 Deviance 227 log X2 1 Taylor series expansion shows that these are approximately equal 0 Compare With x hk distribution Where N distinct covariate patterns k parameters estimated large X2 gt conclude poor t 6 37 Note 0 Poisson distribution has E Var 0 Let VarDQ 02EK39 0 is scale assumed to be 1 o If scale gt 1 then this is overdispersion problem With model assumptions 0 Scale 0 can be estimated by several different methods including X2 N k See categorical data analysis course or generalized linear models course for more 6 38 A nal example Are sex and job type independent in a company Men Women Administrative 8 7 Clerical 4 17 Janitorial 2 16 Professional 13 14 How do you answer this in Stat 1040 H0 Sex and Type are independent row tot col tot Under H07 expeCted Count 18 table total obs exp2 Test statistic X2 Sum of exp N X rows 1cols 1 6 39 This X2 test is doing a saturated model With Y observed count log EY u l Semi l Typej l SexTypeiaj H0 SexTypeiaj 0 The X2 test statistic is the goodness of t X2 because the alternative to a poorly tting additive model is to saturate it With interactions SexTypeiaj This is called a log linear model 0 similar to Poisson regression ca 77 0 exposure is constant n7 n for all z 0 t Poisson regression model but don t include offset because it s constant Parameter Intercept sex sex type type type type Scale Criterion Deviance Loglinear model Look at independence of sex and type tested by Pearson ChiSquare statistic The GENMOD Procedure Criteria For Assessing Goodness 0f Fit Scaled Deviance Pearson ChiSquare Scaled Pearson X2 Log Likelihood Men Women Administ Clerical Janitori Professi U I11 OOI I I OI H D F 00000000 Analysis Of Parameter Estimate 8904 6931 0000 5878 2513 4055 0000 0000 00000000 Standard Error 2079 2357 0000 3220 2910 3043 0000 0000 ValueDF 39955 39955 37651 37651 Chi Square 19334 865 Pr gt ChiSq lt0001 00033 00680 03877 01827 Obs 1 Indep Pval 001023 Value 119864 119864 112952 112952 1130124 Estimates Wald 95 Conf Limits 24830 32978 11551 02312 00000 00000 12190 00434 08216 03190 10019 01909 00000 00000 10000 10000 2 6 40 6 41 Some nal remarks on Poisson regression c When response Y is category counts counts Y follow Poisson distribution as gt oo Poisson gt N not only rare events depends on key is that EY VarY see slide 6 37 0 Why not logistic regression here covariate pattern describes a category not an individual 6 42 Final aside why logistic regression If T N logisticm cdf is FTt m This cdf is the solution to a particular differential equation 0 simplest form P1 P PO 05 0 related to population growth and carrying capacity sketch This cdf is called the logistic equation 0 from French equation logistique Verhulst especially providing necessary support for ongoing growing mil itary operations 0 describes self limiting growth of population Statmtks 5100 SASCha iCoume To open SAS in AgSci 119 Start a Programs a Reserve a SAS a SAS 91 SAS is a software package used by many companies and researchers We will focus in this course on the statistical tools available in SAS This document will give you a quick crash course in some of the basics of SAS operation SAS has four main windows 0 Editor type and run commands here color matters sas les open here 0 Log tells what was done helps with debugging color matters 0 Output see results 0 Graph shows nice graphs only created by certain commands To run a SAS program or a highlighted section of SAS code7 click on the running man icon7 or hit F8 Read in Data 0 By hand data a1 input X y cards 12 JCILJMgt 1 2 3 run 0 ln le statement Cfolderdatafile tXt SAS code 1 2 data a1 1 4 infile Cfolderdatafiletxt 2 3 input X y 3 2 run 0 Import facilities for specially formatted data les more later Necessary components to run a program 0 semi colon at the end of every line except for datalines o a data statement that names your data set unless you import the data set 0 input statement unless you import the data set 0 at least one space between each word or statement 0 a run statement Mathematical Operators Function Operator Example addnjon Xy subtraction X y multiplication Xy division Xy power or X2 log log logX exponenmal eXp eXpX Logical Operator Example equal or eq if X 1 unequal ltgt or ne if X ltgt 1 less than lt or lt if X lt 1 less than or equal to lt or le if X lt greater than gt or gt if X gt 1 greater than or equal to gt or ge if X gt Sample SAS code to create and View a new data set data a2 set al New Data Set XY XY qu X2 Dbs X y Xeq1 Xy qu Xeq1 0 if X 1 then Xeq11 1 1 2 1 2 1 run 2 1 4 1 4 1 proc print dataa2 3 2 3 O 6 4 var X y Xeq1 Xy qu 4 3 2 O 6 9 title1 New Data Set run Sample SAS code to only keep certain observations in a new data set data a3 set a2 Smaller Set if y lt 35 Default then is keep Dbs X y Xeq1 run Same as 1 1 2 1 data a3 set a2 2 2 3 0 if y gt 35 then delete 3 3 2 0 run proc print dataa3 var X y Xeq1 title1 Smaller Set run Procedures Data steps are used to create read in and manipulate data sets Procedures proc7s are used to perform speci c analyses or to create speci c types of output usually more than you need Some basic procedures 0 proc print dataa1 var y X titlel Title here run 0 proc univariate dataa1 var X titlel Title here run 0 proc corr dataa1 var X y titlel Title here run proc sort dataa1 outa4 by X run Note no output o proc means dataa4 mean median stddev C1111 alpha005 by X Note options var y titlel Title here run 0 proc plot dataa1 plot Xy titlel Ugly Plot run proc gplot dataa1 plot Xy CIRCLE titlel Nicer Plot run Exporting output For text output just copy and paste into a wordprocessing document paying attention to font use a font that gives equal width to every character like SAS Monospace or Courier New For graphics output right click on graphics page select Edit a Copy and then paste Paste Special into report document best quality as a Device lndependent Bitmap You can also send graphics output to pdf ps or jpeg les Closing SAS Save the code in your Editor window as a sas le in a location of your choice Close SAS as any other Windows application red X for example SAS will ask if you7re sure Miscellaneous SAS Notes Help in SAS The help facility in SAS drop down menu or the question mark book icon can be useful if you have a good idea of what you want to do Some examples 7 See appropriate syntax and possible options for speci c procedure 7 See examples of SAS code for speci c analyses or graphics 7 Learn more about what a speci c procedure option does see title1 below Code Continuity Code can be written across lines SAS only looks for semicolons to break up code except for data lines To read in data continuously use Missing Values SAS procedures will completely ignore an observation if one of the called variables is missing to code a value as missing in SAS use the period character Strings Read in character variables with after the name in the input line Comment Lines To comment out a line up to the next semi colon put an asterisk before it To comment out an entire section start it with and end it with Output Options You can make the SAS output window be a little more friendly with judicious use of the options statement Selective Output SAS will usually give you more than you want so you7ll need to know what you want in order to do anything useful with the output Save Code Save your code from the Editor window as a sas le in whatever location thumb drive you choose double click on le icon to open SAS run to recreate all results Example and portions of output options ls80 nodate pageno1 formdlimquot quot data a1 input X y 2 cards Data set 1 1 2 alpha 1 4 2 3 gamma 3 delta Obs x y 2 run 1 1 2 alpha proc print dataa1 2 1 4 var y X z 3 2 3 gamma title1 quotData Setquot 4 3 delta run proc means dataa1 Data Set 2 var y Subtitle title1 quotMeans Outputquot run Ubs y X Z 1 2 1 alpha proc print dataa1 2 4 1 var X z 3 3 2 gamma title2 quotSubtitlequot 4 3 delta run 2 0 Stat 5100 Notes Spring 2009 Unit 2 Simple Linear Regression Section Topic 21 22 23 24 25 26 Scatterplots 85 Correlation pp 34 37 113 115 Linear Regression Model pp 30 34 37 38 Inference for Coef cients pp 42 49 Residual Diagnostics pp 51 53 116 117 Residuals 85 Remedial Measures pp 53 58 Problems with Regression pp 51 109 113 2 1 21 Scatterplots amp Correlation Relationship between tWO variables X midterm amp y nal Student 1 2 3 4 5 Midterm 50 66 72 75 92 Fmal 65 65 89 79 97 O O E E O O 7 O m 0 0 avex 08 sdx avey 08 sdy gt 8 7 gt 8 7 o 7 o 7 l l O O O avex 13 sdx avey 13 sdy o 7 o 7 O O x w w w x w x w w w w w 50 60 70 80 90 100 50 60 70 80 90 100 X X SD measures the variability of X and y separately What about 2 2 how much they vary together Recall that for data 961 xn SD36 ave of deviations from avedeviatior1s from ave To measure the covariability or covariance of two lists X and y something similar Cov ave of deviations from ave for X deviations from ave for y 1 n Z 9 xy yl 71 1 21 Sometimes rewritten and approximated by 2 3 COD 1 n n 1 i1 y n 1 y ave of products X y ave of Xave of y 22 Back to the question of association strength How much covariability is there relative to the variability of X and y Here relative to leads us to divide Covariability of X and y Covariance of X and y Variability of X and y SD96 SDy Because this represents the strength of the linear relation between X and y or the linear co relation of X and y we will call this measure the correlation or correlation coef cient of X and y We use the symbol r to represent it and use the following shortcut to calculate it 1 n Bi IE yi gj n lZSDxSDy 2 4 Correlation measures linear association or degree of Clustering about the SD line Scatterplot shows relationship between two variables correlation tells how strong it is linearly 2Concord1Regressionsas SCATI39ER PLOTS OF PAIRS OF VARIABLES SCATI39ER PLOTS OF PAIRS OF VARIABLES 110007 110007 100007 0 100007 0 8000 0 0 m D co 2 5 7000 D D D 3 7000 0 M o c o o a E 6000 0 o o E m a 00 g n 0 U a g 0 o 00 o 5000 5000 o O 0 0 4000 O 4000 0012055quot S P39g 0 9 00 a 3000 3000 9 00 95 o ZIXXT 2000 0 0 00 1000 1000 o I I I I I I I I I I I 100001200014000 200040000000800010000120001400010000 o Water79 Water81 Water81 110007 110007 100007 100007 9000 9000 7000 I 7000 I gtk gtIlt gtIlt I I I x I I 4000kIII 4000 I 3 It Ii in I I I i i i 320H I i 3 i i 2000 2000 I i 3 g Iiwg i I II IX 1000IIIIIII 1000 III I i ii i o I I I 0 I I I 0 20000 40000 00000 100000 6 7 9 10 11 12 1a 14 15 16 17 1a 19 Income Education Water81 Water80 Water79 Income Pearson Correlation Coefficients Prob gt r under H0 Rho0 Number of Observations Water81 100000 496 076479 lt0001 496 073244 lt0001 449 041779 lt0001 496 Water80 076479 lt0001 496 100000 496 072721 lt0001 449 033705 lt0001 496 Water79 073244 lt0001 449 072721 lt0001 449 100000 449 032110 lt0001 449 Income I 041779 lt0001 496 033705 lt0001 496 032110 lt0001 449 100000 496 243 2 7 Graphical counterpart to pairwise correlation matrix the scatterplot matrix In SAS the code is quite cumbersome one of SAS s great weaknesses Instead use Interactive Data Analysis 1 2 Select WORK library and desired data set then Open 3 4 Highlight desired variables and select Y use Ctrl key for Note Select Solutions gt Analysis gt Interactive Data Analysis Select Analyze gt Scatter Plot multiple selections and repeat for X then select OK clicking and double clicking close IDA Window Ioluo Haters 28 uaterao 12700 water l 14500 2000 Income Iooooo With the right Java Runtirne Environment JRE Scatter Plot Matrix Watarvg 22 Linear Regression Model Recall SD line represents the Whole Cloud of data sketch slope SDySDx 2 10 A more useful line represent the average y for each value of x sketch Look at average y 0ver intervals of x graph of averages The regression line smooths this graph sketch 2 12 2 13 Simple linear regression model Consider a response variable Y say Water81 and a predictor variable X say Income Measure X and Y on individuals 239 1 n An overall linear relationship Y g l Xi 67 239 1 n Why linear Compare 1 Y 30 110gX 6139 2 K250Xf1 linear in parameters 30 and l Equivalent regress Y on X predict Y from X predict average Y for given X 2 14 Major Assumptions of Linear Regression Model 1 linearity of Y vs X transformations 2 61en iid N002 a error terms independent b error terms have a symmetric normal distribution c error terms have constant variance 02 homogeneity 3 no outliers 2 15 Fit this model estimate the parameters g amp l unknown Call these estimates BO amp Bl or be amp 91 Estimated Y predicted average for a value X is Y 2 Bo BlX 90le HOW to estimate the parameters 0 many methods exist 0 in general want Y7 Close to Y7 Ordinary Least Squares OLS A 2 Let Q 2 40 o What does Q measure sketch 0 Pick 90 and 91 to minimize Q 2 16 2 17 Minimize Q 271 b0 91X2 With respect to be and 91 Using calculus and re arranging terms SD 91 7 b0 17 b1 What is the predicted Y for X X calculate amp sketch Scatterplot does it look linear Water81 name 1mmni 39 9 00 39 2 18 8000 39 39 7000 39 39 axni 39 l 39 4000 39 l 5 39 39 3939 o o 0 I I 39 39 3000 J g a 20007 i f c39 o l 10007 II i 39 39 J39T I39J 071 1 1 1 1 1 o ammo 4mm0 ammo 1mxm0 Income The CORR Procedure Variable N Mean Std Dev Sum Minimum Maximum Water81 496 2298 1486 1140000 10000000 10100 Income 496 23077 13058 11446000 2000 100000 Pearson Correlation Coefficients N 496 Water81 Income Water81 100000 041779 Income 041779 100000 Example Concordl data regressing Water81 on Income Simple linear regression 2 19 The REG Procedure Model MODEL1 Dependent Variable Water81 Number of Observations Read 496 Number of Observations Used 496 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F Model 1 190820566 190820566 10446 lt0001 Error 494 902418143 1826757 Corrected Total 495 1093238710 Root MSE 135157589 RSquare 01745 Dependent Mean 229838710 Adj RSq 01729 Coeff Var 5880541 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 120112436 12332451 974 lt0001 Income 1 004755 000465 1022 lt0001 Household Income in 1981 Income 20000 40000 60000 80000 ltltm 002336 3 69 A 0 II o 041 U 4 A I 0 I 100000 11000 10000 9000 8000 1IF f 04 atquot I all 0 O Water81 100000 The Regression dashed and SD solid Lines Scatterplot with Regression Line 2 20 WaterSl 120112436 l 004755 Income A ett uasarne t ushigt uac0rrek on0utputz Variable Water81 Income 2 21 The CORR Procedure N Mean Std Dev Sum Minimum Maximum 496 2298 1486 1140000 10000000 10100 496 23077 13058 11446000 2000 100000 Pearson Correlation Coefficients N 496 Water81 Income Water81 100000 041779 Income 041779 100000 Predicted values for 20000ltIncomelt22000 Obs Nphock L Income 21000 21000 21000 21000 21000 Water81 2200 1100 1500 2600 4300 Predict 219965 219965 219965 219965 219965 2 22 2 23 23 Inference for coef cients Recall linear model Yr g l qu l eiz391n predicted values Yr 90 l 91 XL Parameter estimates by OLS How good is the model 0 how much variability is explained by the model 0 how con dent are we in the estimates Example Concordl data7 regressing Water81 on Income 2 24 The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F Model 1 190820566 190820566 10446 lt0001 Error 494 902418143 1826757 Corrected Total 495 1093238710 Root MSE 135157589 RSquare 01745 Dependent Mean 229838710 Adj RSq 01729 Coeff Var 5880541 2 25 ANOVA table ANalysis Of VArianoe Source SS df MS F Model ESS SSdf MSMOdelMSResidual Residual RSS SS df Total TSS n 1 TSS total sum of squares 2106 l72 n 1 var of Y RSS residual sum of squares 2106 16 sometimes SSE sum of squared errors residual e7 Y7 Y7 ESS explained sum of squares TSS RSS sometimes model sum of squares Coef cient of determination m U 2S 39R W m o R2 of variation in Y explained by model higher R2 gt stronger linear t for simple model With one predictor X R2 r 2 2 26 2 27 How good is the estimate 91 Not usually concerned about 90 0 how useful or signi cant is that term variable X parameter l in predicting Y accounting for variability in Y 0 how much worse would the model t the data without using these terms Basically we have a hypothesis test null H0 l 0 alternative H1 l 7E 0 2 28 Under H1 ie if H1 is true model is Y g 1X7 Q Under H0 ie if H0 is true model is Y 30 67 l 0 Notice that we get both 91 and SE1 the standard error of 1 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 120112436 12332451 974 lt0001 Income 1 004755 000465 1022 lt0001 De ne test statistic t b1 0SE1 Under H0 t N tn2 What does this mean 2 29 Consider testing H0 l c for some value 0 Test statistic t b51516 depends on our sample of data Under H0 t N tn2 sampling distribution t with n 2 degrees of freedom If H0 is true and we drew many samples of size n from this popula tion calculating t for each sample this would be the distribution of the t s sketch 2 30 Degrees of freedom basic idea 0 Suppose I have sampled ve observations from some population With mean u and the sample mean is 4 Four of the observations are 4 6 3 2 What is the other observation 0 degrees of freedom of unconstrained data points 0 Estimating a single parameter like u costs one df 0 Recall sample variance 82 231x7 32 O Yi 0 1X i 6 6 N002i1n b0 30191 31 gt COSt 2 gt have n 2 df to estimate 02 used in SE of 91 2 62 22106 m2 gt tn2 sampling distribution 1 sided vs 2 sided tests 0 H12 l gtC 0139 H1 l ltC Get P Value from a 1 sided test sketch 0 H12 l 75 C Get P Value from a 2 sided test sketch 2 32 Interpretation of P value Probability of observing a difference at least as extreme as What was seen just by chance When the null H0 is true Usually reject H0 When P value is below some threshold 0 Common 5 level of signi cance 0 More conservative 1 0 Why 5 Historical note To Ronald Fisher the signi cance test only made sense in the context of a sequence of experiments all aimed at clarifying the same effect The closest he ever came to de ning a speci c p value cut off was in a 1929 article to the Society for Psychical Research 0 An observation is judged signi cant if it would rarely have been produced in the absence of a real cause of the kind we are seeking o It is a common practice to judge a result signi cant if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials This is an arbitrary but convenient level of signi cance for the practical investigator 0 He should only claim that a phenomenon is experimentally demon strable when he knows how to design an experiment so that it will rarely fail to give a signi cant result Consequently isolated signif icant results which he does not know how to reproduce are left in suspense pending further investigation As on p 99 of The Lady Tasting Tea 2001 by David Sals burg similar discussion in Truth Damn Truth and Statistics by Paul F Velleman in July 2008 Journal of Statistics Education httpwwwamstatorgpublicationsjsev16n2vellemanpdf 2 33 Model F test 0 ANOVA table columns Source df SS MS F o F value M S Model M SResidual o What does this test H0 every predictor X has no effect H1 at least one predictor has an effect Under H0 F N Fdfhdf2 Where dfl df for Model and dfg df for Residual 0 We Will return to this With multiple regression but for now With a single predictor H0 l 0 Note that With a single predictor F 152 2 34 Con dence Intervals CI 2 35 1 A 1001 00 CI for coef cient m 91 l SE1 gtlt tn21 042 Interpretation a We are 1001 00 con dent that the true value of l lies within this interval b If we drew many samples of size n from this population and each time t the model and obtained a CI then 1001 00 of these intervals would contain the true value of l c NOT 1001 00 chance the interval is correct 236 2 A 1001 a CI for Y When X X Y i SEY gtlt tn21 12 SE3 depends on What we re looking for When X X a expected Y group mean gt con dence interval b predicted Y individual observation gt prediction interval do for many X to get CI about regression line Parameter Estimates 2 37 Variable DF 95 Confidence Limits Intercept 1 95881911 144342961 Income 1 003841 005669 Predicted values and confidence and predicted intervals for 20000ltIncomelt22000 with eXtra observations Obs Income Water81 Predict l95conf u95conf l95pred u95pred 8 21000 2200 219965 207891 232039 458643 485794 55 21000 1100 219965 207891 232039 458643 485794 102 21000 1500 219965 207891 232039 458643 485794 296 21000 2600 219965 207891 232039 458643 485794 462 21000 4300 219965 207891 232039 458643 485794 497 20500 217587 205433 229741 482454 483420 498 21500 222342 210332 234353 434840 488168 Household Income in 1981 60000 Water Consumption in 1981 10000 9000 11000 20000 40000 80000 100000 2 38 95 Confidence and Prediction Bands Which is Wider con dence or prediction intervals Why 0 more variability in individuals than in group means gt less precision in predicting individuals gt larger SEY for prediction 0 gt Wider intervals for prediction 2 39 What about con dence levels other than 95 oz 005 2 40 Predicted values and confidence and predicted intervals for 20000ltIncomelt22000 with extra observations Note that these are 90 intervals This is from the best way Obs Income Water81 Predict lPred90 uPred90 lConf90 uConf90 8 21000 2200 219965 299755 442927 209838 230092 55 21000 1100 219965 299755 442927 209838 230092 102 21000 1500 219965 299755 442927 209838 230092 296 21000 2600 219965 299755 442927 209838 230092 462 21000 4300 219965 299755 442927 209838 230092 497 20500 217587 537805 440553 207393 227781 498 21500 222342 61771 445302 212268 232416 2 41 24 Residual Diagnostics Recall mOdel Yr g qu 67 239 1 n and assumptions 0 linearity o independence 0 constant variance 0 normality o no outliers 2 42 If model assumptions don t hold then any inferences signi cance tests predictions etc are meaningless How to assess model shortcomings o re ected in the error terms 67 0 estimate this With the residual 67 Y7 o graphical check residual plot c versus Y can help diagnose several Violations of model assumptions it is a diagnostic plot Residual Plot for Data Set 1 Scatterplot for Data Set 1 Model Msump ons Met Model Msumptions Met W 50 3 I 40 27 1 30 g 0 39 39 39 20 17 39 27 10 3 01 41 2 4 6 B 10 12 14 16 18 20 0 1O 20 30 Predicted Value of y1 2SimulatedResidualssas but don t worry too much about the simulation code 2 43 avg Scatterplot for Data Set 2 Linearity Msumpiion Not Met Scatterplot for Data Set 3 Conslant Error Variance mumption Not Met as Residual Residual Residual Plot for Data Set 2 Unearin Assumpiion Not Met r 10 20 Predicted Value of 2 Residual Plot for Data Set 3 Consiant Error Valiance Ammpiion Not Met Prediued Value of ya 2 44 Scatterplot for Data Set 4 Normality lasumption Not Met Scatterplot for Data Set 5 No Oulliers A umption Not Met 5 Residual Residual Residual Plot for Data Set 4 Normality Assumption Not Met 130 120 110 100 907 u 80 70 so 50 40 30 20 10 n a u I 1D 39 20 I u o o 307l l l l l l 10 20 30 40 50 60 70 Predicted Value of y4 Residual Plot for Data Set 5 No Outliers Asumption Not Met 10 u I O 39 u a n 39 u I 1 20 0 10 20 30 4O Predicted Value of y5 2 45 2 46 How to use residual diagnostics 1 residual plot linearity constant variance normality symmetry loosely here independence c vs time 2 residual summary plots four plot summary 0 normality symmetry more directly 2 47 25 Residuals and Remedial Measures Recall residual diagnostics o residual plot e vs Y o residual summary plots histogram boxplot normality symme try help diagnose Violations of model assumptions What if model assumptions are not met 0 it depends on Which assumptions is are violated 0 some are more important than others and should be addressed rst 2 48 Nonlinearity 0 make relationship linear or more linear 0 consider tting a non linear model 0 transform X rst When possible because transforming Y Will affect the residuals more Which could introduce new model assumption violations 0 may need to transform Y also 2 49 sketch relation here appears quadratic Consider model Y aX c2 l b 0 try transformations X X c2 for various 0 values especially for meaningful c 0 try to make Y vs X more linear Y aX b o if possible keep model bivariate just one predictor variable 2 50 Non constant error variance 0 variance stabilization transformation on Y 0 usually consider a ladder of powers or Box Cox Yq 0 can also use WLS weighted least squares regression essentially puts different weights on each point weights related to error variance at that point we may return to this topic 0 may need to transform X also Non independent error terms 0 time series model 0 could work With rst differences transformation 0 we Will return to this problem in Unit 5 Time Series Non symmetry Non normality 0 Transformation on Y o ladder of povvers or Box Cox Yq 0 may need to transform X also 2 52 Outliers 0 Consider dropping points competing ideals 1 Valid data shouldn t be ignored 2 One or two data points shouldn t determine signi cance 0 we will return to this problem more in Unit 4 o Robust Estimation see Chapter 6 IRLS Iteratively Reweighted Least Squares use but de emphasize outliers we will return to this topic 45 After any remedial measure go back and re check all the model assump tions again Scatterplot for Revised Data Set 2 nonlinearity xed o o 20 4o 60 so 100 120 140 160 x2ne v Scatterplot for Revised Data Set 3 helaroscedaslioity xed new a 7 6 5 4 3 Resi Residual Plot for Revised Data Set 2 nonlinearity xed Predicted Value of 2 Residual Plot for Revised Data Set 3 h dual sleroscedastidty lixed Predicted Value of Yanew 2 53 Scatterplot for Revised Data Set 4 Residual Plot for Revised Data Set 4 non symmetry non symmetry xed y4new Residual 005 00 39005 39 006 39 005 2 54 004 0037 002 0017 39 UIW g 39 o n oo17 39 39 015 o 016 002 017 003 o391871 l l l l l l l o39 47 l l l l l l 0 6 8 10 12 14 16 18 20 O14 013 012 O11 010 009 008 x4 Predicted Value of y4new Scatterplot for Data Set 5 Residual Plot for Revised Data Set 5 With and wilhoul outliers in regon No Outliers Assumption is Met VS 47 50 If quoti 3 39 407 2 30 g 1 e 39 E 207 I O o 10 39 1 O on 1 1 1 1 1 1 1 1 2 1 139 1 1 0 2 8 10 12 14 16 18 20 0 1O 20 30 40 50 Predicted Value of y5 Summary of Residual Diagnostic 2 55 and Remedial Measures Model Graphical Diagnostic Remedial Assumption amp Statistical Test Measure Linearity e vs Y transformation X of Y Vs X F test lack of t nonlinear model Constant Error Variance Independent Error Terms e vs Y modi ed Levene test Breusch Pagan test e vs time Durbin Watson test transformation rst on Y WLS time series model Summary of Residual Diagnostic 256 and Remedial Measures Model Graphical Diagnostic Remedial Assumption amp Statistical Test Measure No outliers e vs Y ignore points boxplot e transformation rst on Y Symmetric symmetry plot 6 transformation Error Terms rst on Y Normal normal quantile plot 6 transformation Error Terms Kolmogorov Smirnoff test rst on Y Lilliefors test Last concern x others rst Example Concordl data BoxCox Transformation Regressing Water81 on Income regressmg Water81 on Income Lambda RSquare Log Like 10 006 404160 09 007 395675 2 57 2 Concordl Regression sas 3908 0 08 quot3878 07 07 010 380612 06 011 374140 05 012 368424 04 014 363479 Residual Plot for Concord1 Data O 3 0 15 9 593 02 02 016 355870 80 01 017 353149 7000 39 00 017 351093 E 6000 o 01 018 349654 5 02 018 348782 5 50 07 39 03 019 348431 lt E 4000 39 04 019 348559 3000 3 o 05 019 349127 5 2000 0 39 06 019 350103 g 7 354 ha 07 018 351459 g 1000 395 1 08 018 353170 3 of p 09 01 35212 10 01 35 568 E 1000 39139fr 539 g E39 u a ow lt Best Lambda 3000 1 1 1 1 1 1 Confidence Interval 100 200 3 00 400 500 6 00 Convenient Lambda Predicted Value of Water81 Residual for transfode Waler81 regressed on Income Residual for trans Waler81 regressed on trans Income Residual Plot for Concord1 Data Amms mmv BoxCox on Income X try lambda5 6 39 39 Lambda RSquare Log Like a xviquot 020 000 465681 2 5 quot3quot 39 022 000 465575 139 339Eg 39 39 024 000 465481 1 39 quot 39h f if 2 026 000 465399 27 gg 39 39 028 000 465328 j 39 quot 030 000 465268 5 032 000 465220 6 O 034 000 465184 7v 036 000 465158 8 9 1 Pedwue1fm13 14 15 038 000 465144 040 000 465142 lt Resldual Plot for Concord1 Data 042 000 46513950 mmmmmtmm 044 000 465170 6 046 000 465201 5 n 048 000 465243 2 I 39 050 000 465295 2 z 7 052 000 465359 1 315 054 000 465434 54 R s5 056 000 465519 2 uigm 058 000 465616 3 13 5 060 000 465723 4 2 lt Best Lambda 7 39 Confidence Interval 7 5 9 1 1 12 13 4 Convenient Lambda Predicted Value of Ynew Hlstogram of Resld after Iransform Y and X Boxplot of Resld after u39ansform Y and X Wllh normal curve superimposed 7 5 30 50 D 25 9 g 25 20 gt E E g 15 g 0 g 10 25 5 11 50 E u 39 K m D ss 54 42 a 10 oa 06 19 3 42 54 5 Raid alter transform Y and X 15 Symmetry Plot of Redd alter transform Y and X Normal Quantlle Plat of Resld after transform Y and X 7 15 6 50 quot0quot Q z 25 3 5 5 9 1 25 39339 393 n 4 50 Lquot 75 1 1 1 1 1 1 7 011510255075909599 Distance Below the Median Normal Percentiles 2 59 Regressing Ynew Water8103 on Xnew Income05 The REG Procedure Model MDDEL1 Dependent Variable Ynew Analysis of Variance Sum of Mean Source DF Squares Square Model 1 36601404 36601404 Error 494 145373973 294279 Corrected Total 495 181975376 Root MSE 171546 RSquare 0 Dependent Mean 977698 Adj RSq 0 Coeff Var 1754587 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Intercept 1 683635 027470 2489 Xnew 1 002017 000181 1115 Final model With assumptions met PredictedWater810393 683635 002017Income0395 2 60 F Value 12438 Pr gt F lt0001 2011 1995 Pr gt t lt0001 lt0001 26 Problems with Regression Simple linear regression is based on some formal theory 0 ordinary least squares o Gauss Markov Theorem 0 more o in practice there can be cause for concern 0 so need to think critically about using regression 2 62 1 Model Assumptions a must be satis ed for conclusions to be valid 0 Linearity of X vs Y o Constant Error Variance 0 Independent Error Terms 0 No outliers o Symmetric amp Normal Error Terms b sometimes no transformation Will x the problems then linear regression must be abandoned 2 Interpret ability 2 63 a slope for every unit increase in X the average or expected change in Y Example For every unit increase in Income0395 the expected change in Water810393 is 002017 b model amp parameter estimates should make sense 0 compare With mechanistic theory c X vs Y relationship could look counterintuitive 0 possible wrong model 0 possibility of some omitted variable that affects both X and Y c we Will return to this problem multicollinearity 2 64 3 Simple model assumes X xed a but the observations in X could have measurement error b X variables could be related to random error terms 4 R2 can be abused a R2 of variability in Y explained by linear relation With X b a high B2 does not guarantee predictive ability or model appropriateness o a low B2 does not mean there is no linear relationship 2 65 5 Linear Regression is easily misused a just because you t a model doesn t make it the right model b look carefully at interpretations and conclusions c beware extrapolation beyond data range 9304 Nature paper 5 0 Stat 5100 Notes Spring 2009 Unit 5 Time Series Section Topic 50 51 52 53 54 55 56 Summary Overview Autooorrelation Hamilton pp 118 124 Stationarity Bowerinan pp 437 441450 451 AR amp MA Models pp 467 470 442 457 ARIMA Models pp 474 476 Forecasting 85 Goodness of Fit pp 462 467 496 504 Seasonal Modeling Table 121 5 1 50 Summary Overview Homework 5 intro httpwwwleftbusinessobservercomBushNGashtml Response Y collected in some sequential manner time space Want to make useful forecasts short term predictions Want to understand What influences Y o recurring patterns in Y 0 effect of other variables X1 Xk1 on Y o dependence among observations due to sequential nature 5 2 Box Jenkins ARIMA models account for dependence structures for forecasts to be useful need model assumptions to be met stationarity graphical diagnostics to tentatively identify appropriate model graphical and numerical diagnostics to assess model adequacy make forecasts point amp interval With adequate model 5 1 Autocorrelation Example Concord2 data 0 Response monthly ave water consumption from 1970 1981 0 Predictors monthly ave precipitation and temperature 0 Conservation campaign started 1980 successful Original data Look at effect of time 6 Daily Water Use millions of gals 0 20 40 60 80 100 120 140 Tlme months since Dec 1969 51Concord2AutocorrDWsas 5 3 All clear Residual plot from simple regression approach m Mum The REG Procedure Parameter Estimates Standard Parameter Variable DF Estimate Intercept 1 382800 temp 1 001286 precip 1 004743 campaign 1 024698 107 08 067 047 c n l 39 39 a 3 027 39 00 I L f39 539 39 39 o2 n 39 o47 39 39 39 oe 39 39 08 36 38 40 42 44 46 48 Predicted WaterUse 0000 Error 10064 00170 02123 11348 t Value Pr gt t 3803 lt0001 757 lt0001 223 00271 218 00313 Residuals Look at effect of time 10 08 06 04 02 00 02 04 06 081 Reddum 0 20 40 60 80 100 120 140 Time months since Dec 1969 5 4 5 5 Linear regression model Y 30 3139 k1X7k1 i1n Assumption 61 en iid N0 02 What if not independent 1 bj estimates unbiased but not minimum variance inef cient 2 MSE can severely underestimate 02 gt var 0f bj underestimated gt usual inferences not applicable When could error terms be dependent o observations collected serially in time closing price of GE stock every day rainfall every month population every census 0 observations collected in geographic sequence air quality at each mile marker of freeway water pH every km along river soil richness at points along throughout a soy eld 0 others data collected observed in some sequence 5 6 5 7 Sequential or serial dependence is autocorrelation of error terms Graphical detection diagnostic plot residuals vs time any clear trend could be problematic sketches cyclic sawtooth 5 8 Numerical diagnostic for autooorrelation Durbin Watson statistic Yt g le l k1Xtk1 l 615 t index for time Assume 6t 16154 at l 1llt1a1aniid N002 First order autooorrelation 0 1 autooorrelation parameter sometimes p c at error term distribution free of t sometimes called random shook O H02 1Ovs H12 1gt0 39I L 2 A 2t2 6t 6t Y39t y d 2211 6 7 sometimes n T 5 9 For signi cance level oz sample size n and k 1 predictors get critical values d L and dU from table like A44 on p 355 356 of Hamilton text d lt d gt reject H0 at level 04 d gt dU gt fail to reject H0 at level 04 d L g d g dU gt test inconclusive at level 04 Test for negative autocorrelation H1 o lt 0 calculate d as above then compare 4 d to critical values Back to C0nc0rd2 data Look at autocorrelation in DurbinWatson test DurbinWatson D 0535 Pr lt DW lt0001 Pr gt DW 10000 Number of Observations 137 1st Order Autocorrelation 0730 NOTE PrltDW is the pvalue for testing positive autocorrelation and PrgtDW is the pvalue for testing negative autocorrelation d 042001 1 Table n137 dL k 13 dU Conclusion here 5 10 Was the campaign successful What s different here Estimates of Autocorrelations Lag Covariance Correlation 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 0 01261 1000000 I 1 00921 0730231 I Estimates of Autoregressive Parameters Standard Lag Coefficient Error t Value 1 0730231 0059465 1228 Standard Approx Variable DF Estimate Error t Value Pr gt t Intercept 1 38226 01222 3129 lt0001 temp 1 00119 0002103 566 lt0001 precip 1 00358 00113 318 00018 campaign 1 01901 01938 098 03284 5 12 First order autooorrelation lag1 6t 1 ed at What about relationships in addition to 615 amp et1 lag other than 1 To estimate 1 With lagm 6t 1 tm l at n m 1 151 6t tm 6 23316t 2 correlation Auto r p 09 00IO O39AgtOOIDHO Covariance 1261 0921 0597 0435 0580 0706 0744 0582 0354 0194 0302 0535 0612 0430 0193 0164 0272 0428 0466 0343 0108 000238 000370 00250 00329 00204 0 OOOOOOOOOOOOOOOOOOOO Estimates of Autocorrelations Correlation O H 0000 OOOOOOOOOOOOOOOOOOOO 000000 730231 473521 345115 459549 559339 589674 461684 280885 153916 239722 424193 485392 340738 152932 129701 215724 339531 369486 271671 085549 018880 029327 198215 260869 161734 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 5 13 Autocorrelation Estimate times 100 100 90 80 70 60 50 4O 3O 20 10 Correlogram O 0123456789111111111122222 012345678902345 Number of lags 5 14 Remedial measures for autocorrelation 0 add predictor variables trend 0 transform predictors and or response 0 account for error dependence structure Box Jenkins ARIMA models iterative process 1 identify tentative model 2 use historical data to t model 3 diagnostic checking 4 forecast future time series values model assumptions homogeneity stationarity invertibility next section 5 15 5 16 52 Stationarity Linear model revised Yt e iXm k iXm i 613 Time series Y1Y2 lt Yn n T sometimes First order stationary at E ILL for all t Second order stationary if Va Yt of E 02 for all t homogeneity Intuitive diagnostic looks the same mean and variance in every time Window Graphical diagnostics for stationarity plot residuals ct vs t Not rst order Not second order General remedial measures 1 transform Yt 2 add time based predictors 3 differencing for stubborn trends 5 17 5 18 1 Transform Yt usually to eliminate heteroscedasticity powers 2 Add time based predictors to remove time trends a linear or curvilinear trends sketches 5 19 b cyclic trends ii small obs per cycle add dummy variables quarterly 4 quarters gt dummy vars monthly 12 months gt dummy vars large obs L per cycle too many for dummy vars consider trigonometric functions of t as predictors 27rt X sin X cos 1 L 2 L What kinds of cycles would these remove sketch 27rt 27Tt X t 39n X t 3 81 I 4 COS What kinds of cycles would these remove sketch 3 5 20 Differencing for stubborn trends First differences Zt Yt Yt1t 2 7 wt Zt Zt l Y22Yf 1Yf 27t37 7n Second differences Algebraically What do rst differences do to linear effect of time gt ZtZE E1 Do second differences remove quadratic time effect Ytabtct2 gt thYt Yt1 gt WtZZt Zt1 Higher order differences rare in practice remove higher order time effects But differencing can destroy cyclic behavior 0 gt hurts ability to forecast loss of information o a remedial measure of last resort 5 22 Example Monthly averages of number of occupied rooms in four hotels Operated by Traveler s Rest Inc in Central City from 197 7 1990 plot of Monthly Hotel Room Averages Plot of Residuals for Monthly Hotel Room Averages Raw Data after removing linear time effect 1150 I 300 e x I 1010 I g 200 gt I l 5 1 5 870 v a 100 9 w a g 730 5quot 0 c 2 5 n I 39 5 590 n 1007 E 450 200 0 43 86 129 172 0 43 86 129 172 Plot of Residuals for Monthly Hotel Room Averages Square Root of Data and removing linear time effect Residual lllllllllllll vllllllll ill Plot of Residuals for Monthly Hotel Room Averages Fourlh Root of Data and removing linear time effect Residual 05 04 03 llllllllllllll lllllllllllll 01 o27 03 l l l l 0 43 86 129 172 Time Plot of Residuals for Monthly Hotel Room Averages Cube Root of Data and removing linear time effect Residual 10 08 0 a 04 0 N 00 02 o4 O6 0 llllll lll l l 43 will l l 86 129 172 Time Plot of Residuals for Monthly Hotel Room Averages Log of Data and removing linear time effect Residual 032 027 022 990 o L L INV 002 003 008 013 018 023 l llllll ill 0 43 ll lull l llllllll l ll l 172 Time Trends remain Plot of Residuals for Monthly Hotel Room Averages Log of Data after removing linear time effect 0327 0277 0227 0177 0127 0077 002 i i i i 003w 1 llquot 008 013 018 n I 0 43 86 129 172 Time Residual Plot of Residuals for Monthly Hotel Room Averages Log of Data and removing linear time effect Hm Residual 0327 0277 0227 0177 0127 0077 0027 o037 ooar o137 v 018 o23 l l I l r n I r I 1 Plot of Residuals for Monthly Hotel Room Averages Log of Data w Month Dummy Vars how better 005 004 003 5 25 Coo o o 09 3 F be is by 001 002 003 004 005 006 Residual 0 86 129 172 Time 172 0 43 Predicted Values from Regression Model 1200 1100 1000 900 8007 l 700 I I r 6007 400 Monthly hotel room averages with Dummy Variables for Months 0 30 60 90 120 150 180 Time 5 26 Bonus material Generalized differencing Zt Yt th1 Methods to estimate p a Differencing Will return to this n E 1 b Cochrane Orcutt primitive Yule Walker be cautious With small n or large p c Hildreth Lu primitive ULS be cautious With small n or large p 5 27 53 Autoregressive amp Moving Average Processes Example 531 57 consecutive daily overshorts from an underground gasoline tank in Colorado overshort for day t is Zt amount of fuel at end of day t amount of fuel at end of day t 1 amount of fuel delivered during day t AAAA amount of fuel sold during day t With no measurement error and no tank leaks Zt 0 Let Zt be the stationary time series after transforming including es timating out time trends and other covariates the original time series Y1 Yn Daily Overshorts 200 5 28 100 Overshort CD AMAMM W VVVVW V 100 2001 Day 5 3 Overshort GE Gas ARMA sas Stationary Dependence structure gt forecast Graphical checks for dependence structure r p 0010301AgtCDMHOOQ O Covariance 3415718 1719956 416701 723256 273522 66440027 396723 742192 861426 655675 192042 Lag OOO103039phCOlol L O Corre Correl lation 50354 17625 31556 26367 15282 04200 19396 11850 05068 06969 ation 00000 50354 12200 21174 08008 01945 11615 21729 25219 19196 05622 The ARIMA Procedure Autocorrelations 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Partial Autocorrelations 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 5 29 5 30 Autocorrelation function ACF or ACF 0 measure linear association between time series observations separated by a lag of m time units Emmi ZZtm Z Z 215 Zt Tm Z Z2 39 b 1 Ztb t n w12z r2 SE of rm is STm Vn b1 b 1 unless use differencing 0 call rm the sample autocorrelation function SAC39F or ACTF o sometimes used trm rmSTm Autocorrelation plot or SAC o bar plot rm vs m for various lags m sketch 0 lines often added to represent 2 SE s sketch rough 95 con dence intervals if rm is more than 2 SE s away from zero consider it signi cant rough non zero compare trmi to 2 for lags m g 3 use 16 because low lags most important to pick up 77 0 determine stationarity and identify MAq structure 5 32 MAq moving average process of order q Zt 6 at 61at1 62CLt2 QqCLtq Zt stationary transformed time series 67 unknovvn parameters at random shocks 6 unknovvn parameter Note include 6 in model only if Z is statistically different from 0 for ARp as well MAq Value of response Zt at time t depends on random shock values at previous q times not so intuitive SAC terminology o spike rm is signi cant 533 0 cuts off no signi cant spikes after rm 0 dies down decreases in steady fashion SAC amp stationarity 1 Z stationary if SAC either cuts off fairly quickly or dies down fairly quickly sometimes dyes down in damped exponential fashion sketches 2 Z nonstationary if SAC dies down extremely slowly sketch 3 Ch 9 of Bowerman amp O Connell if the SAC cuts off fairly quickly it Will often do so after a lag k that is less than or equal to 2 SAC amp MAq process 0 rst q terms of SAC Will be non zero then drop to zero sketch r p 0010301AgtCDMHOOQ O Covariance 3415718 1719956 416701 723256 273522 66440027 396723 742192 861426 655675 192042 Lag OOO103J39lphCOlol L O Corre Correl 1 0 lation 50354 17625 31556 26367 15282 04200 19396 11850 05068 06969 ation 00000 50354 12200 21174 08008 01945 11615 21729 25219 19196 05622 Look at ACF MA1 The ARIMA Procedure Autocorrelations 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Partial Autocorrelations 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 5 34 MA1 model fit to Overshort data The ARIMA Procedure Unconditional Least Squares Estimation Standard Approx Parameter Estimate Error t Value Pr gt t MU 512443 035073 1461 lt0001 MA11 099999 026992 370 00005 Constant Estimate 512443 Variance Estimate 1996541 Std Error Estimate 4468267 AIC 6009357 SBC 6050218 Number of Residuals 57 Autocorrelation Check of Residuals To Chi Pr gt Lag Square DF ChiSq Autocorrelations 6 482 5 04379 0119 0131 0054 0102 0130 12 1318 11 02817 0090 0079 0210 0161 0178 18 2947 17 00304 0098 0141 0273 0173 0207 24 3294 23 00821 0084 0071 0068 0057 0086 Lag 0123 0041 0151 0095 5 35 5 36 Partial Autocorrelation Function PACF 0r PACF o autocorrelation of time series Observations separated by a lag of m With the effects of the intervening Observations eliminated T1 7amm m l 7 7 T UT l m Zlm11m m 1fllt222 1 251 Tm lJrl Where rm SACFm Tmal Tmamrm17ml 17 7m 1 SE of rm m is 3mm 1n 91 9 1 unless use differencing 7amm 0 call rm m the sample partial autocorrelation function SPAC39F 0r PACFm o sometimes used trm m rmamSrm m 5 37 Partial Autocorrelation Plot or SPAC o bar plot rm m vs m for various lags m sketch 0 lines often added to represent 2 SE s compare t to 2 sketch Tm n 0 Use SPAC to identify ARp structure SPAC terminology spike cuts off and dies down 7 72mm N SAC SPAC amp ARp process 0 rst p terms of SPAC will be non zero then drop to zero sketch 5 38 Recall rst order autocorrelation Yt o 1Xm k 1Xtk 1 t7 t177n 6t t 1at lag 1 Correlogram plot d vs lag m cyclical pattern gt relationship in addition to 615 amp et1 exists ARp autoregressive process of order p 0 account for error dependence structure Where current time series value depends on past values 0 model 615 16134 26134 petp at More common representation for ARp 5 39 39 Zt 5 1Zt 1 2Zt 2 pZtp l at 7 are unknown parameters random shock at iid N 0 02 6u1 1 p MZEthl Z are residuals gt u E 0 gt common to assume 6 0 Special case Random Walk Model 39 Zt Zt l at o AR1 is a discrete time continuous Markov Chain probability at time t depends only on state at time t 1 ARp value of response Zt at time t depends on response values at previous 19 times Example 532 General Electric s gross investment in millions of dollars for years 1935 1954 GE gross investment 1940 1945 year 1950 GE gross Investment after accounting for time uslng log 04 03 02 g 01 E 0 f I o1 U 02 03 04 1940 1945 1950 Year 1955 GE gross investment after accounting for time Residual 50 5 40 AM UVV Year 1955 r m m m N m m h w M H 009 O Covariance 0073518 0021289 0038026 0039301 00051560 0022797 0016539 00022313 00093295 00029266 00011643 Lag Correlation OOO103J39lphCOlol L 0 Correlation 100000 028957 51723 53458 07013 31009 22497 03035 12690 03981 01584 The ARIMA Procedure Autocorrelations 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 I I Partial Autocorrelations 28957 65610 18504 23526 05568 18820 01776 03811 04283 13331 I 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 I I 5 41 AR2 model fit to log of GE data The ARIMA Procedure Unconditional Least Squares Estimation Variable logGEinv logGEinv logGEinv year 0018 0065 Shift 000 0078 0026 Standard Approx Parameter Estimate Error t Value Pr gt t Lag MU 13517006 1484188 911 lt0001 0 AR11 051014 018639 274 00146 1 AR12 071635 017516 409 00009 2 NUMl 007183 00076327 941 lt0001 0 Constant Estimate 163042 Variance Estimate 0044281 Std Error Estimate 0210431 AIC 051718 SBC 3465754 Number of Residuals 20 Autocorrelation Check of Residuals To Chi Pr gt Lag Square DF ChiSq Autocorrelations 6 311 4 05395 0176 0019 0086 0269 12 944 10 04910 0122 0032 0094 0343 18 1423 16 05815 0140 0189 0037 0005 0074 0004 5 42 5 43 Some convenient notation backshift operator BZt Zt l BZZt 3th BZt1 2 2H Zt 6 1Zt1 pZtp at 6 1BZt poZt l at 6 1B pBPZtat gt 1 1B poZt 6at Zt 6 at 610434 QqCLtq 6at QlBat 8quat 51 913 9q3qat I 61B 6qu1Zt at 5 44 Inverse Autocorrelation Function IACF or SIACF 0 similar to PACF and rarely discussed Autoregressive Process 0 current amp future values of Zt depend on historical values of same time series Z O 1 1B pBPZt 6Cbt Moving Average Process 0 current amp future values of Zt depend on past random shocks at 1 61B anq1Zt 6 at Annual Average Price of Unleaded Example 533 gas prices since 1976 Gas Price Data 360 330 300 270 240 210 180 150 120 90 60 00 O O 250 N O O L 01 o 1974 1986 1998 year 2010 1 O O Jeuoa 946L lo Jemod Suiins maleuan 5 45 Gas Price Data adjusted for inflation Price of Unleaded Current Dollars 40 35 30 25 20 15 10 1974 1992 2001 2010 year 1983 The REG Procedure Dependent Variable price Root MSE 3881984 RSquare 05864 5 46 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 149939 2153252 007 09449 infl76 1 056355 008367 674 lt0001 Gas prices regressed on inflation 120 100 80 60 40 20 0 2o 4or 607 Note extra output 1974 1983 1992 2001 2010 on next slide year What is H0 Residual Autocorrelations Lag Covariance Correlation 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 0 1418334 100000 1 1025385 072295 2 752621 053064 3 485828 034253 I I I 5 47 4 242655 017108 5 67981247 004793 6 65624165 04627 I 7 108660 07661 8 187771 13239 9 309936 21852 10 315535 22247 Partial Autocorrelations Lag Correlation 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 1 072295 2 001672 I I 3 009799 I I 4 010148 I I 5 003818 I I 6 004516 I I 7 003327 I I 8 009931 9 016668 I I 10 003957 I I Autocorrelation Check for White Noise To Chi Pr gt Lag Square DF ChiSq Autocorrelations 6 3617 6 lt0001 0723 0531 0343 0171 0048 0046 AR1 model fit to gas data The ARIMA Procedure Name of Variable resid 548 Unconditional Least Squares Estimation Standard Approx Parameter Estimate Error t Value Pr gt t Lag MU 084911 1572996 005 09573 0 AR11 072909 012202 598 lt0001 1 Std Error Estimate 2669538 MA1 model fit to gas data The ARIMA Procedure Name of Variable resid Unconditional Least Squares Estimation Standard Approx Parameter Estimate Error t Value Pr gt t Lag MU 378200 905471 042 06790 0 MA11 099998 035495 282 00082 1 Std Error Estimate 2676369 What about a composite AR and MA model To Lag 6 12 18 24 Parameter MU MA11 AR11 Chi Square 096 266 368 778 DF 10 16 22 ARMA11 model fit to gas data Name of Variable The ARIMA Procedure resid Unconditional Least Squares Estimation Estimate 232029 023685 062825 Standard Error 1486111 056261 028952 Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals Autocorrelation Check of Residuals Pr gt ChiSq 09151 09884 09994 09977 5 49 Lag Autocorrelations 0005 0034 0009 0060 0074 0072 0094 0042 Approx t Value Pr gt t 016 08769 042 06767 217 00378 086256 7286837 2699414 3242865 3288656 34 0022 0071 0044 0142 0084 0047 0029 0007 0059 0010 0084 0140 0118 0028 0033 0051 5 50 ARMApq Mixed Autoregressive Moving Average Model Zt 6 1Zt1 91at1 8qat V V Amp MAM In backshift notation ARMApq 1 1B 2B2 poZt 6 1 61B 62B2 6quat 1 61B 6qu11 1B poZt 6at Estimation procedures 0 need to estimate ol s 65 s and s 0 how to deal With initial lag 0 several approaches exist ULS unconditional least squares MAq amp ARp also called nonlinear least squares minimize SS error YW Yule Walker ARp generalized least squares using OLS residuals to estimate covariance across observations Invertibility an underlying assumption here intuitively weights 1 amp 9 on past observations decrease as we move further into the past 5 52 Model SAC SPAC ARp o dies down 0 cuts off after lag p MAq 0 cuts off after lag q o dies down ARMApq o dies down 0 dies down Notes this table is for non seasonal stationary time series column headings here should be IAC and IPAC Iheoretioal SAC and SPAC are estimates common structure is ARMA11 Where SAC and SPAC both die down in damped exponential fashion 54 5 53 Autoregressive Integrated Moving Average Model Recall differencing First difference Zt Yt Yt1 t 2 7 Second difference Wt Zt Zt1 Yt 2Yt1Yt2 t 3 n Pros help make time series more stationary stubborn trends Cons can destroy cyclic behavior harder to forecast Useful When transformations and addition of time related predic tors lovv order polynomial trigonometric dummy do not make time series stationary 5 54 After differencing AR and MA dependence structures may exist Autoregressive Integrated Moving Average process ARIMApdq p Amp value at time t depends on previous 19 values 0 d of differences need to take dth difference to make stationary o q MAq value at time t depends on previous q random shocks 5 55 How to select 19 and q How to select d 0 usually look at plots of time series 0 choose lowest d to make stationary also SAC ARIMApdq is a m exible family of models gt useful prediction Recall backshift notation d1zztn n1n Bn1 Bm 0 general d Zt 1 BdYt 5 56 Model summary 0 model Y in terms of predictors X1 Xk1 With ARIMApdq dependence structure 0 But in What order does SAS do this 1 Brim 30 lXt1 mama EH v J Differencing Linear Model 1 1B po11 61B 6quat Autoregressive Moving Average at iid N002 Independence given 19 d and q SAS estimates j s ol s and 65 s 5 57 proc arima data a1 identify var X Q crosscorr X1Xk1 estimate p 2 q g input X1Xk1 method uls plot run AR model 13 Zt 5 1Zt 1 2Zt 2Clt 2 Zt 5 1Zt 1 3Zt 30 13 MA model g Zt 5 Cbt 61at 1 92at 2 93at 3 3 Z13 6i Cbt 61CLt1 93Cbt3 97Cbt7 137 Differencng d First Zt Yt Yt1 1 BYt 1 Second Zt 1 B1 BYt 11 Legged Zt Yt Yt7 7 5 58 Common Models for Stationary Time Series Model SAC SPAC MA1 cuts off after lag 1 dies down dominated by damped exponential decay MA2 cuts off after lag 2 dies down in mixture of damped exp decay amp sine waves AR1 dies down in damped cuts off after lag 1 exponential decay AR2 dies down in mixture of cuts off after lag 2 ARMA11 damped exp decay amp sine waves dies down in damped exp decay dies down in damped exp decay 55 Forecasting amp Goodness of Fit 559 1 BdYt e iXm k lXth l 1 1B po11 61B 6quat at iid N002 ARIMApdq model rewritten With t 1 n Yt 91Y177Yt 192Xt177Xtk 193a177at Where g1 linear combination LC of previous observations Differencing 92 LC of predictors at time t in terms of parameters j Linear Model 93 function of random shocks in terms of parameters o amp 6 AR amp MA dependence structures t model gt estimates amp standard errors for s ol s amp 65 s 1 5 60 Predicted values point forecast from Box Jenkins model even for times t gt n g1Yi Estimate Y With Y if no obs at time l l gt n Yt1 Q2Xt177Xtk 1 Estimate j With bj Q3amp1 a h Estimate o amp 6 With 5 amp 9 Note CAI150 dY2 Yl forlltt andamp0forlgtn Multicollinearity o predictors 11 th1Xt71 Xt7k1 related 0 need diagnostics for goodness of t Measure of Overall Fit Standard Error S my np parameters in model In SAS Std Error Estimate smaller S means Diagnostic Checking Ljung Box statistic o Residuals re ect model assumptions 7 0 Check adequacy of overall Box Jenkins model for these data 0 In SAS look at lag 6 X2 for Autocorrelation Check of Residuals What is H0 5 62 M Q WW 2 71 m1r72namp m1 n n d d degree of differencing rmamp ESAC sample autocorrelation of residuals at lag m M somewhat arbitrary number of lags to consider usually multiple of 6 basic idea look at local dependence among residuals in rst M sample autocorrelations Under H0 model is adequate Q N x 4np sketch this is the Ljung Box statistic SAS lag 6 X2 for Autocorrelation Check of Residuals So What are S and 62 like for a better model Interval Forecasting get point forecast Yt 17er most interested in 739 gt 0 5 63 SEnT standard error of Yn7 depends in S Std Error Estimate for overall t M smaller S gt smaller SEnT recall Estimate i CriticalValue gtlt StandardError Note get CriticalValue from the sampling distribution sketch an i tnnp1 a2 gtlt SEW Based on historical data and the selected Box Jenkins model we are 1 d100 con dent that the true value of Y at time n l 739 Will be inside this interval General SAS code for ARIMApdg Y in terms of X1 Xk1 proc arima data a1 564 identify var X crosscorr X1Xk1 estimate p 1 q g input X1Xk1 method uls plot forecast lead L alpha g noprint out fout run option description d p g differencing AR amp MA settings as before plot adds RSAC amp RSPAC plots L times after last observed to forecast a set con dence limit a 10 gt 90 conf limits noprint optional suppresses output out rout optional sends forecast data to fout data set For a 10 data set fout Will contain columns variables Y forecast std 190 u90 residual What about time or X1Xk1 5 65 Summary choosing a good model choice of p d amp q o ESAC amp ESPAC die down quickly should have nothing left 0 small standard error S 0 small Ljung Box statistic Q o tight or narrow con dence prediction forecasting intervals how far into future 15 n l T 739 gt 0 good summary comparison plot overlay forecast and con dence limits sketch 5 66 Example 534 gas prices since 197 6 revisited Gas Price Data Gas prices first differences 300 60 280 40 260 20 8 240 0 g 220 8 g 200 g 20 3 180 539 40 g 16039 E 607 C 140 u 80 n39 120 7 100 100 80 120 60 1407 1974 1980 1986 1992 1998 2004 2010 1974 1983 1992 2001 2010 Year year Gas prices removing linear and quadratic time Parameter Estimates Parameter Standard Variable DF Estimate Error Intercept 1 11211046 1811369 year1 1 454221 238624 year2 1 025789 006614 Gas prices after removing time effects 100 80 60 40 20 0 20 40 60 80 1 1 1 1 1974 1983 1992 2001 2010 year Residual t Value 619 190 390 Prgt t lt0001 00663 00005 5 67 r p 0010301AgtCDMHOOQ I bv hv h OI O Covariance 1002321 445262 249751 22601310 139170 178280 182119 121122 131139 166107 56790469 22663935 95545152 Look at behavior of SAC and SPAC after removing time effects The ARIMA Procedure Name of Variable resid Correlation 1 44423 24917 02255 13885 17787 18170 12084 13084 16572 05666 02261 09532 000 00000 Autocorrelations 1 9 8 7 6 5 4 3 2 1 O 1 2 3 4 5 6 7 8 9 1 r p 09 Look at behavior of SAC and SPAC after removing time effects The ARIMA Procedure Name of Variable resid Partial Autocorrelations Correlation 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 044423 006458 O13722 O15015 005210 004653 000524 009845 OOO103J39lphCOlol L I bv bv h OI O 014026 005924 000153 017891 Tentative model ARIMA100 The ARIMA Procedure Unconditional Least Squares Estimation Variable price price year1 year2 0102 0072 0035 Shift 000 0123 0014 0041 Standard Approx Parameter Estimate Error t Value Pr gt t Lag MU 8838384 3251518 272 00108 0 AR11 060015 016145 372 00008 1 NUM1 082421 434859 019 08510 0 NUM2 014764 012170 121 02345 0 Std Error Estimate 2880927 Autocorrelation Check of Residuals To Chi Pr gt Lag Square DF ChiSq Autocorrelations 6 201 5 08475 0042 0023 0050 0134 12 335 11 09853 0012 0020 0143 0022 18 410 17 09994 0069 0062 0003 0019 24 818 23 09981 0095 0034 0005 0020 SOXNhatisthe i med Inodelequa on 0154 0068 5 70 r m m m N m m h w M H om I bv hv h OI O Covariance 829974 34933641 19057756 41169483 111339 84280608 101921 9634734 16400616 118340 17889676 60020632 11942645 Lag Correlation I bv h I LOLOOONOEO39IphooMI t J Tentative model ARIMA100 Autocorrelation Plot of Residuals Correlation 100000 004209 002296 04960 13415 10155 12280 01161 01976 14258 02155 007232 01439 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Partial Autocorrelations 04209 02123 05156 13106 09089 11632 00850 00319 18734 06088 06114 04792 I I I I I 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 5 71 Gas Price Gas data ARIMA100 Forecast with 90 percent confidence intervals 400 o Obs 32 33 34 35 36 37 1976 1985 1994 Time Some forecasts from ARIMA100 model year 2007 2008 2009 2010 2011 2012 2003 price 2801 3266 1886 2012 L90 198 214 246 167 179 192 325 733 443 544 506 996 FORECAST 245712 262120 293830 214931 234772 250837 U90 293 309 341 262 290 308 099 507 217 318 038 678 5 72 Consider another model ARIMA111 With one oovariate yearl Parameter MU MA11 AR11 NUM1 ARIMA111 model with covariate Unconditional Least Squares Estimation Standard Approx Estimate Error t Value Pr gt t Lag 342746 1256701 027 07870 0 099998 040637 246 00201 1 094832 040709 233 00270 1 031682 062440 051 06157 0 Std Error Estimate 3259335 Obs year year1 price FORECAST 32 2007 32 2801 271696 33 2008 33 3266 295024 34 2009 34 1886 340990 35 2010 35 197215 36 2011 36 217644 37 2012 37 227487 Variable price price price year1 Shift 000 5 73 Forecasting equation for ARIMA111 With one covariate lt1 Bm lt1 1Bgtlt1 Bm 1 1 1gtB W n Yt 1 1Yt 1 1Yt 2 Y2010 5 74 e 1yearlt1 1B11 913 at 1 13 30 lyearlt 1 913 at g 1 l yearlt CLt 301 1 31yearlt 1y aT1t 1 at 91at 1 BO 1 1 31 yeaHt 1y 0 1t 1 511 1ampt 1 1 1Yt 1 1Yt 2 5 75 56 Seasonal Modeling Recall estimating out cyclic trends add time related variables as predictors make time series stationary Occasionally even after using these regression methods a seasonal effect remains correlation dependence among residuals at seasonal level detectable using SAC and SPAC plots determine appropriate error structure based on how plots die down What if SAC amp SPAC plots don t die down but have a recurring pattern eg spikes at lags L 2L 3L seasonal time series seasons of length L observations 5 76 What to do First try using L 1 dummy predictors most interpretable model Otherwise consider Box Jenkins seasonal models 1 Seasonal moving average model of order q SAC spikes and SPAC dies down at lags L 2L qL Zt 5 at 91Lat L 92Lat 2L 9qLat qL SAS estimate q L 2L qL 2 Seasonal autoregressive model of order p SAC dies down and SPAC spikes at lags L 2L pL Zt 5 1LZt L 2LZt 2L pLZt pL SAS estimate p L 2L pL 5 77 Some notation for seasonal models Notation Description Example B backshift operator BYt Yt1 BkYt Ytk A nonseasonal operator AYt 1 BYt Yt Yt1 AL seasonal operator ALYt 1 BLYt Yt YtL L seasons per year obs per cycle quarterly data L 4 monthly data L 12 gt1lt pre differenoing Y7k f Y1k loth transformation 5 7 8 General stationarity transformation thz A A BD BVK d degree of nonseasonal differencing D degree of seasonal differencing Examine SAC amp SPAC see Table 121 of Bowerman amp O Connell of stationary time series Zt to identify a tentative general Box Jenkins model see next slide General BOX Jenkins model of order p P q Q 5 79 pB PBLZt 5 9qB9QBLat Where pB 1 1B 2B po PBL 1 1133 2133 PLBPL Zt AfAdYt k 6 u pB pBL 7HEZt 6qB 1 61B 62B2 quq 6QBL 1 617LBL 627LB2L 6Q7LBQL Zt is stationary time series 1 p 1aL P7L 6 616q 617L6Q7L are unknown parameters to be estimated from the data at at1 are z39z39d N 0 02 independent and identically distributed 5 80 50 Summary revisited Response Y collected in some sequential manner time space Want to make useful forecasts short term predictions Want to understand What in uences Y o the obvious effects recurring patterns in Y effect of other variables X1 Xk1 on Y o the less obvious dependence among observations previous values autoregressive ARp previous errors moving average M Aq both ARMAOO q Box Jenkins ARIMA models 0 account for dependence structures 0 for useful forecasts meet model assumptions stationarity add dummy vars transform response differencing o graphical diagnostics SAC amp SPAC to tentatively identify appropriate model ARIMA structure 0 graphical RSAC amp RSPAC and numerical 62 amp S diagnostics to assess model adequacy 0 make forecasts point amp interval With adequate model 0 may need to consider seasonal models based on SAC amp SPAC or RSAC amp RSPAC Now i a case study

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.