Applied Linear Regression
Applied Linear Regression 22S 152
Popular in Course
Popular in Natural Sciences and Mathematics
This 86 page Class Notes was uploaded by Cullen Conn on Friday October 23, 2015. The Class Notes belongs to 22S 152 at University of Iowa taught by Staff in Fall. Since its upload, it has received 19 views. For similar materials see /class/228085/22s-152-university-of-iowa in Natural Sciences and Mathematics at University of Iowa.
Reviews for Applied Linear Regression
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/23/15
225152 Applied Linear Regression Appendix Anotation and Review Notation o Lowercase italic letters Known scalar ie not vector constants These are xed none random values 7 An observed covariate x o Uppercase italic letters Observable scalar random variables e If an observation is to be draw from a standard nor mal distribution X N NO7 l 7 After it is drawn it is known and shown as x 0 Greek letters Scalar parameters may 7 Y g lX E o Summation Z ZLXFREampXmaHM 0 Product H 7H1XiX1gtltX2gtltX3gtltgtltXn 0 Estimates of parameters Maa o Boldface characters Vectors and matrices 7 Vector of covariates for an individual a 7 Vector of regression parameters 5 g l 7 Matrix of covariates for all subjects X e Hopefully the context of the problem will make the use clear 0 Symbol means is distributed as iwamu 0 Expected value operator E 7 Expected value of X EX o Variance operator V e Variance of X VX Review Some probability distributions 0 Normal distribution Parameters p 02 N 7 02 o Zedistribution Standard Normal N 07 1 o X2 Chiesquared distribution XV Shape de ned by df V Domain 07 00 Some statistics and their distributions 1 Let Y N NW 02 Then for a independent Y1Yg andY 7 2 Y N N 7 7 Y is less variable than Yz o tedistribution Centered at zero Bell shaped like normal HeaVier fatter tails than the Normal Degrees of freedom df de ne fatness7 o Fedistribution Shape de ned by numerator df V1 FV17 V2 and denominator df V2 Domain 07 00 2 Let Y g lXi 61 with el indepene dently and identically distributed N 002 forz391n Let 31 b1 Z vri yr then 20624 7 bi N Nlt 17varltblll where 2 WW Erika more sp39readeout Xz s gives better estimate of l alternative hypothesis 7 Usually the more interesting statement 7 What the researcher would like to show 7 Symbolized by HA or H1 Possible Hypothesis tests H0p70 vs HAp7 70 H0p 70 vs HAtgt70 H0p270 vs HAtlt70 Under H0 true t N tg4 Under HA true the distribution oft will not be centered at zero it will be shifted The Pvalue of the test is the probability come puted assuming that H0 true that the observed outcome would take a value as extreme or more extreme than what we actually observed 7416770 The observedtim 3877 How often would a number this far from 70 occur if H0 were true Pt 2 3877 00004 Because this is a 27sided hypothesis test the Pevalue 200004 00008 15 Test statistic and Pvalue The appropriate test statistic for testing a hy7 pothesis depends on the parameters of inter7 est In the example from heart rate example assume ing heart rates are normally distributed we can perform a one sample t test for it H0p70 vs HAp7 70 The estimator for p is 17 observed 117416 Y N NltM727 so E07 7 M 2 Test statistic is atestatistic as 039 is unknown 15 Pvalues 0 Small pevalues are evidence against the null hypothesis 0 A small pevalue says we would have been very unlikely to have gotten a sample with data like this if H0 were true 0 The pevalue is not the probability that H0 is true 0 We use the calculated pevalue to make a deci7 sion on the hypothesis test based on a chosen signi cance level a 7 Reject the null hypothesis 7 Fail to reject the null hypothesis or accept the null hypothesis 0 We do not prove the null hypothesis true this is not how things are setup In the heart rate example if a 005 what is the decision P7value 00008 lt 005 i Reject H0 o Smaller signi cance levels means youIII need stronger evidence to reject H0 One sample ttests If we dont know the population standard devi7 ation 039 then we 0 estimate 039 with the sample standard devia7 tion 5 o compute a t7statistic 0 compare the t7statistic to a table of t7vaIues under the appropriate degrees of freedom 0 make an appropriate decision based on 04 Other tests In linear regression analysis we will extend the principles of statistical inference to examining the relationship between two or more variables Types of errors in hypothesis testing 0 We could reject H0 when it was actually true 7type I error 7 false positive 7thought you found something interesting but there really wasn7t c We could fail to reject H 0 when the null was actually false 7type II error 7 false negative 7 there was something interesting but YOU MISSED ITI Remember when computing t we assume H0 true so we use the value of p from the null hy7 pothesis or 0 t EO sw Computing pvaIues 7 Our book does not have distribution tables in the back but most introductory statistical books have these tables 7 Most of the p7values in this course will be generated from statistical software 7 You can also use R to get p7values directIy Pt 2 3877 R gt pt387724lowertailFALSE 1 00003592863 R gt helppt R gt pt Table showing possible outcomes H0 is faIse H0 is true Reject H0 Correct Type I error Do not reject H0 Type II error Correct aprobabiIity of making a Type I error aPreject H0 I H0 is true 7 We set a so that we don t make a Type I mistake very often maybe 5100 times probability of making a Type II error PFaiI to reject H0 I H0 is false 7 We wouId Iike to be smaII 7 SmaII means we have a good chance of detecting something interesting 7 depends on sampIe size and the true state of the situation 7 Power I7 22s152 Applied Linear Regression Ch 14 sec 1 and Ch 15 sec 1 amp 4 Logistic Regression Logistic Regression c When the response variable is a binary varie able7 such as 7 0 or 1 7 live or die 7 fail or succeed then we approach our modeling a little dife ferently o What is wrong with our previous modeling 7 Let7s look at an example The dependent variable plotted against the independent variable mmooo on o highbld i i i an 02 IA 06 as in mm D l l l l l l H mm ZUDEI annn ADDEI EUDEI suil Original E 7mm as o o o E ltr E mo i i i i i i El 1UEE ZUUU BUUU AUUU EUUU suil J1ttered highhld SampleOuanllles m an m 0 Example Study on lead levels in children Binary response called highbld Either high lead blood level1 or low lead blood level0 One continuous predictor variable called soil Measure of the level of lead in the soil in the subjects backyard gt slreadcsV soilleadcsV gt attachltslgt gt headltslgt highbld 011 1290 90 894 193 1410 1 O 1 O 1 1 410 mmewmex Consider the usual regression model Yi 0 ii6i The tted model and residuals Y 03178 0000395 Fitted model iittered poims Residuals vs Fitted values mmusooa lm nul ireslduals an 02 DA me as in a D o i i i i i i l i i i i i i H mm mm 5mm n A n E i 2 i E Sull lmuummedvalues Normal QQ Plot 33 A Theulellcal Ouamlles All sorts of violations here A Violations 7 Our usual 6 N0 02 isn7t reasonable The errors are not normal and they don7t have the same variability across all Xevalues e predictions for observations can be outside 01 For X3000 Y 114 and this is not possibly an average ofthe 0s and 1s at X3000 like a conditional mean given X Yx3000 114 which is not in 01 0 Conditioning on a given X we have PY 1 and PY 1 6 01 0 We will consider this probability of getting a 1 given the Xevalues in our modeling 7UPYi1lXi The previous plot shows that PY 1 de pends on the value of X o It is actually a transformation of this prob ability that we will use as our response in the regression model 0 What is a better way to model a 071 response using regression 0 At each X value there7s a certain chance of a 0 and a certain chance of a 1 WWW 680 O o 0 mghbid l l l 00 02 04 05 08 10 We 1 l l l l 0 1000 2000 3000 4000 5000 SOll o For example in this data set it looks more likely to get a 1 at higher values of X 0 Or PY 1 changes with the Xevalue 6 Odds of an Event 0 First let7s discuss the odds of an event When two fair coins are ipped Ptwo heads14 Pnot two heads34 The odds in favor of getting two heads is P2 heads 7 L l Pn0t 2 heads 7 34 or sometimes referred to as 1 to 3 odds odds 13 You7re 3 times as likely to not get 2 heads as you are to get 2 heads o For a binary variable Y 2 possible outcomes odds in favor of Y1 is o For example if Pheart attack00018 then the odds of a heart attack is 00018 00018 447k mwommm o The ratio of the odds for two different groups is also a quantity of interest For example consider heart attacks for male nonsmoker vs male smoker Suppose Pheart attack00036 for a male smoker and Pheart attack00018 for a male nonsmoker Back to Logistic Regression o The response variable we will model is a trans7 formation of PYi 1 for a given Xi o The transformation is the logit transforma7 tion l39 l 0 z a n 7 g 1 7 a o The response variable we will use This is the loge of the odds that Y 1 Notice PYi 1 E 01 P Y1 e lt07 00gt and 700 lt ln lt 00 11 Then the odds ratio OR for a heart at7 tack in nonsmoker vs smoker is 7 odds of a heart attack for non7smoker O R 7 Odds of a heart attack for smoker P heart attacklnon7smoker 17PTheart attacklnon7smoker P heart attack smoker 17PTheart attacklsmoker ll 04991 0 Interpretation of the odds ratio for binary 071 7 OR 2 0 7 If OR 10 then PY 1 is the same in both samples 7 If OR lt 10 then PY 1 is less in the numerator group than in the denominator group 7OR 0 if and only if PY 1 0 in numerator sample 10 o The logistic regression model Pm1 1 0 1X1i kai 7This response on the left isn7t bounded7 by 01 eventhough the Y7values them7 selves are having the response bounded by 01 was a problem before 7The response on the left can feasibly be any positive or negative quantity 7 This a nice characteristic because the right side of the equation can potentially7 give any possible predicted value 700 to 00 o The logistic regression model is a GENERALIZED LINEAR MODEL Linear model on the right something other than the usual continuous Y on the left 12 o Let7s look at the tted logistic regression model for the lead level data with one X covariate lt PY1 m 0 1X 71516000027 X WWWEMR W a 7 Muf n67 quot The curve represents the And EllHM 0P0 Oliz Hl POi 1l96i Poc 71m 13 0 In the general model with one covariate PY 1 7 m o 61Xz which means ewwo r X2 PlYZ 1 1 ammo 1X2 1 1 epo o 1X2l o lf l is positive7 we get the Seshape that goes to 1 as X goes to 00 and goes to 0 as X goes to 700 as in the lead example 0 lf l is negative7 we get the opposite Seshape that goes to 0 as X goes to 00 and goes to 1 as X goes to 700 c We can manipulate the regression model to put it in terms of 1 o If we take this regression model PY 1 7 Zn 7 715160 00027 X and solve for 1 we get exp715160 00027 X P Y 1 l 1 exp715160 00027 X 7 The value on the right is bounded to 01 7 Because our l is positive7 As X gets larger7 PYi 1 goes to 1 As X gets smaller7 1 goes to 0 7 The tted curve7 ie the function of X on the right7 is Seshaped sigmoidal 0 Another way to think of logistic regression is that were modeling a Bernoulli random vari7 able occurring for each Xi and the Bernoulli parameter 7r depends on the covariate value Then7 YilXi Bernoullzlm where 7139 represents PYi 1lXi Ema1X0 m and Vlyz39lXi 7011 77139 We re thinking in terms of the conditional distribu7 tion of YlX we don t have constant variance across the X values mean and variance are tied together Writing 7139 as a function of Xi 690plt 0 r Xi 1 690plt 0 r Xi 7W Interpretation of parameters For the lead levels example 0 lntercept g When X 0 there is no lead in the soil in the backyard Then So g is the logeodds that a randomly se lected child with no lead in their backyard has a high lead blood level Or 7139 6050 is the chance that a kid with e no lead in their backyard has high lead blood level For this data g 715160 or 7 01800 17 Or unedoing the log e l is the OR comparing the two groups For this data 01 00027 or 5010027 10027 A 1 unit increase in X increases the odds of having a high lead blood level by a factor of 10027 Though this value is small the range of the X values is large40 to 5255 so it can have a substantial impact when you consider the full spectrum of x values Interpretation of parameters For the lead levels example 0 Coef cient 1 Consider the loge of the odds ratio OR of having high lead blood levels for the following 2 groups 1 those with exposure level of x 2 those with exposure level of I 1 loge loge lt5 05 1x1gt 1771M g Og w loge em 1 1 is the loge of OR for a 1 unit increase in 06 It compares the groups with 1 exposure and x 1 exposure Prediction 0 What is the predicted probability of high lead blood level for a child with X 500 exp15160 00027 X p 12 1 l 1 exp15160 00027 X 1 1 exp15160 00027 X 1 04586 1 exp15160 00027 x 500 o What is the predicted probability of high lead blood level for a child with X 4000 1 PK1 09999 1 exp15160 7 00027 x 4000 o What is the predicted probability of high lead blood level for a child with X 0 PY 1 01800 1 1 expl5160 So7 there7s still a chance of having high lead blood level even when the backde doesn7t have any lead This is why we don7t see the loweend of our tted Securve go to zero Testing Signi cance of Covariate gt glmoutglmh1ghbldquotsoilfam11yb1nomiallog1t gt summary glm out Coefficients Estimate Std Error 2 value Prgtlzl Intercept 15160568 03380483 4485 730606 WW 5011 00027202 00005385 5051 439eio7 WW Signif codes 0 0001 M 001 005 01 1 Soil is a signi cant predictor in the logistic re gression model A couple things 0 Method of Maximum Likelihood is used to t the model Estimates j are asymptotically normal7 so R uses Zetests or Wald tests for covariate signi cance in the output NOTE squaring a zrstatistic gets you a chiesquared statistic with 1 df 0 General concepts easily extended to logistic regression with many covariates 225152 More on ANOVA Lecture 18 Nov 4 2002 Kate Cowles 374 SH kcowles statui0waeclu 3 If we use ANOVA to compare two populations7 the overall F statistic is equal to t2 7 7 2 t2 aft y 8 o numerator measures variation between groups as the square of the difference between the sample means7 scaled by the sample size 0 denominator measures variation within groups 0 general idea of F statistic for assessing whether more than 2 populations all have same mean 7 compare variation among group means with variation within groups 7 hence analysis of variance 2 Recall twosample t statistics 0 used to compare the means of two popula tions 0 if i the two populations are assumed to have equal but unknown standard deviations lt7 i the sample sizes are both equal to n i g is the pooled sample estimate of the common variance 02 0 then the t statistic is l ms 8 MH l Ql 33quot En U 4 The Oneway ANOVA model 0 For testing the hypothesis that the means are equal in l different populations 7 take random samples from each of l pop ulations ni is sample size in ith population 7 327 is jth observation from ith population 7 W is unknown population mean ofith pop ulation 702 is unknown common variance of all I populations 0 then the onerway ANOVA model is yij Mi 6239 fori177Iandj177n eij assumed to be from N07 lt72 distribur tion 7 unknown model parameters to be estimated M17M27 1702 5 Assessing the assumptions of oneway ANOVA 0 normal distribution of response variable in all populations of interest histograms boxplots for sample data from each population 7 normal qq plot for sample data from each population 0 same standard deviation 0 in all populations 7 possible to use Levenels test for homogener ity of variance but it is less robust to none normality than the ANOVA procedurel rule of thumb if largest sample standard deviation isnlt more than tWice as large as smallest sample standard deviation as sumption probably is met closely enough that ANOVA results Will be OK i also ANOVA is less sensitive to violation of equality of variance assumption if ms 7 Estimating the population parameters 0 For each population 239 the estimate of W is V 7 22711923 Z n 7 ml 0 pooled estimator of the common variance 02 82 n1713n27133n1 7 Us p 1 1n2 1 n1 1 Where 32 is the sample variance in the 2th sample 0 pooled estimator of the common standard deviation 0 2 8p 8p are close to equal 0 lfone or both assumptions are violated often a transformation of the response variable Will improve both E Hypothesis testing in oneway ANOVA THe null and alternative hypothesis for onerway ANOVA are H01M1M2MI Ha not all they are equal The ANOVA table Degrees Sum of Mean Source of freedom squares s uare F Groups I 7 1 zgmupsn y 7 2 SSGDFG MSGMSE Error N 7 l 270 an 7 Us SSEDFE Total N 7 1 2055le 7 g SSTDFT where o g is the overall mean from all samples come bined o SSG MSG and DFG are sum of squares mean square and degrees of freedom for groups 7 SAS calls this for Model 0 SSE MSE and DFE are sum of squares mean square and degrees of freedom for er ror o SST MST and DFT are total sum of squares mean square and degrees of freedom TWOway ANOVA 0 used to compare population means when the populations are classified according to two factors 0 advantages compared to doing two onerway ANOVAs more ef cient to study two factors simulr taneously rather than separately can reduce residual variation in a model by including a second factor believed to affect the response 7 can study interaction between factors 10 Coef cient of determination in ANOVA 33G 2 7 R T SST This is the proportion of the total variation in the data that is accounted for the the model 0 that is by the differences among the group means 12 Assumptions 0f twoway ANOVA 0 But first some notation first categorical variable factor A with l levels 7 second categorical variable factor B with J levels 7 22239 sample size for level 2 of A and level 339 of B 7 total number of obs N 2239 2239 222739 i 222739 unknown mean of response variable in population defined by level 2 of A and level 339 of B iyijk is kth observation from population having factor A at level 2 and factor B at level 339 o assumptions 7 independent simple random samples of size 222739 from each of I X J normal populations 13 i all these populations have same standard deviation 0 unknown 7 model yijk M 6m eijkls follow a N07 lt72 distribution 15 In twoway ANOVA7 however7 SSM and DFM are each split up 3 ways 0 SSM SSA SSB SSAB o DFM DFA DFB DFAB producing 3 distinct F statistics for factor A7 for factor B7 and for the interaction 14 The ANOVA table for twoway ANOVA As in onerway AN OVA7 both the sums of squares and the degrees of freedom add 0 total sum of squares model sum of squares error sum of squares or SST SSM SSE 0 total degress of freedom model degrees of freedom error degrees of freedom or DFT DFM DFE 16 The general ANOVA table for twoway ANOVA Degrees Sum of Mean Source of freedom squares square F A lrl SSA SSA DFA MSAME B J 71 SSB SSB DFB MSBME AB 1 r 1J71 SSAB SSAB DFAB MSABM Error N r U SSE SSEDFE Total N r 1 SST SSTDPT 17 13 Examples of What a signi cant interac Example the depression data Without tion can mean interaction term data de re p ss lnIlle groupItppubkcowlesdatasetsdepressdat lnput eff age trt trta0 rtb0 trta 1 else 11 trt eq 8 then trtb 1 then do n agegt40 0 11 age gt 40 then agegt40 1 roc m class trt agegt40 del eff trt agegt40 means trt agegt40 bon run 19 20 Class Level Informatlon agegt40 1 2597294935 2597294935 44 43 Class Levels Values Source Pr gt F 3 A B C trt 0 0016 agegt40 2 0 1 agegt40 lt 0001 Number or observatlons 36 Source DF Type 111 SS Mean Square F Value The ClM Procedure trt 2 1055682381 527V841190 93903 Dependent Varlable err agegt40 1 2597294935 2597294935 44 43 Sum of Source Pr gt F Source DF Squares Mean Square F Value 00008 Model 3 3524 461601 1174820534 20 10 agegt40 lt 0001 Error 32 1870538399 58454325 Corrected Total 35 5395 000000 Source Pr gt F Model lt 0001 Error Corrected Total R Square Coerr Var Root MSE err Mean 0 653283 13 85899 739645543 553916667 Source DF Type 1 SS Mean Square F Value CIC 2 927V166667 463 Y 583333 7 V 93 Tne ClM Procedure 005 Error Degrees of Freedom 32 Bonferronl ltDunngt t Tests 1or e11 Error Mean Square 5845432 Mlnlmum Slgnlflcant Dl erence 5393243 NOTE ans test controls the Type I experrmentwrse error rate but 1t Harmonrc Mean of Cell Slzes 173911111 generally has a nrgner Type II error rate than RECVQ NOTE Cell slzes are not equalv Alpna 005 Error Degrees o1 Freedom 32 Means wrtn the same letter are not Slgnlflcantly dl erent Error Mean Square 5845432 Crrtrcal Value or t 252843 Don Crouprng Mean M agegt40 Mlnlmum Slgnlflcant leference 78857 A 81773 22 1 B 44788 14 0 Means wrtn the same letter are not Slgnlflcantly dlfferent Don Crouprng Mean M trt A 82333 12 A B 51917 12 B B 51250 12 C The ClM Procedure Bonferronl ltDunngt t Tests for e11 NOTE ans test controls the Type I experrmentwrse error rate but 1t generally has a nrgner Type ll error rate than REGWQ Z3 24 With interaction 0783071 1132184 8245884 5518887 pr Elm S DF T lSS M S FVl class trt agegt40 ource ype ean quare a ue model e11 trt agegt40 trt agegt40 2 927188887 483583333 1188 means trt agegt40 trt agegt40 bon run I agegt40 1 2597294935 2597294935 6668 trtagegt40 2 700208258 350103128 897 Class Level Informatlon source Pr gt F Class Levels Values 03900 agegt40 lt V 0001 6 v trt 3 A B C trt agegt40 0 0009 agegt40 2 0 1 Source DF Type lll SS Mean Square F Value Dependent Varrable e11 trt 2 1162147432 581073718 1439 90 Sum of agegt40 1 2551877357 2551877357 6541 Source BF Squares Mean Square F Value trt agegt40 2 700 208258 350 103128 8 97 gt Model 5 4224887857 844933571 2166 source Pr F Error 30 1170332143 39011071 lt 0001 Corrected Total 35 5395000000 39 agegt40 lt V 0001 Source Pr gt F trtagegt40 00009 Model 0001 Bonferronl ltDunngt t Tests for e11 Err MOTETn t t t l tn T l t t bt t Corrected Total V 1s es con ro s e ype experlmen wlse error ra e u 1 generally has a nrgner Type ll error rate than REGWQ R Square Coeff Var Root MSE eff Mean Alpha 005 Error Degrees or Freedom 30 Error Mean Square 393901107 Crltlcal Value of E 253574 Mlnlmum Slgnlflcant leference 64658 Means wrth the same letter are not srgnrrreantly drrrerent Eon Grouprng Mean M trt A 62 V 333 12 A B 51917 12 B B B 51250 12 c The GLM Procedure Bonferronl ltDunn t Tests for eff NOTE Thls test Controls the Type I experrmentwrse error rate but rt generally has a hrgher Type II error rate than REGWQ Alpha 005 Error Degrees or Freedom Error Mean Square Crrtreal Value or t Mlnlmum Srgnrrreant Drrrerenee Harmoan Mean or Cell Slzes 30 3901107 204227 4361 17 V 11111 NOTE Cell slzes are not equal Means wrth the same letter are not srgnrrreantly drrrerent Eon Grouprng Level of oommrr Level of agegt4o HOHOHO Mean 61773 44786 agegt4o The GLM Procedure 7142857 Std Dev 439 08656335 439 46147531 3 V 51188458 662112420 8 V 13633824 7 V 91021040 22s152 Applied Linear Regression Chapter 15 Section 2 Poisson Regression with zeroin ation 0 Example Ear infection in swimmers Count response Infections Number of ear infections Categorical predictors Swimmer Frequent 0 or Occasional 1 Location Beach 0 or Nonebeach1 gt contrasts Sw1mmer 1 gt contrasts LOCaElOD NonBeach Beach 0 NonBeach 1 neqlaeach neqmonheach usually usually mtecnons mtecnons OccasIBeach oucasmonheach g 3 3 c a W H 39 l V mover V l V mover There is an overabundance of zeros which gives the Poisson regression a poor t There are 4 groups or cells7 and swimmers that swim frequently at the beach are the baseline group gt glmout2 glmltlnfectionsquotSw1mmer LOCBElOD gt glmbut 2coeff1c1ents familyrpoisson Intercept Sw1mmer0ccas LocationNonBeach 0 3058324 0 6130355 0 5087305 Group Poisson mean Frequent Swimmers at the Beach A e o 074 Frequent Swimmers not at the Beach A elt o n 122 Occasional Swimmers at the beach A elt o n 136 Occasional Swimmers not at the beach 5 o 1 2 226 ZeroIn ated Poisson ZIP Regression This type of model accounts for the increased number of zeros Its not uncommon when you have low counts7 to have an overabundance of zeros 7 Tumor counts in experimental mice 7 Number of lled teeth in children cavities 7 Shoot counts on a plant In terms of goodnesseofe t7 you could use a LR test to see if the extra parameters used in the ZIP model were really needed in this case three extra parameters In a ZIP model7 we think the covariates may explain why some groups have an in ated nume ber zeros7 and that these covariates may also explain why some groups have larger counts ZIP model with probability pl 07 N Poz ssoni7 with probability 1 7 pl And we model the probability of being a per fect7 zero as a logistic regression IOgitW 70 mm mm and we model the Poisson part as a Poisson re gression 10gi n lm zm It7s like there7s two generalized linear models working together to explain the data 5 gt zip outcoefficients count Intercept SWimmerOccas LocationNonBeach 06079255 06154344 zero Intercept SWimmerOccas LocationNonBeach 03868486 O1956552 07543500 Hand D J Daly F Lunn AD McConway Kl Ostrowski E 1994 A Handbook of Small Data Sets London Chapman amp Hall Data set 328 The R contributed package pscl contains the function zeroin o which will t this ZIP model to our data gt library mvtnorm gt librarycoda gt librarypscl gt zipout zeroinflInfections quotSWimmer LOCBEIOD dataearinf dist poisson gt summaryzipout Count model coefficients poisson With log link Estimate Std Error 2 Value Prgtlzl Intercept 06079 01224 4965 687e707 M SWimmerOccas 05154 01229 4194 274e705 M LocationNonBeach 01221 01151 1061 0289 Zeroiinflation model coefficients binomial With logit link Estimate Std Error 2 Value Prgtlzl Intercept 03868 02441 1585 011295 SWimmerOccas 41957 02748 O712 047640 LocationNonBeach O7543 02695 799 000513 M Signif codes O OOO1 O01 700577017 Number of iterations in BFGS optimization 13 Logilikelihood 4752 on 6 Df Fitted probabilities for the ZIP models FreqHench FreoNonbeach E E 39 3 c 7 3 c 7 a u n c u c 7 C39 I I I I C39 I l l l n 5 1D 15 n 5 1D 15 mums mums OccasEeach OccasNonbeach Denle 4 Martians WEEHDHS A much better t than the Poisson Regression c When faced with count data7 the rst plan of attack is to use a transformation that may get you nearenormality and constant variance and then classical regression could be used c The x y 38 transformation is recommended for count data 0 Sometimes a transformation doesn7t x eve erything7 and a Poisson regression model or a ZIP regression model may be a better t 0 The ZlP model has a speci c interpretation of the parameters which may be more meane ingful than the usual OLS parameters more relevant to the context at hand 0 For Poisson regression7 if the variance is not equal to the mean7 you may need to allow for the exibility of overdz39spersz39on or underdz39s person in your model 9 Code for plotting ZlP tted probabilities overe laid with the observed relative frequencies ero coeffmp youtcoeff1c1eutszero Pons V coeffmp outcoeff1c1eutscouut xseqlt0gt17 parmfrowc2gt2 bbreaksseqlt0gt18 395 a freqbeach pVO11expzerovcoeff1W lambd pltPonsscoeff1gt ftted1p0dposxgtlambdacpV0gtrep0gt17 hstearvJuflSmmmer Freq amp Locatio Beach gt3 gt col grey80 gtbreaksbgty11mc0gt 85 gtmaiu FreqBeach gt xlab 1ufectious gtprobT boxO pouts017gtf1ttedgtpch16 a freqNoubeach pO11exp sumzeroVcoeffc1gt3l lambd pltsumltPonsscoeffclt1gt3gtgtgt ftted1pV0dpolsxgtlambdacplV0gtrep0gt17 hstearuflSmmmer Freq amp Locatlo NouBeach gt3lgt col grey80 gtbreaksbgty11md0gt 85 gtmaiu FreqNoubeach gt xlab ufectous gtprob gt boxO pouts017gtf1ttedgtpch16 Code for plotting Poisson Regression tted probe abilities overlaid with the observed relative fre quencies xseqlt0gt17 parmfrowc2gt2 bbreaksseqlt0gt18 V5 hstearvJuflSwmmer Freq amp LocamoquBeachmsr col grey80 gtbreaksbgt lemCOgt 65 gtmaJu FreqBeach gt xlab 1ufectlous gtcexmau2gtcex 1ab2gt probT boxO pouts017gtdposxgt0 V74gtPCh16gtcex2 hstearuflSmmmer Freq amp Locatlou NouBeach gt3gt col grey80 gtbreaksbgt lemCOgt 65 gtmaJu FreqNoubeach gt xlab 1ufectlous gtcexmau2gtcex 1ab2gt probT boxO pouts017gtdpolsxgt122gtPch16gtcex2 hsearmf3mmer 0ccas amp LocamouxHBeach usr col grey80 gtbreaksbgty11mc0gt 65 gtmaJu 0ccaSBeach gt x1ab ufectous gtcexmau2gtcexvlab2gtprobT boxO pouts017gtdpolsxgt136gtPch16gtcex2 hsearmf3mmer 0ccas amp Locatiou NouBeach gt3gt col grey80 gtbreaksbgty11mc0gt 65 gtmalu 0ccasNoubeach gt xlab 1ufectlous gtcexvmalu2gtcexvlab2gtprobT boxO pouts017gtdposxgt2 26gtPCh16gtcex2 10 a OccaSBeach pO11exp sumzeroVcoeffc1gt2l lambdaexpltsumltPonsVcoeff C13 gtgt ftted1pV0dpolsxgtlambdacplV0gtrep0gt17 hsearmf3mmer 0ccas amp LocamouxHBeach ua gt col grey80 gtbreaksbgty11md0gt 85 gtmaJu 0ccasBeach gt xlab 1ufectlous gtprobT boxO pouts017gtf1ttedgtpch16 a OccasNoubeach pO11exp sumzerocoeffc1gt2gt3l lambdaexpltsumPonscoeffC1gt2gt3l ftted1pV0dpolsxgtlambdacplV0gtrep0gt17 hsearmf3mmer 0ccas amp Locatiou NouBeach gt3gt col grey80 gtbreaksbgtyhmd0gt 85 gtmalu 0ccaSNoubeach gt xlab 1ufectlous gtprobT boxO pouts017gtf1ttedgtpch16 22s152 Applied Linear Regression Chapter 8 TwoWay ANOVA 0 Example A balanced twoeway ANOVA without signi cant interaction Fishing Data 7 Objective to see if sinker weight andor rod length affect the distance the line is cast and if the rod effect if present is the same for differing sinker weight ie see if there is interaction 7 Variables Distance distance line cast in meters Sinker factor with levels 8 0212 oz Rod factor with levels 6 ft7 ft 1 w Reeass1gu the variables to be factors gt rodasfactorrod gt sinkerasfactorsmker gt 1s1accors1uker 1 TRUE gt 1s1accorrod 1 TRUE gtSl ker 111111112222221122 Levels12 gtrod 111111222222112211 Levels12 gt urQWCrodsiuk 1 16 gt tableCsiukenrod sinker 1 2 1 4 4 2 4 4 This is a balanced twoeway ANOVA s 0 Example Fishing data gt rodsiukread csV rodaudsiuker csV gt attachCrodsmk headltrods1uk rod sinker distance order 1 2 v 30 31 mmwaH was 030 00 owoowo ommmmw 1 1 1 1 1 1 2 1 2 1 as gt isfactorrod 1 FALSE gt 1s1accors1uker 1 FALSE The variables rod and sinker were entered as numbers in the data le rather than letters7 so we have to tell R that these are factors or it will see them as continuous data lf rodl 6 foot rod lf sinkerl 8 oz sinker 2 Get the interaction plot of the means gt interaction plot Sinker rod dlstauce 72 an nquot 0 4mm 34 m gt meanstapplydistaucedistCsiukenrod mean gt means 2 1 29875 3575 2 35625 3950 First factor listed is represented in the rows 4 Get a more sophisticated plot gt plotc 075225 rangedistance type n xlab Sinker ylab effectiveness axesF gt axis2 axis1at12c 8 ounces 12 ounces XO v gt pointsjitteras numericsinkerrod1 factor15 distance r0 1 pch 1 col1 gt pointsjitteras numericsinkerro 2 factor15 distance rod2 pch 2 col2 gt linesc12 means1 col1 gt linesc12means21 col2 gt legend142c Rod 6 feet Rod 7 feet colc12 lty1 Before tting the linear model we will change our dummy regressor coding to the sum t0 zem constraint gt contrasts rod 2 10 21 W Reassign as sumstoszero contraint dummy regressors gt contrasts rodcontr sumlevelsrod gt contrasts rod 1 1 1 2 1 gt contrasts sinker b OM W Reassign as sumstoszero contraint dummy regressors gt contrastssinkercontrsumlevelssinker gt contrastssinkergt 1 1 2 1 alsrance m 4 l a nun25 2 nun25 S nker As the primitive interaction plot also sug gested there may be some slight interaction we7ll test for this But even if there is in teraction having a rod of 7 feet is better at both levels of sinker weights Fit the linear model with interaction gt 1mout1mdistance gt summarylmout sinker rod sinkerrod Coefficients Estimate Std Error t value Prgtltl Intercept 351875 05 526 lt Zeile WW sinkerl 23750 0 5135 41625 0000585 WW rod1 24375 05135 41747 0000475 WW sinker1rod1 05000 05135 0974 0349411 Residual standard error 2054 on 12 degrees of freedom Multiple Risquared O789Adjusted Risquared 07363 Fistatistic 1496 on 3 and 12 DF psvalue 00002341 ltems provided by the summary7 output 1 Overall Fetest reject thus something in the model is useful for explaining the vari ability in distance 2 Test for dummy regressors In this case since there is only 1 dummy regressor for each factor this gives us the tests for main effects of sinker and rod 8 We can also use the An00alm0uttype 11177 command to get the appropriate tests for main e ects and interaction gt Anovalmouttype TTl Anova Table Type III tests Response di stance Sum Sq Df F value PrgtF Intercept 198105 1 46958370 lt 22eile WW sinker 9 3 1 213926 00005849 WW rod 951 1 225333 00004746 WW sinkerrod 40 1 09481 03494114 6 12 Residuals 50 Signif codes 0 0001 001 005 01 1 Items provided by the Anovalmouttype lll 7 output 1F7tests for the main e ects and interac7 tion You dont get an overall F7test here 2 Type III sums of squares for each term 9 i there is NOT signi cant interaction Since there was not signi cant interaction we can use the simpler additive7 model for any fur7 ther analysis gt Anovalmoutredtype TTT Anova Table Type III tests Response distance 5 q f F value PrgtF Intercept 49208 1 1171071 4003e714 WM 903 sinker 1 21478 00004675 M97 rod 951 1 22624 00003751 M Residuals 546 13 Signif codes 0 0001 H 001 005 01 1 Both sinker weight and rod length have a sig7 ni cant e ect on distance cast What are the e ects of these factors In other words is it better to have a 6 or 7 foot rod Or an 8 or 12 oz sinker 11 As we mentioned earlier in atwo7way ANOVA that includes interaction a common rst step is to test for interaction We can get this from the Anova7 output or we can also do a partial F7test Using partial F7test for interaction term gt lmoutredlmdistance quot sinker rod gt anovalmoutredlmout Analysis of Variance Table Model 1 distance quot sinker rod Model 2 distance quot sinker rod sinkerrod ResDf RSS Df Sum of Sq F PrgtF 1 13 54625 2 12 50625 1 4000 09481 03494 This is the same p7value as for the dummy re7 gressor called sinkergxmdg in the summary output just square the t 709742 m 0948 to get the F And the same p7value for inter7 action using the Anova7 command 10 We can plot the tted model by getting the 4 predicted cell means Recall how is R coding the dummy regressors contrastsrod 1 1 2 1 contrastssinker 1 1 2 1 lmout redcoefficients Intercept sinker1 rod1 351875 23750 24375 The coef cients are given for sinkerl and rodl rod 6 7 sinker 8 351875 7 23750 7 24375l351875 7 23750 24375l 12 351875 23750 7 24375 l 351875 23750 24375l rod 6 7 sinker 8 35 250 12 40 000 The four t ted values gt uniqueltlmoutredfitted 1 30375 35250 40000 35125 Aligned with the data gt cbindltrodsink lmout redfittedgt rod sinker distance order lmoutredfitted 1 1 3 1 1 280 037 2 1 1 305 2 30375 3 1 1 310 6 30375 4 1 1 300 5 30375 5 2 1 335 9 35250 6 2 1 350 10 35250 7 2 2 380 15 40000 8 2 2 375 16 40000 9 2 2 420 11 40000 10 2 2 405 12 40 000 11 1 2 330 7 35125 12 1 2 340 8 35125 13 2 1 360 13 35250 14 2 1 385 14 35250 15 1 2 380 4 35125 16 1 2 375 3 35125 13 e7 2 2 2 U 12 8 57 2 x g c g y 5 RI 1 La 24m Sinker CONCLUSIONS 7 The heavier sinker has a positive e ect on distance 7 The longer rod has a positive e ect on dis7 tance The tted model shows parallel pro le plots gt plotc075225 rangedistance type n xlab Sinker ylab di stance axesF cex Iab2 cexaxi s2 gt axis2 gt 8X1518E112CH8 ounces 12 ounces gt box gt pointsjitterasnumeric sinker rod1l factor 15 stance rod1l pch 1 col1cex1 5 gt pointsjitterasnumeric sinker rod2l factor 15 distance rod2l pch 2 col2cex1 5 in lines connecting the relevant fitted values linesc12c3037535125col1 gt IineSC12c3525040000col2 v gt Iegend142c Rod 6 feet Rod 7 feet colc12 Ity1cex15 7 The positive e ects are seen in the plot7 and also in the estimated coef cients in the ef7 fects model Yijk M04i j6ijk 32 23250 and an 723250 32 24375 and 31 724375 n 351375 gt meandistance Wt Y7bar muhat because balanced 1 351875 7 As there was no interaction7 the rod e ect is the same for both levels of sinker 7 The tted model shows parallel lines 7 RECOMMENDATION FROM THIS DATA For the longest distance7 get a rod of length 7 ft and sinker that weighs 12 oz 225152 Applied Linear Regression Chapter 8 ANOVA o Twoway ANOVA 7 Used to compare populations means when the populations are classi ed by two face tors or categorical variables 7 For example 361 and occupation for pre dicting salary 7 advantages compared to doing two sepa7 rate7 oneeway ANOVAs more ef cient to study two factors sie multaneously rather than separately can reduce residual variation in a model by including a second factor believed to affect the response more predictors can study interaction between factors 1 Cells7 in a twoway ANOVA Example Poison data The data give the survival times in 10 hour units in a 3 X 4 factorial experiment the factors being a three poisons and b four treatments Each combination of the two factors is used for four animals the allocation to animals being completely randomized The variables in the dataset are Time Survival time in 10 hour units Poison Factor with levels 123 Treatment Factor with levels 172374 Notation 7 rst categorical variable factor A with at levels 7 second categorical variable factor B with 12 levels 7 n2 is the sample size for level i of A and level j of 7 N is total observations Cell means model d ll Yijk theijk Wlth Eijk N N010 7 p is population mean for level i of A and level j of B there are ab distinct cell means 7 Allows for freely t7 mean in each cell 7 It will take ab parameters to describe the mean structure 7 Yzjk is kth observation from population with level i of A and level j of B E ects model Yijk M04i j39j Ez jk 7 042 is the main e ect parameter for level i of A 7 is the main e ect parameter for level j of B 7 72 is the interaction parameter for level i of A and level j of B 7 There are 1abab parameters to describe the mean structure this is an overpammete39rization 7 We ll need to use constraints to make the parame ters estimable 7 But7 the eiTects model can be useful when there are only main eiTects present no in teraction because its a much simpler model to interpret more on this later 7 Yzik is kth observation from population With level i of A and level j of B gt levels Poison L Seen as numeric redefine as categorical gt POison7asfactorPOison gt isfactorPoison 11 TRUE gt leve1sPoison 1 ml HZ Us Get the 12 cell means gt meanstapplyltTimelist POlSOD Treatment mean means 1 2 3 4 1 04125 0880 05675 06100 2 03200 0815 03750 06675 3 02100 03335 02350 03250 Get the number of observations in each cell gt tablePoison Treatment Treatment Poison1234 14444 24444 34444 0 Example Poison data Time Survival time in 10 hour units Poison Factor With levels 123 Treatment Factor With levels 172374 gt poisondatareaddelim poisondata gt headltpoisondata Poison Treatment Time quantuwm gt gt gt gt gt MMHHHH o o 5 w gt levelsltTreatment NULL w R sees it as numeric redefine as categorical gt TreatmentasfactorTreatment gt isfactorTreatment 11 TRUE gt levelsltTreatment 1 ml HQH Us H4 This is a balanced Qeway ANOVA See how R is creating the dummy regressors gt contrastsltPOison 2 3 100 210 301 gt contrasts Treatment 000 100 010 001 prH i Poison level 1 and Treatment level 1 is the baseline You can get a quick plot of the means using the interactionplot function gt interactionplot Poison TreatmentTime Tvealmem wean or Time Puisun More sophisticated plot gt poisonnumasnumericPoison gt piotco535gtrangemmetypea n xiaba POisson ylab Survival Time axesFmain Survival Time vs Poison by Treatment group Jittered gt axis2 gt axis1at13Cll1 2 all Fit the model that includes interaction and get the overall Fetest gt lmoutlmTime Poison Treatment PoisonzTreatmeut gt summarylt1moutgt Coefficients Estimate Std Error t value Prgtltl Intercept 041250 007457 55 294e06 Poison2 009250 010546 0877 03862 Poison3 020250 010546 1920 00628 Treatment2 046750 010546 4433 837e05 Treatment3 015500 010546 70 01503 Treatment4 19750 010546 1873 0 0692 Poison2zTreatment2 002750 014914 0184 0 8547 Poison3 Treatment2 0 34250 014914 2297 0 0276 Poison2 Treatment3 0 10000 014914 0671 05068 Poison3 Tre tme t3 0 13000 014914 0872 03892 Poison2Treatment4 0 15000 014914 1006 0 3212 Poison3zTreatment4 008250 014914 0553 0 5836 Signif codes 0 0001 001 005 01 1 Residual standard error 01491 on 36 degrees of freedom Multiple Risquared O7335Adusted Risquared 06521 Festatistic 901 on 11 and 36 DF pevaiue 1986eeo7 The overall Fetest pevalue is quite small and we reject the overall Fetest At least one of the terms Poison Treatment or PoisonTreatment in the model is signi cant gt box gt pointSitterpoisonnumTreatment lfactor05gt T l pch 1 col1 lfactor05gt 2 ime Treatment 1 gt points J i tter poi son num Treatment Time Treatment 2 l pch 2 col gt points J i tter poi son num Treatment Time Treatment 3 l pch gt pointsitterpoisonnumTreatment factor051 Time TreatmentH l pch 4 gtcol4 linesC1 2 a means 1 col1 linesC123 means2l co linesC123 gtmeans3l col3 linesC123 means4l col4 vvvv gt legend312c TRT1 TRT me S W TRT 4 colc1234lty1 Survlval Tlme vs Polson by Treatment group umiereai Puissun The summary command also provides a test for each dummy regressor for each factor but were probably more interested in testing for each factor as a Whole initially The function Anova7 notice the capital A is a function in the car library We can use this to compute type III sums of squares and do appropriate tests for the effects in the model gt Anova1mouttype TTl Anova Table Type III tests Re sponse Time Sum Sq Df F value PrgtF Intercept 068062 1 306004 2937e706 008222 2 18482 01721570 Treatment 045395 3 68031 00009469 WW PoisonTreatment 025014 6 18743 01122506 Residuals 080073 36 Signif codes 0 0001 H 001 1 005 01 1 NOTE We need to be careful using the anova function for unbalanced data because it computes type T sums of squaies sequential sums of squaies for testing the signi cance of the teims We want type III sums of squaies 12 ln models that include interaction as this one does a common rst step is to test for the sig ni cance of the interaction If there is no interaction we perceive the true pro le plot of the means to be parallel meanm Time as as The interaction is NOT signi cant in this model Sum Sq Df F value PrgtF PoisonTreatment 025014 6 18743 01122506 So these lines connecting the cell means are not signi cantly di erent than parallel7 13 When the interaction is not signi cant we can go back and t the reduced model without in teraction and use this reduced model for the rest of the analysis gt Anovaltlmnointeractiontype TTT Anova Table Type III tests Response Time um Sq Df F value PrgtF Intercept 163654 1 4224e710 M 103301 2 20643 57O4e707 M Treatment 092121 3 12273 6697e706 M Residuals 105086 42 m 07 Jgt o oo Signif codes 0 0001 H 001 005 01 1 E ects model with only Main E ects Yijk Mai j 6m Both main e ects are signi cant here A model like this with no signi cant interace tion is said to have only main e ects7 We can plot the main effects7 tted model which shows parallel lines Mam effects model Survlval Tlme vs Folson by Treatment 12 Sumalva no n5 me in n2 Puissun Since there is no signi cant interaction we can say The Treatment e ect is the same for all levels of Poison Or at each Poison level the Treatment e ect is the samequot We can get the parameter estimates like usual gt lmuoJuteractloucoeff1c1euts Interce 1 P0150212 P0150213 Treatmeut2 TreatmentS Treatmeut4 045229167 007312500 034125000 066250000 007833333 022000000 You can also get the tted values using the lm output gt cbudPonouTreaCmeutlmmo1uteractouftted sou Treatment 1 4522917 HROOOxlwv PWIOH O O O 0 08147917 0 O O O O WWNMMMHHHH 1 1 1 1 1 1 1 1 1 5306250 There will be 12 tted values in the 27way ANOVA To plot the main e ects tted model we can plugein the estimated parameters to the main e ects model Yijk M 04139 j to get the 12 tted means and then connect the relevant tted values 3 Main effects lot connecting fitted means mtnin same treatment gt CoefflmvDOVinteractio coefficieuts coeff Intercept Poison2 Poison3 Treatment2 Treatment3 Treatment4 045229167 007312500 034125000 036250000 007833333 022000000 gt a Sum the relevant parameter estiamates to get the 12 fitted values gt mn11coeff1i 3 Baseline group gt mu34sumcoeffc136l a Create vector of means for each treatment gt TRT1meanscmu11mu21mu 1 gt TRT2meansCmu12mu22mu32 gt TRT3meanscmu13mu23mu33 gt T39RT4meansCmu14mu24mu34 If there IS signi cant interaction we shouldn7t consider the tests for main e ects ie the tests for Poison or Treatment in the Anova output Instead we should consider the Treatment efe fect separately at each Poison level because the Treatment e ect depends on the level of Poison and there isn7t a common global7 Treatment efe fect to be testing all at once meanm rvve p msun gt plotc0535rangeTimetype n xlab Poisson ylab Survival Time axesFmain Main effects model Survival Time vs Poison by Treatment gt gt axis2 gt axis1at13c 1 2 3 gt boxO sonnumasnumeriCPoisou gt pointsitterpoisonnumTreatment 1 factoro5 Time Treatment 1 l pch col1 gt pointsitterpoisonnumTreatmen 2 lfactor05 Time Treatment 2 l pch 2 col gt pointsitterpoisonnumTreatmen 3 factor05 Time Treatment 3 l pch 01 gt pointsitterpoisonnumTreatmen 4 factor05 Time Treatment 4 l pcl 41 gt col4gt gt linesC123TRT1col1 gt linesC123T39RT2col2 gt linesC123TRT3col3 gt linesC123T39RT4col4 gt legend312c TRT1 TRT 2wnzr S W TRT 4 colc1234lty1 mm mm mm Sunmul I39m vs Palm by Y39alnull N7 2 2 2 2 Ege a 2 it K 7 Another option for testing for interaction is to use a Partial Fetest gt lmnointeractionlmTime quot Poison Treatment gt anovalmnointeractionlmout Analysis of Variance Ta Model 1 Time quot Poison Treatment Model 2 Time quot Poison Treatment PoisonTreatment ResDf RSS Df Sum of Sq F PrgtF 1 42 105086 2 36 080073 6 025014 18743 01123 This pevalue is the same as the one seen in the gt Anova mout type quot111 output Different types of interaction 2 factors2 levels Cells No interaction parallel lines Assumptions 0f twoway ANOVA 0 independent simple random samples of size 711 from each of a X b normal populations 0 constant variance across all populations c 61770 are normally distributed as N07 02 Sums of squares We still have the same fundamental relationship that TSSRegSS RSS The RegSS represents the variability explained by the model7 which includes the terms of main e ects for factor A and B and the interaction between A and B7 in the full model Different types of interaction 2 factors2 levels Factor B is in same direction for all levels of Factor A but not parallel lines Factor B is in di erent direction for di erent levels of Factor A lines cross A general ANOVA table for twoway ANOVA Source Sum of Squares df Mean Square F A SSA a4 MSA B 33 b4 M SB AB ssAB a71b71 quot751 MSAB g Residuals RSS neab MSE Total TSS N71 If we have equal sample sizes in each cell a balanced design then TSS SSA 553 SSAB RSS But if we have an unbalanced design then this equation does not hold in the case we can still test for effects using the Type lll sums of squares as with the Anova function looking for a signi cant effect given all other factors are accounted for 225152 Weighted Least Squares Lecture 16 October 28 2002 Kate Cowles 374 SH kcoWlesstatuiowaedu Recall ordinary least square OLS 0 used to compute the estimated regression co ef cients when assumptions of linear regres sion are adequately met 0 chooses B07 317 7B1 to minimize Elk239 30 31 t BMW 2 Cases Where the errors are not iid 0 Recall assumptions 7 errors are independent 7 errors have normal distribution with equal variance at all values of the predictors gtk identically distributed 0 we check these assumptions be examining the residuals 7 our estimates of the true errors 0 exceptions that call for special handling 7 unequal variance correlated errors 0 weighted least squares WLS is appropriate for certain types of unequal variance 4 Weighted least squares o useful when errors have different variances but are all uncorrelated 0 assumes that we have some way of knowing the relative variances associated with each observation 0 associates a weight 10239 with each observation i weight are inversely proportional to varir ances of errors 0 chooses 307317 73p to minimize quot A A A 2 1 wi 3239 50 51 pfvpill o examples 7 if ith response is average of 72239 equally varir able observations7 then gtk Var 0 2 ni gtk w m i if variance is proportional to some predicr tor mi gtk Vary 202 1 wiix l 7 the difference between long and short shoots of Mclntosh apple trees Using healthy trees of clonal stock planted in 1933 and 1934 he took samples of long and short shoorts from the trees every few days throughout the 1971 growing sea son about 106 days The shoots sampled are presumed to be a sample of available shoots at the sampling dates The sampled shoots were removed from the tree marked and taken to the laboratory for analysis Among the many measurements taken Bland counted the number of stem units in each shoot The long and the short shoots could differ be cause of the number of stem units the average size of stem units or both The dataset ape pledat contains an abstract of Blandls data for both long and short shoots For now we will consider only the long shoots leaving the short shoots for homework 6 Example apple shoots data Weisberg 1985 Many types of trees produce two types of more phologically different shoots Some branches re main vegetative year after year and contribute considerably to the size of the tree Called long shoots or leaders they may grow as much as 15 or 20 cm over a single growing season On the other hand other shoots will seldom exr ceed 1 cm in total length Called short dwarf or spur shoots these usually produce owers from which fruit may arise To complicate the is sue further long shoots occasionally change to short in a new growing season and vice versa The mechanism that the tree uses to control the long and short shoots is not well understood Bland 1978 has done a descriptive study of B Our goal is to find an equation that can ader quately describe the relationship between day days from dormancy y number of stem units For each sampled day the dataset reports day number of shoots sampled ybar average number of stem units on that day sd withiniday standard deviation long 1 if long shoots 0 if short shoots day n ybar sd long 0 5 10 20 83 1 3 5 10 40 54 1 7 5 10 60 54 1 13 6 12 50 83 1 18 5 12 00 1 41 1 24 4 15 00 82 1 25 6 15 17 76 1 32 5 17 00 72 1 38 7 18 71 74 1 42 9 19 22 84 1 44 10 20 00 126 1 49 19 20 32 100 1 52 14 2207 120 1 55 11 2264 176 1 58 9 2278 84 1 61 14 2393 116 1 69 10 2550 98 1 73 12 2508 1 94 1 76 9 26 67 1 23 1 88 7 28 00 1 01 1 100 10 3167 142 1 106 732142281 11 Plot of ybarday Symbol used is 11 ybarl 40 30 20 10 7 7777777777 77 7777777777 77 7777777777 77 0 50 100 150 NOTE 1 obs hidden to applesas options linesize 75 data apple infile groupftppubkcowlesdatasetsappledat input day n ybar sd long run proc plot plot ybar day vpos 20 hpos 40 where long 1 run proc reg lp model ybar day plot residual pred weight n where long 1 run w The REG Procedure Model MODELl Dependent Varrable ybar Wergnt n Analysrs 01 Varrance o Mean Source DF Squares Square F Value Pr gt F Model 1 616427627 616427627 165724 lt0001 Error 20 2 60 Corrected Total 21 623866835 Root MSE 192863 R Square 09881 Dependent Mean 2142212 Adj R Sq 09875 Coerr v r 900297 Parameter Estrmates Parameter Standard Varlable DF Estrmate Erro t Value Pr gt t Intercept 1 997375 031427 3174 lt0001 day 1 021733 000534 4071 lt0001 Ignpxsea Fred 1c ed Value of ybar RESIDUAL Dependent Vanable ybar The REG Procedure 225152 Multiple Regression Checking Assumptions and Diagnostics Lecture 9 Oct 2 200 2 Kate Cowles 374 SH kcowles statuiowaedu Scatterplot Matrix preliminary screen for o collinearity o outliers 0 not very informative about nonlinearity in multiple regression setting Assumptions 1 Existence 2 The relationship between the response varir able and each of the predictor variables is linear after controlling for the effects of the other predictor variables EY X17 7XK 0 1X1 KXK 3 The variance gml XX is constant for any set of values of X17 XK homoscedasticity 4 The values of Y are independent of each other 5 The errors are distributed normally 4 data cnme mfile gtgouprcppubkcouiesaacasecscymedav c rate agei south educ polex60 poiexsg labor males pop nu unempi unemp2 with pov swlth south wlth run Can obtain scatterplot matrix in SAS lnr sight using Analyze 7 gt Multivariate 7 gt Output p c m m nu am am mm m Conu mn mums Putsm Certainm cmmm m gt m as n my 4 a my m A 00000 70 one 0 mm o m A7902 0 o 0 491 0 m o om o m mm a mu A 00000 a 3924 a a 0 mm 0 491 0 o 0 cool 0 cool 0 cool mic 0 mm 70 3024 A 00000 0 moo m o om 0 cool 0 o 0 cool 0 cool mu 0 m 70 c7004 0 moo A 00000 um 000m 000m 000m 00 000 my mm 0 mm a m moo A 00000 0 m 0 cool 0 cool 0 cool 0 0 Correlation Matrix a preliminary screen for collinearity 7 but cannot diagnose multicollinear re lationships involving more than two Varir ables a large correlations between two predictor riabl s indicate collinearity a large correlations between response V81 able an any predictors are good news a only continuous Variables should be in cluded After tting model Check for nonlinearity and heteroscedosr ticity using 0 added Variable plots 0 plots of residuals Vs predicted Values Check for nonrnormolity using 0 plots or summary statistics of residuals or studentized residua s Patina 25mm m e m Etc 2 m m m o m 0 us A W 0 ma 0 mo 0 m o 9 0 2m S 3 u um um 43912 em gtm to Added variable or partial lavaraga plots he ubtamed 11 SAS Inslght from the On ut submenuuf the Flt menu or by rst attng the desked e e1 end then accessmg the Graphs menu Chednngtet nennemehty Plots of reslduals vs predmted values 53 39 I R R B 39 I 39 39 T E 39 53 we ISEI 39 PRATE Checkmg tel nennelnlellty Nelnlel QQ plat of leslduels mHgtxI 1 vellehle en the ethel pledletels Detection of Multicullinearity a er Fitti Model 1 ls 33 lngh Wlth law Lsmres7 2 Ale slnlple mlleletlen caef ments h1gh7 3 telelenseW end Wanance ln etlen tentel mm ls Mme sham lsele nwgtm 4 Rggess the suspected elllpllt predlctur Multiple correlation cua ciant oThe nlllltlple mlleletlen mef ment ls e nleesule of the evelell llneel assuclatlun of one dependent vellehle Y Wlth sevelel el mdependentvaxlables X1X2 0 1t ls the slnlple mlleletlen mef clent he tween the obsemedvelues lg end the p757 dleted values lg lenl enlllltlple rages slen nledel o wef clent of determma un R2 ln multl le regesslun ls the squaxe of the nlllltlple mlleletlen mef ment Xk 17 IE Tolerance and the variance in ation Remedies for Multicollinearity factor 1 do nothing 0 tolerance of predictor Xk is the proporr tion of Xk7s variance that is not shared With the other X variables 2 drop at least 1 of the multicollinear varir tolerance 1 7 Bi ables Where Rg is the coef cient of determina tion for a mOdel in WhiCh Xk is regressed 3 transform the multicollinear variables on all the other predictor variables oWith perfect multicollinearity7 tolerance i 1 1 1 1 4 increase the sample size 7 0 and regression is impossible o variance in action factor VIP is m estimated variance of k oc VIFk rule of thumb worry if VIP for a pre dictor is gt 10 In uential Observations Output Statistics Cook7s distance Cook s D 0 measures the in uence of each observar 1 1 1 1 1 1 0001 tion on all the model coef cients as a 2 W 1 0032 Whole 3 1 11 1 02095 4 4 1 11 1 0084 V 5 1 1 1 0001 o unusually in uential case DZ gt n 6 W M41 7 1 1 1 0003 0 based On 8 1 11 02176 9 1 1 1 0000 7 how far a values are from mean of as 10 1 1 1 0005 11 1 111 1 0081 7 how far is from the regression line 12 1 1 1 0001 13 1 M 1 02047 14 1 M 1 0018 15 1 1 1 0003 16 1 1 1 0009 17 1 1 1 0001 18 1 1 1 0012 19 1 1 1 0004 20 1 1 1 0005 21 1 1 1 0011 22 1 1 1 0000 23 1 1 1 0012 24 1 1 1 0003 25 1 1 1 0017 26 1 1111 1 0108 27 1 1 1 0021 OOOOOOOOOOOOOOOOOOOO 228152 Multiple Regression7 continued Lecture 11 Oct 97 2002 Kate Cowles 374 SH kcowles stat uiowaedu mput rate unemp2 with with soutn with m medatgt age14 soutn educ polex60 polex59 labor males pop nu unempi 0v proc reg data c rme model rate south with with run v Dependent varlable RATE Anain is of Varlance Sum of Mean Source DF Squares Square F Value ProbgtP Model 2101448072 700482691 6302 00012 Error 43 4779479588 111150688 c total 46 6880927660 Root MSE 3333927 Rrsquare 03054 Dep Mean 9050851 Adj Rrsq 02569 cv 368355 Parameter Estmates Parameter Standard r for H0 Varlable DF Estmate Error Parameter0 Prob gt ltl IMTERCEP 1 r119107478 5311334607 r2243 00301 1 142595446 6792520023 2099 00417 ma 1 0 372730 0 09273946 4 019 00002 swua 1 r0 231622 013218119 r1 752 00868 Z If we assume that the effect of community wealth on crime rates could differ in southern states vs other parts of the country7 then we need slope dummy variables to represent inter actions Slope dummy variables or interaction terms are obtained by multiplying one predictor variable times another south szth southern 1 Wlth other 0 0 Regression equation southern mth 41911 14260 037 023 X wlthz39 5239 other ratezv r11911 037 X wlthz39 5 6 Do south and szth together substantially improve our ability to predict crime rate7 coma pared to a model using only Wlth 7 Variables with more than 2 categories To code C categories7 use 0 7 1 dummy varir ables Full model rate g l south g wlth g swlth south northeast midwest south 1 0 0 Reduced model northeast 0 1 0 rate wlth midwest 0 0 1 west 0 0 0 H0 1 30 Why not 1 dummy variable for each HA 3 either 51 or 53 or bOth 7g 0 category To avoid perfect multicollinearity With the intercept proc reg data crlue proc reg data crme uodel rate wealth uodel rate south wealth sulth run Dependent Varlable RArE Dependent Varlable RArE Analysls of Variance Analysls of Variance um of ean Source DE Squares Square P Value ProbgtE Sum of Mean Source DF Squares Square P Value ProbgtE Model 1 1340152160 1340152160 10884 00019 Error 45 5540775500 123128344 Model 3 2101448072 700482691 6302 00012 c total 46 6880927660 Error 43 4779479588 1111 50688 c total 46 6880927660 Root MSE 3508965 Rrsquare 01948 Dep Mean 9050851 Adj Rrsq 01769 Root MSE 3333927 Rrsquare 03054 cV Dep Mean 9050851 Adj Rrsq 02569 cV 3683551 Paraueter Estmates Paraueter Estmates Paraueter Standard r for H0 Varlable DE Estmate Error Paraueter0 Prob gt W Parameter Standard E for H0 Varlable DE Estmate rro Paraueter0 Prob gt ltl erERcEP 1 r2428261 2863138637 r0085 09328 wlra 1 0176893 005361836 3299 00019 erERcm 1 119107478 5311334607 2243 00301 south 1 142595446 6792520023 2099 00417 WL EH 1 0372730 009273946 4019 00002 SWL EH 1 0231622 013218119 1752 00868 H 2 how many fewer predictor variables in reduced model than in full model K 4 number of s in full model RSSTeduced model 7 RSSqu model H FHK n RSSfull model n 7 K F2 55407754779479 2 47 4 4779479 43 380648 111151 3 42 More on Interactions If there is strong evidence of interaction7 then it no longer makes sense to test other individ ual partial effects7 since welve already concluded that the effect of each variable exists and differs according to the level of other variables 0 Types of interactions 7 between a binary or categorical variable and a continuous variable common 7 between 2 binary or categorical variables Example Does the effect of gender on length of hospital stay differ by marital status 10 Compare to F423 distribution in Table B in Weisr berg 05 cutoff for F420 323 for F620 315 So 05 cutoff for F423 iis between 315 and 323 342 is gt either one So the two variables south and wlth taken together are significant at the 05 level 7 between 2 continuous variables less common Example Does the effect of weight on sysr tolic blood pressure differ with age P er OMH 5 225 152 Regression and Design Multiple Regression continued Lecture 10 Oct 77 2002 Kate Cowles 374 SH7 3350727 kcowles statuiowaedu Consequences of Severe Multicollinearity estimates of coef cients will remain unbiased variances of the coef cient estimate will increase trscores will decrease estimates are very sensitive to change in speci cation overall t of the equation will be largely unaffected estimation of nonrmulticollinear variables will be unr affected severity of multicollinearity worsens its consequences Multicollinearity o Collinearity r linear relationship between 2 predictor variables perfect collinearity r correlation between 2 pre dictor variables is 1 or r1 0 Multicollinearity r linear relationship among 2 or more predictor variables perfect multicollinearity r one predictor variable is perfectly predicted by 2 or more other predicr tor variables if we regressed X k on X l and X m we would get an R of 1 0 Perfect collinearity or multicollinearity rarely occurs7 but imperfect is common 4 Example dataset How 1960 crime rate is predicted by various variables in 47 US states rate if of offenses b r o reported to police per million population age14 the num e mal ion f es of age 14r24 per 1000 populat rnment labor labor force participation rate per 1000 Civilian urban males age 14r24 on unemp1 Unemployment rate of urban males per 1000 of age 14r24 n males per 1000 of age ssrsg ultn Median value of transferable goods and assets or family income in tens of p the r of families per 1000 earning below 12 the median income proc reg data crlme model rate polex60 run The REG Procedure Model MDDEL1 Dependent Varlable rate Analysls of Varlance Sum of Mean Source DE Squares Square P Value Pr gt E Model 1 32533 32533 40 36 lt 0001 Error 45 36276 306 13907 Corrected total 46 63309 Root MSE 2339259 Rquuare 04723 Dependent Mean 9050351 Adj 3qu 04611 Coeff Var 3137003 Parameter Est 1mates Parameter Standard Varlable DE Estlmate Error t Value Pr gt M lntercept 1 1444640 12 66926 114 0 2602 polex60 1 039435 014036 6 35 lt 0001 7 proc reg data crlme model rate polex60 polex59 run Sourc e DE Mo del 2 Error 44 Corrected total 46 Root MSE Dependent Coeff Var Varlable DE lntercept 1 po lex60 1 po lex59 1 Dependent Varlable rate Analys 1s of Varlance Sum of Mean Squares Square F Value Pr gt F 33995 16997 2143 lt0001 34315 79124070 09 2312900 Rquuare 04940 Mean 9050351 Adj 3qu 04710 3107335 Parameter Est mates Parameter Standard Estlmate Error t Value Pr gt M 1532646 1259264 126 02155 256153 123417 203 00433 173233 131175 r136 01310 proc reg data crume model rate polex59 run D ependent Var lable rat e Analys 1s of Varlance of Mean Source DE Squares Square P Value Pr gt E Model 1 30536 30536 3601 lt0001 rror 45 33223 34940045 Corrected total 46 6330 SE 2914443 Rquuare 04445 Dependent Mean 9050351 Adj 3qu 04322 oeff Var 3220031 Parameter Estumates Parameter Standard Varlable DE Estumate Error t Value Pr gt M Intercept 1 1651642 1304270 127 02119 polex59 1 092220 015363 600 lt0001 D proc corr data crlme var age14 educ ultn pov polex59 polex60 run Pearson Correlatlon Coe lclentsgt M 47 Prob gt r under HO Rhor age14 educ ultn pov polex59 polex60 age14 100000 053024 067006 063921 r051317 r050574 00001 lt0001 lt0001 00002 00003 educ r053024 100000 073600 r076366 049941 043295 00001 lt0001 lt0001 00004 00006 ultn r067006 073600 100000 r033400 079426 073723 lt0001 lt0001 lt0001 lt0001 lt0001 pov 063921 076366 033400 100000 r064315 r063050 lt0001 lt0001 lt0001 lt0001 lt0001 polex59 r051317 049941 079426 r064315 100000 099359 00002 00004 lt0001 lt0001 lt0001 polex60 r050574 043295 073723 r063050 099359 100000 00003 00006 lt0001 lt0001 lt0001 Model Speci cation DUMMY VARIABLES Regression models can use binary and nominal variables as predictors o eg gender marital status type of practice health insurance status region of country To incorporate a nominal predictor variable into R we must quantify it by using dummy varir a es Dummy variables assume only a nite number of values often 0 or 1 to identify different catr egories We use the word dummy to signify that the values of the variable have no quanr titative meanin ose we were interested in determining whether 1960 crime rates were different in southern states vs the rest of the country proc prmt data crlme obs 15 var soutn proc reg data crlme model rate soutn run odel MODEM Dependent Variable rate Analys 15 of Varlance um of ean Source DF Squares Square F Value P gt F Model 1 565 27222 565 27222 0 37 05446 Error 45 682 151653343 Corrected total 46 Root MSE 3894269 Rquuare 00082 Dependent Mean 9050851 d3 13qu 00138 Coeff Var 4302656 Parameter Estumates Parameter Standard Varlable DF 5 mate E ror t Value Pr gt M Intercept 1 9300000 699431 1330 lt0001 south 1 r7 31875 11 98765 061 Dbs south HOOOOOHHHOOOHOH Choice of Values for Dummy Variables Because the choice of coding scheme affects the interpretation of the coefficient for the dummy variable its important to specify which coding method is being used BUT the interpretation of the relationship among the variables doesn7t depend upon the coding scheme Coding Schemes Example 1 rate BO 31 south 1 if southern state 0 otherwise south Example 2 rate Bo 31 south2 1 not southern state soumg 0 otherwise 15 Now welll adjust for wlth We have gone back to the original coding of south proc reg data crime modei rate south with Anaiys is of Variance of Mean Source DE Squares Square E Vaiue ProbgtE Modei 2 1760150314 330075157 7562 00015 Err r 44 5120777345 116331303 c rotai 46 63309 27660 Root MSE 3411470 Rrsquare 02553 Dep Mean 9050351 Adj Rrsq 02220 cv 3769226 Parameter Estimates Parameter Standard r for H0 Variabie DE Estimate Error Parameter0 Prob gt ltl IM EERCEP 1 54224332 3396472143 r1392 01710 SOUTH 1 25377332 1362213096 1900 00640 WL EH 1 0253713 006761963 3326 00004 14 data crime Inflle gtckatecrime3datgt input rate age 4 south educ polex60 poiex59 iabor maies w un p1 unemp2 with pov south2 1 r sout swith south 3 with run Dependent Variabie RATE Anaiysis of Variance Sum of Mean Source DF Squares Square E Vaiue ProbgtE Modei 6527222 56527222 0373 05446 Error 45 63244 00437 1516 53343 c rotai 46 6330927660 Root MSE 3394269 Rrsquare 0032 Dep Mean 9050351 Adj Rrsq r00133 cv 4302656 Parameter Estimates Parameter Standard r for H0 Variabie DE Esti t Erro Parameter0 Prob gt ltl iMrERcm 1 35631250 973567355 3301 00001 SDUrH2 1 7313750 1193765034 0611 05446 16 In this model south was an intercept dummy variable We assumed that the intercepts were different for the 2 groups of states but that the slope on wlth was the same For southern states the intercept is 75422 2588 X 1 72834 For other states the intercept is 75422 2588 X 0 75422 17 Methods we Will study this semester Method Response variable Predictor variable Simple linear Continuous Usually continuous regression and only 1 Multiple regression Continuous Usually at least one continuous variab may also be discrete binary nominaland or 0 ANOVA Continuous Nominal or ordinal Logistic regression Binary Continuous binary nominal and or ordiI Cox proportional Censored Continous binary hazards regression nominal and or ordiI 22s152 Applied Linear Regression Chapter 7 Dummy Variable Regression So far7 we7ve only considered quantitative vari ables in our models We can integrate categorical predictors by con structing arti cial variables known as dummy variables or indicator variables We7ll illustrate here with a binary predictor eg MaleFemale 0 Pick one category as the default say Fee male 0 De ne 21 0 if observation i is a Female7 otherwise 21 1 denotes Male 0 Example Apple yield and tree size A botany students wants to model the rela tionship between apple tree size diameter and yield bushels She also has informa tion on the pruning method used on the trees Pyramid or Flattop The dependent variable Bushels is quantitae tive7 as is the independent variable Diame ter but Pruning is a categorical or qualitae tive variable Consider the model we previously described Yi o m QZi 5i where 21 0 if observation i was a Flattop pruning7 otherwise 21 1 denotes Pyramid pruning o A simple model with 1 covariate and 1 dummy variable Yi o m QZi 5i where Y and al are continuous variables7 and 6 lg N07 02 7 note g lm for females g 32 lm for males 7 Info from both the males and females were used to estimate m 039 7 g is interpreted as the female Yeintercept 7 2 is interpreted as the difference in ex pected value of Y for identical X units same X in the two groups 2 The data gt botany dataread cSV botany csv gt attachltbotany data gt headltbotanydata Diameter Bushels Pruning 1 1056 Pyramid 2 14 614 Pyramid 3 16 630 Pyramid 4 13 638 Pyramid 5 18 865 Pyramid 6 17 702 Pyramid gt uniqueltPruninggt 11 Pyramid Flattop Levels Flattop Pyramid gt is factorltPrun1nggt 11 TRUE gt isnumeric Pruning 11 FALSE gt plot Diameter ushels pch16 Eushels There does appear to be a linear relationship between tree diameter and yield in bushels Does Pruning Method also make a signi cant impact on yield First7 we7ll create the dummy variable in Allocate space for the new Vector gt pruningdummyrep0nrow botany datagt gt pruningdu 1 0 Bl am eter my 0000000000000000000 in If Pruning equals Pyramid code it as 1 gt pruning dummy PruningH Pyramid 1 gt pruning dumm Y 1111111111111000000000 gt dataframe Pruningpruningdummy Pruning pruning dummy Pyramid Pyramid Pyramid Pyramid Pyramid Pyramid Pyramid Pyramid Pyramid Pyramid Flattop woodman me gt gt gt MHO OHHHHHHHHHH The simple linear regression gt lmoutlmBushels Diameter gt summarylmout Coefficients Estimate Std Error t Value Prgtltl Intercept 218886 075898 2884 000988 M Di eter 062361 005185 12028 486e710 WW Signif codes 0 0001 H 001 005 01 1 Residual standard error 1133 on 18 degrees of freedom Multiple RSquared 08894Adjusted Risquared 08832 Fistatistic 1447 on 1 and 18 DF pivalue 486e10 gt ablineltlmout 13 Flattop 18 Flattop 19 Flattop 20 Flattop oooooooo Fit a model with both Diameter and Pruning as a dummy variable gt lmout2lmBushels quot Diameter pruningdummy gt summarylmout 2 Coefficients Estimate Std Error t Value Prgtltl Intercept 190616 00217 Diameter 0633 2 005038 12574 491eilo WW pruningdummy 076259 049468 1542 01416 Signif codes 0 0001 H 001 005 01 1 Residual standard error 1092 on 17 degrees of freedom Multiple RSquared 09029Adjusted Risquared 08915 Fistatistic 7906 on 2 and 17 DF pivalue 2458e09 Pruning is not a signi cant regressor given Die ameter is already accounted for But Diameter is a signi cant regressor given Pruning is already accounted for Create plot distinguishing the Pruning Method pIotDiameterBusheIstype D in Don t plot the points pointsDiameter111 BusheIsI111 pch1col1 pointsDiameter1220 Bushels 1220 pch9 coI4 Iegend810c Pyramid FIattop coIc14 pchc19 aims 5 o o o o mm Dametzv This shows the tted lines but the di erence between them g was not found to be signi cantly di erent than 0 given Diameter is in the model so there7s really only one line H0 g 0 was not rejected Plot tted models for the two Pruning Methods in The estimated parameters gt Imout2 coefficients Intercept Diameter pruningdummy 19061600 06335207 O7625916 Wt 810 e is the same for both it s the second parameter gt sIopeImout2coefficients2l gt sIo e Diameter 06335207 in Intercept is different gt interceptFIattopImout 2coefficients1l gt interceptFIattop Intercept gt interceptPyramidImout2coefficientsll Imout2coefficients3l gt interceptPyramid Intercept 2668752 in Add the lines on the previous plot gt ainneinterceptFIattopsIopecoI4 gt ainneinterceptPyramidsIopecoI1 This model that we t with 1 covariate and 1 dummy variable Yi n m 22i 5i where Y and x are continuous variables 2 0 if observation 239 is a Flattop method 2 1 if observation 239 is a Pyramid method and eud N002 assumed the same Diameter e ect slope for both groups but allowed for di erent intercepts Inclusion of Interaction Returning to the earlier model with a male female binary variable A slightly more complicated model Yi o m QZi aMZi 5i 0 note g lm for females o 2 1 gx for males 0 The 12 interaction term allows for a di erent slope for each group 0 x2 may be called a slope dummy variable 0 This model allows for two separate regression lines for each group Let7s t the interaction model to the tree data gt lmout3lmBushels quot Diameter pruningdummy Diameterpruningdummy gt summarylmout 3 Coefficients Estimate Std Error t Value Prgtltl Intercept 7200130 098989 72022 00603 064078 006988 9170 905e708 M pruningdummy 7053930 152761 70353 07287 Diameterpruningdummy 7001618 010434 70155 08787 Signif codes 0 0001 H 001 005 01 1 Residual standard error 1124 on 16 degrees of freedom Multiple R7Squared O9031Adjusted R7squared 08849 F7statistic 4969 on 3 and 16 DF p7Value 2487e708 gt lmout3coefficients Diameter Intercept Diameter pruningdummy Liningdumm 7200129614 064077682 7063929972 7001617838 o i 2 g 0 What is the bene t to bringing the data for the two groups together 7 If there is constant variance well have a better estimate for 02 by pooling the data may not do it if you didn7t think they had a common 02 7 We can run tests comparing the two groups 7 More degrees of freedom 0 To test if the lines are parallel test if the in7 teraction term is signi cant H0 g 0 o If there7s no signi cant interaction we can test if there7s two separate lines for the two groups or if one is suf cient to describe the data H0 g 0 o If there is signi cant interaction the e ect of covariate x on the response is di erent for di erent values of 2 in Model allows for different intercepts gt interceptFlattoplmout 2coefficients1l gt interceptFlattop Intercept gt interceptPyramidlmout2coefficients1l lmout2coefficients31 gt interceptPyramid Intercept 72668752 in Model allows for different slopes gt slopeFlattoplmout3coefficients2 gt slopeFlattop Diameter 06407768 gt slopePyramidlmout3coefficients2 lmout3coefficients4l gt slopePyramid Diameter 06245984 The separate tted line for each group plotDiameterBushelstype n W Don t plot the points pointSltDiameter1111Bushels1111pch1col1 pointSltDiameter112201Bushels12201pch9col4 1egend810c Pyram1d Flattop colc14 pchc19 ablinehnterceptFlattopslopeFlattopcol4 gt gt gt gt gt gt ablinehntercept PyramidslopePyram1dcol1 You can 7t see much di erence but the tted line for Flattop has a slightly steeper slope 17 o If there is signi cant interaction we do not consider the tests for main e ects7 If there is NOT signi cant interaction we can con sider the tests for main e ects7 We look at the interaction signi cance rst and then proceed from there 0 When interpreting models with dummy vari ables its important to keep in mind the cod ing scheme that was used which variable is associated with 1 and which with 0 for ex ample 0 Numerous covariates can be used in conjunce tion with dummy variables though we only showed 1 covariate in this example 0 A model with no interaction terms is said to be additive o If an interaction term between two variables is included in the model then the main eff fects7 for those variables should also be in cluded So if x2 is in the model so should you include a and 2 Categorical Variables with Many Cat egories Consider the categorical variable of Rank for a professor which has the categories of Assistant Associate and Full We can represent this categorical variable with 2 dummy variables Category D1 D2 Assistant 1 0 Associate 0 1 Full 0 0 Suppose we wish to predict Salary based on Bank and the quantitative variable Grants which tells how much grant money a professor has brought in The additive model Yr n lm 2D1i SDm 5i where a is the amount of grant money in dole lars for observation 239 The di ering models for each Rank Assistant K n z 611 2 62 Associate YZ g g ler 62 Full Y u le 62 There were no interactions terms included so Grants was assumed to have the same effect on Salary for all Rank categories This model gives three parallel lines The 7Full7 professor model serves as the baseline model but this is arbitrarily chosen 21 In general it takes m 7 1 dummy variables to represent a categorical variable with m levels When interactions are included it makes in terpretation a bit more complex but interace tions are often needed to appropriately model the data Sometimes there is a natural baseline choice This choice can mallte desired comparisons ease ier For example H0 g 0 tests if the model for Associate is di cerent than the model for Full And H0 g 0 tests if the model for Assisi tant is di cerent than the model for Full To compare Assistant to Associate use H0 1 2 g 0 To test if Bank has any e ect on Salary after accounting for Grants use H0 g g 0 This can be done with Partial Fetest 225 152 Regression and Design Describing continuous variables Review of Statistical Inference Lecture 2 August 28 2002 Kate Cowles 374 SH 33570727 kcowles statuiowaedu Describing continuous data 0 Measures of center 7 mean 7 median 0 Measures of spread 7 variance or standard deviation 7 interquartile range 0 overall summary of distribution 57number summary 2 Data checking and screening 0 important to do prior to any multivariate analyses 0 purpose to identify incorrect invalid or otherwise suspect data 0 types of checks for binary nominal and ordinal data 7 frequency and proportion of invalid catr egories 7frequency and proportion of missing classi cations 7adequate representation of categories of interest 0 types of checks for continuous data 7 range screens 7 consistency screens accuracy Of measurement 4 Describing continuous data contin ued Symmetry or skewness 0 positive skew skew to the right 7 mean gt median 0 symmetry 0 negative skew skew to the left 7 mean lt median 5 Describing continuous data contin ued Normality 0 check only if data distribution is approx imately symmetric o heavierethanenormal tails standard dev gt IQR 135 o lighterethanenormal tails standard dev lt IQR 135 7 Example Indoor BaP measurements Investigators suspected that Benzoapyrene7 or BaP7 from a pipe foundry in thillipsburg7 NJ7 might be contaminating household air This dataset presents data from 14 different days on samples of indoor air from a house near the foundry and samples of outdoor air collected at the same times The measures are concentrations of BaPecontaining partie cles no larger than 10 micrograms The two variables are 0 indoor air BaP 0 outdoor air BaP Reference Lioy7 PL7 Walman7 JM7 Greenberg7 A7 Harkov7 R and Pietarninen7 C 1988 The total human environmental exposure study THEES to Benzoapyrene Comparison of the inhalation and food pathways Archives of Environmental Health7 43 3047312 6 Describing continuous data contin ued Outliers o outliers in data of any distribution 7 smaller than rst quartile e 15 IQR7 or 7 larger than third quartile 15 IQR o outliers in normallyedistributed data 7 minor outlier gt 196 standard deviae tions away from mean 7 major outlier gt 258 standard deviae tions away from mean 3 SAS code data 83 mnie groupftppubkcow1esdatasetsBaPdat input indoor outdoor run I proc print data 83 run proc univariate data Ba var indoor run I SAS output OBS INDOOR OUTDOOR 1 10 24 2 10 35 3 25 41 4 40 65 5 40 27 6 45 56 7 45 67 8 55 50 9 55 25 10 70 20 11 75 78 12 90 25 13 220 38 14 285 40 mum 2121121125 12 2222 5111 Wgts 11 12 2222 21m 7s 2m 5111 1255 2222 521 Dev 7 we 1132151121 525 m 51121111255 2 2212 121222515 2 5115 Ilss 152375 222 55 21252 ms 103 9M3 521 212511 21 1222 Unantnes 100 Max 225 2222 99 OX 25 2222 m as 75 2222 97 51 225 2222 50 2121 52 2222 95 OX 25 2222 251 D1 22 2222 90 OZ 2 2222 0 12 2222 0 12 2222 x 2 2 2222 5 A 12 2222 term 35 2222 2 51 12 2222 2121 12 22 1 A 12 2222 11 Buxplut of mduar Bap data 25o n zoo 1 15a 2 we so Indnnr m 111522311111 of 11111011 Bap data so I20 use 240 300 ndoor 12 Normal quant e plot I n d o o r I l O I Ninduor 13 Symmetrizing transformations 0 commonly applied to dependent variable rarely to independent variables 0 Power transformations often can reduce skew pull outliers in 15 More on log transformations 0 2 commonly used bases for logarithms base 10 gtrlt 100 102 so 0910100 2 gt1lt preferred by medical people base 6 gtrlt 7389 62 so 0947389 2 2 gt1lt also called 77natural log gtllt preferred by mathematicians and quantita tive scientists 0 Both bases require that raw data be strictly gt 0 Add 1 to all data points if some are 0 14 Tukey s ladder of powers 0 Y3 reduces negative skew and pulls in low outliers 0 Y2 reduces negative skew and pulls in low outliers 0 Y1 the raw data 0 Y5 sqrtY reduces positive skew and pulls in high outliers can be applied only if all data values are 2 0 0 log Y reduces positive skew and pulls in high outliers o Y 5 18qrtY reduces positive skew and pulls in high outliers o Y1 1Y reduces positive skew and pulls in high outliers 16 More plots and descriptive statistics from SAS The SAS System indoore 5b 160 iso 260 250 indoor 0 005 39ltTHmDDU 180 240 300 gt indoor 4L Moments N 140000 Sum Wgts 140000 Mean 760714 Sum 10650000 Std Dev 791099 Variance 62583791 Skewness 20340 Kurtosis 36115 USS 162375000 CSS 813589286 CV 1039943 Std Mean 211430 All Quantiles 100 Max 2850000 990 2850000 75 Q3 750000 975 2850000 50 Med 500000 950 2850000 25 Q1 400000 900 2200000 0 Min 100000 100 100000 Range 2750000 50 100000 Q3 Ql 350000 25 100000 Mode 100000 10 100000 Natural log of indoor BaP The SAS System 5 gt Lindoor o a D e n o 4 Equot l t Y o z 0 o 8 3 6 4 4 5 2 6 o Lindoor 4L Moments N 140000 Sum Wgts 140000 Mean 39244 Sum 549417 Std Dev 09465 Variance 08959 Skewness 00311 Kurtosis 04001 USS 2272600 CSS 116463 CV 241184 Std Mean 02530 kj Quantiles 100 Max 56525 990 56525 75 Q3 43175 975 56525 50 Med 39070 950 56525 25 Q1 36889 900 53936 0 Min 23026 100 23026 Range 33499 50 23026 Q3 Q1 06286 25 23026 Mode 23026 10 23026 Populations and samples a The population is the entire set of items 17 18 Square root of indoor BAP The SAS System S indoors 4 s 1b is gt Sindoor o 15 D e 301 i t Y o 05 9 12 15 18 Sindoor All Moments N 140000 Sum Wgts 140000 Mean 78893 Sum 1104505 Std Dev 38593 Variance 148939 Skewness 12803 Kurtosis 16508 USS 10650000 CSS 1936205 CV 489175 Std Mean 10314 4 Quantiles 100 Max 168819 990 168819 75 Q3 86603 975 168819 50 Med 70622 950 168819 25 Q1 63246 900 148324 0 Min 31623 100 31623 Range 137197 50 31623 Q3 Q1 23357 25 31623 Mode 31623 10 31623 19 20 Example We wish to draw conclusions about the average amount of money spent by Ul students on textbooks this semester about which we wish to draw conclusions 0 We usually cannot measure or evaluate the variables of interest for all members of the population 0 Therefore we draw a sample from the population and use the data from it to draw inferences about the population o It is not practical to interview all of them to nd out what each one spent 0 We will draw a simple random sample of 50 students interview them and use statistical inference to draw conclusions about the population 0 We want to estimate the population mean a and to determine how much uncer tainty there is in our estimate 0 We also want to do a hypothesis test re garding whether the population mean a is 100 21 Point estimates and con dence in tervals o A statistic is a numerical quantity that can be computed from sample data It does not involve unknown population pa rameters 0 We use sample statistics to estimate pop ulation parameters 0 We will use the sample mean i of expenr ditures on textbooks by the 50 students we interviewed to estimate the unknowr able population mean a o i is our point estimate of the unknown M 23 Preparing to compute a con dence interval for n 0 To compute an interval in which we have some speci ed level of con dence that the unknowable n lies we must consider the sampling distribution of i o The sampling distribution of a statisr tic is the distribution of values taken by the statistic in all possible samples of the same size from the same population 0 Then the mean of the sampling distribur tion of i is a oThe standard deviation is of the same pling distribution of 22 0 We might also wish to estimate the varir ability among expenditures in the pop ulation The relevant parameter here is the population standard deviation 0 0 Again we will use the corresponding same ple statistic s the sample standard der viation as the point estimate of the pop ulation parameter If we knew 0 Imagine that we know that the true popula tion standard deviation of textbook expenr ditures among Ul students is 0 15 Then the standard deviation of i is 15 i 21 Em lmagine also that we could choose many samples of size 50 and nd the mean text book expenditure from each one If we collect all these different is and dis play their distribution we get the normal distribution with 0 mean equal to the unknown n 0 standard deviation 21 Statistical con dence 0 Recall that for a normal distribution 95 of all values are within 2 standard deviar tions of the population mean oThus in 95 of all samples the mean score i for the sample will be within two standard deviations of the population mean score M 7 So the i will be within 42 points of M in 95 of samples of 50 students7 textbook costs 0 But if i is within 42 points of the uni known M then M also has to be within 42 points of the observed i i This will happen in 95 of all samples 0 That is in 95 of all possible samples of size 50 from this population 27 Constructing con dence intervals for M When 0 is unknown To construct a level C con dence interval for M 0 Draw a simple random sample of size n from the population The population is assumed to be normal 0 Compute the sample mean i and the same ple standard deviation 3 0 Then the level C con dence interval for M is s 2 i 5 7 where t cuts off the upper Lag area uni der the density curve for a t distribution with n 7 1 degrees of freedom 0 Use Table A41 at the back of your text book to nd t the unknown M lies between i 7 42 and i 42 Standard errors But we don7t know 0 any more than we know M 0 We will estimate 0 by the sample stanr dard deviation 3 0 Then we estimate the standard deviation of i by When we use the data to estimate the stanr dard deviation of a statistic the result is called the standard error of the statistic The standard error of the sample mean i is S W 23 Example Suppose in our sample of n 50 students i 104 s 16 and we want a 95 con dence interval for the population mean expenditure M The degrees of freedom is 50 r 1 49 Find the appropriate t in Table A41 It will be approximately 201 Our con dence interval is s 2 i t 104 i 201 104 i 455 H Agov 9945 10855 225 152 Applied Linear Regression Introduction Lecture 1 August 267 2002 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu The Challenger How understanding of regression methods might have prevented a tragedy References Dalal7 SR Fowlkes7 EB7 Hoadley7 B 1989 Risk Anal7 ysis of the Space Shuttle PreChallenger Prediction of Failure77 Journal of the American Statistical Associa7 tion 84 9457957 Tufte Edward R 1997 The Decision to Launch the Space Shuttle Challenger in Visual and Statistical Think ing Displays of Evidence for Making Decisions7 Graph7 ics Press 2 What is regression analysis 0 a class of statistical methods for 7 studying relationships between variables that can be measured on subjects 7 using known values of certain variables to predict the values of other variables for the same subjects 4 On 12886 space shuttle Challenger exploded during launc o 7 astronauts killed 0 reason gas leak through a joint that should have been sealed by two rubber O7rings 7 O7rings had lost resiliency due to cold temperature 5 On the previous day7 extensive discussions of whether or not it would be safe to launc o predicted temperature for launch time 26290 o no shuttle had ever been launched at temperature lower than 530 o engineers who designed rocket faxed to NASA a rec ommendation not to launch due to risk of Oiring fail ure at low temperatures 0 NASA of cials pointed out weaknesses of engineers7 evidence 0 after lengthy discussion7 managers of rocket mak ing company changed their minds and recommended launch 7 The engineers7 plot of data from previous shuttle launches joint temperature vs number of Oirings having some temperaturerelated problems 6 The engineers7 evidence 0 history of serious but nonicatastropic Oiring damage during previous cooliweather launches 0 physics of resiliency of rubber 0 experimental data 3 What was missing from the engineers argu ment 0 quanti cation of the relationship between joint tem7 perature and Oiring failure 0 prediction of the probability of Oiring failure at 290 with assessment of degree of uncertainty An appropriate statistical method logistic re gression o Dalal et al carried out such an analysis after the fact using data from the 23 shuttle launches prior to the Challenger 0 found strong statistical evidence of a temperature ef fect on Oirings 0 we will analyze these data later in the semester 9 A plot showing data from all 23 previous launches in7 cluding those in which no O7rings were damage 11 For analysis by a computer a set of data collected for a study is often organized as a table with a row for each subject and a column for each variable Pat id age seX diagnosis 101 25 F hepatitis A 102 38 F cirrhosis 103 76 M hepatitis C Each row in such a table corresponding to the data for a single subject is called an observation 10 Subjects observations and variables In statistical studies we generally choose a set of sub jects on whom data is collected We usually are interested in collected a number of differ7 ent kinds of information to describe each subject A variable is a particular characteristic that may take on different values for different subjects For example 0 age 0 gender 0 diagnosis are three variables that might be included in a study of length of hospital stays of hospital patients 12 Dependent and independent variables 0 Dependent variable 7 also called response or outcome variable 7 what we want to explain or predict 0 Independent variables 7 also called predictor variables or covariates Different regression methods are required for different types of response variables Review of types of data 0 Qualitative 7 Nominal categorical values fall into unordered categories numbers may be used to represent categories but they are just labels example variable called occupational area coded as r l education r 2 business r 3 service r 4 industry r etc etc special case binary data which can take on only 2 possible values 7 Ordinal data representing ordered categories example variable called prognosis taking on possible values poor fair good 15 Describing binary nominal and ordinal data 0 tables of frequencies and percents 0 bar graphs 14 0 Quantitative 7 Discrete both order and magnitude are important numbers represent measurable quantities possible values are restricted often to be inte7 gers example count of number of homicides in John7 son County in 1998 7 Continuous numbers represent measurable quantities and are not restricted to a set of speci ed values examples temperature blood pressure annual pro t Special case censored data r continuous data in which values for some sub7 jects are not observable r some values are known only to be larger or smaller than some observed value r example time7tofailure data 16 Example data from a 1992 poll of Montana residents 0 AGE 1 under 35 2 35754 3 55 and over 0 SEX 0 male 1 female 0 INC yearly income 1 under 20K 2 2035K 3 over 35K o POL l Democrat 2 Independent 3 Republican 0 AREA 1 Western 2 Northeastern 3 Southeastern Montana 0 FIN Financial status 1 worse 2 same 3 better than a year ago 0 STAT State economic outlook 1 better 2 not better than a year ago SAS code 0 at yalue sexth 0 W 1 gtFgt w 2 gtMFgt 3 gtSFgt gtworsegt 2 gtSamegt 3 gtBettergt data montana mflle gtckatemontanadatgt Input age sex mc pol area fm stat label mc gtrncomegt pol e gt olrtrcal arrrlratrongt fm gtpersonal flnzmclal atts fma format sex sexfmt area arearmt fln fmfmt run proc contents run SAS code proc freq data montana tables sex fin run Output The FREQ Procedure Cumulat1Ve Cumulat1Ve Sex Frequency Percent Frequency Percent M 107 51720 107 51720 F 102 48780 209 100700 personal rrnancralstatus Cumulatlve Cumulatl nn Frequency Percent Frequency Percen Worse 61 29733 61 2933 Same 76 36 V 54 137 65 V 87 Better 71 34713 208 100 V 00 Frequency MlSSlng 1 13 SAS output the SAS System 13 the CONTENTS Procedure Data Set Mam WDRKMDM EAMA Dbseryatrons 209 Member type DAtA Varrables 7 E lne39 7 Indexes 0 Cr ated 2 Frrdaygt August 13gt 1999 Dbseryatron Len n 56 last Modlfled 2 Frrdaygt August 13gt 1999 Deleted Dbseryatrons o Prot tron Comp d D Data Set type So MD EngmeHost Dependent Infomatlorr Data Set Page Slze 8192 Number of Data Set Pages 2 Frrst Data Page 145 b t Pa Number of Data Set Reparrs o Flle Mame39 Cwmdows EEWSAS temporary F rlestD 28937montana sas7bdat Release Created 7ooooP Host Created wrm95 rAlpnabetrc Lrst of Varrables and Attrrbutes Varrable type Len Pos Format label 1 age Mum 3 o 5 area Mum 3 32 ARFAFMt 6 1n Mum 3 40 FIMFM E personal fmanclalstatus 3 lnc Mum 3 16 mcome 4 pol Mum 3 24 polrtrcal afflllatlon 2 ex Mum 3 3 SFxFMt 7 stat Mum 3 48 state fmanc lal stath 20 goptlons devrce wln proc gchart vbar nn drscrete trtle Montana Poll run 1 Z 22S1152 Survival analysis timetoevent analysis o 0 assumes Surv1val AnalySIS Lecture 21 7 event Whose rst or only occurrence time November 29 1999 can be measured 7 some de ned follow7up time 7 well7de ned starting time Kate Cowles o examples 374 SH starting time event followup kcowles statuiowaedu admission to discharge 3 years hospital discharge from drug drug use 18 mos treatment program mastectomy for death 10 years breast cancer 0 censoring occurs if not all study subjects reach the event during study period 3 4 Required survival analysis data Schematic of a timeto7event study ti 7 time 1th subject was observed 1 7 event occurred at time ti div 0 7 censored ti is last follow7up time and event had not occurred yet may also have additional variables for Which we Wish to assess the relationship to survival time 4 Vetezan s Adnlnlstzatlon Lung Cancez Illa Taken fzom Kalbeelsch and Pzentlce pages 223224 a a a Vazlables a Tzeatment 1standazd 2test a Celltype 15quamous 2snallcell 3adeno AJazge a Sumyal 1n ay 4 Status 1 ead 0censozed a Kamofsky scoze 4 Months fzom Dlaglosls 4 Age in yeazs 4 leoz thezapy 0no 10yes a w optlons llneslze 75 pageslze s0 data yetlung nfe gzoupftppubkcowesdatasetsvetJungdat npu z s status kaznof nont age 2 oz cell1 cell2 ceJJS at cell eq 1 then cel 1 else cell1 0 at cell eq 2 then cell2 else cell2 1 cell ed 3 then ceJJS 1 else ceJJS 0 un pzoc contents zun The CDMTEMTS onceduze Data Set Name wDDKVETLUMe Dbsezvatlons 137 Membez Type DATA Vazlables 11 Englne VD Indexes Dbsezvatlon Le t Cze t d B wednesday Decembez A 2002 ng h Deleted Dbsezvatlons 0 Compzessed MD a e D0 Last Modlfled D0D wednesday Decembez A 2002 Plotectlon Data Set Type Sozted MD Label eeeeeEnglneHost Dependent lnfoznatlonm Data Set Page Slze D152 describing survival characteristics of a population 0 survival function 375 7 proportion of subjects Who have not had the event yet at time t it is counted from each individualls starting time 7 30 is always 10 7 example What proportion of breast can cer patients have no recurrence of cancer for at least one year 0 hazard function Mt r proportion of popula tion Who Will have the event at time t given they have not had it before time t iexample If a breast cancer patient has gone for one year after a mastectomy with out a recurrence of cancer7 What is her probability of having a recurrence at one year 6 Mumbez of Data Set Pages 2 e 1 ax b5 pez Page 52 n a a Page SB Mumbez of Data Set Repalzs 0 Flle Mame usztnpSASuoszSAE0000SAEc mouseyetlungsasdeat Release Czeated D0202M0 Host Created HPeUx lnod M mbez 44450 ce s Pezmlsslon zu 77777 W Dunez Mane UMKNDWM Fllle 5128 bytes 24576 quotWellphabetlc Llst of Vanables and Attnbutese 7 age m D 48 2 cell Mu D D B cell1 Mum D s4 10 cell2 Mum D 72 11 113 M m D D0 3 s u D 15 5 kaznof Mum D 32 s onths Mu D 40 D pzloz m D SS A status Mum D 24 1 um D 0 D Estimating the survival function 375 the Kaplanr Meier estimator 0 also called product limit 0 step function 7 decreases at each time a sub ject has event 0 assumes probability of censoring is not re lated to probability of event 0 example ti e failed at risk censored ItPCOMHO gt kgt kOgt O gt KDCOgtJgtCTKCTK OOHOH 1 CH gtlt l l The LIFETES E Procedure Survlval Funct 10h Estmates 10 Graphical comparison of 2 survival curves proc llfetest data vetlung plots s tune dayskstatm0 strata tr where cell eq 1 run n A BB 00 A B 0 100 200 300 400 500 600 700 300 900 1000 13 Statistical comparison of survival curves the logarank test 0 at each failure time7 compares observed numa ber of events in each group to number that would be expected if hazards were equal in all groups 0 assumes nonainformative censoring 0 SAS and Stata use slightly different formulas rne LIFETES E Procedure Summary of the Number of Censored and Uncensored Values Percent Stratum trt total Failed Censored Censored 1 1 15 13 2 13 33 2 2 20 18 2 10 00 total 35 31 4 1143 the SAS System 0809 wednesdaygt December 4gt 2002 the LIFETES E Procedure resting Homogeneity of Suryiyai Curves for days over Strata Rank Statistics Hazards trt LogRank wiicoxon 1 37754 48 000 2 r3 7754 r48 000 Covariance Matrix for the LogrRank Statistics trt 1 2 1 580858 r5 80858 2 r5 80858 5 80858 Covariance Matrix for the wiicoxon Statistics trt 1 2 1 312635 r312635 2 r312635 312635 rest of Equallty over Strata Pr gt rest Cniquuare DF Cniquuare of 24539 1 01172 wiicoxon 0 7370 1 0 3906 r2LogLR 38531 1 00497 16 Accommodating multiple explanatory variables the Cox proportional hazards model 0 We cant use multiple linear regression if any of the survival times are censored because the coefficient estimates would be biased o Assumes that there is some baseline hazard7 h0t if all predictor variables were 0 somewhat analogous to intercept g in line ear regression may change over time of study 0 Assumes that effect of predictor variables is multiplicative on baseline hazard example if age is a predictor ht 1 age h0te31W o Assumes nonrinformative censoring Interpreting coef cients 0 A 1eunit increase in Xi causes the hazard to be multiplied by 65339 o 651 is the relative risk for a leunit increase in Xi 0 Compute the 95 con dence interval for the relative risk by exponentiating the endpoints of the 95 CI for the coef cient 19 naeanetee Standaed naaaed Vanatne De aetanate eneaaaee n 2 naea Rana tat i c 23994 c 20721 i 9579 o 1317 lees eeni 1 ed 39983 c 2e2ee i 9933 c 157A 0 s71 een2 i o Asses o 2ss27 2 9433 o ossz i 579 eene i o 7 o 302s s 7399 o 0092 2 200 name 1 ed 03232 c 00551 35 112A lt 0001 o 983 antne 1 ed 0000913 0 00913 o 0001 o 9920 i 000 age 1 ed 00355 0 med c me o eee2 o 991 pnae i 0 W23 o 0222 o 0971 o 7554 i 007 13 pm phreg data vetlung nddei days statuslto tn eem een2 tens karnof months age pnor as 09 Wednesday December 4 2002 The mm pteeeduee Mode rnxemataen Data Set mm mums Dependent Vanable da 5 Densorlng Vanable status Densorlng Vauelts c has Handhng Emisan Summary e the unmet 21 Event and senseted Values Fe tent Iota Event senseted censeted 137 12s 9 s 57 Convergence Status Convergence cntenon ltscnwmem satasxaed Mode Fat Statlstlcs Watneut Watn Untenon Unvanates Dovanates 2 LBS L 1011 ms 950 359 AID 1011 ms ass 359 star 1011 ms Testlng Global Mull Hypothesls mamo Test DhIVSquare DF pt gt Dh18q leehhood Ratlo st 4091 s lt 0001 stete 5 9173 s lt 0001 Wald 1 475 3 lt 0001 Analysls ex Maxlmum leehhood Estlmates 1 t d d Treatment is coded S an ar 2 new So the relative risk for the new treatment vs standard therapy7 with other factors held cone stant7 is 02241 079 95 or 21 0447e0 561 035271752 This means that a person on the new treat ment who had survived up to any particular time point would be only 79 as likely to die at that time as a person on standard treatment who had also survived up to that time and had the same values of the other predictors The relative risk for each additional month between diagnosis and treatment is 202044 1045 95 or e0 00747e0 0802 1007471084 Testing for signi cance of a set of predictors Lung cancer example 0 Remove insigni cant predictors one at a time based on individual trtests 0 Check that nal model isnlt signi cantly poorer than full model Partial X2 test 2 7 W i 72 log likelihoodedwed mode 7 log likelihoodfuu model Proportional hazards assumption 0 The relative effects of the predictor variables are the same over time 0 Check graphically With survival curve 0 Example of violation 7 Treatment A is e ective only very early in the course of the illness7 so the relative risk of Treatment A vs B changes over time Time dependent covariates 0 Example smoker nonsmoker 0 Value for the same subject may be different at different time points in the study 0 SAS proc phreg can handle STATA canlt I L I L3 I L I L c x regresslon Log kaehhood 772 232915 tune l statusl Deaf sca Ex 2 mm 952 Deni Interval mam l 7 1397405 4030419 70 437 0 340 r 553 perf l 7 0343013 011213 73 057 0 002 7 0532305 7 0123127 months l 0339001 013335 2 11s 0 034 0023331 0749341 X3 727228 7 4157 24 Sample size calculation for log rank test 0 depends more on the number of events than on the number of subjects 0 factors that have to be considered 7 either expected event rates or expected median survival times in the groups being compared 7 length of accrual duration of follovvrup get help from a statistician 0 Reference Lachin7 JM 1981 Introduction to Sample Size Determination and Power Analysis for Clinir cal Trials7 Controlled Clinical Trials7 2 937 113 Hints on presenting survival analysis data 0 univariate data description of survival time itself idonlt report mean7 median7 standard der viation7 range7 etc obtained by summarize or proc univariate if any times are censored do report median survival time from Kaplanr Meier curve gtk or smaller percentile if less than half of the people have had the event 0 bivariate analysis of relationship between sure vival time and one predictor variable at a time i log rank test for categorical predictors gtk one variable7 not dummies Cox model for continuous predictors 26 o to put more than 1 predictor at a time in the model 7 Cox dummy variables for categorical predictors o KaplanrMeier curves 7 for main comparison of interest 7 for any other important individual predicr tors gtk break continuous predictor into categories to depict it using Kaplan Meier 225152 Introduction to OneWay ANOVA Lecture 17 Nov 1 2002 Kate Cowles 374 SH kcowles statuiowaeclu 3 Goal to compare population means under three different treatments 0 a threerindependentrsample problem 0 Call the population mean heart rates M1 for when pets are present M2 for when friends are present and M3 for when women perform task alone then HOWQM2M3 47mmwmmmmwm gtk not onesided or Qrsided 2 Comparing more than two population means Example Does the presence of pets or friends affect responses to stress 0 Allen Blascovich Tomaka and Kelsey 1988 Journal of Personality and Social Psychol 09y 0 subjects 45 women who described theme selves as dog lovers 0 randomly assigned to three groups to do a stressful task 1 alone 2 with a good friend present 3 with their dog present 0 Subjects7 mean heart rate during the task was one measure of the effect of stress 4 SAS descriptive statistics data pets infile petdat input group rate grpf 0 if group eq F then grpf 1 grpp 0 if group eq P then grpp 1 run proc sort by group run proc means var rate class group run 5 6 To infer about the three population means7 we The MEANS Procedure might use the tworindependentrsample t test 3 Analysis Variable rate tunes N 0 Test H0 M1 M2 to see if mean heart rate group Obs N MeaI1 Std Dev when pet is present differs from mean when friend is present C 15 15 82 5240667 9 2415747 0 Test H0 M1 M3 to see if mean heart rate F 15 15 913251333 83411341 when pet is present differs from mean when alone P 15 15 73 4830667 9 9698202 oTest H0 M2 M3 to see if mean heart rate when friend is present differs from mean when alone 7 3 Problem with this approach Multiple comparisons procedures in statis tics o 3 prvalues for 3 different tests dont tell us how likely it is that three sample means are issue how to do many comparisons at once Spread apart as far as these are with some overall measure of con dence in might be that 21 7348 and 22 9132 3110 conclusions are significantly different if we look at just 0 two steps 2 groups but not signi cantly different if we know they are the smallest and largest means in 3 groups ioverall test of whether there is good eye idence of any differences among parame ters we wish to compare 7 As more and more groups are considered7 we expect gap between smallest and largest sample mean to get larger followrup analysis to decide which of pa rameters differ and to estimate size of dif ferences lmagine comparing heights of shortest and tallest person in larger and larger groups of people 0 the probability of Type I error for the whole set of trtests will be much bigger than the 04 level set for each one 9 Step one OneWay Analysis of Vari ance ANOVA ostep one overall test for some difference among 3 or more population means 0 uses an F test to compute a prvalue Main idea of ANOVA What matters is how far apart sample means are relative to variability of individual obser vations o F statistic variation among the sample means F variation among individuals in the same sample 0 compare to a cutoff value in an F distribu tion Notation o I number of different populations Whose means we are studying o n number of observations in sample from ith population 0 N total number of observations in all same ples combined 10 Dogs friends and stress example The ANOVA Procedure Dependent Variable rate Sum of Source DF Squares Mean Square F Value Model 2 2387 688992 1193 844496 Error 42 3561 299492 84 792845 Corrected Total 44 5948988484 Fquuare Coeff Var Root MSE 0401360 1116915 9208303 Source DF Anova SS Mean Square F Value group 2 2387 688992 1193 844496 12 F distributions 0 many different F distributions identified by two parameters numerator degrees of freedom l r 1 denominator degrees of freedom N r l Example Do four varieties of tomato plant differ in mean yield7 Agronomists grew 10 plants of each va riety and recorded the yield of each plant in pounds of tomatoes What are 0 the populations of interest 0 the variable of interest 0 I 0 each 72239 o the degrees of freedom for the ANOVA F statistic 15 o All of the populations have the same stanr dard deviation 0 unknown funlike tetests7 there is no general proce dure when population standard deviations are not assumed to be equal rough rule of thumb if largest sample stanr dard deviation is no more than twice the smallest sample standard deviation7 then population standard deviations probably are close enough to equal that ANOVA procedure is OK falso may use Levene7s test7 but it is not powerful 14 Assumptions for OneWay ANOVA 0 We have I independent simple random same ples7 one from each of l populations 0 Each population 239 has a normal distribution with unknown mean 11239 As with tetests7 if sample sizes are large enough in each sample7 Central Limit The orem says inference based on sample means is OK even if population distributions are not exactly normal proc anova class group model rate group means group hovtest run Levene s Test for Homogeneity of rate Variance ANOVA of Squared Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr gt group 2 58181 29091 023 079 Error 42 538094 1281 18 17 Step two individual ttests With cor rection for multiple comparisons This is the follow up test 0 should be carried out only if the F test from onerway ANOVA is signi cant at the chosen signi cance leve Goal to set the overall probability of commitr ting a type I error at or when doing pairwise comparisons of k different means 0 we will perform k 2 tworindependentrsample tetests 0 we will conduct each one at the signi cance level or 7 2 o This is called the Bonferroni correction Dogs friends and stress example 0 There are k 3 samples so there are 3 different pairs to compare 0 To get an overall signi cance level or 05 on all 3 tests considered together we conduct each one at 05 0167 0 3 i That is we would consider the difference between two population means to be sig ni cantly different from zero at the 05 level only if the prvalue for the the tetest for that pair was less than 0167 7 very conservative Equivalently we could multiply the prvalue from each tetest by 3 gtk If the result was less than 05 we would consider the difference between two pop ulation means to be signi cantly differ ent from zero at the 05 level SAS handles the appropriate computations The ANOVA Procedure proc anova class group model rate group means group alpha 05 bon cldiff Bonferroni Dunn t Tests for rate NOTE This test controls the Type I experimentwise error rate but it generally has a higher Type 11 error rate than Tukey s run for all pairWise comparisons Alpha 005 Error Degrees of Freedom 42 Error Mean Square 8479285 Critical Value of t 249367 Minimum Significant Difference 83847 Comparisons significant at the 005 level are indicated by 96 Difference group Between Simultaneous 95 Comparison Means Confidence Limits F i C 8801 0416 17186 F i P 17842 9457 26227 C i F 8801 17186 0416 C i P 9041 0656 17426 P i F 17842 26227 9457 P i C 9041 17426 0656 22s152 Applied Linear Regression Chapter 7 Dummy Variable Regression Consider the categorical variable Class repre senting the class in high school with categories of freshman sophomore junior or senior l7ve mentioned that a categorical variable with 4 categories would require 3 dummy variables Why can7t I just use the following coding sys tem Category D1 D2 Freshmen 1 1 Sophomore 1 0 Junior 0 1 Senior 0 0 This would not t the best tting7 intercept for each respective group which is usually what we want One possible correct coding Category D1 D2 D3 Freshmen 1 0 0 Sophomore 0 1 0 Junior 0 0 1 Senior 0 0 0 The overall model Yi 0 1i 2D1i 3D2i 4D3i6i The tted model for each group Freshmen Y2 e z m 62 Sophomore Y e s wz 62 Junior Y e 4 1962 62 Senior Y o 8le 52 D1 and D2 do de ne 4 distinct categories Let7s look at the tted models for each group if we also have 1 quantitative covariate X Yi o lm 2D1i aDm 5i Freshmen Y g 2 g z 62 Sophomore Y g g 8le 62 Junior Y g 3 1962 62 Senior Y o 8le 52 Let7s assume all the parameters are not 0 There are 4 distinct intercepts here but they are not all free7 to be any value There is a restriction such that the Freshmen intercept is dependent on the Sophomore intere cept and the Junior intercept 2 This coding would allow for a separately t in tercept for each group Use m 7 1 dummy variables for a categorical variable with m categories Recall that this model would t 4 separate parallel lines because there is no interaction between Class and Xin the model A model that includes interaction Yi o lm 2D1i sDzi 4D3i 5D1m 6D2m 7D3m 5i The tted model for each group Freshmen K 50 5251 551952 t 52 Sophomore K u 5351 9962 62 Junior YZ 50 545157952 t 52 Senior 16 u 51962 62 The model ts separate intercepts AND sepa rate slopes The test if any of the interaction variables are signi cant H015556570 HAnotH0 this can be accomplished with a partial Fetest Plotting in R gt pioclt1oglt1ccgt1og3a1gtcype u gt a Don t plot polutsv gt poutslogACG Rauk odezai 1ogSa1 Rauk odexa gtcol2 gt ponucs1ogAcGRaukCOd i1ogSa1gtRaukCod 2lgtcol3 gt poutslogACG Rauk od 1ogSa1 Rauk o lgtcol4 gt lmvbothvoutcoeff1c1euts Intercept logACG rankdummyd raukvdumm 392 1122791103 001652546 7038142946 7026632286 50 5405 317 317 gt coeff1mbothvoutcoeff1c1euts M for easler codmg gt abluecoeff1l coerfm coeff2lgtcol2 2 1wd gt abluecoeff1l coerfm coeff2 1 gt abimeltcoerr1 coefrmrcokri gt legumes1o7cmssnsc Assoc Null coiclt234gt1wd2gt n mm Returning to the Salary Data from Lab We found that loggmnts X and rank as sistar1t7 associate7 or full were both signi cant predictors of logsalary in the model Y1 50 510006139 5D1Dli 5D2D21 6139 with the coding dummy 1 dummy 2 Assistant 1 0 Associate 0 1 Full 0 0 What do the tted models look like Assistant K 50 501 51400962 t 52 Associate K 50 502 Aoaxz 62 FUHC Y2 50 5ACG2 62 Do we need to include an interaction between loggmnts and rank H 0 i slopedummyl slopedummyZ 0 The Partial Fetest gt 1moucbochmc1mlt1oglt3a1gt logACG raukvdummyd raukdummyg 10gACGraukdummy1 10gACGraukdummy2 gt auova mvoutVbothgtlmoutbothut Analysls of Varlauce Table Model 1 10gSal 10gACG raukvdummyd raukdummyg Model 2 10gSal 10gACG raukvdummyd raukdummyg logACG raukvdummyd logACG raukdummya Resva RSS Df Sum of Sq F PrgtF 1 420 17394912 2 418 17394337 2 00575 06889 06027 i Fail to reject the null The parallel lines appear to be suf cient for ex plaining the relationship between loggmnts and logsalary ie there7s a similar linear re lationship between logsalary and loggmnts for the di ering rank groups same slopes 8 225152 Applied Linear Regression Chapter 8 ANOVA NOTE We will meet in the lab on Monday October 13 o Oneway ANOVA 7 Focuses on testing for di cerences among group means 7 Take random samples from each of m pop ulations 7 m is the sample size in the 2th population forz391m 7 yij is the jth observation in the 2th pope ulation 1 Estimators 7 11 Y M11i in The estimated for a group is just the sample group mean 02 is estimated using a pooled estimate because constant variance is assumed A2 2 039 75137 where 52 is the sample variance in the 2th group 7 2 Pooled est1mate of a Sp 7 5P n17 1s ng71s nm7 1s 7 There are a couple commonly used models for a oneway ANOVA with m groups 7 The cell means model Yij Mi EU With EU iiil N00392 So p1 and all observations from group 1 have the same mean 1 The mean of group 239 is M The mean parameters to be estimated are M17M27w7Mm There is 1 noise parameter to estimate 02 2 7 The e ects model Eij lyNlt00392 m 12ni So p 041 and all observations from group 1 have the same mean 11 041 In this model there are m groups esti7 mated means and were using m 1 pa7 rameters to de ne the mean structure This is an overeparameterizationl Di erent sets of parameter values p 041 am can give the same tted values ie can give the same estimated group means For example suppose m 3 and 171 10 20 and 30 In the overeparameterized e ects model YU na fori123 many di cerent combinations of n 041 042 043 estimates will give me these same estimated group means of 10 20 30 for example 710 20 30 40 20 710 0 10 This means we have to use a constraint or restriction to make the parameters in the model identi able7 uniquely deter7 mined Dummy Regressor Coding for the am 0 constraint with m 3 group 1 group 2 Regression Model Yi M alDli a2D2i 5i Model by group Group 1 Y na1ei Group 2 Y nagei Group 3 ne This is what we7ve been using so far with our dummy regressor coding we7ve had a baseline group 7 The e ects model Yij n 04139 61 The am 0 constraint Set the last group parameter to zero Essentially delete the parameter for the last category Under this constraint the last group group in is seen as the baseline group am0soEYmnamn n represents the mean of the mth group under this constraint 04 is the distance of group i from group in The 04 7s give distance from baseline group This may or may not be a useful interpre7 tation for your situation 6 7 The e ects model Yij n 04139 61 There is another often used constraint that produces easily interpretable parameters The sumtozero constraint Elli 04139 0 am 7a1a2am71 m 7 l dummy variables needed a is seen as the grand mean or the average of the pop7n means nice interpretation If you have balanced data l Y the overall mean of the sample If you have unbalanced data a the mean of the sample means 04 represents the distance of group i from the grand mean Thus 04139 is the e ect of being in group i tells us if the mean of group i is up or down from the grand mean Dummy Regressor Coding for sum to zem constraint with m 3 Category D1 D2 group 1 1 0 group 2 0 1 group 3 e1 71 Regression Model Yi M a1D1i a2D2i 5i 0 Example Deviation regressors 7 Back to the Pet and Stress data well now use a di cerent dummy regressor coding of the same situation and well use the deviation regressors for the dummy vari ables gt petsreadcsv petscsv gt attachltpetsgt gt namesltpetsgt 11 groupH rate gt levelsltgrou 1 MC HF upu Create the deviation regressors 11 Model by group Group 1 Yipoqei Group2 Yipagei Group 3 pa1a2ei No baseline group in this interpretation You still only need 2 dummy variables as 043 041 a2 That7s the restrice tion we7ve imposed These 1071 dummy regressors are called deviation regressors because the inter pretation gives values as distances or de viations from the grand mean nnrowpetsgt E a a H l H m a 8 3 dummy1grou C 1 1 dummy1group P 1 1 dummy 2rep0n dummy2group F 1 1 dummy2group P 1 1 vvvvvvv gt dataframegroupdummy1dummy2 group dummy1 dummy 1 P 1 1 2 F O 1 3 P 1 1 4 C 1 O 5 C 1 0 Regression Model Yi M a1D1i a2D2i 5i gt lmoutlmrate quot dummy1 dummy2 gt lmoutcoeff1c1ents Interce t dummy1 dummy 8244408889 007997778 888104444 M 041 042 Since this is balanced data7 overall mean 1 gt meanltrategt 1 8244409 All three group means 1717172 gt tapplyrategroupmean c F P 8252407 9132513 7348307 Control treatment group M 041 gt lmoutcoefficients 1 lmoutcoefficients 2 1 8252407 Friend treatment group 0 2 gt lmoutcoefficients1lmoutcoefficients3 1 9132513 Pet treatment group M 7 041 042 gt lmoutcoefficients1l7lmoutcoefficients21 lmoutcoefficients 3 1 7348307 Hypothesis testing in oneway ANOVA 0 Cell means model H01M1M2 Mm HA at leastlul different 0 E ects model H0 Ozg04m0 04 HAatleast1al7 0 Both hypotheses are testing the same thing You can use the summary statement to get an overall Fetest gt lmoutlmrate quot dummy1 dummy2 gt summary lm out Coefficients Estimate Std Error t value Prgtltl Intercept 8244409 137269 60060 lt Zeile WW dummy1 007998 194128 0041 0967 dummy2 888104 194128 4575 418e05 WW Signif codes 0 M 0001 H 001 005 01 1 Residual standard error 9208 on 42 degrees of freedom Multiple RSquared 04014Adjusted Risquared 03729 Fistatistic 1408 on 2 and 42 DF pivalue 2092e05 Festatistic is 1408 and pevalue is 000002 We reject the null and conclude that there is statis tically signi cant evidence that at least one of the group levels is di cerent from the others This Festatistic and pevalue are EXACTLY the same as when we t the model using the 043 0 constraint in the part 1 notes on p16 1A 0 ANOVA table and overall Fetest When we represent group by dummy regrese sors7 R sees each dummy variable as a sepa rate covariate notice how it performs a test for each dummy regressor in the summary In the pet example7 we can test the signife icance of group by lumping the two covarie ates together and doing a partial Fetest or an overall Fetest in this case because they were the only predictors in the model What about the ANOVA table and the sums of squares Regss 22142 7 Y R88 222112 7 i2 TSS 22102 7 Y 16 gt RegSSsumltlmoutfittedvalues7meanrategt 2 gt Re SS 1 2387689 gt RSS7sumrate7lmoutfittedvalues 2 gt RSS 1 561299 Source Sum of Squares df Mean Square F Regr sion 2387 689 2 1193 844 14 07954 Residuals 3561299 42 84 79285 Total 5948 988 44 When R sees group as a factor categorical variable and its the ONLY predictor we can get the RegSS from the anova statement gt lmoutlmratequotgroup gt anovalmout Analysis of Variance Table Response rate Df Sum Sq Mean Sq F value PrgtF group 2 23877 11938 14079 2092e705 M Residuals 42 35618 848 Signif codes 0 0001 001 7 005 01 17 Assessing the assumptions of oneway ANOVA 0 normal distribution of response variable in all populations of interest 7 histograms boxplots for sample data from each population 7 normal qq plot for sample data from each population 0 same standard deviation in all popula7 tions 7 possible to use Levene7s test for homogene ity of variance but it assumes normality of observations 7 rule of thumb if largest sample standard deviation isn7t more than twice as large as smallest sample standard deviation as7 sumption is probably met close enough for ANOVA to be OK 7 if nis are equal the ANOVA is less sensi7 tive to violation of equal variance 19 1 0 Classical ANOVA sums of squares Notation for sums of squares in a 17way ANOVA m R5933 ssgmp 2 7107 7 Y i1 where is the group 239 mean and l7 is the overall sample mean Note that 17 Residual sum of squares m 7 Jess 202 7 w i1j1 Source Sum of Squares df Mean Square quot11 SSW mm 7W m 2553719259143 Residuals 2211274541 7 KY urm MSE Total 21210437162 n71 o If one or both assumptions are violated try transformation If none seems suf cient try noneparametric procedure such as Kruskale Wallis test