THEORY STAT INFERENCE
THEORY STAT INFERENCE STAT 513
Popular in Course
Popular in Statistics
This 187 page Class Notes was uploaded by Shane Marks on Monday October 26, 2015. The Class Notes belongs to STAT 513 at University of South Carolina - Columbia taught by Staff in Fall. Since its upload, it has received 6 views. For similar materials see /class/229674/stat-513-university-of-south-carolina-columbia in Statistics at University of South Carolina - Columbia.
Reviews for THEORY STAT INFERENCE
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/26/15
STAT 513 THEORY OF STATISTICAL INFERENCE Fan 2007 Lecture Notes Joshua M Tebbs Department of Statistics University of South Carolina TABLE OF CONTENTS STAT 5137 J TEBBS Contents 1 Hypothesis Testing 1 11 Introduction 1 12 The elements of a hypothesis test 3 13 Common large sample tests 9 131 One population mean 11 132 One population proportion 12 133 Difference of two population means 13 134 Difference of two population proportions 14 14 Sample size determination 14 15 Relationship between con dence intervals and hypothesis tests 18 16 Probability values 20 17 Small sample hypothesis tests using the t distribution 23 171 One sample test 23 172 Two sample test 25 18 Hypothesis tests for variances 26 181 One sample test 26 182 Two sample test 27 19 Power7 Neyman Pearson Lemma7 and UMP tests 28 191 Simple and composite hypotheses 31 192 Uniformly most powerful UMP tests 33 110 Likelihood ratio tests 38 2 An Introduction to Bayesian Inference 45 21 The Bayesian approach 45 22 Prior model selection 56 221 Conjugate priors 57 TABLE OF CONTENTS STAT 5137 J TEBBS 222 Noninformative priors 58 23 Bayesian point estimation 61 24 Bayesian interval estimation 65 25 Bayesian hypothesis tests 69 3 Regression Models 73 31 Introduction 73 32 Linear statistical models 76 33 Simple linear regression and least squares 78 34 Properties of least squares estimators from simple linear regression 81 35 Estimating the error variance 84 36 Inference for 60 and 61 85 37 The analysis of variance for simple linear regression 87 38 Con dence intervals for linear functions of 60 and 61 91 39 Prediction intervals for a future Y using simple linear regression 94 310 Correlation 96 311 Matrix algebra results for linear models 101 3111 Basic de nitions and results 101 3112 Linear independence and rank 104 3113 Column and row spaces 106 3114 Quadratic forms 107 3115 Means and variances of random vectors 108 3116 The multivariate normal distribution 109 312 Introduction to multiple linear regression models 112 313 Multiple linear regression models using matrices 115 314 Sampling distributions 120 315 Inference for parameters in multiple regression 123 TABLE OF CONTENTS STAT 5137 J TEBBS 316 Simple linear regression in matrix notation 125 317 The hat matrix and geometric considerations 126 318 The analysis of variance for multiple linear regression 127 319 Con dence intervals for EYlw in multiple linear regression 131 320 Prediction intervals for a future value of Y in multiple linear regression 133 321 Reduced versus full model testing 135 4 A Brief Introduction to Survival Analysis 139 41 Introduction 139 42 Describing the distribution of time to an event 141 43 Censoring and life table estimates 147 44 The Kaplan Meier estimator 151 45 Two sample tests 159 46 Power and sample size considerations for two sample tests 168 461 Proportional hazards 168 462 Design speci cations 172 r I k sample tests 177 CHAPTER 1 STAT 5137 J TEBBS 1 Hypothesis Testing Complementary reading Chapter 10 11 Introduction TERMINOLOGY Statistical inference is the part of statistics that deals with making statements about population or model parameters The two main areas of statistical inference are estimation point estimation and con dence intervals and hypothesis tests We discussed estimation in Chapters 8 and 9 This chapter covers some of the theory behind hypothesis tests Example 91 The lifetime of an air conditioning units run capacitor a device in the unit which drives the compressor denoted by Y is modeled by an exponential distribution with mean 6 gt 0 Suppose that we observe an independent and identically distributed iid sample of n capacitor lifetimes denoted by Y1Y2 Y o A good77 point estimator for 6 is the maximum likelihood estimator MLE 5 e 3 i Y e 7 T n i1 l T t We know that 3 is an unbiased estimator of 0 ie t9 and is a function of the suf cient statistic T 21 Yi ie gis UMVUE 0 To write a 1001 7 a percent con dence interval for 0 we have some different options We could use a largesample con dence interval ie i S Y i a 7 2 H w This interval arises from the fact that for 71 large 7 i 0 ZS AV01 PAGE 1 CHAPTER 1 STAT 5137 J TEBBS ie Z is an asymptotic pivot see 86 This large sample result follows from the Central Limit Theorem and Slutsky7s Theorem Alternatively we could use an exact 1001 7 a percent interval given by 2T 2T Xgnpz27 Xgn iaZ 7 where xganZ and xgn Z denote the lower and upper 042 quantiles of a X22n distribution respectively This interval arises from the fact that 2T Q j N 9622 ie Q is a an exact pivot which follows from the fact that T N gamman0 proved using an mgf argument If the sample size n is large we would expect the two intervals to be similar in terms of coverage probability When the sample size is small these two intervals could be quite different HYPO THESIS TEST Suppose that the manufacturer of a new capacitor claims that its new model has a longer life than the known mean lifetime of the standard model say 00 How can we determine statistically if there is evidence to support this claim Here it makes sense to think of two competing hypotheses H0 3 6 60 versus Ha6gt60 The state under H0 represents the situation where the mean lifetime of the new model is not different than the mean lifetime of the standard model The state under Ha represents the situation that the mean lifetime of the new model is greater than the mean lifetime of the standard model ie the manufacturers claim is true How should we decide between H0 and Ha using the available data Y1Y2 Yn7 This question can be answered by performing a hypothesis test PAGE 2 CHAPTER 1 STAT 5137 J TEBBS 12 The elements of a hypothesis test IMPORTANT The hypothesis test is probably the most widely used statistical infer ence procedure The four parts to a hypothesis test are a the null hypothesis denoted by H0 b the alternative hypothesis denoted by Ha in some books it is denoted by H1 c the test statistic and d the rejection region REMARK The alternative hypothesis Ha is sometimes also called the researchers hypothesis since it is often but not always the hypothesis that the researcher wants to conclude is true TERMINOLOGY A test statistic is a rule that is used to decide between H0 and Ha It will always be a function of Y1Y2 Yn the data hence the term statistic TERMINOLOGY The rejection region denoted by RR speci es the values of the test statistic for which H0 will be rejected The rejection region is often located on a probability distribution of some kind usually in tails of that distribution If we reject the null hypothesis H0 we say that the result or test is statistically signi cant THE STATES OF NATURE Table 11 succinctly summarizes the four different states we could encounter in performing a hypothesis test of the form H0 3 6 60 versus Ha0 90 TERMINOLOGY Type I Error Rejecting H0 when H0 is true The probability of Type 1 Error will be denoted by a and is called the signi cance level for the test PAGE 3 CHAPTER 1 STAT 5137 J TEBBS Decision Reject H0 Decision Do not reject H0 Truth H0 Type I Error correct decision Truth Ha correct decision Type II Error Table 11 The states of nature in a hypothesis test TERMINOLOGY Type II Error Not rejecting H0 when Ha is true The probability of Type II Error sometimes generically denoted by 6 depends on which value of 6 31 00 is being considered that is this probability will be different for different values of 0 PREVAILING RESULT In any hypothesis test if the test statistic falls in rejection region then we reject H0 A test statistic is constructed in a way so that we can determine its distribution under H077 ie when H0 is true This construction will allow us to calculate Type I and II Error probabilities REMARK In any testing situation we would like the error probabilities oz and B to be small In general the error probabilities are inversely related that is as the Type I Error probability 04 increases decreases Type II Error probability 6 decreases increases Example 92 Suppose that Y1Y2 Y25 is an iid sample of size n 25 from aN6 02 distribution where 02 100 known To test H0 3 6 versus Ha0gt75 suppose that we decide to use the rejection region RR1 y y gt 76 that is we will reject H0 when the sample mean 73 exceeds 76 For this decision rule what is a the probability of Type I Error SOLUTION Recall that if Y1Y2 Yn is an iid N002 sample then 7 N N602n Thus when H0 is true and n 25 it follows that i 100 y N N 75 E PAGE 4 CHAPTER 1 STAT 5137 J TEBBS Thus with RR1 y 17 gt 76 PType 1 Error PReject HolHo is true P7 gt 76l6 75 76775 PltZgt i PZ gt 05 03085 In practice one would never want the signi cance level 04 to be this large Let7s try a different rejection region With RRZ y 17 gt 80 the Type 1 Error probability is PType 1 Error PReject HOlHO is true P7 gt 80l6 75 P Z gt U PZ gt 25 00062 xZl Obviously this is much better However note the tradeoff the second rejection region has a smaller Type 1 Error probability but requires much more evidence against H0 ie to reject H0 we need the sample mean 17 gt 80 with RRZ whereas we only needed gt 76 with RRl For RRZ let7s compute the probability of Type ll Error when 6 77 B PType ll Error PDo not reject H0l6 77 P7 lt sole 77 PZlt80777 PZ lt15 09332 Thus if 6 77 in which case Ha is true we would likely not reject H0 The rejection region RRZ has a small Type 1 Error probability Oz 00062 but the probability of Type ll Error for this value of 6 may be too high D STRATEGY In most hypothesis testing situations we x 04 at something we can live with77 say 04 001 Oz 005 Oz 010 etc Then with our chosen value of 04 we try to make 6 as small as possible this will often force us to think about the sample size PAGE 5 CHAPTER 1 STAT 5137 J TEBBS Example 93 Suppose that Y1Y2Y5 is an iid sample of exponential observations with mean 6 gt 0 We would like to test H0 3 6 1 versus Ha6gt1 a Find k so that RR y 17 gt k is an exact level 04 005 rejection region b With this rejection region7 what is the Type ll Error probability when 6 3 SOLUTION a Working with the general form of the rejection region7 we note that 5 5 RRyygtkyZygt5k y2Zylgt10k i1 i1 ln general7 we recall that if YhYZ Yn is an iid exponential6 sample7 then Yi X22n M 29 H Thus7 when H0 is true and n 57 it follows that 2 21 Y N X210 therefore7 we can specify k so that a 005 PX210 gt10k From Table 67 pp 794 WMS7 we see that 10k o w 18307 Thus7 taking k 183077 it follows that RR y 17 gt k is an exact level 04 005 rejection region b To nd the Type ll Error probability when 6 37 we compute B PType ll Error PDo not reject Hol6 3 P7 lt183076 3 5 P lt18307 6 3 i1 PX210 lt 6102 01934 I used the pchisq610210 command in R to calculate the last probability D EXERCISE Repeat Example 937 except take 71 10 What happens to the Type ll Error probability in part b Note that the rejection region changes PAGE 6 CHAPTER 1 STAT 5137 J TEBBS Example 94 Suppose that Y1Y2Y10 is an iid sample of Poisson observations with mean 6 gt 0 We would like to test H0 3 6 2 versus Ha6lt2 a Find k so that RR y 17 S k is an exact level 04 005 rejection region SOLUTION a Working with the general form of the rejection region7 we note that 10 RRfyiyltkyZyilt10k i1 ln general7 we recall that if Y17Y2 Yn is an iid sample of Poisson0 observations7 then the suf cient statistic T 2 Y N Poisson7u97 proved using an mgf argument Thus7 when H0 is true and n 107 it follows that T N Poisson20 Therefore7 we can specify k so that 10k aPT10kZ j0 Pois20 pmf From Table 37 pp 791 WMS7 we note the following 04 10k k 0011 10 10 0021 11 11 0039 12 12 0066 13 13 Thus7 RR1 y g S 12 is an exact level 04 0039 rejection region similarly7 RRZ y g S 13 is an exact level 04 0066 rejection region We see that because of the discreteness of the Poisson distribution7 it is not possible to obtain an exact level 04 005 test This is usually the case with tests based on discrete distributions rarely will we be able to get an exact prespeci ed level 04 test PAGE 7 CHAPTER 1 STAT 5137 J TEBBS b Using RR1 y 17 S 127 what is the Type 11 Error probability when 6 1 SOLUTION To nd the Type 11 Error probability when 6 17 we compute B PType 11 Error PDo not reject HOW 1 P 1219 1 g 10 P 12 0 1 PlPois10 12 07916 i1 1 used the ppois12 10 command in R to calculate the last probability D EXERCISE Repeat Example 947 except take 71 20 What happens to the Type 11 Error probability in part b Note that the rejection region changes Example 95 Suppose that Y1Y2Y100 is an iid sample of n 100 Bernoullip observations We would like to test H0 p 010 versus Ha p lt 010 Find k so that RR y yi S k is an exact level 04 005 rejection region SOLUTION ln general7 we recall that if Y1Y2 Yn is an iid Bernoullip sample7 then the suf cient statistic T 2116 bnp Thus7 when H0 is true and n 1007 it follows that T N b1007 010 therefore7 we can specify k so that k 100 a m PT k lt 01073917 010 0 7 0 j 7 b100010 pmf EXACT APPROACH The R command pbinomk100 01 gives the following k4 i 04 00237 k5 i 04 00576 k 6 i 04 01172 As we see again7 we cannot get an exact level 04 005 rejection region PAGE 8 CHAPTER 1 STAT 5137 J TEBBS APPROXIMATE APPROACH Using a normal approximation we know from the Gen tral Limit Theorem that when p 010 ie when H0 is true the sample proportion 010090 pAN 010 100 Thus 010090100 k100 i 010 010090100 005PB k100 m p ltZlt k100 7010 gt 7165 gt k 505 w 5 In this case the two approaches exact and approximate provide similar answers They will not always agree however especially if n is small We7ll further discuss approximate approaches in the next section B 13 Common largesample tests TERMINOLOGY The terms large sample7 andor asymptotic are used to describe hypothesis tests that are constructed from asymptotic theory In this section we present large sample hypothesis tests for 1 one population mean M 2 one population proportion p 3 the difference of two population means 1 7 pg 4 the difference of two population proportions pl 7 p2 TEST STATISTIC In each of these situations we will use a point estimator say 0 which satis es A 6 7 6 Z AN01 U1 9 where 7g denotes the standard error of In some cases the estimated standard error g must be used PAGE 9 CHAPTER 1 STAT 5137 J TEBBS area 095 standard normal density l l l 196 196 Figure 11 Standard normal pdf with two sided rejection region when 04 005 TWO SIDED TEST ln general7 suppose that we would like to test H0 3 6 60 versus 1660 This is called a twosided test because Ha does not specify a direction indicating a departure from H0 We choose the value of k so that when H0 l9 00 is true7 aw gtw However7 note that since Z N AN01 it follows by standardizing that 70 70 amPOZl gt 0 gt 0 wza2gtk90zaZU U U Because RR y W 2 Ho Huang gt 2 M 2 242 the approximate level 04 rejection region is often stated in terms of the statistic 2 For exarnple7 if 04 0057 then 2042 20025 196 see Figure 11 PAGE 10 CHAPTER 1 STAT 5137 J TEBBS ONE SIDED TEST In general suppose that we would like to test H0 3 6 60 versus Ha6gt60 This is called a onesided test because Ha speci es a speci c direction indicating a departure from H0 We choose the value of k so that when H0 t9 00 is true 04 P6 gt However note that since Z N AN0 1 it follows by standardizing that kit kit awPltZgt70gt Omzagtk60zaa U U Because RRy 260zaU ltgtz52220 again we often state the approximate level 04 upper tail rejection region in terms of the statistic 2 By an analogous argument the large sarnple test of H0 3 6 60 versus H 6 lt 60 uses RR 3 g3 00 izaag ltgt z 2 S 720 as an approximate level 04 lower tail rejection region 131 One population mean SITUATION Suppose that Y1Y2 Yn is an iid sample with mean M and variance 02 and that interest lies in testing H0 M 0 For large n the Central Limit Theorern says that when H0 is true i Y M0 UW PAGE 11 Z AN0 1 CHAPTERI STAT51ampJTEBBS Thus if the population variance 02 is known we can use this as a test statistic lf 02 is not known we can replace 02 with 2 the sample variance Because 2 is a consistent estimator of 02 Slutsky7s Theorem allows us to conclude that 7 7 0 Z Sm AN0 1 when n is suf ciently large 132 One population proportion SITUATION Suppose that Y1Y2 Yn is an iid Bernoullip sample where 0 lt p lt 1 and that interest lies in testing H0 p p0 For large n the Central Limit Theorem says that when H0 is true 2167 100NAN 0717 P0110071 where denotes the sample proportion For theoretical reasons this is called a score statistic Other textbooks may present 15 100 171 n as a large sample test statistic for H0 p p0 This statistic which is sometimes referred to as a Wald statistic is valid for large 71 since 151 7 is a consistent estimator of p1 7 p However the former statistic has better properties in nite samples ie it possesses a signi cance level which is often closer to the nominal level a The latter statistic is often liberal possessing a signi cance level larger than the nominal level Example 96 ESl is a procedure that places a small amount of powerful anti in ammatory medication cortisone around in amed spinal nerves in the epidural space A single injection is known to deliver notable relief 7 in 80 percent of all subjects Man ufacturers of a new type of anti in ammatory injection treatment are seeking evidence that their treatment is more successful In a medical study with n 56 subjects 49 ex perience notable relief Does this new treatment provide a statistically signi cant result at the 04 005 level PAGE 12 CHAPTER 1 STAT 5137 J TEBBS SOLUTION To answer this question7 treat the 56 subjects as an iid Bernoulli sample7 let p denote the true success probability for the new treatment7 and consider testing H0 p 080 versus Ha p gt 080 Here7 the sample proportion is 15 4956 0875 and the value of the score statistic is Z 157100 0875 7 080 1 p0110071 08017 08056 At the 04 005 level7 2005 1657 so we can not reject H0 that is7 z 140 does not fall in the rejection region 2 z 2 165 This is not a statistically signi cant result We can not conclude that the new treatment is signi cantly better than the existing ESl treatment at the 04 005 level D 133 Difference of two population means SITUATION Suppose that we have two independent samples ie7 Sample 1 Y117Y12 H714 1 are iid with mean 1 and variance of Sample 2 161162712n2 are iid with mean 2 and variance 0 and that interest lies in testing H0 M1 7 p2 0 When H0 is true7 the Central Limit Theorem says that for large 711 and n2 7 77 ZMANO1 0 0 Lita lf7 for some improbable reason7 the population variances of and a are known7 then this serves as a test statistic More likely7 we must estimate the population variances with S and 8 the respective sample variances A detailed argument shows that when H03M1M201St1 ue7 i i Y1 7 Y2 s 3 71 Z AN01 for large m and n2 PAGE 13 CHAPTER 1 STAT 5137 J TEBBS 134 Difference of two population proportions SITUATION Suppose that we have two independent samples ie7 Sample 1 Y117Y1277Y1n1 are iid Bernoullip1 Sample 2 Y21Y22 Y2n2 are iid Bernoullipg7 and that interest lies in testing H0 pl 7 p2 0 A detailed argument shows that when H0 is true7 Z amp N LLV1071 Z711I31 1321iz32 n1 n2 for large 711 and 712 where 151 and 152 are the respective sample proportions This is a valid large sample test statistic it is of the Wald type7 similarly to that observed for the one sample problem For this situation7 other books may present the score statistic 151 i 52 x l f y 711151 712152 711 712 z where 6 denotes the pooled sample proportion A detailed asymptotic argument shows that this is also a bona de large sample statistic The latter statistic often performs better than the former in small sample situations This is often true in general score tests are often better than their Wald test analogues 14 Sample size determination IMPORTANCE We now address the important problem of sample size determination7 restricting attention to one population settings For practicing statisticians7 sample size determination is a ubiquitous exercise For our discussion7 we will assume that we have an estimator gwhere 719 Z AN01 U PAGE 14 CHAPTER 1 STAT 5137 J TEBBS where U denotes the standard error of Recall that the estimator 6 and consequently its standard error Hg depends on n the sample size We focus on the onesided test H0 3 6 60 versus H 6 gt 60 that employs a level 04 rejection region exact or approximate RRy62kltgtz222a where k 60 zaag so that PHO Z 6 RR a PRACTICALLY IMPORTANT DIFFERENCE Our goal is to determine the sample size n that confers a chosen value of B the Type ll Error probability However because 6 depends on 6 we must specify a particular value of 6 to consider denote this value by 61 Because we are considering the alternative Ha 6 gt 60 it is likely that we are interested in the value 6 gt 60 ie 6a 60 i A7 where A gt 0 The value A will be referred to as the practically important difference that we wish to detect IMPORTANT To derive a general formula for the sample size in a particular problem we exploit the following two facts 0 When H0 is true our level 04 rejection region RR y 62 k implies that m U 0 When Ha is true and 6a 60 A then for a speci ed value of B it follows that k 76 W iz see Figure 105 pp 478 These two formulae provide the basis for deter mining the necessary sample size n recalling that Ug depends on PAGE 15 CHAPTER 1 STAT 5137 J TEBBS SINGLE POPULATION MEAN For a one sample test regarding a population mean ie7 for H0 M M07 it follows that k 0 am where 039 denotes the population standard deviation When Ma M0 A then for a 2047 speci ed value of B we have k 7 1 am 2 quot Solving these two equations simultaneously gives 2 2 n 7 2a z 039 7 A2 7 where A Ma 7 M0 As we can see7 the population variance 02 must be speci ed in advance In practice7 we must provide a guess or an estimate of its value Sometimes this guess may be available from preliminary studies or other historical information Example 97 Anchovies are small silvery sh with blue green backs A marine biologist7 interested in the distribution of the size of a particular breed of anchovy7 would like to test the following hypothesis H0 M 20 versus HaMgt207 where M denotes the mean anchovy length measured in cm She would like to perform this test using 04 005 Furthermore7 when M Ma 227 she would like to detect this difference ie7 reject Hg with probability 090 What sample size should she use She has supplied a guess that 039 w 25 SOLUTION Here7 we have M0 20 and Ma 227 so that A 27 Oz 0057 and B 010 We have 2a 2005 165 and z 2010 128 Thus7 the desired sample size is a 2 2 165 128 2 25 2 2 z H m1341 A2 22 Rounding up7 she should collect 14 anchovies D PAGE 16 CHAPTER 1 STAT 5137 J TEBBS SINGLE POPULATION PROPORTION For a one sample test regarding a population proportion ie7 for H0 1 p0 it follows that k i 100 i i 2047 1001 i 100 71 When pa p0 A then for a speci ed value of B we have k 7 pa pa1i pan izg39 Eliminating the common k in these two equations produces 1 Ilia 2a P0 Polz P PliA039 n n where A pa 7100 Solving this equation for 71 gives the desired sample size In practice7 this should be done numerically Example 98 Researchers are planning a Phase III clinical trial to determine the probability of response7 p7 to a new drug treatment It is believed that the standard treatment produces a positive response in 35 percent of the population To determine if the new treatment increases the probability of response7 the researchers would like to test7 at the 04 005 level7 H0 p 035 versus Ha p gt 035 In addition7 they would like to detect a clinically important increase in the response probability to p pa 040 with probability 080 so that the Type ll Error probability 6 020 The clinically important difference A 005 is a value that represents a practically important increase77 for the manufacturers of the new drug What is the minimum sample size that should be used in the Phase III trial SOLUTION We are left to solve the equation 035 1 7 035 040 1 7 040 2005 2020 7 005 0 for n I wrote the following simple R program to solve this equation numerically PAGE 17 CHAPTERI STAT51ampJTEBBS Author Joshua M Tebbs Date 6 Aug 2007 Purpose This program provides the sample size necessary to detect the alternative pa for a fixed Type III Error prob alphalt 005 betalt O2O pOlt O35 palt O4O deltalt pa pO rootfunc lt functionx qnorm1 alphasqrtpO1 pOX qnorm1 betasqrtpa1 paX delta unirootrootfunc C11000O tol0000lroot 1 5729835 Thus7 the researchers will need to recruit n 573 patients for the Phase III trial D 15 Relationship between con dence intervals and hypothesis tests CONFIDENCE INTERVAL There is an elegant duality between con dence intervals and hypothesis tests In a mathematical sense7 they are essentially the same thing7 as we now illustrate Suppose that we have a point estimator7 say7 0 which satis es i ee Z N01 Using Z as a pivot7 it follows that Ri zagag is a 1001 7 a percent con dence interval for 0 REMARKS a This discussion is essentially unchanged if Z N AN01 we will not dwell on this subtlety7 as it is rather unimportant b In practice7 the standard error 7g could depend on parameters that are unknown We assume that these parameters are known here and that as is free of 0 This is probably the not the case in practice7 but7 for our purposes here7 this is irrelevant PAGElS CHAPTER 1 STAT 5137 J TEBBS HYPOTHESIS TEST The two sided level 04 hypothesis test H0 3 6 60 versus Ha y o employs the rejection region 376 RRzzgtza2y 0A0 gtza2 9 which of course means that H0 is not rejected when 6760 ZaZ lt lt 2042 Ur However algebraically the last inequality can be rewritten as 7 zaZa lt 00 lt zagag which we recognize as the set of all values of 00 that fall between the 1001 7 04 percent con dence interval limits PUNCHLINE The hypothesis H0 t9 00 is not rejected in favor of Ha t9 31 00 at signi cance level a whenever 00 is contained in the 1001704 percent con dence interval for 0 Equivalently if 00 is not contained in the 1001 7 04 percent con dence interval for 0 then H0 is rejected at level a REMARK This argument illustrates the connection between con dence intervals and two sided hypothesis tests A similar relationship holds for one sided con dence intervals and one sided tests see pp 481 The obvious question then becomes Do we really need hypothesis tests77 as seemingly running them gleans no additional information than that conveyed from constructing the correct con dence interval which arguably is far more informative There are other situations in statistical practice where hypothesis tests can be performed and con dence interval estimation does not apply eg model buildingchecking in regression analysis overall F tests in ANOVA models etc Thus hypothesis tests do have their place PAGE 19 CHAPTER 1 STAT 5137 J TEBBS 16 Probability values REMARK When performing a hypothesis test simply saying that we reject H077 or that we do not reject H077 is a largely uninformative binary categorization On the other hand a probability value provides a numerical measure of how much evidence we have against the null hypothesis H0 TERMINOLOGY The probability value for a hypothesis test speci es the smallest value of Oz for which H0 would rejected Thus if the probability value is smaller than 04 we reject H0 If the probability value is larger than or equal to 04 we do not reject H0 REMARK Understanding how to compute probability values is easy all we need to do is examine the alternative hypothesis Since the probability value is viewed as a measure of how much evidence we have against H0 it is always computed under the assumption that H0 is true Example 99 Suppose that Y1Y2 Y100 is an iid NM02 sample where 02 100 is known and that we want to test H0 u 75 versus Haugt75 Suppose that the sample mean is 17 7642 and thus the one sample 2 statistic is 7177 7 7642775 7142 Ux 1mm 39 39 When would we reject H0 Since our alternative is one sided we would reject H0 when ever 2 gt 20 where 20 denotes the upper 04 quantile of the standard normal distribution In this light consider the following table 04 Test statistic Critical value Reject H0 04 005 z 142 2005 165 no 04 010 z 142 2010 128 yes PAGE 20 CHAPTER 1 STAT 5137 J TEBBS Figure 12 N01 density with me sided probability value P 00778 From the table7 we note that 2010 lt Z lt 2005 and7 thus7 it should be clear that the probability value is somewhere between 005 and 010 In fact7 observing that our alternative is one sided7 we see that7 with P denoting probability value7 P PZ gt142 00778 see Figure 12 Thus7 for all oz lt 007787 we would not reject H0 On the other hand7 for all oz 2 007787 we would reject H0 Rernernber7 we can think of the probability value as the borderline77 value of oz for which H0 is rejected PREVAILING THEME Probability values provide an alternative but equivalent way of deciding between H0 and Ha ln general7 Test statistic in RR ltgt probability value 3 a PAGE 21 CHAPTER 1 STAT 5137 J TEBBS Example 910 It has been suggested that less than 20 percent of all individuals who sign up for an extended gym membership continue to use the gym regularly six months after joining Suppose that Y denotes the number of members who use a certain gym at least 3 times per week on average six months after joining7 to be observed from a sample of n 50 members Assume that Y N b50p and that we are to test H0 p 02 versus Haplt02 If Y y 67 the exact probability value is 6 V y0 145002 pmf computed using the pbinom65002 command in R This is fairly strong evidence against H07 although it is not enough77 at the standard 04 005 level of signi cance Instead of using the exact probability value7 one commonly uses the normal approxima tion to the binomial7 in which case the approximate probability value is 012 7 02 P Z lt PZ lt 7141 00793 0217 0250 As you can see7 there is a mild discrepancy here in the exact and approximate probability values Approximate results should always be interpreted with caution D REMARK In a profound sense7 a probability value7 say7 P7 is really a random variable This should be obvious since it depends on a test statistic7 which7 in turn7 is computed from a sample of random variables Y17Y2Yn In the light of this7 it seems logical to think about the distribution of P Clearly7 the support of P is over 017 so a beta family comes to nd As it turns out7 mathematics can show that when H0 is true7 a probability value P N 107 1 This theoretical result7 which would be proven in a more advanced course7 has important consequences in statistical practice7 including applications in drug discovery and in genetics PAGE 22 CHAPTER 1 STAT 5137 J TEBBS 17 Smallsample hypothesis tests using the t distribution RECALL We now focus on small sample tests for a a single population mean M and b the difference of two population means say M1 7 M2 In the one sample problem we have learned that we can use a Z type statistic 7 7 Z Mo7 S which has an approximate standard normal distribution when H0 M M0 is true and n is large The key point here is that the sample variance S2 is a consistent estimator of 02 the population variance so that when n is large Z N AN0 1 from the Central Limit Theorem and Slutsky7s However since consistency is a large sample property one would not expect this statistic to perform well when n is small This is true because 52 is not necessarily a precise estimator of 02 when n is small For example recall that when Y1Y2Yn is an iid NM02 sample VSZn71 that is the precision of 52 depends directly on 02 If 02 is large and n is not large then VSZ will be large 171 Onesample test SETTING Suppose that Y1 Y2 Yn is an iid NM 02 sample where both parameters M and 02 are unknown and that we want to test H0 3 M 0 versus Ha M7 M0 In this situation we know verify that iv o 5 PAGE 23 tn 71 CHAPTER 1 STAT 5137 J TEBBS Table 12 Crab temperature data These observations are modeled as n 8 iid realiza tions from a Naoz distribution 258 246 261 249 251 253 240 245 The quantity t is called the onesample tstatistic When H0 is true7 we would not expect t to fall in the tails of the tn 7 1 distribution7 so the rejection region is located there Speci cally7 for an a level test7 the rejection region is RR 3 Z tnil Z Probability values are also computed from the tn 7 1 reference distribution One sided tests use a suitably adjusted rejection region Example 911 In Wilmington7 NC7 a marine biologist observes a random sample of n 8 crabs and records the body temperature of each in degrees C see Table 12 She models these observations as an iid sample from a NW 02 distribution She would like to test7 at level oz 0057 H0 M 254 versus Ha M lt 254 From the data in Table 127 we compute ij 250 and s 069 thus7 the value of the one sample t statistic is 7 177 7 2507254 7 5N7 069 The appropriate critical value is 7t70V05 71895 from Table 57 WMS7 and7 thus7 the t 7164 rejection region is given by RR t t lt 71895 We do not have suf cient evidence to reject H0 at the oz 005 level The probability value is Pt7 7164 m 0073 computed using the pt 1647 command in R D PAGE 24 CHAPTER 1 STAT 5137 J TEBBS 172 Twosample test SETTING Suppose that we have two independent samples Sample 1 Y117Y127Y1n1 iid NULLUf Sample 2 Y217Y22 quot7ng iid NltM2U and that the goal is to compare the means 1 and 2 When the population variances are equal that is7 when a a 02 say7 we have shown that 71 72 1 i 2 Sp Q71 172 where 7 represents the 2th sample mean and t Ntlt711712727 82 n1 1S12n2 Us p 711 712 7 2 denotes the pooled sample variance7 which is an unbiased estimator of the common 02 verifyl Suppose that we want to test7 at level 04 H0 3 1 i 2 0 versus Ha3M1M20 When H0 M1 7 p2 0 is true7 it follows that 71 7 72 1 1 SP 771 M t The quantity t is called the twosample t statistic When H0 is true7 we would not expect t to fall in the tails of the tn1 712 7 2 distribution7 so the rejection region is located there Speci cally7 for an a level test7 the rejection region is RR ti ltl Ztn1n272a2 Probability values are also computed from the tn1 712 7 2 reference distribution One sided tests use a suitably adjusted rejection region PAGE 25 CHAPTER 1 STAT 5137 J TEBBS IMPORTANT When a 31 0 that is7 when the population variances are not equal7 we can use the modi ed t statistic given by tiY17Y2 5 3339 Lita Under H07 the distribution of this modi ed t is approximated by a tV distribution where S 3amp2 71 Tz V 132 52 Tb1 Tb2 nlil ngil the degrees of freedom This formula for V is known as Satterwaite7s formula ROB USTNESS Recall that a statistical procedure is said to be robust if the underlying assumptions do not have to exactly hold for the procedure to yield mostly accurate results One and two sample t tests are robust to moderate departures from normality For larger sample sizes7 the procedures are more robust We discussed robustness at great length when presenting one and two sample con dence intervals based on the t distribution 18 Hypothesis tests for variances 181 Onesample test SETTING Suppose that Y1 Y2 Yn is an iid NW 02 sample7 where both parameters are unknown7 and that interest lies in testing H0 02 0393 versus Ha 02 03 where 0393 is a xed value Recall from Chapter 7 WMS that when H0 is true ie7 when a a3 71 7 1 52 X2 2 X201 1 70 PAGE 26 CHAPTER 1 STAT 5137 J TEBBS Thus7 when H0 is true7 we would not expect X2 to fall in the tails of the X201 7 1 distribution Thus7 for an a level test7 the rejection region is RR X2 3 X2 S Xi7ll7oz2 or X2 Z Xian2 Probability values are also computed from the X20171 reference distribution One sided tests use a suitably adjusted rejection region 182 Twosample test SETTING Suppose that we have two independent samples Sample 1 Y117Y12 Y1m iid Nu1af Sample 2 Y21Y22 iid NM27U 7 and that the goal is to compare the variances a and a by performing the test H0 a 0 versus Ha a 31 0 Recall from Chapter 7 WMS that7 in general7 c7 7 n271S n2 7 1 c7 7lt 1 gtSltn1 7 1 1 NFlt71171771271 However7 note that when H0 a a is true7 F reduces algebraically to 52 F 71 5 Thus7 when H0 is true7 we would not expect F to fall in the tails of the Fn1 71712 71 distribution Thus7 for an a level test7 the rejection region is RR F 3 F S Fn171n27117uz2 or F 2 Fm71n271a2 Probability values are also computed from the Fn1 7 1712 7 1 reference distribution One sided tests use a suitably adjusted rejection region PAGE 27 CHAPTER 1 STAT 5137 J TEBBS 19 Power Neyman Pearson Lemma and UMP tests REMARK The concept of power is important in statistics Loosely speaking we say that a hypothesis test has good77 power properties if it is likely that we will reject H0 when H0 is not true In practice statistical investigations are often designed to provide a suf ciently high power to detect a difference deemed important77 by the investigator TERMINOLOGY Suppose that Y1 Y2 Yn is an iid sample from fyy 6 and that we use a level 04 rejection region RR to test H0 6 60 versus a suitable alternative The power function of the test is denoted by K6 and is given by K6 PReject H0l6 That is the power function is a function that gives the probability of rejecting H0 as a function of 6 Of course if 6 60 then H0 is true and K60 04 the probability of Type 1 Error For values of 6 that are close to 60 one would expect the power to be smaller than say when 6 is far away from 60 This makes sense intuitively namely it is more dif cult to detect a small departure from H0 than it is to detect a large departure from H0 The shape of the power function always depends on the alternative hypothesis FACT Under the scenario described in the last paragraph if 6 is a value of 6 in the alternative space that is when Ha is true then where 66a is the Type ll Error probability when 6 6a Proof This fact follows directly from the complement rule that is K6a PReject H0l6 6a 1 7 PDo not reject H0l6 6a 1 7 D Thus tests with small Type ll Error probability have high power We would like to use tests with high power PAGE 28 CHAPTER 1 STAT 5137 J TEBBS Example 912 Suppose that Y1 Y2 Yn is an iid N0Ug sarnple7 where 03 is known7 and that we would like to test H0 3 6 60 versus Ha6gt60 Suppose that we decide to use the level 04 rejection region RR z z gt 204 where Z g i 90 0390 7 and 2a denotes the upper 04 quantile of the standard normal distribution The power function for the test7 for values of 6 gt 00 is given by K09 PReject Holt PZgtza6 7 0390 P Ygt60ZD lt7 9 w l 90 Li 79 P Zgt ltfgt ToW 90 2 7 9 ToW 7 where denotes the N01 cumulative distribution function cdf It is easy to see that K09 is an increasing function of 6 verifyl Note additionally that K090 17 za 04 the probability of Type I Error Figure 13 displays the graph of K09 for the case wherein Oz 0057 00 67 03 4 and n 10 The R code I used to generate the plot follows E EXERCISE Reproduce the power function for different values of 71 What happens to the shape of the power function as 71 gets larger What would the power function look like if the rejection region was two sided PAGE 29 CHAPTERI STAT51ampJTEBBS Ktheta theta Figure 13 Power function K09 in Example 912 with a 005 00 6 03 4 and n10 Author Joshua M Tebbs Date 31 Aug 2007 Purpose Plot the power function in Example 912 thetalt seq51000l thetaOlt 6 nlt 1O alphalt 005 sigmaOlt 2 power lt functiontheta l pnormthetaOqnorm1 alphasigmaOsqrtn theta sigmaOsqrtn plotthetapowerthetatypequotlquotxlab quotthetaquot ylabquotKthetaquotlty1 REMARK It is common in the planning phase of an experiment or observational study to specify a desired power to detect a practically important difference of interest to the investigator and then determine the sample size n to provide that power PAGESO CHAPTER 1 STAT 5137 J TEBBS 191 Simple and composite hypotheses TERMINOLOGY In this course7 we will usually take the null hypothesis to be sharp7 or simple that is7 the value of 6 under H0 is equal to a speci c value The alternative may be simple or composite Here is an example of a simpleversussimple test H0 3 versus Ha 6 6 Here is an example of a simpleversuscomposite test H0 3 6 5 versus Ha6gt5 There are an in nite number of values of 6 speci ed in the composite alternative hypoth esis that is7 any value of 6 that is larger than 5 GOAL For a simple versus simple hypothesis test7 we seek the most powerful rejec tion region ie7 the rejection region that maximizes the probability of rejecting H0 NEYMAN PEARSON LEMMA Suppose that Y17Y2 Yn is an iid sample from fyy 67 let L6 denote the likelihood function7 and consider the following simple versus simple hypothesis test H0 3 6 60 versus Ha 6 6a For xed a the test that maximizes the power when 6 61 uses the rejection region mag682 where k is chosen so that the test has level a that is7 a PReject HO HO true This is called the mostpowerful level a test for H0 versus Ha PAGE 31 CHAPTER 1 STAT 5137 J TEBBS Example 913 Suppose that Y1 Y2 Y10 is an iid Poisson0 sample and that we want to test H0 3 6 1 versus Ha t9 2 Find the most powerful level 04 test SOLUTION The likelihood function for 6 is given by 1 0 631679 yi67109 6u67109 L 9 y E L 0 7 7 77 i1 yd H221 yil H221 yil where the suf cient statistic u 231 yi Now7 form the ratio lu67101 1 L09 2ue 102H1yil 271039 Thus7 the Neyman Pearson Lemma says that the most powerful level 04 test is created 1 P 2116710 lt k 1 a This is not a friendly probability calculation to make7 so lets rewrite the event inside Note that by choosing k such that 1 U710 1 Wltk ltgt 2 e gtE ltgt 2Ugt510k ltgt Uln2gt107lnk 10ilnki ltgt Ugt ln2 k say Thus7 the problem has now changed to choosing k so that PU gt M6 1 a This is an easier probability to handle We know that when H0 is true7 U 2221 Y N Poisson10 Thus7 as k is likely not an integer7 we need to solve the equation 1076 10 j PUgtk 01 f jk 1 PAGE 32 CHAPTER 1 STAT 5137 J TEBBS for k where denotes the greatest integer function or equivalently lk l PU S k l6 1 j0 We could use Table 3 WMS which catalogues the Poisson cdf I used the ppoism10 10716710 7 7 a jl command in R where 111 plays the role of k to produce the following table 111 17 Oz 04 14 0917 0083 15 0951 0049 16 0973 0027 17 0986 0014 18 0993 0007 From the table we can say for example that the most powerful level 04 0049 test uses the rejection region 10 RR 31Zyi 2 15 i1 that is to reject H0 6 1 in favor of Ha 6 2 when the sum 231 y is greater than or equal to 15 D QUESTION What is the power of this test when 6 6a 2 10 K2 PReject H0l6 2 P 2 15 i1 9 2 PPois20 2 15 0843 192 Uniformly most powerful UMP tests REVIEW For a simple versus simple test the Neyman Pearson Lemma provides the means necessary to derive the most powerful level 04 rejection region We now switch gears and discuss simple versus composite tests eg H0 6 60 versus H 6 gt 60 or H0 660 versus Ha 6 lt 60 TERMINOLOGY When a test maximizes the power for all 6 in the alternative space it is called the uniformly most powerful UMP leveloz test PAGE 33 CHAPTER 1 STAT 5137 J TEBBS DISCLAIMER The Neyman Pearson Lemma pertains only to simple versus simple hy potheses Thus if a simple versus composite test is of interest we can not appeal to Neyman Pearson Lemma directly to nd the UMP test However its use still will come into play in nding the UMP test as we demonstrate shortly INTERPRETATION Suppose that KU6 denotes the power function for the UMP level 04 test for H0 versus Ha and let KO6 denote the power function for some other level 04 test Then KU6 2 KO6 for all 6 in the alternative space that is for all values of 6 which satisfy Ha A STRATEGY TO FIND UMP TESTS Instead of considering the simple versus composite test H0 3 6 60 versus Ha6gt60 pretend like you have the simple versus simple test H0 3 6 60 versus H s 9 where 6 is some value larger than 60 If you can show that the choice of the rejection region for the simple versus simple test does not depend on the particular value of 6a speci ed in the simple alternative Ha 6 6a then the test will be UMP for the simple versus composite test H0 6 60 versus Ha 6 gt 60 CURIOSITY Why does this approach work Essentially what you are doing here is showing that for a given 6a the simple versus simple test for that particular 6 is most powerful using Neyman Pearson in a simple versus simple context However since the value 6 is arbitrary the test must be most powerful for every value of 6 gt 60 ie it must be the uniformly most powerful test for all 6 gt 60 PAGE 34 CHAPTER 1 STAT 5137 J TEBBS Example 914 Suppose that Y17Y2 Y10 is an iid geometricp sample and that we want to test H0 p 14 versus Ha p gt 14 Find the UMP level 04 test SOLUTION We begin by using the NP Lemma to nd the most powerful level 04 test for H0 1 14 versus Ha 310 pm where pa gt 14 Recall that the geometricp pmf is given by 1 W lp y 172 07 otherwise7 fyyp andl7 thus7 the likelihood function for p is 10 W 10 WW 2 MP H fYltyi p 1 pLi1yZ 1op10 1 p 1op107 i1 where the suf cient statistic u 231 yi The Neyman Pearson Lemma says that the most powerful level 04 test uses the rejection region L14 RR y 7 lt k Mpg where k is chosen such that 04 PReject HolHO true PReject Help 14 Consider the ratio mm W lt1gpam in Pa Mm 1 Pa 10 a P Thus7 the Neyman Pearson Lemma says the most powerful level 04 test uses the value of Pmi 3f imltkw PAGE 35 k satisfying CHAPTER 1 STAT 5137 J TEBBS This is not a friendly calculation However note that l lUlu md m Mmlzkl f ll m ltgt Uln ltlnk710ln lnk710ln13 7pa k pa 3 In mam Thus the problem has now changed to choosing k so that ltgtUlt PU lt k lp 14 a This is an easier probability to handle We know that when H0 is true U 231 Y N nib10 14 Thus as k is likely not an integer we need to solve the equation lk l j 7 1 I M lt k lp 14gt Z lt34gt710lt14gt10 a F10 10 i 1 mb1014 pmf for k where denotes the greatest integer function I used the pnbinomm 1O 10 025 command in R where 111 plays the role of k to produce the following table m 04 21 0021 22 0030 23 0041 24 0055 From the table we can say for example that the most powerful level 04 0055 test uses the rejection region 10 RR 31Zyi lt24 i1 that is to reject H0 p 14 in favor of Ha p pa whenever the sum 2221 y is less than or equal to 24 PAGE 36 CHAPTER 1 STAT 5137 J TEBBS Kp Figure 14 Power function Kp in Example 914 with oz 0055 and n 10 IMPORTANT In this example note that neither the test statistic U 222156 nor the rejection region the lower ozth quantile of the nib1014 distribution depends on the speci c value of pa Thus we have shown that 10 RR 3 S 24 i1 is the UMP level oz 0055 rejection region for testing H0 p 14 versus Ha p gt 14 19 TERMINOLOGY A level oz test for H0 t9 00 versus the alternative Ha whose power based on the data 3 The power function 10 Kp P g 24 i1 is depicted in Figure 14 D function is denoted by K09 is said to be unbiased if K021 2 oz for every 0 6 Ha REMARK In many situations such as when one considers a two sided alternative no UMP test exists The reason that a UMP test may not exist is that the class of level PAGE 37 CHAPTER 1 STAT 5137 J TEBBS 04 tests say C is a very large class and nding one rejection region that uniformly beats all other level 04 rejection regions is dif cult When a UMP test does not exist one can often nd a UMP unbiased UMPU test that is within the class of all level 04 tests C restrict attention to those tests that are unbiased This is analogous to restricting attention to unbiased estimators when trying to determine the UMVUE Example 915 Suppose that Y1 Y2 Yn is an iid NM 02 sample where 02 is known From Example 1023 pp 511 3 WMS we know that the UMP level 04 test of H0 M M0 versus Ha M gt MO is given by the test with RR y 17 gt MO zapW It can also be shown verify that the UMP level 04 test of H0 M M0 versus Ha M lt M0 uses RR y y lt M0 7 zap When M gt M0 the rst test beats the second test in terms of power the opposite is true when M lt M0 This shows that neither test is UMP for the two sided test H0 M M0 versus Ha M 31 M0 In fact for the two sided test no UMP rejection region exists For this test the alternative space is too large that is we can not nd a test that is uniformly better for all M 31 M0 The UMPU level 04 test uses RR 3 3 l gt Mo 2a2U 39 D REMARK The theory needed to derive UMPU tests will be covered in a more advanced course in statistical inference The theory involves the generalized NP Lemma 110 Likelihood ratio tests TERMINOLOGY The likelihood ratio test approach is very useful in statistics It is especially useful when there are nuisance parameters in the problem that is parame ters that are unknown but are not of interest Suppose that Y1 Y2 Yn is an iid sample from fyy6 where the parameter 6 E Q We call 9 the parameter space that is 9 represents the set of all values that the parameter 6 scalar or vector can assume Suppose that we partition 9 into two sets 90 and 9a ie Q 90 U 9a where 90 and 9a are disjoint A hypothesis test can be stated generally as H0 6 E 90 versus Ha 6 6 9a In this statement we call 90 the null space and 9a the alternative space PAGE 38 CHAPTER 1 STAT 5137 J TEBBS LIKELIHOOD RATIO TEST Suppose that Y1Y2 Yn is an iid sample from fyg67 where 6 E Q The likelihood ratio test of H0 3 6 E 90 Versus Ha 6 6 9a employs A as a test statistic and uses the rejection region given by RR A A S k where A L90 Sup6e o 19 supeen 19 and k is chosen such that 04 PReject HOlHO true that is7 so that RR is a level 04 rejection region L630 is the likelihood function evaluated at the MLE over 90 the restricted parameter space LQ is the likelihood function evaluated at the MLE over 9 the unrestricted parameter space It is obvious that 0 S A S 1 TECHNICAL NOTE If H0 is a composite hypothesis7 we de ne Oz sup PReject H0l6 sup K67 6690 9690 where K6 denotes the power function Example 916 Suppose that Y1Y2 Yn is an iid exponential6 sample and that we want to nd the level 04 likelihood ratio test for H0 3 6 60 versus Ha67 60 Here7 90 607 9a 6 gt 0 6 31 607 and Q 6 6 gt 0 The likelihood function for 6 is given by 1 1 7 m L9ly E LU H fyy 6 H aKW 675 Liam9 i1 i1 PAGE 39 CHAPTER 1 STAT 5137 J TEBBS Clearly supgenO L6 L60 since 90 contains only the singleton 60 For the alternative space we need to nd supeen L6 For which value of 6 is the likelihood function L6 maximized This is not a trick question L6 is maximized when 6 6 the MLE of 6 this is the de nition of a maximum likelihood estimatorl Thus we rst need to nd the MLE of 6 The log likelihood is given by In L6ly in 1116 i Z Thus W in 66 9 02 gt 7n6iyi Solving this last equation for 6 yields 6 7 Checking second order conditions shows that this solution is actually a maximizer verify so that supeen L6 The likelihood ratio test statistic is given by Llt ogt L090 Bowen90 L My 7 WWW3 where the suf cient statistic u ELI yi Clearly if 60 Q then A 1 Otherwise 7 noting that u 2 y m we can rewrite A as 7 n uQO n y e e w 47 70 Gina27 70 y 6 We E 99 say Sketching a graph of g versus 17 we see that H0 will be rejected when AEQlt gtSkltgtSCiorgZCz where the constants 01 and 02 are chosen so that the probability of Type 1 Error is a ie 01 and 02 are chosen so that 04 To satisfy this equality we can for simplicity require that P7 S 01l6 60 042 and P7 2 02l6 60 042 that is just split the 04 Type 1 Error probability equally Now when H0 6 60 is true it is possible to show using an mgf argument that 7 gamman60n PAGE 40 CHAPTER 1 STAT 5137 J TEBBS but it is easier to work with the distributional result n 290 2 Yi 96227 07 i1 because this allows us to nd expressions for 01 and 02 in terms of quantiles of the X22n distribution Note that 137 Clio 60 P g 7101 660 i1 2 i S 27101 2 0 00 2 X 2 incl 1 P 1Y 0 P n S it 042 00 2710 7 2 60 X2n17042 90 gt 01 Xgn iaZ39 A completely analogous argument try it shows that 7 60 2 02 Xmaz Thus the level 04 likelihood ratio test uses the rejection region 7 60 2 7 60 2 RR 3 y lt X2n17042 or y gt X2na2 For added concreteness suppose for example that 00 1 n 20 and Oz 005 Then the critical values 1 1 cl mxiwgm 061 and 02 mxio oz5 148 For these values of n and 00 a level 04 005 likelihood ratio test employs the rejection region 3117 3 061 or 172148 Figure 15 displays the graph of K09 PReject Holt for the case wherein 00 1 n 20 and Oz 005 The R code I used to generate the plot follows E PAGE 41 CHAPTERI STAT51ampJTEBBS Ktheta theta Figure 15 Power function K09 in Example 916 with a 005 n 20 and 00 1 Author Joshua M Tebbs Date 2 Sep 2007 Purpose Plot the power function in Example 916 thetaltseq000l4000000l nlt 2O thetaOlt 1 alphalt 005 powerlt functiontheta pchisqthetaOthetaqchisqalpha22n2n 1 pchisqthetaOthetaqchisq1 alpha22n2n plotthetapowerthetatypequotlquotylimc01Xlabquotthetaquot ylabquotKthetaquotlty1 NOTE In the last example we were fortunate enough to know the distribution of 7 or equivalently 200 21 Y when Hg was true In other situations the distribution of the test statistic or some function of it may be intractable or may not even be known In this event the following large sample result can prove to be useful PAGE42 CHAPTER 1 STAT 5137 J TEBBS ASYMPTOTIC39 RESULT Suppose that Y1Y2Yn is an iid sample from fyyt9 where 6 E Q and that we are to test H0 3 6 E 90 Versus Ha 6 6 91 Under certain regularity conditions which we will omit7 it follows that 72 lnA igt X2V7 as n a 007 where V denotes the difference in the number of free parameters speci ed in 90 and the number of free parameters speci ed in Q The term free parameters will become clear in the next example Example 917 Among those individuals who are infected with a certain virus7 we would like to know whether the prevalence p varies across a socioeconomic or demographic variable eg7 race7 income class7 etc To be speci c7 suppose that we observe Y Y17Y2Y3 multnp1p2p3 21131 17 a multinomial response vector that counts the number of infecteds in each of k 3 classes We want to test7 at level 04 H0 3101 102 103 13 versus Ha H0 not true In the context of our general parameter space notation7 9 10171027103 6 071l3 3101 102 103 1 Q0 10171027103 6 071l3 3101 102 103 13 91 Q i 90 Under H07 each of the parameters 101102 and p3 is equal to 13 that is7 none are allowed to vary freely ln 9 we know p1 102 103 1 ie7 if we know two of the ps7 the last p is determined uniquely that is7 only two of the ps are free to vary Thus7 V 2 For PAGE 43 CHAPTER 1 STAT 5137 J TEBBS this problem the likelihood ratio statistic is given by L 0 Supee o 10 LQ supeen 10 7 where 0 191102133 and the likelihood function 71 L09 2 Lanny mph221230 1211232033 919293 A multivariable maximization argument shows that the unrestricted maximum likelihood estimators are A Y1 A Y2 A Y3 1 0171027103 n n n ie the sample proportions verifyll Thus the likelihood ratio test statistic is given by L13713713 C X 13y113y213y3 131 n Z 3 7 i Millmyznwsn i C X ylnly1yzny2ysny3 where c nlyllyglygl Finding the distribution of 13102 i1 i1 is dif cult analyticallyl However if n is large we can use the 72 lnA X22 approxi mation When H0 is true a multivariate Taylor series expansion shows that 3 2 3 yi 7 3 Obs 7 Exp2 21 A x 3 E l n 713 Exp 11 11 where Obs denotes the observed category count and Exp denotes the expected cat egory count A large sample level 0 rejection region is given by RRAA kA721nAZk where k is chosen such that 1372 lnA 2 klHO true a Assuming that 0 005 then since 72ln has an approximate X22 distribution k xiow 599 For example if we observed 71 100 yl 30 yg 40 and y3 30 then 3 yi 2730 2 4072 30A2 721mg n3 1003 1003 1003 200 and we would not reject H0 p1 pg pg 13 at the 0 005 level D PAGE 44 CHAPTER 2 STAT 5137 J TEBBS 2 An Introduction to Bayesian Inference Complementary reading Chapter 16 WMS 7th Ed 21 The Bayesian approach REMARK Statistical inference is concerned with drawing conclusions from numerical data about quantities that are not observed eg model parameters etc For example if Y1Y2 Yn is an iid Np02 sample then we typically are interested in using the data Y y to make conclusions about the value of 6 M02 the parameter which describes the distribution ie the population from which the data 3 are taken We can make these conclusions using point estimates con dence interval set estimates or by performing hypothesis tests that are pertinent to the problem at hand CLASSICAL APPROACH Up until now in your exposure to statistics you have most likely been taught exclusively the classical or frequentist approach to inference that is you have been taught to regard the model parameter 6 scalar or vector valued as a xed but unknown value and to use the data Y y to make some statement about 6 This classical approach can be succinctly summarized as follows H Treat the parameter 6 as a xed but unknown quantity E0 Assume that Y1 Y2 Yn is a sample perhaps an iid sample from the probability model fyy 6 where 6 E 9 9 Observe the data Y 3 Hr Draw inference about 6 based on the observed data 3 Example 101 In a public health study researchers are planning to investigate the prevalence of HIV in Houston TX involving heterosexual male intravenous drug users not receiving treatment for their addiction In this study the goal is to estimate p the PAGE 45 CHAPTER 2 STAT 5137 J TEBBS unknown proportion of positives in this population A sample of 71 individuals will be obtained from the population and the positivenegative statuses of the individuals Y1Y2 Yn will be modeled as iid Bernoullip observations where 0 lt p lt 1 Under the classical approach the value ofp is regarded as xed but unknown and the data 3 will be used to make inferential statements regarding the prevalence p THE BA YESIAN PARADIGM Instead of treating model parameters as xed quantities and modeling the data Y the Bayesian sets up a full probability model that is ajoint probability distribution for all observables eg the data Y and unobservables eg model parameters in 6 The model for the 6 should be consistent with our prior knowledge of the underlying scienti c problem and the data collection process KEY DIFFERENCE In the Bayesian approach to inference unobserved model parame ters are not treated as xed quantities rather they themselves are modeled as random quantities which vary according to a probability distribution this distribution is called the prior distribution It re ects or models our prior beliefs about 6 Taking this approach allows us to incorporate our a priori knowledge into the inferential procedure BA YESIAN APPROACH We can summarize the Bayesian approach as follows 1 Treat the parameter 6 as random and model 6 as a random variable with pdf 96 the so called prior distribution for 6 D Conditional on 6 posit that Y1Y2 Yn is a sample from fyy6 In a Bayesian context assuming that Y1 Y2 Yn is an iid sample the distribution n MAElg H fYQM 9 i1 is called the conditional distribution of the data Y given 6 Mathematically fy 9yl6 is the same as the likelihood function L6ly 9 Observe the data Y y 4 Construct the posterior distribution g6ly All inference regarding 6 eg estimation testing etc is conducted using the posterior distribution g6ly PAGE 46 CHAPTER 2 STAT 5137 J TEBBS REMARK The Bayesian approach allows the researcher to incorporate prior information about 0 Clearly in many problems this would be desirable For example if we are talking about HIV in Houston we know that at least the prevalence p is not large In fact a wide variety of estimates of HIV prevalence have appeared in the literature ranging up to about 3 million infected in the United States this is about one percent nationwide lf Houston follows the pattern77 of this nationwide estimate taking a Bayesian approach affords the researcher the exibility to incorporate this prior information into the analysis On the other hand the classical approach seemingly ignores the prior information FINDING THE POSTERIOR DISTRIBUTION We present a general algorithm on how to nd the posterior distribution 90ly H Start by assuming a prior distribution for 0 say 6 90 3 Construct the conditional distribution fY 9yl0 Recall that if Y1 Y2 Yn is an iid sample from fyy t9 the conditional distribution is given by n MAElg H fYQM 9 i1 this is simply the likelihood function of 0 C40 Construct the joint distribution of Y and 0 this distribution is simply fY9379 ineQlt X 99 this follows from the de nition of conditional distributions remembering that 6 is regarded as a random variable 7 Compute hyy the marginal distribution of the data Y this is given by hYy 9 fY9y we Cf Construct the posterior distribution 90ly as fY937939 My My PAGE 47 CHAPTER 2 STAT 5137 J TEBBS Example 102 In our Houston HIV example see Example 1017 suppose that we model the data Y17Y2 Yn7 conditional on p7 as iid BernoulIip observations7 where 0 lt p lt 1 Since we are considering HIV prevalence7 we know p is most likely small7 so we decide to model p as a realization from a betaoz distribution7 where 04 lt 6 this is the prior distribution This beta prior distribution is reasonable since 0 the support of a beta is R 07 I which coincides with the Bernoullip parameter space ie7 0 lt p lt1 0 the betaoz family is very exible that is7 the pdf can assume many different shapes by changing Oz and g furthermore7 taking 04 lt B provides a pdf that is concentrated closer to p 0 than to p 1 o the betaoz distribution turns out to be a conjugate prior for the Bernoulli likelihood we7ll explain this later Following the previously mentioned steps7 we now derive gpy7 the posterior distribu tion of p 1 The prior distribution for p is p N betaoz thus7 for 0 lt p lt 17 91 pm1 71 2 The conditional distribution of the data Y7 given p7 is7 for yi 017 inpyIp H fyyp pri ipf yi pill 17p m1yi p 1p i1 i1 where the suf cient statistic u 21 yi 3 The joint distribution of Y and p7 for values of yi 01 and 0 lt p lt 17 is fYpyp fyinIp mm M niu NONLB 0471 71 M1719 X Par p 172M PAGE 48 CHAPTER 2 STAT 5137 J TEBBS 4 The marginal distribution of the data Y for y 01 is Pa ua711 7 pn 7u71dp 1 2 Po hyy fypy7pdp 113 WW 1 u H n H MW 0 pl HM d Pa PuozPn 7u WWW Pna 7 where u 21 y 5 The posterior distribution gply is for 0 lt p lt 1 given by ma a 051 7 n 7u71 l fYgtP37P FaF p 1 p g p y hYZ Ra W lquotuzlquot Fna a pua711 7 pn 7u717 Nu aPn B 7 u which is the betaua n6 7u pdf that is the posterior distribution ofp given the data 3 is distributed as betau 0471 B 7 u where u 21 yi UPDATE To update what we have done so far we start by modeling the unknown prevalence p as a betaoz random variable we then observe the data 3 from the study and we nally update our prior beliefs based on the data 3 to arrive at the posterior distribution of p Schematically p N betaoz gt Observe data 3 gt p N betau 0471 B 7 u prior distribution posterior distribution We see that the posterior distribution depends on a the prior distribution through the values of Oz and B and on b the data 3 through the suf cient statistic u 21 yi It is also interesting to note that both the prior and posterior distributions are members of the same beta family ILLUSTRATION In our Houston HIV example suppose that n 100 that is we observe 100 subjects and that our prior distribution is p N beta119 that is a beta distribution with 04 1 and B 19 This may be a reasonable prior distribution since Ep 005 which is small77 consistent with our prior belief that p is likely not large PAGE 49 CHAPTER 2 STAT 5137 J TEBBS 0 1 Q 90 2 00 005 010 015 020 025 030 00 005 010 015 020 025 030 QW 0 Mil 15 0 2 4 B 8 1012 00 005 010 015 020 025 030 00 005 010 015 020 025 030 Figure 26 Binomial beta Bayesian prior and posteriors in Example 102 Upper left p N beta1 19 prior Upper right Posterior distribution ofp when u 1 beta2 118 Lower left Posterior distribution ofp when u 5 beta6 114 Lower right Posterior distribution ofp when u 15 beta16104 The su eient statistic is u yi In Figure 26 we depict this prior distribution upper left and posterior distributions based on three different values of u yi Consider the following table Prior gp Observed data Posterior gply beta1 19 u 1 beta2 118 beta1 19 u 5 beta6 114 beta1 19 u 15 beta16 104 NOTE Figure 26 illustrates the effect that the observed data 3 can have on the posterior distribution gply For it 1 the posterior is left shifted from the prior For it 5 the posterior rernains located in a similar position as the prior although the variability has been reduced For it 15 the posterior is right shifted from the prior An R program to construct the posterior distribution follows E PAGE 50 CHAPTER2 STAT51ampJTEBBS Name Joshua M Tebbs Date 16 Sep 2007 Purpose Binomial beta posterior construction p is the binomial proportion max p is 03 on plots because p is close to zero plt seq0030001 parameters in the betaab prior alt 1 blt 19 prior p is a function that constructs the prior distribution lgamma is the internal loggamma function priorplt functionpabexplgammaab lgammaa lgammab p a 11p b 1 f0lt priorppab n of trials u of successes nlt 100 u1lt 1 u2lt 5 u3lt 15 postp is a function that constructs the posterior distribution postplt functionnpuabexplgammanab lgammaua lgammanb u p ua 11 p nb u 1 f1lt postpnpu1ab f2lt postpnpu2ab f3lt postpnpu3ab plot prior and three posteriors use three different u s parmfrowc22 plotpf0typequotlquotXlabquotpquotylabquotgpquot plotpf1typequotlquotXlabquotpquotylabquotgpu1quot plotpf2typequotlquotXlabquotpquotylabquotgpu5quot plotpf3typequotlquotXlabquotpquotylabquotgpu15quot PAGE51 CHAPTER 2 STAT 5137 J TEBBS Example 103 An animal biologist is interested in modeling the number of rat pups per mother7 Y7 for Rattus rattus commonly known as the black rat77 Suppose that Y1 Y2 Yn denote litter sizes for a sample ofn rat mothers and assume that Y1 Y2 Y 7 conditional on 67 is an iid sample from a Poisson distribution with mean 6 ln turn7 6 is modeled as a realization from a gammaoz distribution A Bayesian approach is taken to exploit the information from previous rat fecundability studies it is known that the mean number of pups per litter is around 5 7 but can be as high as 20 A gamma prior distribution is reasonable since 0 the support of a gamma is R 000 which coincides with the Poisson6 para meter space ie7 6 gt 0 o the gammaoz family is very exible that is7 the pdf can assume many different shapes by changing Oz and 6 right skewed consistent with prior knowledge 0 the gammaoz distribution turns out to be a conjugate prior for the Poisson likelihood We now derive 96ly7 the posterior distribution of 6 1 The prior distribution for 6 is 6 N gammaoz thus7 for 6 gt 07 1 0471 7619 9 6 76 e lt WW 2 The conditional distribution of the data Y7 given 67 is7 for yi 017 fY 9y 6 fYli 9 6yi 6 9 6213 2396 7 6ue 9 W T H1yil THEEW where the suf cient statistic u 21 yi 3 The joint distribution of Y and 67 for values of yi 017 and 6 gt 07 is fY9379 MAE 9 X909 676677 1 n7 gtlt 760 45 9 9 Hi1 yil NEW 1 n 6ua71676n1 139 NEW Hi1 yil PAGE 52 CHAPTER 2 STAT 5137 J TEBBS 4 The marginal distribution of the data Y for y 01 is co 1 7 7 n 7 my Armmew PCWH WW 15 WW 1dr i1 l39 l 00 71 n 6u0 71676n1 Pa a Hi1 0 4mm romanme mm 7 where u 2 y 5 The posterior distribution g6ly is for 6 gt 0 given by u 047 7 n 1 ham79 i Fa ah1yi6 15 9 1 h 7 n0 y Fa l3911y ru 5 mil gu0471676n1 17 90 1 n0 Nu 04 rd which is the gammau a 71 16 1 pdf that is the posterior distribution of 6 given the data 3 is distributed as gammau 04 n 16 1 where u ELI yi UPDATE To update what we have done so far we start by modeling the unknown mean 6 as a gammaoz random variable we then observe the data 3 from the study and we nally update our prior beliefs based on the data 3 to arrive at the posterior distribution of 6 Schematically 6 gammaoz gt Observe data 3 gt 6 gammau 04 n 16 1 We see that the posterior distribution depends on a the prior distribution through the values of Oz and B and on b the data 3 through the suf cient statistic u ELI yi It is also interesting to note that both the prior and posterior distributions are members of the same gamma family ILLUSTRATION In our rat pup example suppose that n 10 that is we observe the litter sizes of 10 rat mothers and that our prior distribution is 6 gamma23 that is a gamma distribution with 04 2 and B 3 These Choices of Oz and 6 provide PAGE 53 CHAPTER 2 STAT 5137 J TEBBS 012 W ale 0 08 gmeau32 004 57 90 goneiau goneiau 00 01 02 03 04 05 Figure 27 Poisson gamma Bayesian prior and posteriors in Example 103 Up per left 6 gamma23 prior Upper right Posterior distribution of 6 when u 32 gamma347 00968 Lower left Posterior distribution of 6 when u 57 gamma5900968 Lower right Posterior distribution of 6 when u 90 gamma9200968 The su eient statistic is u 221 yi 7 E6 67 which is consistent with our prior knowledge In Figure 277 we depict this prior distribution upper left and posterior distributions based on three different values of u 231 yi Consider the following table Prior7 96 Observed data Posterior7 g6ly garnrna27 3 u 32 garnrna347 00968 garnrna27 3 u 57 garnrna597 00968 garnrna27 3 u 90 garnrna927 00968 NOTE Figure 27 illustrates the effect that the observed data 3 can have on the posterior distribution g6ly Similarly to Example 1027 we see the posterior distributions are much less variable than the prior7 with locations depending on the observed data 3 D PAGE 54 CHAPTER2 STAT51ampJTEBBS Name Joshua M Tebbs Date 16 Sep 2007 Purpose Poisson gamma posterior construction theta is the Poisson mean thetalt seqO2000l parameters in the gammaab prior alt 2 blt 3 prior theta is a function that constructs the prior distribution lgamma is the internal loggamma function priorthetalt functionthetaab1explgammaab a theta a 1eXp thetab fOlt priorthetathetaab n sample size u sample sum nlt 1O u1lt 32 u2lt 57 u3lt 9O bnewlt 1n1b posterior scale parameter postp is a function that constructs the posterior distribution postthetalt functionnthetauabnew1explgammaua bnew uatheta ua 1eXp thetabnew f1lt postthetanthetau1abnew f2lt postthetanthetau2abnew f3lt postthetanthetau3abnew plot prior and three posteriors use three different u s parmfrowc22 plotthetafOtypequotlquotXlabquotthetaquotylabquotgthetaquot plotthetaf1typequotlquotXlabquotthetaquotylabquotgthetau32quot plotthetaf2typequotlquotXlabquotthetaquotylabquotgthetau57quot plotthetaf3typequotlquotXlabquotthetaquotylabquotgthetau90quot PAGE55 CHAPTER 2 STAT 5137 J TEBBS REMARK Before we proceed one observation is worthy of a remark The process of Bayesian inference involves passing from a prior distribution 96 to a posterior distri bution g6ly The previous two examples have suggested that because the posterior distribution incorporates the information from the data it will be less variable than the prior distribution This notion can be formalized through the following results 1379 ElE9lYl V6 EV6lYl VE6lY That is the prior mean of 6 is the average of all possible posterior means over the distribution of the data Y In addition the posterior variance is on average smaller than the prior variance by an amount that depends on the variation in the posterior means over the distribution of the data Y 22 Prior model selection Example 104 Suppose that in Example 103 we took 6 to have a NMoag prior distribution where both 0 and 03 are known For this choice it is easy to show do it that the joint distribution of Y and 6 is for values of y 01 and for foo lt 6 lt oo given by gugiln onZv l fY9376 7 x 2mm H2 yil where the suf cient statistic u ELI yi The marginal distribution of the data Y is computed by 00 6u67n667p402203 Iiiill 9fY6376d6 W 00 H which unfortunately does not exist in closed form Thus the posterior distribution will gm not exist in closed form either Why is it that when 6 gammaoz the posterior g6ly does exist in closed form but when 6 N NM0 03 it does not B ANOTHER QUESTION ln Examples 102 and 103 respectively why is that when we started with a beta gamma prior we ended up with a beta gamma posterior This happens because the beta and gamma priors were conjugate priors in those examples PAGE 56 CHAPTER 2 STAT 5137 J TEBBS 221 Conjugate priors TERMINOLOGY Let f fyy t9 6 E 9 denote a class of probability density mass functions indexed by the parameter 0 A class Q of prior distributions is said to be a conjugate family for f if the posterior distribution 90ly E g for all fyy0 E f and for all priors 90 6 9 Table 23 Some common conjugate families Family Parameter Conjugate family binomialnp p betaoz7 B Poisson A gammaoz7 6 M6 03 6 Mu 72 NM002 02 lnverse gammaoz exponential10 t9 gammaoz7 B multinomialnp p Dirichleta1a2ozk1 TERMINOLOGY The parameters that index the prior distribution are called hyper parameters For example7 the beta prior has two hyperparameters7 Oz and B The Bayesian approach we have outlined so often called the classical Bayesian approach requires that the researcher speci es the values of all hyperparameters There are related Bayesian approaches that do not require prior model hyperparameter selection OONJUGAOY The basic justi cation for the use of conjugate prior distributions is that they simplify the computations and that one can write out a closed form expression for the posterior distribution 90ly Thus7 in single parameter problems7 conjugate priors are chosen often for convenience However7 in multiple parameter problems7 conjugate priors may not exist Conceptually7 there is nothing to prevent one from using a nonconjugate prior in the general Bayesian approach In this case7 although it is not generally possible to write out a closed form expression for 90ly it may be possible to approximate it numerically or through the use of computer simulation PAGE 57 CHAPTER 2 STAT 5137 J TEBBS 222 Noninformative priors TERMINOLOGY When there is a general lack of a priori knowledge about the parame ters of interest7 prior models can be dif cult to choose It might also be desired for the prior distribution 90 to play a minimal role in determining the posterior distribution 90ly7 and7 hence7 the resulting inference Such distributions are called noninforma tive priors they are also referred to as vague diffuse or flat77 priors The rationale for using a noninformative prior is to let the data speak for themselves77 and to have the prior distribution contribute only minimally Example 105 Suppose that a researcher is interested in reporting an estimate for p7 the proportion of individuals who experience an allergic reaction to a new drug Since the drug is new7 we might have a genuine lack of knowledge as to where the distribution of p is likely located In this case7 one could take p N beta11 that is7 a beta prior with parameters 04 B 1 Recall that the beta11 distribution is the same as a 107 1 distribution see Figure 28 D Example 106 A medical investigation is undertaken to learn about the relationship between brain lesion frequency for patients with advanced multiple sclerosis A Poisson model is assumed for Y7 the number of brain lesions per subject A largely noninformative prior for 0 EY is a gammaa 126 100 distribution7 depicted in Figure 28 This distribution is relatively at over 07 oo7 so it will not have a large effect in determining the posterior 90ly D JEFFREYS PRIOR One approach that is sometimes used to elicit a noninformative prior distribution is to use the approach due to Jeffreys whose general principle states that any rule for specifying a prior distribution 90 should yield an equivalent result if applied to the transformed parameter b h07 for any one to one function ln 12 particular7 Jeffrey7s principle leads to specifying 90 x J0 azlog yltY6gt 6 0 92 and fyy 0 denotes the conditional model for Y7 given 0 7 where m 7 PAGE 58 CHAPTER 2 STAT 5137 J TEBBS 00 02 04 05 08 H 00 0102 03 04 05 m eta Figure 28 Examples of noninformative priors Top p N U01 from Example 105 Bottom 6 gamma12100 from Example 106 Example 107 Derive Jeffreys7 prior for the Bernoulli success parameter p SOLUTION The probability mass function for Y N Bernoullip7 for y 017 is given by fyy p py17 MW and thus 10E fyyp y logp 17 y10g17 p The rst and second derivatives of log fyyp are Mosh11m g 7171 6p p 17p and azlogf ym 7g7 17y 6p p2 1719 Thus7 02 10E fyY p Jp 7 7E 6132 p E51 Y PAGE 59 CHAPTER 2 STAT 5137 J TEBBS Figure 29 Je reys noninformative priorp beta1212 in Example 107 Thus7 Jeffreys7 prior is taken as 12 7 1 12 7 0471 1971 91 0lt lJpl m p 171 7 where 04 B 12 Of course7 we recognize this as the beta1212 kernel thus7 the beta1212 distribution is the Jeffreys7 noninformative prior for p D CRITICISMS The classical ie7 non Bayesian statistician seems to be bothered by the fact that the prior distribution for the parameters of interest needs to speci ed beforehand by the researcher Of course7 classical statisticians are ne with choosing a model for Y7 but they cringe when an additional model is attached to 0 I think this is duplicitous ln fact7 I would argue that in almost every real problem7 the researcher will have some information regarding 6 that can be conveniently included in the statistical model Furthermore7 if the amount of data observed is large that is7 large enough for the likelihood to dominate the prior model7 the prior7s contribution will be small If this occurs7 then the posterior distribution is likely not to be affected too greatly by the chosen prior model7 unless the prior is severely misspeci ed PAGE 60 CHAPTER 2 STAT 5137 J TEBBS 23 Bayesian point estimation REMARK We have seen that the Bayesian is primarily interested in the posterior distri bution g6ly This is true because by combining the prior model and information from the likelihood function the posterior distribution contains all the information regarding 6 However in practice numerical summaries of the posterior distribution 96ly are often desired POSTERIOR POINT ESTIMATION As in the classical framework we now consider constructing a point estimator for 6 focusing on measures of central location To describe the location ofthe posterior distribution g6ly the commonly used numerical summaries of location include the mean mode and median o The posterior mean of 6 is given by 33 E6lY y 69mm That is 63 is simply the rst moment of 6 computed with respect to the posterior distribution 0 The posterior mode of 6 is given by 52 are mgxgwlyl That is 6 is simply the value of 6 that maximizes the function g6ly o The posterior median is denoted by 63 and solves the equation 9713 05 139 my g6lyd6 That is 63 is the value of 6 that splits the area under the posterior density function into two halves NOTE If g6ly is perfectly symmetric a rarity in real life the three estimates 63 6g and 63 will be equal PAGE 61 CHAPTER 2 STAT 5137 J TEBBS 0 1 WW 2 00 005 010 015 020 025 030 00 005 010 015 020 025 030 QW 0 Mil 15 0 2 4 B 8 1012 00 005 010 015 020 025 030 00 005 010 015 020 025 030 Figure 210 Binomial beta Bayesian prior and posteriors in Example 102 Upper left p N beta1 19 prior Upper right Posterior distribution ofp when u 1 beta2 118 Lower left Posterior distribution ofp when u 5 beta6114 Lower right Posterior distribution ofp when u 15 beta16 104 The su eient statistic is u yi Example 108 Recall Example 102 see pp 48 50 where we considered the prevalence of HIV infection among male lVDU subjects In that example the prior distribution was p N beta1 19 Based on 100 male positivenegative statuses modeled as an iid Bernoulli sample we found the posterior distribution to be p N betau 1119 7 u where u y denotes the total number of positives among the 100 male subjects The table below gives the values of the posterior mean 153 mode 16 and median 153 for three different values of u The gure above depicts the three posterior distributions Prior gp Data Posterior 9p1y 153 16g 153 MLE beta1 19 u 1 beta2 118 00167 00085 00141 00100 beta1 19 u 5 beta6 114 00500 00424 00475 00500 beta119 u15 beta16104 01333 01271 01313 01500 PAGE 62 CHAPTER2 STAT51ampJTEBBS COMPUTATIONS To compute the posterior mean 153 note that the posterior distri bution is p N betau 1119 7 u so that u1 12039 173 EplY y To compute the posterior mode g we need to nd the value of p that maximizes the betau 1119 7 u posterior density ie the value of p that maximizes N120 U17 1187M Fu1P1207up p 9ply for xed u Because gply is unimodal the value ofp that maximizes gply is the same value of p that maximizes log gply log N120 7 log Nu 1 7 log N120 7 u u logp 118 7 ulog17 p This is true because logs is an increasing function To nd the maximizer we set the derivative of the log posterior equal to zero and solve for p that is we solve the following equation for p Oggloggplyg7llsiu pip 6p p 17p 118 Bquot To nd the posterior median 153 we must solve 173 r 120 u u 05 Pp S 19313 N p 1710118 dn 0 u 1P120 7 u for 153 This is just the 50th percentile of the betau 1119 7u distribution which can be easily found using software COMPUTING USING R I wrote the following R program to compute the three posterior estimators this code would be attached at the end of the binomial beta R code given previously see pp 51 notes Name Joshua M Tebbs Date 25 Sep 2007 Purpose Binomial beta Bayes point estimation PAGE63 CHAPTER2 STAT51ampJTEBBS Posterior mean p mean1lt u1a nab p mean2lt u2a nab p mean3lt u3a nab Posterior mode mode1lt u1a 1 nab 2 mode2lt u2a 1nab 2 pmode3lt u3a 1nab 2 quotU U Posterior median med1lt qbeta05 u1anb u1 med2lt qbeta0 5 u2anb u2 p med3lt qbeta0 5 u3anb u3 quotU U REMARK In the general binomial beta Bayesian setting outlined in Example 102 since the posterior distribution p N betau 0471 7 u B the posterior mean is easily computed that is 1 sawing pgltplygtdp 0 ua na It is insightful to note however that we can express 153 as A7 ua i n u a a P3mmgtm verifyl a weighted average of the maximum likelihood estimate 15 un and the prior mean aa B where the weights are a B and a a B respectively Note that the prior mean receives less weight as 71 increases This makes sense intuitively namely if we have a larger sample size we should weight the maximum likelihood estimate more and the prior mean less On the other hand if n is small then the maximum likelihood estimate may not possess enough information about p in this case we would want to weigh our prior beliefs more heavily PAGE 64 CHAPTER 2 STAT 5137 J TEBBS 24 Bayesian interval estimation DIATRIBE From a classical point of view we have at great length mostly in STAT 512 discussed the construction and interpretation of con dence intervals Recall that a 1001 7 a percent con dence interval for 6 includes this xed parameter 1001 7 a percent of the time in repeated sampling The word con dence77 is carefully but perplexingly chosen so that we dont haphazardly say the probability that our realized interval contains 6 is l 7 04 Remember that if 6 is regarded as a xed constant and if the observed con dence interval computed from the data 3 is lu the statement Pl lt 6 lt u 095 makes no sense because the event inside the probability symbol does not contain anything random Instead we remember that if we were to repeat the experimentstudy over and over again each time under identical conditions our interval estimation procedure would produce intervals that contain 6 approximately 1001 7 a percent of the time and the computed interval we obtained from the actual study is just an example of one of these intervals Personally I think this notion is rather confusing since in practice we only get to see the one realized con dence interval not a large number of them In addition novice statistics students and even experienced scientists nd the notion of repeated sampling7 inherent to the notion of a sampling distribution to be both frustrating and unintuitive For what it is worth I nd probability values which are too based on the notion of repeated sampling to be equally confusing RECALL We have already studied point estimation within a Bayesian context Bayesian point estimators are often a measure of central tendency of the posterior distribution 96ly such as a mean median mode or perhaps even some other functional deciding on which one is preferred requires a more theoretical discussion which we will eschew However while point estimates are nice it is important to report posterior uncertainty This can be done by using credible intervals also known as posterior probability intervals Such intervals are the Bayesian analogues of classical frequentist con dence intervals However their interpretation is quite different PAGE 65 CHAPTER 2 STAT 5137 J TEBBS CREDIBLE INTERVALS lf g6ly represents the posterior distribution of 67 given the data Y 117 then for any interval A 6L7 6U7 the credible probability of A is 9U P9 Aly 991yd9 9091306197 A 6L and A 6L76U is called a credible interval for 6 If 96ly is a discrete posterior distribution7 we simply replace the integral with a sum lf P6L lt 6 lt 6Uly 1 7 04 then we call 6L76U a 1001 7 04 percent credible interval INTERPRETATION The interpretation of a 1001 7 04 percent credible interval A 6L76U is the following quotThe probability that 6 is between 6L and 6U is 1 7 04 This is true since g6ly is the updated probability distribution of 6 given the data 3 Note that the interpretation of a Bayesian credible interval is far more straightforward than the interpretation of a classical con dence interval However7 the ease of interpreta tion comes with additional assumptions The Bayesian model requires more input than the classical model ie7 it requires the speci cation of a prior distribution 96 FINDING CREDIBLE INTERVALS A 10017a percent credible interval results when the credible probability of A 6L7 6U is 1 7 a Here are two popular ways to construct 1001 7 04 percent credible intervals 0 Simply take the endpoints of A to be the lower and upper 042 quantiles of g6ly this is called an equal tail ET credible interval 0 Take A to be the region of values that contain 1001 7 04 percent of the posterior probability in such a way that the posterior density 96ly within the region A is never lower than outside A this is called a highest posterior density HPD credible interval REMARK lntuitively7 if g6ly is unimodal and symmetric7 then the ET and HPD intervals are the same interval The ET interval is often preferred in practice because it is easier to compute PAGE 66 CHAPTER 2 STAT 5137 J TEBBS 0 1 Q 90 2 00 005 010 015 020 025 030 00 005 010 015 020 025 030 QW 0 Mil 15 0 2 4 B 8 1012 00 005 010 015 020 025 030 00 005 010 015 020 025 030 Figure 211 Binomial beta Bayesian prior and posteriors in Example 102 Upper left p N beta1 19 prior Upper right Posterior distribution ofp when u 1 beta2 118 Lower left Posterior distribution ofp when u 5 beta6114 Lower right Posterior distribution ofp when u 15 beta16104 The su eient statistic is u 11 Example 109 Recall Example 102 see pp 48 50 where we considered the prevalence of HIV infection among male lVDU subjects In that example the prior distribution was p N beta1 19 Based on 100 male positivenegative statuses modeled as an iid Bernoulli sample we found the posterior distribution to be p N betau11197u where u y The table below gives the 95 percent equal tail ET credible interval and the large sample 95 percent Wald interval for three different values ofu The gure above depicts the three posterior distributions Prior gp Data Posterior 9p1y 95 ET interval 95 Wald interval beta1 19 u 1 beta2 118 00020 00459 700010 00295 beta1 19 u 5 beta6114 00187 00953 00073 00927 beta1 19 u 15 beta16104 00788 01994 00800 02200 PAGE 67 CHAPTER 2 STAT 5137 J TEBBS COMPUTATIONS Credible intervals come directly from the posterior distribution gply Because the posterior distribution of p N betau 1119 7 u7 to compute the 95 percent equal tail ET interval7 we need to nd the values of pL and pU that satisfy the following integral equations 17L P 120 0025 1310 S PLly N p 17p118 dp 0 u 1P120 7 u betau11197u density and pU P 120 0975 Pp g pU y m pu1ip1187udp7 0 u 1P120 7 u respectively In other words7 we need to nd the lower and upper 0025 quantiles of the betau 1119 7 u distribution This type of calculation7 while seemingly intirnidating7 is easily done using software The classical 95 percent Wald interval for p is computed 15i1964M 39 100 39 as usual ie7 POINTS OF DISCUSSION 0 Perhaps the most glaring de ciency with the classical Wald interval is that its lower endpoint can be negative that is7 the interval itself can extend outside the parameter space 01 As a matter of principle7 I rarely report negative estimates for positive quantities Classical statisticians will clurnsily x this problem by truncating negative endpoints at 077 an ad hoc remedy that I nd to be somewhat illusory On the other hand7 Bayesian credible intervals can never extend beyond the parameter space because the posterior distribution is guaranteed to never fall outside of it It is also interesting to note how the prior information pulls the ET interval in the direction of the prior rnean here7 Ep 005 This is evident by looking at the ET intervals based on u 17 which is pulled to the right7 and u 157 which is pulled to the left PAGE 68 CHAPTER2 STAT51ampJTEBBS COMPUTING USING R I wrote the following R program to construct 95 percent cred ible intervals and the Wald intervals in this example this code would be attached at the end of the binomial beta R code given previously see pp 51 notes Name Joshua M Tebbs Date 2 Oct 2007 Purpose Binomial beta Bayes interval estimation Lower endpoint of 95 percent ET interval pL1lt qbeta0025u1anb u1 pL2lt qbeta0025u2anb u2 pL3lt qbeta0025u3anb u3 Upper endpoint of 95 percent ET interval pU1lt qbeta0975u1anb u1 pU2lt qbeta0975u2anb u2 pU3lt qbeta0975u3anb u3 Lower endpoint of 95 percent Wald interval waldL 1lt u1n qnorm0 975 sqrt uln 1 u1nn waldL 2lt u2n qnorm0 975 sqrt u2n 1 u2nn waldL 3lt u3n qnorm0 975 sqrt u3n 1 u3nn Upper endpoint of 95 percent ET interval waldU 1lt u1nqnorm0 975 sqrt uln 1 u1nn waldU 2lt u2nqnorm0 975 sqrt u2n 1 u2nn waldU 3lt u3nqnorm0 975 sqrt u3n 1 u3nn results1lt dataframepL1pU1waldL1waldU1 results2lt dataframepL2pU2waldL2waldU2 results3lt dataframepL3pU3waldL3waldU3 25 Bayesian hypothesis tests REMARK On an elementary level7 hypothesis testing is far less formal within a Bayesian framework than it is within the classical framework In fact7 many Bayesians do not even advocate their use7 and7 instead7 simply summarize the posterior distributions eg7 with PAGE69 CHAPTER 2 STAT 5137 J TEBBS pointinterval estimates without applying a rigid test Thus our discussion of hypothesis tests will not be as rigorous as it was within a classical mind set last chapter It is possible to make hypothesis testing much more rigorous by embedding it within a decision theoretic construct but that is not the approach we take here SETTING Suppose that Y1Y2Yn is an iid sample from fyy6 where 6 E Q and where the prior distribution 6 96 Within a Bayesian framework suppose that we would like to test the hypothesis H0 3 6 E 90 VSI SUS Ha e a where Q 90 U 91 As we have already learned for the Bayesian all inference is carried out using the posterior distribution g6ly This is a valid probability distribution so the probabilities PH0 is truely 139 6 my g6lyd6 no and PHa is truely 139 6 my g6lyd6 m make perfect sense and can be computed exactly As for a decision rule the Bayesian can choose to reject H0 when P9 E oly lt P9 E Qaly ie reject H0 when P6 E Qoly lt 05 If one wants to heavily guard against erroneously rejecting H0 one can choose this threshold probability to be very small say 01 or 001 REMARK It is worth noting that statements like PH0 is truely and PHa is truely make no sense to the classical statistician because 6 is a nonrandom quantity Taking this perspective the classical statistician is interested in using tests with good size and power properties On the other hand concepts like Type 1 Error probability77 ie size and power do not make sense to the Bayesian PAGE 70 CHAPTER 2 STAT 5137 J TEBBS 0 1 WW 2 00 005 010 015 020 025 030 00 005 010 015 020 025 030 QW 0 Mil 15 0 2 4 B 8 1012 00 005 010 015 020 025 030 00 005 010 015 020 025 030 Figure 212 Binomial beta Bayesian prior and posteriors in Example 102 Upper left p N beta119 prior Upper right Posterior distribution ofp when u 1 beta2 118 Lower left Posterior distribution ofp when u 5 beta6114 Lower right Posterior distribution ofp when u 15 beta16104 The su eient statistic is u 11 Example 1010 Recall Example 102 see pp 48 50 where we considered the prevalence of HIV infection among male lVDU subjects In that example the prior distribution was p N beta119 Based on 100 positivenegative statuses modeled as an iid Bernoulli sample we found the posterior distribution to be p N betau 1119 7 u where u 23 11 Suppose that we would like to test H0 p S 005 versus Ha p gt 005 The table below gives the posterior probabilities 131 0 S 00513 for three different values of u The gure above depicts the three posterior distributions Prior gp Data Posterior 9p1y 131 0 S 00513 beta119 u 1 beta2 118 09837 beta119 u 5 beta6114 05502 beta1 19 u 15 beta16104 00003 PAGE 71 CHAPTER2 STAT51ampJTEBBS COMPUTATIONS The posterior probability 1310 S 005ly is computed using the pos terior distribution gply Because the posterior distribution of p N betau 1119 7 u7 it follows that 005 lt 39 u 7 11871 131 7 0 0513 0 WM Drum imp 1 p dp betau11 197M density which7 again7 is easily computed using software COMPUTING USING R I wrote the following R program to compute the posterior probability 1310 S 005ly in this example this code would be attached at the end of the binomial beta R code given previously see pp 51 notes Name Joshua M Tebbs Date 2 Oct 2007 Purpose Binomial beta Bayes hypothesis tests PHO is true computed with respect to gpu prob1lt pbeta005u1anb u1 prob2lt pbeta005u2anb u2 prob3lt pbeta005u3anb u3 SUMMARY The purpose of this chapter has been to expose you to the Bayesian para digm Having done this7 it should be noted that while the Bayesian approach appears to be markedly more dif cult computationally especially when one does not use conjugate priors7 this turns out to be not true as long as one has suf cient computing resources There is no right77 approach to statistics7 and we are way beyond the frequentist Bayesian showdown In some problems7 a Bayesian approach is very sensible In other problems7 a Bayesian approach may not be necessary It suf ces to say that we have just scratched the Bayesian surface77 with our discussion here REFERENCE For a very readable exposition on the analysis of Bayesian data7 I highly recommend the following text 0 Gelman et al 2004 Bayesian Data Analysis7 2nd Edition7 Chapman and HallCR07 Boca Raton PAGE72 CHAPTER 3 STAT 5137 J TEBBS 3 Regression Models Complementary reading Chapter 11 and Appendix A 31 Introduction IMPORTANCE A problem that often arises in economics engineering and biological settings is that of investigating the mathematical relationship between two or more quantitative variables Depending on the nature of the variables the methods of re gression analysis or correlation analysis are appropriate We start by examining regression models SETTING We want to model a response variable Y as a function of one or more independent variables say 1z2xk Towards this end consider the following regression model Y g1x2 6 where Ee 0 and Ve 02 You see that this model consists of two parts namely 1 the deterministic part Y g12 2 the stochastic part 6 where Ee 0 and Ve 02 REMARK It is unreasonable to think that Y will be perfectly related to 12 through 9 The random error E conveys the fact that there will not be a perfect relation ship lt is important to realize that in this regression model the independent variables 12 zk are xed they are not random and they are measured Without error RESTATED Because 12 zk are xed and not random it follows that Ele1x2xk g12 and Vle1z2 02 That is the conditional expectation of Y given 12 xk is equal to 9z1 2 zk and the conditional variance of Y does not depend on 12 xk PAGE 73 CHAPTER 3 STAT 5137 J TEBBS STRAIGHT LINE MODEL Suppose that k 1 ie we only have one independent variable say s and that the relationship between Y and x is in fact a straight line We may write this as Y 50 51 67 where Ee 0 and Ve 02 Here g 60 n is a straight line with intercept 60 and slope 61 The slope 61 quanti es the change in the mean of Y brought about by a one unit change in x TERMINOLOGY We call 60 and 61 regression parameters These parameters are assumed to be xed and unknown thus they must be estimated with data Recall that in this setting z is xed and that E is random Clearly Y is then also random and it follows that EYl 60 6196 Vle 02 Thus for a given value of x we would expect Y to equal g 60 61 however due to random variation eg biologicalmeasurement error etc we see Y values dispersed about g 60 61x where the amount of dispersion depends on the value of 02 ADDITIONALLY ln regression models it is common to assume that E N N0 02 that is the error term is normally distributed with mean zero and variance 02 It follows immediately that for a given s Y N Nlt O zl Example 111 Many shes have a lateral line system enabling them to experience mechanoreception the ability to sense physical contact on the surface of the skin or movement of the surrounding environment The frequency number per second of elec trical impulses El emitted from one particular sh is measured at several temperatures measured in Celcius and the data are plotted in Figure 313 PAGE 74 CHAPTER 3 STAT 5137 J TEBBS 310 290 Frequency 270 250 Temperature Figure 313 E frequency at di erent temperatures with a straight me t OTHER MODELS Nowhere does it say that we have to restrict ourselves to straight line models For example7 it could be that the true model is Y o 2 6 or Y 50 51522 533 6 9w 9W Even though quadratic and cubic equations do not represent straight lines7 they still fall into a general class of models called linear models NONLINEAR MODEL In some situations7 a nonlinear model may be appropriate For example7 suppose that Y is a measure of plant growth and x denotes time and that we want to study how Y relates to x Here7 we would eventually expect the relationship to level Oh7 when x gets large7 as plants can not continue to grow forever A popular model for this situation is the logistic model given by Y 501 le rl 6 9w as this 9 function would begin to atten out77 for large values of x if g lt 0 PAGE 75 CHAPTER 3 STAT 5137 J TEBBS Frequency N x O l Temperature Figure 314 E frequency at di erent temperatures with a quadratic t NOTE In the quadratic model with g 60 61x gzz note that7 although g is no longer a straight line7 the regression parameters 6061 and g enter in a linear fashion Contrast this with the logistic growth model with g 601 le 1 This function is not linear as a function of 6061 and g rather7 it is better described as nonlinear 32 Linear statistical models TERMINOLOGY Denoting 606162 k and w 17x17z2 zk the model Y o 1 239 kk w 6 is called a linear statistical model Here7 1727 W denote k independent variables also known as covariates7 predictors7 or explanatory variables7 Y denotes the response7 E is a random error term7 and 60 61 Bk are regression parameters Throughout PAGE 76 CHAPTER 3 STAT 5137 J TEBBS our discussion we will assume Ee 0 and Ve 02 so that 590 5051152239 5kk VYlw i 02 MATHEMATICAL DESCRIPTION The term linear statistical model77 refers to how each covariate term xi j 12 k enters the regression equation It does not refer to the shape of the true regression function A linear statistical model77 means that the true regression function g is linear in the parameters Mathematically this means that the k 1 partial derivatives 59507517 75k 55 do not depend on 6061 k For example each of the models is a linear model Y 50 51 6 HP 91 Y 50 51 522 6 W W Y 50 51 52 5312 6 91112 Y 60 61 log zl zzl sinzg 6 V 9112 However the logistic regression model with g 601 6165264 is not linear since 1L 350 1515f92m 71516 l which is neither free of 61 nor g GOALS For the linear regression model Y w 6 we want to 0 make inference about the parameters 606162 k o diagnose the t77 of the model 0 make predictions about future values of Y We will start with the simple linear regression setting where k 1 and g BO lz and later extend our ndings to multiple linear regression settings where k gt 1 PAGE 77 CHAPTER 3 STAT 5137 J TEBBS 33 Simple linear regression and least squares TERMINOLOGY When we say t a regression model7 we mean that we would like to estimate the parameters in the model with observed data The method of least squares provides a way to do this For each individual in a sample of 71 individuals we collect mlYi 239 1 2 n and postulate the simple linear regression model Yi 50 51 613 where 6 iid N0 02 We wish to t this model by estimating the intercept and slope parameters 60 and 61 respectively THE METHOD OF LEAST SQUARES ln regression the most widely accepted method of estimating the model parameters 60 and 61 is that of least squares For each mlYi and for given values of 60 and 61 note that the quantity 339 Yi 50 51 measures the vertical distance from the observation Y to the line 60 61 see Figure 315 The method of least squares selects the values of 60 and 61 that minimize Wm Zac 7 60 61 i1 We denote the least squares estimators by BO and 31 respectively OALOUL US A two variable minimization argument is used to nd the form of the estimators BO and 31 Taking partial derivatives of Q o l we obtain 6Q o l 2 5676076196020 1 550 1 QQW 75 n 55 2 i 50 i i t 0 Solving for 60 and 61 and verifying second order conditions one can show that A 7 31 2me m2 PAGE 78 and BO 7 7 31 CHAPTER 3 STAT 5137 J TEBBS l l l l l 100 110 120 130 140 150 160 X Figure 315 A seatterplot with straight lirie arid residuals associated with the straight lirie NOTE Algebraically7 it follows that A TL i i 7 Yi i TL i i 7 Yi n 61 2149 zgtlt 7 2 gt Elia 2 2 Zi1i 7 m Zi1xi n 11 where the constant 1139 39 7 i That is7 31 can be expressed as a linear combination of Y1 Y2 Yn ADDITIONALLY The following facts are useful for calculating the least squares esti mates by hand both facts also follow from simple algebra rt 2 TXYZ39 7 i1 t Yi 1 i li rt i H and 2 2ziif22zlzi i1 i1 i1 Of course7 in practice7 we use statistical software instead of hand calculation PAGE 79 CHAPTER 3 STAT 5137 J TEBBS OZRate Temperature Figure 316 Bird omygen rate data for dz erent temperatures Example 112 The following data are rates of oxygen consumption of birds Y mea sured at different temperatures The temperatures were set by the investigator7 and the 02 rates7 Y7 were observed for these particular ternperatures x deg C 718 715 710 75 O 5 10 19 ym1ghr 52 47 45 36 34 31 27 18 The scatterplot ofthe data appears in Figure 316 The least squares line is superimposed over the data The calculations which follow show us how to compute by hand the least squares line HAND CALCULATION We have n 8 8 8 2y 29 y3625 2y 11404 i1 i1 8 8 8 95 714 2 7175 95 1160 71504 i1 i1 i1 PAGE 80 CHAPTER 3 STAT 5137 J TEBBS H H H H Thus7 we obtain A 79965 7 700878 ELM 7 22 11355 and BO y i 312 3625 7 7008787175 34714 Finally7 the least squares regression line is given by 34714 7 0053753 This tted line is superimposed onto the scatterplot in Figure 316 D 34 Properties of leastsquares estimators from simple linear regression INTEREST We wish to investigate the properties of 30 and 31 as estimators of the true regression coef cients 60 and 61 Throughout7 we assume that K 60 61 6139 for 239 127717 where 61 N iid N0702 FACT The least squares estimators BO and 31 are unbiased estimators of 60 and 61 respectively that is7 EBO 60 and 61 Proof Recall that 31 can be written as 31 21 aili where the constant iif 321 i f 1139 Thus7 Z MEGi Zai60 51 50 20139 51 201 i1 i1 i1 i1 PAGE 81 CHAPTER 3 STAT 5137 J TEBBS It is easy to show that 2 a 0 and 21 am 1 Thus E l 61 as claimed Now EltBOgt E7 i 31 E7 EE3150 51 i 51 50 The last step is true because E7 60 61 verify and because 31 is an unbiased estimator of 61 Thus the least squares estimators are unbiased It is important to note that normality on the errors was never assumed in this argument If the simple linear regression model is the correct model a suf cient condition for the least squares estimators to be unbiased is that 0 D VARIANCE The variances of the least squares estimators are given by A 7 x A 2 1 V o 02 and V l 039 n Zi1xi 7 W Zi1i 7 m2 LP rom these formulae it is obvious that precise estimators result when SLAM 7 E2 is large this fact can be exploited before data are collected Proof Recall that 31 can be written as 31 ELI aiYi where the constant Thus Eleg mwzgl l as claimed Now we can write Who w 7 312 w 2mg 7 220m 31 Because 02 for each 239 it follows that V7 0271 Also it can be shown verify that Cov7 Al 0 Thus 2 72 72 A 7 L 2 95 7 2 l 95 mo n H lmmzrwl quot ln mel 2 ELK TV 7112 039 n ELK E 2 21 n ELK 7 EV 039 PAGE 82 CHAPTER 3 STAT 5137 J TEBBS as claimed It is important to note that normality on the errors was never assumed in this argument If the simple linear regression model is the correct model7 these results follow simply from the assumptions that 0 and VEi 02 D COVARIANC39E The covariance between 30 and 31 is given by COV307 1 72 39 Proof Simply write 30 7 7 31 so that COVBO Bl Cov7 7 31231 Cov7 1 7 ail The result follows because Cov7Bl 0 and VBl UZZ1xi 7 E D NORMALITY The least squares estimators BO and 31 are normally distributed Proof Recall that 31 can be written as 31 ELI aili where the constant M 7 ZLW T In the simple linear regression model7 K is a linear combination of 6139 which is assumed 1139 to be normally distributed Thus7 K is normally distributed Thus7 31 is normally distributed since it is a linear combination of Y1 Y2 Y That 30 is normally distributed follows because 30 7731 that is7 30 is a linear combination of 7 and 31 The sample mean 7 is normally distributed because it is a linear combination of YhYZ Yn7 and we have just argued that 31 is normally distributed D SUMMARY In the simple linear regression model K 60 61 6139 for 239 127717 where 61 iid J07 02 we have shown that 30 N Nlt o7 C0002 and 31 N N617011027 where 21 1 COO 7 and on 7 n ZLW 96 ZLW 96 PAGE 83 CHAPTER 3 STAT 5137 J TEBBS 35 Estimating the error variance PROBLEM In the simple linear regression model Y 60 61 6 for 239 12 n where 6 iid N002 we have shown that the least squares estimators BO and 31 are unbiased estimators Now we turn our attention to 02 the error variance FACT For the simple linear regression model an unbiased estimator for 02 is given by qgt 7 H M lt l ltgt K where 30 31 That is E67 02 Proof See WMS pp 548 9 E NOTES 0 Your authors denote the unbiased estimator of 02 by 82 I dont like this notation because it is easily confused with our usual de nition of S2 as the sample variance 0 The quantity is called the tted value for the 2th observation The quantity Y is called the observed value The quantity Y 7 is called the residual for the 2th observation Refer to Figure 315 o For reasons that will become obvious later we call the quantity V L SSE 2 204 7 Y i1 the error sum of squares 0 An argument beyond the scope of this course can be used to show that when the simple linear regression model is the correct model SSE i n 7 2 FiTNXZW Ql An argument beyond the scope of this course can be used to show that when the simple linear regression model is the correct model SSE and hence 32 is independent of both 30 and 31 PAGE 84 CHAPTER 3 STAT 5137 J TEBBS 36 Inference for g and 51 INTEREST Because 30 and 31 are point estimators for the true model parameters 60 and 61 respectively we might also want to write con dence intervals for 60 andor 61 or perform the corresponding hypothesis tests Recall that the slope estimator 51 N Nlt 17011027 where 011 1 SLAM 7 E Standardizing we have 31 1 Z V 01102 N01 Recall also that n 7 2 2TX2n72 In addition because SSE 7172 is independent of 31 Z and X2 are also independent Thus the quantity 3151 B1i 1VC1102 N t A V 01102 n702272 7 tn 7 2 is pivotal so we can write 31 i 51 i P Maud2 lt W lt tn720 2 1 5 where tnq g denotes the upper 042 quantile of the tn 7 2 distribution Rearranging the event inside the probability symbol we can trap 61 between two random endpoints with probability 1 7 a These endpoints delimit a con dence interval in particular the 1001 7 04 percent con dence interval for 61 is 51 i tn720 2 V 01132 If our interest was to test at level a H0 3 51 510 versus Ha 3 51 7 5107 PAGE 85 CHAPTER 3 STAT 5137 J TEBBS where 619 is a xed value we would use It Bl 510 V 61132 as a test statistic and reject H0 whenever It falls in the two sided rejection region RR 3 gt Eliza2 One sided tests are also possible Probability values are computed as areas under the tn 7 2 distribution NOTE In practice inference for the slope parameter 61 is of primary interest because of its connection to the predictor variable x in the model Inference for 60 is usually less meaningful unless you are explicitly interested in the mean of Y when x 0 ANALOGOUSLY A completely analogous argument can be used to show that where 600 ELI Sign 7 EV and that a 1001 7 04 percent con dence interval for 60 is 50 i tn72oz2 V 00032 In addition a level 04 test of H0 3 50 500 versus Ha 3 50 7 500 can be performed using A M V 60032 as a test statistic and and rejecting H0 whenever It falls in the two sided rejection region It RR 3 gt Eliza2 One sided tests are also possible Probability values are computed as areas under the tn 7 2 distribution PAGE 86 CHAPTER 3 STAT 5137 J TEBBS Example 113 For the bird oxygen rate data in Example 1127 we have n 87 BO 347147 and 31 700878 Additional hand calculation shows that COO 012787 011 m 1113557 and 32 m 0028 The necessary critical value is 1239025 2447 A 95 percent con dence interval for 61 is given by 31 i 1239025 V 01132 gt 700878 i 2447x0028113557 or 700999700755 In this example7 61 has units of the rate of change of oxygen consumption per unit change in temperature ie7 rnlghr per degree Celsius Hence7 given a one degree increase in temperature7 we are 95 percent con dent that the change in mean oxygen consumption rate is between 700999 and 700755 rnlghr Note that this interval does not include 077 so H0 61 0 would not be rejected at the 04 005 level that is7 temperature is important in describing oxygen consumption A 95 percent con dence interval for 60 is given by 30 i t60V025 01132 gt 34714 i 2447 0127800287 or 332167 36185 Thus7 we are 95 percent con dent that the mean oxygen consumption rate when the temperature z 0 is between 33216 and 36195 rnlghr D 37 The analysis of variance for simple linear regression IDENTITY Algebraically7 it can be shown that n 20 e 7 07 e 7 20 e 17 i1 i1 i1 SST SSR SSE o SST denotes the total sum of squares SST measures the total variation in the data Y1Y2Yn o SSR denotes the regression sum of squares SSR measures the variation in the data explained by the straight line model PAGE 87 CHAPTER 3 STAT 5137 J TEBBS Table 34 Analysis of variance table for simple linear regression Source df SS MS F SSR MSR Regression 1 SSR MSR i T F i m Error 71 i 2 SSE MSE n72 Total n 7 1 SST o SSE denotes the error sum of squares SSE measures the variation in the data not explained by the straight line model ANOVA TABLE We can combine all of this information to produce the analysis of variance ANOVA table Such tables are standard in regression analysis NOTES ON THE GENERAL ANOVA TABLE STRUCTURE o The degrees of freedom df add down SST can be viewed as a statistic that has lost77 a degree of freedom for having to estimate the overall mean of Y with the sample mean 7 There is only one degree of freedom associated with SSR since there is only one independent variable The degrees of freedom for SSE can be thought of as the divisor needed to create an unbiased estimator of 02 Recall that SSE EZEMSE n72 is an unbiased estimator of 02 o The sum of squares also add down This follows from the algebraic identity noted earlier 0 Mean squares are the sums of squares divided by their degrees of freedom 0 The F statistic is formed by taking the ratio of MSR and MSE F STATISTIC In the ANOVA table for a regression model7 the F statistic is used to test whether or not at least one independent variable adds to the model In the simple PAGE 88 CHAPTER 3 STAT 5137 J TEBBS linear regression setting of course there is only one independent variable Thus the F statistic is used to test H0 3 81 0 versus Ha 3 81 7 JUSTIFICATION More advanced linear model arguments that we will not discuss show the following 0 When H0 61 0 is true 8723 21 o Regardless of whether or not H0 61 0 is true SS E V N X201 2 SSR and SSE are independent CONCLUSION Putting this all together we can say that when H0 is true MSR 731 FMWNFLTL72 Thus we can use the F1n 7 2 distribution to test H0 61 0 versus H 61 31 0 Even though this is a two sided Ha we place the entire rejection region in the upper tail we do this because of the way the F statistic is motivated That is the rejection region is given by RR F F gt quy where FUR denotes the upper 04 quantile of the F1n 7 2 distribution Probability values are computed as right tail areas under the F1 71 7 2 density EQUIVALENC39E The F test of H0 61 0 versus Ha 61 31 0 based on the ANOVA table is equivalent to the twosided t test of H0 3 51 510 versus Ha 3 51 7 510 PAGE 89 CHAPTER 3 STAT 5137 J TEBBS when 619 0 This follows because A 2 t2 7 61 i MSR i F V Elli72 verifyl The t test is a more exible procedure than the ANOVA based F test if interest lies in drawing inference about 61 With a t procedure one can specify a nonzero value of 619 in H0 and use a one sided alternative The F test on the other hand only allows a two sided alternative with 619 0 COEFFICIENT OF DETERMINATION Since SST SSR SSE it follows that he proportion of the total variation in the data explained by the model is 2 7 SST7 this statistic R2 is called the coe icient of determination Clearly 0 S R2 S 1 The larger the R2 the better the deterministic part of the regression model here 60 61 explains the variability in the data IMPORTANT It is critical to understand what R2 does and does not measure Its value is computed under the assumption that the simple linear regression model is correct and assesses how much ofthe variation in the data may be attributed to that relationship rather than just to inherent variation If R2 is small it may be that there is a lot of random inherent variation in the data so that although the straight line is a reasonable model it can only explain so much of the observed overall variation Alternatively R2 may be close to 1 but the straight line model may not be the best model In fact R2 could be very high but not relevant because it assumes the straight line model is correct In reality a better model may exist eg a quadratic rnodel nonlinear logistic rnodel etc HAND COMPUTATION Algebraically it can be shown that SSR Bf z i 2 n H verifyl This makes the ANOVA and R2 hand calculation easier PAGE 90 CHAPTER 3 STAT 5137 J TEBBS Table 35 Analysis of variance for the bird orygen rate data in Example 112 Source df SS MS F Regression 1 8745 8745 308927 Error 6 0170 0028 Total 7 8915 Example 114 With the bird oxygen rate data of Example 1127 we nd the associated ANOVA table It follows that SSR 7 7008782 gtlt 11355 8745 8 SST 2m 7 y 8915 11 and SSE SST 7 SSR 8915 7 8745 01707 by subtraction The complete ANOVA table appears above If the researcher wants to test H0 61 0 no linear trend versus Ha 61 74 07 it takes neither a genius nor an F table to see that there is overwhelming evi dence against H0 that is7 oxygen rate and temperature are strongly linearly related The R statement 1 pf308927 16 gives the right tail probability value P 00000022 The coef cient of determination is given by 2 SSR 8745 7 i 7 0981 R SST 8915 That is7 981 of the variability in the oxygen data is explained by temperature D 38 Con dence intervals for linear functions of g and 51 GOAL Consider our simple linear regression model K 60 61 6139 where 6 iid 07 02 In many settings7 it is of interest to make statements about 9 0050 01517 a linear combination of the regression parameters a0 and al are xed constants PAGE 91 CHAPTER 3 STAT 5137 J TEBBS POINT ESTIMATOR Using the least squares estimators of 30 and 31 as point estimators for 60 and 61 respectively the point estimator for 6 becomes 9 0050 0151 PROPERTIES First it is easy to see that 6 is an unbiased estimator for 6 since E09 00E30 alei 0050 0161 9 lt is also possible to show verify that 2 A E TL 2a272aaf V6EUU2 nzl1nl 17201 39 3195139 T Finally since 6is a linear combination of 30 and 31 both of which are normally distrib uted we have that 6 N6U INFERENCE The variance of 6 that is 0 depends on the unknown 02 An estimate of a is given by a 82 ELI a 7 2a0a1z 9 ELW EV 7 where recall 32 MSE Now whereas Lo 2 0 1 W N gt you can quickly convince yourself that 6 i 9 t A tn 7 2 6 Since t is a pivotal quantity a 1001 7 04 percent con dence interval for 6 is 6 t mkgyg g Tests of hypotheses concerning 6 use the tn 7 2 probability density function SPECIAL CASE CONFIDENCE INTERVALS FOR THE MEAN OF Y One very useful application of the preceding result is in the problem of estimating the mean value of Y at a xed value of x say f In our simple linear regression setting we know that 80 lif PAGE 92 CHAPTER 3 STAT 5137 J TEBBS However EYl o l is just a linear combination itself with 10 1 and a1 f With 10 1 and a1 f straightforward algebra shows that 175 21 a 7 2a0a1 i 1 f 7 EV ELK 7 32 n ELK E Thus a 1001 7 04 percent con dence interval for EYl 60 l the mean response of Y for a xed f is given by 30 31f i tTLintZ gtlt E g standard error INTERVAL LENGTH The con dence interval for EYl BO l will be different depending on the value of f In fact the expression for 3 above will be smallest when f E and will get larger the farther f is from E in either direction This implies that the precision with which we estimate EYl 60 l decreases the farther we get away from E This makes intuitive sense7we would expect to have the most con dence77 in our tted line near the center77 of the observed data Thus if the tted regression line is used to estimate means for values of z besides those used in the experiment it is important to use a range of x values which contains the future values of interest EXTRAPOLATION It is sometimes desired to estimate EYl 60 l based on the t of the straight line for values of f outside the range of x values used in the experiment This is called extrapolation and can be very dangerous In order for our inferences to be valid we must believe that the straight line relationship holds for x values outside the range where we have observed data In some situations this may be reasonable in others we may have no theoretical basis for making such a claim without data to support it Example 115 In Example 112 suppose that our researcher desires to get a 95 percent con dence interval for the mean oxygen rate of birds in an environment at f 25 degrees Celcius She is thus interested in the linear combination 60 2561 PAGE 93 CHAPTER 3 STAT 5137 J TEBBS Using the calculations in Examples 112 and 1147 the desired con dence interval is 1 25 7 7175 2 34714 7 0087825 i 2447 gtlt 0028 7 M 8 11355 or 3097334065 mlghr Thus7 we are 95 percent con dent that Ele 25 60 256L7 the mean oxygen rate of birds living in a 25 degree Celcius environment7 is between 30973 and 34065 mlghr D 39 Prediction intervals for a future Y using simplelinear re gression A DIFFERENT PROBLEM Sometimes7 depending on the context7 we may not be interested in the mean EYl 60 lxi but rather in the actual value of Y we may observe when x f On the surface7 this may sound like the same problem7 but they are very different For example7 consider a stockbroker who would like to learn about the value of a stock based on previous data In this setting7 the stockbroker would like to predict or forecast the actual value of the stock7 say7 Y7 that might be observed at time x f The stockbroker likely does not care about what may happen on the average77 that is7 he is probably not concerned with estimating EYl 60 l REMARK In this kind of situation7 we are interested not in the population mean EYl BO lz but rather the actual value that might be taken on by the random variable Y In the context of our model7 we are interested in the future77 observation Yl 50 51f 617 where 6 is the error77 associated with Y that makes it differ from the mean 60 l It is important to see that Y is not a parameter but rather is a random variable We do not wish to estimate a xed parameter instead7 we wish to predict a random quantity POINT PREDICTOR Our point predictor for Y is given by 57 Bo Bif PAGE 94 CHAPTER 3 STAT 5137 J TEBBS This is the same quantity as before when we were estimating the mean Ele 60 61 However we use a different symbol in this context to remind ourselves that we are interested in predicting Y not estimating Ele We call 57 a prediction or forecast to make the distinction clear Of course just as we do in the estimation of xed parameters we would still like to have some idea of how well we can forecast To get an idea we would like to characterize the uncertainty that we have about 57 as a point prediction for Y lntuitively there will be two sources of error 0 part of the error in 57 arises from the fact that we do not know 60 and 61 and that they must estimated from the observed data 0 additional error arises from the fact that Y itself is a random variable so its value varies itself even if we were to know the exact values of 60 and 61 Thus additional uncertainty is introduced because we are trying to characterize a quantity that itself is uncertain You will note that this additional uncertainty did not arise when we were trying to estimate Ele ERROR IN PREDICTION The argument on pp 563 WMS shows that the error in prediction 57 7 Y has the following sampling distribution 577YN0021Z i LAW 7 EV Comparing the variance 57 7 Y to the variance of R 30 Elf in the last section we see that there is an extra 177 added on This accounts for the additional variation arising from the fact that Y itself is a random quantity and we have to predict its value Standardizing Ytyl N01 ww 2 l 7 7 1 n 231m752 Using the fact that n 7 33202 X201 7 2 it is not dif cult to argue consequently that 57 7 Y tn 7 2 A2 l 1 7 1 n f macaw PAGE 95 CHAPTER 3 STAT 5137 J TEBBS Using It as a pivot7 it follows that is a 1001 7 04 percent prediction interval for Y Example 116 In Example 1127 suppose that our researcher desires to get a 95 percent prediction interval for a particular bird in an environment at f 25 degrees Celcius compare this with Example 115 This prediction interval is given by 1 25 i 7175 2 34714 7 0087825 i 2447 gtlt 0028 1 g or 28123369187 Thus7 we are 95 percent con dent that the oxygen rate for this bird will be between 28123 and 36918 One will note that the prediction interval for a single bird at f 25 is wider than the con dence interval for Ele 25 This is because of the additional variability arising from having to predict a random variable7 Y7 rather than estimating the mean Ele D 310 Correlation EXPERIMENTAL DATA In our discussion of regression models up until now7 the in dependent variable z has been best regarded as xed This assumption is plausible in designed experiments7 say7 where the investigator has control over which values of x will be included in the experiment For example7 o z dose of a drug7 Y change in blood pressure for a human subject 0 z concentration of toxic substance7 Y number of mutant offspring observed for a pregnant rat 0 z temperature7 Y frequency of electrical impulses IMPORTANT When the independent variable x is best regarded as xed7 the methods of regression are appropriate PAGE 96 CHAPTER 3 STAT 5137 J TEBBS OBSERVATIONAL DATA In many research problems we do not get to choose the values of z to be analyzed rather we merely observe X Y for each individual in the study In this case the independent variable X is best regarded as random For example 0 X weight Y height of a human subject 0 X average heights of plants in a plot Y yield 0 X SAT score Y rst year college GPA IMPORTANT When the independent variable X is best regarded as random the meth ods of correlation are appropriate In this case the observation that we record for an individual subject is best regarded as a realization of the random vector X Y A popular model for X Y is the bivariate normal distribution RECALL The random vector X Y is said to have a bivariate normal distribution if its joint pdf is given by 6 62 9571 E R2 4 n 2 17 2 fxy9c7yi My 0 otherwise Q 12 xux22pltzux yiuy gm2 17p 0X 0X Uy UY ESTIMATION Suppose that X1 Y1 X2 Y2 Xn Y is an iid sample of size n from where a bivariate normal distribution w1th marginal means x and My marginal variances TX and 032 and correlation p The likelihood function is given by LltMX7 MY7 0 0327 Pl y H fXY95i7yi i1 1 57 mica27 27T039XU39yxl 7 p2 where w 12 sn 3 y1y2yn and 1 ii 2 ii 1quot zquot 2 Q1 2 MX 72p MX 21 My 2 My 17p 0X 0X Uy UY PAGE 97 CHAPTER 3 STAT 5137 J TEBBS A ve variable maximization argument shows that the maximum likelihood estimators are given by MY m7 A 1 i 0 ZX1X2 i1 A 1 7 Uf ZYi YZ i1 Z1Xi iyzquot7 2109 XV 2100 7 7 HYPOTHESIS TESTS Under the bivariate normal model a complicated conditioning argument can be used to show that EYlX 96 60 6196 where UY 51 Plti UX 50 MY BIMX This result is helpful and here is why Suppose that it is desired to test H0 p 0 versus Ha p 71 0 Since TX and Uy are both positive it follows that p 0 ltgt 61 0 Thus we can perform a hypothesis test involving p by using the test statistic 31 Wm V0117 1 i 32 and comparing t to the tn 7 2 distribution REMARK It may also be of interest to test H0 p p0 versus an alternative where p0 71 0 In this case one is forced to appeal to a large sample result based on the Fisher Z transformation The method is based on the result that 1 1 3 1 1 p 1 WA fl 7A fl 7 P 2nlp AN2H1ip 7173 PAGE 98 CHAPTER 3 STAT 5137 J TEBBS when n is large This asymptotic result allows us to test H0 p p0 by computing 1 1 1 1P0 ZA 73 71 7 771 7 p n 1211175 211170 7 and comparing Z to the standard normal distribution CONFIDENCE INTERVAL Because is an estimator of the population parameter p it would be desirable to report along with the estimate itself a con dence interval for 1 1 1 1p 1 Willi3 MlEWfJlml when n is large we can rst construct a large sample 1001 7 04 con dence interval for 1 W 2411 2 17p Denote this interval is given by Wi W where W and WL are given by 1 1 W W 7za21n73 and W W za2 n73 We can then transform the endpoints W and WL to obtain the endpoints for the con p Since dence interval for p Applying the necessary inverse transformation Wfl tanhp the hyperbolic tangent function we obtain eXp2W 7 1 eXp2W 7 1 t hW t hW an Man U 0r expam y pww as a large sample 1001 7 04 percent con dence interval for p Example 117 The following data are measurements on wing length X and tail length Y for a sample of n 12 birds Both measurements are in centimeters I 104 108 111 102 103 102 107 105 108 112 106 114 y 74 76 79 72 74 71 74 72 78 77 78 83 Straightforward calculations show that 7 EV 1717 7 17V 1347 and 7 7 y 1323 Thus the maximum likelihood estimate of p based on the bivariate normal model and the observed data is 1323 08704 17171347 70 PAGE 99 CHAPTER 3 STAT 5137 J TEBBS O 82 i S 78 e 0 g o n 4 i I m 74 r o o o O 0 g 70 e 1 1 1 1 1 1 1 100 102 104 106 108 110 112 114 angLength Figure 317 Tail length and wing length for 12 birds We can compute an approximate con dence interval for p using the Fisher Z transfor mation method we discussed We rst compute 1 1 08704 14437111 1335 2 1 08704 Thus7 a 95 percent con dence interval for WP is equal to 1335 j 196 12737 06811988 Transforming this interval into an interval for p we obtain exp2 gtlt 0681 71 exp2 gtlt 1988 71 exp2 gtlt 0681 17exp2 gtlt 1988 1 7 or 05927 0963 as a large sample con dence interval for p Note that this interval does not contain zero Thus7 H0 p 0 would be rejected in favor of Ha p 71 0 at the 04 005 signi cance level That is7 there is an signi cant association between wing and tail lengths Of course7 this conclusion is under the assumption that the bivariate normal model is the correct model for the data D PAGE 100 CHAPTER 3 STAT 5137 J TEBBS 311 Matrix algebra results for linear models IMPORTANCE In this subsection we review some basic concepts from linear algebra for a complementary review see Appendix A It is hard to overemphasize how important linear algebra is to statisticians The presentation of regression and ANOVA models which are very popular models can be streamlined greatly using matrix notation 3111 Basic de nitions and results TERMINOLOGY A matrix A is a rectangular array of elements eg 3 5 4 A 1 2 8 Elements of A are denoted by llj eg an 3 112 5 123 8 In some instances we may write A calj The 239 refers to the row the j refers to the column In general the dimensions of A are n the number of rows by m the number of columns If we want to emphasize the dimension of A we may write Anxm If n m we call A a square matrix For example 0 l 72 B l 73 2 72 2 71 is a square matrix with n m 3 TERMINOLOGY If A calj is an n gtlt m matrix the transpose of A denoted A is the m gtlt 71 matrix 17 eg 3 l A 5 2 4 8 TERMINOLOGY For any square matrix A if A A we say that A is symmetric that is aij a for all values OH and j The matrix B above is symmetric PAGE 101 CHAPTER 3 STAT 5137 J TEBBS TERMINOLOGY A vector is a matrix consisting of one column or one row A column vector is denoted by 1W1 and a row vector is denoted by alxm We will assume that all vectors written as 1 b7 c7 etc7 are column vectors All vectors written as a U7 c 7 e 07 are I OW vec OI S eg 7 0 i 39 a 11 12 an gt39 TERMINOLOGY Two vectors a and b are said to be orthogonal if a b 0 MATRIX MULTIPLIOATION If A is n gtlt m and B is m gtlt p7 then AB is an n gtlt p matrix lf matrix multiplication makes sense77 ie7 the number of columns in A equals the number of rows in B7 we say that the matrices are conformable RESULTS The following matrix results are helpful a A A b AB BA7 if A and B are conformable c AA and AA are symmetric NOTATION The determinant of an n gtlt 71 matrix A is denoted by lAl or detA The determinant of the 2 gtlt 2 matrix is given by lAl ad 7 b0 RESULT The following matrix results are helpful a for any square matrix A7 lAl b for any n gtlt 71 upper lower triangular matrix this includes diagonal matrices7 n i1 PAGE 102 CHAPTER 3 STAT 5137 J TEBBS TERMINOLOGY A popular matrix in linear models is the n gtlt 72 identity matrix I given by 1 0 0 0 1 0 I 0 0 1 that is7 aij 1 for 2 j and aij 0 when 2 79739 TERMINOLOGY Another popular matrix is the n gtlt 72 matrix of ones J given by 111 111 J 111 that is7 aij 1 for all 2 and j Note that J 11 7 where 1 is an n gtlt 1 vector of 1s TERMINOLOGY A popular vector is the n gtlt 1 zero vector 0 given by TERMINOLOGY If A is an n gtlt n matrix7 and there exists a matrix C such that AC CA I7 then A is said to be nonsingular7 and C is called the inverse of A7 henceforth denoted as A71 If A is nonsingular7 the inverse matrix A71 is unique If A is not nonsingular7 we say that A is singular7 in which case A71 does not exist SPECIAL CASE The inverse of the 2 gtlt 2 matrix a b 71 1 d 7b A is given by A 7 C d ad 7 be as long as ad 7 be 31 0 PAGE 103 CHAPTER 3 STAT 5137 J TEBBS SPECIAL OASE The inverse of the n gtlt 71 diagonal matrix all 0 0 ail 0 0 0 a 0 0 a 1 A t 3922 t is given by A71 t 2392 0 0 am 0 0 1 RESULT The following matrix results are helpful a A is nonsingular if and only if lAl 31 0 b if A and B are conformable and nonsingular7 AB 1L B IA I c if A is nonsingular7 A 1 A71 TERMINOLOGY An 71 gtlt 71 matrix A is said to be idempotent if A2 A TERMINOLOGY If A is an n gtlt n matrix7 the trace of A is de ned as follows trA i a i1 that is7 trA is the sum of the diagonal elements of A RESULT For any conformable matrices A and B7 a traA bB atrA btrB7 for any constants a and b b trAB trBA 3112 Linear independence and rank TERMINOLOGY The n gtlt 1 vectors a1 a2 am are said to be linearly dependent if and only if there exist scalars 0102 cm such that m E CiOl39 0 i1 PAGE 104 CHAPTER 3 STAT 5137 J TEBBS and at least one of the cs is not zero On the other hand7 if m ECiaigtcl52Cm07 i1 then we say that 11 a2 am are linearly independent RESULT Suppose that 11 a2 am is a set of n gtlt 1 vectors Then7 o the vectors are linearly dependent if and only if it is possible to express at least one vector as a linear combination of the others 0 the vectors are linearly independent if and only if it is not possible to express one vector as a linear combination of the others TERMINOLOGY The rank of any matrix A is de ned as rA number of linearly independent columns of A number of linearly independent rows of A RESULT The number of linearly independent rows of any matrix is equal to the number of linearly independent columns TERMINOLOGY Suppose that A is an n gtlt p matrix Then rA S minnp lf rA minnp7 then A is said to have full rank lf rA 717 we say that A is of full row rank lf rA p7 we say that A is of full column rank lf rA lt minnp7 we say that A is less than full rank Example 118 The rank of 1721 5217 A is 2 because the two rows are linearly independent neither is a multiple of the other Thus7 A has full row rank Furthermore7 by the de nition of rank7 the number of linearly PAGE 105 CHAPTER 3 STAT 5137 J TEBBS independent columns is also 2 Therefore the columns are linearly dependent that is there exist constants 0102 and 03 such that 1 72 1 0 C1 62 Ca 5 2 17 0 Taking cl 3 02 1 and 03 71 are examples of such constants that is as the third of column of A is equal to 311 12 where 11 and 12 are the rst and second columns of A respectively B CONNECTION For any n gtlt n matrix A the following statements are equivalent rA n ltgt A71 exists ltgt lAl 31 0 FACT For any matrix A rAA rA FACT For any n gtlt n idempotent matrix A rA trA 3113 Column and row spaces TERMINOLOGY For the matrix Anxm a1 12 am where 17 is ngtlt 1 the column space of A CA m 1 R vchaJ 07 ER j1 v R vAc ceRm is the set of all n gtlt 1 vectors spanned by the columns of A that is CA is the set of all vectors that can be written as a linear combination of the columns of A Example 1 19 De ne 112 A 103 103 PAGE 106 CHAPTER 3 STAT 5137 J TEBBS The column space of A is the set of 3 gtlt 1 vectors that are linear combinations of the columns of A ie vectors of the form 01 62 203 C1 11 6212 03 01 303 7 Ci 303 where c E R Thus the column space CA is the set of all 3 gtlt 1 vectors of the form abb where 11 6 R The dimension of CA ie the rank of A is 2 D TERMINOLOGY Similarly the row space of A denoted RA is the set of all vectors that can be written as a linear combination of the rows of A RESULT For any matrix A a CA A CA and RA A RA b 724 CA and CA RA 31 14 Quadratic forms TERMINOLOGY Let y y1y2 yn denote an n gtlt 1 vector and suppose that A is an n gtlt n symmetric matrix A quadratic form is a function q R a R of the form VL VL qy Z Z aijyiyj y Ay i1 j1 The matrix A is called the matrix of the quadratic form Note that this sum will involve both squared terms and the cross product terms yiyj With 3 31 0 o if y Ay 2 0 the quadratic form and the matrix A are said to be nonnegative de nite o if y Ay gt 0 the quadratic form and the matrix A are said to be positive de nite PAGE 107 CHAPTER 3 STAT 5137 J TEBBS 3115 Means and variances of random vectors TERMINOLOGY Suppose that Z17 Z2 Z are random variables We call a random vector The multivariate pdf of Z is denoted by fZz It describes mathematically how Z1Z2Zn are distributed jointly and7 hence7 is called a joint distribution If Z17 Z2 Z are iid from fzz7 the joint pdf of Z is given by n H fZ2239 21 fzz THE MEAN AND VARIANCE OF A RANDOM VECTOR Suppose that Mi VZl 2712 for 2 127 727 and COVltZ 7 Zj 271 for 2 31 j We de ne the mean of a random vector to be 01 012 Um 021 0 0 VZ V 2 Unl U722 39 39 39 U We may also refer to V as the variancecovariance matrix of Z7 because it contains the variances 0120 mm on the diagonal and the covariance terms COVltZ 7 Zj7 for 2 lt j Since COVltZ 7 Zj COVZ7 Z7 it follows that the variance covariance matrix V is symmetric PAGE 108 CHAPTER 3 STAT 5137 J TEBBS THE COVARIANCE OF TWO RANDOM VECTORS Suppose that Ynxl and mel are random vectors The covariance between Y and Z is given by CovY1Z1 CovY1Z2 CovY1Zm C YZ C YZ C YZm COVY7Z 0V392 1 0V392 2 39 0V392 CovYn7 Z1 CovYn7 Z2 CovYn7 Zm nXm MEAN AND VARIANCE OF LINEAR FUNCTIONS OFA RANDOM VECTOR Sup pose that Z7 Z1 and Z2 are all n gtlt 1 random vectors and that amxl Amm and Bmm are nonrandom ie7 they contain only non varying constants ln addition7 suppose that EZ u and VZ V The following facts are easily shown 1 EaBZ aBEZ aBu 2 Va BZ BVZB BVB 3 COVlt14Z17 14COVZ17 Z2B MEAN OF A QUADRATIC FORM Let Y be an n dimensional random vector with mean u and variance covariance matrix V Then7 YAY is a quadratic form and EY AY uAM trAV 3116 The multivariate normal distribution RECALL We have talked about the univariate normal distribution for a random variable Y and the bivariate normal distribution for a two dimensional random vector Y1Y2 We can naturally extend the notion of joint normality of Y Y17Y27Yn to 71 dimensions TERMINOLOGY The random vector Y Y1L7 Y2 Yn is said to have a multivari ate normal distribution with mean u and variance covariance matrix V if its joint PAGE 109 CHAPTER 3 STAT 5137 J TEBBS pdf is given by f 1 1lt gt V1lt gt Yy i 2Wn2 V 1Zexp 2 y u y u 7 for all y E R Shorthand notation for this statement is Y N Nnu V FACT If Y Y1Y2 Y N Nnu V then marginally each Y N ARM0 LINEAR COMBINATIONS Suppose that Y N Nnu V and that amxl and BMW are nonrandom Then U a BY N Nma By BVB Of course if m 1 then amxl is a scalar BMW is a row vector and U a BY has a univariate normal distribution Example 1110 With Y Y1Y2Y3 suppose that 4 8 5 0 YN3 6 7 512 4 10 0 4 9 a Find the distribution of U1 Y1 7 Y2 1 Y3 b Let U2 Y1 7 3Y2 2Y3 Find the joint distribution of U1 and U2 SOLUTION a Writing then using the fact above we identify a 0 and leg 1 711 Straightforward calculations show that 4 BMlt1711 6 8 10 and 8 5 0 1 BVB 1711 512 4 i1 11 0 4 9 1 PAGE 110 CHAPTER 3 STAT 5137 J TEBBS Thus U1 N N8 11 That is U1 is univariate normal with mean 8 and variance 11 b Writing 33 U1 1 71 1 U Y2 U2 1 73 2 Y3 then using the fact above we identify a 0 and 1 71 1 B 1 73 2 Straightforward calculations show that 4 1 71 1 8 By 6 1 73 2 6 10 and 8 5 0 1 1 1 71 1 11 22 BVB 5 12 4 71 73 1 73 2 22 74 0 4 9 1 2 Thus U1 8 11 22 U N2 7 U2 6 22 74 That is U U1U2 has a bivariate normal distribution with mean vector 86 and covariance matrix BVB given above D NOTE To do the BVB calculation I used the MAPLE commands gt withlinalg gt Barray12131 111 32 gt Varray131385o5124I049 gt variance multiplyBVtranspose B NOTE MAPLE is great for tedious matrix calculations PAGE 111 CHAPTER 3 STAT 5137 J TEBBS 312 Introduction to multiple linear regression models REVIEW We have already considered the simple linear regression model Yi 50 51 613 for 239 1 2 n where 6 iid N002 We talked about three main issues a least squares estimation and inference for 60 and 61 b partitioning the variability in the data using an ANOVA table and c estimating mean values of Y and predicting future values of Y While the straight line model serves as an adequate description for many situations more often than not researchers engaged in model building consider more than just one predictor variable x In fact it is often the case that the researcher has a set of k candidate predictor variables say 12 xk and desires to model Y as a function of one or more of these k variables To accommodate this situation we must extend our linear regression model to handle more than one predictor variable Example 1111 The taste of matured cheese is related to the concentration of several chemicals in the nal product In a study of cheddar cheese from the LaTrobe Valley of Victoria Australia samples of cheese were analyzed for their chemical composition and were subjected to taste tests Overall taste scores were obtained by combining the scores from several tasters Data were collected on the following variables Y taste test score TASTE 1 concentration of acetic acid ACETIC 2 concentration of hydrogen sul de H25 3 concentration of lactic acid LACTIC Variables ACETIC and H25 are both on the natural log scale The variable LACTIC has not been transformed Table 36 contains concentrations of the various chemicals in n 30 specimens of mature cheddar cheese and the observed taste score FULL MODEL Suppose that the researchers postulate that each of the three chemical composition covariates 12 and 3 are important in describing the taste In this case PAGE 112 CHAPTER 3 STAT 5137 J TEBBS Table 36 Cheese data TASTE ACETIC H28 LACTIC TASTE ACETIC H28 LACTIC 123 4543 3135 0 86 409 6365 9588 1 74 209 5159 5043 1 53 159 4787 3912 1 16 390 5366 5438 1 57 64 5412 4700 149 479 5759 7496 1 81 180 5247 6174 1 63 56 4663 3307 0 99 389 5438 9064 1 99 259 5697 7601 1 09 140 4564 4949 1 15 373 5892 8726 1 29 152 5298 5220 133 219 6078 7966 1 78 320 5455 9242 144 181 4398 3350 1 29 567 5855 1020 2 01 210 5242 4174 1 58 168 5366 3664 131 349 5740 6142 1 68 116 6043 3219 146 572 6446 7908 1 90 265 6458 6962 172 07 4477 2996 106 07 5328 3912 1 25 259 5236 4942 130 134 5802 6685 1 08 549 6151 6752 1 52 55 6176 4787 1 25 they might initially consider the following regression model Yz39 50 51 52 533 613 911gt213 for 239 12 30 Before using this model we should ask ourselves o Are there other predictor variables that in uence taste not considered here 0 Alternatively what if not all of 12 and 3 are needed in the model After all it may be that one or more of 12 and 3 are not helpful in describing taste For example if the acetic acid concentration 1 is not helpful in describing taste then we might consider a smaller model which excludes it ie for 239 1 2 30 Yi 50 52 533 613 91213 PAGE 113 CHAPTER 3 STAT 5137 J TEBBS IMPORTANT The goal of any regression modelling problem should be to identify each of the important predictors and then nd the smallest model that does the best job MULTIPLE REGRESSION SETTING Consider an experiment in which n observa tions are collected on the response variable Y and k predictor variables covariates 12 Schematically we can envision data from such an experiment as follows lndividual Y 1 2 wk 1 Y1 11 12 39 39 39 z1k 2 Y2 21 22 39 39 39 Mk n Yn zn1 znZ 39 39 39 znk To describe Y as a function of the k independent variables x12k we posit the multiple linear regression model Yi 50 51 52 39 39 39 ik 613 for 239 12n where n gt k 1 and 6 iid N002 The values 60 61 Bk as before are called regression coef cients and since we are talking about regression models we assume that x1z2zk are all xed measured without error Here the random errors 6 are still assumed to be independent and have a normal distribution with mean zero and a common variance 02 Just as before in the straight line case our main goals in regression analysis will be to 0 estimate the regression parameters 6061 k o diagnose the t ie perform model diagnostics and 0 estimate mean responses and make predictions about future values PREVIEW To estimate the regression parameters 60 61 Bk we will still use least squares Simple computing formulae for parameter estimators are no longer available We can however nd closed form solutions in terms of matrices and vectors PAGE 114 CHAPTER 3 STAT 5137 J TEBBS 313 Multiple linear regression models using matrices MATRIX REPRESENTATION Repeatedly writing the multiple linear regression model Y7 50 51 522 39 39 39 kiik 673 for 239 127717 where 61 iid J0Uz7 is not necessary We can more succinctly ex press the model7 and7 hence7 greatly streamline our presentation7 with the use of matrices and vectors Let p k 1 denote the number of regression parameters 60 61 Bk and de ne Bo Y1 1 11 z12 zlk Y 1 51 2 21 z22 z2k YnXl 39 7 anp 39 39 39 39 39 7 pxl 62 7 Yn 1 zn1 n2 nk Bk and 61 62 enxl 6n The model Y7 50 51 522 39 39 39 kiik 673 for 239 127717 where 61 iid J07 02 can be equivalently expressed as Y X e where e N Nn0702I NOTE The matrix X is called the design matrix It contains all of the predictor variable information and is assumed to be xed ie7 not random The vector contains regression coef cients which are also assumed to be xed The only random quantities in the model are e and Y Since Ee 0 and Ve 02L it follows that EY X and PAGE 115 CHAPTER 3 STAT 5137 J TEBBS that VY 03921 Since 6 has a normal distribution so does Y thus Y N NMX 03921 Summarizing Y has a multivariate normal distribution with mean 50 ii 522 ik EY X 60 5121 62z22 BWZk 50 511 522 BMW and covariance matrix 02 0 0 2 0 02 0 VY U I 0 0 02 LEAST SQUARES The notion of least squares estimation is the same as it was in the simple linear regression model In a multiple regression setting we want to nd the values of 6061 k that minimize the sum or squared deviations V L Q o75177 k Z i 50 51 522 39 39 39 ik27 i1 or in matrix notation the value of that minimizes Q E QW Y X WY X3 Because Y 7 X Y 7 X Y 7 X IY 7 X is a quadratic form ie it is just a scalar function of the p k 1 elements of it is possible to use calculus to determine the values of the p elements that minimize it Formally one would take the p partial derivatives with respect to each of 6061 k respectively and set these expressions equal to zero ie 3Q 370 0 3Q 0Qlt gt 371 set 0 an 3Q m 0 PAGE 116 CHAPTER 3 STAT 5137 J TEBBS These are called the normal equations Solving the normal equations for 6061 k gives the least squares estimators which are denoted by 3031 NORMAL EQUATIONS Using the calculus of matrices makes this much easier in particular we can show the normal equations can be expressed as 72XY 2X X 0 or equivalently X X XY Provided that X X is full rank the unique value of that solves this minimization problem is given by 3 X X 1X Y This is called the leastsquares estimator of Each entry in the vector is the least squares estimator of the corresponding value of 6 for 239 01k ie NOTE For the least squares estimator 3 to be unique we need X to be of full column rank ie rX p k 1 That is there are no linear dependencies among the columns of X If rX lt p then since rX rXX the matrix X X does not have a unique inverse In this case the normal equations can not be solved uniquely WORKING ASSUMPTION To avoid the more technical details of working with non full rank matrices we will assume unless otherwise stated that X is full rank In this case we know that XX 1 exists uniquely NOTE Computing 3 for general p is not feasible by hand of course particularly nasty is the inversion of X X when p is large Software for multiple regression analysis includes routines for inverting a matrix of any dimension thus estimation of for a general multiple regression model is best carried out in this fashion PAGE 117 CHAPTER 3 STAT 5137 J TEBBS ERROR SUM OF SQUARES We de ne the error sum of squares by SSE E SSE Y 7 X Y 7 X Y 7 TAUY 7 66 The vector 1 E X3 contains the n tted values The vector 6 E Y 7 1A Y 7 X3 contains the n least squares residuals ESTIMATION OF 02 An unbiased estimator of 02 is given by 32 7 SSE 7 Y 7 X Y 7 X3 10 10 Proof It may be demonstrated that for the n gtlt 71 identity matrix I SSE Y 7 X3 Y 7 X3 Y I 7 MY where M XXX 1X Since EY X and VY 03921 it follows that EY I7 MY X I 7 MX trI 7 M039ZI recall the mean of a quadratic form However it is easy to see that MX X in which case the rst term X I 7 MX 0 As for the second term note that trI 7 M02I 02trI 7 trM 7 024m 7 trXXX 1X Now since trAB trBA for any conformable matrices A and B taking A X and B XX 1X we can write the last expression as 02n7trXX X 1X 02n7trX X 1X X 024m WU 0201 10 since I XX 1XX is p gtlt p We have shown that ESSE 0271 7 p Thus SSE E32 E U2 n 19 showing that 32 is an unbiased estimator of 02 D PAGE 118 CHAPTER 3 STAT 5137 J TEBBS Example 1112 With the cheese data in Example 1111 consider the full model Y1 50 51 52 533 617 for 239 12 30 or in matrix notation Y X 6 From Table 36 we have 123 1 4543 3135 086 209 1 5159 5043 153 yam 390 7 and X30 1 5366 5438 157 55 1 6176 4787 125 Our parameter vector to be estimated is We compute 30 164941 178254 43260 164941 916302 1001806 240879 XX 7 178254 1001806 1190343 269113 4326 240879 269113 65052 3795 70760 0087 70071 XX71 70760 0194 70020 70128 0087 70020 0015 70046 70071 70128 70046 0726 and 736000 4194442 5130932 1162065 PAGE 119 CHAPTER 3 STAT 5137 J TEBBS Thus7 the least squares estimate of for these data is given by B x Xrlx y 3795 70760 0087 40071 736000 7 70760 0194 40020 70128 4194442 7 0087 40020 0015 70046 5130932 40071 70128 70046 0726 1162065 728877 30 i 0328 7 El 7 3912 7 32 19670 BS Our least squares regression equation becomes 728877 03281 39122 196702137 or7 in terms of the variable names7 i 728877 0328ACETIC1 3912H2Si19670LACTICl An unbiased estimate of the error variance 02 is given by 32 W 102630 314 Sampling distributions GOAL Consider our multiple linear regression model Y X e where e N Nn0UZI We now investigate the sampling distribution of the least squares estimator This will help us construct con dence intervals for individual ele ments of and perform hypothesis tests which compare different regression models PAGE 120 CHAPTER 3 STAT 5137 J TEBBS SAMPLING DISTRIBUTION OF 3 Recall that the least squares estimator for is given by B X X 1X Y The mean of 3 is given by E EKX XYIX Yl X X 1X EY X X 1X X 3 Thus 3 is an unbiased estimator of The variance of 3 is given by V VlX X 1X Yl 7 X X 1X VYX X 1X X X 1X UZIXX X 1 02X X 1 Recall that XX 1 is a p gtlt p matrix where p k 1 Thus is a p gtlt p matrix of the true variances and covariances it has the following structure VBo COVWAo Ai COVltBOvBZgt COVWAo Ak WEI Combs Comma VB v0 2 COVBZB VltBkgt PX Notice that I only gave the upper triangle of the V matrix since it is symmetric Of course rarely will anyone ever tell us the value of 02 For practical use we must estimate 02 with its unbiased estimate 32 Thus the estimated variancecovariance matrix is given by VE 32X X 1 This is a p gtlt p matrix of the estimated variances and covariances it has the following PAGE 121 CHAPTER 3 STAT 5137 J TEBBS structure v75 Cof n 00 Cof n A VBl COVltBhBZgt COWBth Vw VBZ 5 COVltBZ7BIQ v73 Computing packages will give us this estimated variance covariance matrix upon request This matrix will be helpful when we want to get con dence intervals for EYlw or prediction intervals for a new Y coming upl Example 1113 With the cheese data in Example 11117 consider the full model Yi 50 51 52 533 613 for 239 17 27 307 or in matrix notation Y X 6 Recall that we have computed 32 102630 The estimated variance covariance matrix of B is given by 398480 777977 8960 77333 gt A2 1 777977 19889 72089 713148 V 0 X X 8960 72089 1558 74670 77333 713148 74670 74461 NORMALITY Since 3 XX 1XY is just a linear combination of Y17Y2Yn all of which are normally distributed under our model assumptions7 it follows that 3 is normally distributed as well More precisely7 it follows a p dimensional multivariate normal distribution Summarizing7 9 Mira 02X X 1l IMPLICATIONS In our multiple linear regression setting7 we would like to do many of the things that we did in the simple linear case eg7 obtain con dence intervals for regression coef cients7 prediction intervals7 etc The following facts provide the basis for many of these objectives and are direct consequences of our previous discussion PAGE 122 CHAPTER 3 STAT 5137 J TEBBS 1 E j 67 for j 01k that is the least squares estimators are unbiased 2 V j 020711j11 where Cj1j1 XX11j17 for j 01k The value XXJZLLJJr1 represents the j1th diagonal element of XX 1 An estimate of is given by V8j 320741441 3 COV 7 7BS 02071 17 Where Cr1s1 XX1151 The value XX LLSJr1 is the entry in the r 1th row and s 1th column of the XX 1 matrix An estimate of Cov is given by COVltB 7 BS Ezcr1 1 4 Marginally N N jazcj1j1 for j 01k SAMPLING DISTRIBUTION OF 32 Under our model assumptions n 7p2 SSE F U2 96271 10 315 Inference for parameters in multiple regression INFERENCE FOR INDIVIDUAL REGRESSION PARAMETERS Consider our mul tiple linear regression model Yi 50 51 52 39 39 39 kiik 613 for 239 1 2 n where 6 iid N002 Writing a con dence interval or performing a hypothesis test for single regression parameter 6 can help us assess the importance of using the independent variable Q in the full model PAGE 123 CHAPTER 3 STAT 5137 J TEBBS CONFIDENCE INTERVALS Since N NW U26j1j17 forj 0 1 2 h it follows that A z N Nlt071 02011j1 B a 639 tltn 7 m V0 741741 Con dence intervals are based on this pivotal quantity Speci cally a 1001 704 percent and that tj con dence interval for 67 is given by 6739 i Erna2V 320741441 HYPOTHESIS TESTS To test H0 67 6730 versus a one or two sided alternative we use the test statistic A 6739 50 V 320741441 The rejection region is located in the appropriate tails on the tn 7 p distribution tj NOTE To assess whether or not Q is useful in describing Y with the inclusion of the other covaiiates in the model we can test H0 67 0 versus Ha 67 31 0 It is important to recognize that tests of this form are conditional77 on there being the other covariates present in the model Example 1114 With the cheese data in Example 1111 consider the full model Y 60 61x 1 62mg 63mg 6139 To assess the importance of the hydrogen sul de concentration H25 and its in uence on taste we test H0 g 0 versus Ha g 31 0 With 32 3912 cs3 0015 and 32 102630 our test statistic becomes 32 i 0 3912 V7733 102630015 which is larger than t260025 2056 At the 04 005 signi cance level we have signi cant t2 3133 evidence that the hydrogen sul de concentration after adjusting for the e ects of acetic and lactic concentration is important in describing taste D PAGE 124 CHAPTER 3 STAT 5137 J TEBBS 316 Simple linear regression in matrix notation BACKTRA OK The simple linear regression model Y1 60 61 61 for 239 127717 is just a special case of the multiple linear regression model when k 1 If we de ne Y1 1 1 Y2 1 2 50 Ynxl t 7 anz t t 7 le 7 61 Yn l xn and 61 62 engtlt1 7 67L the model Y1 60 61 61 for 239 127717 where 61 iid J0Uz7 can also be equivalently expressed as YX e where e N Nn0702I Now7 straightforward calculations verify show that 2 1239 7 X7X 7quot 27 and X7X71 772 1817732 Eglmiw 2 i 27 7 4 X1i52 39 Thus7 it can also be shown verify that Y1 A 7 7 A E A X Y 1 and 3 X X 1X Y kiwi 30 Zl MYi m 2 51 32 NOTE Variances and covariances for 30 and 31 found previously are identical to those given by V 02XX 1 We estimate this variance covariance matrix by using VB 32X X 1 The elements ofthis matrix are used to con dence intervals and perform hypothesis tests for 60 and 61 PAGE 125 CHAPTER 3 STAT 5137 J TEBBS 317 The hat matrix and geometric considerations TERMINOLOGY We call M XXX 1X the hat matrix Observe that MY XX X 1X Y X3 1 so that in a sense the hat matrix M puts the hat on77 Y Here are some facts regarding the hat matrix 0 The hat matrix M is symmetric and idempotent ie M M and M2 M and MX X o CM CX and rM Since MY Y X3 6 CX geometrically M projects Y onto CX the estimation space Furthermore since 3 is the value that minimizes Y 7 X Y 7 X HY 7 X ll which is the squared distance between Y and X we call M a perpendicular projection matrix 0 Let mij denote the z39jth element of M The diagonal elements of M ie mi XX71i where denotes the 2th row of the design matrix X are called the leverages Leverages can be useful in classifying observations as outliers or not ANOTHER IMPORTANT MATRIX Like M the matrix I 7 M is also symmetric and idempotent verifyl In addition note that 17MYY7Ye Geometrically the matrix I 7 M projects Y into a space called the error space This space is denoted by CI 7 PAGE 126 CHAPTER 3 STAT 5137 J TEBBS ORTHOGONALITY It is easy to show that I 7 MX 0 This results from the fact that CX and CI 7 M are orthogonal spaces ie any vector in CX is orthogonal to any vector in CI 7 Also note that Y MY I 7 MY I Y 7 if l 6 RESULT Any data vector Y can be uniquely decomposed into two parts namely 1 E CX and e E CI 7 Since 1 and e are orthogonal it follows that 6 0 That is the tted values and residuals from the least squares t are orthogonal The end result is that if the model is correct the tted values and residuals are independent 318 The analysis of variance for multiple linear regression PREVIEW Just as we did in the simple linear case we aim to summarize in tabu lar form the amount of variability due to the regression model and the amount of variability due to the error RESULT Consider the quadratic form YY YIY and note that we can write it as Y Y Y MI7 MY Y MYYI7MY Y MMYYI7MI7MY lAff 66 Algebraically this means that Y 7 22 2m 7 i1 i1 i1 ln words the uncorrected total sum of squares 21 Y equals the uncorrected re gression model sum of squares 21 572 plus the error sum of squares 2106 7 PAGE 127 CHAPTER 3 STAT 5137 J TEBBS CORRECTED SUMS OF SQUARES Since interest often lies in those regression coef cients attached to predictor variables ie 6162 k we remove the effects of tting the intercept term 60 ie the overall mean of Y ignoring the covariates This removal is accomplished by subtracting 7172 from both sides of the last equation to get in 77172 nY 7171 11 11 11 or equivalently 201 7 7 2071 7 04 7 171 11 11 11 SST SSR SSE The terms SST and SSR here are called the corrected total and corrected regression sums of squares respectively for the reasons just described This is the partitioning of the sums of squares that all software uses SUMS OF SQUARES AS QUADRATIC FORMS To enhance the understanding of the partitioning of sums of squares we express the SST SSR SSE partition in terms of quadratic forms The basic uncorrected partition is given by Y Y Y MY Y I MY ln words the uncorrected total sum of squares YY equals the uncorrected regression model sum of squares YMY plus the error sum of squares YI MY CORRECTED SUMS OF SQUARES As before we can remove the effects of tting the intercept term 60 This removal is accomplished by subtracting n72 Yn lJY from both sides of the last equation to get Y Y Y n lJY Y MY Y n lJY Y I MY or equivalently Y I n lJY Y M n lJY Y I MY SST SSR SSE We can combine all of this information into an ANOVA table It is worth noting that I n lJ M n lJ and I M are all idempotent verifyl PAGE 128 CHAPTER 3 STAT 5137 J TEBBS Table 37 The general form of an analysis of variance table for multiple linear regression All sums of squares are assumed to be of the corrected type Source df SS MS F Regression p 71 SSR MSR 23 F Error n 7p SSE MSE 3 Total n 7 1 SST INTERPRETION The ANOVA table here has the same structure as the ANOVA table for simple linear regression model The only difference is the degrees of freedom 0 The corrected SST has n 7 1 degrees of freedom Again7 as before7 we can view this as a statistic that has lost77 a degree of freedom for having to estimate the overall mean of Y7 60 with 7 Also7 note that rI 7 n lJ n 71 o The corrected SSR has p71 degrees of freedom7 the number of predictor variables in the model Since rM rX 107 one will note that rM 7 n lJ p 71 o SSE has n 7p degrees of freedom7 obtained by subtraction Also7 one will note that rI 7 M n 7 p USING THE F STATISTIC The F statistic in the ANOVA table above is used to test whether or not at least one of the independent variables 1 2 xk adds to the model 1e7 it is used to test Hoi i z39 k0 versus Ha at least one of the 61 is nonzero in the full model K 60 61 62 Brim 61 Note that if H0 is rejected7 we do not know which one or how many of the fs are nonzero only that at least one is JUSTIFICATION When H0 is true SSRUZ X2p 717 SSE02 X201 7p7 and that SSR and SSE are independent These facts would be proven in a more advanced PAGE 129 CHAPTER 3 STAT 5137 J TEBBS linear models course Thus under H0 F p71 Fp71n7p 701 7 p MSE As in the simple linear case H0 is rejected whenever the F statistic gets too large that is this is a onesided uppertail test with rejection region RR F F gt Fp11 1py where Fp11 1py denotes the upper 04 quantile ofthe F distribution with p71 numerator and n 7 p denominator degrees of freedom Probability values are computed as areas to the right of F on the Fp 7 1 n 7 p distribution THE COEFFICIENT OF DETERMINATION Since SST SSR SSE it follows that the proportion of the total variation in the data explained by the model is 2 SSR 7 SST39 The statistic R2 is called the coe icient of determination It has the analogous interpretation in multiple linear regression settings as it did in the straight line case Example 1115 With the cheese data in Example 1111 consider the full model Y 60 61 62mg 63mg 6 for 239 1 2 30 The ANOVA table for these data obtained using SAS is shown below Source DF SS MS F Pr gt F Model 3 4994509 1664836 1622 lt0000l Error 26 2668378 102629 Corrected Total 29 7662887 The F statistic is used to test H0 351 52530 versus Ha not H0 Based on the F statistic and the corresponding probability value we conclude that at least one of ACETIC H25 and LACTIC is important in describing taste The coef cient of PAGE 130 CHAPTER 3 STAT 5137 J TEBBS determination is R2 m 0652 Thus7 about 65 percent of the variability in the taste data is explained by the covariates If you analyze these data using R7 you get the following ANOVA table Df Sum Sq Mean Sq F value PrgtF acetic 1 231414 231414 225481 6528e05 h2s 1 214702 214702 209197 00001035 lactic 1 53332 53332 51964 00310795 Residuals 26 266841 10263 The protocol used by R is to split up77 the regression sum of squares SSR 4994509 into sums of squares for each of the three covariates ACETIC7 H257 and LACTIC7 as they are added sequentially to the model these are sometimes called sequential sums of squares You will note that these sums of squares add to SSR up to rounding error 319 Con dence intervals for in multiple linear regres sion RECALL In the simple linear regression model K BO 61i6i7 forz39 127 n7 where 61 iid J0Uz7 we learned how to obtain con dence intervals for the mean response EYl 60 l Extending this to a multiple regression setting is straightforward SETTING AND GOALS Consider our multiple regression model Yi 50 51 52 39 39 39 kiik 613 for 239 127717 where 61 iid J0Uz7 or7 equivalently7 Y X e where e N n07 03921 Our goal is to write a 1001 7 04 percent con dence interval for EYl 50 i l i 5295 i 39 39 9527 the mean of the response Y when w a E xix Mai PAGE 131 CHAPTER 3 STAT 5137 J TEBBS POINT ESTIMATOR We rst note that EYl 50 l 52953 39 39 39 3 13 E 97 33 where a 1xf 2 and 606162 k Note that 6 1 is a scalar parameter since a is 1 gtlt p and is p gtlt 1 To estimate 1 we will use 6 1 3 where B is the least squares estimator of To nd the con dence interval we need the sampling distribution of 6 1 3 The mean of 6is given by EEwwmw a thus 6 is an unbiased estimator of 6 The variance of 6is given by W 7 Va B av m maxer U2a X X1a Furthermore 6 1 3 is just a linear combination of 3031 Bk all of which are nor mally distributed under our model assumptions Thus 6 1 3 has a normal distribution as well Summarizing we have that 6 1 N N602aXX 1a where 6 a Standardizing it follows that 1 Z 9 02a XX 1a N01 Using the fact that n 7 p fZUZ X201 7 p we can argue that 6 i 9 t f N Wquot i 10 UZOX X 1a Thus we can use It as a pivotal quantity to derive con dence intervals and hypothesis tests involving 6 1 A 1001 7 04 percent con dence interval for 6 a EYlw 6i MW 32aX X71a PAGE 132 is given by CHAPTER 3 STAT 5137 J TEBBS and a level 04 test of H0 6 60 versus a one or two sided alternative uses i 0 i 00 32a XX 1a as a test statistic The rejection region is located in the appropriate tails of the tn 7p reference distribution Probability values come from this distribution as well Example 1116 With our cheese data from Example 11117 suppose that we want to nd a 95 percent con dence interval for EYlw 60 561 662 g Here7 we have sf 57 6J3 17 so that a 1561 We have 728877 A A 0328 6 a 1561 15905 3912 19670 and 3795 70760 0087 70071 1 71 70760 0194 70020 70128 5 aXX a1561 0181 0087 70020 0015 70046 6 70071 70128 70046 0726 1 A 95 percent con dence interval for 6 a 60 561 662 g is given by 15905 i 2056 10263001817 or 73107 24497 Thus7 when sf 57 x 67 and x3 17 we are 95 percent con dent that the mean taste rating EYlw is between 7310 and 24497 D 320 Prediction intervals for a future value of Y in multiple linear regression RECALL In the simple linear regression model K 60 61 6139 for 239 127717 where 61 iid J0UZ7 we learned how to obtain prediction intervals for a new Y5 Extending this notion to a multiple regression setting is also straightforward PAGE 133 CHAPTER 3 STAT 5137 J TEBBS POINT PREDICTOR Our point predictor for Y at a given value of 17 3 Z 7 is given by 57 30 31 32 39 39 39 37 where a 1 Z and 3 303132Bk Analogously to the simple lin ear regression setting7 the error in prediction 17 7 Y has the following sampling distribution 17 7 Y N0702 1aXX 1a Using the fact that n 7 p fZUZ N X201 7 p7 we can argue that 177Y t 1 twp 021a XX 1a Using this as pivot7 we can conclude that 57 1 Mpg2 32 1 aX X71a is a 1001 7 04 percent prediction interval for Y at the point Z Comparing this to the 10017 04 percent con dence interval for EYl7 we see7 again7 that there is an extra 177 in the estimated standard error As in the straight line case7 this results from the extra variability which arises from having to predict the random quantity Y instead of having to estimate the mean EYl Example 1117 With our cheese data from Example 11117 we want to nd a 95 percent prediction interval for a particular taste rating score Y when 1 57 3 67 and 3 17 so that a 1 5 6 1 The prediction interval is given by 15905 i 2056 1026301 01817 or 7662438431 Thus7 when 1 57 3 67 and 3 1 we are 95 percent con dent that the taste rating for the new Y will be between 76624 and 38431 One will note that7 again7 this prediction interval is much wider than the corresponding con dence interval for EYl Also7 note that the lower limit for the prediction interval is negative7 which may not even make sense in this application D PAGE 134 CHAPTER 3 STAT 5137 J TEBBS 321 Reduced versus full model testing SETTING Consider our full multiple regression model Yi 50 51 522 ik 613 for 239 12 n where 6 iid N002 or equivalently Y X e where e N Nn0 021 Sometimes it is of interest to determine whether or not a smaller model is adequate for the data That is can we throw out77 some of the independent variables write a smaller model and have that smaller model do just as well77 as the full model In terms of the regression parameters themselves we basically are asking are some of the parameters 606162 k not different statistically from zero77 TERMINOLOGY A well known principle in science is the Parsimony Principle which states loosely speaking that the simplest of two competing theories is to be preferred Applying this principle to regression modeling leads us to choose models that are as simple as possible but yet do an adequate job of describing the response Seldom ie never will there be a model that is exactly correct but hopefully we can nd a model that is reasonable and does a good job summarizing the true relationship between the response and the available predictor variables REMARK Besides their ease of interpretation smaller models confer statistical bene ts too Remember that for each additional predictor variable we add to the model there is an associated regression parameter which has to be estimated For each additional regression parameter that we have to estimate we lose a degree of freedom for error Why is this important Remember that MSE our estimator for the error variance 02 uses the degrees of freedom for error in its computation Thus the smaller this degree of freedom value is the fewer observations we are using to estimate 02 With an imprecise estimate of 02 hypothesis tests con dence intervals and prediction intervals are likely to be uninformative PAGE 135 CHAPTER 3 STAT 5137 J TEBBS Example 1118 With our cheese data from Example 1111 the full model is given by Yi39 50 51 52 533 Suppose that we consider the smaller model ie the reduced model Yi39 Yo Y is 613 for i 1 2 30 In the reduced model we are excluding the independent variables 1 ACETIC and 2 H2S Does the smaller model do just as well at describing the data as the full model How can we test this One will note that in this example we are essentially asking ourselves whether or not H0 61 g 0 is supported by the data REDUCED VERSUS FULL MODEL TESTING Consider testing Y X07 6 reduced model versus Y X 6 full model where X0 is a submatrix of X that is X0 is the matrix obtained from taking X and removing some of its columns De ne the hat matrices for the full and reduced models by M XX X 1X and M0 X0X6X0 1X6 respectively so that M projects Y onto CX and that M0 projects Y onto CXO Clearly CXo Q REGRESSION SUMS OF SQUARES We know that SSRF Y M i n lJY is the corrected regression sum of squares for the full model Similarly SSRR Y M0 i n lJY is the corrected regression sum of squares for the reduced model Since the regression sum of squares SSR can never decrease by adding predictor variables it follows that SSRF Y M i n lJY 2 Y M0 i n lJY SSRR whenever CXO Q CX ie the sums of squares for regression in the reduced model is always less than or equal to than the sums of squares for regression in the full model In the light of this our intuition should suggest the following PAGE 136 CHAPTER 3 STAT 5137 J TEBBS o if SSRF YM 7 n lJY and SSRR YMO 7 n lJY are close77 then the additional predictor variables are not adding too much in the form of an increase in sums of squares for the regression and the reduced model does just as well at describing the data as the full model does if SSRF YM 7 n lJY and SSRR YMO 7 n lJY are not close77 then the additional predictor variables are adding a lot in the form of an increase in sums of squares for the regression This suggests that the reduced model does an insuf cient job of describing the data when compared to the full model REMARK As you might suspect we base our decision on whether or not the reduced model is adequate by examining the size of SSRF i SSRR Y M i n lJY i Y M0 i n lJY Y M i M0Y If this difference is large then the reduced model does not do a good job of describing the data when compared to the full model You should be aware that in this presentation we are assuming that the full model already does a good job of describing the data we are trying to nd a smaller model that does just as well F STATISTIC Theoretical arguments in linear models show that when Y X07 e is the correct model Y M 7 M0YdfeR 7 dfep F MSEF N Fdf R 7 deFdf6F where dfeR and dfep denote the error degrees of freedom associated with the reduced and full models respectively and MSEF denotes the MSE from the full model Thus to perform an 04 level test we reject the reduced model Y X07 e in favor of the full model Y X 6 when F is too large That is this is one sided upper tail test with rejection region RR F 3 F gt FdfeR7dfeFdfeFav where FdfeRdf5Fdf5Fa is the upper 04 quantile of the FdfeR 7 dfepdfep distribution PAGE 137 CHAPTER4 STAT51ampJTEBBS Example 1118 continued We want to test7 using 04 0057 H03 Yi Y0 Yi393Ei Ha 3 Yi 50 51 52 533 Ei That is7 we want to know whether or not the variables 1 ACETIC and 2 H2S should be added to the model Here7 the submatrix X0 corresponding to the reduced model is the full model matrix X with the second and third columns removed ie7 the columns for 1 and 2 Note that we can t the reduced model7 in SAS7 for example7 by changing the model statement to MODEL TASTE LACTIC Here are the ANOVA tables for the reduced and full models Analysis of Variance Reduced Model Source DF SS MS F Pr gt F Model 1 3800398 3800398 2755 lt00001 Error 28 3862 489 137946 Corrected Total 29 7662887 Analysis of Variance Full Model Source DF SS MS F Pr gt F Model 3 4994509 1664836 1622 lt00001 Error 26 2668378 102629 Corrected Total 29 7662887 Thus7 with YM i M0Y SSRF i SSRR 4994509 7 3800398 11941117 dfeR 287 dfep 267 and MSEF 1026297 we compute the F statistic to be F i Y M i M0YdfeR i dfep 711941112 i 2 7 MSEF 7 102629 7 Since F 582 gt F2260V05 33697 we would reject H0 and conclude that the reduced model is not as good as the full model at describing these data That is7 ACETIC and H2S signi cantly add to a model that already includes LACTIC D PAGE 138 CHAPTER 4 STAT 5137 J TEBBS 4 A Brief Introduction to Survival Analysis 41 Introduction NOTE The statistical analysis of lifetime data is an important topic in many areas including biomedical applications eg clinical trials etc engineering and the social sciences The term lifetime77 is understood to mean time to event77 where an even 77 may refer to death machine failure eradication of infection completion of an assigned task natural disaster etc CLINICAL TRIALS In chronic disease clinical trials eg trials involving cancer AIDS diabetes cardiovascular disease etc the primary endpoint of interest may be the time to an event such as time to death time to relapse of disease after a response etc For such clinical trials we may be interested in comparing the distribution of the time to event among competing treatments CENSORING Typically clinical trials occur over a nite period of time and conse quently the time to event is not measured on all patients in the study This results in what is referred to as censored data Also since patients generally enter a clinical trial at different calendar times staggered entry the amount of follow up varies for the different individuals SURVIVAL ANALYSIS The combination of censoring and staggered entry creates some unusual dif culties in the analysis of such data that do not allow standard statistical techniques to be used Because of this a whole new research area in statistics has emerged for the study of such problems This area has been called censored survival analysis We will provide a brief introduction to some problems in survival analysis Example 41 A randomized clinical trial involved 64 patients with severe aplastic anemia Aplastic anemia occurs when the bone marrow stops making enough blood cells Prior to the trial all patients were treated with a high dose of cyclophosphamide PAGE 139 CHAPTER 4 STAT 5137 J TEBBS Table 48 Time to diagnosis of severe AGVHD for bone marrow patients Starred subjects represent censored observations CS MTX MTX only 3 65 324 528 9 25 104 395 8 77 356 547 11 28 106 428 10 82 378 691 12 28 156 469 12 98 408 769 20 31 218 602 16 155 411 1111 20 35 230 681 17 189 420 1173 22 35 231 690 22 199 449 1213 25 46 316 1112 64 247 490 1357 25 49 393 1180 followed by an infusion of bone marrow from an HLA identical family member Patients were then assigned to one of two treatment groups cyclosporine and methotrexate CSP MTX or methotrexate alone MTX An important endpoint in the trial was the time from assignment until the diagnosis of a life threatening stage of acute graft versus host disease AGVHD a frequent complication where the donors bone marrow cells attacks the patients organs and tissue Table 48 presents these times in days for the 64 patients In this trial only 20 of the 64 patents actually reached the endpoint the remaining 44 patients were right censored Q UESTIONS OF INTEREST o How should we model the times to diagnosis 0 How should we compare the treatments 0 What effect if any do the censored observations have on any resulting analysis which compares the two treatments 0 Other covariates eg incubation status age etc were also recorded Might the use of these covariates sharpen the analysis between the two treatments PAGE 140 CHAPTER 4 STAT 5137 J TEBBS 42 Describing the distribution of time to an event TERMINOLOGY To start we introduce some notation on concepts for describing the distribution of time to an event for a population of individuals Let the random variable T denote the time to an event It is understood to mean that T is a positive random variable for which there is an unambiguous start point of infection start of treatment etc and end death diagnosis etc with the period in between corresponding to T Random variables T with positive support are called lifetime random variables 0 survival time from birth to death 0 the time from treatment of lung cancer to death this may be a bit tricky if indi viduals die from causes other than lung cancer more about this later 0 the time to diagnosis of a more severe condition eg AIDS etc 0 days to vaginal cancer mortality in rats NOTE The time of interest may not always correspond to something deleterious such as time to death For example we may consider the time to the eradication of an infection measured from the initiation of some antibiotic used to treat patients with the infection In this situation it is preferable to shorten the distribution of times whereas in the other examples it is desirable to lengthen time DESCRIPTION We now describe some different but equivalent ways of de ning the distribution of T In our discussion we will assume that T is continuous o The cumulative distribution function cdf FTt PT S t o The survivor function Sm PT gtt1i Fm PAGE 141 CHAPTER 4 STAT 5137 J TEBBS time time Figure 418 An eccponential cumulative distribution function with mean A 1 left and the corresponding survivor function right 0 The probability density function fTt EFTW d 75 t dt To dt Also7 note that E I m uyjmf mmi Example 42 Suppose that failure times T in years7 say follow an exponential dis tribution with mean A 1 In this case7 fTt e tIt gt 0 Furthermore7 for t gt 07 FTt 1 7 e 7 and STt equot Graphs of the cdf and survivor function appear in Figure 418 Note that ST1 0367 ie7 only 367 percent of the population survives one year Also7 STm 05 s m S1051n2 m 0693 ie7 the median survival is 0693 years PAGE 142 CHAPTER 4 STAT 5137 J TEBBS Figure 419 Survivor functions for two stoohastioally ordered distributions TERMINOLOGY We say that the distribution of a survival time T1 is stochastically larger than another survival time T2 and write T1 Est T2 if the survival distribution of T1 is greater than or equal to the survival distribution of T2 for all t that is For example T1 may denote the survival time of a population if subjects were given treatment 1 and T2 is the corresponding survival time if subjects were given treatment 2 At any point in time a greater proportion of the population will survive if given treatment 1 as opposed to treatment 2 TERMINOLOGY The mortality rate at time t is the proportion of the population who fail between times t and t 1 among individuals alive at time t Usually t is taken to be an integer in terms of some unit of time eg day month year etc ie mTt Pt T ltt1lT 2 t The mortality rate for the human population might look like Figure 420 PAGE 143 CHAPTER 4 STAT 5137 J TEBBS 015 I mt 010 I 005 I 0 20 40 60 80 100 age in years Figure 420 A typical mortality pattcm for an animal population TERMINOLOGY The hazard rate is a useful way of describing the distribution of the time to an event because it has a neutral interpretation that relates to the aging of a population A hazard rate is just a continuous version77 of a mortality rate lnformally the hazard rate ATt is the limit of the mortality rate if the interval of time is taken to be arbitrarily small ie the mortality rate is the instantaneous rate of failure at time t given that the individual is alive at time t In symbols ATQ haO PtTlttth2t h NOTE The hazard rate is not a probability rather it is a probability rate Therefore it is possible that a hazard rate may exceed one REMARK The hazard rate or hazard function is very important characteristic of a lifetime distribution It indicates the way the risk of failure varies with time and this is of interest in most applications Distributions with increasing hazard functions are seen for individuals for whom some kind of aging or wear out77 takes place Certain types of electronic devices actually display a decreasing hazard function PAGE 144 CHAPTER 4 STAT 5137 J TEBBS NOTE It is insightful to note that PtTlttth2t AT i PtTltthh fTt igs t d PT 2 t Sm Tc 7 10gSTt39 Thus integrating both sides of the last equation we get t flogSTt mum 2AM 0 The function ATt is called the cumulative hazard function Consequently t STt exp 7 ATudu 0 exp 7ATt RELATIONSHIPS Because ofthe one to one relationships that we have just derived we can summarize the distribution of the continuous survival time T by using fTt FTt STt OI ATt Example 43 Suppose that failure times T follow a Weibull distribution with pdf fTt WWW1 exp M al 1it gt 07 where the parameters A and B are parameters larger than zero This model is commonly used in engineering applications but has also been used in biological and medical applica tions for example in studies on the time to occurrence of tumors in human populations Note that when 6 1 the Weibull distribution reduces to an exponential distribution with mean ET 1A In general the Weibull hazard rate function is given by A t MONA for t gt 0 Plots of Weibull hazard functions are given in Figure 421 It is easy to show 0 ATt is increasing if B gt 1 population gets weaker with aging o ATt is constant if B 1 constant hazard exponential distribution and o ATt is decreasing if lt 1 population gets stronger with aging PAGE 145 CHAPTER 4 STAT 5137 J TEBBS beta05 a ambdat 2 I Figure 421 Wcibull hazard functions with A 1 OTHER PARAMETRIC MODELS o Lognormal mt expp aogt 7 w w gt o o Gamma 7 1 071 7 fTt 7 Rama 6 t t gt 0 o GompertZ Makeham ATt 6 6677 0 Others Extreme value log logistic inverse Gaussian Burr Pareto etc REMARK In most clinical trials applications and research in survival analysis it has become common to use nonparametric and semiparametric models where the shape of the distribution function is left unspeci ed This is the approach we take henceforth PAGE 146 CHAPTER 4 STAT 5137 J TEBBS 43 Censoring and life table estimates REMARK Two important issues arise in clinical trials when time to event data are being considered 0 Some individuals are still alive the event of interest has not occurred at the time of analysis resulting in right censored data 0 The length of follow up varies due to staggered entry over calendar time Patient time is measured from entry into the study In addition to censoring occurring because of insuf cient follow up it may also occur for other reasons For example 0 loss to follow up ie the patient stops coming to the clinic or moves away 0 death from other causes competing risks These different forms of censoring are referred to as random right censoring Random censoring instead of right censoring due to staggered entry creates certain dif culties in the analysis as is illustrated by the following example Example 44 Data from 146 individuals who previously had myocardial infarction Ml ie a heart attack and participated in a clinical trial for a antihypertensive treat ment are given in Table 49 The data have been grouped into one year intervals and all time is measured in terms of patient time How should we estimate the ve year survival rate 85 Two naive answers are given by 76 deaths in 5 years 0521 E 5 0479 146 individuals 5 l 76 deaths in 5 years 0650 E 5 0350 146 29 withdrawn in 5 years gt o The rst estimate would be appropriate if all subjects withdrawn in the rst 5 years were censored exactly at the ve year mark ie at time t 5 However this is not PAGE 147 CHAPTER 4 STAT 513 J TEBBS Table 49 Myocardial infarction data Measured in patient time Number alive and Year since under observation at Number dying Number censored entry into study beginning of interval during interval or withdrawn 0 1 146 27 3 1 2 116 18 10 2 3 88 21 10 3 4 57 9 3 4 5 45 1 3 5 6 41 2 11 6 7 28 3 5 7 8 20 1 8 8 9 11 2 1 9 10 8 2 6 the case so this estimate is overly optimistic ie this overestimates 55 This corresponds to censoring on the right The second estimate would be appropriate if all individuals censored in the 5 years were censored immediately upon entering the study ie at time t 0 This is also not the case so this estimate is overly r 39 39 Hr ie this under 55 This corresponds to censoring on the left LIFE TABLE ESTIMATES Note that 55 can be expressed as 5 55 H 113 i1 where q1iPi71 TltiT2i71 mortality rate at year t i 7 1 for i 1 2 5 So we just need to estimate 1 Note that 1 7 q is the mortality rate mt at year t i 7 1 PAGE 148 CHAPTER 4 STAT 513 J TEBBS RIGHT CENSORING Suppose that anyone censored in an interval of time is censored at the end of that interval right censoring Our table then looks like ti7172 not dlttgt was alttgt 177 Rlttigtn1wlttgt 01 146 27 3 0185 0815 0815 12 116 18 10 0155 0845 0689 23 88 21 10 0239 0761 0524 34 57 9 3 0158 0842 0441 45 45 1 3 0022 0972 0432 Thus if right censoring was used our estimate of the 5 year survival probability based on the life table would be 8135 0432 LEFT CENSORING Suppose that anyone censored in an interval of time is censored at the beginning of that interval left censoring Our table then looks like mitt nlttgt dos we mm limos Llttigtn1wlttgt 01 146 27 3 0189 0811 0811 1 2 116 18 10 0170 0830 0673 23 88 21 10 0269 0731 0492 34 57 9 3 0167 0833 0410 4 5 45 1 3 0024 0976 0400 Thus if left censoring was used our estimate of the 5 year survival probability based on the life table would be L5 0400 SUMMARY 0 Our naive estirnates ranged from 0350 to 0479 0 Our life table estirnates ranged from 0400 to 0432 depending on whether we assumed censoring occurred on left or right of each interval 0 However it is likely that censoring occurs at a time inside the interval Thus 8132 and Lt are still too optimistic and pessirnistic respectively PAGE 149 CHAPTER 4 STAT 5137 J TEBBS Survival probability 0 39m l 0 a l l l 4 6 Time in years Figure 422 Myocardial infarction data Life table estimate of the survival distribution COMPROMISE A compromise is to use the following table ti1722 not do was mew was lttigtn1ealttgt 071 146 27 3 0187 0813 0813 17 2 116 18 10 0162 0838 0681 273 88 21 10 0253 0747 0509 37 4 57 9 3 0162 0838 0426 47 5 45 1 3 0023 0977 0417 Based on this table7 our estimate would be 04177 which is7 of course7 between 8132 and Lti This is usually referred to as the lifetable estimate The value nt 7 wt2 is sometimes called the effective sample size A plot of the estimated survival probabilities is in Figure 422 INFERENCE Of course7 the life table estimates are computed from a sample of data7 and7 hence7 are subject to variability Theoretical arguments show that is approxi PAGE 150 CHAPTER 4 STAT 5137 J TEBBS mately normal with mean St and variance which is consistently estimated by t oa k43 V Gf WE5i where 71 nj dj dj and 10 wj Consider the following table m4 we aw we an 7375 601 01 146 27 3 0813 000159 0032 1 2 116 18 10 0681 000327 0039 2 3 88 21 10 0509 000735 0044 3 4 57 9 3 0426 001084 0044 4 5 45 1 3 0417 001138 0044 The estimator S t is commonly referred to as Greenwood7s Formula An ap proximate large sample con dence interval for St is given by wwia ewoi where 7ar t12 Thus an approximate 95 percent con dence interval for 55 is given by 0417 i1960044 or 03310503 That is we would expect that 331 to 503 percent of the patients survive 5 years after the Ml episode NOTE The life table estimator for a survival function St dates back to the 1920s The life table method is designed primarily for situations in which actual failure and censoring times are unavailable and only the ds and ms are given for the jth interval 44 The Kaplan Meier estimator NOTE In Example 44 we see that the bias from estimating the survival distribution when incorrectly assuming that censoring occurs at the left or right of each interval PAGE 151 CHAPTER 4 STAT 5137 J TEBBS decreases when the interval is taken to be smaller eg 1 year as opposed to 5 year intervals Thus if the data are not grouped ie we know the exact times we may want to apply the life table estimator using intervals with very small lengths THE KAPLAN MEIER ESTIMATOR The limit of the life table estimator ie when the interval lengths are taken so small that at most one observation occurs within any interval is called the productlimit estimator or the Kaplan Meier estimator Kaplan and Meier 1958 derived this limiting estimator using likelihood theory not as the limit of the life table estimator However it is instructive and intuitive to consider the KM estimator as a limit of the life table estimator NON INFORMATIVE CENSORING In order for life table estimators to give unbiased results there is an implicit assumption that individuals who are censored are at the same risk of subsequent failure as those who are still alive and uncensored The risk set at any point in time individuals still alive and uncensored should be representative of the entire population alive at the same time so that the estimated mortality rates re ect the true population mortality rates REMARK lf censoring only occurs because of staggered entry then the assumption of independent censoring seems plausible However when censoring results from loss to follow up or death from a competing risk then this assumption is suspect This follows because the censoring processes depends on the survival time If at all possible censoring from these latter situations should be kept to a minimum Example 45 An illustration of the Kaplan Meier estimator In this example we will work with a small ctitious data set to illustrate how the KM estimator is computed Suppose we have the following death and censoring times for n 10 patients Time 45 75 85 115 135 155 165 175 195 215 Censored 1 1 0 1 0 1 1 0 1 0 Here 177 means the observation was a death and 077 means the observation was censored Thus we have 6 deaths and 4 censored observations out of the 10 patients Consider PAGE 152 CHAPTER4 STAT51ampJTEBBS the following calculations 17mm 1 1 1 1 50 1 1 1 1 0 ml s MW 5 if W i 5 00 aim gig rim w m o w m o 4 o o dt number of deaths in an interval i nt 7 number at risk at beginning of the interval RESULT In the limit the Kaplan Meier or product limit estimator will be a step function taking jumps at time where a failure occurs Thus since there is at most one occurrence in any interval of time at any time t the KM estimator of the survival distribution is computed by A l S 7 7 t H 1 77 719 where 71 is the number of individual still at risk at time t By convention the KM estimator is taken to be right continuous The KM estimator for the data in Example 45 is given in Figure 423 An R program to t censored survival data follows Name Joshua M Tebbs Date 21 Nov 2007 Purpose Fit Kaplan Meier estimator in Example 45 notes input the failurecensoring times survtime lt c457585115135155167175195215 input the failurecensoring indicators 1 failure 0 censored status lt c1101011010 Use internal survfit function Use conftypequotplainquot to place confidence band about the survivor function using Greenwood s formula fit lt survfitSurvsurvtimestatus conftypequotnonequot returns a summary of the fit summaryfit plotfitxlabquotPatient time in yearsquot ylabquotsurvival probabilityquot PAGE153 CHAPTER 4 STAT 5137 J TEBBS survival probability I I I I 0 5 10 15 20 Patient time in years Figure 423 Kaplaa Meter estimate for the data in Example 45 OUTPUT Here is the output for the program Note that to get the con dence lirnits7 I used the option conf typequotplainquot Otherwise7 no con dence are given in the output R will also plot the con dence limits on the KM estimate plot See Figure 424 on the next page Call survfitformula Survsurvtime status conftype quotplainquot time nrisk nevent survival stderr lower 95 o CI upper 95 o CI 45 10 1 0900 00949 07141 1000 75 9 1 0800 01265 05521 1000 115 7 1 0686 0 1515 03888 0983 155 5 1 0549 0 1724 02106 0887 167 4 1 0411 0 1756 00673 0756 195 2 1 0206 0 1699 00000 0539 CONFIDENCE BANDS The con dence bands are formed by taking St i 196 gtlt St where the estimated standard error comes from Greenwood7s Formula PAGE154 CHAPTER 4 STAT 5137 J TEBBS survival probability I I I I 0 5 10 15 20 Patient time in years Figure 424 Kaplaa Meier estimate for the data in Example 45 with con dence limits DESCRIPTION In describing censored survival data7 it is useful to conceptualize the existence of two latent variables corresponding to the failure time and censoring time o For the ith individual7 denote the failure time by T1 and the censoring time by 01 These are referred to as latent variables since they are not always observed 0 The random variable Tl corresponds to the ith individual7s survival time if that individual were observed until death7 whereas 01 corresponds to the time that the ith individual would have been censored assurning death did not intervene o For example7 Cl rnay be the time from entry into the study until the time of analysis If censoring were to occur for other reasons eg7 loss to follow up7 cornpeting risks7 etc this would have to be accounted for in the analysis OBSERVABLES ln actuality7 we get to observe the minimum of T1 and 01 which we denote by the random variable PAGE 155 CHAPTER 4 STAT 5137 J TEBBS and whether the individual failed died or was censored ie we get to see the failure indicator 17 if Ti S Ci 0 ifT gt 0 Ai The variables XiA are the observables in an actual experirnent whereas T and C are latent variables which are useful in conceptualizing the problem GOAL Although it is not always observed the main objective in a clinical trial involving survival analysis is to make inference about the probability distribution of the latent variable T For example in the one sarnple problem we are interested in estimating the survival distribution STt PT gt t The data available are XtAi z3912n If we de ne the number of individuals at risk at time t in our sample by ie nt is the number of individuals in the sample who neither died nor were censored by time t then the KM estimator for St is given by This is the KM estirnator when there are no tied survival times in our sample DEALING WITH TIES Let dt denote the number of observed deaths in the sample at time t that is Generally dt is equal to 0 or equal to 1 with continuous survival data where there are no ties More generally however dt may be greater than 1 when the survival data allows ties In this situation we can write the KM estirnator as KMtg1i PAGE 156 CHAPTER4 STAT51ampJTEBBS where Au is the set of all death times u less than or equal to t A consistent estimator of the variance of the KM estimator is also taken as the limit of Greenwood7s Formula in particular KMW KMWZ g lnltugtnltugt e dltugt Thus in general because KMt is approximately normal in large samples a 1001 7 04 percent con dence interval for St is given by KMlttgt i meme where wa KMwlZ Example 46 In this example we will simulate a set of censored survival data for n 50 patients We assume that the true distribution which generates the true survival times is exponential with mean ve years ie A 02 per year The true median survival time thus is given by the value of m that solves m 02502 05 0 Straightforward calculations show that m 5 log2 3468 years We shall also assume that the potential censoring time is independent from the survival time and has an exponential distribution with mean ten years ie A 01 per year ie the censor ing time distribution is stochastically larger than the survival time distribution An R program to perform this simulation along with the results is on the next page Name Joshua M Tebbs Date 22 Nov 2007 Purpose Simulation exercise Example 46 generate the data n50 failure mean 5 censoring mean 10 survtime lt reXp5002 censtime lt reXp5001 creates indicator variable Delta 1 if surv lt cens 0 otherwise status lt survtime lt censtime PAGE 157 CHAPTER4 STAT51ampJTEBBS obstime are the observed simulated data we would see obstime lt survtimestatus censtime1 status Compute KM estimator fit lt survfitSurvobstimestatus conftype quotplainquot summaryfit plotfitxlabquotPatient time in yearsquot ylabquotsurvival probabilityquot ablineh05 lty1 Call survfitformula Survobstime status conftype quotplainquot time nrisk nevent survival stderr lower 95 CI upper 95 CI 0097 50 1 0980 00198 094119 1000 0229 48 1 0960 00280 090470 1000 0403 46 1 0939 00343 087151 1000 0548 45 1 0918 00394 084070 0995 0686 43 1 0897 00439 081056 0982 0777 42 1 0875 00477 078163 0969 0819 41 1 0854 00511 075364 0954 0889 40 1 0832 00541 072643 0939 1492 37 1 0810 00571 069800 0922 1524 36 1 0787 00598 067025 0905 1647 35 1 0765 00622 064309 0887 1708 32 1 0741 00647 061430 0868 1860 29 1 0716 00673 058360 0847 1973 27 1 0689 00698 055214 0826 1985 26 1 0663 00720 052139 0804 2046 25 1 0636 00738 049129 0781 2400 24 1 0610 00754 046180 0757 2767 23 1 0583 00766 043286 0733 3458 21 1 0555 00778 040271 0708 3568 20 1 0527 00787 037318 0682 4321 17 1 0496 00800 033970 0653 4624 16 1 0465 00808 030711 0624 4806 15 1 0434 00811 027538 0593 5708 12 1 0398 00821 023738 0559 5791 11 1 0362 00822 020091 0523 6533 9 1 0322 00823 016045 0483 7699 6 1 0268 00843 010298 0433 10816 5 1 0215 00827 005235 0377 11194 4 1 0161 00775 000897 0313 14821 3 1 0107 00677 0 0240 15389 2 1 0054 00508 0 0153 19674 1 1 0000 NA NA NA PAGElM CHAPTER 4 STAT 5137 J TEBBS survival probability Patient time in years Figure 425 KM estimate for simulated data in Example 46 with con dence limits NOTE Our estimate ofthe median based on these simulated data would be KM 105 4321 years However we see from Figure 425 that the approximate 95 percent con dence interval for the median survival time includes in 3468 45 Twosample tests GOAL In many applications especially in Phase III clinical trials we are often interested in comparing two or more treatments If the primary endpoint is time to an event then the preliminary focus is to determine whether one treatment increases or decreases the distribution of this time Let Z denote treatment group If there are two treatments of interest then Z 6 1 2 TWO SAMPLE PROBLEM The problem of comparing two treatments can be posed as a hypothesis test The null hypothesis is that the distribution of time to death is the same for both treatments If we denote by Slt and Sgt the survival distributions for PAGE 159 CHAPTER 4 STAT 5137 J TEBBS treatments 1 and 2 respectively the null hypothesis of no treatment difference is H0 Slt 520 for all t gt 0 in terms of the survivor functions Sift PT 2 tlZ j for j 1 2 or equivalently in terms of the hazard functions H0 A1t A2t for all t gt 0 where Ajt 7l0gSJt for j 12 A possible alternative hypothesis is one specifying that the survival time for one treatment is stochastically larger or smaller than the other treatment If for example treatment 1 is the standard treatment then a onesided alternative may be of interest ie H1 Sgt 2 510 for all t gt 0with strict equality for some t However more often the twosided alternative is considered ie H1 3 Z OI S 52t with strict equality for some t gt 0 NONPARAMETRIC BASED INFERENC39E It has become standard practice in clinical trials to use nonparametric tests that is to use test statistics whose distribution under the null hypothesis does not depend on the shape of the underlying survival distributions at least not asymptotically The most widely used test in censored survival analysis is the logrank test which will now be described DATA AND NOTATION Data from a two sample clinical trial with survival data can be expressed as a sample of triplets namely Ai 1 2 n where X minTC Here recall that T is the latent failure time and C is the latent censoring time for the 2th individual Also recall that the failure indicator 17 ileSOi 0 ifTgto Ai PAGE 160 CHAPTER 4 STAT 5137 J TEBBS and 17 for treatment group 1 Zl39 27 for treatment group 2 De ne 711 to be the number of patients assigned to treatment 1 ie7 where7 of course7 n 711 712 The number at risk at time u from treatment 1 is denoted by 711 ie7 n Similarly7 n 71211 2109 1121 2 i1 is the number at risk at time u from treatment 2 The number of deaths at time u from treatment 1 is denoted by d1 ie7 Similarly7 d2u Z3109 uAi 121 2 i1 is the number of deaths at time u from treatment 2 The number of deaths at time u for both treatments is du d1u d2u This notation allows for the possibility of haVing more than one death occurring at the same time tied survival times The logrank test is based on the statistic Z ldlltugt 7 Warm AW 71W PAGE 161 CHAPTER 4 STAT 5137 J TEBBS where Au is the set of all death times for both treatments This is not the test statistic itself we now derive77 its particular form INFORMAL DERIVATION OF THE LOGRANK TEST A formal derivation of the logrank test statistic7 as well as asymptotic considerations would facilitate the need to discuss martingale theory We7ll avoid these details and will take the following informal approach The statistic above can be viewed as the sum over the distinct death times of the observed number of deaths from treatment 1 ie7 d1u7 minus the expected number of deaths from treatment 1 Here7 the expected77 number of deaths n u ll dltugt Mu refers to the expected number of deaths when H0 is true At any point in time u cor responding to a time where a death was observed ie7 when du 2 17 the data at that point in time can be summarized by the following 2 gtlt 2 table Treatment 1 Treatment 2 Total Number of deaths d1 d2u du Number alive 711u 7 d1u 712u 7 d2u 7 du Total n1u 712u Thus7 from this point of view7 the data can be envisioned as a sequence of r 2 gtlt 2 tables7 where 7 denotes the number of distinct death times Now7 if H0 were true7 we would expect to be equal to zero7 as well as the sum n1u Z d1u 7 W du Au over all death times in However7 PAGE 162 CHAPTER 4 STAT 5137 J TEBBS o if the hazard rate for treatment 1 was greater than the hazard rate for treatment 2 over all u then we would expect o if the hazard rate for treatment 1 was less than the hazard rate for treatment 2 over all u then we would expect This suggests that the null hypothesis of treatment equality should be rejected if the statistic 711u d1u 7 7du 21 t is too large or too small depending on the type of alternative we are interested in In order to gauge the strength of evidence against the null hypothesis we must be able to evaluate the distribution of the test statistic at least approximately under the null hypothesis Therefore the test statistic needs to be standardized appropriately Speci cally this standardized version is the logrank test statistic and is given by zit d1 u 7 mm Tn TH W udunudul SAW 1 52unu71 SAMPLING DISTRIBUTION We now argue that under H0 ie the survival distri butions for treatments 1 and 2 are the same Tn N AN0 1 To see why this is true consider the 2 gtlt 2 table Treatment 1 Treatment 2 Total Number of deaths d1u du Number alive 7 du Total 711u 712u PAGE 163 CHAPTER 4 STAT 5137 J TEBBS Conditional on the marginal counts d1u has the following hypergeometric probability Pd1u d W7 at where each of these binomial coef cients is well de ned Thus the conditional mean and distribution variance of d1u is given by and n1un2udunu du n2unu 1 7 respectively The sum of these variances over the death times in Au is the variance estimate of the test statistic 2 d1 u 7 15 dltugt AW under the null hypothesis this is not an intuitive result and it follows that the logrank test statistic Tn is approximately distributed as a standard normal if the null hypothesis is true ie Tn Lg AN0 1 THEORETICAL NOTE The key here is recognizing that the last expression is a sum of conditionally uncorrelated terms each with mean zero The asymptotic normality result for the test statistic Tn then follows from the Martingale Central Limit Theorem this is a special CLT for dependent type data REJECTION REGION Since Tn Lg AN0 1 an 04 level test two sided would reject H0 Slt 520 whenever lTnl 2 2042 If for example we were interested in showing that treatment 1 is better ie longer survival times than treatment 2 then we would reject H0 when Tn S 720 since under H1 we would expect the observed number of deaths from treatment 1 to be less than that expected under H0 If we want to show that treatment 2 is better than treatment 1 then we would reject H0 when Tn 2 20 since under H1 we would expect the observed number of deaths from treatment 1 to be larger than that expected under H0 PAGE 164 CHAPTER4 STAT51ampJTEBBS Table 410 Time to death in patients with advanced AIDS Starred subjects represent censored observations Standard treatment HAART 14 333 706 1730 64 863 1873 2380 17 444 909 1834 178 998 1993 2680 128 558 1213 2244 478 1205 1999 2696 129 568 1216 2246 533 1232 2140 2896 164 677 1420 2565 742 1232 2204 3223 228 702 1527 3004 756 1433 2361 3344 NOTE All the arguments we made above were based on summarizing the data by 2 gtlt 2 tables at the distinct death times In constructing our logrank test statistic we never made any assumptions regarding the shape of the underlying survival distributions in deriving the numerator or its variance This intuitively explains why this test is non parametric in nature Example 47 Highly active antiretroviral therapy HAART is the combination of several antiretroviral medications used to slow the rate at which HlV makes copies of itself multiplies in the body A combination of three or more antiretroviral medications is more effective than using just one medication monotherapy in the treatment of HIV In a two arm clinical trial involving patients with advanced AIDS 24 patients receive a standard monotherapy treatment 1 and 24 patients receive a new HAART treatment 2 Deathcensoring times measured in days are given Table 410 The Kaplan Meier estimates for these data by treatment are given in Figure 426 Here is the R program I used to analyze these data Name Joshua M Tebbs Date 23 Nov 07 Purpose Illustration of logrank test in Example 47 PAGE1amp CHAPTER4 STAT51ampJTEBBS survival probability I I I I 0 500 1000 1500 2000 2500 3000 Patient time in days Figure 426 Kaplan Mez39er estimates for AIDS patients in a two arm trial input treatment 1 data survtime1 lt c1417128129164228333444558568677702706909 1213 142015271730183422462565300412162244 input treatment 2 data survtime2 lt c641784785337427568639981205123212321433 187319931999214023612380268026962896322322043344 combines data into one vector survtime lt csurvtime1survtime2 create deathcensoring indicator repab creates a vector of length b consisting entirely of a s status lt crep12200rep12200 treatment indicator 1 s and 2 s treat lt crep124rep224 fit KM estimates for each treatment fit1 lt survfitSurvsurvtimestatus treat PAGElM CHAPTER4 STAT51ampJTEBBS logrank test for difference in survivor distributions fit2 lt survdiff Survsurvtimestatus treat plot KM estimates plotfit1XlabquotPatient time in daysquot ylabquotsurvival probabilityquot ltyc14 create legend click where you want legend to go names lt cquottrt1quotquottrt2quot legendlocator1 legendnames ltyc14 btyquotnquot OUTPUT The fit1 output gives con dence intervals for the median survival times based on the KM based estimate using Greenwood7s formula The fit2 output gives the value of the test statistic T3 ie7 the square of the logrank test statistic Recall that T is approximately distributed as a x random variable when H0 is true gt fit1 Call survfitformula Survsurvtime status treat conftype valainn n events mean semean median 095LCL 095UCL treat1 24 22 1077 189 706 444 1527 treat2 24 22 1671 199 1873 998 2361 gt fit2 Call survdiff formula Survsurvtime status treat N Dbserved Expected D E 2E D E 2V treat1 24 22 158 244 398 treat2 24 22 282 136 398 Chisq 4 on 1 degrees of freedom p 00461 ANALYSIS The test statistic is given by T 3987 with a probability value ie7 the area to the right of 398 under the x density curve equal to 00461 At the a 005 level7 we have suf cient evidence to reject H0 Slt 520 in favor of the two sided alternative That is7 we have signi cant evidence to conclude that the two survivor functions corresponding to the two treatments are different It is worth noting that R PAGE 167 CHAPTER 4 STAT 5137 J TEBBS and SAS gives the square of the logrank test statistic Anyway this is not prohibitive as lTnl 2 2042 ltgt T3 2 xia NOTE Suppose that we a priori speci ed a one sided alternative H1 51 t S 82t ie patients on treatment two HAART lived longer on average Noting that the observed number of deaths for treatment 1 22 is larger than the expected number of deaths 158 we know that Tn er 1995 that is Tn is positive not negative Thus at the 04 005 level we would reject H0 in favor of the one sided H1 since Tn 2 165 the upper 005 quantile of the standard normal distribution 46 Power and sample size considerations for twosample tests IMPORTANT Thus far we have only considered the distribution of the logrank test statistic Tn under the null hypothesis However we know that in order to assess statistical sensitivity we must also consider the power of the test or the probability of rejecting the null hypothesis under some feasible clinically important77 alternative hypothesis 461 Proportional hazards PROPORTIONAL HAZARDS One popular way of specifying an alternative hypothesis is by the use of a proportional hazards assumption That is if we denote the hazard function at time t for treatments 1 and 2 by A1t and A2t respectively then the proportional hazards assumption corresponds to the case wherein A10 A20 We parameterize through the use of exp77 since a hazard ratio must be positive and exp77 for all t 2 0 the 77 0 case would correspond to equal hazards for both treatments which corresponds to the null hypothesis of no treatment difference Using the above parameterization o 77 gt 0 gt individuals on treatment 1 have worse survival ie they die faster PAGE 168 CHAPTER 4 STAT 5137 J TEBBS o 77 0 gt null hypothesis is true 0 77 lt 0 gt individuals on treatment 1 have better survival ie7 they live longer VISUAL ASSESSMENT Under a proportional hazards assurnption7 it follows that ver ifyl log7 log S1t log7 log Sgt 77 This relationship suggests that if we plot two survival curve estirnates eg7 the Kaplan Meier estimates on a log7 log scale7 then we can assess the suitability of a proportional hazards assurnption lf7 in fact7 proportional hazards was reasonable7 we would expect to see something approxirnately like that in Figure 427 090931 time Figure 427 Two survivor functions plotted on the log7log scale These survivor functions satisfy the proportional hazards condition SPECIAL CASE For the special case where the distributions are exponential7 so that the hazard functions are constant7 we automatically have proportional hazards since 7 A1 Mt A PAGE 169 CHAPTER 4 STAT 5137 J TEBBS The median survival time of an exponential distribution with hazard A is m ln2 verifyl Therefore7 the ratio of the median survival times for two treatments that are exponentially distributed with hazard rates A1 and A2 is E ln21 A2 m2 i That is7 the ratio of the medians for two exponentially distributed random variables is inversely proportional to the ratio of the hazards PRACTICAL IMPORTANCE This last result may be useful when one is trying to elicit clinically important differences in medical collaborations lf survival distributions are well approximated by exponential distributions then asking clinicians to determine clinically important differences in median survival is easy for them to relate to compared to hazard ratios This can then be easily translated into a clinically important hazard ratio through the inverse relationship derived above LINK TO THE LOGRANK TEST Theoretical arguments show that the logrank test is the most powerful test among all nonparametric tests to detect alternatives which follow a proportional hazards relationship Therefore7 the proportional hazards assumption not only has a simple interpretation for describing departures from the null hypothesis but also has some nice statistical properties associated with the use of the logrank test POWER CONSIDERATIONS In order to compute the power and the necessary sample sizes for a survival study7 we need to know the distribution of the logrank test under the alternative hypothesis For a proportional hazards alternative Mt H1 3 expW for t gt 07 the logrank test statistic is distributed approximately as Tn N Nd91 9l1277A717 where d is the total number of deaths from both treatments during the whole trial and 6 is the proportion of subjects randomized to treatment 1 For the usual two arm clinical trial where 6 05 ie7 randomization to treatments with equal probability7 theoretical PAGE 170 CHAPTER 4 STAT 5137 J TEBBS arguments show that for a level 04 test two sided to have power 1 7 B to detect the alternative of interest we must have dlZ se HAT t zag z Solving for d we get 42 2102 773 39 As you can see the desired power is directly related to the number of deaths d during d the course of the trial For example if 04 005 17 B 090 and t9 05 then 4196 1282 d 773 Consider the following table of hazard ratios eXp77A Hazard ratio exp77A No of deaths d 200 88 150 256 125 844 110 4623 NOTE As exp77A becomes closer to one we are trying to detect smaller differences between the hazard functions A1t and A2t thus we7ll intuitively need a larger sample size to detect these smaller departures from H0 SAMPLE SIZE CONSIDERATIONS During the design stage we must ensure that a suf cient number of patients are entered into a study and are followed long enough so that the requisite numbers of deaths are attained One straightforward approach is to just continue the trial until we obtain the required number of deaths Example 48 Suppose that patients with advanced lung cancer historically have a median survival of six months We have a new treatment treatment 2 which if it increases median survival to nine months would be considered clinically important We would like to detect such a difference with 90 percent power using a level 04 005 PAGE 171 CHAPTER 4 STAT 5137 J TEBBS test two sided If both survival distributions are approximately exponential then the clinically important hazard ratio is 410 A2t m1 Thus 77A log32 04055 With these criteria we would need to observe 4196 71282 04055 9 732 6 d m 256 deaths One way of accomplishing this goal for example is to enter say 350 patients randomize 175 to each treatment arm and follow these patients until we have 256 deaths Note that l have chosen 35077 rather arbitrarily here we now give more careful consideration to some of the design aspects of a clinical trial with a survival endpoint 462 Design speci cations NOTE In most clinical trials involving time to death endpoints arbitrarily picking a number of patients and waiting for d deaths will not be adequate for the proper planning of the trial It is more often the case that we need to specify to the investigators the following o the number of patients 0 the accrual period and o the length of follow up time KNOWN RESULT It has been shown that to obtain reasonable approximations for the power we need the expected number of events deaths computed under the alternative hypothesis to be equal to 4za2 2 9 ie we must compute the expected number of deaths separately for each treatment arm d under the assumption that the alternative is true The sum of these expected values from both treatments should be equal to d PAGE 172 CHAPTER 4 STAT 5137 J TEBBS NOTE To compute the expected number of deaths we will assume that censoring is due to lack of follow up resulting from staggered entry If we additionally have other forms of censoring eg competing risks loss to follow up etc then the computations which follow would have to be modi ed NOTATION In order to more thoroughly plan a trial with a survival endpoint we de ne the following notation o A is the accrual period that is the calendar period of time that patients are entering the study eg January 1 2008 through December 31 2013 F is the followup period that is the calendar period of time after accrual has ended before the nal analysis is conducted 0 L A F denotes the total calendar time of the study from the time the study opens until the nal analysis 0 au is the accrual rate at calendar time u more precisely au lim expected number of patients entering between uu h haO h The total expected number of patients in the clinical trial is then given by 0A audu If we have a constant accrual rate this is a common assumption made in the design stage then au a and the total expected number of patients is aA DESIGNING THE TRIAL Suppose we have a clean77 clinical trial where there is no loss to follow up and no competing risks If au is the accrual rate onto a study randomized equally to two treatments then the expected number of deaths for treatment 1 is A a u d1 7 udu 0 where is the cumulative distribution function for the survival time for treatment 1 To see why this makes sense note that PAGE 173 CHAPTER 4 STAT 5137 J TEBBS 0 we would expect du patients to enter in the interval of time from u to u du 0 Of these patients7 the proportion FL 7 u are expected to die by the end of the study ie7 at time L o This number summed ie7 integrated over u7 for values u 6 07A yields the expected number of deaths on treatment 1 Similarly7 then the expected number of deaths for treatment 2 is A a u d2 7 udu 0 2 where is the cumulative distribution function for the survival time for treatment 2 The sum of these expected values7 from both treatments7 should equal d1 d2 thus7 we are to set 4za2 z 2 2 A d1d2 Note that the number of deaths can be affected by the accrual rate7 the accrual period sample size7 the follow up period7 and the failure rate survival distribution Some or all of these factors can be controlled by the investigator and have to be considered during the design stage Example 49 Suppose that the accrual rate is constant at a patients per year7 and that we randomize equally to two treatments7 so that the accrual rate is 12 patients per year for each treatment Also7 suppose that the survival distribution for treatment j is exponential with hazard ratio Aj j 12 Then7 we have that A dj g 1 7 e AjL du 0 a 57AM AA 7A7 M e 71 Suppose that during the design stage7 we expect a 100 patients per year to be recruited into the study Median survival for treatment 1 is 4 years thus7 4 ln21 A1 m 0173 PAGE 174 CHAPTER4 STAT51ampJTEBBS We desire the new treatment 2 to increase median survival to 6 years so that A2 0116 If this happens we want to have 90 percent power to detect it using a logrank test at the 04 005 two sided level of signi cance The hazard ratio is AgAl 32 and 77A ln32 Thus the total number of deaths must be d 4za222 2 4196 1282 256 77A 1H322 so that d1A L d2AL 256 Note that l have emphasized that d1 and d2 depend on our choice of the accrual period A and the length of the study L According to our calculations we need A and L to satisfy 670173L 7 7 0173 670116L 0173147 7 6 lll 2 A 0116 5011 i 1 256 NOTE There are many A L combinations that satisfy this equation To nd a solution one possibility is to take the accrual period and the length of follow up to be equal ie take A L this will minimize the total length of the study When A L the equation above only has at most one solution and I wrote an R program to nd it From the program we obtain A L m 698 years Such a design would require 698 patients DESIGN ALTERNATIVES We can experiment with different combinations of A and L that best suits the needs of the study Table 411 provides different accrual periods A and the resulting values of L Here is the R program that I wrote to nd the A L solution To nd the other values in Table 411 simply substitute A into the root func function accordingly Name Joshua M Tebbs Date 24 Nov 07 Purpose Find the accrual period and length of study assumed equal for Example 49 notes Assume constant au and exponential distribution PAGElB CHAPTER 4 STAT 5137 J TEBBS Table 411 Design of the clinical trial in Example 49 using di erent values of A A No of patients to accrue Total length of study L 257 257 4391 300 300 1536 400 400 931 500 500 769 600 600 711 698 698 698 constant accrual rate no patients per unit of time a lt 100 hazard rates constant since exponential assumption lambda1 lt 0173 lambda2 lt 0116 no of deaths d lt 256 rootfunc lt functionx a2X eXp lambda1Xlambda1eXplambda1X 1 a2X eXp lambda2Xlambda2explambda2X 1 d rootfunc is a function of X find the value of X which solves rootfunc option root returns the root solution unirootrootfunc cO50 tol00000lroot REMARK Other factors that may affect the power that we will not discuss here include loss to follow up patient withdrawal7 competing risks7 and patient non compliance An excellent account on how to deal with these issues during the design stage is given by Lakatos 19887 Biometrics 447 229 241 and more recently by Jiang et al 20047 Biometrics 607 800 806 PAGE 176 CHAPTER 4 STAT 5137 J TEBBS 4 7 ksample tests SETTING Suppose that we randomize patients to k gt 2 treatments and that we are interested in testing the null hypothesis that the survival distributions are the same for all k treatments versus the alternative of some treatment difference NOTATION With right censoring the data from such a clinical trial can be represented as a sample of triplets namely Ai 1 2 n where X minT 0 the failure indicator 17 ifTiSCi 01fTgtO Al and Z j for j E 1 2 k corresponding to the treatment to which the 2th individ ual was assigned HYPOTHESIS TEST Denoting by EliIt PT 2 Hz j the survival distribution for the jth treatment the null hypothesis can be represented as H0 Slt Sgt Ska for t 2 0 or equivalently by H0 A1t A2t Amt for t 2 0 LOGRANK TEST The test of the null hypothesis will be a direct generalization of the logrank test used for the two sample problem At any time u where du 2 1 we consider our data as a 2 gtlt k contingency table like the one below where recall and dju denote the number of individuals at risk and the number of deaths at time u from treatment group j respectively PAGE 177 CHAPTER 4 STAT 5137 J TEBBS Treatment 1 Treatment 2 Treatment k Total Number of deaths d1 d2 dku du Number alive n1 7 d1u 712u 7 d2u 7 dku 7 du Total 711u 712u GENERALIZATION We now consider a vector of observed number of deaths minus the expected number under the null hypothesis for each treatment group j ie d1ltugt 7 mam MW 7 706101 n u du 7 emu 7 353 Note that the sum of the elements in this vector is zero verifyl If we condition on the kgtlt1 marginal counts in the 2 gtlt k table then the k gtlt 1 vector d1u7d2u7 7dku is distributed as a multivariate version of a hypergeometric Of particular interest for us is that conditional on the marginal counts for j 1 2 k n 39u E d 7 d o M M u where l have used the notation to denote conditional expectation and dunu 7 dunjunu 7 WW n2unu 71 Vcdju and that for j 31 j dunu 7 dunjunjlu C d d39 7 OVC 7u7 J 7 Now consider the k 7 1 gtlt 1 vector Sn made up by the sum of the observed minus eXpected counts of treatment j j 1 2 k 7 1 summed over time u ie zit d1ltugt 7 7dltugt 2W mo 7 mm 2AM 5117100 7 ZL du k71gtlt1 PAGE 178 CHAPTER 4 STAT 5137 J TEBBS where Au is the set of death times u for all treatments Note that we only take the k 71 dimensional vector since the sum of all k elements of du is zero and hence one of the elements is extraneous if this was included in Sn we would create singularities in what follows The corresponding k 7 1 gtlt k 7 1 covariance matrix of 5 is given by V vjjz for jj 12k 7 1 where 1 dltugtnltugt 7 dltugtnjltugtnltugt 7 mu n2ltugtnltugt 71 ii for j 1 2 k 7 1 these are the diagonal elements of V and dltugtnltugt 7 dltugtnjltugtnj ltugt n2ltugtnltugt71 for j 31 j these are the covariance terms TEST STATISTIC The statistic used to test the k sample null hypothesis is given by T 5V1S Under H0 Tn has an approximate central X2 distribution with 71 degrees of freedom This quadratic form is the ksample logrank test If Hg was true then we would expect the elements in the vector 5 to be near zero in this case the quadratic form Tn would also be near zero so that under H0 Tn would be small If however Hg was not true then we would expect some of the elements of 5 to perhaps greatly deviate from zero in this case the quadratic form Tn would be large A level 04 test would reject the null hypothesis whenever Tn 2 xiim These tests are available in most statistical packages Example 410 A clinical trial CALGB 8541 includes female patients with positive stage ll breast cancer The endpoint of interest is time to death and there are three chemotherapy treatments 0 Intensive CAF 0 Low dose CAF and 0 Standard dose CAF PAGE 179 CHAPTER4 STAT51ampJTEBBS Table 412 Time to death in patients with stage II breast cancer Starred subjects repre sent censored observations Intensive CAF Low dose CAF Standard CAF 501 4610 1959 357 4067 974 1721 665 354 1666 3494 4205 4280 3660 1157 1464 1323 2734 3350 2067 95 3146 1992 3634 3142 3260 2729 76 1482 3302 4167 653 2385 1199 1305 3436 3266 3684 625 3750 3230 4372 894 4197 1716 1206 71 1853 4454 2320 2574 391 4117 989 2360 3905 1169 1847 4002 1712 where CAF stands for cylclophospharnide adriarnycin and 5 uorouracil Data from the trial are given in Table 412 The n 60 patient times in Table 412 are a subset of the total data set Below is the R program I used to analyze these data Name Joshua M Tebbs Date 24 Nov 07 Purpose Compare three survivor distributions using logrank test using data in Example 410 input treatment 1 data survtime1 lt C501 1721 4280 3350 3142 4167 3266 894 4454 2360 4610 665 3660 2067 3260 653 3684 4197 2320 3905 input treatment 2 data survtime2 lt C1959 354 1157 95 2729 2385 625 1716 2574 1169 357 1666 1464 3146 76 1199 3750 1206 391 1847 input treatment 3 data survtime3 lt C4067 3494 1323 1992 1482 1305 3230 71 4117 4002 974 4205 2734 3634 3302 3436 4372 1853 989 1712 PAGE 180 CHAPTER4 STAT51ampJTEBBS intensive 39 low quotquotquot quot standard survival probability I I I I 0 1000 2000 3000 4000 Patient time in days Figure 428 Kaplan Meter estimates for breast cancer patients in a three arm trial combines data into one vector survtime lt csurvtime1survtime2survtime3 create deathcensoring indicator status lt crep1160000rep11800rep117000 treatment indicator 1 s 2 s and 3 s treat lt crep120rep220rep320 fit KM estimates for each treatment fit1 lt survfitSurvsurvtimestatus treat conftypequotplainquot logrank test for difference in survivor distributions fit2 lt survdiffSurvsurvtimestatus treat plot KM estimates plotfit1xlabquotPatient time in daysquot ylabquotsurvival probabilityquot ltyc148 names lt cquotintensivequotquotlowquotquotstandardquot legendlocator1legendnames ltyc148btyquotnquot PAGE 181 CHAPTER4 STAT51ampJTEBBS gt fit1 Call survfitformula Survsurvtime status treat conftype valainn numeric matrix 3 rows 7 columns n events mean semean median 095LCL 095UCL treat1 20 16 3006 319 3266 2067 4280 treat2 20 18 1618 240 1464 1157 2385 treat3 20 17 2900 292 3302 1992 4002 gt fit2 Call survdiff formula Survsurvtime status treat N Dbserved Expected D E 2E D E 2V treat1 20 16 2421 2784 5784 treat2 20 18 793 12805 16812 treat3 20 17 1886 0184 0303 Chisq 177 on 2 degrees of freedom p 0000141 ANALYSIS The test statistic is Tn 177 with a probability value ie area to the right of 177 under the x density curve equal to 0000141 At a 005 we have suf cient evidence to reject H0 510 Sgt Sgt There is strong evidence that these three treatments affect the survival rates differently To see where the differences are7 we look at the pairwise tests Here is the output to test H0 Slt 520 H0 Slt 530 and H0 Sgt 530 respectively Using a Bonferroni correction7 we declare the pairwise tests to be signi cant if the probability value was smaller than 043 0053 00167 gt fit 12 Call survdiff formula Survsurvtime 12 status 12 treat 12 N Dbserved Expected D E 2E D E 2V treat121 20 16 2438 288 117 treat122 20 18 962 730 117 Chisq 117 on 1 degrees of freedom p 000061 gt fit13 Call survdiffformula Survsurvtime13 status13 treat13 PAGE 182 CHAPTER 4 STAT 5137 J TEBBS smwalprahammy 00 02 04 05 08 10 00 02 04 05 08 10 0 1000 2000 3000 4000 0 1000 2000 3000 4000 Pauem ume 1n days Pauent Urne 1n days 00 02 04 05 08 10 0 1000 2000 3000 4000 Pauem ume 1n days Figure 429 Kaplan Meier estimates for breast cancer patients Upper left intensive versus low Upper right intensive versus standard Lower left low versus standard N Observed Expected D E 2E D E 2V treat131 20 16 19 0469 118 treat133 20 17 14 0635 118 Chisq 12 on 1 degrees of freedom p 0277 gt fit23 Call survdiffformula Survsurvtime23 status23 treat23 N Observed Expected D E 2E D E 2V treat232 20 18 981 684 111 treat233 20 17 2519 266 111 Chisq 111 on 1 degrees of freedom p 000087 ANALYSIS The intensive low 1 versus 2 and low standard 2 versus 3 comparisons are signi cant at the 043 0053 00167 level However7 the intensive standard 1 versus 3 comparison is not signi cant There is not enough evidence to conclude intensive high CAF treatment prolongs survival when compared to the standard treatment PAGE 183
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'