### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Nonlinear Statistical Models for Univariate and Multivariate Response ST 762

NCS

GPA 3.79

### View Full Document

## 19

## 0

## Popular in Course

## Popular in Statistics

This 409 page Class Notes was uploaded by Jordane Kemmer on Thursday October 15, 2015. The Class Notes belongs to ST 762 at North Carolina State University taught by Staff in Fall. Since its upload, it has received 19 views. For similar materials see /class/223969/st-762-north-carolina-state-university in Statistics at North Carolina State University.

## Reviews for Nonlinear Statistical Models for Univariate and Multivariate Response

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/15/15

CHAPTER 14 ST 762 M DAVDDIAN 14 Estimating equation methods for marginal models 141 Introduction In this chapter we focus on methods for estimation of the parameters in a population averaged marginal model of the general form discussed in Chapter 13 In particular we assume that the pairs Y 113 i 1 m are independent where each Y is x 1 and we consider the general mean covariance matrix model 1939 filtmi7 7 YO13911 Vi 7 7wi m X 7107 141 where f16 is the x 1 vector with jth element filij The covariance model V6 is taken to have the form varY1 Va 5 m T Zm 0 mra 213T226 0 m 142 where T 0 is the diagonal matrix whose diagonal elements are the models for varYj1 eg involving a variance function varYj1 0292 0 1117 depending on possibly unknown variance parameters 0 as in the previous chapter we will generally absorb 02 into 0 for brevity and just refer to 0 as all the variance parameters The x matrix 131 is a correlation matrix that generally depends on 11 only through the within individual times or other conditions tij at which observations in Y are taken Here a is a vector of unknown correlation parameters The vector of variance and correlation parameters 0T1TT may be entirely unknown or it may be that only a is unknown in models where the form of the variance function is entirely speci ed As discussed in Chapter 13 the correlation model 131 is likely not to be correct Rather it is speci ed as a working model that hopefully captures some of the main features of the overall pattern of correlation This is acknowledged when inference is carried out under model 141 with assumed covariance structure 142 we will discuss this explicitly in Section 145 Model 141 may be viewed as a multivariate analog to the univariate mean variance models discussed in Chapters 2712 Thus it should come as no surprise that inferential strategies for 141 exploit some of the same ideas as in the univariate case In particular estimation of 6 and g is typically carried out by solution of linear or quadratic estimating equations that are similar in spirit to those used for univariate response Of necessity these equations are more complicated in the multivariate setting as we will see in Sections 142 143 and 144 although the basic principles are the same PAGE 371 CHAPTER 14 ST 762 M DAVDDIAN The terminology generalized estimating equations GEEs rst coined by Liang and Zeger 1986 has come to refer broadly to the body of techniques for inference for 6 and g based on solution of appropriate estimating equations 0 As suggested by the title of Liang and Zeger 1986 Longitudinal data analysis using generalized linear models this paper cast the idea of posing estimating equations for multivariate response in the context where the the response is collected longitudinally for each experimental unit The development was also restricted to responses such as binary data counts and so on thus on mean models 1 and models for varYj that are of the generalized linear model type However this restriction is unnecessary solving estimating equations for any model of the form 141 is feasible more generally This restriction does explain why in much of the literature there are no unknown variance parameters 0 and interest focuses exclusively on estimating correlation parameters a For ex ample for binary response the model for varYjl1j might be taken to be the usual model f1ij 61 7 f1ij 3 and is assumed to be correctly speci ed A working correlation model might be postulated depending on unknown 1 so that only 1 remains to be estimated 0 In our development here we will allow the possibility that the model V6 may involve both unknown variance parameters 0 and unknown correlation parameters a We will note simpli cations that would occur in the case where the variance function does not depend on any unknown parameters 0 There are numerous references that cover the types of estimating equations we are about to discuss Some of the key references are Prentice 1988 Zhao and Prentice 1990 Prentice and Zhao 1991 and Liang Zeger and Qaqish 1992 See also Section 75 and Chapter 8 of Diggle Liang and Zeger 1995 and Chapter 9 of Vonesh and Chinchilli 1997 142 Linear estimating equations for Just as with univariate response it is natural to start by considering the normal likelihood as a basis to derive an estimation method for 6 in 141 Thus analogous to the case of known weights in the univariate case assume that the matrices varYl1 Vi say are known Under these conditions assuming that the Yilm are normally distributed the normal loglikelihood has the form m IogL 412 log W m 7 rimmfv wi 7 rimm 143 i1 PAGE 372 CHAPTER 14 ST 762 M DAVDDIAN Writing fglti17 Xd 3 m X19 fglti13ini7 5 taking derivatives of 143 with respect to the p x 1 vector 6 and setting equal to zero yields ixmv m 7 new o 144 i1 which follows by using the following standard matrix differentiation results 0 For quadratic form q 11TA21 and A symmetric 8q811 QAZII Note that this is a vector 0 The chain rule gives 8q8 8m8 8q81 Note that if both 1 and 6 are vectors then din86 is a matrix See Section 24 for a refresher Section 14 of Vonesh and Chinchilli 1997 is an excellent source for many useful and sometimes dif cult to nd results on matrices and matrix differentiation The equation 144 has the form of a multivariate analog to the usual univariate WLS equation where here the response and mean are now vectors and the weights V1 and gradient are matrices In fact we may write 144 in a way that makes it clear that there is really no fundamental difference in the general forms in the univariate and multivariate cases Let Y YlT YT be the vector of length N 221 71 total number of observations De ne ma fltz1 gtf ltw1 gtT N x1 V block diagV1 Vm N x N Note that f6 and varYli V where we use conditioning on 513 as shorthand to denote that conditioning for each component Y is with respect to 111139 Note further that we may rewrite 144 as XTlt gtV 1Y 7 f 0 145 PAGE 373 CHAPTER 14 ST 762 M DAVDDIAN This has the same form as the matrix representation of the usual WLS estimating equations in the univariate case with the exception that the weight matrix V 1 in 145 is not necessarily diagonal This is not a big deal so that the form of the equations is exactly that of WLS In fact note that the summands for i 1 m in the equation written in the form of 144 are independent Moreover regardless of whether the constant matrices V are actually equal to the true varYil1i the expectation of a summand conditional on 111139 is clearly equal to zero 0 Thus the estimating equation is unbiased We would thus expect that estimation of 6 by solving 144 to lead to a consistent estimator Of course if the matrices V are not known but depend on the parameters 6 and g following the analogy to the linear case it is natural to consider replacing them by the postulated covariance model in 141 This leads to the linear estimating equation m EjXRmewgawozihatmm Mo i1 which may be rewritten de ning V6 block diagV1 11Vm6 1m as XWWV4W HY7fWH0 Mo As in the case where the covariance matrix is known the summands in 146 are independent across 2 Hence we expect that this estimating equation is also unbiased and would lead to a consistent estimator for 6 We will discuss this in more detail in Section 145 IN FA CT If the model for the covariance matrix varYli is correct we of course assume that the model for the mean is correct note that 147 has exactly the form of the optimal linear estimating equation for 6 as in Section 96 except for the detail that the weight matrix V 1 g is not diagonal c As we will see in Section 145 a similar folklore result as for the univariate case obtains for the multivariate generalization of GLS that we discuss momentarily Thus the same argument used in Section 96 to show asymptotic optimality among all linear equations in the univariate case would be applicable to the solution to 147 The main point we emphasize now is that estimating equation 146 or equivalently 147 to which we are led to by applying the same ideas to the multivariate case that we did in the univariate setting The equations have the same form as those for generalized least squares in the univariate case PAGE 374 CHAPTER 14 ST 762 M DAVDDIAN Carrying the analogy further an obvious strategy for solving 144 is suggested via a three step algorithm i Estimate 6 by 30 and set k O A natural choice for 30 would be the OLS estimator treating all the elements of Y as if they were mutually independent with the same conditional variance ie replacing V in 145 by a N x N identity matrix That the OLS estimator is consistent under these conditions follows from the arguments in Section 145 A k ii Estimate somehow by 5 and form estimated weight matrices77 1 We wk m V39 5 iii Re estimate 6 by solving in 6 Zx mv lkl 5 mm 7 12mm o i1 to obtain 9 Set k k 1 and go to ii lterating C 00 times would correspond to solving 146 jointly with another equation for g IMPLEMENTING STEP 2392 By analogy to the univariate case a natural approach to estimating would be to use a quadratic estimating equation We will discuss this in Section 143 shortly In the early papers on GEEs in the biostatistical literature by Liang and Zeger estimation of g was advocated based on simple moment based approaches Actually as this early work was in the context of generalized linear model type problems this only involved estimation of correlation parameters a as in this setting the variance function does not depend on unknown parameters except perhaps a scale parameter 02 For example consider the exponential correlation model given in 1328 reparameterized here as commam where Yij and Yij are observed at times tij and tijz Under this model 131 depends on the i i i A k scalar parameter 4 Assume that varYjl1j 0292 0 1117 With 0 known Given an estimate 8 at the kth iteration of the above algorithm the weighted residuals A k A k w Yt 7 fat gtgtgltalt is 11117 have approximate mean 0 and satisfy Ewnjwnjzl139 z azaltirtii l PAGE 375 CHAPTER 14 ST 762 M DAVDDIAN Taking logarithms of both sides of this expression yields the approximate relationship logwrjwrjz z log 02 ltij 7 2574 l loga thus the suggestion was to form all pairs of lagged residuals for each 2 pool them together and estimate loga by simple linear regression of the logwnjwrjz on the ltij 7 2577 The resulting estimator may be exponentiated to yield an estimator for 04 Given our experience in the univariate case it seems likely that a potentially more ef cient way of estimating 04 could be identi ed Moreover this approach does not seem to take account of the fact that if observations from the same individual are correlated then residuals are likely to be so too In any event following estimation of a in the general model with a scale parameter 02 with no other unknown variance parameters the obvious multivariate analog for estimation of 02 is m 72 N71 93 filti7 kTTl2Ek7 97 11301141347 iTl2Ek7 97 illi 1Yi fii7 k 11 148 Of course N might be replaced by Nip In the next section we will see that an estimator of this form when a scale parameter is in the model arises naturally from a quadratic estimating equation approach as is not unexpected 143 Quadratic estimating equations for variance and correlation parameters To develop a class of quadratic estimating equations for g it is natural as we did in the univariate case to start with the normal loglikelihood We thus rst consider the form of the loglikelihood under the assumption that the conditional distributions of Y given 11 are normal to deduce the form of an estimating equation and then consider generalizations As before for simplicity we suppress explicit mention of a possible scale parameter absorbing it into 0 for convenience The normal loglikelihood for model 141 is IOgL 12 in IOglVd iw l Yz39 fii7 TVZ1 7 7 113099 filtmi7 39 149 391 Assume that 0 is q x 1 and a is s x 1 Consider differentiation of 149 with respect to the kth scalar element of g k 1 q 3 51 say This is straightforward using the following well known matrix differentiation results Suppose here that V is a n x n nonsingular matrix depending on a vector PAGE 376 CHAPTER 14 ST 762 M DAVDDIAN o lf 5 is the kth element of g then 8851 V is the n x 71 matrix whose 619 element is the partial derivative of the 619 element of V with respect to 5k aaaaog MW 7 tr V1lt gt8aavlt gt 885kV1 gt 7 7V 1lt gt 885 Vlt gt V 1 gt o For quadratic form q mTV 1 8q8Ek mT88 k V 1 Thus from the previous result 8851 WWW 7 7wTV1lt gt8ask we V1lt gtw Applying these results to 149 treating 6 as xed and setting each partial derivative if 149 equal to zero we obtain the following q s set of estimating equations 12iY17 rialmfv mg mam85 vim WWW mm 7 rm m 391 trV1 7 7i885kvi 7 illigt 0 k 1q 8 1410 As with 146 the summands for i 1 m are independent Moreover analogous to the univariate case if V 1 is correctly speci ed then at the true values the conditional expectation of a summand is zero This may be seen by using the following result o If 1 is a random vector with mean zero and covariance matrix V and A is a square matrix then E1TA1 trE11TA trVA trAV Using this we have assuming expectation is under the parameter values 6 and 7 E Yz 7 no gtTV1lt6 mam85 Vim migtV1 5 mm 7 imam an 7 tr Vim a mam85k Vim a winvim a mam a ma 7 tr V m mnaamvx mm from whence unbiasedness of 1410 follows Note of course that if V were incorrectly speci ed the equation would not be unbiased JOINT ESTIMATING EQUATIONS Analogous to the univariate case iterating the three step algo rithm with C 00 would correspond to jointly solving in 6 and g the p q s dimensional system of estimating equations given as follows PAGE 377 CHAPTER 14 ST 762 M DAVDDIAN ixflt gtvfltmam1gt1n7 Jamm 7 0 1111 i1 lt12 i 1Y1 7 fltw1 gtTv1lt mam851 mag w1gtvgllt mm 7 f1v1 391 7 tr Vilma w1gt8851V1 w1gtl 0 k 711 s 1412 SCALE PARAMETER When the covariance model includes a scale parameter 02 say the equation 1412 corresponding to differentiation with respect to 02 takes a simple form Speci cally if we write varYl1 in a slight abuse of notation as 02V 0 1113 with g 0T 1T 02T then with 5qs 021 885q19a2V 001 111 V60 121 so that 1412 in the case k q 3 reduces to m 0 4 2 Y1 7 f1ivi1 TV1 1 101a1iviY1 7 rm m e w 7 0 11 which is easily seen to lead to an expression for 02 of the form in 148 The multiplicative factor of 12 in the equations for g in 1412 could be disregarded but we maintain it for now as it proves important for the developments of Section 146 the motivation for which we now discuss ALTERNATIVE FORM OF THE QUADRATIC ESTIMATING EQUATION FOR Recall that in the univariate case we expressed the summand in joint estimating equations in the form gradient of mean function x covariance matrix 1 x response 7 mean 1413 From this representation we were immediately able to identify generalizations of the equations and determine the optimal equation under a set of assumptions on the moments o For example in the case where quadratic equations were motivated by normality this represen tation allowed us to replace normal moments by those corresponding to the assumptions It is in fact possible to represent estimating equations such as 1411 and 1412 in the form 1413 although it is a little more involved than in the univariate case 0 Writing estimating equations in the form 1413 has become the standard way to represent such equations and was rst made popular in papers by Prentice 1988 Zhao and Prentice 1990 and Prentice and Zhao 1991 PAGE 378 CHAPTER 14 ST 762 M DAVDDIAN We now consider how this might be accomplished For de niteness we rst focus only on the quadratic equation for estimating given in 1412 Rather than try to show that 1412 may be written in this alternative form directly we instead start with the idea that one would want to write an equation in the form 1413 and demonstrate how this would be done The equivalence between this approach and 1412 derived from normality is presented in Section 146 We will consider 6 xed for now and consider estimation of g If we are interested in estimating the elements of g which describe an entire covariance structure variances and correlations so variances and covariances we must consider the variances of each element of a data vector and all pairwise associations among elements of a data vector Of course if there are no unknown variance parameters we need only consider associations In particular for Y K1 YmiT if we let vijk be the j k element of V6 which is of course equal to the kj element by symmetry then we know that if Vi6 is speci ed correctly E We 7 imp5 Wm E Yij fij7 3 k 7 ank l vim57 In the univariate case the quadratic estimating equation was based on treating the squared deviations as the response and equating them to their mean the variance function By analogy it is natural to think that the quadratic equation in the multivariate case should involve a response that includes both squared deviations equated to the variance function and crossproducts of deviations equated to their means the covariances o By symmetry for Y x 1 there are m squared deviations one for each entry of Y1 and 7 12 distinct crossproducts ie number of covariances for a total of 12 distinct terms Explicitly the mm12 distinct terms are the m squared deviations K17f1i1 2 7 f1mi 62 and the 7 12 crossproduct terms Ya Mb5sz 113127 5 Ya flti17 yi3 113137 5 7 313an 7 7 0 Thus the response and mean vectors here should be of length 12 assuming that the model contains unknown variance parameters If it does not then only the 7 12 crossproduct terms would be required PAGE 379 CHAPTER 14 ST 762 M DAVDDIAN 0 Of course as the quadratic estimating equation 1412 depends on the quadratic form in Yi 7 f1 3 it also depends on squared deviations and crossproducts To formalize this de ne Wk Yij fijv Ym 7 Hm 37 1414 where the notation suppresses the dependence on 6 for brevity Then we may collect the distinct um de ned in 1414 in a vector of length 12 with unknown variance parameters or n 7 12 with no unknown variance parameters in some order In the former case we will de ne 7 T uz 7 141117 141127141137 11227 141237 7umi71m717 Ulniilniilyuznini 0 Recall that for a n x r matrix A vecA is de ned as the 7w x 1 vector consisting of the r columns of A stacked in the order 1 r o If furthermore A is n x n and symmetric then vecA contains redundant entries The vech operator yields the column vector containing all the distinct entries of A by stacking the lower diagonal elements eg for n 3 an 112 an 112 ais ais A 012 022 am and vechA 122 ais 123 ass 0123 ass 0 Thus our convention here is to use the order imposed by the de nition u vech m 7 rim mm 7 mm If there are no unknown variance parameters in g u would be de ned by deleting the squared components We may de ne a corresponding vector 014579 Ui11 7 7Ui12 7 7vns 7 7 7Ui22 7 7vi23 7 7 7Uinmi 7 T ie vi vechVi6 lf ui contains no squared components ie no unknown variance parameters then the corresponding elements of 390 would be deleted PAGE 380 CHAPTER 14 ST 762 M DAVDDIAN Then we have that the response vector ui satis es WW7 5 so that 0 g is the mean vector It is important to recognize in reading the literature that there are variations on the construction we describe here For example some authors instead base the equations on the response vechYYT andor may stack things in a different order Treating 0 g as the mean vector and as a function of g we may de ne the gradient matrix of the mean to be Ell57 aa vilt 7 Ei will have 12 or 7 12 rows depending on the form of Vi6 g 113139 and q 5 columns Let the covariance matrix of ui be Varuili Z457 5 say A little thought about of the form of this matrix makes it clear that it is fairly complex In particular in order to specify this matrix it is clear that we would have to be willing to make assumptions about quantities of the general form COV LLijk uilplmi Euijkuuplrli 7 1415 To investigate what this entails and to make the analogy to the univariate case transparent consider the particular model for varYll1i that involves a scale parameter 02 and additional variance parameters 0 such that VaFOijlilh39j 0292957 97 113M for a variance function g and correlation parameters a where g 0T 1T 02T De ne Eij Yij 7 fij7 09 7 9711317 Then of course O and varejl1 1 where expectation here and subsequently is under the T parameter values 6 and g Clearly the elements of 6139 6n em are correlated with correlation matrix equal to 131 assuming as we are that this matrix is correctly speci ed Under this model for j k 1 m39 Um 0292 707ij512j7 Uig k 02957 97 ij9 7 97 ik5ij5ik PAGE 381 CHAPTER 14 ST 762 M DAVDDIAN Thus of course My 029257 97 mijE512jlmi 029257 97 11 Um 02937 97 ij9 7 97 illikE ij iklilli 02937 97 ij9 7 97 ikC0rrEij7 Eiklilli7 where corrqj 67 is the j k element of the correlation matrix 131 In general using the shorthand notation 9M 9637 0113 we may rewrite 1415 in terms of the Eij as COVUijk7 Uizplilli U49ijgikgwgipEEij ik w ipl113i EEij5ikliEEilEipli 1416 The representation 1416 highlights how much more complicated the multivariate case is relative to the univariatel Recall that in the univariate case the response corresponding to a single independent observation is the scalar squared deviation so that in constructing the covariance matrix we needed only to be concerned about the variance of a squared deviation which boils down to concern about the variance of the square of a standardized error e and hence the need to make an assumption about excess kurtosis Here in contrast the response corresponding to a single independent observation response vector is itself a complex vector so we must be concerned with numerous variances and covariances of the general form 1416 To illustrate the complexity note some special cases of 1416 COVWM Wa39jlilli VaerliBi 049Var jli7 WWW Uijklilli Vaerk U49izj9i2klEEzszszli EEi 6ikli2 l7 COVWWUWW 02993EEj6zli 17 COVijv Uijplilli 0499ik9ipE jEik ipl113i EEi739 ikliEEijEipliv COVWijlmi 0299ipE j ipli EEij ipli Of course 1416 represents the general case for j 344 k 344 Z 344 p RESULT In order to specify the covariance matrix Z we must be prepared to make assump tions about numerous higher moments involving the elements of Y or equivalently 6139 up to four way associations ie EEij6ik6ilqplilli PAGE 382 CHAPTER 14 ST 762 M DAVDDIAN Note that if we have m 1 so that Y is a scalar all of this reduces to the situation rst discussed in Chapter 5 ESTIMATING EQUATION Putting aside for the moment the troublesome issue of specifying Zi6 g the preceding developments suggest the following approach Given some assumption on Z6 thus some assumption on higher moments of the 67 to estimate treating 6 xed for now one would solve an estimating equation of the form m ZEM 925 mm 7 gt 5 0 1417 11 This of course is a q s dimensional system of equations in q s unknowns o This is a quadratic estimating equation in a much more general sense than in the univariate case A convenient way to think of this is that the squared deviations and crossproducts that make up the response here are components of a quadratic form 0 By analogy to the univariate case it seems likely that the equation will be optimal for estimating if Zi6 is chosen correctly so that varul1 Z6 A k o In step ii ofthe three step algorithm one could implement this by replacing 6 by 8 everywhere including in u where the dependence on 6 has been suppressed here o It is straightforward to verify that in the univariate case m 1 so that there is no parameter a the estimating equation 1417 reduces to the general quadratic equation for variance parameters in the second row of Equation 102 There are two issues to be resolved 0 Clearly getting Z6 correct involves major moment assumptions The chance that we would be able to specify all of the relevant moments correctly seems slim if not impossible What is a practical strategy for specifying Zi6 g in practice 0 How does 1417 compare to the equation in the form 1412 The latter equation may be thought of as a multivariate generalization of pseudolikelihood PL as it is based on the normal theory likelihood lntuition and the analogy to the univariate case suggest that choosing Zi8 to correspond to the matrix that would obtain if the Yilm were multivariate normal should yield 1412 ls this in fact true PAGE 383 CHAPTER 14 ST 762 M DAVDDIAN SPECIFYING Zi8 The dif culty associated with specifying Zi8 with con dence is widely acknowledged Thus the standard approach is to face up to the dif culty and instead of attempting to get it right77 adopt a working assumption that is likely incorrect but might at least capture some of the predominant features of associations among the elements of u For instance it might be better than simply taking Z6 g to be an identity matrix for each i ie no weighting 0 A working assumption involves adopting for the purposes of deducing a form for Zi8 a distributional speci cation for Yilmi or at least for aspects of the distribution 0 As discussed in Section 145 standard errors for the estimator of 6 may be adjusted to take into account that the working assumption may be incorrect Some popular working assumptions are as follows i Independence working assumption Take Zi6 g to be the covariance matrix for ui that would be obtained if all the Y equivalently the EM were assumed to be independent across j It may be deduced from the general expressions on page 382 that this implies that Z6 will be a diagonal matrix with diagonal elements varujkl1i 049gizk and varuijjl1i a4gfjvare2jlmi The analyst must specify vare2jl1 but no other higher moments of the EH Gaussian working assumption Take Zi8 to be the covariance matrix for that would A V be obtained by assuming that the distribution of Yilmi is normal with the rst two moments correctly speci ed according to the assumed mean covariance model 141 Equivalently assume that eilmi is normal with mean 0 and covariance matrix Vi6 It may shown under this condition that COVltij7 Uizplilli Uijlvikp 39Uijp39Uikl U49ij9ik9ilgipEEijEilEEik5ip EEij ipEEikEil 1418 The entries of Z6 may then be determined from the simpli ed relationship 1418 Note that as Eejekl1 is equal to the conditional correlation between 67 and 67 from 1418 all of the needed entries of Zi6 depend only on the assumed correlation model Fia1i Moreover 1418 also yields varuijjl1i 204947 where the 2 is as expected PAGE 384 CHAPTER 14 ST 762 M DAVDDIAN Although these choices may indeed be misspeci cations the hope is that they will produce estimators closer to being optimal than simply ignoring the pattern of association among the elements of u altogether RELATIONSHIP TO PSEUDOLIKELIHOOD As we have mentioned the PL estimating equation 1412 deduced from the normality assumption ie lt12 m 7 fltm gtTV1lt a mam8a vim minvfm mm 7 12mm 139 tr WW57 188 kV1 7 7 igt 0 k 1 q 8 1419 is quadratic as it depends on a quadratic form in Y 7 f16 A quadratic form may of course be written as a linear combination of squared deviations and crossproducts eg Y1 7 filtmtzagt1Tvglltaamigt1Yi7 gel113 7 7 113177 my 7 rammwwmt j 1 where vi7k is the jk element of V716 1i That is the relevant quadratic form in the summand of the PL equation 1419 is a linear combination of the elements of u 0 Of course the summands of estimating equation in 1417 m ZElt gtz1lt gtm 7 gtLa 5 7 0 1420 11 are also linear combinations of the elements of u If the distribution of Yilm really are normal and we choose Zi6 according to the Gaussian working assumption then by analogy to the developments in previous chapters eg Section 96 1420 must be the asymptotically optimal quadratic estimating equation as the weight matrix Z g is correctly speci ed 0 But the quadratic PL equation 1419 is also the normal theory ML estimating equation for g Thus it is also asymptotically optimal if the data really are normally distributed Moreover it is also quadratic 0 Both equations cannot be the optimal quadratic equation and be different hence intuition suggests that estimating equations 1419 and 1420 must be the same It turns out that it is possible to show analytically the equivalence between the quadratic PL equation deduced from normality 1419 and the equation 1420 with Z chosen according to the Gaus sian working assumption We defer demonstration of this to Section 146 where the argument is carried out in detail PAGE 385 CHAPTER 14 ST 762 M DAVDDIAN RESULT Using a quadratic equation ofthe form 1420 to estimated along with the linear equation for 8 it is clear that iteration of the threestep algorithm on page 375 with C oo solves the 19 q 3 dimensional system of equations 71 m XW 0 Vilt migt 0 eriltwi T 0 1421 i1 0 Ei 575 0 Z4575 W f 011575 Compare these equations to for example the univariate equations 612 which incorporate the uni variate version of the Gaussian working assumption to yield the 204977 term It is straightforward to observe that with n 1 for all i and the Gaussian working assumption 1421 reduces to 612 Continuing the analogy writing 61 T TT to represent all the unknown parameters being esti mated it is clear that we can write 1421 in the form iDZTampgtV1ampsi i mild 0 1422 i1 where ma XM 0 mmni12Xpqs 0 Eilt 7 5 C61 Viw f 12 x m 12 31151 Yi 12 X 1 ll57 5 WW7 5 Once the equations are written in the form 1422 it is clear that by analogy to the developments in Section 64 1422 may be solved by a Gauss Newton updating scheme 0 Because the dimension of i may be large the full Gauss Newton updating scheme based on 1422 may be unwieldy in practice 0 But as the equations separate just as in the univariate case one may carry out the update in two steps for given update 6 by solving the 6 equation rst p rows and then for given 6 solve the g equation last q 3 rows This is of course is exactly what is done in steps ii and iii of the three step algorithm PAGE 386 CHAPTER 14 ST 762 M DAVDDIAN 144 Quadratic estimating equations for Just as in the univariate case it is natural in the multivariate context to think about trying to increase ef ciency for estimation of 6 by extracting information about 6 from the covariance matrix V6 As we saw in the univariate case Chapter 5 one way to obtain a quadratic estimating equation for 6 and to motivate a general class of such equations is to take the normal loglikelihood as a starting point Taking the normal loglikelihood 149 as a starting point and differentiating with respect to 6 it is straightforward to deduce by the same matrix differentiation operations leading to the linear equation 144 and the quadratic equation for 1419 that the resulting equation for 6 is m 2 XTlt gtV 1lta mm 7 12mm i1 m e nmm vglm winea mwm mnw m mm 7 rim 3 tr v1lt wigt88mVlt migt o p x 1 1423 where the double parentheses indicate p terms of the form inside them stacked for Z 1 p Of course underlying estimating equation 1423 is the assumption of normality This equation would be solved jointly with the quadratic equation 1419 for under the normal theory ML approach Alternatively following the same reasoning as in the previous section for estimation of g it should be clear that a more general quadratic equation is possible In particular de ne 0 g and u as before and let BM 9854 vilt 7 mm12 X 19 so that Bi g is the gradient matrix of the mean under the model Then consider the joint estimating equations for 6 and g given by 71 i Xlt gt Blt gt Vilt zigt 0 Yrmmw T 0 1424 i1 0 Ei 575 0 Z4575 ui 04575 0 Clearly because of the presence of the gradient matrix B g in the rst p rows of 1424 this leads to an estimating equation for 6 that is quadratic o If Z g were chosen according to the Gaussian working assumption then by the same reasoning at the end of Section 143 the resulting equation corresponding to the rst p rows of 1424 should be the optimal such equation if normality really holds and thus intuitively should be identical to the normal theory ML equation 1423 That this is indeed the case follows from arguments like those in Section 146 PAGE 387 CHAPTER 14 ST 762 M DAVDDIAN Note from 1423 that the off block diagonal elements of the covariance matrix V457 7 113139 0 0 Z1157 5 are all equal to zero Now it is easy to show that under normality of Yilmi we have 171457 5 covYiuil1i 0 So in fact this observations reinforces the notion that 1424 with the Gaussian working assumption is the optimal joint equation if normality really holds as in this case 6 g is exactly the covariance matrix of the response 3 YiT By analogy to the univariate case if we believe that the distribution of Yilmi is not normal we could in principle arrive at a more general equation by specifying the covariance matrix of the response Si YiT to embody more realistic assumptions about the necessary moments of Si If we specify VaFWiliBi Z4575 and COVO thIBi 045797 say we would arrive at the estimating equation 71 i Xlt gt BMQ Vimgm 0mg Yrmmm 0 1425 H 0 12mg 0mg 2mg urmm Provided that the assumptions for Zi6 and Ci g were correct we would expect to have constructed the optimal such quadratic equation In order to do this we must not only feel con dent the moment assumptions on varul1 em bodied in Zi6 We must also make moment assumptions corresponding to Ci g which involve specifying terms of the form COVYij7Uiklli gagijgikgilEEijEikEill1 That is we need to be willing to specify not only the skewness as in the univariate case but also the three way associations 0 Of course the chance that we would be able to specify the matrices Z and Ci com pletely correctly in practice is slim lndeed specifying Vi6 correctly in itself is dif cult enough as we have already discussed Putting this issue aside for the moment it should be clear that implementation of estimation via solution of equations ofthe general form in 1425 may also be carried out via a Gauss Newton updating scheme rede ning the matrices Di6 g and 1746 g in the obvious way PAGE 388 CHAPTER 14 ST 762 M DAVDDIAN 0 Note however that it is no longer possible to separate estimation of 6 from estimation of g as in the linear estimating equation case solution of the equation for 6 here even for xed g is considerably more complicated WORKING ASSUMPTIONS Obviously as noted above it is well recognized that correct speci cation of all of the necessary moments involved in an equation like 1425 is pretty hopeless in practice Thus just as with the quadratic equation for estimation of alone the practical strategy is to make a working assumption about the entire matrix Vi This of course involves an assumption on the matrix Zi6 g as before along with one on the matrix Ci Popular working assumptions for Vi in this context are as follows analogous to the previous discussion 0 Independence working assumption Pretending that the elements of Y are mutually independent leads to the choice for Z6 discussed previously on page 384 and 0 0 0 Gaussian working assumption Pretending that the distribution of Yilmi is normal leads to the choice for Zi8 given in 1418 and Ci 0 TERMINOLOGY Equations like those in 1421 and 1425 have come to referred to as generalized estimating equations of speci c types Equations of the form 1421 which involve solving a linear estimating equation for 6 jointly with a quadratic one for g have been called GEE 1 Equations of the form 1425 which involve solving a quadratic estimating equation for 6 jointly with a quadratic one for g have been called GEE 2 o This terminology was evidently coined in a paper by Liang Zeger and Qaqish 1992 o The unquali ed term GEE is used both to refer to the general approach of specifying estimating equations for mean covariance models of the form 141 and to the particular case where the linear equation for 6 in 146 is solved along with simple moment type equations for the elements of g Some authors insist that the term GEE also embodies the notion that working assumptions are involved including those on the correlation matrix of Yilmi that are likely to be incorrect These authors thus also imply when they use the term GEE that standard errors for the estimator for 6 must be corrected to account for the possibility that the assumptions may be incorrect PAGE 389 CHAPTER 14 ST 762 M DAVDDIAN We will discuss the basis for this in the context of linear estimating equations in the next Section Other authors use GEE more loosely and generally REMARKS It should be clear from the preceding development that the same issues involving tradeoffs between linear and quadratic estimating equations for 6 that we discussed for the univariate case extend to the multivariate setting In fact because the problem here is more complex the implications may be even more profound c There is a choice between linear and quadratic equations 0 The linear equation is clearly unbiased even if the covariance model Vi6 is misspeci ed Such misspeci cation is more likely in the multivariate case as the analyst must model not only variance but also correlation structure The latter is probably more dif cult so most of the focus on misspeci cation in this context is on incorrect modeling of the correlation matrix 131 The quadratic equation for 6 is obviously also unbiased as long asthe covariance model V is correctly speci ed Thus if the analyst is con dent in the form of this matrix this equation offers an alternative to the linear equation that may increase ef ciency Of course the optimal such equation would require the analyst to also specify Zi6 g and 046 correctly If this is the case unlikely in practice then the resulting estimator for 6 will be more ef cient than even the optimal linear estimator If these matrices are not correct although the estimating equation would remain unbiased whether it offers an improvement in ef ciency over the linear equation is no longer clear by analogy to the results in Chapter 10 Of course the scary thing about quadratic equations for 6 is that they will not be unbiased if V is not correctly speci ed leading to potentially inconsistent estimation of 6 In the univariate case all that is required is correct modeling of variance something in which the analyst may often have great con dence In the multivariate case the analyst is generally much less con dent as described above Thus the potential for increased ef ciency may be greatly offset by fear of in the multivariate case Thus it is generally agreed that GEE 1 estimation based on solving equations of the form 1421 is the safer choice for routine use It is unusual to see quadratic estimating equations for 6 used in practice for multivariate response although they are discussed extensively in the literature because of the theoretical potential for increased ef ciency PAGE 390 CHAPTER 14 ST 762 M DAVDDIAN 145 The folklore theorem and robust covariance matrix Not surprisingly the theoretical results for estimation of 6 via solution of equations of the form 1421 and 1425 parallel those for the univariate case Because there are two sample sizes77 here m number of individuals and m number of observations on individual i it is important to qualify this statement by stating explicitly the asymptotic framework in which it is valid 0 In the literature on multivariate response the usual perspective is that the m individuals have been sampled from a population of interest From the population averaged modeling perspective we model the rst two moments of the relevant conditional distribution in order to learn about features of this population The natural asymptotic framework under these conditions is one in which m a 00 so we learn more about the population as m increases as in the univariate cases discussed earlier while the m are regarded as xed Regarding n as xed often makes practical sense In many real situations the number of ob servations on each of the experimental units may be rather small recall the wheezing example discussed in Section 134 where at most four measurements were available on each of 300 children 0 Sometimes however there may be many observations on each unit In this case it may be possible to treat not only m but also the m as a 00 This requires slightly more delicacy in specifying how this might happen In fact a scenario that has sparked a good deal of recent interest is that in which m is not that large but rather the n are Clearly if inference on the population of units is of interest small m is a liability as it limits the information available on the population In this situation recent work has focused on small sample corrections to improve the reliability of inference eg Mancl and DeRouen 2001 Still other work is devoted to a different asymptotic framework in which the n H 00 while m stays xed the scope of relevance for this is clearly different from that above 0 In many situations the 71 may be dictated by design But note that in some contexts the 71 may in fact be themselves a response eg in the developmental toxicology setting in Example 17 the size of the litter may well change in response to increasing toxicity In everything we have done so far we have not highlighted this issue rather we have implicitly conditioned on the m We will not delve into this issue further here but be aware that if circumstances dictate the n themselves may in fact contain important information on scienti c questions of interest and a modeling strategy that does not condition on them may make more sense PAGE 391 CHAPTER 14 ST 762 M DAVDDIAN Here we will focus on the framework in which m a 00 while the m remain xed It should be clear that it would be possible to carry out largesample theory arguments for the estimators for 6 and g in model 141 that would be obtained from both 1421 and 1425 These arguments would be analogous to those for linear and quadratic equations for 6 coupled with solution of an additional estimating equation for g as discussed in Chapters 9 and 10 In this section we will con ne our attention the properties of estimators for 6 found by solving the linear GEE estimating equation in 1421 along with estimation of covariance parameters by solution of an additional equation eg a quadratic equation The famous results we derive are analogous to the folklore theorem77 for GLS estimators in the univariate case as we now demonstrate As the argument allowing for misspeci cation of the covariance matrix Vi6 includes the case where this matrix is in fact correctly speci ed we present only the more general argument under misspeci cation This argument is analogous to that for GLS in the univariate case with misspeci ed variance function given in Section 93 as we saw in that case the results when in fact the variance function is given correctly presented in Section 92 were really just a special case of these To formalize the argument suppose that although the conditional mean model f16 is correctly speci ed Moreover suppose that the covariance model is misspeci ed as varYili Uilt 777 an 1426 where it is most likely that the model depends on 6 through the components of the mean vector and 397 is a vector of variance and correlation parameters specifying the chosen model As we have discussed most often the misspeci cation would be in the correlation matrix If there are no additional variance parameters ie the chosen variance function is known as a function of 6 then 397 would represent correlation parameters in the misspeci ed correlation model More generally if the model additionally contains unknown variance parameters some elements of 397 may be the variance parameters in a correctly or incorrectly speci ed variance function Suppose that in truth the covariance model is varYl1 V g 113 1427 where 60 and 0 are the true values of these parameters De ne 6i V 12 o o iYi flan50 where V22 0 0 is a symmetric square root of Vi60 01i and V12 0 0mi its inverse such that V1 2 50 MW 2lt 0 0 mi V22lt 0 0 wnvglZwo 50 w I PAGE 392 CHAPTER 14 ST 762 M DAVDDIAN Then it is clear that 0 var6l1 1 Assuming the incorrect model 1426 by analogy to the discussion in Section 93 suppose there is a value 7 such that if we estimate 397 in by solving an appropriate estimating equation the resulting estimator A satis es mlZoe 7 0171 As in the univariate case 4 is likely to have been obtained by solving this estimating equation jointly with the linear GEE equation in 1421 That is the resulting estimator i solves m A A A ZXlt gtUillt n7 mm 7 12mm 0 1428 i1 Analogous to the arguments in Sections 92 and 93 we denote the estimator for 6 in the weights as El to allow the possibility of a nite number of iterations of a threestep algorithm and to keep track of its in uence 0 Clearly even with the covariance matrix misspeci ed the estimating equation in 1428 is un biased so that i should be consistent under regularity conditions We also assume as in the univariate case that also 77112 7 80 Op1 AT A T Expanding 1428 in a Taylor series of 6 TT about g OT39TT we obtain 0 m 0 A21 Atnzgtm1 ltia 7 e DWIW e e EtnmlZo e 7 1429 where 0 MW ZXlt ogtUgllt mi mnViZwo 50 06 i1 Aim m 1 DEBEm Xlt gtU1lt ov iVlZ507 0 xvigt61 i1 Atnz milZX50U150777iXi50v i1 13 me ZXlt ogtaa Uglwmi winviZwm o valgt62 i1 Er m l ZXlt ogt8av U1lt ovwgtV2lt o 0 main i1 Clearly as m a 00 Aim L 0 Din L 0 and E L 0 thus as expected from the univariate case the effects of El and A used to form the weigh 7 matrices are negligible PAGE 393 CHAPTER 14 ST 762 M DAVDDIAN We thus have applying these results to 1429 and rearranging 77112 50 Ajnzcjn Writing X X60 U U607 113 and V V60 11 we have that Aj a A A quotigloo mil ifoglxi i1 m 0 LN0B B Aymwl EfoglVU1X 11 Combining we obtain WW 7 e i Nlt0A1BA1gt so that A m 71 m m 71 5 b N 80 Z XTU1Xgt Z XTU1VU1Xgt Z XTU1Xgt 1430 i1 i1 i1 REMARKS 0 Obviously if in fact the true model 1427 and the assumed model 1426 coincide so that the covariance structure has been correctly speci ed then U V and the covariance matrix in 1430 reduces to m 71 Z XiTVZ 1X1 7 11 which adopting the de nitions on page 373 may be written as XTV lX 1 This of course is of the identical form to the corresponding covariance matrix in the univariate case Similarly the covariance matrix in 1430 may be written as de ning U analogous to V XTU lX 1XTU lVU lXXTU 1X 1 Thus note that the comparison between the large sample covariance matrices that obtain when varYl1 is correctly and incorrectly speci ed is identical to that in the univariate case and the same discussion applies In particular misspeci cation of the covariance matrix of the response varYil1i will result in a loss of ef ciency relative to modeling this moment correctly PAGE 394 CHAPTER 14 ST 762 M DAVDDIAN CAUTION This argument requires that m127 7 7 Op1 for some 7 Thus for example if we solve the linear estimating equation for 6 along with the quadratic equation for the covariance parameters where the model is misspeci ed we assume that such a 7 exists Recall that it is most likely that misspeci cation will be in the correlation matrix so that the source of this dif culty is this issue In this case the components of 7 in the misspeci ed model that are correlation parameters have no real meaning Crowder 1995 points out that there may be situations where if there are no unknown variance param eters the estimator 7 of the correlation parameters for the chosen working correlation matrix may not have a well de ned limit in probability required for the above argument to go through There are con gurations where the true correlation matrix is such that for certain working assumptions that are incorrect 7 may not exist Details are beyond the scope of our treatment here however the overall message is that there is no general largesample argument to guarantee the above folklore result we must assume that 7 such that 7 L 7 exists QUADRATIC ESTIMATING EQUATIONS Arguments similar to those in Chapter 10 may be used to deduce the general form of the approximate joint covariance matrix of estimators for 6 and g using quadratic equations 1425 under different working assumptions We do not present this here ROBUST SANDWICH COVARIANC E MATRIX Although modeling the covariance matrix correctly is of course desirable because of the potential for increased precision in estimation of 6 as we have discussed previously it may be dif cult to have con dence that this is the case In particular correct modeling of the marginal conditional correlation structure is likely to be dif cult so that the speci ed model is ordinarily viewed in most applications as only a working assumption Given this it is usually unreasonable to construct estimated measures of uncertainty such as standard errors or con dence intervals on the basis of the correct folklore result Instead it is assumed up front that the speci ed model is incorrect and the estimated standard errors are constructed using the general result 1430 More speci cally analogous to the discussion in Section 94 the large sample approximate covariance matrix we would like to estimate to use as a basis for standard errors is XTU lX 1XTU lVU lXXTU 1X 1 PAGE 395 CHAPTER 14 ST 762 M DAVDDIAN This involves the matrix XTU IX 1 that one would construct under the assumption that Ui67 7 is the correct model Of course7 m times this matrix may be estimated by m 71 A m l ZX U1 7 Q7 214X4 1431 41 De ning Ri 194411313 194411313 3 771 1 times the middle matrix may be estimated by B me ixmw m manU m maxim 1432 i1 The estimated matrices 1431 and 1432 may be combined to obtain an estimator for the covariance matrix of 3 under the assumption that varYil1i may be incorrectly speci ed as m lA BA 1433 Note that the m77 and 771 terms cancel The use of 1433 in this context was suggested by Liang and Zeger 1986 Software packages that implement this type of GEE 1 estimation generally calculate standard errors based on 1433 by default or allow the user to request this calculation It is generally considered prudent in practice to use the so called robust sandwich77 standard errors to protect against the likely misspeci cation of correlation structure 146 Equivalence of pseudolikelihood and GEE estimating equations for CO Variance parameters On page 3857 we noted that it is possible to show formally the equivalence between the quadratic PL equation for deduced from normality 14197 12 i m f4illi7 TVZ1 7 7 mam854 V4415 44gtV1lt mm i no 43 i1 tr Waite wnaamvm ml o k 1q 4 3 1434 and the alternative form 1420 in the case where the matrix Z 7 g is chosen according to the Gaussian working assumption7 namely7 ZEltA gtZJ1ltA MW 7 wine 0 1435 41 Here7 u vech m 7 mi mm a nmm PAGE 396 CHAPTER 14 ST 762 M DAVDDIAN In practice squared terms are deleted if the model contains no unknown variance parameters We will not note this explicitly in the following argument It suf ces to show that 1434 and 1435 are in fact the same estimating equation under the conditions above speci cally it suf ces to show that their kth rows coincide The kth row of 1434 may be written using the identity for quadratic forms 11TA21 trAZIIZIIT as 12 aask Vim miV71 7 7 mm 7 rimmMYi 7 mm WWW a m 1 48851 vilta migtvg1lt amigt 0 1436 Noting that Ei6 g has kth row 88m T we may write the kth row of 1435 as 8avi TZi ui e was 0 1437 i We may show the result by showing that the ith summand in 1436 is equal to that in 1437 The following results are available in many advanced texts on matrix algebra such as Chapter 16 of Harville 1997 or in Appendix 4A of Fuller 1987 For matrices A a x a B C D i trAB vecATvecBT vecATTvecB ii trABDTCT vecATB X CvecD where X represents Kronecker product iii For A symmetric there is a relationship between vecA and vechA In particular there exists a unique matrix of dimension a2 x aa 12 such that vecA vechA Clearly is unique and of full column rank as there is only one way to write the distinct elements of A in a full redundant vector There also exist many not unique linear transformations of vecA into vechA It should be clear that there cannot be a unique such transformation Consider a transformation matrix 11 of dimension aa 12 x a2 such that vechA IlvecA One particular choice of II is the MoorePenrose generalized inverse of 11 T 1 T Fuller 1987 page 383 gives the actual form of in PAGE 397 CHAPTER 14 ST 762 M DAVDDIAN Because we are addressing equivalency under normality assumptions take Yilmi to be normally dis tributed for the purposes of the argument Under these conditions it is possible to show see for example Fuller 1987 Lemma 4A1 that maul121 2114mm 5 21 V413 marl 1438 In fact 1438 is a compact way of expressing 1418 Result 4A31 of Fuller 1987 page 385 then yields that varltu4wigt 112gtltIgtTV1lt wigt V1lt wigtlt1gt 1439 Armed with these results we are now in a position to show the desired correspondence For brevity we will suppress the arguments of all matrices and vectors The estimating equation in 1436 has two parts 12tr8854V1VZ1Y17 fiYi 7 MTVZI 1440 712tr8854V1V51 1441 Consider 1440 By result ii on page 397 identifying A QQEkVi B Vgl DT YiifY7 fiT7 and CT V1we have 12tr8054V1V51Yi 7 fiYi 7 MTVZI 712Vec88EkViTVZ1 V51VecYi 7 fiYi 7 MT 7 12 1 Vech88EkViTV1 Vll 39ui 7 Vech5 5 EkViT12 39TVil VZ1 39u17 1442 using the de nition of m Now from the de nition of 1 we have vech88 k Vi 8851 vechV 138514 Moreover the middle term in 1442 in braces by 1439 equals Zgl as we are doing these calcu lations under normality Substituting these developments into 1442 yields 88EkviTZ1uZ 1443 Now consider 1441 PAGE 398 CHAPTER 14 ST 762 M DAVDDIAN Applying result ii on page 146 gives 12V8088 EkViTV71 V71V80Vi 712 vech88EkVTV1 V1ltIgtvechV VBCNQa EkViT12 39TV71 Vll lvi 885404TZ51W 1444 Combining 1443 and 1444 we obtain that the kth row of the PL summand in 1436 is in fact equal to the kth row of the GEE summand in 1437 namely aa kvilTZfWi 1M as desired Of course it is fact possible to carry out the argument in the reverse direction starting from 1437 Note that the same type of argument may be applied to the second term in the quadratic estimating equation for 6 so that in fact the joint normal ML equations may be written in the GEE 2 form with the Gaussian working assumption employed 147 Implementation in SAS and R For tting marginal population average models using the GEE 1 and GEE 2 approaches there are few convenient options Popular available software implements only the linear estimating equation for 6 allowing only unknown correlation parameters in the covariance model ie no unknown variance parameters except for a scale parameter 02 PAGE 399 CHAPTER 14 ST 762 M DAVDDIAN These parameters are estimated using the simple moment methods advocated by Liang and Zeger 1986 rather than the quadratic estimating equation methods used in the GEE 1 approach We demonstrate use of SAS proc genmod and the RSplus function geeO below Implementation of the GEE 1 approach may be carried out in SAS using a macro that was originally designed for another purpose In particular the nlinmix macro was originally targeted for tting of nonlinear subject speci c models with random effects nonlinear mixed effects models by an approx imate technique this is discussed in Chapter 15 It turns out as discussed in Chapter 12 of Littell Milliken Stroup and Wol nger 1996 that this macro may also be used to t marginal models in the manner we have discussed In particular in this approach the linear estimating equation for 6 is solved jointly with the quadratic estimating equation for the covariance parameters under the Gaussian working assumption As with proc genmod and geeO only correlation parameters may be estimated it is not possible to estimate unknown parameters 0 in a variance function except for a scale parameter 02 SAS proc mixed which invokes normal theory maximum likelihood for very general linear models including those for multivariate response is used to estimate the covariance parameters where a linear approximation to the nonlinear mean model is used lteration between estimation of 6 and estimation of the covariance parameters continues until convergence To illustrate the use these packages we consider a famous data set rst reported by Thall and Vail 1990 A clinical trial was conducted in which 59 people with epilepsy suffering from simple or partial seizures were assigned at random to receive either the anti epileptic drug progabide or an inert substance a placebo in addition to a standard chemotherapy regimen all were taking Because each individual might be prone to different rates of experiencing seizures the investigators rst tried to get a sense of this by recording the number of seizures suffered by each subject over the 8 week period prior to the start of administration of the assigned treatment It is common in such studies to record such baseline measurements so that the effect of treatment for each subject may be measured relative to how that subject behaved before treatment Following the commencement of treatment the number of seizures for each subject was counted for each of four two week consecutive periods The age of each subject at the start of the study was also recorded as it was suspected that the age of the subject might be associated with the effect of the treatment somehow The goal of an analysis was to determine whether administration of progabide results in fewer seizure episodes on average The data for the rst 5 subjects in each treatment group are summarized in Table 141 The response number of seizures is in the form of a count for which the Poisson distribution might be considered an appropriate model PAGE 400 CHAPTER 14 ST 762 M DAVDDIAN Table 141 Seizure counts for 5 subjects assigned to placebo 0 and 5 subjects assigned to progabide 1 Period Subject 1 2 3 4 Trt Baseline Age 1 5 3 3 3 0 11 31 2 3 5 3 3 0 11 30 3 2 4 0 5 0 6 25 4 4 4 1 4 0 36 5 7 18 9 21 0 66 22 29 11 14 9 8 1 76 18 30 8 7 9 4 1 38 32 31 0 4 3 0 1 19 20 32 3 6 1 3 1 10 30 33 2 6 7 4 1 19 18 Many authors who have considered this data set have chosen to treat the baseline response as a covariate rather than as one of the repeated measurements of the response at time 0 Thall and Vail did this in their original report of analysis of these data This is partly because the baseline response was collected over 8 weeks while the responses post treatment were collected over two weeks but in reality the baseline response is a response so probably should be treated as one of the responses suitably scaled to be put on a two week basis Although this would be a better way to View these data we follow the original source and t the model that they did Thus we regard the data as having repeated measurements 4 for all on each of m 59 subjects Because we have data from many subjects we expect the marginal distribution to exhibit overdispersion Let Yij be the seizure count for subject t at hisher visit to the clinic for the jth period following initiation of treatment where the corresponding times of measurement are biweekly visits coded as tij 1 2 3 4 for j 1 4 ni Let 6 be the treatment indicator for the ith patient 6139 0 for placebo subjects 1 for progabide subjects Following Thall and Vail 1990 for subject 2 let a be the logarithm of age and b be the logarithm of the number of baseline seizures experienced in the eight week pre treatment period divided by 4 to place the baseline count on the same basis as the post treatment counts taken at two week intervals Summary of the means at each time point post treatment shows that they basically do not change until the last visit period Accordingly we consider the model that Thall and Vail did that accommodates this De ne t4 0 if observation j for subject t is prior to the 4th and last visit and t4 1 if observation j is the last one 4th visit We de ne 11 ba6t41 t4mT PAGE 401 CHAPTER 14 ST 762 M DAVDDIAN A popular model in these circumstances for the mean number of seizures would be a loglinear model Consider the particular such model where the conditional mean for Y is taken to depend only on 111739 biai6iti7 T given by EYijlilli expWi 521 53 545139 55754 565112 Mp37 5 517 755 1445 A natural model for variance is varOijlmi 02M 6 1446 where 02 allows for the possibility of overdispersion In the following programs for illustration we consider three working correlation assumptions 0 Completely unstructured so that the correlation pattern is assumed to have no particular form 0 Compound symmetry or exchangeability o Autoregressive of order 1 as the visits are equally spaced in time Full information on SAS proc genmod is available in the SASSTAT documentation and a detailed discussion of the nlinmix macro may be found in Chapter 12 of Littell Milliken Stroup and Wol nger 1996 The version discussed there is outdated newer versions and some minimal documentation and examples are given on the Technical Support pages at httpwwwsascom There is less detailed information on the geeO function available in R using the help function ie type helpgee for a basic accounting of syntax and operations The following programs do not have accompanying detailed descriptions They are fairly well documented See the above references for more details and further options It is important to note that proc genmod and gee impose limitations on the form of the mean model as it is speci ed through the standard link functions77 used in popular generalized linear models This requires that the mean be a function of 6 through a linear predictor Similarly the form of the variance function is restricted to correspond to one of the scaled exponential family types There are ways around these restrictions but they require specialized programming on the part of the user The nlinmix macro allows more general mean and variance functions to be speci ed by the user PAGE 402 CHAPTER 14 ST 762 M DAVDDIAN PROGRAM 141 Implementing the linear estimating equation with simple moment methods using SAS proc genmod The following program calls proc genmod several times7 the rst three calls tting the mean variance model in 1445 and 1446 with the three different working correlation models noted above The fourth call ts a different mean model similar to that tted by Thall and Vail 1990 PROGRAM STATEMENTS Fit a loglinear regression model to the epileptic seizure data first reported in a paper by Thall and Vail 1990 dat we use the Poisson meanvariance assum tions This model is fitted with different working corre ation matrices options ls80 ps59 nodate run The data look like first 8 records on first 2 subjects 104 5 1 0 11 31 104 3 2 0 11 31 104 3 3 0 11 31 104 3 4 0 11 31 106 3 1 0 11 30 106 5 2 0 11 30 106 3 3 0 11 30 106 3 4 0 11 30 column 1 sub ect column 2 number of seizures column 3 visit 14 biweekly visits column 4 0 if placebo 1 if progabide column 2 baseline number of seizures in 8 weeks prior to study co umn data seizure infile seizedat input subject seize Visit trt base age run Fit the loglinear regression model using PROC GENMOD and three different working correlation matrix assumptions unstructured compound symmetry exchangeable AR1 Subject 207 has what ap ear to be very unusual data this subject both base ine and studyperiod numbers of seizures are huge much larger than any other subject We leave this subject in for our ana ysis in some published analyses this subject is deleted See Diggle Heagerty Liang and Zeger 2002 and Thall and Vail 1990 for more on this subject geffit a modified model exactly like the one in Thall and Vail 1990 e ine logbase log base4 logage logage trt Here visit4 is an indicator of the last visit and basetrt is e for treated and placebo subjects Visit allows that the mean response chan ed over the study period but not gradually rather the change on y showed up toward the end of the study The DISTPOISSON option in the model statement specifies P54E403 CHAPTER 14 ST 762 M DAVDDIAN that the Poisson requirement that mean variance be used The LINKLOG option asks for the loglinear model Other LINK choices are avai a The REPEATED statement specifies the quotworkingquot correlation structure to be assumed The CORRW option in the REPEATED statement prints out the estimated working correlation matrix under the assump ion given in the TYPE 0 tion The MODELSE 0E tandard error estimates printed for t e elemen s of etahat are based on assumi the correlation matrix is correctly specified By default the ones based on the quotrobustquot version of the sampling covariance matrix are printed o The dispersion arameter is estimated rather then bein hel fixed at 1 t is allows for the possibility of quotovergispersionquot The V6CORR option asks that the estimator be computed using the tes method discussed in the no data seizure set seizure if subject207 then delete logbaselogbase4 logagelogage basetrtlogbasetrt if vis tlt if visit4 then visit41 r title quotUNSTRUCTURED CORRELATIONquot proc enmo dataseizure class subject logbase logage trt visit4 basetrt dist p i n nk lo 0 model seize y repeated subjectsubject typeun corrw modelse v6corr run title quotEXCHANGEABLE COMPOUND SYMMETRY CORRELATIONquot dataseizure class subjec model seize logbase logage trt vi dist p repeated subjectsubject typecs corrw modelse v6corr n sit4 basetrt 10 title quotAR1 CORRELATIONquot proc genmod ataseizure class subject model seize logbase logage trt visit4 basetrt dist isson lo repeated subjectsubject typear1 corrw modelse v6corr n OUTPUT UNSTRUCTURED CORRELATION 1 The GENMOD Procedure Model Information Data Set WORKSEIZURE Distribution Poiss Link Function 0 Dependent Variable seize Number of Observations Read 236 Number of Observations Used 236 Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 2 123 124 126 128 129 130 135 2 221 222 225 226 227 228 230 232 234 236 2 Parameter Information Parameter Effect PU E404 CHAPTER14 ST 762 M DAVDDIAN Prm1 Intercept rm 0 a e Prm4 trg g Prm5 visit4 Prm6 basetrt Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 230 8693236 ca d Dev ce 230 8 93236 Pearson ChiSquare 230 10147112 Scaled Pearson X2 230 10147112 Log Likelihood 29941327 Algorithm converged UNSTRUCTURED CORRELATION The GENMOD Procedure ValueDE Analysis Of Initial Parameter Estimates Standard Wald 95 Parameter DF Estimate Error Confidence Limits Intercept 1 27576 04075 35562 19590 logbase 1 09495 00436 08641 10349 logage 1 08971 01164 06688 1253 trt 1 13411 01567 16483 10339 visit4 1 01611 00546 02681 00541 basetrt 1 05622 00635 04378 06867 0 10000 00000 10000 1 0000 NOTE The scale parameter was held fixed GEE Model Information Correlation Structure Sub ect Effect Num er of Clusters Correlation Matrix Dimension Maximum Cluster Size Minimum Cluster Size Unstr subject 59 Algorithm converged Working Correlation Matrix Col1 Col2 Col3 Row1 1 0000 0 2846 02608 Row2 0 2846 1 0000 06520 Row3 0 2608 0 6520 10000 Row4 0 1564 0 3480 04618 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95 Confidence Parameter Estimate Error Limits Intercept 30705 09449 49225 12184 3 09371 00929 07550 11192 10 logage 10009 02737 04644 15374 trt 14974 04223 23251 06697 3 visit4 01560 00783 03095 00025 1 basetrt 06281 01699 02952 09610 3 UNSTRUCTURED CORRELATION The GENMOD Procedure Anal sis Of GEE Parameter Estimates Mode Based Standard Error Estimates Standard 95 Confidence Parameter Estimate Error Limits Intercept 30705 12019 54261 07148 2 ogbase 09371 01275 06873 11869 7 3 Chi Square uctured levels 59 4 4 4 C014 01564 03480 04618 10000 Z Pr gt IZI Z Pr gt IZI 55 35 00106 lt0001 lt lt lt lt O lt Pr gt ChiSq PU4E405 CHAPTER14 logage trt visit4 basetrt ca e NOTE The scale p root 0 arameter for GEE estimation was computed e normalized Pearson s c isquare 10009 14974 01560 06281 2 0836 03433 04617 00938 01866 03281 16737 2 24023 05925 3 03398 00278 1 02624 09937 3 92 00035 24 00012 66 00963 37 00008 as the sduare EXCHANGEABLE COMPOUND SYMMETRY CORRELATION 4 The GENMOD Procedure Model Information WORKSEIZURE Distribution oisson Li unction og Dependent Variable seize Number of Observations Read 236 Number of Observations Used 236 Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238 Parameter Information Parameter Effect Prm1 Intercept rm 0 a e Prm4 trg g Prm5 visit4 Prm6 basetrt Criteria For Assessing Goodness Of Fit Criterion DF Devi ce 230 Sca d Devi ce 230 Pearson ChiSquare 230 Sca d Pearson X2 230 e Log Likelihood Algorithm converged Value 8693236 8 6 10147112 29941327 ValueDF EXCHANGEABLE COMPOUND SYMMETRY CORRELATION The GENMOD Procedure Analysis Of Initial Parameter Estimates Standard ald 95 Chi Parameter DF Estimate Error Confidence Limits Square Pr gt ChiSq Intercept 1 27576 04075 35562 19590 4580 lt 0001 logbase 09495 00436 08641 10349 47511 lt 0001 logage 1 08971 01164 06688 11253 5935 lt0001 trt 1 13411 01567 16483 10339 7321 lt0001 visit4 1 01611 00546 0268 0054 871 0 0032 basetrt 1 05622 00635 04378 06867 7840 lt 0001 Scale 0 10000 00000 10000 1 0000 NOTE The scale parameter was held fixed GEE Model Information Correlation Structure xchangeable Sub ect fect subject 59 levels Num e of Clu 59 Correlation Matrix Dimension 4 Maximum Cluster Size 4 Minimum Cluster Size 4 ST 762 M DAVDDIAN P54E406 CHAPTER 14 ST 762 M DAVDDIAN Algorithm converged Working Correlation Matrix C011 C012 Col3 Col4 Row1 1 0000 0 3582 0 3582 0 3582 Row2 0 3582 1 0000 0 3582 0 3582 Row3 0 3582 0 3582 1 0000 0 3582 Row4 0 3582 0 3582 0 3582 1 0000 Exchangeable Working Correlation Correlation 03582063731 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95 Confidence Parameter Estimate Error Limits Z Pr gt IZI Intercept 27939 09561 46678 09199 292 00035 logbase 0 9504 00987 07569 11439 963 lt0001 logage 09066 02772 03633 14499 327 00011 trt 13386 04296 21805 04967 312 00018 EXCHANGEABLE COMPOUND SYMMETRY CORRELATION 6 The GENMOD Procedure Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95 Confidence Parameter Estimate Error Limits Z Pr gt IZI visit4 01611 00656 02896 00325 246 00140 basetrt 05633 01749 02205 09060 322 00013 Anal sis Of GEE Parameter Estimates Mode Based Standard Error Estimates Standard 95 Confidence Parameter Estimate Error Limits Z Pr gt IZI Intercept 27939 12158 51767 04110 230 0 0216 10 base 09504 01301 06954 12055 730 lt0001 logage 09066 03475 0 2256 15876 261 0 0091 trt 13386 04678 2 2554 04218 286 00042 visit4 01611 00908 03391 0 0169 1 77 0 0761 basetrt 05633 01895 01919 039347 2397 030030 1 2 742 ca e NOTE The scale parameter for GEE estimation was computed as the square root of the normalized Pearson s chisquare AR1 CORRELATION 7 The GENMOD Procedure Model Information WORKSEIZURE Poiss Distribution i unctio 0g Dependent Variable seize Number of Observations Read 236 Number of Observations Used 236 Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 2 221 222 225 226 227 228 230 232 234 236 238 Parameter Information PAE 407 CHAPTER14 Parameter Effect Prm1 Intercept 1 3 1 rm 0 a e Prm4 g g Prm5 visit4 Prm6 basetrt Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 230 8693236 Scaled Deviance 230 9323 Pearso hiSqu re 30 10147112 Sca ed Pearson X2 230 10147112 Log Likelihood 29941327 Algorithm converged AR1 CORRELATION The GENMOD Procedure Analysis Of Initial Parameter Estimates ValueDF AOAAAA Standard Wald 95 Chi Parameter DF Estimate Error Confidence Limits Square Pr gt ChiSq Intercept 1 27576 04075 35562 19590 580 logbase 09495 00436 08641 10349 47511 logage 1 08971 01164 06688 11253 5935 1 13411 01567 16483 10339 7321 1 01611 00546 02681 00541 871 basetrt 1 05622 00635 04378 06867 7840 0 10000 00000 10000 1 0000 NOTE The scale parameter was held fixed GEE Model Information Correlation Structure AR1 Sub ect ect subject 59 levels um er of Clusters 59 Correlation Matrix Dimension 4 Maximum Cluster ize 4 Minimum Cluster Size 4 Algorithm converged Working Correlation Matrix Col1 Col2 Col3 Col4 Row1 1 0000 0 4661 02173 01013 Row2 0 4661 1 0000 04661 02173 Row3 0 2173 0 4661 10000 04661 Row4 01013 0 2173 04661 10000 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95 Confidence Parameter Estimate Error Limits Z Pr gt IZI Intercept 30526 09385 48920 12132 325 00011 09445 00928 07627 11263 1018 lt0001 logage 09908 02737 04543 15272 362 00003 trt 14828 04168 22998 06659 356 00004 visit4 01552 00892 03301 00197 174 00820 basetrt 06193 01692 02876 09509 366 00003 AR1 CORRELATION The GENMOD Procedure Anal sis Of GEE Parameter Estimates Mode Based Standard Error Estimates Standard 95 Confidence Parameter Estimate Error Limits Z Pr gt IZI Intercept 30526 11818 53690 07362 258 00098 ST 762 M DAVDDIAN P54E408 CHAPTER 14 ST 762 M DAVDDIAN logbase 09445 01254 06988 11902 753 lt0001 logage 09908 03375 03292 1 6523 94 00033 trt 14828 04545 23737 05920 326 00011 visit4 01552 00950 03414 00311 163 01025 basetrt 28 gi 01835 02595 09790 337 00007 ca NOTE The scale arameter for GEE estimation was computed as the square root of the norma ized Pearson s chisquare PROGRAM 142 Implementing the linear estimating equation with simple moment methods using R geeO The following program calls the function geeO three times to t the mean variance model in 1445 and 1446 with the three different working correlation models noted above Note that there are some slight differences in the results from those obtained from SAS proc genmod7 even though the two programs are supposed to be carrying out the same calculations These are likely due to differences in the implementation PROGRAM STATEMENTS Fit gee with linear estimating equation for beta and moment methods as in ian and Zeger 1986 for correlation parameter using the R function gee load the gee library library gee output data set outfile lt quotgeeseizureRoutquot read in the data set thedata lt matrixscanquotseizedatquotncol6byrowT subj lt thedata1 seize lt thedata2 log tranform baseline and age logbase lt logbase4 logage lt logage other variables that could be used in more complicated models basetrt lt logbasetrt visit4 lt asnumericvisit4 calls to gee function unfit lt geeseize 39 logbaselogagetrtvisit4basetrtidsubjfamilypoisson corstrquotunstructuredquot catquotUNSTRUCTURED WORKING CORRELATIONquotquotnquotquotnquotfileoutfileappendF sinkfileoutfileappend printsummaryunfit sink catquotnquotquotnquotfileoutfileappendT csfit lt geeseize 39 logbaselogagetrtvisit4basetrtidsubjfamilypoisson corstrquotexchangeablequot sinkfileoutfileappend T printsummarycsfit catquotEXCHANGEABLE WORKING CORRELATIONquotquotnquotquotnquotfileoutfileappendT s1 catquotnquotquotnquotfileoutfileappendT P54E409 CHAPTER 14 ST 762 M DAVDDIAN note demonstration of use of usersupplied starting values here corstrquotARMquotM1b c2 0 1 ar1fit lt geeseize 39 logbaselogagetrtvisit4basetrtidsubjfamilypoisson 11010 catquotAR1 WORKING CORRELATIONquotquotnquotquotnquotfileoutfileappendT sinkfileoutfileappendT printsummaryar1fit sink OUTPUT UNSTRUCTURED WORKING CORRELATION GEE GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee Sfunction version 413 modified 980127 1998 Model Li Logarithm Xariance to Mean Relation Poisson 39r39r 2H n quot Call geeformula seize 39 logbase logage trt visit4 basetrt id subj family poisson corstr IIunstructuredquot Summary of Residuals in 1Q Median 3Q Max 141473158 29156034 05648066 16966299 597187817 Coefficients Estimate Naive SE Naive z Robust SE Robust z Intercept 30703916 121744843 2521989 0 94492946 3249334 logbase 09370925 012911640 7257734 0 289341 10087825 logage 10008969 034772442 2878420 027372135 3656627 trt 14973787 046765660 3201877 042230224 3545751 visit4 01559873 009499776 1642010 007831289 1991847 basetrt 06280492 018899638 3323076 016985140 3697639 Estimated Scale Parameter 445469 um er of Iterations 4 Working Correlation 3 4 0000000 02846098 02608401 01563989 2846098 10000000 06519718 03479656 1 1 2 O 3 02608401 06519718 10000000 04617796 4 01563989 03479656 04617796 10000000 EXCHANGEABLE WORKING CORRELATION GEE GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee Sfunction version 413 modified 980127 1998 Model L nk i Logarithm Variance to Mean Relation Poisson n rr 1i n quot Call geeformula seize 39 logbase logage trt visit4 basetrt i subj family poisson corstr quotexchangeablequot Summary of Residuals in 1Q Median 3Q Max 142804674 29024753 04940558 17757319 598699344 Coefficients Estimate Naive SE Naive z Robust SE Robust z Intercept 27933674 122877062 2273303 095596337 2922044 logbase 09504074 013152324 7226156 009869915 9629338 logage 09064447 035118728 2581086 027715935 3270482 trt 13386276 047277353 2831435 042951137 3116629 visit4 01610871 009219968 1747155 006558188 2456275 basetrt 05632558 019152488 2940902 017485489 3221276 Estimated Scale Parameter 4414352 um er of Iterations 2 Working Correlation PU E410 CHAPTER 14 ST 762 M DAVDDIAN 1 2 3 4 1 10000000 03551212 03551212 03551212 2 03551212 10000000 03551212 03551212 3 03551212 03551212 10000000 03551212 4 03551212 03551212 03551212 10000000 AR1 WORKING CORRELATION GEE GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee Sfunction version 413 modified 980127 1998 Model Link Logarithm Variance to Mean Relation Poisson Correlation Structure ARM 1 seize 39 logbase logage trt visit4 subJ b c2 1 1 1 01 05 corstr quotARMquot Mv 1 Call geeformula basetrt i family poisson Summary of Residuals in 1Q Median 3Q Max 141147621 29201518 05479669 16786673 596736888 Coefficients Estimate Naive SE Naive z Robust SE Robust z Intercept 30525967 119717709 2549829 093848186 3252697 logbase 09 447 4 62 012699946 7436852 009276876 10180973 logage 09907881 034191424 2897768 027370217 361 950 trt 14828350 046042681 3220566 041682111 3557485 visit4 01551820 009626332 1612057 008922172 1739285 basetrt 06192620 018592401 3330727 016921026 3659719 Estimated Scale Parameter 4462371 Num er of Iterations 5 Working Correlation 2 3 4 1 10000000 04661711 02173155 01013062 2 04661711 10000000 04661711 02173155 3 02173155 04661711 10000000 04661711 4 01013062 02173155 04661711 10000000 PROGRAM 143 Implementing the linear estimating equation with the quadratic estimating equation for the correlation parameters using SAS macro nlinmix The following program invokes the macro nlinmix to implement the three working correlation as sumptions We use the most recent version of the macro available on the SAS technical support web site Comparison of the results to those from Programs 141 and 142 shows that use of the more sophisti cated quadratic estimating equation with the Gaussian working assumption to estimate the correlation parameters in each case yields similar but different estimates for 6 The nlinmix macro requires starting values for 6 unlike proc genmod and geeO which use standard methods for deriving starting values in generalized linear models to obtain these internally as they restrict attention to such models Starting values here may be obtained in the usual ways we have discussed for nonlinear models perhaps by pooling the data from all m units together and treating them as mutually independent For mean variance models of the generalized linear model type starting values can be obtained from a preliminary t eg using proc genmod of the model assuming all observations are independent which requires no starting values from the user PAGE 411 CHAPTER 14 ST 762 M DAVDDIAN The user should be aware that the log le generated when SAS is run contains a log of the intermediate calculations over each iteration This may be examined in the event the convergence is dif cult to achieve For lorevity7 this le is not reproduced for the example here PROGRAM STATEMENTS Fit gee with linear estimating equation for beta and quadratic equation for covariance arameters with aussian working assumption usin macro NLI IX IIMethod quot in Chapter 12 of the book AS System for Mixed Models by Little et al 1996 This program uses the newest version of the macro availble on the sas web site options ps59 ls80 nodate Include the SAS macro code the user should of course substitute the appropriate path Zinc quotafsunityncsuedulockersdeptstatinfost762infowwwdavidiannlinmixnlmm801sasquot Read in the data Seizure data from Thall and Vail 1990 The data look like first 8 records on first 2 subjects 104 5 1 0 11 31 104 3 2 0 11 31 104 3 3 0 11 31 104 3 4 0 11 31 106 3 1 0 11 30 106 5 2 0 11 30 106 3 3 0 11 30 106 3 4 0 11 30 column 1 sub ect column 2 num er of seizures column 3 visit 14 biweekly visits column 4 if placebo 1 if progabide column 5 baseline number of seizures in 8 weeks prior to study column 6 data seizure infile seizedat input subject seize visit trt base age n data seizure set seizure if subject207 then delete logbaselogbase4 logagelogage base rtlogbasetrt if visitlt then visit40 if visit4 then visit41 run title IISeizure data of Thall and Vail 1990quot proc print run Invoke the NLINMIX macro use the variable name quotpredvquot to define the mean function if you do not wish to include analytical derivatives db1 through db5 here SAS will calculate them automatically the parms statement gives starting values the stmts portion basically contains elements of a call to roc mixed with the linearized version of the mean ode e use e repeated statement to specify the working correlation structure the rocopt statement allows us to specify quotempiricalquot whic wil produce the robust sandwich standard errors PU4E412 CHAPTER 14 title2 IIUnstructured working correlationquot aninmixdataseizure ST 762 M DAVDDIAN modelstr v exp b1 L 1 DL L1 1 5 5 L1 Lr 1 1 1 derivsstr db1 exp b1 L 1 DL L1 1 D L1 Lr 1 1 1 db2 1 g y b1 L 1 L L1 1 D D b4 UL br i 1 1 1 131210 0 db4tt fe P u L M gt d5 LL 12 M 1m L 0 p u D D D u LrL u l wt1predv parmsstrb120 b21 b31 b41 b5O1 b60 5 stmtss cl ss subject model pseudoseize db1 db2 db3 db4 db5 db6 noint notest solution repeated subjectsubject typeun rcorr t e an zero procoptstrempirical methodml title2 quotExchangeable working correlationquot aninmixdataseizure modelstr expb1 L 1 DL L1 1 5 5 L1 Lr 1 1 1 derivsstr db1 expb1 L 1 DL L1 1 D D L1 Lr 1 1 1 1 1 L1 L 1 L L1 1 L1 Lr 1 1 1 b p u D D u LrL u l 1 L1 L 1 L1 1 L1 Lr 1 1 1 D 1 D DD 12 LrL L L 1 L L1 1 L1 L 1 1 1 L1 LD39I L LnD1D Ll 4 III 1 1 D DD 12 LrL L b1 L 1 DL L1 1 D D b4 UL br i 1 1 1 parmsstrb120 b21 b31 b41 b5O1 b60 5 stmtss cl ss subject model pseudoseize db1 db2 db3 db4 db5 db6 noint notest solution repeated subjectsubject typecs rcorr t e an zero procoptstrempirical methodml title2 quotAR1 working correlation matrixquot aninmixdataseizure odelHstr v expb1 L 1 DL L1 1 5 5 L1 Lr 1 1 1 derivsstr db1 expb1 L 1 DL L1 1 D D L1 Lr 1 1 1 1 1 b1 L 1 L L1 1 D D b4 UL br i 1 1 1 1 b r39L1 L 1 3 L1 1 L1 Lr 1 1 1 D D r u D D D u LrL L trtex L1 L 1 L L1 1 L1 L 1 1 1 4 4 AP L L D1 L LnD1D UL l39 4 ll 1 1 y I LdL 1DL L11 L1 L1 11 1 Y db u D D D u LrL u l wt1predv parmsstrb120 b21 b31 b41 b5O1 b60 5 stmtsstr ass subject solution model pseudoseize repeated subjectsubject typear1 rcorr t wt wel e pand zero procoptstrempirical methodml PU E413 db1 db2 db3 db4 db5 db6 noint notest CHAPTER14 OUTPUT Seizure data of Thall and Vail Unst Class subject S ependent Variable eight Variable o egrees of Freedom Method Levels 1990 ructured working correlation The Mixed Procedure Model Information WORKNLINMIX pseudoseize wt Unstructured subject c Method None SE Meth d Empirical BetweenWithin Class Level Information Values 59 101 102 103 104 234 236 238 Dimensions Covariance Parameters in X Co Subj Max s s in Z ects Obs Per Subject Number of Observations Number of Observations Read Number of Observations Used Number of Observations Not Used Iteration 0 1 2 Evaluations Iteration History 106 107 H on MMMHHH D D OLU H oomo b MMMHHH MHopM OHpHm MMMHHH MHop H Hmmm u 108 228 230 232 236 236 2 Log Like 1 141207968151 2 134207974235 1 134207780093 Seizure data of Thall and Vail Unstructured working correlation banish x The Mixed Procedure Convergence criteria m 1990 et Estimated R Correlation Matrix for subject 101Weighted by wt Col2 03250 10000 05006 04962 Col1 10000 03250 02295 02584 Col3 02295 05006 10000 04947 Covariance Parameter Estimates Cov Parm Subject Estimate UN11 subject 3 1138 UN21 subject 1 2302 UN22 subject 4 6019 UN31 subject 1 1119 UN32 subject 2 9477 UN33 subject 7 5349 Criterion 000000424 000000000 ST 762 M DAVDDIAN PAGE4M CHAPTER14 ST 762 M DAVDDIAN Effect db1 db2 db3 Effect db4 db5 db6 UN41 subject 06923 UN42 subject 16164 UN43 subject 20618 UN44 subject 23058 Fit Statistics 2 Log Likelihood 13421 A maller s b r 13741 AICC smaller is better 1376 6 BIC smaller is better 1407 3 Null Model Likelihood Ratio Test DF ChiSquare 9 7000 Pr gt ChiSq lt0001 Solution for Fixed Effects Standard Estimate Error DF t Value 31773 09421 59 337 9219 008483 59 1087 10506 02775 59 379 Seizure data of Thall and Vail 1990 Unstructured working correlation The Mixed Procedure Solution for Fixed Effects Standard Estimate Error DF t Value 1675 04193 59 400 01633 006303 59 259 6890 01693 59 407 Seizure data of Thall and Vail 1990 Exchangeable working correlation The Mixed Procedure Model Information ata Set WORKNLINMIX ependent Variable pseudoseize eight Variable w ovariance Structure Compound Symmetry u ject t subject hstimation Method esi ual Variance Method Profile ixed Eff cts SE ethod Empirical egrees of Freedom Method BetweenWithin Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 12 122 123 124 126 128 129 130 135 137 139 141 143 145 147 2 1 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238 Dimensions Covariance Parameters 2 in X 6 Columns in Z 0 Subjects 59 Max Obs Per Subject 4 Number of Observations Number of Observations Read 236 Pr gt Itl Pr gt Itl PU4E415 CHAPTER 14 ST 762 M DAVDDIAN Number of Observations Used 236 Number of Observations Not Used 0 Iteration History Iteration Evaluations 2 Log Like Criterion 0 1 141213918577 1 2 137670189404 000000000 Seizure data of Thall and Vail 1990 5 Exchangeable working correlatio The Mixed Procedure Convergence criteria met Estimated R Correlation Matrix for subject 101Weighted by wt Row C011 Col2 Col3 C014 1 10000 03582 03582 03582 2 03582 10000 03582 03582 3 03582 03582 10000 03582 4 03582 03582 03582 10000 Covariance Parameter Estimates Cov Parm Subject Estimate CS subject 15411 Residual 27611 Fit Statistics 2 Log Likelihood 1376 7 AIC smaller is be er 13927 AICC smaller is better 1393 3 BIC smaller is better 14093 Null Model Likelihood Ratio Test DF ChiSquare Pr gt ChiSq 1 3544 lt0001 Solution for Fixed Effects Standard Effect Estimate E DF t Value Pr gt Itl 1 2 7939 0 61 171 292 00039 db2 9504 0 09873 171 963 lt0001 db3 9066 0 72 171 327 00013 db4 13386 04296 171 312 00021 db5 01611 006558 171 246 00150 db6 05633 01749 171 322 00015 Seizure data of Thall and Vail 1990 6 AR Working correlation ma The Mixed Procedure Model Information ata Set WORKNLINMIX ependent Variable pseudoseize eight Variable wt ovariance Structure Autoregressive ubject Effect subject hstimation Method esi u l Vari c Method Profile ixed Effects SE Method pirical egrees of Freedom Method BetweenWithin Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 P54E416 CHAPTER 14 ST 762 M DAVDDIAN 113 121 122 123 124 126 128 H p m H p N M o H M o M M o m MMMH MHop OHpH MMMH MHop Hmmm 222 225 226 227 228 230 232 234 236 238 Dimensions Covariance Parameters m 0 Subjects Max Obs Per Subject g m H u Nx m boomM Number of Observations Number of Observations Read 236 Number of Observations Used 236 Number of Observations Not Used 0 Iteration History Iteration Evaluations 2 Log Like Criterion 0 1 141184577442 1 2 137841267102 000001238 2 1 137840681354 000000000 Seizure data of Thall and Vail 1990 7 AR 1 working correlation matrix The Mixed Procedure Convergence criteria met Estimated R Correlation Matrix for subject 101Weighted by wt Row C011 Col2 Col3 Col4 10000 03757 01411 005302 03757 10000 03757 0 1411 01411 03757 10000 03757 005302 01411 03757 10000 DPCUMH Covariance Parameter Estimates Cov Parm Subject Estimate AR1 subject 03757 Residual 42126 Fit Statistics 2 Log Likelihood 13784 AIC smaller is better 13944 AICC smaller is better 13950 BIC smaller is better 14110 Null Model Likelihood Ratio Test DF ChiSquare Pr gt ChiSq 1 3344 lt0001 Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr gt Itl In In In In In In 666666 0101me H p p D O p I 0 H q H HQ mp 13039 OOOOAO O O O N PAGE4N CHAPTER 11 ST 762 M DAVDDIAN 11 The role of estimating weights in GLS second order theory 111 Introduction The folklore theorem for GLS discussed in Chapter 9 is used routinely as the basis for assessing uncertainty of GLS estimation eg standard errors con dence intervals etc for estimation of 6 with true value 80 in the model EOjlilrj 73 37 varOjlilrj 029237 07 i117 111 Recall that the main implications are as follows o If the variance function g is correctly speci ed how one estimates 6 and 0 appearing in the weights does not affect the largesample properties of ECLS in the sense that n12ltiaaLs e e i Manama 112 where EWLS is the same as the matrix that would arise if the weights were in fact known 0 Regardless of the number of iterations of the GLS algorithm C 112 holds Thus the theorem offers no insight into how one should select C in practice according to the theorem the choice of C does not matter Recall also that it has been observed that using 112 as the basis for deriving estimated standard errors may result in unreliable inferences In particular it has often been noted that the standard errors obtained this way tend to understate the variability associated with BCLS The bottom line is that the folklore theorem despite its widespread use and folklore status may not be that useful of a result in practice 0 The folklore theorem and in fact all the large sample distributional results we have discussed so far are rst order results That is usual asymptotic normality results represent only a certain level of approximation with respect to n In terms of the covariance matrix of n12EGLS 7 30 the result tells us only that Varn12BGLS 30 UgEWLSV 113 where EWLS is the well behaved limit of a quantity in the form of an average and where this moment is understood as in all our arguments to be conditional on the 2117 PAGE 272 CHAPTER 11 ST 762 M DAVDDIAN Written another way 113 only tells us that VarltEGLS 50 3 nil ggww 114 o For the model in 111 and GLS estimation the level of rst order approximation in 113 is not suf ciently re ned so that the effect of estimation of 6 and 0 in the weights and number of iterations C shows up 0 Moreover because of this the approximation 114 may not yield a reliable representation of the true uncertainty when n is not too large RESULT A more re ned approximation is needed Loosely speaking what is needed is something along the lines of varn1260LS 7 60 z UgEWLS other stuff 115 where intuitively other stuff must depend on n in such a way that it is small when n is large in fact smaller than the leading term UgEWLs so that the leading term dominates This must be true otherwise other stuff would have shown up in the rst order results However if n is not too large other stuff might be nontrivial In particular 115 may also be written as varfiGLS 7 80 z n lagEWLS 7171 other stuff 116 so it must be that the second term involving other stuff in 116 is 0n 1 Hopefully other stuff is in part determined by the effects of estimation of 6 and 0 in the weights and the choice of C effects that do not show up otherwise so understanding its form may yield insight into the consequences of estimating weights and how to choose C in practice Presumably if we could obtain the form of other stuff we could use it to calculate more reliable estimated standard errors for practical use SECOND ORDER RESULTS What we will nd is that other stuff for our problem turns out to be 0n 1 so that in 116 we have V81ltEGLS 30 7171062WLS 7172 stuff where we will determine the form of stuff Such a representation called a second order result here we have an approximation to the covariance matrix of ECLS that involves not only the leading term that is of 0n 1 but one of 0n 2 as well the second order term Generally arguments to establish such second order results are very tedious Consequently in Sec tion 112 we will pursue such an argument in a simple special case of 111 PAGE 273 CHAPTER 11 ST 762 M DAVDDIAN Even in this situation the calculations are rather involved In Section 113 we will simply state the results of a more general argument given by Carroll Wu and Ruppert 1988 Throughout we will assume that the variance function g in 111 is not misspeci ed as our focus is on understanding the performance of the rst order results and how it might be improved upon when the model is correct It turns out that such second order results although providing some theoretical insight do not translate into improvements that may be used in practice as the necessary calculations are much too dif cult to be implemented easily In Section 114 we will consider use of the bootstrap for our model 111 as an alternative way of effecting the same sort of improvement automatically under certain circumstances 112 Covariance matrix of 30 When 9 does not depend on The argument in this section is based on that of Rothenberg 1984 and involves some restrictive assumptions which we will adopt for this section only As usual let 60 00 and 00 denote the true values of these parameters We will suppress conditioning on the 11 in this section for simplicity o 9 does not depend on 6 so only depends on an unknown variance parameter to be estimated 0 The variance parameter 9 is a scalar o mgr that is the mean model is linear in 6 so that the full mean variance model is T 2 2 2 71 11 6 varY7l1j a g 0111 a w 117 o The conditional distribution of given 11739 is normal so that YJ39 QT50 6739 7090907117 w700yj 7 mg OVUO N N07139 The assumptions of linearity and normality actually may be relaxed but simplify the argument sub stantially in particular these assumptions allowed Rothenberg 1984 to make clever use of suf ciency as we will demonstrate momentarily Let 9 be any estimator for 9 such that a the distribution of 9 does not depend on 6 b 9 is an even function of 61 e see below PAGE 274 CHAPTER 11 ST 762 M DAVDDIAN These assumptions are actually quite reasonable for estimators based on transformations of absolute residuals as discussed in Chapter 6 b may be shown to always hold and a will hold if 9 does not depend on 6 We will gain insight into these properties in Chapter 12 for now we will assume that such a 9 exists For completeness we recall the following DEFINITION 111 Even and odd functions Suppose q61 en is a real valued function of 61 en Then 0 qlt 1 n is an odd function of 61 e if q61 en 7q761 En o qlt 1 n is an even function of 61 6 if q61 en q761 fen Products of even and odd functions satisfy the following properties evenxeveneven oddxoddeven and evenxoddodd Thus note that the de nition implies that if depends on 61 en in large samples through a term like 216 7 1aj say for some constants aj then satis es the assumption that is even in large samples Recall that the PL estimator solves a quadratic estimating equation from Chapter 10 such an estimator would have this property under certain conditions It turns out that the distinction between odd and even functions of 61 en is quite useful in showing the main result which we now state ROTHENBERG S RESULT Under the above conditions Varn12BGLS 50 032WLS n lV 001757 118 where for satisfying n12 7 00 L M0 72 and V is an increasing function of 7392 where of course 7392 gt 0 Before we show the argument we note some implications of the result PAGE 275 CHAPTER 11 ST 762 M DAVDDIAN o The number of iterations C appears not to matter here it turns out to matter when 9 does depend on 6 0 However the result shows that the precision of estimation of 6 by ECLS is dictated by the precision of estimation of 0 In particular the more precise is the more precise ECLS is to second order Thus the result shows that the properties of do play a role in the properties of 30le o This role only shows up in the second order term n lV for large n this term is dominated by the leading term and its effect is negligible For small 71 however the effect may be more pronounced o The result suggests that if we can write down the explicit form of V which is presumably a function of the parameters and the design we could obtain more reliable standard errors for ECLS as the square roots of the diagonal elements of 1 A A 2 A n UEWLS 71 V where lmLs is the usual expression from the folklore theorem evaluated at ECLS and and V is V approximated at the 71 data points and evaluated at the estimates If n is not too large the magnitude of n ZV may be suf ciently large relative to that of the rst term to result in a noticeable difference in estimated standard errors Presumably adding this term would correct the folklore standard errors for the effect of estimation of Unfortunately the form of V may be very dif cult to derive so this may not be a practical alternative ARGUMENT We rst state a classical result of which Rothenberg s proof makes clever use BASU S LEMMA Let S be a suf cient statistic for 1 in a family of distributions P7 indexed by 1 Let T be a statistic whose distribution does not depend on 1 Then S and T are independent The assumptions of linearity of the mean model and normality allow Basu s lemma to be exploited simplifying things considerably as we now demonstrate De ne Y as in previous chapters and let X m1 1nT be the design matrix Write W09 diagw10 wn0 and Y X W 120e 119 we suppress evaluation at true values here to emphasize that the following argument applies to the model indexed by 6 PAGE 276 CHAPTER 11 ST 762 M DAVDDIAN Let EWLS be the WLS estimator where 19 is known That is if 19 is known the weights 1070 are themselves known constants Then using 119 EWLS 8 aXTW0X 1XTW 120e 1110 The GLS estimator uses the weights wj thus ECLS 5 UXTW X71XTW Wil206 Under these conditions it is possible to show that EWLS is independent of ECLS 7 AimLS using Basu s lemma as follows this fact will come in handy momentarily It is well known from standard linear models theory so we do not show it here that BWLS is a complete suf cient statistic for 6 Thus if we can show that the distribution of ECLS 7 AimLS does not depend on 6 we may conclude the result by Basu s lemma From the form of BWLS and ECLS above because the distribution of does not depend on 6 neither does the distribution of ECLS IgWLS UlXTW X71XTW XTW0X71XTlW712067 1111 which does not involve 6 Now as e N NO 1 we have from 1110 that at the true values Varn12i3wrs 30 03n 1XTW00X71 032W exactly for all 71 Using this and the independence of EWLS and ECLS 7 AimLS we have varn12EGLS 50 Varinl2EWLS 50 7112 ECLS WLS Varinl2EWLS 50 t Fi REG EWLS UginileW00X71 t Varinla aw WLS 032W Var1n12EGLS WLS39 1112 Thus from 1112 we see that to compute the covariance matrix of interest we need only consider the second term the covariance matrix varn12EGLS 7 EWLS To elucidate the form of this covariance matrix we write 1111 at the true values using summation rather than matrix notation as 71 nlZt ars 7 awry00 711 Zw g nil2 Zw wo m j1 11 71 n n 7 nlzwojmjillg nilZZwOJgojmjej 1113 11 73971 PAGE 277 CHAPTER 11 ST 762 M DAVDDIAN Here 90739 900 111739 and woj 107100 The assumption n12 700 L NO 7392 implies that 7112 7190 Op1 so that 7190 Opn 12 This rate will be important in keeping track of nonnegligible terms Now as EltEjgt O assume that terms of the form 7142 2190767117 converge in distribution to a normal random vector by the central limit theorem This implies that 7142 21 gojEJaj Op1 ie such terms are bounded in probability Let uer 880 Lug190 and similarly for higher order derivatives and more generally use such sub scripting to denote partial derivatives Then considering the second term in 1113 expanding about 190 we have for some 191 between and 190 n A 7712 2 wj 07907113767 j1 n A A A 77712 ZQOjjEjw0739 1090109 i 00 1271099073909 0072 167w999700 0073 j1 71 71 A 71 A 77712 29071007113767 77712 Egoa39weo jejw 00 12777 12 Emmoij 0072 j1 j1 j1 16n 12 Egojw999emje 7 00 1114 73971 By the above the nal term in 1114 may be seen to be the product of two terms of Op1 and Opn 32 respectively thus the entire term is Opn 32 and we have TL 7142 Z wj0go39j j A17 An20 0o A7309 0072 0pn 327 1115 j1 where Anl 77712 Z 100 739Ej Anz 77712 Zweojqu Any 77712 Z weeojillej 771 71 71 are all odd functions of 61 en Thus 1115 provides an approximation to n lZ 21 wj gojmj6j that is accurate up to terms of order 71 32 By an entirely similar argument it may be shown that n A A A 7171 21mm 7 31 anw 7 00 Bngw 7 00 7 0n 32 1116 j1 where n n n 712 T 712 T 712 T Bln 77 wojilljillj Bgn 77 weojilljillj Bgn 77 weeojilljillj 7 1 71 71 PAGE 278 CHAPTER 11 ST 762 M DAVDDIAN We would like to substitute 1115 and 1116 into the rst term of 1113 to obtain an approximation to n12EGLS7EWLSao We would thus like an expression for the inverse assumed to exist of 1116 It is possible to nd CM and Oz not depending on the 6739 to satisfy BM7an 7ooBn3 70020pn321 BfCm 70002n 70020pn32 1117 To see this multiply BM 7 Bn2 7 00 Bn3 7 do 7 077r32 by the right hand side of 1117 keeping track of terms up to Opn 32 to obtain 1 6 00gtltBln0m 32431 7 002Bia0a Bach BaaBi 017W Here7 we have absorbed all terms depending on 7 003 and higher into Opn 32 lf 1117 is to hold7 then the coef cients of 7 00 and 7 002 must be equal to 07 so that 1117 is the inverse up to Opn 32 Thus7 CM 7BljiB2nB1 n1 and Bylan 7B2nCM 7 BgnBl nlg upon substitution and rearrangement 0 BEHBZTLBEBM 330312 It is not relevant what these terms are what is important is that a representation of the inverse in 1117 is possible and CM and Oz are well behaved for large 71 Substituting into 11137 we obtain n12EGLS EWLSV TO BfilAln H313 01749 t90 02 i 002 Opn 32A1n A2749 t90 A3n 002 Opn 32 1118 Here7 note that only the terms AM A2 7 and Agm depend on the 6739 Multiplying out the right hand side of 1118 using the facts that 7 00 0M4 and 0pnk20pn 2 0pnltklgt2 we may write 1118 as n12EGLS EWLSVUO D1n t90 D2n t902 01271732 ll19 where Dln 31231427 CinAin7 DZn BEAM ClnAZn CZnAln The important issue is not what Dln and Dzn are7 but the fact that they are both linear functions of the Am k 17 237 which are all odd functions of the 6739 Thus7 as BM CM and Oz do not depend on the 6739 and are obviously well behaved depending on terms of the form 71 1 21 aj and so of 01 and as the Aim satisfy n lZ 2291 lej Op17 it is also reasonable to suppose that Dln 0121 Dzn 0121 PAGE 279 CHAPTER 11 ST 762 M DAVDDIAN Multiplying 1119 by 7112 we have 71EGLS EWLSVUO Dlnnl2 t90 nilZDanla t902 t 0120171 as n 12D2nn12 7 00 7 nil20p1op1 7 071712 we obtain nltI GLS EWLS039O Dlnn12 7 0o 0170171 1120 From 1120 then if we can show that Dlnn12 7 00 A Z say for some random vector Z then we may conclude that nfiGLS 7 AimLS i Z as well Now in general the fact that a sequence of random vectors Zn converges in distribution to some Z does not necessarily imply that the corresponding moments converge However under suitable conditions on the moments of Zn and Z this does hold Assuming these conditions hold here we would thus expect that VarD1nnl2 t90 VarZ WWW530w WLS39 We will now see the signi cance of BM being an odd function Recall that 6739 N N0 1 and that n12 7 00 is even by assumption For any symmetric density eg the normal the density function itself is even moreover the product of even functions is itself even Thus the joint density of 61 en is an even function for all 71 Using this it is clear that DlnnlZ 7 00 times the normal density is an odd function oddxevenx even Because the integral of an odd function is zero we may thus conclude that ED1nn12 7 00 0 Moreover we know that ED1n 0 Thus the covariance between Dln and 7112 7 00 ED1nn12 7 00 7 ED1nEn12 7 00 7 0 so that D1 and 7112 7 00 are uncorrelated for any 71 Further note that it is reasonable by virtue of the form of D1 to expect that L D1 7 N0 ED for some ED and by assumption 71120 7 00 L N07392 Because D1 and 71120 7 00 are approximately normal for large n and uncorrelated for all n it is reasonable to expect them to be independent for large 71 Using this and the fact that ED1nn12 7 00 0 the covariance matrix of their product is thus varD1nn12 7 00 7 EDT PAGE 280 CHAPTER 11 ST 762 M DAVDDIAN Thus we expect VarD1nn12 00 JLHgOVaF EGLS WLS ED727 which we may write as WNW530w WLS EDTZ t 017 where 01 represents a deterministic term that approaches zero for large n We may rewrite this equivalently as varn12EaGLS 7 Ems 714237 owl 1121 Returning to 1112 and using 1121 Varfnla aw 50 032W Varn1ZEGLS EWLS7 we obtain the nal result that Varn12BGLS 50 032WLS n lV 0nillv where V 2137 2 Obviously V is an increasing function of 7392 as claimed To make the above argument rigorous would entail considerably more work The heuristic nonrigorous version we have presented here highlights the steps that are needed Evidently even a nonrigorous argument is rather involved 113 Covariance matrix of 30 When 9 depends on Rothenberg s result given in Section 112 is rather limited in its applicability as it required normality linearity of the mean model and that 9 not depend on 6 Carroll Wu and Ruppert 1988 present very detailed arguments in the more general case where o 9 may depend on 6 so that varY7l1j 0292 01117 0 the distribution of need not be normal Their presentation also requires that mgr but as their argument does not make use of suf ciency as did Rothenberg s relaxation to nonlinear mean is straightforward Here we simply state their main result a detailed statement of the result and proofs may be found in Carroll Wu and Ruppert 1988 PAGE 281 CHAPTER 11 ST 762 M DAVDDIAN EFFECT OF ESTIMATING WEIGHTS IN GENERAL PROBLEMS Let 30 be the GLS estimator at iteration C of the threestep GLS algorithm7 where 30 is the initial estimator for 6 used in step i Let 0 be a reasonable estimator for 0 obtained in step ii using the previous iterate 3071 where reasonable estimators include those based on transformations of absolute residuals Then marWC 7 o 032m nilvlt0gt owl where a In general7 V3 V4 Voo b V2 V3 m Voo if either i 9 does not depend on 6 and 67quotle have a symmetric distribution ii 30 EOLS c If both and ii hold7 then V1 V2 Voo V for all C Note that this subsumes Rothenberg s result d When V1 7 V37 there is no general ordering that is7 either V1 gt V3 or V1 lt V3 is possible e lfg does not depend on 6 and var6jl1j 2 a for all j eg the 6739 are iid7 and we estimate 0 by PL based on 3071 then VC 2 7 ampV for all C for some matrix V REMARKS Some implications for practice are as follows 0 From a7 there is no optimal number of iterations of the GLS algorithm in a second order sense from d7 it could well be that iterating past C 1 could be detrimental This suggests that the usual practice of taking C 00 could be suboptimal in some situations After C 37 whatever the ordering7 additional iteration has no effect to second order Of course7 it is important to keep in mind that this is a largesample approximate result7 albeit more re ned7 so it does not imply that estimates will be identical for C 2 3 c From b7 no additional iteration after C 2 is required under or ii If both and ii hold7 then from c7 no iteration is required at all in principle PAGE 282 CHAPTER 11 ST 762 M DAVDDIAN o e seems counterintuitive If V is a positive de nite matrix then the larger amp is the greater the reduction in largesample variancel That is if the underlying distribution of the data has heavy tails so that extreme observations are more likely than they would be under normality there is a greater reduction in variance than if the data were normally distributed Carroll Wu and Ruppert conjecture that perhaps GLS with PL which is in fact normal theory ML in e as 77 u 9 does not depend on 6 has a built in outlier protection77 mechanism The form of VC is very complicated in general Thus although the above results lend insight they are of little use for practical applications as we have pointed out earlier Under certain conditions Carroll Wu and Ruppert suggest use of the bootstrap to achieve a second order correction to standard errors which we will discuss in Section 114 Despite the simplicity of the folklore result which says that estimating weights does not matter in practical applications evidence suggests that it does Second order theory like that here and in Sec tion 112 represents a way to gain insight into this phenomenon Unfortunately second order theory is both involved and of little practical use A main point that emerges is that how one estimates 0 does matter This has motivated research into determining the best way to estimate 0 under different circumstances which is the topic of Chapter 12 114 Use of the bootstrap to obtain corrected standard errors The bootstrap is a resampling scheme that is employed when one wants to obtain estimates of the sampling distribution of an estimator and hence its moments eg standard errors First we review the procedure for a single sample and then we discuss its use in regression We consider linear mean models as in Carroll Wu and Ruppert 1988 Sec 6 however the technique we discuss applies approximately in the nonlinear case as well as we will note A brief discussion of the use of the bootstrap for this purpose is given in Carroll and Ruppert 1988 p 27728 The book by Efron and Tibshirani 1993 provides a comprehensive treatment of the bootstrap BOOTSTRAPPING A SINGLE SAMPLE Suppose that Y1 Yn are iid from some distribution with cdf F which depends on a parameter 6 we would like to estimate Let Fn be the empirical distribution function for the sample ie PAGE 283 CHAPTER 11 ST 762 M DAVDDIAN In this section we will use notation that is standard in the context of the bootstrap We think of the parameter as a function of the distribution function F ie 6 At the true distribution F this function is equal to the true value of the parameter 60 say The idea of the bootstrap is to substitute an appropriate estimator for the true F into the function 6F as follows thus the bootstrap may be used even if the true F may be unknown Suppose we estimate 6F by 3 where F is the true cdf We would like to assess the properties of 3 eg obtain an approximation to its sampling distribution One way is to use largesample approximation eg obtain the rst order asymptotic distribution for From this we can approximate the true mean and covariance matrix of 3 which we write as and varfilF the exact moments of the sampling distribution of 3 under F by the corresponding attributes of the asymptotic distribution of 3 This is done routinely in practice Alternatively a different approach is to instead use and varfian to approximate the true exact sampling moments and varfi lF where by this notation we mean the mean and covariance matrix of 3 calculated by replacing the true distribution F by the empirical distribution Fn when taking expectation and covariance Sometimes it is possible to calculate these quantities directly For example if 6F is the mean of F and i is the sample mean then the variance of the sample mean with respect to Fn is just the sample variance n l 22910 7 l72 divided by n the n 1 can be replaced by n71 1 as is conventional In more exotic problems it may not be possible to calculate and varfian explicitly The bootstrap is a procedure to approximate these quantities The basic procedure is as follows For I 1 B for B large a Obtain a bootstrap sample77 by sampling with replacement from F Thus from the original sample 11 Yn choose a sample of size n Y1b Y by sampling from the original sample with probability 171 for each entry A b A b Obtain 6 based on the bootstrap sample by the same procedure used to obtain the estimator 6 from the original sample A 17 Repeat a and b B times resulting in B estimates 6 b 1 B Note that for each I a sample is drawn from F and this is repeated many B times Thus it is obvious that the rationale is to replace moments with respect to the distribution Fn by estimates based on sampling from F PAGE 284 CHAPTER 11 ST 762 M DAVDDIAN To approximate E an then one could use the sample mean A B Ab moot B l Z 121 To approximate var8an the obvious choice is the sample covariance matrix B A b A A b A B 1 12 7 mom 7 lm 1122 ba Variations are possible one may divide by B instead of B 7 1 or replace 31300 by The idea is that if B is large the b represent a large sample from F so that the sample mean and covariance matrix 1122 of the sample should be reasonable estimates of and varfian by the law of large numbers In practice it is conventional to obtain the estimator i from the original sample and then use the sample covariance matrix 1122 of the bootstrap replicates to estimate the true sampling covariance matrix of 3 from which bootstrap estimated standard errors may be deduced BOOTSTRAPPING LINEAR CONSTANT VARIANCE REGRESSION MODELS It is instructive to rst discuss how one might apply the bootstrap idea to obtain estimated standard errors for the classical form of the regression model Assume that we have the model Boam mfg magm a 1123 Before we discuss the bootstrap in this context it is important to recall what we mean by a model such as 1123 The model is for the rst two moments of the conditional distribution of given 211739 Thus the parameters in the model 1123 pertain to the conditional distribution As we have noted previously in some circumstances it makes sense to think of the pairs 11739 as having been drawn from a joint distribution so that the 11 are simply observed along with From this perspective the 11 vary as do the In other situations eg designed experiments the 11 may be xed by an investigator at certain values and the observed This suggests that there are two ways one might consider using the bootstrap idea The rst adopts the perspective that the pairs Yj1j are iid draws from some joint distribution To implement the bootstrap then one would construct B bootstrap data sets by drawing with replacement from Y111 Yn B times and following the procedure above In this approach the joint variation in the population of both and 11 values is taken into account PAGE 285 CHAPTER 11 ST 762 M DAVDDIAN The second idea seems especially relevant when the 111739 are xed but also seems appealing when one recognizes that the model is a conditional one This idea requires as is the case in the classical regression model that 6739 7 are iid Under this condition it would be natural to regard the distribution of interest F as that of the 6739 and as 039FEj one could resample from the 6739 j 1 n which are iid and then form bootstrap data ij at each xed 111739 that would have the implied distribution of Of course the 6739 are not observed but one can substitute residuals as demonstrated below Which method is better The rst method seems to protect against violation of the assumption that the errors 6739 are iid while the second method is predicated on this If the iid assumption is not correct the rst method would be safer On the other hand if we have con dence in the iid assumption the second approach seems to get more directly to the heart of things A discussion of this choice is given in Efron and Tibshirani 1993 Chapter 9 Here we will adopt the latter approach as is conventional under the classical perspective and consider F to be the distribution of the 6739 The true values of 6 and 0 correspond to those under F which we write as 6F and 0F here so that EOG39WF wfmF varlej7F 0207 Let i BOLS and 72 n 7 p 1 22910 7 mgfi and de ne the standardized residuals as T1 nn7p12197mf j Let Fn be the empirical distribution function of 7y so that n rn are iid Fn Then as long as the model has an intercept term 7L 7L Erlen n71 77 O varrlen n71 62 j1 73971 Thus the moments of Ti under Fn emulate those of 039FEj under F To implement the bootstrap the following steps would be repeated for b 1 B a Draw a sample of size n with replacement from Fn to obtain 7117 rf b Form the bootstrap responses ij T2 j 1 n n Ab c Obtain 6 by OLS on the bootstrap data Ya17111 j 1 PAGE 286 CHAPTER 11 ST 762 M DAVDDIAN i A A b i i i As in the onesample case one could then form boot B 1 251 6 and estimate the covariance matrix of i by the bootstrap sample covariance matrix B Ab A Ab A Bilflzw e bmx who ba It turns out that this leads us to exactly what we would expect We are interested in obtaining WWW 7 olF moms where EOLS n lXTX for the true F Thinking of this as a function of F if we let 5 039FEj then the q are emulating the 5739 for which 02F fedF5j Moreover 02Fn n l 2291 r 62 Thus WWW 7 oan 622m as EOLS is a constant Thus using the empirical distribution leads one exactly to the usual estimator for the sampling covariance matrix of EOLS under these conditions Now we also have that Eijl1j F varY7bl1j F 62 so that a bootstrap sample follows a linear model with conditional mean and variance 62 with iid errors having distribution Fn Thus from standard linear model theory it follows that Ab A Ab E lmjFn 6 var6 lmjFn 620539 Thus the bootstrap mimics the underlying model so we expect that the bootstrap covariance estimator and the usual estimator will be fairly similar In this case the bootstrap leads one to do approximately what one would do anyway so it is hard to see the advantage However when the variance is noncon stant the advantage of the bootstrap is much more evident BO OTS TBA PPING REGRESSION MODELS WITH NONC ONSTANT VA RIANC E Now assume that we have the model fmj7 7 varOjlmj 029 707mj7 and assume that 6739 7 f1j FaFg F 0F that are iid with distribution F so that true values are 6F 0F and 0F and EltEjgt O var6j 1 Write 57 039FEj as before Thus we have EMliijF fj7 F7 vaerliIIij 02F92 F79F7 113739 PAGE 287 CHAPTER 11 ST 762 M DAVDDIAN In the case of a linear model f1j mgr one proceeds similarly to the constant variance case nding a bootstrap model77 the mimics the true one Let 30er be the update of the GLS algorithm A C using 0 to form weights and de ne Ac1 Tlilt 71 gt1 1972135 7 7 7 AC1 WC 39 n p 93 79 7 13739 Thus the usual estimator 62 n l 217 By analogy to the constant variance case we could resample from T1 rn however this does not directly mimic the underlying model In particular note that 71 77 7171 Z77 7 0 j1 as we would hope so we will not have Erlen 0 If we instead consider 87 lt17 izWlZm e a then if we take Fn to be the empirical distribution function of the 37 we have n n 62 n E F 4 0 F 4 2 1 172 AZ 37 n n 7223 var87l n n 23 6272 71 Zr 7 a mimicking the situation under F In the more general case where f is nonlinear it is standard to simply allow these relationships to A c A c A 0 hold approximately in particular take 57 7 8 1g 1 0 1 1988 p 27 7 see Carroll and Ruppert With the appropriate de nition of the 57 for a linear or nonlinear model the bootstrap procedure is as follows For I 1 B 0 Obtain the bth bootstrap sample 311 3 by sampling with replacement from Fn ie from 31 0 Form the bootstrap data set Y1b Yf where W gltialt01 lt0gtwjgts Y m a 0 Obtain b via the GLS algorithm using the C 1 iterate We now have a sample of B GLS estimates PAGE 288 CHAPTER 11 ST 762 M DAVDDIAN What we would really like of course is to obtain the true sampling covariance matrix varn12fi01 7 60 lF where again 60 6F and we use this notation to emphasize this is under the true distribution F From Section 113 we know that this satis es 12 AC1i 7 2 71 732 varm 3 ow 7 a ltFgtEWLSltFgt n We F am gt 1124 where we use the notation EWL5F and VC F to emphasize that these quantities are evaluated at F Using this notation the rst order asymptotic theory says that viiWale e e A N002ltFgtEWLSltFgt The usual estimator for 02FEWLSF is based on replacing the parameters appearing in this quantity by estimators Now at least in the linear case 02Fn n l 21 3 6 in the nonlinear case this is approximately true Moreover EWLSF z n 1 21 f mj FfgmJ Fg 2 F0Fmj so that the usual estimator for 02FEWLSF may be seen to be 02FnEWL5Fn According to the bootstrap approach we would like to use varn12ltialt0 gt7 oan 7 02ltFngt2mltm 7 n 1VCFn 07r32 to approximate 1124 As we now show the sample covariance matrix of the 317 A B b 2B B 171 Z 1 mgt Ab A Ab A T A 71 B 8 e bmx who acacia 2 171 H is an estimator for this quantity From the way the bootstrap samples are constructed we have at least approximately in the nonlinear case A 01 A A c 1 A c E32blmj7Fnfj7 gt varnbleFngtanglt 10ij so that the bootstrap data mimic the original model at least approximately with the parameter values replaced by estimators and with error distribution Fn Consequently we know that Ab A 01 A 7 varn126 7 al ME UEWLSW n 1V0F 0971 32 by analogy to 1124 PAGE 289 CHAPTER 11 ST 762 M DAVDDIAN Thus under F by the law of large numbers 33 is an approximation to this expression that is EB z 62WL5F n 1VC F 0n 32 1125 Now if we assume further that VC F VC F Opn 12 which seems reasonable as VC F is just VC F with estimators substituted then 1125 becomes 23 z 62WLSFn 1VCF0pn32 usual covariance estimator second order correction Opn 32 That is unlike in the linear constant variance case the bootstrap covariance estimator is different from the usual rst order covariance estimator in that it contains the necessary correction term of order n l In short the bootstrap automatically corrects for the effect of estimating the weights Carroll Wu and Ruppert 1988 provide simulation evidence supporting the claim that when estima tion of the weights has a nonnegligible effect using the bootstrap estimator for the sampling covariance matrix of ECLS represents and improvement over the usual rst order estimator in the sense that the bootstrap estimator is closer to the true sampling variation thus offering more reliable performance Of course a potential drawback ofthe bootstrap is computational burden To derive estimated standard errors using the bootstrap approach the data analyst would need to carry out the GLS algorithm with C iterations B times It has been observed that in the mean variance models of interest here B 250 to 500 is required to obtain reliable estimates of the true sampling variation thus this could become computationally intensive In the current age of computing power however the burden is not as severe as it was when Carroll Wu and Ruppert rst investigated the approach so the bootstrap approach is gaining ground as a routine method of choice PAGE 290 CHAPTER 3 ST 762 M DAVDDIAN 3 Implementation of generalized least squares We have indicated that7 for the general model fmj7 7 varOjlmj 0292 707 131 3391 a popular method for estimating 6 in the mean speci cation is generalized least squares GLS We motivated the approach from the standpoint of solving an estimating equation of the form n Emily 7 fmj7 f mjv 07 j1 where the weights are replaced by estimates The weighting takes into account the differing precision of each response j giving this approach an omnibus appeal In fact7 as we will see7 the GLS approach corresponds to maximum likelihood estimation when the have distributions in a certain class Before we tackle these issues7 it is worthwhile to discuss how this very popular approach may be implemented in practice This will serve both to reinforce its generality and to introduce us to the computational strategy used to solve very general sets of estimating equations that may not be solved in closed form We will assume for now that 0 is known in the sense discussed in Chapter 27 so that the focus will be on estimation of 6 and 02 only However7 we will continue to highlight dependence of the variance function g on 0 as later we will consider adding estimation of 0 to the model tting task 31 GLS algorithm The conceptual scheme we will call the GLS algorithm in the case that 0 is known may be written more precisely as follows A A A 0 A 0 i i i i i A 1 Estimate 6 by 6 7 where 6 is some initial estimate7 for example OLS solv1ng 229 aw f 7 0 Set k 0 A k ii Form weights 10739 g 2 0 111739 iii Re estimate 6 by solving 7 fltmjv f mjv 0 x H to obtain 9 Set k k 1 and return to ii PAGE 51 CHAPTER 3 ST 762 M DAVDDIAN Continue through C iterations and adopt the Cth as the estimator i i i i i i A k c lntu1t1vely we might expect hope that if C were large successive iterates lt would be more and more similar If we could iterate forever we would hope that successive iterates would coincide so that the algorithm could be said to have converged We will denote this as the case LLC 0077 If C 00 then the 6 value appearing in the weights and that in the rest of the equation must coincide Thus the case C 00 corresponds to the case where we are solving V L 2972 707mj yjifcljjv f cnjv 0 3392 j1 in 6 As we will see solving 32 may in fact be implemented using an approach different from and more direct than the GLS algorithm given above However we will continue for now to think of the general approach conceptually in terms of the GLS algorithm in steps 7 iii as this will prove convenient when we generalize to the case where 0 is also taken to be unknown and estimated 32 Implementing steps and iii Assuming 90 is EOLS in step i both steps and iii in the GLS algorithm require solution of a p x 1 set of estimating equations of the form 7L ijyjifmjv f mjv 07 3393 j1 where the wj are a set of xed known constants In the case of OLS wj E 1 for all j of course in step iii the wj are the current estimated values from step ii which are held xed in iii In general solving 33 in the case of a set of xed known weights wj j 1 n corresponds to the method of WLS Thus implementation of the GLS algorithm requires the ability to solve estimating equations of the WLS form We thus focus rst on how this may be carried out Note that if filj were a linear function of 6 ie filj mgr then magNB 1117 and it is easy to see that 33 may be solved in closed form for 6 PAGE 52 CHAPTER 3 ST 762 M DAVDDIAN In particular under these conditions it is easy to verify that the solution is 71 71 71 A T wrs 2 1011131117 2 10739 ij j1 73971 When linearity of 1 does not hold and f is a general nonlinear function of 6 then it is clear that a closed form solution is no longer possible in general In some special cases the forms of f and 1 may fortuitously admit an analytical solution but this is very unusual Accordingly 33 must be solved numerically The basic method for numerical solution of the equation may be derived in different ways Here is one way a variant of an idea called the Gauss Newton method in the nonlinear regression literature We will discuss another way to motivate this method shortly By a Taylor series expansion we may approximate f1j6 and f 1j by linear functions of 6 Taking the expansions about some value 6 close to77 6 we have may 3 7 Mm fan 7 3 34 Malawi heap5 f lt j7 7 3 P X 1 35 See Section 24 for an overview of this notation here f 1j is a p x p matrix The underlying as sumption behind the linear approximation is that for 6 close to77 6 the subsequent terms quadratic and higher in the Taylor series are suf ciently small as to be negligible Note also that these expres sions carry implicit assumptions about the existence of partial derivatives of f and 1 required for the relevance of Taylor s theorem 0 For nonlinear models such approximations are used routinely It must be kept in mind that these approximations involve assumptions that must hold for them to be relevant to a particular problem Substituting these expressions into 33 we get 0 z Zwm 7 eam 7 filmmm 7 gtf ltmj gt f ltwj gtlt 7 m j1 Zw yj fjv f jv 7 ijfmwm fgm 55 7 5 j1 73971 Z 599 7 f f 72 7 3 quadratic terms in 6 7 W 11 This expression forms the basis for a linear approximation to the nonlinear estimating equation PAGE 53 CHAPTER 3 ST 762 M DAVDDIAN o If 6 and 6 are close we would expect the quadratic terms to be small relative to the other terms involving 6 7 8 in only a linear way 0 Furthermore under these conditions we also expect Em 7 mm m 0 Thus the third term on the right hand side of the approximation which involves the product of 6 7 8 and 7 f1j6 might also be expected to be small Considering then the third and forth terms to be negligible we have the approximation n n T 0 z Zwm 7 rm f m72 3 7 Zwmwmw mj lt 7 tr j1 j1 which may be rearranged to yield ijf j fngHV 5 i W Zw yj fltjv f jv 36 j1 73971 De ning fngy 113175 Xt 7 E 7 ma 3 W diagw1 wz W aw arm we may write 36 compactly in the suggestive form XTlt gtWXlt gtlt 7 3 z XTlt gtWY 7 rm Assuming that the matrix product in braces is positive de nite we may rearrange this as 6 7 3 z XTlt gtWXlt gt 1XTlt gtWY 7 mam 37 The approximation 37 suggests an iterative scheme for solving the estimating equation 33 Starting from some value 0 from the ath iteration obtain the a 1th iteration as a1 ltagt XZZWXa71XTaWY fa39 38 Here we have used the shorthand notation Xa X a7 fa f a PAGE 54 CHAPTER 3 ST 762 M DAVDDIAN Note that we may also write this as T 7 T T n1 XaWXa 1XaWY fa XaWXa a XgtWXltagt 1XgtWXltagt ltagt 3 fltagtgt 39 Zltagt Now the matrix X6 depends on the 111739 through the partial derivatives of 1 Writing 3 to denote all the 2117 j 1 n and ignoring the dependence of the random vector77 Z01 on this information from 39 we have that EZali z Xa a varZalfi z UZWA 310 Thus 30 represent a WLS estimator for the approximate linear model given in 310 where the model has been expressed in matrix form If the model 310 were exact the WLS estimator would be the best linear unbiased estimator for 6 BLUE From this development the a 1th iterate in solving the nonlinear model problem may be viewed as an attempt to emulate the desirable properties in the linear case SUMMARY This argument suggests that to implement solution of the estimating equation 33 with known weights one would begin with a starting value 30 and a O and would obtain successive updates 30 declaring a solution to the equation to be reached when two successive iterates 3a and lta1gt are suf ciently close77 in some sense We will discuss selection of starting values convergence criteria for declaring the solution has been reached and other issues momentarily The success of this procedure obviously depend on the relevance of the approximations made REALISTIC IMPLEMENTA TION The algorithm as described above is simplistic relative to how things are actually implemented in practice 0 Several modi cations to the basic algorithm choices of convergence criteria and so on have been suggested to improve performance that is to offer better assurance that the true solution is found and to decrease the computation time number of iterations and function evaluations One such modi cation is discussed in Section 36 Modi cations to the basic algorithm given here and alternative algorithms that go about nding the solution in other ways are available in software such as SAS proc nlin and the RSplus function nls These are discussed further and illustrated in Section 37 Such software generally offers the user a choice among modi cations and other approaches selection of the convergence criterion to be used and so on PAGE 55 CHAPTER 3 ST 762 M DAVDDIAN c As our focus is on statistical inference and not on numerical methods when discussing imple mentation of 33 and other estimation methods in the sequel we will usually speak of the basic algorithm for conceptual simplicity But keep in mind that realistic implementation as in available software will usually be more sophisticated o More on modi cation and alternative procedures may be found in Bates and Watts 1988 sections 22 35 and Seber and Wild 1989 Chapter 14 See also the documentation for SAS proc nlin SAS Institute Inc 1999 and discussions of the RSplus function nls Chambers and Hastie 1993 Venables and Ripley 1999 INCORPORATION IN THE GL3 ALGORITHM From this discussion in principle to implement the GLS algorithm one would do the following Note that each of steps and iii involve using the Gauss Newton method above to solve the relevant estimating equation Thus the GLS algorithm which is itself an iterative process involves internal iterations to nd the required solutions at steps i and iii It is important to understand the distinction between the outer and inner iterations Here superscript k indexes the outer GLS iterations and subscript a indexes the inner Gauss Newton iterations used to solve the equation at each step these inner iterations are nested within the outer ones i To obtain the initial estimate 3 BOLS k 0 one would use the Gauss Newton method above to solve 33 with all wj E 1 so that W I a n x 71 identity matrix The Gauss Newton algorithm would be started with some suitable starting value ma a 0 and iterated according to lta1gt X8Xltagt 1XaTZltagtv where X a and Z a are de ned as above until two successive iterates differ negligibly as de scribed in Section 33 Declare the nal iterate to be the OLS estimate 30 BOLS ii Form estimated weights 7i 94 02117 recall we are treating 0 as known iii Form the matrix W diagw1 wn and regard it as xed Use the Gauss Newton methods to solve 33 with the wj set to their estimated values from ii by starting from some suitable starting value ma a O and iterating according to lta1gt X52gtWXltagt 1XaTWZltagt until two successive iterates differ negligibly as described in Section 33 Declare the nal iterate to be the solution Idol the GLS estimate after k iterations of the GLS algorithm PAGE 56 CHAPTER 3 ST 762 M DAVDDIAN If the total number of iterations planned for the algorithm is C then return to ii and repeat C 7 1 more times A k If C oo iteration between steps ii and iii would continue until two successive outer iterates lt A k 1 i i i i i i i i and 8 differ by some negligible amount indicating that the solution to 32 has been found This will be discussed further in Section 34 33 Practical issues We now discuss several issues to be faced in practical implementation of the GLS algorithm STARTING VALUES ln linear regression estimators are available in closed form and no starting values are needed In contrast because the estimating equation for nonlinear models even with wj E 1 cannot be solved in closed form starting values for the Gauss Newton or other algorithm are required 0 The choice of starting value can be critical It is generally not the case that choosing an arbi trary value will lead to the Gauss Newton algorithm converging to the genuine solution to the estimating equation Sometimes the algorithm may bomb This may happen if at the current iterate 3a the matrix X51WXQ is not positive de nite so that it may not be inverted In this case the iterations may have gotten confused by starting from a value far from the part of the parameter space where the solution resides Alternatively and more insidiously the algorithm may appear to converge in that the difference between two successive iterates is small but in reality the solution has not been found at all When this happens the supposed solution is often implausible Choice of suitable starting values is somewhat of an art form Because nonlinear models 1 can have many different forms there is no one all purpose or automatic approach for identifying a sensible choice However for models with particular features ad hoc methods are possible For example consider the four parameter logistic model discussed in Example 14 namely 52 i 51 m f f 31 1 zBsw Recall that the parameters have the following interpretations 31 is the response at z 00 g is the background response at z 0 33 is the concentration giving response halfway between 32 and 31 often called the ED50 and B4 is a shape parameter governing steepness PAGE 57 CHAPTER 3 ST 762 M DAVDDIAN This suggests guessing at plausible starting values for 6 1 g g 4T by examining a plot of the data as follows Take the starting value for 1 to be the value about equal to where the data appear to asymptote as x increases Similarly take g to be the value where the data seem to suggest the curve intersects the Y axis Choose g as equal to the z value that seems to correspond roughly to the response halfway between the chosen 1 and g Finally take 4 to be about 1 this is often a reasonable rst choice as success seems to depend much more critically on the chosen values for the other three parameters 0 Bates and Watts 1988 discuss choice of starting values for different models in Section 32 see also Draper and Smith 1981 Chapter 10 and various places in Seber and Wild 1989 As we will discuss in Chapter 4 for certain nonlinear models popular in the analysis of binary and count data an almost automatic procedure is in fact available but for more general nonlinear models it is more tricky and speci c 0 Experience and perseverance are often required but most problems may generally be resolved successfully unless the information in the data is simply not suf cient to identify all the parameters in the model IDENTIFIABILITY As with any statistical model it must be possible to identify the components of 6 that is the model must be identi able in the sense that there is only a single value of 6 leading to the same model This is not an issue for linear models unless n lt p of course but it arises often in nonlinear settings As an example consider the biexponential model 73 5 51 exPF 33 exPP Ml It is straightforward to observe that the roles of 1 g may be swapped with those of g 4 so that there is more than one way to choose 6 and end up with the same value of fm 6 for any particular m Thus technically the model is not identi ed In this particular case this is not a big practical issue If it is decided that 1 g will represent the rst part of the curve and g 4 the second the starting values may be chosen accordingly If this is done it is rare that the algorithm will become confused Thus in a sense by appropriate choice of starting values the user may identify a technically unidenti able model PAGE 58 CHAPTER 3 ST 762 M DAVDDIAN When nonlinear models are adapted from theoretical considerations the situation sometimes does arise where the model may contain too many parameters to be identi ed from the data For example one parameter may turn out to be able to be expressed in terms of others The user should be on the lookout for such situations and subject matter considerations may need to be used to reduce the number of parameters CONVERGENCE CRITERIA Routines for implementing the Gauss Newton or other method use a number sometimes several at once criteria for assessing whether two successive iterates are suf ciently close that the algorithm may be declared to have converged to the solution The most common criterion is to insist that the maximum relative change in all elements of 6 from iteration a to a 1 be smaller than some prespeci ed small tolerance a common default choice is that the tolerance tol 10 8 Formally the criterion is max w lt 2501 311 Ham mu Bates and Watts 1988 offer extensive discussion of a variety of convergence criteria the one we give here is a popular default in many available software routines PARAMETER RESTRICTIONS Some situations may impose natural restrictions on the values that parameters in a nonlinear model may take on o For instance recall Examples 11 12 and 18 all having to do with modeling of pharmacokinetics The models involved in this application involve parameters that may have physical meaning eg a fractional rate of removal from a compartment in a compartmental model must be a positive value Indeed the pattern exhibited by the data implies that some parameters must be positive in order for the model to make sense eg for the indomethacin data in Examples 11 and 12 the model was the biexponential B1 expk zm 33 expk wl In order that the model both make physical sense and characterize the pattern we must have all of61 4gt0 The tting algorithm contains no restriction that forces iterates and indeed the nal declared estimate to obey a positivity constraint Luckily it often turns out that with appropriate choice of starting values this is a non issue as under these circumstances the algorithm almost never wanders into areas of the parameter space where components of 6 are negative However in some situations eg where the value of one of the exponential parameters may be very close to zero this can happen PAGE 59 CHAPTER 3 ST 762 M DAVDDIAN 0 One can impose such restrictions by a reparameterization of the model For example the biexpo nential model may be reparameterized as exp31exp6 2 exp sexp6 4 in this parameterization the parameters are the logarithms of those in the original parameter ization The values appearing in each position of the model must be positive for values of the components of B in foo 00 One must remember the meaning of the parameters when inter preting the results We will discuss methods for making inference on functions of parameters in such models in later chapters In addition to imposing constraints the adoption of an alternative parameterization may have other bene ts in fact different parameterizations may be preferable even if no constraints are present In one parameterization a model may be dif cult to t as the linear approximation underlying the algorithm may not be very good A new parameterization may help with this and make convergence more speedy Another bene t may be that the parameter estimates themselves may be less correlated leading to more precise inferences we will discuss this in later chapters SCALING Sometimes if the true values of the elements of 6 are several orders of magnitude different from one another convergence may be dif cult to achieve For example consider the monoexponential model mm 61 expe zz 8 412651013 The fact that the true values are many orders of magnitude different can create problems in computing the update as the matrix X51WXQ may become close to singular because of problems in representing very small or large numbers and the tolerances for declaring matrices singular that underlie the software A remedy is to rescale the parameters in our example one could t instead the model 9075 51 X 100008XP52907 for which 6 42651 013T Another way is to analyze YlOOOO DERIVATIVES The iterative Gauss Newton procedure requires the rst partial derivatives of at 6 with respect to 6 0 Available software allows the user to specify analytical expressions for these derivatives Alterna tively in the absence of these speci cations the software computes numerical approximations PAGE 60 CHAPTER 3 ST 762 M DAVDDIAN Almost always the solution will be achieved more quickly with less chance for problems if the analytical expressions are used Finding analytical derivatives is not dif cult with the advent of software such as Maple and Mathematica It is generally recommended that analytical derivatives be used whenever possible This is because depending on the function and the values of 1 and 6 numerical approximations to true derivatives can sometimes fail 0 However one must be careful lncorrect speci cation of analytical derivatives is the most common source of problems in tting nonlinear models Often this is because of the carelessness or ineptitude of the user Helpful Hint 1 Check the expressions for the analytical derivatives A quick simple way to do this is to evaluate the expressions for some choices of 1 and 6 and compare these to simple onesided numerical derivatives That is for Z 1 p compare your analytical expression ax5 58 ax8 aamwm to 6 where 6 is some small number eg 6 00001 and 81 is a p x 1 vector with a 1 in the 6th position and zeroes elsewhere Helpful Hint 2 aamz 7g ml The above albeit brief discussions of implementation issues are relevant in general for tting any model nonlinear in parameters The best way to get a feel for the issues is through practical experience 34 The case 0 00 and iteratively reweighted least squares Returning now to our focus on the outer GLS algorithm recall that the case C 00 corresponds to solving the estimating equation in 32 repeated here 94037 97111009 7 HM f 17 0 312 n 1 x In previous decades use of the GLS algorithm with steps i7iii as presented in Section 31 with C chosen to be some xed typically small value was common in part because the required computations were time consuming Today with advances in computation this is no longer an issue Thus what is typically done is to solve 312 that is take C oo PAGE 61 CHAPTER 3 ST 762 M DAVDDIAN In this case our discussion of the GLS algorithm with three steps i7iii is mainly a device to allow us to think about the problem conceptually It turns out that the case C 00 is more easily implemented directly rather than via a three step approach as we will now discuss It is important to keep in mind however that when 0 is unknown a problem we will tackle later the approach we are about to discuss becomes problematic and incorporating estimation of 0 into the three step algorithm is convenient lntuition suggests that if we want to solve 312 iterating between steps ii and iii each time performing a Gauss Newton iteration with xed weights is computationally inef cient An alternative approach is to attack the problem directly that is derive a linear approximation based on starting with 312 4 Consider linear approximation of 312 about a value 6 close to77 6 as before 0 In addition to the approximations to f1 6 and f 16 in 34 and 35 we may approximate g 2 0 217 in a similar manner 0 Substituting the approximations for f16 f 16 and g 2 02117 into 312 multiplying out and collecting terms we may disregard those that are quadratic in 6 7 8 and involve products of small expressions 0 This is left as an exercise for the interested reader Following this approach it may be shown that a linear approximation may be derived Zgzt i 07 in f ww f jy 5 i 3 Zg zt i 07 HIM aw f j7 W 7391 j1 De ning WM diagg 2 7 071111 7g 2 7 07 we may write this compactly as a 7 3 z XT W X 1XT W Y 7 rm This suggests the following iterative scheme Writing Wm W a from the ath iterate ma obtain the a 1th as T 7 T lta1gt 5a t XltagtWltagtXltagt 1XltagtWltagtY fan XltTagtWltagtXltagt 1X52gtWltagtZltagt 313 where Z01 Xa a Y 7 u as before PAGE 62 CHAPTER 3 ST 762 M DAVDDIAN Because the weight matrix W6 changes at each iteration this method of solving the estimating equation 312 is called iteratively reweighted least squares usually abbreviated as IRWLS or lWLS One would iterate 313 until two successive iterates differ negligibly ie until a stated conver gence criterion is satis ed Combining the reweighting and approximate linear least squares into a single procedure and iter ating to convergence has the same effect as iterating between steps ii and iii in the conceptual GLS algorithm until convergence In particular if one were to implement the GLS algorithm for C 00 one would continue to i H i i i A k A k 1 i 1terate between steps 11 and 111 until two successive outer iterates lt and 8 differ by some negligible amount This could be formalized by implementing a convergence criterion like 311 applied at the end of step iii to the new and previous iterates Using similar convergence criteria the IRWLS procedure and the GLS algorithm should yield the same estimated value for 6 as they are both approaches to solving the same estimating equation The former is obviously more computationally ef cient as it involves only one iterative algorithm In contrast the GLS algorithm requires that the Gauss Newton procedure with xed weights be performed each time step iii is carried out With computing as powerful as it is these days in most problems there is not a practically signi cant time difference in the two ways of obtaining the solution to 32 The GLS algorithm will generally converge within about 5 to 10 iterations in most problems We will see in Chapter 4 that for a certain class of problems the IRWLS procedure arises naturally from an entirely different perspective from that given here TERMINOLOGY Regardless of what computational method is used to obtain the solution to 312 by IRWLS or the three step algorithm we will refer to the nal value as the GLS estimator BCLS In fact we will refer to that obtained from the threestep algorithm for xed C as a GLS estimator as well This emphasizes that the key issue is the estimating equation being solved not the numerical method used to obtain the solution The terminology GLS refers to that general approach of solving a weighted linear estimating equation not to the computational strategy used PAGE 63 CHAPTER 3 ST 762 M DAVDDIAN 35 Estimation of 02 So far we have focused exclusively on estimation of the regression parameter 6 o This is in fact because the other parameter characterizing the model 31 02 acts as a constant of proportionality as far as variance is concerned and hence as far as weighting is concerned As 02 is the same for all j weighting to adjust for difference in precision of the response need not incorporate 02 o This is very convenient although both 6 and 02 fully characterize our conditional moment assump tions we may estimate 6 without regard to 02 rather than having to estimate the two parameters jointly We will see later that other approaches for inference for 31 do not necessarily enjoy this feature and are thus considerably more complicated to implement 0 Of course in practice we may well be interested in the value of 02 both because it fully character izes our assumption about variance of the response and as we will see because it plays a key role in the large sample sampling distribution of the estimator for 6 For the latter an estimator for 02 is needed in order that estimates of uncertainty eg standard errors and con dence intervals regarding elements of 6 be obtained A natural strategy is to base the estimator for 02 on the GLS estimator for 6 That is form an estimate for 02 using the nal GLS estimate BCLS It seems reasonable to emulate what one would do with known weights wj whether these weights are known or not When the weights wj are known a natural estimator for 02 is 71 A2 71 A 2 7 ZION739 fabawn 7 314 j1 where BWLS is the WLS estimator which may be computed using the Gauss Newton procedure of course It is straightforward to show and left as an exercise that this estimator is the maximum likelihood estimator for 02 under the assumption of normality and known xed weights Analogous to the case of linear regression this is often replaced by 62 n 7w iwm 7 m fawn 315 j1 PAGE 64 CHAPTER 3 ST 762 M DAVDDIAN In the linear case it is straightforward to show that the maximum likelihood estimator 314 is biased downward by the factor n 7 pn motivating the division by n 7 p rather than n In the nonlinear case a similar bias is often seen in practice although it is not possible to derive an exact analytical expression for the bias as it is in the linear case Thus the same bias correction of division by n 7 p is made as at least an approximate adjustment as in 315 in the nonlinear case In any event because estimators for 6 are computed after estimation of 6 is nalized this bias is often viewed in both linear and nonlinear situations as a consequence of failure to account for the degrees of freedom lost p due to having to estimate 6 rather than knowing it Thus the modi ed estimators with divisor n 7 p are often said to be adjusted for loss of degrees of freedom for estimating 6 In Chapter 7 we will give another interpretation In model 31 with varY7l1j 0292 02117 so that wj g 2601j 0 known the natural analogs to 314 and 315 are 71 72 n71 2972EGL570711739HY739 113 Elam j1 71 A2 71 72 A A 2 7 n 19 9 GL570711739HYJ39 fab50w 7 316 j1 respectively The same bias phenomenon is observed to persist in this more complicated situation so the estimator in 316 is generally preferred It is straightforward to show that for xed 6 the rst estimator is the maximum likelihood estimator for 02 when the distribution of with rst two moments 31 is assumed normal We will discuss this further in later chapters 36 Improving performance of the basic algorithm As we remarked in Section 32 implementing solution ofthe types of estimating equations we consider via the basic Gauss Newton iterative procedure may be modi ed to improve performance The bare bones Gauss Newton algorithm we discussed is often modi ed by various netuning devices Here we discuss one such ne tuning device referred to as step halving Step halving is really only applicable in the case of xed known weights thus it is not used to modify the basic IRWLS approach In the case of xed weights the basic Gauss Newton iterative update is given in 38 as ow 3a t XWXa71XTaWY fa39 PAGE 65 CHAPTER 3 ST 762 M DAVDDIAN This may also be expressed in terms of Z01 as in 39 Thus the proposed increment by which 3a is updated to 30 is given by 3a lta1gt 5a XltTagtWXltagt 1XltTagtWY fw In the basic implementation d w is adopted without question In the step halving approach before this proposed increment is adopted a check is performed to see whether it may be improved There are several variations on this theme here we discuss just one such approach The procedure we discuss relies on the fact that with known xed weights solving the usual estimating equations for WLS which may be written in matrix notation as XT WY 7 ma 7 0 is equivalent to minimizing the weighted sum of squares again written in the matrix notation Ma 7 Y 7 f TWY 7 f 317 o This may be seen by differentiating 317 with respect to 6 of course The motivation for stephalving comes from being able to express the problem as a minimization Note that if W were to depend on 6 as in the estimating equation 312 this would no longer be true in general This may be seen by differentiating 317 with respect to 6 7 if W6 depends on 6 the derivative turns out to be XT6WY 7 f8 plus an additional term due to this dependence Thus minimizing Y 7 f6TW6Y 7 f6 does not correspond to solving XT6W6Y 7 f8 0 This is a very important point that we will discuss again later 0 For now note that it is not apparent that we can pose the solution of the estimating equation 312 as a minimization problem When the weights are xed we can Software such as SAS proc nlin and RSplus nls use step halving as the default as the main focus of these programs is to address problems with xed weights SAS proc nlin can also be made to solve 312 where the weights depend on 6 using the IRWLS approach In this case the user is instructed to turn off77 step halving for this reason Considering the problem as one of minimizing the objective function R6 in 317 we would hope that the new increment d ltagt at the a 1th iteration serves to decrease the value of R6 Thus rather than just adopting 1801 the stephalving approach checks to see whether a ne tuning of d ltagt will result in a greater decrease in R6 PAGE 66 CHAPTER 3 ST 762 M DAVDDIAN We will focus on two of the most readily accessible to statisticians 0 SAS proc nlin o R function nls A version of this function with slightly different syntax is also available in Splus Full details on use of these programs is available in the associated documentation Here we simply give some basic information and illustrate their use As with almost any software a good way to learn how to use these programs is to follow an exam ple Thus we demonstrate their application to tting a model for the data on pharmacokinetics of indomethacin in Examples 11 and 12 The raw data are given in Table 31 where Y is concentration of indomethacin pgml and z is time hours the data are plotted in Figure 11 Table 31 Indomethacm concentration time data for a single subject m Y 025 205 050 104 075 081 100 039 1 25 030 200 023 300 0 13 400 011 500 008 600 0 10 800 0 06 We consider the following model 190911130 lemma 5 91 BXPPE Wj 5 93 130519th VaFOjlillj 021 2007397 37 so that the biexponential model is parameterized to enforce positivity and the variance function g6 02117 fzj6 It would of course be straightforward to t the model in its original param eterization as well A more general variance model speci cation is varY7l1j 02f29zj6 thus we have adopted this model with the value 9 10 known We remark on this here as some of the programs will be modi ed in later chapters to allow estimation of a variance parameter 0 In all cases we use the following starting value for solving all estimating equations ism log20 200202T PAGE 68 CHAPTER 3 ST 762 M DAVDDIAN SAS proc nlin SAS proc nlin is the basic SAS procedure to implement nonlinear model fitting7 including weighting The procedure uses a syntax similar to other regression and linear model procedures in SAS See the most recent documentation for SAS statistical procedures for full information Here7 we simply summarize the basic syntax these may be seen in action in the programs that follow Required statements are as follows proc nlin lt options gt Invoke the procedure Options include choice of solution algorithm7 control of max number of iterations7 convergence criteria model Specify the model 1 and identify the response variable Y parameters or parms Identify the parameters 6 and give starting values The following statements are optional and are not the only optional statements der Specify expressions for analytic derivatives bounds Constrain the range of parameter values use with caution The following pages contain extensively documented SAS programs to t the model by solving 312 two ways PROGRAM 31 By using IRWLS PROGRAM 32 Using the threestep GLS algorithm DISCLAIMER The author is not a good SAS programmer it is likely that these programs could be made slicker and more ef cient PROGRAM 31 Implementing IRWLS using SAS proc nlin PROGRAM STATEMENTS GLS analysis of the subject 5 indomethacin data with theta 1 for variance im mean 2 t s a power theta of the Here use IRWLS as implemented automatically in SAS PROC NLIN PAGE 69 CHAPTER 3 ST 762 M DAVDDIAN options ps55 ls80 nodate Enter the data may also be done from a file of course data indo input time conc cards 25 205 050 104 075 081 100 039 125 030 200 023 300 013 400 011 500 008 600 010 800 006 Invoke PROC NLIN The quotmethodgaussquot statement asks for the Gauss Newton algorithm other method options are given in the SASSTAT documentation e GN algorithm as implemented in PROC NLIN auto matically employs a version of stephalving Because stephalving is not appropriate when the weights are iteratively recomputed ie not held fixed e t e instructs the user too IIturn offquot step halving with the quotnohalvequot proc nlin dataindo methodgauss nohalve Specify starting values in the quotparmsquot statement see the documentation for fancy options in specifying start values parms b1069 b2069 b316 b416 We will parameterize the model to enforce positivity thus define b1 b4 to be the logarithms of the parameters of interest t is possible to use general programing statements includin quotifquot state ments and quotdoquot loops in specifying models and derivatives in PROC NLIN eb1expb1 eb2expb2 eb3expb3 eb4expb4 The quotmodelquot statement specifies the model put the name of the response variable on the LHS the form of the model on the RHS As above the mo el may be specified in terms f t ings defined in very general program statements so that very complex models are possible model conc eb1expeb2timeeb3expeb4time are going to weight by the inverse of the predicted values squared corresponding to the power variance model wit theta Thus we define the variable conc2 to be the square of the predicted value at each iteration internal to the pro ram In general such predicted values are specified by the syntax modelquotLHSofmodelstatementquot conc2modelconcmodelconc Specify the derivatives of the model with respect to each parameter I no 39derquot sta ements are present the program will automatically default to using a derivativefree method of computation regardless PAGE7O CHAPTER 3 ST 762 M DAVDDIAN of what is requested in the quotmethodquot option Again these may be uite complex an the use of general programming statements to define t e components of these formulae is al owed derb1eb1expeb2time derb2eb1eb2timeexpeb2time derb3eb3expeb4time derb4eb3eb4timeexpeb4time To get PROC NLIN to do weighted least squares include a quotweightquot statem nt e weights may be fixed quantities in which case PROC NLIN wi 1 do plain old WLS If the weights de end n predic ed values as ere C LIN is smart enou to now to do IRWLS Thus specifying a weight statment as below automatically invokes the RWLS computation weight 1conc2 run OUTPUT The SAS System 1 The NLIN Procedure rative P ase De endent Variable conc et od GaussNewton Weighted Iter b1 b2 b3 b4 SS 0 06900 06900 16000 1 6000 1 2071 1 12238 09390 15229 1 8705 0 2127 2 12100 0 9470 1 4533 1 7641 0 2071 3 12136 0 9494 1 4552 1 7715 0 2063 4 12144 0 9502 1 4542 1 7705 0 2063 5 12146 0 9504 14541 1 7703 0 2063 6 12146 09504 14541 1 7703 0 2063 7 12146 09504 14540 1 7703 0 2063 NOTE Convergence criterion met Estimation Summary Method GaussNewton Iterations 4412E6 PPCb2 1932E6 RPCb2 8742E6 jec 2717E7 ObJective 0206319 Observations Read 11 Observations Used 11 Observations Missing 0 NOTE An intercept was not specified for this model Sum of Mean Approx Source DF Squares Square F Value Pr gt F Regression 4 110000 27500 9330 lt0001 Residual 7 02063 00295 Uncorrected Total 11 112063 Corrected Total 10 45334 The SAS System 2 The NLIN Procedure A prox Approximate 95 Confidence Parameter Estimate Std rror Limits b1 12146 02380 06519 17773 b2 09504 01599 05724 13284 b3 14540 02135 19589 09492 PAGE71 CHAPTER 3 ST 762 M DAVDDIAN b4 17703 02361 23285 12121 Approximate Correlation Matrix b1 b2 b3 b4 b1 10000000 08466629 02669316 02283554 b2 08466629 10000000 05700623 05050704 b3 02669316 05700623 10000000 09379593 b4 02283554 05050704 09379593 10000000 PROGRAM 32 Implementing the three step GL3 algorithm using SAS proc nlin PROGRAM STATEMENTS GLS analysis of the subject 5 indomethacin data with theta 1 for variance 2 tim s a power theta of the mean Here define a SAS macro to perform the 3step GLS algorithm options ps55 ls80 nodate Define a sas macro called quotglsalgquot to control the GLS iteration The macro takes as input the name of the SAS working data set the names of the x and y variables here x is a scalar but the program may be modified easily to pass more than one explanatory variable and starting values for each WLS calculation Xmacro glsalgdsetxvaryvarb1b2b3b4 Set up the data set for the first pass through the algorithm which will com ute the OLS estimate ariable quot red1quot containing the predicted values from the revious iteration is set equa to 1 here so that the first pass wit weights based on pred1 theta will actually be OLS weights all 1 data ampdset set ampdset pred11 if ampxvar or ampyvar then delete b1newampb1 b2newampb2 b3newampb3 b4newampb4 Set the value of theta 10 here theta110 Call the macro quotstep3quot defined below that actually implements the call to PROC NLIN to do WLS with the current set of fixed weights The argument sets options for supressing the printing of the results of eac all to PRO NLIN Xstep31 Xstep31 Xstep31 Xstep31 Xstep31 Xstep31 Xstep31 Xstep31 Xstep31 PAGE72 CHAPTER 3 ST 762 M DAVDDIAN Xstep31 Xstep30 After the final iteration compute the final estimate of sigma using the newest values of the weights data sigma set outnlinkeep resid pred data sigma2 set newkeep theta1 data sigma merge sigma sigm 2 data sigma set sigma wresidresidpredtheta1 proc means datasigma noprint var wresid output outsigout ussrss data sigout set sigout sigmasqrtrss7 data sigout set sigoutkeep sigma if ngt1 then delete proc print datasigout title2 IIFinal estimate of sigmaquot Xmend glsalg Define the macro quotstep3quot The argument quotfooquot controls options in calling PROC NLIN see below Xmacro step3foo PROC NLIN prints out a summary by default We rint out this full output oan for the final GLS fit the IInoprintI option is invoked f r all ot e f t Zif ampfoogt0 Zthen Zdo 7 e optnoprint Xend Xelse Zdo Xlet opt Xend The call to PROC NLIN to implement the current WLS fit proc nlin dataampdset methodgauss ampopt parms b1ampb1 b2ampb2 b3ampb3 b4ampb4 if iter1 then 0 b1b1new b2b2new b3b3new b4b4new en Programming statements to define the mean function and its derivatives eb1expb1 eb2expb2 eb3expb3 eb4expb4 feb1expeb2ampxvar eb3expeb4ampxvar db1eb1expeb2ampxvar db2eb1eb2ampxvarexpeb2ampxvar db3eb3expeb4ampxvar db4eb3eb4ampxvarexpeb4ampxvar The model statement and derivative statements to tell PROC NLIN the form of the model and derivatives model ampyvar f derb1db1 PAGE73 CHAPTER 3 ST 762 M DAVDDIAN derb2db2 derb3db3 derb4db4 The quotweightquot statement defines the values of the wei ts to use for ere we use wei ts computed using the fixed va ue of theta and the values of the predicted values from the previous iteration Thus the weights are FIXED do not depend on current predicted values PROC NLIN thus knows to do WLS with FIXED weights rather than IRWLS weight 1abspred12theta1 Form an output data set containing everything in the input data set plus the predicted values to set up weights for next time the residuals so we can calculate sigma on the final iteration and the parameter estimates from this iteration so we can print them output outoutnlin ppred rresid parmsb1 b2 b3 b4 run Set up the data set containing the new predicted values etc for use on the next iteration Also create the data set IIparmsI containing only the estimates of beta from the iteration just completed for printing data new set outnlinkeep ampxvar ampyvar b1 b2 b3 b4 pred theta1 renameb1b1new b2b2new b3b3new b4b4new predpred1 data parms set newdrop ampxvar ampyvar pred1 theta1 if ngt1 then delete proc print dataparms data ampdset set new Xmend stepS End of the macro definitions Now the program to read the data and call the macros begins data indo input time conc c s 01 MD 0 mmmeM u xoo OOOOOMONU IM 00000010010 a 00000000gt x HOHHMmmooo 000H0J0J0 0H4gt 800 006 Call the macro quotglsalgquot with the name of the indomethacin data set the names of x time and y conc and the starting values title1 quot3STEP GLS ALGORITHM APPLIED TO THE INDUMETHACIN DATAquot XglsalgindotimeconcO690691616 PAGE74 CIIAI FEIl3 OUTPUT Iter NOTE Convergence 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 127147 104077 123272 150684 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121779 094888 147265 179326 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121449 095081 145208 176792 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121464 095041 145422 177053 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121461 095043 145402 177029 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121461 095043 145404 177031 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121461 095043 145404 177031 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121461 095043 145404 177031 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121461 095043 145404 177031 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs b1new b2new b3new b4new 1 121461 095043 145404 177031 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA The NLIN Procedure rative hase De endent Variable conc et od GaussNewton We b4 17703 b3 14540 b2 09504 b1 12146 criterion met Estimation Summary thod GaussNewton terations 2241E6 PCb1 7246E7 PC bJective 0206319 bservations Read 11 bservations Used 11 ST 762 M DAVDDIAN i ted gh SS 02063 PAGE75 CHAPTER 3 ST 762 M DAVDDIAN Observations Missing 0 NOTE An intercept was not specified for this model Sum of M an Approx Source DF Squares Square F Value Pr gt F Regression 4 110000 27500 9330 lt0001 Residual 7 02063 00295 Uncorrected Total 11 112063 Corrected Total 10 45334 A prox Approximate 95 Confidence Parameter Estimate Std rror Limits b1 12146 02380 06519 17773 b2 09504 01599 05724 13284 b3 14540 02135 19589 09492 b4 17703 02361 23285 12121 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 12 The NLIN Procedure Approximate Correlation Matrix b1 b2 b3 b4 b1 10000000 08466631 02669317 02283555 b2 08466631 10000000 05700620 05050700 b3 02669317 05700620 10000000 09379592 b4 02283555 05050700 09379592 10000000 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 13 Obs b1new b2new b3new b4new 1 121461 095043 145404 177031 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 14 Final estimate of sigma Obs sigma 1 017168 R FUNCTION nls This is the basic function in R and Splus implementing nonlinear model tting It uses a syntax similar to other RSplus regressionlinear model functions Full details for the Splus implementation may be found in Chambers and Hastie 19937 Chapter 10 The R implementation is slightly different in some of the syntax7 but the basic syntax is the same We demonstrate the R implementation The basic syntax is of the form obj ect lt nls formuladata start where each entry is as follows PAGE76 CHAPTER 3 ST 762 M DAVDDIAN object The R object that is created from the model t containing the parameter estimates and other summaries formula Specify the response and the model data Optional 7 speci es the data frame start A list of starting values that also identi es the parameters Other optional arguments may be added to control the number of iterations the convergence cri terion etc For more information one may also type helpnls at the R prompt The command summaryobject issued after running nls will produce a summary of the results in the created object The method given in Chambers and Hastie 1993 Chapter 10 to do Weighted Least Squares77 is not correct 7 it does not perform IRWLS as the authors seem to mistakenly believe Thus it is not clear if it is possible to implement IRWLS internally within nls Here we give a program that implements the three step GLS algorithm by calling nls to carry out the solution with xed weights in steps and iii In step iii this is accomplished by transforming the problem to a constant variance problem as on page 31 and then using nls to nd the OLS solution for the transformed problem nls uses a different convergence criterion than does SAS proc nlin so there may be slight discrep ancies in results from the identical algorithm implemented in the two languages However these should be very slight if not unnoticeable PROGRAM 33 Implementing the three step GL3 algorithm using R nls PROGRAM STATEMENTS Program to implement the 3step GLS algorithm with cmax total iterations theta known using the function nls Details on the nlso function may be found in Chapter 10 of the book quotStatistical Models in Squot edited by JM Chambers g and T Hastie 1993 Chapman and Hall Applied to the indomethacin data subject 5 assuming variance is proportional to a power 2theta of the mean quotPowerofmeanquot model with theta10 The mean model f is the biexponential g 0 el parameterized in order to enforce positivi y PAGE 77 CHAPTER 3 ST 762 M DAVDDIAN The ro ram ma be used for an roblem b chan in the codepde ining he mean functioh End quotweightsquot g g g Define the bioexponential mean function f and the gradient matrix of its partial derivatives with respect to beta n x p attribute The nls function will know to use analytic derivatives when it s ots the presence of the attribute quotgradientquot defined along wit the function indofunc lt functiontimeb1b2b3b4 e x b1 lt e pb1 eb2 lt expb2 eb3 lt expb3 eb4 lt expb 4 indof lt eb1expeb2timeeb3expeb4time compute analytical dervivatives create the gradient matrix X indograd lt array0clengthtime4listNULLcquotb1quotquotb2quotquotb3quotquotb4quot indogradquotb1quot lt eb1expeb2time indogradquotb2quot lt eb1eb2timeexpeb2time indogradquotb3quot lt eb3expeb4time indogradquotb4quot lt eb3eb4timeexpeb4time attrindofquotgradientquot lt indograd indof To im lement step iii we wish to do weighted least squares with nown weights This is accomplished y ransforming the the response and mean function 0 a roblem with cons ant variance nls is then i led to do OLS on the transformed problem thereby doing This is because the transformed mean function wil e en on the current estimated weights which are found by evaluating the mean function at the current estimate Because of the nature of R attributes if calculated usin the function quotindofuncquot above the weights will carry along the gradient use things when we wish to calculate the radient of the transforme mean unction assumin t e weig ts are constant here are more elegant ways around this but doing it this wgy here is meant to highlight the issue so 0 t agatatatatatatatatatatatatatata agatata eb1 lt expb eb2 lt expb2 eb3 lt expb3 eb4 lt expb4 uwtf lt eb1expeb2timeeb3expeb4time uwt The transformed mean function multiplied by the square root of the current estimated weights which are considered fixed weightfunc lt functiontimeb1b2b3b4wt b 4 pred lt unweightfunctimeb1b2b3b4 w12 lt sqrtwt weightf lt predw12 compute analytical dervivatives create the gradient matrix X weightgrad lt array0clengthtime4listNULLcquotb1quotquotb2quotquotb3quotquotb4quot weightgradquotb1quot lt eb1 eb2timew12 eb1eb2timeexpeb2timew12 eb3expeb4timew12 weightgradquotb4quot lt eb3eb4timeexpeb4timew12 lt lt PAGE78 CHAPTER 3 ST 762 M DAVDDIAN attrweightfquotgradientquot lt weightgrad weightf g The data alternatively we could read them from a file of course time lt c025050075100125200300400500600800 conc lt c205104081039030023013011008010006 n lt lengthconc p lt 4 Create the data frame for nls indodat lt data frametimeconc Specify the max number of iterations of GLS C Alternatively we could check for convergence after step iii cmax lt 10 Step 1 initial fit by OLS A call to nls is pretty selfexplanatory The first argument specifies the model on the LHS of the 39 is the response variable on the RHS is the mean function may also be just an expression need not unction call The second argum the name of the dataframe where the data reside and e third is a list containing the t rting values f GaussNewton alg Additional options are des ibed in the ChambersHasti o The orithm employs a orm of ste ha in this also described i e 00 cal creates an 0 j ct containing t e parameter estimates and other summary information rom the fit indo olsfit lt nlsconc 39 indofunctime b1b2b3b4indodat listb1069 b2069 b316 b4 165 Extract the estimate from the object indo olsfit bols lt coefindo olsfit Print out the results to a file we round the results to 6 decimal places to the right of the decimal agatata catquotFIT OF THE INDUMETHACIN DATA BY GLSquotfilequotindogls1Routquotquotnquotquotnquotquotnquot append F catquot0LS estimate quotroundbols6filequotindogls1RoutquotquotnquotquotnquotquotnquotappendT Use the OLS estimator as the preliminary estimator bgls lt bols Begin iterating between steps ii and iii for k in 1cmax PAGE79 CHAPTER 3 ST 762 M DAVDDIAN Step ii calculate the weights and the transformed response g to use in the WLS calculation in iii mu lt unweightfunctimebgls1bgls2bgls3bgls4 wt lt 1mu 2 concwt lt concsqrtwt Step iii update estimation of beta by WLS with the weights held fixed First create the updated data frame of transformed responses and weights for use by nls indodat2 lt dataframetimeconcwtWt indoglsfit lt nlsconcwt 39 weightfunctimeb1b2b3b4wtindodat2 listb1069 b2069 b316 b416 Get the updated GLS estimate to use for constructing weights on the next iteration bgls lt coefindoglsfit Print results of this iteration to the output file catquotIteration quotkquotnquotfilequotindogls1RoutquotappendT catquotGLS estimate of beta quotroundbgls6quotnquotquotnquotfilequotindogls1Routquot appendT Finished iteration loop now com ute the estimate of sigma 2 based on the final GLS estimate se the quotadjustedquot version mu lt unweightfunctimebgls1bgls2bgls3bgls4 resid lt concmu g lt mu sigma2 lt sumresidg2n p sigma lt sqrtsigma2 Print out the final estimate of sigma and the summary provided by g the nls function catquotFinal estimate of sigma quotroundsigma6quotnquotquotnquot filequotindogls1RoutquotappendT sinkquotindogls1RoutquotappendT printsummaryindoglsfit sink OUTPUT FIT OF THE INDOMETHACIN DATA BY GLS OLS estimate 1271474 1040768 1232717 1506841 Iteration 1 GLS estimate of beta 1217788 0948885 1472645 1793258 PAE 80 CHAPTER 3 ST 762 M DAVDDIAN Iteration 2 S estimate of beta 1214496 0950814 1452073 1767919 Iteration 3 GLS estimate of beta 1214645 0950411 1454219 1770534 Iteration 4 GLS estimate of beta 1214612 0950432 1454021 1770288 Iteration 5 GLS estimate of beta 1214614 0950429 1454042 1770313 Iteration 6 GLS estimate of beta 1214613 0950429 145404 1770311 Iteration 7 S estimate of beta 1214613 0950429 145404 1770311 Iteration 8 GLS estimate of beta 1214613 0950429 145404 1770311 Iteration 9 GLS estimate of beta 1214613 0950429 145404 1770311 Iteration 10 GLS estimate of beta 1214613 0950429 145404 1770311 Final estimate of sigma 017168 Formula concwt 39 weightfunctime b1 b2 b3 b4 wt Parameters Estimate Std Error t value Prgtt b1 12146 02380 5104 0001393 b2 09504 01599 5945 0000573 b3 14540 02135 6810 0000251 b4 17703 02361 7499 0000137 Signif codes 0 0001 001 005 01 7 1 Residual standard error 01717 on 7 degrees of freedom Correlation of Parameter Estimates b2 b3 b2 0 8467 b3 0 2669 05701 b4 02284 05051 0938 PampE 81 CHAPTER 7 ST 762 M DAVDDIAN 7 Detection and modeling of nonconstant variance 71 Introduction So far we have focused on approaches to inference in mean variance models of the form EOG39W fj7 7 varOjlilrj 729237 07 111739 71 under the assumption that we have already speci ed such a model 0 Often a model for the mean may be suggested by the nature of the response eg binary or count by subject matter theoretical considerations eg pharmacokinetics or by the empirical evidence eg models for assay response 0 A model for variance may or may not be suggested by these features When the response is binary the form of the variance is indeed dictated by the relevance of the Bernoulli distribution while for data in the form of counts or proportions where the Poisson or binomial models may be appropriate the form of the variance is again suggested One may wish to consider the possibility of over or underdispersion in these situations this may reasonably be carried out by tting a model that accommodates these features and determining whether an improvement in t is apparent using methods for inference on variance parameters we will discuss in Chapter 12 Alternatively when the response is continuous or approximately continuous it is often the situation that there is not necessarily an obvious relevant distributional model As we have discussed in some of the examples we have considered several sources of variation may combine to produce patterns that are not well described by the kinds of variance models dictated by popular distributional assumptions such as the gamma or lognormal distributions In fact it may be unclear whether heterogeneity of variance is even an issue at all In some applications it is expected and popular models may be available in others whether variance changes with the mean or covariate values may need to be deduced from the data In these situations methods are required for detecting nonconstant variance determining whether it changes smoothly across the range of the response or covariates and identifying an appropriate model to characterize the change To address these issues both formal and informal approaches have been proposed PAGE 155 CHAPTER 7 ST 762 M DAVDDIAN o Graphical techniques Both for detection and modeling7 these often have a subjective avor In this chapter7 we will focus on these procedures 0 Formal hypothesis testing Formal procedures are mainly used for detection We will defer discus sion of these until after we have covered the largesample theoretical developments on which they are based Because of the complexity of 717 no nitesample7 exact methods are available in general COMMON THEME Most graphical approaches are based on the OLS residuals Tj Yj M730le and functions thereof7 or on related constructs Our main focus will be on detection and modeling in situations where the response is continuous or nearly continuous7 such as in the case of moderate to large counts A complementary treatment of some of the approaches we will discuss may be found in Carroll and Ruppert 19887 Sections 27 and 28 72 Plots based on residuals We begin by rst reviewing the basic rationale for the use of residuals as a tool for detecting nonconstant variance in regression The usual residual plots described in a rst course in linear regression analysis apply equally well in the nonlinear model situation Speci cally7 one usually plots the W or the standardized residuals rj60LS where T 62 71 P71 n 7 11 versus one or more of the following 0 Predicted values ay EOLS o Covariates elements of 217 o loglAj in cases where many responses tend to be clustered in a very narrow range in order to stretch things out so that any patterns might be more readily discernible We will see the value of this for some nonlinear models and designs later PAGE 156 CHAPTER 7 ST 762 M DAVDDIAN If the plots exhibit an apparent pattern with the magnitude of residuals changing with level of predicted value or covariate this is taken as evidence of potential nonconstant variance In particular for the plot of residuals vs predicted values or their logarithms a fan shape is accepted as evidence that variance increases smoothly with the level of the response mean More generally any nonhaphazard systematic pattern may well be evidence that variance does not remain constant over the range of the response One must be careful however 0 A systematic pattern may also be the result of an ill tting mean model The nature of the pattern must be critically assessed by the data analyst to determine a reasonable explanation for it given the particular mean model and circumstances For example for the indomethacin pharmacokinetic data in Examples 11 and 12 the model was the sum of two exponential terms If a simple model containing only a single exponential term were tted to these data one would expect to see a systematic pattern in the residuals re ecting the lack of t of this model There is certainly subjectivity involved in this endeavor c When responses are collected in time order eg repeated measurements on the same individuals one often plots the residuals against time to look for temporal patterns that may suggest possible serial correlation Alternatively more sophisticated plots for investigating this are available We defer discussion of serial correlation until later chapters as our current focus is on detecting and modeling nonconstant variance when the assumption of independence is reasonable It is important to recognize however that this is an assumption that should be considered carefully in practice MOTIVATION The obvious motivation for the usual plots is that W is a proxy for the true deviation Yj 7 f 11373 5 If the data are normally or at least symmetrically distributed with constant variance we would expect the W to be roughly symmetrically distributed about 0 and to have approximate constant variance 0 We would thus expect a haphazard pattern with approximately equal numbers of positive and negative residuals with approximately the same magnitude across their entire range 0 Even if the variance were nonconstant if the data were at least normally or symmetrically dis tributed we would still expect approximately equal numbers of positive and negative residuals PAGE 157 CHAPTER 7 ST 762 M DAVDDIAN However we would expect changing magnitude across the range PROBLEMS WITH THE USUAL PLOTS The OLS residuals rj may not have exactly the same properties as the true deviations because 6 is replaced by the OLS estimator BOLS We will tackle this issue shortly Some more immediate problems that may make the usual plots dif cult to interpret are as follows 0 The data may not be normally or even symmetrically distributed but may instead arise from a skewed asymmetric distribution 0 The design the settings of the 217 may be such that an unusual pattern of residuals may be due to something other than nonconstant variance 0 Furthermore although the usual plots may be suf cient for detection they may not be very helpful for modeling of nonconstant variance We thus consider re nements of the usual plots REFINEMENT 1 A common idea is to base plots on transformations of absolute residuals or other residuals in order to account for sample size or asymmetry A seminal reference for some of these ideas in the context of linear regression is Cook and Weisberg 1983 We have already discussed estimation of variance parameters based on transformations of absolute residuals so it should come as no surprise that diagnostic plots would also be based on them IDEA 1 Visually double the sample size77 The usual plots may be dif cult to interpret because the sample size is small Under such conditions a change in the placement of just a single residual in the plot can change the apparent pattern substantially Thus each observation may be very in uential to the eye in gauging the pattern 2 A simple remedy is to plot r or rig6gb instead In this plot the magnitude but not the sign of the residuals is emphasized Because the contribution of all residuals is positive this has the effect of creating a larger sample size for the purpose of spotting changes in magnitude Moreover the visual in uence of any single observation in dictating the pattern is reduced Recall the data on the pharmacokinetics of indomethacin in Examples 11 and 12 Here n 11 concentration responses were collected over time on a single subject PAGE 158 CHAPTER 7 ST 762 M DAVDDIAN The data are plotted again in Figure 73 in Section 75 a usual residual plot was given in Figure 13 and exhibits a fan shaped pattern that appears roughly symmetric about zero Note that the residuals have been plotted against the logarithm of predicted values because the response tails off rather quickly there are many residuals at very small values of the response so that residuals plotted against predicted values themselves are bunched up near zero making the pattern dif cult to assess Figure 7 4 in Section 75 shows a plot of squared standardized residuals against log predicted values and shows a wedge shape indicating the increase in magnitude across the range One could substitute absolute residuals ml for squared ones and make similar plots A purported advan tage of squared over absolute residuals themselves is that squaring tends to highlight residuals large in magnitude and downplay those that are small thus drawing attention to changes in magnitude over the range A potential drawback is that squaring may arti cially accentuate residuals corresponding to outlying anomalous observations A further drawback is that although one may gain better ability to spot a trend any asymmetry of the pattern is obscured Thus such plots should not be made in lieu of the usual ones but rather should be supplementary In fact the squaring operation may be misleading in another way IDEA 2 Re ne Idea 1 McCullagh and Nelder 1989 Section 242 expand on this idea Squaring residuals can cause a problem which we now discuss heuristically Suppose the data were exactly normally distributed Then at least approximately the r72 would have a X2 distribution Of course the X2 distribution is a special case of a gamma distribution and is skewed Thus if we plot squared residuals under these conditions some of the observed pattern in the plot may well be due to expected asymmetry of the r72 and not to underlying nonconstant variance in the response The proposed remedy is to consider other transformations of residuals such that the transformed resid uals would be expected to be as normal as possible and hence symmetrically distributed That is nd a transformation satisfying this condition If plots were based on instead presumably any observed pattern could be attributed only to nonconstant variance and not to asymmetry ANSC OMBE RESID UALS This is based on consideration of so called Anscombe residuals lf variance depends on the mean through some function 90 eg one of the scaled exponential family models then de ne Altygt 72 PAGE 159 CHAPTER 7 ST 762 M DAVDDIAN The transformation makes the distribution of the transformed variable AY as close to normal as possible77 o If Y has Poisson like variance gm ulZ implies Ay 32y23 cx 3423 o If Y has gamma like variance gm u then Ay 33413 X 3413 This may be used in the context of residual plots as follows We noted that for normally distributed data Yj the residuals r72 are approximately X2 distributed The X2 distribution is a gamma distribution thus the above suggests that TEgt1 r should be approximately normally distributed The suggestion is thus to plot rig3 rather than r72 against predicted values or covariates These transformed residuals still visually double the sample size77 moreover if the data truly are normally distributed we would expect the pattern they exhibit to not be the result of their asymmetry but rather to re ect nonconstant variance if it exists Carroll and Ruppert 1988 pp 30731 advocate this kind of plot in the case where the original response may be approximately normal For the indomethacin data the usual residual plot seems fairly symmetric suggesting that the normality assumption for the may not be unreasonable Figure 75a in Section 75 shows the original residual plot and d shows the plot of 23 root residuals against log predicted values Comparing to Figure 74 note that the pattern of increase with predicted value is not nearly as dramatic suggesting that the impression in Figure 74 may in part be due to asymmetry of the squared residuals This transformation idea may be used in another way the technique we are about to discuss is the basis for the term Anscombe residual Here we form a different kind of residual based on 72 and use it to assess the validity of a distributional assumption for the original data that may in turn dictate a particular variance model To do this if one suspects that the data themselves may not be normally distributed but rather may follow or be closely approximated by a distribution in the scaled exponential family77 class then one might form residuals formed on an appropriate transformed scale instead of the usual ones Eg continuous data may be skewed at each 211739 so something like a gamma distribution may be a closer representation of the truth than the normal PAGE 160 CHAPTER 7 ST 762 M DAVDDIAN To illustrate suppose that we suspect that the may be Poisson distributed at each 211739 From 72 the residual 7 23 A 23 Win 2 would be expected to be close to normally distributed if the Poisson assumption were valid However t 7 across j may although the distribution of r for each j may be more normal77 the variance of the r not be constant Now if the Poisson assumption is valid a l method Taylor series approximation AW 3 1400 A My 7 M7 My ddy My yields upon rearrangement varAY varYA 27 so that in the Poisson case varAY MON3V W3 Thus the suggestion if one wishes to verify graphically the appropriateness of the Poisson assumption or at least the Poisson like variance model is to plot rjlAjlSFZ r lAjl6 vs predicted values If the assumption is reasonable we would expect to see symmetry about 0 as the r should be approxi mately normal with haphazard scatter about 0 as we have scaled the r by an estimate of its standard deviation so that these standardized residuals should be of approximately equal magnitude across the range If such a pattern does not emerge it may suggest that the Poisson variance conjecture is not correct and further investigation is required Of course the same idea could be used with other distributions eg the gamma Distributional considerations are not the only issue one must think about when constructing and in terpreting plots The issue of design also plays a key role This is most clearly understood by rst restricting attention to linear models Thus we consider the linear model EOjlillj HIT5 73 739 rst then generalize to the nonlinear case REFINEMENT 2 We now consider the idea of studentization Consider the linear mean model 73 and write VaFOjlillj 02101 for some values wj which we will treat as xed constants For the purposes of the following arguments we will take the perspective that the 11 are xed constants as is conventional in this setting PAGE 161 CHAPTER 7 ST 762 M DAVDDIAN Thus we will suppress conditioning for the time being We may of course represent the model in obvious matrix notation as EY X varY UZWA Here X is the design matrix and W diagw1 wn We may write Y X aW lZe where W lZ is the diagonal matrix with diagonal elements wail2 The elements of e 6739 satisfy EltEjgt O and var6j 1 of course which would be true more generally if the 6739 were assumed to be iid The OLS estimator is given by EOLS XTX 1XTY and the vector of OLS residuals is given by r n rT Y 7 X190 In 7 XXTX 1XTY In 7 XXTX 1XTX6 aW lZe 0In 7 HW 126 74 where H XXTX 1XT a n x 71 matrix with jk element hjk The matrix H is usually called the hat matrix because if 6 is estimated by EOLS and the vector of predicted values YOLS say is formed then YOLS XEOLS HY so that H puts the ha 7 on Y It is straightforward to show that 74 implies that TL 77 Uw126j 7 Z hjkwglZek k1 Under our assumptions E07 0 and we may thus calculate 7L varrj 02Ew126j 7 Z hjkwglZekf k1 1 2 n 1 2 n 1 2 02Ew16 7 2w 6739 Zhjkw 6k hjkw Ek2 k1 k1 7L a2w117 thjEe 7 210 Z hjkEEjEk Z hgkwglmei j k1 Z hjkhjlwglszlZEkke k7 Using the independence of the 6739 and 1 we obtain 7L mm a2w117 2h 7 Z hgkwgl a2w117 1m Z hgkwgl 75 k1 177739 PAGE 162 CHAPTER 7 ST 762 M DAVDDIAN Now consider the case of constant variance so that wj E 1 for all j Under this condition 75 becomes varm 02m e haygt2 27 167 Because H is a symmetric idempotent matrix 0 hjj 1 and V L 2 hjj Z hm k1 from whence it follows that 2 hit 1 i 7m 1w Thus we obtain the nal result that under constant variance we expect mm 0217 71 76 IMPLICATION If all the hjj are of approximately the same magnitude for all j then varrj is approximately constant across j On the other hand if the hjj are quite different across j then varrj will be expected to vary That is under these conditions the Ti will have nonconstant variance even if the original data do not Thus inspecting plots based on the 77 could lead one to conclude erroneously that there is evidence of nonconstant variance even when there is not This is possible in the event the hjj vary across j LEVERAGE When do the hjj vary The hjj are called the leverage values corresponding to each design point 211739 Loosely speaking leverage is a measure of how remote77 an observation is from the remaining observations in the design space77 The simplest example of this is when zj is scalar Figure 71 exempli es the situation Note that the design point z 15 is far removed from the rest of the x values which are in the range from 0 to 5 The gure shows the effect of the placement of an observed response at z 15 the dashed lines are OLS ts of a simple linear regression to the data sets containing all the responses at z 0 to 5 along with either one of the two depicted responses at z 15 and show the dramatic in uence the response at this design point has on the tted model A point such as x 15 in this example will turn out to have a large77 value of hjj relative to those for the other design points Such a point is called a high leverage point and has a potentially in uential role in determining the t of the model In the situation of Figure 71 x 15 almost entirely dictates the t PAGE 163 CHAPTER 7 ST 762 M DAVDDIAN Figure 71 A high leveragequot point RESULT The magnitude of hjj in a linear model is dictated by the design Thus even when variance really is constant a design with high leverage points will yield OLS residuals whose variances are nonconstant in a nontrivial way The observed pattern of q in the usual or re ned residual plots may appear to re ect nonconstant variance when in reality it is an artifact of the design REMEDY An obvious modi cation is to calculate the hjj values which in a linear model depend only on the known values 11 and replace the q in residual plots by the so called studentized residuals 7 FTOLS1 Mall27 which clearly are such that varampOL5bja 1 Most linear regression software such as SAS proc reg computes studentized residuals automatically Hopefully if variance really is homogeneous we will not be misled by a pattern that is actually due to design if we use studentized rather than ordinary residuals Thus the suggestion is to replace W in all of the plots we have discussed previously by bj in the linear case so that one would plot for example 23 23 bj rather than rj60LS and use 1 rather than T EX TENS ON What if the variance is in fact nonconstant That is suppose we suspect that variance may not be constant but instead follows a smooth relationship dictated by a variance function g 0 1117 What should we plot to investigate this taking potential issues of leverage into account PAGE 164 CHAPTER 7 ST 762 M DAVDDIAN Cook and Weisberg 1983 suggest the following approach Note that if in reality varY7 azwj then still EampOLSb7 0 but now VaFWOLSbj ET1 MVP 02w11 MDZ Z hikw117 by 77 17739 Suppose we suspect that variance is of the form varY7 0292 01117 where g 0 217 is such that 0 is a scalar and 9 707111j 1 This is satis ed by many popular variance functions eg the power of the mean model Note that under this condition assuming 10771 92 0 111739 a Taylor series about 9 0 yields 10771 z 92 01j 292 01j19 01j0 7 0 z 1 20V96 O 137 Replacing 10771 in 77 by this expression yields VaFWOLSbjU 1 hjj1 20Wlt 707 1 1 hjjfl Z hk1 20Wlt 707 2 k 39 1 hjj 1 Marl 2 hit 201 hjjV957071131 k 39 2017 hjjrl Z biz9960 21 k 39 z 1 2017 hu960 m 7 2017 hijrl Z hgkm b 21 k 39 where we have used the fact that Ek hgk hjj17 hjj Under the further assumption that the hjk j 7 k are small Cook and Weisberg 1983 approximated this as varampOL5bjaz1 2019 0 11171 7 hjj RESULT This approximation suggests plotting 1 versus V930LS0 2171 7 hjj ignoring the fact that 60L is random as a diagnostic for nonconstant variance thought to have the form 0292 0 111739 This plot should offer protection against design induced residual patterns that may mislead the analyst if the variance really is constant and should have non zero slope approximately equal to 20 in the event that the variance really is nonconstant with variance function g6 0 1117 Thus not only does this plot allow for detection of nonconstant variance but it also gives information on the relevance of a particular model PAGE 165 CHAPTER 7 ST 762 M DAVDDIAN 0 One could construct the plot for different candidate models and compare The plot that appears most like a linear relationship might be adopted on empirical grounds if no variance model is naturally suggested by subject matter considerations The plot also gives information on the likely value of 0 Although Cook and Weisberg 1983 considered only the model g 0mj exp0zj other models could also be considered For example the model g601j exp017T6 leads to 19601j mgr and g601j alg w leads to 19601j log17T6 In this case of the power of the mean variance function then the suggestion would be to plot 1 versus 17 hjj log YOLs EXTENSION TO NONLINEAR MODELS One may extend the notions of leverage and studentization at least approximately to nonlinear mean models as follows Continuing to regard the 111739 as xed constants so suppressing conditioning suppose we have at 6 As before de ne fgm mtg Xlt gt 7 s f6 7 17 6 mm By a linear approximation for 6 close to77 BOLS we may write the vector of residuals as 7quot Y 1950le 3 Y i 5 X OLs 37 and we know that EOLS satis es again by a linear approximation 0 X30LsY Eon XT Y 7 me lXT X 8W3 XT Y 7 flt gtHltiaOLs 7 3 22 Ignoring the last term as it involves the product Y 7 f OLS 7 6 which should be small relative to the others we obtain lm 7 3 z XT X 1XT Y 7 Hm which is just a result we have seen previously eg Chapter 3 Combining we arrive at the approximation r x In 7 X XT X 1XT lY 7 f In 7 H Y 7 f 22 PAGE 166 CHAPTER 7 ST 762 M DAVDDIAN Here H6 X6XT6X6 1XTB is the approximate hat matrix77 Thus if varYj azwj by analogy to the linear case we have approximately that r e In 7 Hlt gtW 1Ze where e is de ned in the obvious way The implication is that one may regard the diagonal elements of H 6 as approximate leverage values77 This makes some intuitive sense in a nonlinear model the rami cations of design will be not only through the actual design points but also through the the behavior of the function f at those points Unlike a linear model a nonlinear model allows the changes in f at different 111739 settings to be different as the derivative of 1 depends on both 11 and 6 in general Consequently depending how 1 changes in different parts of the design space different observations will exert different amounts of in uence on the t Of course here H6 depends on 6 so for practical implementation we would need to substitute a likely value for 6 to obtain approximate leverage values eg BOLS RESULT In the nonlinear case one may apply the same ideas as in the linear case to take into account the effects of leverage using as approximate leverage values the values figj the diagonal elements of the matrix H30LS In practice for linear or nonlinear models it is often the case that analysts ignore the Cook Weisberg correction and plot 1 versus or logYOLs for example as diagnostics for the exponential and power models respectively OTHER PLOTS Carroll and Ruppert 1988 Chapters 2 and 3 advocate plotting other transformations of studentized residuals For example 0 if g80 21739 exp0f1j6 then log varY712 loga 0f1j 6 Thus the suggestion is to plot log ml or log lbjl versus YOLS fmjinLS using the absolute residuals as a proxy for varYj12 o Similarly ifg60 mj f9ma then logl varYj12l Ioga 010g We a and the suggestion is to plot log ml or log lbjl versus logYOLS PAGE 167 CHAPTER 7 ST 762 M DAVDDIAN Figure 72 The data density issue O m N I o o O N o o 0 AI 0 O 2 0 0 0 0 o 3 o o 0 o 0 o o o o o u in u g Q e O o O o w 39 0 o o g o oo o O 0 c o c 10 15 20 25 30 Predicted value A PRACTICAL ISSUE FOR ALL PLOTS Data density Varying degrees of data density along the horizontal axis for any plot may give a misleading impression For example consider the plot of absolute residuals versus predicted values in Figure 72 One may be tempted to interpret this plot as having some evidence of a wedge shape as the residuals in the range 10 to 15 are mostly small while the range 20 to 30 seems to contain many that are of great magnitude Such a plot might tempt the analyst to suspect nonconstant variance However because the rst data segment is so much sparser than the second it is not surprising that we might end up seeing only a few large residuals in the rst segment by chance even if the data really do have constant variance Thus the varying degrees of density of observations may yield illusions of patterns that may really not re ect a real phenomenon at all Carroll and Ruppert 1988 p 154 cite a real data example of this issue SUMMARY Plots based on residuals may be useful for both detection and model selection In the latter case the presence of nonconstant variance may be acknowledged a pn39om39 by virtue of the application and default variance models may even be available In this situation the plots may be useful for verifying the relevance of the model or for identifying departures from it that may call for a different one It is important that the data analyst be well aware that interpretation of the plots is somewhat of an art form owing to their approximate and ad hoc nature PAGE 168 CHAPTER 7 ST 762 M DAVDDIAN 73 Did it work Suppose we construct diagnostic plots review them identify nonconstant variance and select a model We then re t the model taking this into account eg by GLS PL or other method Can we check graphically for evidence that the assumed variance model accounts adequately for the form of the nonconstant variance IDEA Construct the same plots using weighted residuals that take into account the form of the variance That is if the chosen variance model is 0292 0111j and we estimate 6 and perhaps an unknown 0 too by a method that takes nonconstant variance into account the standardized weighted residuals are 7 Y1 7 111733 7 7 790537 971131 7 where 9 and 6 are the estimates These standardized weighted residuals should have the same properties as standardized ordinary residuals if constant variance were valid as they are weighted for each j by the appropriate factor A studentized version of weighted residuals is possible de ned by analogy to the unweighted case to account for leverage These may be constructed by considering the transformed problem based on the estimated weights That is de ne W 7 diagg 2ii 9 m1 794 0 2 and let w WW Xvi WlZXW ma WlZ m Consider the particular case of GLS estimation of 6 Then 3 satis es the estimating equation 0 XTltiagtY 7 173 and moreover the vector of weighted residuals 1mquot wr1wmT Y 7 f By an argument analogous to that leading to the form of r on page 167 we may obtain W 3 In 7 H Y 7 f 7 where H X XT X6 1XT6 the approximate hat matrix Studentized weighted residuals may be constructed in the obvious way on the transformed scale where i would be substituted PAGE 169 CHAPTER 7 ST 762 M DAVDDIAN If the approach to taking account of nonconstant variance worked one would expect to see plots that show no systematic patterns 74 Restricted maximum pseudo likelihood In Chapter 6 we discussed estimating equation approaches to estimation of variance function pa rameters As was evident from that discussion although such equations may be based on different transformations of absolute residuals the most popular approach is to use squared residuals and hence solve a quadratic estimating equation as in the PL method This is in part driven by the fact that the estimating equation will be unbiased by construction moreover squared residuals seem to be a natural choice for estimating residuals The foregoing discussion suggests that plots based on ordinary residuals may be misleading due to failure to account for leverage An obvious concern is thus whether methods for estimation of variance parameters based on ordinary residuals might not also be subject to the same problem This is one way to motivate consideration of a modi cation of the PL technique known as restricted maximum likelihood which we might more aptly term restricted pseudolikelihood The usual abbreviation which we will adopt is REML Recall that for some xed value of 6 3 the PL estimators for a and 0 in the general model 71 solve Zn in mwmlzil 1 7 716723971 1 V9 71 0292370713739 V9 7 0713739 71 02 3707137 where here wrj depends on the unknown 0 to be estimated Now from the arguments of the last section we have that Wquot 3 In H Y 7 fi 37 so that var39wr z 02 In 7 Hfi where it is understood here that depends on the unknown 0 Thus varwrj z Ewr72 x 0217 73 where is the jth diagonal element of HEi which also depends on the unknown 0 Rewrite the estimating equation as n y 7 A 2 1 1 J fAm77 Z 39 7 0392 2 0 1 A e A 771 9 7 7 7 WW7 971137 71 WW7 971137 Note that if i were replaced by the truth then the expectation of the left hand side of 78 would be exactly equal to the right hand side PAGE 170 CHAPTER 7 ST 762 M DAVDDIAN Thus solving the PL equation in a and 0 may be viewed as equating a function of weighted squared deviations to its expectation ignoring the fact that 6 must be replaced by an estimator in practice Suppose we were to not ignore the fact that i has been substituted From above the left hand side of 78 would have approximate expectation 1 1 73 A 1 119 0211 x depending on the leverage values77 This observation suggests a modi cation of the PL estimating 7 equation to take account of leverage7 namely instead of solving 78 one would instead solve term7W 1 143 1 7g Z029230mj WWW 2 7 V9Eaom 39 39 11 FACT Because H6 is a symmetric idempotent matrix assuming that X6 has rank p and so XT X is invertible then H6 has rank p and it is true that 71 trace H6 211 p j1 where 717 are the diagonal elements Using this it is straightforward to show that 79 may be written as n 7 I A 2 1 n 7 Z 72 2fAmJ7 A A p A 11 7 9 107 WW 97 113739 9211 hjl 9 7 071111 from which it follows that 72 satis es 1 Yr jag1a j1 9257 97 62471729 RESULT Solving 710 instead of the usual PL estimating equation for 02 automatically yields the bias adjusted estimator for 02 we have discussed previously where division is by n 7 p rather than 71 Recall from Section 35 that the estimator using the divisor 71 rather than n 7 p is often viewed as failing to account for estimation of 6 0 Because 0 is also a variance parameter like 02 we might expect that the usual PL estimator for 0 might be subject to a similar kind of bias PAGE 171 CHAPTER 7 ST 762 M DAVDDIAN o It appears that solving 710 instead might somehow result in an estimator that is less biased ln practice7 this is indeed the case The resulting estimators for a and 0 obtained by solving 710 are referred to as the REML estimators IMPLEMENTATION It turns out that solving 710 in 77 0TT for xed 6 xed at 0 for example is equivalent to maximizing a certain objective function7 just as PL is equivalent to maximizing the normal loglikelihood with 6 held xed Recall that7 evaluated at 0 the PL objective function normal loglikelihood7 disregarding constant terms7 is given by A 1 Ei mzi Fi 7 A PLlt 70707 121 029237 07mg nloga Flloggw w It is possible to show that the objective function corresponding to 710 turns out to be 13MB 0 0 p10g0 7 1210ngTi3Wii 0Xf3l7 711 where W 0 diagg 2 7 011117 79 2 7 071n7 where we have added the 0 argument to make clear the dependence on 0 and X6 is as de ned previously The last term in 711 involves the determinant of XT6W87 0X 7 evaluated at o It is not at all obvious that maximizing 711 in 70TT is equivalent to solving 710 This may in fact be shown by some clever matrix manipulations and is left as an exercise In particular7 letting N 7 0 XT6W87 0X 7 it may be shown that taking the derivative of 711 with respect to 0 and setting equal to 0 yields the equation Y 7 211 2 A A Z 71 mam lt12gt88ologiNltaogti 712 j1 7 9 57971137 Of course7 taking derivatives with respect to 0 gives the biased adjusted estimator above The equivalence follows by showing that the right hand side of 712 is in fact equal to ZVd 11300 TL 71 The fact that solving 710 is equivalent to maximizing 711 may be used to advantage in practice It is possible to go through the same type of argument leading up to the trick on page 130 to derive a method for estimating 0 using nonlinear regression software It is again not possible to use just any such software eg SAS proc nlin7 because of the complex form of the regression model77 But it is not too dif cult to write a program for general variance models The details of this implementation approach are left as an exercise PAGE 172 CHAPTER 7 ST 762 M DAVDDIAN TERMINOLOGY The terminology restricted maximum likelihood arises from the perspective that the objective function 711 has the form of the usual normal loglikelihood plus a penalty term that has the effect of imposing a restriction on the solution From the above developments the penalty term for our model has the effect of taking into account leverage thus incorporating the effect of having to estimate 6 rather than knowing it using the given design and mean model Basically the result is to use studentized rather than ordinary residuals 75 Examples We now consider two examples to illustrate the use of residual plots for detecting and modeling non constant variance EXAMPLE 71 Pharmacokmetz39cs of indomethacin Recall the data on the pharmacokinetics of in domethacin discussed in Examples 11 and 12 The data are concentrations pgml of indomethacin taken at n 11 time points zj hours post dose at time 0 The model we consider is the biexponential parameterized to enforce positivity m a em expee wj as expee wn In this application it is well established that variance tends to increase with the level of the response A popular model for representing variance is the power of the mean model 02f29mj 6 Often t9 10 is a reasonable choice yielding constant coef cient of variation this value for 9 is sometimes adopted by default with no validation Sometimes however other values of 9 provide a better characterization Figure 73 shows the raw data with the OLS t and a GLS PL t where 082 see Section 68 for full details Note that the ts themselves are discernibly different This is actually not terribly surprising The tail of the curve at larger time points is determined by only a few observations The OLS t treats these as being of equal quality as those at earlier times while the GLS t regards them as more precise Thus the latter t places more emphasis on these later observations for determining the t Note that the GLS t goes through the last presumably most precise observation while the OLS t seems to compromise over where to place the t at these later observations Figure 7 4 shows a plot of squared standardized ordinary OLS residuals versus the logarithm of predicted values The plot shows a pronounced wedge shape suggesting a rather severe increase in variance with level of the response The evidence strongly supports the contention that variance is not constant PAGE 173 CHAPTER 7 ST 762 M DAVDDIAN Figure 73 Concentration time data for a subject receiving intravenous indomethacin at time 0 The solid line is the OLS t and the dashed line is the GL3 t with 0 estimated Concentration megml Time hours Of course as discussed previously this pattern may in part be due to asymmetry of the distribution of squared residuals Figure 75 shows several residual plots Panels a and b show the usual plot of residuals versus predicted values where that in b replaces the ordinary residuals by studentized versions The pattern is similar in both plots note however that the magnitudes are somewhat different for some observations re ecting the adjustment for leverage The pattern appears fairly symmetric in these plots especially b demonstrating that the common assumption of approximate normality may not be unreasonable for pharmacokinetic data Panel c shows the logarithms of the absolute studentized residuals log lbjl and appears to follow an approximate linear trend supporting the contention that the power model is reasonable A simple linear regression t to the observations in c gives a crude estimate of 9 as the slope equal to 042 Panel d shows the 23 root studentized residuals 1333 versus log predicted values and shows a pattern that is wedgeshaped but not quite as profound as that in Figure 7 4 Presumably this re ects the fact that the residuals on this scale are more symmetrically distributed so the pattern is re ecting only nonconstant variance and not asymmetry PAGE 174 CHAPTER 7 ST 762 M DAVDDIAN Figure 74 Plot of squared standardized residuals rig6gb versus log predicted values Squared std resid Log Pred Figure 76 shows the same plots as in Figure 75 but applied to the weighted residuals following the GLS PL t of the power model In all panels the pattern is haphazard suggesting that weighting according to this variance model takes appropriate account of the nonconstant variance The PL estimate of 9 is 082 which is close to 10 This estimate is likely preferable to that found by the simple linear regression applied to Figure 75c as will be made clear by the theoretical developments in Chapter 12 An important implication of getting the variance right77 may be demonstrated in this example by considering estimation ofthe terminal half life a parameter of great physical interest to pharmacologists The terminal half life is the time that it takes the mean response in the second phase77 of the curve to decrease by half and is useful in determining appropriate dosing regimens The terminal half life is given by log 2e 1 hours here Substituting the estimate of 34 yields an estimated half life of 313 hours based on the OLS t and 396 hours based on the GLS PL t The difference in point estimates is nearly one hour which in a clinical sense is quite a big difference as a difference of this magnitude could lead to establishing very different dosing regimens for a drug that is eliminated from the system rapidly as is indomethacin Of course we have not yet discussed how to construct standard errors for these point estimates so whether this difference is of importance is not clear The point estimates do however suggest the potential for misleading interpretations if variance is not taken into appropriate account PAGE 175 CHAPTER 7 ST 762 M DAVDDIAN Figure 75 Residual plots based on the OLS t to the indomethacin data a Usual plot of residuals us log predicted ualues b Studentized residuals us log predicted ualues c log studentized residuals us log predicted ualues d 23 r00t studentized residuals us log predicted ualues EXAMPLE 72 Oxidation of benzene These data are a b N N e g g E o g a 6 0 U Q m D a o O w o c l o N o 005 010 050 100 005 010 050 100 L00 Pred LOCI Pred U g m a v m o m s v s g U z a m w o w z e a Q E 4 Q 6 V m o 005 010 050 100 005 010 050 100 L00 Pred LOCI Pred also discussed by Carroll and Ruppert 19887 Section 28 An experiment was conducted to determine the relationship between the initial rate of oxidation of benzene over a vanadium oxide catalyst at three different reaction temperatures and several benzene and oxygen concentrations In particular7 n 54 observations on the following are available For the jth observation7 initial rate of oxidation disappearance of benzene 108 gmolegsec oxygen concentration 104 gmoleL benzene concentration 104 gmoleL 20001T 7 16487 where T is the absolute temperature in degrees Kelvin moles oxygen consumed per mole benzene 7 T Thus 1le 7 j17mj27j37mj4 The model for this reaction is the steady state adsorption model 100041042 041x expa4z32000 ang1 expa3z3200039 PAGE 176 CHAPTER 7 ST 762 M DAVDDIAN Figure 76 Residual plots based on the GL3 t to the indomethacin data a Usual plot of weighted residuals us log predicted ualues b Studentized weighted residuals us log predicted ualues c log studentized weighted residuals us log predicted ualues d 23 r00t studentized weighted residuals us log predicted ualues a b O o in 0 C O m 9 c 3 h m a C a m 391 C Q o Q g D E 8 w m m 9 9 o 39 O O o 3 i o 01 05 10 01 05 10 Log Pred Log Pred ltr c c D g 3 g o H g m U r a w s O o a c a s 8 m 8 or c o 0 4 m R ltr o 01 05 10 01 05 10 Log Pred Log Pred Here the parameters are 041 A1 exp7A ElRQTO 042 A2 exp7A EgRgTo 043 A ElRg and 044 A EgRg and T0 648 degrees Kelvin 141142 are constants A E1 A E2 are activation energies and R9 is the gas constant Background on the scienti c considerations underlying this kinetic model are given in Pritchard Downie and Bacon 1977 The objective of an analysis is to estimate the parameters of this model in order to characterize the rate of oxidation of benzene It turns out that computationally a reparameterization of the kinetic model is more stable This reparameterization is given by 11375 5190490f1 expW s 52961 8Xp 490371 Of course even this parameterization is highly nonlinear in the unknown parameters 6 3132 33 4T Note that the model in either parameterization is a physical theoretical one dictated by scienti c con siderations thus the parameters or transformations of them have physical interpretations PAGE 177 CHAPTER 7 ST 762 M DAVDDIAN Figure 77 Raw benzene data a Rate of oxidation 115 oxygen concentration 17 rate of oxidation us benzene concentration c rate of oxidation 115 transformed temperature d rate of oxidation us moles oxygen consumed per mole benzene a b o o o 0 t E e E e 3 9 e 3 t 2 39 o 2 8 3 3 8 3 o o O 2 o 2 E lt 3 7 3 0 o o 3 o N o o o N o o H o o 50 100 150 15 20 25 30 35 40 oxygen oono benzene cone 0 d o o o o 3 N e E m 3 O 3 O gtTlt O gtTlt 0 8 3 8 3 g o o z 2 L ltr E ltr i gquot N i m 3 7010 7005 00 005 010 54 56 58 60 62 64 transformed temp oxygen consumed Like Carroll and Ruppert 19887 we have deleted observation 38 which these authors found to be an extreme outlier that causes problems for PL estimation recall our discussion of potential sensitivity to outliers for the quadratic PL method in Chapter 6 Although we deleted this observation in the tting7 we have included it in Figure 777 which shows the raw data including this observation7 with plots of the response versus each covariate The plots suggest informally that variance is not constant across the range of response and appears to change with changing values of the covariates Carroll and Ruppert 1988 and Pritchard et al 1977 found that a variance model where variance changes as a function of the mean response is a reasonable characterization The former authors considered the power of the mean variance model with power parameter 0 Table 71 summarizes an OLS t of the mean model and the GLS PL t C 00 The estimate of the power parameter 115 seems to support the contention of nonconstant variance Note that the meaning of a is different in each t for OLS7 it is the estimate of the assumed common standard deviation of the response7 while for GLS7 it is the scale factor From the table7 the failure to weight the observations appropriately to account for nonconstant variance seems to have a nontrivial effect on the numerical estimates of the parameters and the assessments of their precision standard errors PAGE 178 CHAPTER 7 ST 762 M DAVDDIAN Table 71 Results for OLS and GLS PL ts to the benzene data The method to calculate standard error estimates in parentheses will be discussed in Chapter 9 Parameter OLS SE GLS PL SE 01 097 010 086 004 02 320 020 345 013 03 733 114 599 061 04 502 064 576 041 a 056 7 009 7 0 7 7 115 7 Figures 78 and 79 each show several residual plots based on these ts observation 38 is included here and appears as the most extreme positive OLS residual and the second most extreme positive GLS PL residual The OLS residual plots show convincing evidence of nonconstant variance and that in Figure 78c seems to support the power variance model The GLS PL residual plots suggest that the nonconstant variance is taken into adequate account by the tted variance model An interesting feature of both OLS and GLS PL residual plots is that the two observations with the smallest predicted values have larger than expected residuals the residuals for both observations are above the zero line in both cases and so do not seem to t the pattern one would expect to see This may indicate a misspeci cation of the variance model The power model varY7l1j 02f291j 6 of course supposes that variance is small where the mean response is small Here however it seems that a few observations with small mean response have variance larger than that represented by this model Perhaps another component of variation77 is present at very low levels of the response which might suggest the alternative model varle1j 02t91 f292113j7 Another explanation for this phenomenon is that the mean model is not a good t across the entire range of the response Perhaps the theoretical behavior it represents breaks down for small response values or at the settings of the covariates corresponding to these two observations From this perspective the large residuals may be a consequence of failure of the model to center the response appropriately PAGE 179 CHAPTER 7 ST 762 M DAVDDIAN Figure 78 Residual plots based on the OLS t to the benzene data a Usual plot of residuals us log predicted ualues b Studentized residuals us log predicted ualues c log studentized residuals us log predicted ualues d 23 r00t studentized residuals us log predicted ualues a b ltr o m o o N U N o a o o u e o o a o a o o o 1 o 39 o g C o 0 o o o o u C u o o o o 0 n o 3 o u 0 V s a V o e O 0 8 o 0 X o o 1 o 0 4 1 o o 1 5 10 1 5 10 Lou Pred Loq Pred C o m o N u o 7 O g 8 o g N o a c o a a m 1 c 3 s u m t m p a r o 2 o 39 w o is o o o is o w o o o c 0 0 v s o 0 o or s n o 39 o 0 c o u 0 Q o f 0 N x 39 to O 0 39 O o d o 0 0 8 s 0 c 1 10 1 5 10 Lou Pred Loq Pred Still another possibility is that this feature is simply a matter of chance To pursue any of these explanations further would require access to subject matter expertise that would help to determine which is most plausible Note7 however7 that the residual plots7 in addition to highlighting the presence and nature of nonconstant variance7 also may be valuable for bringing such potential anomalies to the attention of the data analyst for further consideration PAGE 180 CHAPTER 7 ST 762 M DAVDDIAN Figure 79 Residual plots based on the GL3 t to the benzene data a Usual plot of weighted residuals us log predicted ualues b Studentized weighted residuals us log predicted ualues c log studentized weighted residuals us log predicted ualues d 23 r00t studentized weighted residuals us log predicted ualues a b o m m 0 U N o o t N c m m a o g c g o o 11 v 0 0 o o o t g o 0 390 3 o 00 a n k quot 0 u g u D o 0 O s 0 K quotH N g o o o o o 0 0 t 0 a o i a 06 0 o o o 1 5 10 1 5 10 Log Pred Log Pred C d o C o o v o o o 9 N 9 C I 0 8 m 39o 39 o Ct m 1 c r o t f o u 39 a o 0 2 o w U 0 u a S o D 6 O 39 o o 0 gm o 2 39 s 4 O m m 0 o O V c o o o o O 0 0 v o O o C Q 0 c 1 5 10 1 5 10 Log Pred Log Pred PAGE 181 CHAPTER 6 ST 762 M DAVDDIAN 6 Unknown parameters in the variance function 61 Introduction So far we have considered the general mean variance model EOjlilrj fj7 7 varOjlilrj 729237 07 111739 61 in the case where 0 is known that is the form of the variance function g is known up to 6 so does not contain any additional unknown parameters In this chapter we will consider situations where such parameters 0 are present and discuss inference under these conditions In Chapter 2 we discussed situations in which the data analyst may wish to consider a model of the form 61 but may be unable to specify the model fully In particular although it may be possible to identify an appropriate functional form to describe the pattern of variance it may not be possible to specify values for some of the model components a priori We recall some of the examples we discussed in Chapter 2 and some new ones 0 Assay data Common models are VaFOjlillj 07291373975 0r VaFOjlillj 02 f292j75 in either case although the functional form of variance as a function of mean response is sensible for the situation it may not be possible to identify suitable numerical values for parameters such as 9 in the rst model and 0 01 02T in the second Clearly the data may contain information on likely values for these parameters 0 Exponential model Profound increase in variance with mean may suggest adoption of a model of the form varOjlilrj Uzexp0fj7 3 again a value for 9 may not be readily apparent 0 Over or underdz39spersed Poisson data An additional example follows from the discussion of overdispersion and the reverse phenomenon of underdispersion in Section 45 A simple way to handle this in the case of count data for which the Poisson distribution would ordinarily be a plausible model is to allow for an overunderdispersion parameter 02 and assume varOjlmj Uz illav 6 PAGE 120 CHAPTER 6 ST 762 M DAVDDIAN For the Poisson 02 E 1 of course The parameter 02 allows the magnitude of variance relative to the mean and usual Poisson variance to be large or small However this model may sometimes not be an adequate representation An alternative model is often motivated by considering the negative binomial distribution as a model for the data this distribution may be derived by a mixture of the Poisson with a gamma distribution and leads to a variance model of the form VarOjlmj WW 3 f2j7 0 62 for some parameter 0 A generalization would be to multiply this by a scale parameter 02 It is straightforward to show that the negative binomial with mean at 6 and variance 62 is a member of the scaled exponential family class see McCullagh and Nelder 1989 Sections 62 and 112 Clearly under this model it is unlikely that the data analyst would be able to specify an ap propriate value for 0 Again if this model is correct the data contain information on value of 0 RES ULT Situations where a sensible functional model for variance may be identi ed that depends on unknown parameters 0 are common An obvious approach is to estimate 0 from the data Such estimation would likely need to be carried out jointly with that of 6 and a This clearly introduces an additional level of complication along with 6 and a we wish to estimate a parameter that exclusively describes the pattern of variance and enters the variance model in a nonlinear way through the variance function g 0 1117 An obvious concern is whether estimating all of these parameters from a single set of data is even feasible In Chapter 7 we tackle the issue of how to identify an appropriate variance function model 9 in situations where there is no apparent guidance from subject matter or approximate distributional considerations In this chapter we assume such a model has been speci ed and to complete our review of popular in ferential strategies for the general mean variance model 61 we indicate the approaches for estimation of 6 and a we have discussed so far may be augmented to include estimation of variance parameters 0 PAGE 121 CHAPTER 6 ST 762 M DAVDDIAN c We will discuss an approach for implementation of such strategies in practice using standard nonlinear regression software 62 Normal ML and pseudolikelihood We have already seen that in some applications the assumption of normality along with that of a mean variance model of the form 61 is often reasonable Even if normality is not a correct assumption considering it is nonetheless a way to motivate a class of alternative procedures for estimation of 6 quadratic estimating equations Moreover the assumption of normality motivates common approaches to estimation of the scale parameter 02 when applicable ie solving a quadratic estimating equation containing no linear terms From this last observation if we now consider estimation of additional unknown parameters 0 appearing in the variance function it seems that something similar would apply for estimating these parameters Thus as a starting point we consider the normal distribution as a way to motivate estimating equations for 0 Recall that the normal distribution has the property that the mean and variance are not related in any particular way so this approach allows any arbitrary variance function to be considered o It will turn out that once we observe the form of the resulting estimating equation for 0 a general class of such estimating equations will be suggested QUADRATIC ESTIMATING EQUATION FOR 0 An obvious strategy if we are willing to assume that the are conditionally normally distributed is to estimate 6 a and 0 by maximum likelihood We have already derived the form of the resulting estimating equations for 6 and a in Chapter 5 so we only need consider here the nal equation for 0 obtained by differentiation of the normal loglikelihood nloga 7 floggw 0 22 7 12 Zn F1 F1 0292M 011 with respect to 0 where 0 is q x 1 say De ne WW7 97 113739 889 logg i 9711i 8899 7 97 ZED9W7 97 113739 9937 97113j9 797 113739 Then differentiating the loglikelihood with respect to 0 and setting the resulting q x 1 set of equations equal to zero gives n Y7 211 2 y 113 27722 7072 7 1V9 707mj 7 9 7 F1 0292M 07 113739 F1 0292M 07 113739 145797 113739 0 63 PAGE 122 CHAPTER 6 ST 762 M DAVDDIAN This may be rewritten suggestively as i m 7 m m 7 029 0 mi 27494 9 7 j 20292 7077l 9 707 0 64 71 Recall from Section 54 that the estimating equation for a resulting from differentiation of the normal rainy7m 1 lt1 0 0292 707111j U loglikelihood is M which may also be written as as 7 fag 7 0292670721739 1 20494 7 97 113739 j 209237 07 039 65 7 Now a and 0 are parameters that both pertain to the variance only Thus it is sensible to combine the estimating equations for both Using 63 we may write these equations as the q 1 x 1 system 2 Yj mh z 71 1U 0 66 V9 j1 0292967 07137 57971137 which combining 64 and 65 may be expressed suggestively as i in 7 m m 7 naem 0 mi 2092637 6 W 0 67 F1 20494 7 97 111739 l 2a292 0211jz98 0 22 Thus from Section 5 and these developments joint normal theory ML estimation of 6 a and 0 would involve solving jointly the estimating equation in 67 with the p x 1 quadratic estimating equation for 6 a 29463 0 mm 7 m lf jv 6 Z 71 new 0 113739 7 0 71 7391 99 fj7 2 0292 7 07131 which may of course also be expressed in a suggestive form 0 This joint estimation involves the solution of a p q 1 x 1 system of estimating equations From a practical standpoint this could be messy lntuitively as noted above it is natural to think of estimation of a and 0 together In fact this seems essential as even if 6 were known estimation of a and 0 would involve joint solution of 67 7 estimation of a and 0 does not appear to separate Of course if 6 is not known all three equations must be solved jointly to obtain the ML estimators of all three parameters PAGE 123 CHAPTER 6 ST 762 M DAVDDIAN However as we discuss shortly 67 suggests a way to estimate the variance parameters a and 0 regardless of how we might obtain an estimator for 6 Thus we focus now on just the q 1 x 1 set of estimating equations 67 as an approach to estimating the variance parameters 0 This is a quadratic estimating equation with no linear component With the equation written in the form 67 it is possible to deduce that the equation has the familiar form 71 T 7 E D V 13 7 mi 0 68 j1 where the terms Bi V7 37 and mi have the same interpretations as before as the gradient response and mean function and the regression parameter is a 0TT variance Here the response is evidently 37 7 f1j62 so 3 1 with mean function mj 0292 0 111739 It is straightforward to observe that aaa 0292 0 2117 2092 01j 8800292 01j 20292 0mjz9 012j De ning 6739 7 f1j ag 0 as before we have that varsjl1j varel1ja4g4 01117 As before under normality varel1j vare 2 0 Putting this all together we see that 67 is indeed in the form 68 with under the motivating normality assumption Vj 204g4 0 11739 and 10 D 2029 0 113739 WW7 97 113739 0 In fact note that the 2 in Vj could be replaced by any constant with respect to j and the equation would remain unchanged in the sense that the solution would be the same as this is just a multiplicative factor So for example if it were believed instead that varel1j 2 a for some excess kurtosis value re the equation would be unaffected Of course if it were instead believed that varel1j depends on j somehow then this would not be the case We defer discussion of this issue until later RES ULT We have identi ed a set of estimating equations for the parameters a and 0 that arise from consideration of normal theory ML These equations have the same interesting form as those we have considered previously PAGE 124 CHAPTER 6 ST 762 M DAVDDIAN We may summarize the entire set of equations to be solved to obtain the normal theory ML estimators for 6 a and 0 as follows using the same shorthand notation we have used previously f j 209721491 1 n 029 0 Y 7 f 2 2 10 4 4 2 2 2 0 69 71 0 2a 97 0 2039 97 7 fj 039 97 19739 Equation 69 does not really provide any new insight but it does demonstrate that the same form of estimating equations carries over to the case where 0 is also estimated That the form of the equations for estimating a and 0 in their own right have the form in 67 suggests that they might be exploited more generally than just in the context ofjoint normal theory ML We now consider how this might be carried out 63 Incorporating estimation of 0 in the GLS algorithm We have not yet discussed the tradeoffs between the linear GLS approach to estimating 6 and solving quadratic estimating equations for 6 like those discussed in Chapter 5 of which the normal theory equation is a special case We have noted that it seems that solving the linear GLS type equation would be less burdensome in practice Thus suppose that we wish to take a GLS approach to estimation of 6 solving a linear estimating equation For example in the case of the threestep GLS algorithm with C 00 this entails solving the equation 94 97111jYj 7 HM f 7 0 610 M x H 0 Note that with 0 unknown this is no longer a simple matter as 0 does not separate from 6 as a does 0 Thus if we are to take the GLS approach we must integrate estimation of 6 and 0 together Even though a does not play a role in estimation of 6 note that it is linked to estimation of 0 Thus we really must also incorporate estimation of a as well To incorporate estimation of 0 and a we modify the GLS algorithm as follows A 0 A 0 A i Estimate 6 by 8 where 8 is some initial estimate for example OLS solving SEASj 7 ay f 7 7 5 0 Set k 0 PAGE 125 CHAPTER 6 ST 762 M DAVDDIAN ii Estimate 0 by gal and form weights 7i 9466 60 1117 iii Re estimate 6 by solving M wall 7 fltjv f jv 0 x H to obtain 9 Set k k 1 and return to ii Continue through C iterations and adopt the Cth as the estimator Note that we have represented step iii by holding 6 xed at the current value in the weights as discussed previously A variation would be to let 6 vary here as well and use the IRWLS algorithm to solve the equation holding 0 xed at the current value In practice there is little difference in performance between the two approaches To estimate 0 at step ii the developments of the preceding section suggest that we could solve n 397 I A 2 1 11 029265 7971131 V95 79713 jointly in 0 and a to obtain 60 0 Thus the suggestion is to solve 66 for xed 6 held xed at the current estimate o If we were to iterate between steps and ii and iii until convergence ie C oo conceptually we would be solving the p q 1 x 1 set of equations given by 610 and 66 equivalently 67 We may summarize this last situation by writing the set of equations to be solved in shorthand notation as f j 0 71 n 1 029 0 Yr 039 1 0 2029 0 204g Yaeweaz 19739 0 612 7 Comparing 612 to 69 we see that the only difference is that element in the upper right corner of the leading matrix in the summand is set equal to zero As we have discussed earlier this has the effect of decoupling estimation of 6 and a 0TT and of course leads to the linear rather than quadratic estimating equation PAGE 126 CHAPTER 6 ST 762 M DAVDDIAN 0 Note that if in fact 9 does not depend on 6 then V 0 11739 0 and these two estimating equations coincide In this case solving 612 corresponds to normal theory ML estimation of all parameters o If 9 does depend on 6 this is not the case Indeed it is not clear that solving 612 may be identi ed with estimation under any particular distributional assumption TERMINOLOGY The equations 66 are motivated by considering maximum likelihood under the normal distribution However we are using the equation in a more general context coupling it with estimation of 6 that is not via ML in general If we were to solve 66 along with the corresponding quadratic equation for 6 for normal ML this would be ML The term pseudolikelihood has been used in the statistical literature to refer to the practice of solving es timating equations derived from a maximum likelihood perspective but replacing nuisance parameters parameters not of direct interest by estimators that are not maximum likelihood but are consistent This is precisely what we are doing in the threestep GLS algorithm above at step ii the ML equations for a and 0 are solved with 6 replaced by the most recent GLS estimator Because of this estimation of a and 0 by solving 66 is referred to in the particular context of nonlinear models as the method of pseudolikelihood which we will abbreviate henceforth as PL We will refer in the sequel to the general approach of incorporating PL estimation of variance parameters into step ii of the GLS algorithm as GLS PL estimation Of course we could also pair PL estimation of a and 0 with another equation for 6 than the linear GLS equation For example we could estimate 6 via a quadratic equation that is not the normal ML equation such as 515 QUESTIONS 0 ls it better to use a linear or quadratic equation for 6 o Are there other approaches to estimating 0 and a The rst question will be addressed when we discuss the large sample theory for these estimators in later chapters We will address the second question in part shortly PAGE 127 CHAPTER 6 ST 762 M DAVDDIAN 64 Implementation Before we discuss other possibilities we take up the issue of how one might implement these approaches in practice We discuss two strategies IRWLS quot The rst approach we discuss is in fact broadly applicable to any set of estimating equations of the special form we have identi ed Many of the schemes we have discussed for estimation of 6 a and 0 may be represented as solving equations of the form M DgraV1aSja MAM 07 D7101 980t quot171007 613 x H where a is the entire set of parameters to be estimated of dimension 1 say for us7 a Ta 0TT and v p q 1 Here7 37 and mi are of dimension 3 x 17 where s 2 1 We have considered cases where s 1 or 2 In the case where D7104 880 quot171007 we may proceed directly as follows From a conceptual point of view7 even if s gt 17 there is really no difference between solving 613 and the equations discussed in Chapter 3 The main increase in complexity is that Dj may be a s x 12 matrix rather than a row vector and the weight V1 is now a matrix rather than a scalar For 1 close to77 a we may obtain a linear Taylor series approximation to 613 and follow the same type of arguments we carried out in Section 32 and 34 to arrive at the approximation 71 71 71 a z 04 anavj1aDa Z DfaV1asa 7 mJo n 614 j1 j1 Note for this to hold7 some additional conditions are needed this is left as an exercise We can write this in an even simpler form7 so that it more closely resembles the IRWLS updating scheme in Section 34 De ne which are both 715 x 17 Va block diagV1a Vna m x n3 DTa D1Ta D a u x n3 PAGE 128 CHAPTER 6 ST 762 M DAVDDIAN With these de nitions it is straightforward to deduce that 614 can be written more succinctly as a z 1 7 DTaV 111Da 1DTaV 1asa7 ma 615 The approximation in 615 is exactly the form of that derived in Section 34 corresponding to the method of IRWLS with the matrix V 1 playing the role of the matrix of weights Note that V 1 is not a diagonal matrix but rather is block diagonal but this does not involve any conceptual dif culty This suggests by analogy that we could solve the estimating equations by a series of iterative updates where at the a 1th iteration the update is T 7 7 T 7 04ml 1a t DltagtVa1Dltagt 1DltagtVtltSltagt 7 T 7 7 T 7 DaVa1Da 1DaVa1Za7 616 where Dltagt Daa and similarly for the other components and Zltagt Daaa 1 SW 7 ma Comparing this to 313 shows that this in exactly the form of the general IRWLS update RESULT Under certain conditions in principle one could use the IRWLS algorithm to solve this system of estimating equations lteration would continue until some prespeci ed convergence criterion was met as usual However this implementation could become computationally unwieldy if v total number of parameters dima is large In fact in practice this approach is known to suffer computational problems under these conditions 0 An alternative approach would be to separate the computations into two parts thereby decreasing the dimensionality of the problem for each a For a given value of a 0TT estimate 6 by solving the 6 equation b For a given value of 6 estimate a 0TT by solving the a 0TT equation It should be clear that each of these steps involves solving an equation of the form 613 so may be implemented by the IRWLS strategy described above 0 A variation would be to alternate between a single IRWLS update for a followed by a single update for b cycling between the two until convergence In fact consider solving the GLS PL system of equations summarized in 612 Here the matrix corresponding to D77 in 613 is not exactly the gradient matrix of the mean function77 when 9 depends on 6 PAGE 129 CHAPTER 6 ST 762 M DAVDDIAN Thus technically the argument as given above does not apply to the entire set of equations However because the matrix Vj is diagonal as we have noted previously 613 separates into two distinct pieces the 6 equation 610 and the 0 0TT equation 67 Both of these equations are exactly of the form 613 with a proper gradien matrix Thus solution of 613 could also proceed by the alternating strategy ab above This is of course exactly the conceptual strategy of the three step algorithm step ii carries out b and step iii carries out a Of course the current estimate of a is unnecessary for step iiia in this particular case As written step iii would be implemented by a WLS updating algorithm note that ii could be done analogously if we held 0 in the weights xed at the previous estimate Alternatively the IRWLS update above could be used at each step Thus according to the above developments both equations would be solved via a WLS or IRWLS updating scheme whose dimension is equal to that of the parameter under consideration THE TRICK Although the above is appealing and straightforward in principle in practice compu tational dif culties may be encountered Solving the system 69 using the IRWLS updating scheme as given above often does not converge Recall that the p q 1 equations here do not separate so the dimensionality of the problem can be large and both the quadratic 6 and a 0TT equations are fairly complex 0 For solving the system 612 WLSIRWLS updating for the 6 equation generally works ne However solving the a 0TT equation for xed 6 by using a WLSIRWLS update is usually more problematic even if the dimension is not too large The bottom line is that the ML and PL quadratic equations are simply more unstable and ill behaved in practice than are the linear ones Thus for implementing joint normal theory ML estimation for all parameters or PL estimation of variance parameters an alternative strategy is desirable It turns out that there is a fairly standard computational trick that may be used to circumvent these problems This trick has been found in this particular context and in other statistical models more generally to be an effective way to make the problem better behaved in practice 0 For the particular case of tting models of the form 61 an advantage of this approach is that it may be implemented using nonlinear regression software that carries out OLS estimationl PAGE 130 CHAPTER 6 ST 762 M DAVDDIAN The trick77 relies on the speci c form of the variance model ie varOJ39lmj 0292 7 07 mi to reduce the dimension of the estimation problem by using a technique known as pro ling77 which we now demonstrate for our problem For mean variance models of the form 61 this trick77 is discussed by Carroll and Ruppert 1988 Section 32 and Giltinan and Ruppert 1989 among others Both the ML and PL approaches are motivated by considering the normal loglikelihood inftnjv zt 617 n 7L logL inlogU 7 10g9lt 7 07 13 12 7 7 1 02m 0 113739 739 Recall that we may obtain an explicit expression for the maximizer in 02 solving 65 as a function of the other parameters namely 71 02 7171 2972657 971133739 73 52 618 j1 Now because 618 must be satis ed by the full set of parameters maximizing 617 if we substitute 618 into 617 wherever 02 appears to obtain an expression just in 6 and 0 then the PL estimator of 0 with 6 held xed or the joint ML estimators of T 0TT must maximize this expression Thus consider pro ling out77 a from the loglikelihood 617 by such a substitution This yields the function 1 L 7 n1 71nj fmjv 2 nnl 27 0 n 619 0g maxiigogn iggogg 57 7113 771 771 To nd the PL estimator 6 holding 6 xed or to nd the joint ML estimator ET6TT we would maximize 619 with respect to 0 or T 0TT respectively Note that this is equivalent to minimizing in 0 holding 6 xed or jointly in 6 and 0 ignoring constants n Yj fcljjv z 21 n 271 0 g gz 707j 2 0 119 57 71137 39 7 log Lmax 3 log De ning ln 9 707mj 7 1 7 which is the geometric mean of the g601j over j it is straightforward to see that we wish to minimize log 39 7 fj7 29 2 7 0 7 92 79713739 j 39 PAGE 131 CHAPTER 6 ST 762 M DAVDDIAN Thus by the monotonicity of the logarithm we wish to minimize n Y7 21139 i 0 Z 7 2 9637 071111 39 620 j1 Depending on whether we wish to carry out PL or ML estimation let 120 or Farris 9 f fl l Then note that we may write 620 as depending on our objective V L V L Z0F 02 0r Z0E 702 621 j1 j1 Here we have written the summand as the constant 0 minus the appropriate function to emphasize that 621 may be interpreted as a sum of squared deviations In particular viewing 0 as data corresponding to the jth observation and Fj0 PL or Fj 0 ML as the regression function with regression parameter 0 or T 0TT respectively it is clear that the problem of maximizing 617 so solving the resulting estimating equation either holding 6 xed or not is equivalent to an OLS problem with data 0 for all j mean function or Fj60 To implement these results to obtain estimates of a 0TT PL holding 6 xed or of T a 0TT ML one may 0 Solve the fake OLS problem by constructing dummy data all identically equal to zero for each j and de ning the mean function as indicated above to obtain estimators for 0 or T 0TT using standard nonlinear regression techniques and 0 Obtain a nal estimate of a by substituting these estimates into 618 at the end EXAMPLE 61 Power of the mean variance model In the particular case of the power of themean model VarOjlillj UZfZQWw 9657 971131 f9113j797 the geometric mean is j1 9 9 0 l l f1 av PAGE 132 CHAPTER 6 ST 762 M DAVDDIAN o For PL estimation in step 11 one would substitute 6 lt in this case the term in braces would be a constant with respect to the regression parameter 0 Writing ln fl mm j1 to denote the geometric mean of the at 6 the general problem becomes one of minimizing i 9 2 Z 0mfltwj gt l 71 Hilly 5 o For PL estimation holding 6 xed this is especially simple 7 the form of the regression model is 10 where a and 07 are constants In this case any nonlinear regression program eg SAS proc nlin or RSplus nls may be used to carry out the minimization and hence estimate 0 For ML estimation treating both 6 and 0 as regression parameters the problem is somewhat more complicated but is a nonlinear OLS problem in principle nonetheless Unfortunately some nonlinear regression programs do not allow ready implementation of this while others do as we now discuss SLIGHT WRINKLE In situations where the geometric mean 9 is a constant with respect to the regression parameters of interest as in the power of themean case for PL above the OLS problem is straightforward to implement in any nonlinear regression program When it does depend on the regression parameter as in the case of ML for the power of themean model above some programs such as SAS proc nlin are infeasible This is because the value of the geometric mean must be updated at every internal Gauss Newton iteration This requires the program to make a pass through all 71 data elements at each Gauss Newton iteration Unfortunately proc nlin does not allow speci cation of regression models that require such a pass through all n observations the model must be de ned on an observation by observation basis In contrast RSplus nls can accommodate this easily the user can write his or her own regression function that manually makes the pass through the entire set of n observations at each evaluation Alternatively the user can write his or her own Gauss Newton program Note that this issue arises more generally even for PL if the form of the variance function is such that the geometric mean 1121 g6 0 11171n is not constant with respect to the regression parameters PAGE 133 CHAPTER 6 ST 762 M DAVDDIAN 65 A general class of estimators for 0 A question we posed earlier is whether there are other approaches to estimation of 0 and a besides solving the quadratic PL equation Of course any procedure for estimating the variance parameters treating 6 as xed may be paired with an equation for estimating 6 to determine an overall set of equations Thus we focus here on estimation of a 0TT treating 6 as xed eg as would be the case in step ii of the GLS algorithm The general class of estimators we consider may be motivated by the following c As with PL it is natural to suppose that estimating equations for variance parameters would be based on the deviations or residuals 7 f1j6 Of course the obvious function of these residuals to consider is the the squared function 7 at 2 as in PL as it has expectation equal to the variance model A problem with squared deviations is that the effect of outlying or unusual observations can be magni ed Speci cally a deviation that is large in magnitude for this reason is even larger when squared lntuition suggests that estimation of variance parameters both a and 0 based on a quadratic function might thus be sensitive to the presence of anomalous observations 0 These points suggest that a natural approach is to still consider residuals but to consider functions of them that may not be so sensitive These observations suggest the approach we now describe the formation of estimating equations of the general form 61 based on transformations of absolute residuals D 7 73 D That is we identify the response 37 as some transformation hu say of absolute residuals For convenience we restate the de nition EWfj7 7 T 095797113739 39 EXAMPLE 62 The identity transformation The quadratic PL equation uses the transformation hu 742 An alternative is to consider the identity transformation hu u Thus the response would be 7 awa PAGE 134 CHAPTER 6 ST 762 M DAVDDIAN To form an estimating equation we need the mean and variance of the response We have EHYJ39 7 My ml lilj 0957 97 1El jl lily Var Yj 7 fj7 l lilj 0292 797111jVarl jl lilj 0292 707111j 1 7 EUEjl lilj2l Suppose that we are willing to believe that El6jl is a constant for all j so not depending on 1117 o This would certainly be true if the 6739 were independent of the 111739 and iid as when 6739 N N0 1 o This is certainly not true in general For example if is distributed as Poisson or gamma it may be shown that this moment does indeed depend on 2117 Under this assumption writing El6jl lily Elejl e 07 622 a constant note that we thus have Ele 7 WW l law 7 e 9 7 07 11 var Yj 7 WW l lily 7 t72 7 62 92 7 07 11a 0lt 92 07 111739 recall that estimating equations of the form 613 will not be altered by the addition of or removal of a multiplicative constant in the weights so we may only need consider that the variance of the response is proportional to 92 01117 Having identi ed the mean function and weights we need only identify the gradien Note that with the above consideration the only parameters are 0 and 7 where 7 is de ned as in 622 so that we have a different parameterization We have 08n6 9 7 07 1117 7 e 9 7 071 6 6 9e 9 7 07 1 7 e 9 7 07 mm 07 111 Thus we may construct the estimating equation as n Y7 35 7 739 5797 39 j1 W 6 9 7 0 111739T96 0 22 0 323 where 1 T9057 9711a WW7 9715i PAGE 135 CHAPTER 6 ST 762 M DAVDDIAN ASIDE Note that if we were to write 527102 1 then 5 a and we could write the quadratic PL equation ignoring multiplicative constants equivalently as Z Yj 11373 52 52 92 7 97131 F1 94 707113j shy2 0 wmw 0 113739 0 624 Thus if we reparameterize the variance model this way writing a e the quadratic PL equation 624 and the equation based on the identity transformation 623 have a similar form This suggests that we might generalize the idea to still other transformations by an appropriate de nition of a parameter A 77 739 4 EXAMPLE 63 Logarithmic transformation Consider basing an estimation procedure on the re sponse 10g le H1317 ml 10ng 10g9lt 7 97 113739 10 lEjl 625 Again assuming that the 6739 are iid we have that Elog legH1137 Elog legl and varlog legH1137 varlog 9 are constants From 625 writing 7 loga Elog 9 we then have that the mean function and variance are Elog 7 at 6 7 logg 0 1117 varlog 7 at 6 1117 constant and it follows that the gradient is 79 0111j The resulting estimating equation may be written as Z og le lm5H 77 loggt 97113jT9 7 971 0 626 73971 In the particular case of the power of the mean variance model note that logg 02117 0logf1j6 and 119 0 217 logf1j6 Thus if 6 is set to some preliminary estimate eg in the GLS algorithm the log at 6 is a constant with respect to 7 and 0 and note that solving 626 reduces to a simple linear regression that may be carried out in closed form This feature has made this method for estimating 9 to be used to form weights for tting the original mean model very popular among assay scientists One need only obtain predicted values from a t 3 to implement the method in this case Of course note that this requires that at 6 gt 0 for all j which will virtually always be true in this application PAGE 136 CHAPTER 6 ST 762 M DAVDDIAN IN GENERAL The foregoing special cases suggest a general class of estimating equations For known A de ne how M A7 0 log lul A 0 This is the discontinuous version of the Box Cox transformation discussed in Chapter 2 it is easier to treat the case of A 0 separately from the others Consider forming estimating equations based on general power transformations hu A of absolute residuals With A 7 O assume that the 6739 are such that their appropriate moments given below are not dependent on 11 and are constant for all j as would be the case if the 6739 are iid Formally we will need that El6jlk l1j El6jlk constant for all j k 12 For A 7 0 de ne 7 such that A A i 6 7 0 Mel 7 note that for A 2 EOEJlz 1 so is always possible with no other assumptions Then identify 7 at MA as the response If follows that E019 7 fvj7 llmj em 0 xviMam 6quotg 0 mi VarerfiIIj7 lAlj 02A92A 707jVarlEle 02A92A 7071lEl6jl2A El6le2l 02AEl6jl2AEZA 92A 707IIIj 627 Note that in 627 the leading term in braces is a constant with respect to j under our assumptions so that I I A I 2A I WHY 7 arw lily 0lt 9 8 0711 Calculation of the derivatives of the mean function eA lgA 0 217 with respect to 7 and 0 then yields the estimating equation Zn Yj 7 mill6 7 EMQAlt 707 7 Ae ngW 0712739T9 7 0712739 0 628 71 92257 97 113739 PAGE 137 CHAPTER 6 ST 762 M DAVDDIAN note that the multiplicative constant Ae V could also be disregarded REMARKS o The equation 628 de nes a family of estimating equations indexed by A The quadratic PL equation is obtained when A 2 The interpretation of the parameter 7 obviously depends on the underlying distribution of the 6739 or at least its moment so to obtain an estimate of a one would need knowledge of this in order to transform the estimate of 7 This seems a bit unsatisfactory This is not the case in the particular case A 2 where a may be estimated directly These methods have been proposed in the particular context of GLS Recall that solving the linear estimating equation for 6 does not involve 0 Thus the perspective on this approach is that one is mainly interested in estimation of 0 because 0 plays a direct role in estimation of 6 The hope is that using an estimator for 0 that is less sensitive to outlying observations for example will result in improved estimation of 6 which is of central interest In later chapters we will discuss theoretical results that formalize this position To estimate a one might simply use the usual quadratic estimator calculated using the nal estimates of 6 and 0 o The preceding developments are predicated on the assumption that E Ejlk j is the same for all j for k 1 2 What is the consequence of violation of this assumption To x ideas suppose that El6jlk l1j 39ykj Then we may not de ne 7 as before and in general EUYj imelAlW V1j0A9A 7077 7 Varltle fj7 lAlj V2j ij0392A92 707 113739 Thus note that if we erroneously assumed El6jl l1j to be constant to form the estimating equation 628 assuming existence of a common parameter 7 then EM 7 fagMAW y ecuAw 0 mi and the estimating equation would not be unbiased Of course this is not a problem for A 2 as E Ej 2 j 1 regardless but will be an issue for any other choice of A Note that as far as the weights are concerned the effect of the erroneous assumption of constancy is to yield incorrect weights speci cally 628 uses values proportional to 1gz 0 217 when the appropriate choice would be ygj 7 39ylzjgz 0211j 1 In the case of A 2 note that PAGE 138 CHAPTER 6 ST 762 M DAVDDIAN although one will always have an unbiased estimating equation failure of the assumption to be met will result in use of inappropriate weights In later chapters we will gain insight into the consequences of choosing incorrect weights in general estimating equations theoretically 0 Finally even if we believe the constancy assumption is reasonable how should we choose among competing estimators choices of A We will study this theoretically in Chapter 12 IMPLEMENTATION As it turns out the same pro ling trick for implementing PL in the A 2 case may be generalized to any A This requires the user to de ne a suitable objective function to be maximize that yields 628 upon differentiation with respect to 7 and A and is left as an exercise In the particular case A 1 it may be shown that in fact the equation 628 corresponds to those that would be obtained by considering the double exponential Laplace distribution rather than the normal this is also left as an exercise 66 Ext ended quasilikelihood Yet another approach to estimating parameters in a variance function arises from the quasilikelihood perspective Recall that QL was an attempt to give a distributional justi cation and framework for GLS type estimation of 6 in models in which the variance depends on 6 through the mean but where there are no additional unknown variance parameters 0 known When 0 is unknown in such models there have been attempts to do something similar the idea is to de ne a scaled exponential family like loglikelihood to be used as a basis for joint estimation of 6 a and 0 The objective is to end up with the linear GLS estimating equation for 6 and some other equation for a 0TT by differentiating this extended quasilikelihood Here we just give a sketch of the idea Some relevant references are Efron 1986 Nelder and Pregibon 1987 Jorgensen 1987 Nelder and Pregibon 1987 McCullagh and Nelder 1989 Chapter 9 and Nelder and Lee 1992 as well as several follow up papers by Nelder and colleagues in the 1990s The idea of extended quasilikelihood is to extend the de nition of QL to de ne a family of likelihoods indexed by 0 such that 190911137 1113 37 MFG111131 0292fj7 7 9 Dependence of the variance function on 6 must be through the mean The family should have the following properties PAGE 139 CHAPTER 6 ST 762 M DAVDDIAN 0 Maximum likelihood estimation yields GLS estimation of 6 o The family includes the normal likelihood with constant variance and the Poisson gamma inverse Gaussian distributions for appropriate choices of g and 0 HEURISTIC JUSTIFICATION Recall in Section 46 we wrote the scaled exponential family77 loglike lihood in the form an 0 2 yiud 0 629 7 77 iy 7 U 3397077 y 921 7 0 l where here we have cheated a bit and included mention of 0 in anticipation of the upcoming arguments Now consider any density of the form 629 not just the ones of the scaled exponential family where 9 must be of a particular form Such a density would have to satisfy the following o In general the function 0 could well depend on the mean u say It just so happens for the choices of g in the scaled exponential family where 0 is xed and known that 0 does not depend on u In general for 629 to be a density for any old 9 0 might well have to depend on u o Regardless of 0 we know that for any 9 M yiu Wagy 92U70du wlll yield the GLS estimating equation for 8 0 Thus if 0 depends on u differentiation of the density with respect to 6 would yield some ad ditional contribution to the estimating equation for 6 which would be undesirable if we wish to preserve the GLS type estimation of 6 0 But we cannot simply ignore 0 because it must contain information on 0 too which we would need to estimate 0 APPROA CH Consider densities of form 629 and determine a general approximation to the form of 0 so that o The density integrates to 1 0 Then further approximate 0 so that it does note depend on u In our notation the approximation that is used is to consider the situation where the amount of information in each observation is large To quantify this idea write y M 90 967 PAGE 140 CHAPTER 6 ST 762 M DAVDDIAN Now if a z 0 then y z Ia so that the dviations are small relative to the size of the observation Under this condition note that f f f U U7 LL ywwm 4m wm imwwm C by the Mean Value Theorem yyM2 929M2779 3 4y 7 mzzcm 0 z 4 2 79 2 Here we have invoked the Mean Value Theorem for y g c g u and because y and u are close have approximated c by their midpoint y u2 in the second to last expression and then further approximated y u2 by u in the denominator of the last line Thus for 039 z 0 so that y z I in order for 629 to be a density in general integrate to 1 we must have 0424707 97 121Og27r0292m 9 where we emphasize that 0 must depend on u This follows as we have just seen above that 2 y i u y 7 M d g 0 y 92 0 u 20292017 97 which just looks like the exponential part77 of a normal density But this 0 depends on u which violates one of our objectives in fact the density is approximately normal which we know does not yield a linear estimating equation for 6 Thus a further approximation to remove this dependence is made by replacing u by y in 0 ie take 0424707 9 1210g27r0292y7 9 Combining these developments we have the following de nition DEFINITION 61 Extended quasilikelihood The extended quasilikelihood EQL function is given by y7 LL 920479 The suggestion is to estimate 6 a and 0 jointly by maximizing mm a a y a f du 7 12 log2m292lty 0 1 7L 2 EQLJl v 07 0 Yam j1 PAGE 141 CHAPTER 6 ST 762 M DAVDDIAN REMARKS o What this essentially does is replace the troublesome part of the normal loglikelihood ie that giving rise to the quadratic term in the estimating equation for 6 by the QL furthermore ensuring that the log piece is also free of 6 0 Of course the EQL is only an approximation to whatever the true loglikelihood may be and it is not the same as the normal loglikelihood o A potential problem in implementation arises when y O for example if gy 0 ye More generally replacing u by y in the log term is problematic for several reasons as described by Davidian and Carroll 1988 SUMMARY The EQL approach attempts to incorporate estimation of unknown 0 in the QL approach tr in to maintain the GLS t e estimation of while iVin a distributional ers ective This not 3 g yp g g p 13 always an easy transition as discussed by Davidian and Carroll 1988 67 Update We may now update our 2 x 2 table of possible inferential approaches Note that the distinction of whether 9 depends on 6 has the effect of making GLS PL and normal ML coincide or differ 9 does not depend on 6 9 depends on 6 0 known 0 unknown Weights known WLS Linear equations GLS C xed GLS C 00 with different transformations of absolute residuals GLS PLnormal ML EQL Linear equations GLS lRWLS QL Quadratic equations Normal ML and generalizations Linear equations GLS C xed GLS C 00 with different transformations of absolute residuals GLS PLy normal ML EQL Quadratic equations normal ML generalizations PAGE 142 CHAPTER 6 ST 762 M DAVDDIAN 68 Implementation in SAS and R Recall the data on the pharmacokinetics of indomethacin in Examples 11 and 12 the raw data are given in Section 37 where we used them to illustrate implementation of IRWLS and the threestep GLS algorithm with 0 known See that section for a description of the model and data is concentration at time mg Here we use these same data to demonstrate implementation of the threestep GLS algorithm in the case of 0 unknown where 0 is estimated at the second step via the pseudolikelihood method This is implemented using the trick described in Section 64 As in Section 37 we assume the model 19091901 ew37 varOAQED 02W 6 where 9 is taken to be an unknown parameter to be estimated and m a em expee z e a expeefz For this model the mean function for implementation of the trick called trk or trkfunc in the programs below is FM Yj 7 fj7 f fmj7 97 where i is the current GLS estimate of 6 and is the geometric mean of the y values Both and 7 filj are constants with respect to tting the regression for estimating 0 Each program thus computes the residuals and predicted values from the previous t of mj 6 before invoking the nonlinear regression routine to estimate 9 using this approach Note that 880Fj0 Fj010gf fmj7i3 The following programs are heavily documented and mostly self explanatory PROGRAM 61 Implementing the three step GL3 algorithm with pseudolikelihood using SAS proc nlin Here proc means is used to calculate as eltp7f1 21 logfzj When proc nlin is called to carry out estimation of 9 using the trick note that the Regression sum of squares is negative PAGE 143 CHAPTER 6 ST 762 M DAVDDIAN This is an artifact of the fact that all the data in this t are identically equal to zero we are tricking proc nlin to do a t that is really meaningless as far as the partitioning of the total sum of squares is concerned so this is no cause for concern For estimating 9 by this approach we use the grid search77 feature of proc nlin Rather than specify a single starting value we ask proc nlin to evaluate the objective function 29107 here over the given range of 9 values to nd the most promising77 starting value ie the 9 giving the smallest objective function value This may be more time consuming initially than just picking a single starting value as the objective function must be evaluated at each point on the grid However once the best one is found the number of required Gauss Newton iterations is usually smaller PROGRAM STATEMENTS GLS analysis of the subject 5 indomethacin data for variance 2 times a power theta of the mean Here define a SAS macro to perform the 3step GLS algorithm options ps55 ls80 nodate Define a sas macro called quotglsalgquot to control the GLS iteration The macro takes as input the name of the SAS working data set the names 0 the x and y variables here x is a scalar but the program may be modified easily to pass more than one explanatory variable and starting values for each WLS calculation Xmacro glsalgdsetxvaryvarb1b2b3b4 Set up the data set for the first pass through the algorithm which will compute the OLS estimate e ariable quot red1quot containing the predicted values from the revious iteration is se equa o 1 here so that the first pass wit weights based on pred1 theta will actually be OLS weights all 1 data ampdset set ampdset p e 11 if ampxvar or ampyvar then delete b1newampb1 b2newampb2 b3newampb3 b4newampb4 Set the initial value for theta OLS theta100 Call the macro quotstep3quot defined below that actually implements the call to PROC NLIN to do WLS with the current set of fixed wei ts The argument sets options for supressing the printing of the results of eac all to PRO NLIN Xstep31 Xstep31 PAGE 144 CHAPTER 6 ST 762 M DAVDDIAN Xstep31 Xstep31 Xstep30 After the final iteration compute the final estimate of sigma using the newest values of the weights data sigma set outnlinkeep resid pred data sigma2 set newkeep theta1 data sigma merge sigma sigma2 data sigma set sigma wresidresidpredtheta1 proc means datasigma noprint var wresi output outsigout ussrss data sigout set sigout sigmasqrtrss7 data sigout set sigoutkeep sigma if ngt1 then delete proc print datasigout title2 IIFinal estimate of sigmaquot Xmend glsalg calling PROC NLIN see below Define the macro quotstep3quot The argument quotfooquot controls options in Xmacro step3foo r all ot PROC NLIN prints out a summary by default We rint out this full gutput oan fo the final GLS fit the IInoprintI option is invoked er ts Zif ampfoogt0 Zthen Zdo Xle optnoprint Xend Zelse Zdo Xlet opt Xend equal to theta1 the previous estima The call to PROC NLIN to implement the current WLS fit with theta te proc nlin dataampdset methodgauss ampopt parms b1ampb1 b2ampb2 b3ampb3 b4ampb4 f iter1 t en 0 d1b1new b2b2new b3b3new b4b4new en Programming statements to define the mean function and its derivatives eb1exp b1 eb2exp b2 eb3exp b3 eb4exp b4 feb1expeb2ampxvar eb3expeb4ampxvar db1eb1expeb2ampxvar db2eb1eb2ampxvarexpeb2ampxvar db3eb3expeb4ampxvar db4eb3eb4ampxvarexpeb4ampxvar PAGE 145 ST 762 M DAVDDIAN CIAI FEI6 The model statement and derivative statements to tell PROC NLIN the form of the model and derivatives model ampyvar f derb1db1 derb2db2 derb3db3 derb4db4 The quotweightquot statement defines the values of the wei ts to use for ere we use wei ts computed using the fixed va ue of theta1 and the values of the predicted va ues from e previous iteration Thus the weights are FIXED do not depend on current predicted values PROC NLIN thus knows to do WLS with FIXED weights rather than IRWLS weight 1abspred12theta1 Form an output data set containing everything in the input data set plus the predicted values to set up weights for next time the residuals so we can calculate sigma on the final iteration and the parameter estimates from this iteration so we can print them output outoutnlin ppred rresid ssesigma parmsb1 b2 b3 b4 run Now estimate theta by PL with the IITrickquot defining a quotdummyquot nonlinear regression problem Here ot geometric mean of the current estimated means dummy dummy responses all O data outnlin set outnlin if predltO then delete sigmasqrtsigma7 lpredlogpred proc means dataoutnlin noprint var lpre output outoutmeans meanmlog data outnlin merge outnlin outmeansdroptype freq retain xx 0 i n1 then xxmlog if mlog then mlogxx data outnlin set outnlin dumm 0 fdotexpmlog relax the convergence criterion for variance parameters proc nlin dataoutnlin methodgauss converge000001 ampopt parms t eta to 15 O 5 trk residfdotpredtheta mo e trk dertheta residfdotpredthetalogfdotpred output outnew parmstheta run Set up the data set containing the new predicted values etc for use on the next iteration Also create the data set quotparmsquot containing only the parameter estimates from the iteration just complete for printing data new set newkeep ampxvar ampyvar b1 b2 b3 b4 pred theta sigma renameb1b1new b2b2new b3b3new b4b4new predpred1 thetatheta1 sigmasigma1 PAGE 146 CIIAI FEIl6 ST 762 M DAVDDIAN data parms set newdrop if n ampxvar ampyvar gt1 then delete proc print dataparms data ampdset set new Xmend step3 End of the macro definitions Now the program to read the data and call the macros begins data indo input time conc cards 025 205 050 104 075 081 100 039 125 030 200 023 300 013 400 011 500 008 600 010 800 006 Call the macro quotglsalgquot with the name of the indomethacin data set the names of x time and y conc and the starting values title1 II3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATAquot Xglsalgindotimeconc0690691616 OUTPUT 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 203282 127147 104077 123272 150684 0067932 069275 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201477 124128 097256 143246 174779 011240 081920 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201015 123673 097036 142826 174176 013147 081882 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201007 123659 097009 142895 174255 013133 081892 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201006 123658 097009 142891 174249 013135 081891 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201006 123658 097009 142891 174249 013134 081891 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201006 123658 097009 142891 174249 013134 081891 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 1 2 PAGE 147 CHAPTER 6 ST 762 M DAVDDIAN Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201006 123658 097009 142891 174249 013134 081891 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201006 123658 097009 142891 174249 013134 081891 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 10 Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201006 123658 097009 142891 174249 013134 081891 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 11 The NLIN Procedure Iterative P ase De endent Variable conc et od GaussNewton Weighted Iter b1 b2 b3 b4 SS 0 12366 09701 14289 17425 01208 NOTE Convergence criterion met Estimation Summary Method GaussNewton Iterations R 2549E6 PPCb4 7078E7 RPC Ject ObJective 0120759 Observations Read 11 Observations Used 11 Observations Missing 0 NOTE An intercept was not specified for this model um of S Mean Approx Source DF Squares Square F Value Pr gt F Regression 4 73194 18298 10607 lt0001 Residual 7 01208 00173 Uncorrected Total 11 74401 Corrected Total 10 38688 AEprox Approximate 95 Confidence Parameter Estimate Std rror Limits b1 12366 01768 08184 16548 b2 09701 01379 06440 12962 b3 14289 02322 19779 8799 b4 17425 02632 23648 11202 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 12 The NLIN Procedure Approximate Correlation Matrix b1 b2 b4 b1 10000000 08246022 02715809 02276416 b2 08246022 10000000 06209805 05451138 b3 02715809 06209805 10000000 09316278 b4 02276416 05451138 09316278 10000000 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 13 The NLIN Procedure Gri earch Dependent Variable dummy Sum of theta Squares 0 00335 PAGE 148 CIIAI FEIl6 ST 762 M DAVDDIAN 0500 0303 1000 0276 1500 0252 2000 0231 2500 0213 3000 0198 3500 0184 4000 0173 4500 0163 5000 0155 5500 0148 6000 0142 6500 0138 7000 0135 7500 0133 8000 0132 8500 0132 9000 0133 9500 0136 0000 0139 0500 0143 1000 0148 1500 0154 2000 0162 2500 0170 3000 0180 3500 0191 4000 0204 4500 0218 5000 0234 3STEP GLS ALGORITHM The NLIN Procedure a ive hase De endent Variable dummy ethod GaussNewton Iter theta 08000 Sum of Squares 00132 00132 00132 00132 00132 00132 00132 00132 00132 00132 oooo oumpmmwxo o 00 gt x 00 o H 00132 NOTE Convergence criterion met Estimation Summary GaussNewton 1 Read Used Missing 0 APPLIED TO THE INDUMETHACIN DATA NOTE An intercept was not specified for this model S Mean Source DF Squares Square F Value Regression 1 00132 00132 1000 Residual 10 00132 000132 Uncorrected Total 11 0 Corrected Total 10 0 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA The NLIN Procedure AEprox Parameter Estimate Std rror A rox P Pgt F Approximate 95 Confidence Limits PAGE 149 CHAPTER 6 ST 762 M DAVDDIAN theta 08189 03620 00122 16256 A proximate Corre ation Matrix theta theta 10000000 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 16 Obs pred1 b1new b2new b3new b4new sigma1 theta1 1 201006 123658 097009 142891 174249 013134 081891 3STEP GLS ALGORITHM APPLIED TO THE INDOMETHACIN DATA 17 Final estimate of sigma Obs sigma 1 013134 PROGRAM 62 Implementing the three step GL3 algorithm with pseudolikelihood using R nls Here7 we use the same considerations as in the SAS program7 except that we use a single starting value of 9 10 for each instance of estimation of 9 using the trick PROGRAM STATEMENTS Program to implement the 3step GLS algorithm with cmax total iterations theta unknown using the function nls Details on the nls function may be found in Chapter 10 of the book IIStatistical Models in Squot edited by JM Chambers g and T Hastie 1993 Chapman and Hall Applied here to the indomethacin data assuming variance proprotional to a power 2theta of t an resp se 39Powerofme quot model Theta is estim ted fro the data using the pseudo likelihood meth implemented by quottrickin quot t e onlinea re ression rou ne m an mode f t e biexponential el parameterized to enforce positivi The ro ram ma be used for an roblem b chan in the codepde ining the mean functioh End quotweightsquot g g g Define the bioexponential mean function f and the gradient matrix of its partial derivatives with respect to beta n x p attribute The nls function will know to use analytic derivatives n it s ots the presence of the attribute quotgradientquot defined along wit the function indofunc lt functiontimeb1b2b3b4 eb1 lt expb1 eb2 lt expb2 eb3 lt expb3 eb4 lt expb 4 indof lt eb1expeb2timeeb3expeb4time compute analytical dervivatives create the gradient matrix X indograd lt array0clengthtime4listNULLcquotb1quotquotb2quotquotb3quotquotb4quot indogradquotb1quot lt eb1exp eb2time eb1eb2timeexpeb2time eb3expeb4time 3eb4timeexpeb4time attrindofquotgradientquot lt indograd I m 0quot PAGE 150 CHAPTER 6 ST 762 M DAVDDIAN indof To im lement step iii we wish to do weighted least squares wit nown weights This is accomplished y ransforming the the response and mean func ion 0 a roblem with cons ant variance nls is then called to do OLS on the transformed problem thereby doing S g Because the wei ts in this case are a function of the mean I a so ine t e mean function with no gradient attribute T is is because the transformed mean unc io wi 1 de on the rent estimated w ghts which are foun by evaluating the ean ction at the ent estimate Becau e of the nature of R attributes if calculated using th on quotindofuncquot above the wei ts will carry along the gradient attribute which will co use things when we wish to calculate t e radien of the transforme mean unc ion assumin e weig ts are constant ere are more elegant ways around this but doing it this way here is meant to highlight the issue so you will be aware of t unweightfunc lt functiontimeb1b2b3b4 b 4 uwtf lt eb1expeb2timeeb3expeb4time t The transformed mean function multiplied by the square root of the current estimated weights which are considered fixed weightfunc lt functiontimeb1b2b3b4wt b1 eb1 lt exp eb2 lt expb2 eb3 lt expb3 pred lt unweightfunctimeb1b2b3b4 w12 lt sqrtwt weightf lt predw12 compute analytical dervivatives create the gradient matrix X weightgrad lt array0clengthtime4listNULLcquotb1quotquotb2quotquotb3quotquotb4quot weightgradquotb1quot lt eb1expeb2timew12 lt eb1eb2timeexpeb2timew12 weightgradquotb3quot lt eb3expeb4timew12 4 l lt eb3eb4timeexpeb4timew12 attrweightfquotgradientquot lt weightgrad eightf To estimate theta we use the quottrickquot of turning the PL estimation problem into a IInonlinear regressionquot pro em Here the unc ion trkfunc is the mean function for the quotdummyquot regression problem that is solved to estimated theta trkfunc lt functionresidmudotmu theta trk lt residmudotmutheta trkgrad lt arrayOclengthmu1listNULLcquotthetaquot trkgradquotthetaquot lt trklogmudot mu at rtrkquotgradientquot lt trkgrad analytic derivative tr The data alternatively we could read them from a file of course Note time has already been put into hours subject 5 time lt c0 250 500751 001 252 OO300400500600800 conc lt c2051040810 390300 23013011008010006 PAGE 151 CHAPTER 6 ST 762 M DAVDDIAN n lt lengthconc plt4 Create the data frame for nls indodat lt data frametimeconc Specify the max number of iterations of GLS C cmax lt 10 Step 1 initial fit by OLS A call to nls s pretty selfexplanatory The first argument on the HS of th 39 i specifies the model e is t e response variable on the RHS is the mean function may also be just an expression ject containing e parameter estimates an 0 her summary information rom the fit a u m m n u o d E d H o u n m P 4 m m n o 5 SL m 0 Dquot m d r m E m aaaaaaaaaaaaaa indo olsfit lt nlsconc 39 indofunctime b1b2b3b4indodat listb1069 b2069 b316 b4 165 Extract the estimate from the object indo olsfit bols lt coefindo olsfit Print out the results to a file we round the results to 6 decimal places to the right of the decimal agatata catquotFIT OF THE INDOMETHACIN DATA BY GLSquotfilequotindogls2 Routquotquotnquotquotnquotquotnquot appendF catquot0LS estimate quotroundbols6filequotindogls2RoutquotquotnquotquotnquotquotnquotappendT Use the OLS estimator as the preliminary estimator bgls lt bols Begin iterating between steps ii and iii for k in 1cmax Step ii estimate theta and calculate the weights and the transformed response to use in the WLS calculation in iii a a compute the geometric mean and predicted values mu lt unweightfunctimebgls1bgls2bgls3bgls4 mudot lt prodmu1n mudot lt repmudotn PAGE 152 CHAPTER 6 ST 762 M DAVDDIAN resid lt concmu dummy lt rep0lengthtime pldat lt dataframeresidmumudot Now estimate theta using the PL trick indoplfit lt nlsdummy 39 trkfuncresidmudotmuthetapldat listtheta1 theta lt coefindoplfit catquotIteration quotkquotnquotfilequotindogls2RoutquotappendT catquotEstimate of theta quotroundtheta4quotnquot filequotindogls2RoutquotappendT mu lt unweightfunctimebgls1bgls2bgls3bgls4 wt lt 1mu 2theta concwt lt concsqrtwt Step iii update estimation of beta by WLS with the weights held fixed First create the updated data frame of transformed responses and weights for use by nls indodat2 lt dataframetimeconcwtWt indoglsfit lt nlsconcwt 39 weightfunctimeb1b2b3b4wtindodat2 listb1069 b2069 b316 b416 Get the updated GLS estimate to use for constructing weights on the next iteration bgls lt coefindoglsfit Print results of this iteration to the output file catquotGLS estimate of beta quotroundbgls6quotnquotquotnquotfilequotindogls2Routquot appendT Finished iteration loop now com ute the estimate of sigma 2 based on the final GLS estimate se the quotadjustedquot version mu lt unweightfunctimebgls1bgls2bgls3bgls4 resi lt concmu g lt mu theta sigma2 lt sumresidg2n p sigma lt sqrtsigma2 Print out the final estimate of sigma and the summary provided by g the nls function catquotFinal estimate of sigma quotroundsigma6quotnquotquotnquot filequotindogls2RoutquotappendT sinkquotindogls2RoutquotappendT printsummaryindoglsfit sink OUTPUT FIT OF THE INDOMETHACIN DATA BY GLS OLS estimate 1271474 1040768 1232717 1506841 PAGE 153 CIIAI FEIl6 Iteration 1 Estimate of theta 06928 GLS estimate of beta 1241283 0972565 1432457 1747787 Iteration 2 Estimate of theta 08 192 GLS estimate of beta 1236734 0970361 1428257 1741751 Iteration 3 Estimate of theta 0 8188 GLS estimate of beta 1236594 0970094 1428944 174254 Iteration 4 Estimate of theta 08189 GLS estimate of beta 1236587 0970098 1428899 1742481 Iteration 5 Estimate of theta 08189 LS estimate of beta 1236587 0970097 1428904 1742486 Iteration 6 Estimate of theta 08189 GLS estimate of beta 1236587 0970097 1428904 1742486 Iteration 7 Estimate of theta 08 189 GLS estimate of beta 1236587 0970097 1428904 1742486 Iteration 8 Estimate of theta 0818 9 GLS estimate of beta 1236587 0970097 1428904 1742486 Iteration 9 Estimate of theta 08189 GLS estimate of beta 1236587 0970097 1428904 1742486 Iteration 10 Estimate of theta 08189 GLS estimate of beta 1236587 0970097 1428904 1742486 Final estimate of sigma 0131344 Formula concwt 39 weightfunctime b1 b2 b3 b4 wt Parameters Estimate Std Error t value Prgtt b1 12366 01768 6993 0000213 b2 09701 01379 7034 0000205 b3 14289 02322 6155 0000465 b4 17425 02632 6621 0000298 Signif codes 0 0001 001 005 01 Residual standard error 01313 on 7 degrees of freedom Correlation of Parameter Estimates b2 b3 b2 0 8246 b3 0 2716 06210 b4 02276 05451 09316 7 1 ST 762 M DAVDDIAN PAGE 154 CHAPTER 8 ST 762 M DAVDDIAN 8 Large sample theory A casual approach So far we have discussed a number of approaches to inference in the general mean variance model EOjlilrj aw 37 varOjlilrj 0292 7 07 Ha 81 In particular we have proposed several methods for estimation of 6 0 and a c We have focused on characterizing the various approaches in terms of estimating equations to be solved in the unknown parameters Because of the nonlinearity of the mean and variance functions as well as the complexity of some of these equations it is in general not possible to express the proposed estimators in closed form Rather they may only be represented as the implicit solutions to the equations When such closed form expressions are available it is in fact often possible to derive exact results regarding the properties of estimators in nite samples For example in classical lin ear regression with the normality assumption we may express the estimator for the regression parameter 6 as a closed form linear function of the design matrix X and the data Y namely a XTX 1XTY Because of this pleasant form it is straightforward to show that the estimator is unbiased for any xed sample size n ie E 6 Moreover it is possible to derive directly an expression for the sampling covariance matrix of i as 02XTX 1 This covariance matrix of course may be estimated by substituting the estimator for 02 and the square roots of the diagonal elements of the resulting matrix provide estimated standard errors for the components of In fact with the normality assumption it is possible to in fact conclude that E N NLB 02XTX 1 for xed 71 and that replacement of 02 by the estimator changes this to a tn distribution Thus an exact sampling distribution is available upon which con dence intervals hypothesis testing procedures and so on may be based Unfortunately a consequence of the complexity of the estimators for 81 and the lack of closed form expressions means that in contrast to the case of classical linear regression it is no longer possible to obtain nite sample exact results for the sampling properties of the estimators RES ULT We cannot derive exact results but we may at least obtain approximate ones The usual approach to obtaining approximate results under these circumstances which we will adopt is to appeal to large sample theory PAGE 182 CHAPTER 8 ST 762 M DAVDDIAN o In particular as results for xed sample size n are clearly not tractable the idea is to gain insight theoretically by simplifying the problem The usual simpli cation is to evaluate what happens when the sample size 71 becomes large formally under the condition n a 00 0 Results derived under the condition 71 a 00 are technically only exactly valid in the limit how ever it has been found that in practice they are often a good approximation in xed sample size situations That is standard errors and sampling distributions derived via large sample approx imation are often found to be reasonable re ections of the true but intractable properties of the estimators for xed 71 Our approach will thus be to investigate the properties of the estimators we have discussed by appealing to large sample theory arguments This will allow us to gain insight into several issues 0 Although the estimators cannot be shown to be unbiased nor can we identify the one with minimum variance exactly we will be able to deduce analogous large sample properties 0 Recall that we are not necessarily willing to specify a distributional assumption under 81 The classical regression normal sampling distribution follows from the assumption of normally distributed response We will see that approximate sampling results may be deduced in a large sample framework that do not require a distributional assumption on the response 0 We will in fact be able to gain insight into the consequences of failure to take nonconstant variance into account Moreover the complexity of 81 suggests the possibility that certain components of the model eg the variance function could be misspeci ed in practice We will be able to carry out arguments to evaluate approximately the consequences 0 We will also be able to gain insight into the tradeoffs between linear and quadratic estimating equations for 6 as well as the best way to estimate 0 and the effect of knowing its value versus the need to estimate it We will cover all this and more in subsequent chapters Before we can undertake these investigations we must lay the basic foundation for largesample theory arguments Here we review some of the formal concepts involved and then Section 83 see how we might begin to apply some of these tools to a simple version of 81 with known variance weights The fundamental concepts may be summarized as follows we will state these formally later PAGE 183 CHAPTER 8 ST 762 M DAVDDIAN 0 Consistency Does the estimator estimate the right stuff That is for larger and larger sample sizes does the estimator tend to approach the true underlying value of the parameter in question in some sense Asymptotic distribution May we approximate the true unknown sampling distribution of the estimator somehow to use as a basis for inference and to gain an understanding of precision of estimation Asymptotic relative e ciency There are different de nitions of this concept we will consider a standard one Can we compare the performance of two or more competing estimators for the same quantity Such a comparison would obviously involve the relative precision of estimation Eg if both are consisten which one is better in terms of precision IMPORTANT Whenever large sample or other asymptotic theory arguments are invoked it is un derstood that the results are only approximations to true nite sample behavior Thus although the insights to be gained and approximate sampling properties are quite useful it must be kept in mind that they may not be relevant in the small sample sizes often encountered in practice Thus quantities like pvalues con dence limits and so on should be interpreted very carefully DISCLAIMER This is not a course in large sample theory Thus this chapter is not meant to give a rigorous treatment of these concepts Rather it is meant only to serve as a casual introduction to the basic concepts We will see that this casual level of theory will be suf cient for us to gain a great amount of insight and information for addressing the questions we have given above 81 Basic concepts and notation In this section we review some basic concepts in probability and large sample theory Rather then discuss these in the particular context of 81 we introduce the concepts generically We will relate these developments to the speci c model 81 in subsequent sections CONSISTENC Y AND ORDER IN PROBABILITY In order to evaluate the notion of whether an estimator approaches the right stuff we must de ne precisely what we mean by this Along with this concept is a convenient notation that summarizes the behavior of relevant quantities in this sense STOCHASTIC CON VERGENC E To discuss the notion of consistency we need a basic understanding of convergence of random variables The following concepts are usually introduced in a probability course but often their practical usefulness is not elucidated PAGE 184 CHAPTER 8 ST 762 M DAVDDIAN o Estimators are functions of random variables so that they are themselves random variables vec tors For our particular model the estimators we have considered are very complicated functions of random draws from a population so of the random variablesvectors Y1 Yn that are in fact only implicitly de ned Thus convergence of random variables vectors in a probabilistic sense can be seen to be directly relevant to de ning consistency as we now explain For the purposes of this discussion let Yn be a generic random variable scalar or vector that depends on 71 Let Y be another random variable or vector We will relate these to estimators momentarily DEFINITION 81 Almost sure convergence Yn g Y ie Yn converges to Y with probability one or almost surely if P lim Yn Y 1 Hoe DEFINITION 82 Convergence in probability Y L Y ie Yn converges to Y in probability if lim PlYn7Yl lt61 Hoe for all 6 gt 0 For random vectors the de nitions extend element by element FACTS The following may be derived from the above de nitions Yn L Y implies that Yn L Y o If h is a continuous function in its argument then if Y 3 Y it follows that hY hY Similarly if Yn L Y then hm L hY We will make routine use of the second fact in the sequel Taken alone the de nitions do not seem to be relevant to a study of practical issues in estimation However if we identify them with the generic estimation problem their importance becomes clear 0 n is the sample size PAGE 185 CHAPTER 8 ST 762 M DAVDDIAN o Yn represents an estimator of some parameter of interest in a statistical model n say The esti mator is obviously a function of n through its dependence on the n assumed randomly sampled observations so we could write it as n Ordinarily we do not include a subscript n in standard notation for estimators but it is important to recognize that they do indeed depend on the sample size In particular if we View an estimator properly as a function of the sample then the sample size serves as an index for a sequence of estimators the functions of sample size n for each n 0 Y represents the thing Y and hence an estimator n approaches In the estimation problem we obviously hope that nn 77no where assuming that the statistical model in which n appears is correct no is the true value generating the data 0 Thus in De nitions 81 and 82 of modes of stochastic convergence above the estimator nn plays the role of Yn while the value no which in this case is a xed constant plays the role of Y So in the case of applying these de nitions to consistency of estimators which we de ne formally momentarily the random variable or vector Y is degenerate TERMINOLOGY With these identi cations special terminology is used to describe how nn approaches 770 0 Strong consistency nn 3 n0 0 Weak consistency nn L n0 WHAT DOES THIS MEANT2 Both types of consistency state that the estimator approaches the quantity to be estimated in a probabilistic sense 0 From the de nition of almost sure convergence the interpretation of strong consistency is that if the sample size n is suf ciently large the probability that nn will assume values outside an arbitrarily small neighborhood of no is zero This follows from the fact that the limit appears inside the probability statement in De nition 81 Recall that for a deterministic sequence an has limit a if for each 6 gt 0 there is a value n6 such that lane al lt e for all n gt n5 This may be applied to the probability PAGE 186 CHAPTER 8 ST 762 M DAVDDIAN c From the de nition of convergence in probability the interpretation of weak consistency is that for 71 large the probability is small that 7 assumes a value outside an arbitrarily small neighborhood of no This again follows from the de nition of a limit the difference between 1 and the probability that 7 is within 6 of no is less that e if n is greater than some 715 o The names seem to imply that strong is better than weak PRACTICAL DIFFERENCE Here is a popular argument in favor of strong consistency Suppose that one were to collect data sequentially and periodically we reestimate 7 by m where n is the number of observations collected so far Thus with this scheme 71 is increasing to in nity A sequence of estimators indexed by n m is thus generated 0 One would like to be assured that a point in time so a value of 71 may be reached at which the current estimate is suf ciently close to the true value and will never wander away again after further data collection 0 Strong consistency ensures this 7 for 71 large enough the probability that 7 will stay arbitrarily close to no is 1 0 Weak consistency does not 7 it only states that the probability that 7 will wander away again is small This argument would seem to suggest that we should always prefer strong consistency However statis ticians are usually willing to settle for weak consistency as statistical thinking is in terms of probability rather than absolutes this is not surprising Most statisticians are content that an estimator is good if we can make the probability of 7 being close to the true value large rather than equal to 1 The unquali ed term consistency in most statistical literature almost always refers to weak consis tency In our further developments we will be satis ed with weak consistency TECHNICAL NOTES 0 We have presented consistency here under the conditions that there is a statistical model involving a parameter 7 and this model is correctly speci ed We may thus think of this model as indexed by values of 7 and there is a true value of 7 no that is responsible for the data we have seen Interest in statistical problems is of course in estimating this true value under these conditions Usually the term consistency is meant to imply that this is the situation ie that the true value of some parameter generating the data is correctly identi ed in the probabilistic sense PAGE 187 CHAPTER 8 ST 762 M DAVDDIAN 0 Of course it is not always the case that the model that has been speci ed is correct although we may not be aware of this In this situation we may still consider the model to be indexed by a parameter 7 and we may still deduce estimators for 7 Now however there is not a true value for 7 as the model does not coincide with reality However the estimator 7 may still be de ned for each n and it may still converge in probability or almost surely to some quantity 77 say It is sometimes the case in this situation that 7 will be referred to as being consistent for the value 77 This can be confusing Alternatively even in the context of a correctly speci ed model it is possible to come up with estimators 7 that do not estimate the right stuff ie that are not consistent in the sense that n L 77 where 77 is some value not equal to the true value no In this case 7 would be said to be inconsistent In general then the goal in statistical modeling is a to identify a statistical model that is correct and b identify a consistent estimator for the true value of a parameter 7 indexing this model If a is not carried out b is generally not possible It is thus important to note that in most studies of properties of estimators in statistical models that the model is correctly speci ed is often taken as a starting point We will take this perspective initially however we will also investigate what happens when certain components of models are not correctly speci ed ORDER IN PROBABILITY The notation we now discuss may appear confusing initially but once mastered is useful for streamlining presentation oflarge sample results Again let Yn denote a generic sequence of random variablesvectors indexed by n DEFINITION 83 Big Op Yn is at most of order in probability 71 if for all e gt 0 there exist constants 715M6 gt 0 such that Pn kHYnH lt MEgt17 e for all n gt 715 Here is some appropriate norm to measure magnitude in the case of vector Y if Yn is scalar then this is just absolute value The notation is Yn Opnk o The de nition says that the magnitude of n kYn stays bounded with high probability if n is large enough It will turn out that the cases of most interest to use are when k is nonpositive PAGE 188 CHAPTER 8 ST 762 M DAVDDIAN For example if k 712 then nlZYn stays bounded as 71 gets large with high probability In particular with high probability Yn is bounded by Men lZ This must mean that Yn itself is getting small as 71 gets large A practical interpretation is that Yn behaves like n lZ with high probability for 71 large enough ie becomes negligible in the same way n lZ does 712 In fact as n a 0 as n a 00 this says that Yn itself approaches converges in probability to zero at the same rate as n lZ o If k 0 then Yn Op1 From De nition 83 this says that Yn remains bounded by the constant M5 for 71 large with high probability In this case Yn is said to be bounded in probability 0 Practically speaking this says that as 71 gets large Yn does not become negligible nor does it blow up Instead it is nicely behaved DEFINITION 84 Little op Yn is said to be ofsmaller order in probability than 71 if MW L 0 as n a 00 The notation is Yn opnk o The practical case of most interest to us is k 0 It is immediate that this is the same as Yn L 0 Thus writing Yn op1 is a shorthand way of saying that Yn converges in probability to zero that is for n suf ciently large Yn stays arbitrarily close to zero with high probability This notation will be of convenience in later arguments If we can show an expression is op1 it can be ignored as negligible 0 More generally the case k g 0 is the most interesting For example if k 712 Yn opn 12 then 711 2K L 0 This is convenient shorthand notation that effectively says that we can multiply Yn by a factor that acts like 7112 and the entire product will still be negligible Thus this notation is a useful way of expressing how quickly Yn L O ie if Yn opn 12 then it is faster than n lZ FACTS The following facts may be deduced from the de nitions 0 Yn Opn 5 for 6 gt 0 implies that Yn op1 This is intuitively obvious 7 if Yn acts like n 5 when n is large with high probability then it must go to zero PAGE 189 CHAPTER 8 ST 762 M DAVDDIAN Stating that Yn 01014 is more informative than just saying Y 01 the former not only tells us that Yn becomes negligible with high probability but at what rate IfYn 0PM and X 0PM then XnYn 070 The same holds true if 017 is replaced by op and for combinations of 0p and op Thus if we know something about the order in probability of each of two quantities we immediately may deduce how their product behaves o A useful special case of this is when Yn Op1 bounded in probability and Xn op1 Xn L 0 Then XnYn op1 lntuitively this makes sense Yn is well behaved neither getting small nor blowing up and is multiplied by something that is getting small Thus the product would be expected to get small Of course this means that XnYn L 0 We will exhibit the use of this notation shortly Of course it is important to keep in mind that Y Opn 12 say only means that the magnitude of nlZYn is bounded by some constant That constant could be huge so that 71 must be very very large for Yn to become negligible for practical purposes This kind of issue in part explains why sometimes large sample results do not seem relevant in practice We now turn to concepts useful in deducing approximations to sampling distributions of estimators Continue to regard Yn and Y as generic random variablesvectors DEFINITION 85 Convergence in distribution Suppose Yn has cumulative distribution function cdf and that Y has cdf F Yn is said to converge in distribution or law to Y if and only if for each continuity point of F JLHgO Fny Fy The standard notation is Yn A Y or Yn L Y we will use the latter PRACTICAL INTERPRETATION If Y L Y this implies roughly that for large 71 except at a few points the distribution of Yn and hence probabilities associated with Y is the same as that of Y Thus if we are interested in probability and distributional statements about Y we may approximate these with statements about Y In the context of estimation if we are interested in approximating the sampling distribution of an estimator we will be interested in the convergence in distribution of the estimator or some function thereof PAGE 190 CHAPTER 8 ST 762 M DAVDDIAN FACTS The following may be deduced from De nition 85 and previous de nitions o If Yn L Y then Yn A Y This says that if for large n the probability that Yn differs from Y 7 is small then we would expect the probability with which they take on values to be close and hence expect them to have distributions that are close 0 However Yn L Y DOES NOT imply Y L Y in general For example suppose that Yn and Y have the same distribution for each 71 but Yn and Y are independent for each n In this situation a realization of Yn is totally unrelated to a realization of Y o Yn A y where y is a constant does imply Yn L y lntuitively because the distribution of Yn collapses to a single point a realization of Yn must also approach that point Of course if Y L y a constant then the distribution is degenerate which is not particularly interest ing if one seeks to deduce a sampling distribution to be used for constructing con dence intervals and hypothesis tests This observation is the basis for why the study of approximate sampling distributions for estimators in parametric models like 81 does not quite follow straightforwardly from De nition 85 We now elaborate SAMPLING DISTRIBUTION OF AN ESTIMATOR Return now to our situation of interest where 7 is the estimator for a parameter 7 with true value 770 Showing that n L no thus implies that 7 A 770 However this knowledge that the distribution of n collapses to the single point no is not very useful for the usual inferential goals described above In particular this result does not even give information on precision of estimation To gain insight and to provide a basis for the standard inferential objectives we must pursue a more exotic assessment of large sample behavior Instead of considering the properties of n itself we instead consider a suitable function of 7 whose properties are more interesting and relevant For most estimators and certainly those we have considered for model 81 solving estimating equations a standard approach to deriving an approximate sampling distribution that is more useful applies DEFINITION 86 Asymptotic normality We present this de nition in the scalar case the vector case is similar Classically speaking a random variable Yn is said to be asymptotically normal if we can nd sequences an and such that My 7 an A Mo 1 PAGE 191 CHAPTER 8 ST 762 M DAVDDIAN By this notation we mean that the right hand side of this expression is a standard normal random variable De nition 86 implies that although the limit distribution of Yn itself may be uninteresting if we center and scale Yn appropriately this standardized version of Yn has an interesting limit distribution 0 In particular an is called the asymptotic mean and on is called the asymptotic variance and De nition 86 may be interpreted to mean that approximately for large n Yn amp Nanc2 o The usefulness of this result for approximating a sampling distribution is thus evident How is this applied in estimation situations of interest to us As we will see because the estimators of interest to us are not even available in closed form things are not as simple as immediately identifying 7 with Yn and then determining appropriate centering and scaling constants lnstead what is done is to nd an approximation to an appropriate centered and scaled version of 7 by applying a Taylor series to the estimating equation that de nes n implicitly This approximation then forms the basis for deducing behavior like that in De nition 86 Some important tools for deducing this behavior are the following After we state these important results we will give a sketch of how they are used in this way There are numerous versions of central limit theorems that characterize the convergence in distribution of appropriately standardized sums of independent random variables These may be extended to random vectors in a number of ways to allow generalization of univariate results to multivariate ones We will not discuss the technicalities behind this Instead we will now just state a particular multivariate central limit theorem that will be useful for our purposes MULTIVARIATE CENTRAL LIMIT THEOREM Let Zj be independent random vectors with EZj p7 and varZ7 27 j 1 n such that 71 A 71 7 71 9120 21 20 21220 212 27 7 say letting Fj be the cdf of Zj n 71 2 n E Z7 dFzaO asnaoo 82 11sz ill26n12 7 7 PAGE 192 CHAPTER 8 ST 762 M DAVDDIAN Then nil2 iZj 7 7 A NO E j1 The Lindeberg condition 82 effectively restricts the tail behavior of Zj and does not appear particu larly intuitive It turns out that 82 may be shown to hold if the third moments of Zj exist and are nite the so called Lyapunov condition For our purposes as we will generally assume that higher moments of the response exist and are nite it will turn out that the moments of relevant quantities to which this theorem will be applied can be assumed to exist and be nite Thus the condition 82 will be assumed without comment when we apply the multivariate central limit theorem Another important result of which we will make heavy use is the following SLUTSKY S THEOREM Suppose that Yn A Y and Vn L c where c is a constant Then the following hold Yn Vn A Y c ann A CY YnVn A Yc where in the last expression 0 7 0 is required These extend readily to random vectors lf Yn L Y and Vn L C where Vn and C are matrices In this case the results become Yn V A Y C VnYn A CY Vlen A C lY Thus when EY p and varY 2 say then we have that VnYn L to random vector with mean Cu and covariance matrix CECT Slutsky s theorem may be invoked repeatedly so that if Un L D as well then VnYnUn A CYD We need one nal result WEAK LAW OF LARGE NUMBERS We state this here in the scalar case but it extends straightfor wardly to vectors Suppose Zj are independent or uncorrelated random variables and Li are constants Then if 714 21 varZja a O n TL 7171 0ij 7 n71 ZajEZJ L 0 j1 o The condition that n 2 21varZJa a 0 is satis ed if n 1 21 varZj a 0 say for some constant c which is often reasonable and similar to the requirement for the central limit theorem PAGE 193 CHAPTER 8 ST 762 M DAVDDIAN 0 Note that if we furthermore know that n 1 21 anEZj a 1 say then we may conclude that 7171 391 ajzj L d as TL TL TL TL n l Zaij 7 d 1 Zaij 7 M1 ZEZ 1 ZEZ 7 d L 0 7391 j1 j1 73971 We are now in a position to describe how all of this is used in more detail We will now drop the n subscript on our generic estimator and treat it and the parameter of interest as vectors writing 1 and It will turn out as we will exhibit in the next section that for estimating equations for a parameter 1 of interest with solution f we will be able to deduce using Taylor series and some additional conditions that 711207 770 Aron 0141 where 0 n lZ 2291 function of data An n l 2291 function of data and op1 represents terms that converge in probability to zero We will then 0 Apply the central limit theorem to On to show that it converges in distribution to a normal random vector 0 Apply the weak law of large numbers to An to show that it converges in probability to a constant matrix 0 Apply Slutsky s theorem to AglCn to conclude that 711207 7 no converges in distribution to a normal random vector with mean zero and some covariance matrix ie A L 7112077 no aM0737 83 say This is often interpreted as 17 IN Nn0n 12 ie f is asymptotically normal with mean 170 and covariance matrix n lE From these steps we may then deduce an approximate normal distribution for We will show these steps in more detail in Section 82 Suppose that we have established results like 83 for two competing estimators for 1 k x 1 171 and 172 say so that we have WW 7 no Ame 21gt and WW 7 no A Mo 22gt for some matrices 21 and 22 PAGE 194 CHAPTER 8 ST 762 M DAVDDIAN 0 Both 171 and 172 are consistent It is a general fact that if a random vector converges in distribution then it is bounded in probability Thus we have that 711207 7 no 0121 for Z 1 2 This may be expressed equivalently as 77 i 770 0220171 Thus both estimators estimate the right stuff and approach it at the same rate On this basis then they are entirely comparable c As this does not distinguish the two estimators from one another consider their precision ln nitesample exact theory the estimator that is more precise is to be preferred Here we may approximate the covariance matrices of the estimators by 71421 and n lEg respectively This suggests comparing 21 and 23g In the case k 1 so that 21 and 2g are scalar variances this suggests preferring the estimator with the smaller variance That is prefer g to 1 if 2g lt 21 If 2g 21 then the two estimators are of equal precision DEFINITION 87 Asymptotic relative e ciency For scalars the asymptotic relative e iciency of 1 to g is de ned as ARE 2g 21 With this de nition if ARE 1 the estimators are equally precise lf ARE lt 1 then 1 is ine icient relative to g and if ARE gt 1 then 1 offers a gain in ef ciency relative to g Generally one constructs the ratio so that the potentially better estimator s variance is in the nu merator so that ARE lt 1 is good for showing another estimator is inef cient relative to it However be aware that some texts and authors may do this in the reverse so that larger than one values are preferred The extension of the de nition to k gt 1 is that 172 is preferable to 171 if the covariance matrix 2g is smaller than 21 in some sense To formalize this if 21 7 2g is nonnegative de nite then this means that for all k x 1 A ATEgA ATElA By choosing A in turn to be the vector with a 1 in one position and zeroes elsewhere we see that this implies that the variances on the diagonal of 2g must be smaller than those on the diagonal of 21 so that the approximate variance of each component of 172 is smaller than that of 171 PAGE 195 CHAPTER 8 ST 762 M DAVDDIAN Now if 21 7 22 is nonnegative de nite it follows that lEZl S lgll Thus the asymptotic relative ef ciency of 171 to 172 is generally de ned for k gt 1 as ARE lEZllgll1k39 As we will see later the comparison is sometimes simpli ed in that it turns out that 21 0412 and 22 0422 for some scalars 04 Z 1 2 and common matrix 2 In this case ARE reduces to 042041 as laEl aklEl We will often just argue that one estimator is more ef cient than another just by examining the difference 21 7 22 However simply noting that this difference is nonnegative de nite does not give insight into how much77 more ef cient The calculation of ARE quanti es how much better77 Unfortunately in very complicated models ARE may depend on the design parameter values and functions involved so that a global statement of relative ef ciency may not be made SUMMARY Evaluating the properties of estimators for 6 0 and a in the model 81 will involve the following steps 0 Establish consistency o Derive largesample distribution 0 Compare using asymptotic relative ef ciency 82 Mestimation It turns out that almost all of the estimators we will study both those for the regression parameter 6 and those for the variance parameters a 0TT are of a certain general type for which a standard argument that makes use of the developments of the previous section may be carried out to derive large sample properties This argument could be said to be the bread and butter77 of the formal study of regression parameter estimation in nonlinear models Such estimators are fall into the class of so called M estimators which we will de ne momentarily REFERENCES Our discussion of M estimation will be nonrigorous More rigorous and extensive treatments of M estimation are available here is a list of useful references PAGE 196 CHAPTER 8 ST 762 M DAVDDIAN o Ser ing 1980 Chapter 7 and Huber 1981 Chapter 6 offer rigorous discussions and derivations o A treatment similar to that presented here is given by Carroll and Ruppert 1988 Section 71 o A more recent discussion is provided by van der Vaart 1998 Chapter 5 o Stefanski and Boos 2001 provide an overview of M estimation THEME The estimators we have discussed for the model 81 cannot be expressed in closed form Rather they are de ned implicitly as the solution to a complex problem We have focused on the estimators as solutions to a set of estimating equations some of the estimators we have considered may also be viewed as the minimizer or maximizer of a real valued objective function eg a loglikelihood We will be concerned in the sequel with estimators de ned as solutions to equations For completeness however we give the de nition of an M estimator that includes both kinds This de nition is generic more general ones are possible but this one will suf ce for introducing the concept DEFINITION 88 M Estz39mator Let Zj j 1 n be independent random vectors with cdf We may not know the Fj perhaps we know only the rst few moments Let 1 k x 1 be a parameter of interest in a statistical model involving Zj An M estimator for 1 17 may be de ned in two ways 1 f minimizes 21 pjZj1 where pj are real valued functions 2 f is the root of a k x 1 set of estimating equations such that where Ilj are vector valued functions taking values in k dimensional space The subscript j for the functions pi and 117 may or may not be relevant depending on the situation 0 In particular we may view our problem in different ways We may view the as iid draws from some joint distribution and identify Zj In this case pj E p and 11739 E II 0 Alternatively we may view the 11 as xed constants or consider the situation conditional on the 211739 In this case we may view the given the 111739 as independent and the j subscript is meant to emphasize dependence on 11 and possibly other j dependent xed quantities This perspective is the one usually adopted when developing theoretical arguments for regression models PAGE 197 CHAPTER 8 ST 762 M DAVDDIAN REMARKS lf pj is differentiable with respect to 1 for each j then a problem of type 1 implies one of type 2 where WangZ7317 lgZ7371 0 However a problem of type 2 may be posed without a corresponding problem of type Thus the class of estimating equation approaches of type 2 is more general Note that ifpj the assumed density of Zj then choosing pjZ1 lngjZj1 for type 1 gives maximum likelihood estimation under the assumption that pj is the true density of the Zj Thus maximum likelihood estimation is just a special case of the more general class of M estimation In fact the M was chosen by the originator Huber to mean of maximum likelihood type77 Although we have motivated some estimators for 6 a and 0 from a maximum likelihood perspec tive we have seen that these estimators are regarded more broadly as solutions to estimating equa tions regardless of the distributional assumptions Thus we will focus on type 2 M estimators In fact nowadays 2 has become the popular de nition CONSISTENC Y OF M ESTIMATORS Formal rigorous proof of the consistency of M estimators is rather technical see the treatments given by Ser ing 1980 Huber 1981 and van der Vaart 1998 However in nice problems where certain conditions hold it turns out that 1 obtained by solving a problem of type 2 will be a consistent estimator of no the true value In general and certainly for the 11739 functions corresponding to the estimators we have considered M estimators are usually based on Ilj such that E IIjZ7 17 0 for each j and all 1 in the parameter space where E7 refers to expectation under the assumption that the parameter is equal to 1 It turns out that consistency will generally be the case if E 1 1Zj lo 07 where the expectation is with respect to the true distribution of the Zj and where 170 is the unique value satisfying requirement Thus it is reasonable to expect that the estimators we have considered in earlier chapters are consistent Estimating equations that satisfy these conditions are often referred to as unbiased We have noted this term previously and now we get some insight into the name the expectation condition refers to the estimating equations rather than the estimator itself PAGE 198 CHAPTER 8 ST 762 M DAVDDIAN In general although the solutions to such complex estimating equations are not likely to be unbiased for nite n the above conditions ensure that we can at least expect the best we can hope for consistency As an illustration consider the WLS estimating equation with known weights wj ie V L Zw yj 7 fltjv f mjv 0 84 j1 Here IljYj1j wjYj 7 fmj f mj where we treat the wj as xed constants Clearly EIljYJ 1137 0 and thus EIljYJ1j 0 where expectation is with respect to 6 as EYj Hilly l7 0 Note that the notion of unbiasedness being related to consistency makes intuitive sense here If f1j 6 really is the conditional mean of Y7 then it seems reasonable that as each summand must have expec tation 0 the solution to 84 3 say must be close to77 the true value of 6 in order that the n zero expectation conditions under the true distribution of which is all that is really required here be satis ed In the sequel as we are solving unbiased estimating equations we will in general assume that a consistent solution exists LARGESAMPLE DISTRIBUTIONAL PROPERTIES OF M ESTIMATORS Assuming that 17 L 170 the standard argument to approximate the sampling distribution of f is as follows We give a heuristic sketch here obviously there are technical conditions that are required to validate use of Taylor s theorem and other approximations we make This development will demonstrate the rationale underlying the steps outlined on page 194 By a Taylor series expansion about the true value no the estimating equation satis es V L 0 n lZ Egg2 17 391 7 71712 Z 91ng 770 711 28877 I jzj7 7 711207 i 770 j1 7391 where 1 is a value between 17 and 170 and multiplication of the equation by n lZ will prove convenient shortly Of course with 1 k x 1 881 IljZj 1 is the k x k matrix of all partial derivatives of 11739 with respect to the elements of 1 evaluated at 1 As 17 is consistent it is possible to de ne technical conditions such that 17 may be replaced by 170 in the partial derivative matrix and the weak law of large numbers gives that 7171 28877 IljZjv 770 i 7171 2E 8877 1 jzw70 L 0 j1 7391 PAGE 199 CHAPTER 8 ST 762 M DAVDDIAN Here the expectation is with respect to the true distribution of Zj Thus we have 0 W Z 1 1Zw70 n71 2E 8817 iii4170 Wm 7 m 0171 85 j1 73971 Assuming the inverse exists 85 may be rearranged as V L 71 V L 711207 7 no 7 7 711 2E 8817 I AZgw 770 7142 2 I AZgw 770 0p1 86 j1 j1 Assume that n 1 ZE88n 1 jzj7no7 A 73971 say some constant matrix that is invertible Now we may apply the central limit theorem to the other term on the right hand side of 86 As each summand has mean 0 we have that this term converges in distribution to a multivariate normal random vector with mean 0 and covariance matrix V L 71 T gggon ZIE 1 1Zj7no 1 jZj7no B 7 say We may now apply Slutsky s theorem to conclude that Wm e no Am A1BltA1gtT 87 In fact this argument is usually made even more heuristically as follows From the above we have 0 3 On l Ann12 7 770 which may be rearranged as 711207 7 170W 7143107 where An n l 21 E 881 IljZ71O and On n lZ 21 lgZ7 170 Now by the central limit theorem 0 amp NO B where Bn z n 1 21 E IljZjn0IlZ7n0 From these developments it follows that f is asymptotically normal and satis es 17 39Nnon1A1BnltA1gtT 88 To use a result of the form 88 for practical purposes one would want to obtain the covariance matrix n lAngnA1T which would be an approximation to the sampling covariance matrix of the estimator There are two strategies 170 would be replaced by f in each PAGE 200 CHAPTER 8 ST 762 M DAVDDIAN o If it is possible to calculate the necessary expectations in An and Bn7 88 is used as is77 As we will see later this might not always be possible in particular for matrices like Bn unless one is willing to make some assumptions beyond just those in the model 81 This approach is often called the model based expectation method Another approach is not to even try to evaluate the expectations in An and En in 88 Instead these matrices are approximated by their sample analogs that is An and En in 88 are replaced by 71 71 An m1 28817 Jig2 17 B m1 IljZj 1 TZj 17gt j1 j1 respectively This approach is usually referred to as the sandwich or robust covariance estimator the term sandwich77 is most common If the expectations are known it is generally better to use the model based rather than the sandwich estimator as the former will be a more precise estimator of the true covariance matrix that is really required as in 87 REMARKS Note that if the estimating equations correspond to maximum likelihood then 11739 is the score vector In this case it is well known that 7A B and 87 reduces to 711207 7 no A NO B71 This explains why the covariance matrix in the more general case is often called the information sandwich77 o In 86 we have that n 711207 770 1712 ZAA I KZJ39WO 0121 j1 The expression 7A 1117Zj 170 is sometimes referred to as the in uence function of the estimator 17 There is a correspondence between in uence functions and estimators that may be exploited to deduce properties of estimators and relative ef ciencies of competing estimators under very general conditions The book on semiparamtric theory by Tsiatis 2006 introduces this point of view In this course we will derive many results by brute force77 rather than through a general semiparametric theory approach PAGE 201 CHAPTER 8 ST 762 M DAVDDIAN 83 WLS as an Mestimator and preliminary results The results of the preceding section were presented generically As we noted on page 197 when applying these developments to our regression setting there are two possible ways to proceed i Identify Zj as iid In this case for almost all the estimating equations we have considered the function 11739 would depend on these random quantities and the parameters of interest and in fact as a function of these would be the same for all j Results would be established with respect to the joint distribution of the 111739 A V Identify Zj Y7 and treat the 11 as xed constants This is in effect the same as regarding the situation conditional on the covariates 211739 j 1 71 Under this perspective the function 117 is in fact different for each j ordinarily because of the dependence on 1117 Results would be established with respect to the distributions of the given 217 j 1 n Approach ii is the conventional way to address large sample theory for regression settings This is probably because regression models and hence the parameter 6 of usual interest are de ned with respect to the conditional moments of the distributions of As we have noted previously assuming that dependence on 6 is only through the distributions of Yj mj the distributions of 11 does not play a role in for example maximum likelihood estimation of 6 For approach ii it is necessary to make an additional assumption We assume that given the entire set of n covariates 211 1n the distribution of Yj ml 1n is the same as that of only This assumption is actually implicit in the development of most regression models 7 heuristically it says that for observation j the only relevant covariate is 211739 Certainly if we regard the 21739 pairs as iid draws from some common distribution this assumption is valid It is important to recognize that either approach is valid and may be taken Under i things simplify in the sense that things are iid across j so that a simpler version of the central limit theorem applies whereas under ii we must appeal to more complex results such as the more general central limit theorem on page 192 Likewise the results under will obviously involve quantities like expectations over the joint distribution of 1117 while results under ii will involve expectations with respect to the conditional distribution of Yj ml inn which by the above assumption are the same as those with respect to the distribution of only As it turns out ii is more general if results under ii can be deduced then results under also can be found PAGE 202 CHAPTER 8 ST 762 M DAVDDIAN A more extensive discussion of this issue is beyond our scope here In the sequel because of convention and generality we will adopt approach ii for the derivation of largesample results It is important to keep in mind that in more complicated problems eg where components of the 111739 are missing or error prone this approach must be modi ed see for example the book by Carroll Ruppert and Stefanski 1995 For simplicity in the sequel we will not go to the trouble to note explicitly that expectations of functions of conditional on 111 inn are the same as those conditional on 11 alone each time we need this assumption Instead we will immediately write for example rather than Ele111n when evaluating such expectations in applying these theorems Thus it is important to keep in mind during our arguments that we are indeed conditioning on all n of the 2117 but that required moments of individual summands conditional on 111 1 when applying the weak law of large numbers or the central limit theorem reduce to those conditional on 111739 only Under this perspective we now consider estimators for the general mean variance model 81 Actually we consider a simpler setting that is we focus on the model EWIIH fi11j7 7 VarOjlillj 02107717 89 where the wj gt 0 are xed constants Suppose that 89 represents the form of the true rst two moments of the true conditional distribution generating the data The true value of 6 leading to the data for these moments is 60 say Now suppose that we specify a model of the form Emu ingw 37 VarOjlilla39 027171 810 for some set of xed constants 14 gt 0 That is we are correct in our assumption about the conditional mean but depending on the values of the constants uj we may be incorrect about our speci cation of the variance We will discuss possible choices of 14 one might make in practice momentarily It is important to appreciate that 89 gives the true rst two moments while 810 is the model that we assume Of course the two models coincide if wj 14 for each j Suppose we adopt 810 and propose to estimate 6 by solving fum e m imam a o 811 j1 PAGE 203 CHAPTER 8 ST 762 M DAVDDIAN Note that if we take 0 uj E 1 for all j then solving 811 corresponds to incorrectly assuming that the variance is constant for all j and estimating 6 by OLS o W wj for all j then solving 811 corresponds to assuming the correct model for mean and variance and estimating 6 by WLS with the correct weights o W arbitrary so potentially not all equal to wj for all j corresponds to the situation where we assume nonconstant variance and use WLS to estimate 6 but we have chosen incorrect weights We may view 811 as an M estimating equation for 6 for some general set of xed constants 14 We now deduce the behavior of i solving 811 for arbitrary W by applying the arguments of the last section We may then consider each of the cases above to deduce o The behavior of the OLS estimator EOLS when the variance is really nonconstant o The behavior of the WLS estimator EWLS when we have correctly speci ed the 14 to be equal to the true values wj o The consequences of using the WLS approach but with the wrong weights Note that lg321117 7 fmj f mj where the subscript j indicates that the summand of 811 depends on the xed constants W and 111739 Under the true model 89 we have Ei I J Yjvjllv Wain E 1 1Yj7 l 0 That is even though we have possibly misspeci ed the weights in 810 we still have speci ed the mean correctly thus Eujinf1j 6 0 with expectation taken with respect to the distribution whose rst two moments are given in 89 because the 14 are just xed constants Hence the estimating equation 811 is unbiased even though it depends on a possibly incorrect second conditional moment assumption By the reasoning outlined in previous section we thus are willing to believe that the estimator 3 obtained by solving 811 is consistent for the true value 60 even though the variance weights may be misspeci ed PAGE 204 CHAPTER 8 ST 762 M DAVDDIAN IMPLICATION Even if we ignore nonconstant variance uj E 1 for all j or choose the weights incorrectly the resulting estimator for 6 will still be consistent for the true value 0 Thus the OLS estimator will be consistent even if the variance really is nonconstantl 0 So why bother with worrying about the variance at all by choosing weights if the OLS estimator will produce a consistent estimator anyway 0 The reason has to do with asymptotic relative ef ciency as we now demonstrate We apply the M estimator argument of Section 82 rather than just plug in to the generic argument we derive it from scratch so that we can see exactly how it works in this context Expanding about the true value 60 and using fa L 60 0 W2 Zu Yj 7 m mum fa j1 W2 ZiaY7 rm mam a0 j1 717171ngi Wv OHf ij o ujf ajv o gajv o n12E o 812 j1 In our future arguments we will nd it convenient to rewrite terms such as those in 812 in terms of the standardized variable If 33 113739750 5 127 00w note that 6739 is de ned in terms of the true values of 6 and the true variance so that EEjl1j 0 and var6jl1j 1 Using this and rearranging we may write nlWi 7 e 71472107 where n 712 Cn Ton 12 Zujwj Ejf fllj50 An Anl An27 j1 n 12 n Am 0071 1 waj Ejf llw 507 AnZ 77171 Zujf j7 0fgj7 e j1 j1 Consider Am rst By the weak law of large numbers as EEjl1j O the random vectors making up each summand all have mean zero so that Am L 0 Recall that we are carrying out all arguments conditional on the 2117 so application of the weak law of large numbers is under this condition PAGE 205 CHAPTER 8 ST 762 M DAVDDIAN Note that Anz depends only on xed quantities conditional on the 211739 and has the form of an average Assume that limnH00 Ang A say where A is a positive de nite symmetric matrix Combining the results for Am and An27 then we have that AnLA For On note that E0390ij126jf j 801j 0 and 71 2 7 T varaowwj ejmwj Liam agu w 1f ltwj ogtf ltwj ogt Thus by the multivariate central limit theorem we may deduce that 71 L 7 7 On WWOJgBL 0371 1 Zuguy 1f j7 ofgj7 o Tan H 033 j1 Combining then we conclude by Slutsky s theorem that A L n12 7 60 am ng 1BA 1 813 note that in contrast to the generic results before for convenience momentarily we have isolated the multiplicative factor 08 in our de nition of the center matrix B By analogy to 88 with Bn and Anz as above we may write this alternatively as 3 N 0a n 1A BnA 814 If we write X X60 where aw X E aw as de ned earlier and if we de ne U diagu1 u and W diagw1 wn then it is straight forward to verify that we may rewrite the result in 814 as 9 NW0 03XTUX 1XTUW lUXXTUX 1 815 note that we have factored out the n 1 terms in each component Writing the approximate covariance matrix in the form in 815 will prove convenient momentarily Armed with these general results we may now consider the speci c cases discussed earlier 0 mi E 1 uj E 1 Suppose rst that in truth wj E 1 for all j so that the variance is truly constant and we specify 14 E 1 for all j corresponding to OLS estimation of 6 PAGE 206 CHAPTER 8 ST 762 M DAVDDIAN Then we are in the situation where we would be correctly using OLS and f3 BOLS Here U W In and the covariance matrix in 815 reduces to 03XTX 1 In fact it is straightforward to note that since Bn Anz in this case 813 reduces to WW9 7 e A Mo 03194 where B limH00 n lXTX limH00 n l 2291 f1j 0fg1j 60 We will write this more suggestively as n n1230Ls o i NW7 UgEOLS7 2615 MIXTX 7171 Z My ofgillj7 30 71 816 Thus we have EOLS INN50703XTX71 817 The form of 817 resembles the form one obtains in the classical linear regression situation the difference is that the matrix X is not a xed design matrix but is rather a complicated function of the true value 60 whose form depends on that of the nonlinear mean model 1 In fact the above large sample results are exactly those one would encounter in a basic text on nonlinear regression under the classical assumptions of constant variance independence and normality eg Bates and Watts 1988 Note that nowhere in the above did we use a distributional assumption all that was needed was the assumption on the rst two moments of Furthermore the above applies in the case that filj mgr of course Thus 816 and 817 hold at least approximately in a large sample sense for the linear regression model with constant variance even if the original response is not conditionally normally distributed wj arbitrary W E 1 In this case variance is in truth nonconstant and changes with j according to the values of wj however we incorrectly specify constant variance 14 E 1 Thus we would use OLS estimation 3 BOLS in a situation where the true variance is not constant Now U In but W diagw1 wn and the covariance matrix in 815 reduces to 03XTX 1XTW 1XXTX 1 so that under these conditions 815 becomes Ears eN o703XTX 1XTW IXXXTXVll 818 PAGE 207 CHAPTER 8 ST 762 M DAVDDIAN Of course we may write this result more formally in a form like 813 as A L n12 OLS 50 NW7 UgEOLSEWEOLS7 819 where 2W limnnm n lXTW lX o wj 7 14 both arbitrary In this situation we estimate 6 by WLS but we use incorrect weights Here of course the result 815 holds as is o wj W for all j Now we estimate 6 by WLS but using the correct weights Under these conditions U W diagw1 wn and 815 reduces to BWLS 39 NOa XTWX 1 820 Again we may write this also n12EWLS 50 i NW7 032W 821 SixEs 93110 n IXTWX 113 7171 wjf Wh 50c TWj750l 7 COMPARISONS We are now in a position to make some ef ciency comparisons We begin by consid ering the question on page 205 why not just use OLS even if the variance is nonconstant Suppose that we specify 14 E 1 and so estimate 6 using OLS when in truth the data follow 89 with some arbitrary wj not all equal to 1 If we were clever enough to know the true wj values we could use these to form weights instead and estimate 6 by WLS From our discussion of asymptotic relative ef ciency the advantage may be seen by comparing largesample covariance matrices In particular we would like to compare the covariance matrices in 819 and 821 that is determine whether EOLSEWEOLS EWLS is nonnegative de nite note that we can disregard the common multiplicative constant 08 Under reg ularity conditions instead of looking at these limiting matrices it suf ces to consider their approximate values for large n and evaluate whether the difference XTX 1XTW 1XXTX 1 7 XTWX 1 822 is nonnegative de nite IMPORTANT Recognize that although we are comparing OLS and WLS estimation the covariance matrix of EOLS given in 816 and 817 is not the one of interest here PAGE 208 CHAPTER 8 ST 762 M DAVDDIAN This covariance matrix corresponds to the sampling variation of EOLS when it is used under the condition that the true variance is constant If the true variance is not constant the calculation OLS estimator stays the same but its properties change to those given in 819 and 818 re ecting the fact that the true variance of the data is not constant but depends on the wj Thus recognize that in general an estimator has the same form regardless of the underlying character istics of the data It may still be used even if the assumptions used to derive it are not valid however its properties will change depending on the underlying truth This will be a recurring theme in later chapters Continuing with the comparison we will now show that the difference in 822 is nonnegative de nite that is that ATXTX 1XTW 1XXTX 1 7 XTWX 1 2 0 for all A We may rewrite this as dTXTW 1X 7 XTXXTWX 1XTXd d XTX 1 In fact we may write this further as cTI 7 W12XXTWX 1XTW12c c W lZXd 823 Here Wl2 diagwi2 w71L2 and W lZ is its inverse Note that if we de ne X1 WlZX then the middle matrix in 823 can be written as I 7 P P XXIX1XI which is a symmetric idempotent matrix Thus 823 may be written as cTI 7 PTI 7 Pc 1 2 0 RESULT It follows that the ARE of EOLS to EWLS when the variance truly is nonconstant as in 89 is g 1 Thus if one estimates 6 using OLS under these circumstances the resulting estimator will be consistent but ine icient less precise than the WLS estimator using the correct weights at least approximately recall that these are large sample results Of course this result resembles a similar exact one for linear models PAGE 209 CHAPTER 8 ST 762 M DAVDDIAN REMARKS o The preceding developments do not however quantify the extent of loss of ef ciency This would depend on the relative magnitudes of the determinants of the two covariance matrices These clearly depend on the particular function 1 value of 60 covariate settings 2117 and weights wj The numerical value may thus be different in different problems 0 The result does suggest that it is generally advantageous to use the correct weights lntuitively then it is a good idea to understand as well as possible the form of the variance Potentially failure to do this could result in a great loss of precision for estimating 6 at least approximately The foregoing argument may be applied more generally Suppose that rather than estimate 6 by OLS we use WLS but with an incorrect set of weights uj This represents the situation where we have erroneously speci ed the form of the variance In this case the relevant comparison is XTUX 1XTUW lUXXTUX 1 7 XTWX 1 As both U and W are both diagonal matrices here we may de ne X11 UlZX and W11 U lZWU I2 and rewrite this as XiX AXLWQXMXLXQA XiW X fl The same argument as above may then be used to show that the difference is nonnegative de nite In fact even if U and W are not diagonal as long as they are positive de nite the same reasoning may be applied for appropriate de nitions of U12 and WlZ this will be of interest in Chapter 14 RESULT Using WLS with the correct weights is better in the sense of precision of estimation than using incorrect weights The obvious implication is again that trying to gain an understanding of the true form of the variance is advantageous OB VIOUS QUESTION Of course in practice it is unlikely that we would be able to identify a set of xed constants wj to characterize the variance What we have argued that we would be able to do in many situations is deduce an appropriate variance function to represent the form of the variance as in the general mean variance model 81 The preceding results suggest that failure to get the weights right can result in loss of ef ciency how much depending on the particular problem It is natural to wonder whether some of the implications in fact carry over to the more general setting of 81 This will be investigated in Chapter 9 PAGE 210 CHAPTER 1 ST 762 M DAVDDIAN 1 Introduction and Motivation 11 Scope and objectives OBJECTIVE The goal of this course is to provide a comprehensive treatment of modern regression models and associated inferential methods for univariate and multivariate response 0 By univariate response we mean the case where the response is a single scalar value o By multivariate response we mean the case where several scalar responses may be thought of together as a group We will clarify these designations through a series of examples momentarily Our emphasis will be recent developments in the statistical literature that have become commonplace tools for the practicing statistician We will discuss the models and methods from two perspectives c We will derive relevant theoretical results and study their implications for practical use and c We will discuss practical implementation of the methods using available software and demonstrate them on data from a variety of applications THEME As we will see statistical models and associated inferential methods for univariate and mul tivariate response share common features Thus we will rst study the univariate case in some detail Many of the concepts and results will then carry over readily to the case of multivariate response We begin by reviewing the form of the classical linear regression model which provides a convenient point of departure for discussing the need for more sophisticated and broadly applicable methods 12 Classical linear regression model In the usual regression framework we consider a scalar response Y say and an associated vector of covariates 13 Formally let Y response dependent variable scalar 1 p x 1 covariate predictor independent variable p x 1 1 may include the constant 177 Here we have noted some of the common terminology used in the regression literature PAGE 1 CHAPTER 1 ST 762 M DAVDDIAN ASSUME The values of 1 may be set by an experimenter or may be observed In either case the values are known without error For example 0 z is a planned dose of a drug given to a rat by injection in a pharmacological experiment The actual dose received is exactly equal to the planned value 7 no errors were committed in preparing the injection solution 0 1 contains values of age weight height and race of subjects sampled from a population of interest in an epidemiological study None of the values of age or race of the subjects are recorded incorrectly Errors of measurement committed in ascertaining weight and height are negligible In the rst example z is often thought of as xed as it is set by the experimenter In the second as the subjects are sampled from a population 1 may be thought of as a random vector taking its values according to some multivariate probability distribution In this case values sampled from this population may be ascertained perfectly without error Alternatively in the rst example if the dose scale is continuous and there are 71 different doses in the study one could also view the doses as being drawn from a population of doses along the continuum of possible doses NOTATION Although 1 may be a random vector we use the lower case symbol USUAL PERSPECTIVE The usual way of thinking is that Y is a random variable Given 13 the values of Y that might be observed vary so that the different values of Y may be seen with the same value of 13 This may be because of one or more of the following 0 Error in measurement of Y Here ideally there is a unique value of Y corresponding to each 13 so that in truth a deterministic relationship between Y and 1 does exist However Y cannot be measured exactly For example in an idealized physics experiment if Y is force and z is acceleration then for an object with a particular mass m Y mm However with available devices it may only be possible to measure Y for a given z with some uncertainty so that different values of Y may be recorded for the same z re ecting the error in the device 0 Sampling variation77 or variation among individuals This may happen in several ways In the epidemiological study example above suppose Y is total cholesterol If we consider all subjects in the population having a particular age weight height and race combination total cholesterol will not be identical for all of them re ecting natural biological variation across humans Of course total cholesterol is likely to be measured with error as well PAGE 2 CHAPTER 1 ST 762 M DAVDDIAN As another example consider the pharmacological study above z may be dose of drug given to a rat and Y concentration 5 minutes after injection where Y is determined by drawing a blood sample from the rat If the drug is not well mixed within the rat then different values of Y might be seen from different samples Of course again Y may be measured with error Features of the biological or physical process under study For example in the epidemiological study for a given subject total cholesterol may uctuate over time about some typical value Thus the value of Y recorded for a subject will re ect this uctuation In many applications it would indeed be likely for these features to occur simultaneously although the effects of some might be negligible relative to those of others FORMALLY We may conceptualize for each value of 1 a probability distribution characterizing possible values of the response Y that might be observed 0 This is easy to think of in the case where the values of 1 are xed constants c When the values of 1 vary for example in a population it is natural to think ofa joint distribution of values Y 11 that might be observed Under this perspective the probability distribution characterizing possible values of Y for each value of 1 would be the conditional distribution We will not concern ourselves with the distinction between xed and random 13 The important point is that the basis for regression modeling is to consider a statistical model for the values of Y conditional on 1 values We will adopt this perspective throughout DATA We observe pairs 1117 j 1 n c We may think of these pairs as draws from the joint probability distribution of Y 11 in the event 1 is observed rather than set by the experimenter 0 Alternatively we may think of as being a draw from the distribution of Y corresponding to the xed value 2117 We summarize the data as T Y Y1Y2YngtT X where Y is a n x 1 vector and X is a n x p matrix PAGE 3 CHAPTER 1 ST 762 M DAVDDIAN N OTE Our notation here blurs the distinction between random variables and observed values as is often conventional although not very precise USUAL LINEAR REGRESSION MODEL This model is a statistical model for the pairs 111739 usually written in terms of deviations 57 7 mgr as igmf 5j j1n or YX6e e5152enT in matrix notation Here the are assumed to be related to the 111739 approximately through a function linear in a parameter vector 6 p x 1 For example 0 mg 30 3107 a straight line where 11739 1 CjT7 5 507 51 P 2 o mgr 30 3107 320 a quadratic function in 0739 with 11739 1 370T7 5 307 317 52 P 3 o mgr o lclj3202j gclj02j a simple response surface model with 11 1 017 027 cleZjT 6 606162133 p 4 The models may be a characterization of what is believed to be the truth or an approximation to a more complicated model eg a nonlinear model The deviations represent the fact that the relationship between observed values for Y and 1 does not exactly follow the smooth relationship dictated by mgr Rather observed values deviate from this relationship presumably due to one or more of the reasons given above eg measurement error sampling variation etc As the are viewed as random variables the 57 are also random variables ASIDE The 57 are often referred to as errors in the regression literature We prefer the term deviations as use of the former term can be misleading as it might suggest that the fact that observed values do not follow a smooth relationship with 1 is due entirely to the effects of measurement error In reality the fact that the deviate from the relationsihp for one or more of the reasons given CLASSICAL REGRESSION ASSUMPTIONS In a rst course on regression analysis inferential methods for this model are developed under a set of assumptions For lack of a better term we will call these the classical assumptions A good part of this course will be devoted to relaxation of these assumptions and study of the consequences when they are incorrect Usually the classical assumptions are stated assuming the 111739 are xed constants that is the condi tioning on 11 is not made explicit Here we will be a little more careful PAGE 4 CHAPTER 1 ST 762 M DAVDDIAN 0 The expected value of 5739 is equal to zero Such an assumption implies no systematic tendency or bias in the way the deviate from mgr Note that we could interpret this as Eejl1j 0 so that the expected value is conditional on the value of 211739 or as E5j 0 so that this is an unconditional statement Technically these are two different statements the rst implies the second of course but not vice versa Under the usual assumption that the mj s are xed this distinction is usually not discussed Given 2 below the assumption E5j 0 is made directly A H V The model mgr is correct Along with 0 treating the 111739 as xed constants this implies that EOjlillj If 2 The 5739 are identically distributed for all j independently of 211739 and vare 02 varel1j by independence 3 The 5739 are independent 4 The 57 are normally distributed The assumption that 5739 and 111739 are independent is usually not stated explicitly Note that 4 implies that the response may be viewed at least approximately as continuous Often 3 and 4 are combined into a single statement Under all of these assumptions 5 are independent normal random variables Usually what is really meant is that the pairs 11739 are independent In any event the assumptions taken all together imply EOjlmj mgr57 WHOAmi 02 11 gt Y NX8UZIL REGRESSION PARAMETER 6 If 6 were known then the model 11 would provide a complete characterization of the response on average77 Usually 6 is unknown and must be estimated The approach that is advocated under the assumptions above is that of ordinary least squares OLS The OLS estimator EOLS is de ned as A V L 71 V L OLS XTXVIXTY Z 1137713 Z My j1 7391 0 Under all assumptions 0 7 4 EOLS is in fact the maximum likelihood estimator for 8 ie EOLS maximizes the loglikelihood logL 7712 log 27139 7 712 log a2 7 12 inf 7 mf f 391 u PAGE 5 CHAPTER 1 ST 762 M DAVDDIAN Thus if the assumptions are all correct EOLS is a natural estimator to consider Moreover for large samples EOLS enjoys all the usual optimality properties associated with maximum likelihood estimation Note that EOLS satis es n n minimize 7 mgr 2 solve 7 1138 0 12 j1 j1 The expression to be minimized in 12 may be interpreted as a sensible distance criterion Even if we do not adopt 4 normality EOLS still has the interpretation as being the minimizer of distance between data and model In fact it is often motivated this way in elementary regression texts Here all observations receive equal weight in the distance measure re ecting assumption 2 which implies that all data are of equal quality precision Note further that all observations are treated separately in a sum re ecting the independence assumption In addition EOLS is a linear function of the and has expectation 6 under 11 whether assumption 4 holds or not so is unbiased In fact it may be shown that under 073 EOLS has the smallest sampling variance among all linear functions of the used to estimate 6 7 it is the Best Linear Unbiased Estimator BLUE of 6 o More generally EOLS solves the set ofp estimating equations in 12 that are linear in the If we adopt the sum of squared deviations above as the criterion to minimize then this is regardless of whether 4 holds Thus an implication to keep in mind is that even without the assumption of normality EOLS may be viewed as a sensible estimator for 6 with a nice interpretation and nice properties 13 Violation of the classical assumptions All of the classical assumptions are violated routinely in practice Moreover several of them may be violated simultaneously We illustrate by considering a series of examples EXAMPLE 11 Pharmacokmetz39cs of indomethacm A common objective is to investigate the pharma cokinetics of a drug A known dose of the drug is given to a subject human or animal and then at several subsequent time points blood samples are drawn and the concentration of drug in blood or plasma the response is determined lnterest focuses on characterizing the concentration time pro le PAGE 6 CHAPTER 1 ST 762 M DAVDDIAN Figure 11 Concentration time data for a subject receiving intravenous indomethacin at time 0 20 15 10 Concentration megml 4 Time hours Figure 11 shows plasma concentrations plotted against time for a human subject receiving an intra venous dose of the drug indomethacin at time 0 the time scale is de ned so that 0 represents dose time Here7 zj is the jth time point in hours and is the measured concentration of indomethacin c From the plot7 one might be tempted to approximate the relationship by a polynomial in time which is7 of course7 a linear model However7 this is not a good approximation in general 0 ln fact7 theoretical considerations suggest a more scienti cally relevant way to represent the re lationship Standard practice in pharmacology is to represent the body as a system of com partments77 corresponding to parts of the system such as blood and deeper tissues77 For indomethacin7 pharmacologists believe a two compartment model as follows provides a reasonable representation 12 D X t Xtis t 21 PAGE 7 CHAPTER 1 ST 762 M DAVDDIAN Here dose D is given intravenously into the left blood compartment at time t O and the amount of drug present is denoted by X Transfer is assumed to take place between the blood compartment and the deeper tissues77 compartment according to fractional rates of transfer 12 and km and the amount in the deeper tissue compartment at t is denoted by Xmt The rate k5 corresponds to removal of drug from the blood by elimination from the system eg via the kidneys From this model one can write down a set of differential equations describing amounts of drug in each compartment dX t k21Xti5t 7 k12Xt 7 k5Xt dX 39 if k12Xt k21Xtist subject to the initial conditions that Xtis0 0 no drug in the deeper tissues when the dose is given and X0 D the dose lls the blood compartment instantaneously when it is given Solution of the differential equations yields an expression for X This can be divided by a parameter corresponding to the volume ofthe blood compartment to give concentration amount per unit volume at time t The form of the expression is 51 expP zt 53 exp 4t7 where the elements of 6 Bl 323334T are functions of 12 km kg and the volume that is functions of physically meaningful quantities with respect to the compartment model In fact pharmacokineticists are often more interested in the values of these parameters than they are in the actual pro le itself This is because the values klg 21 kg and volume tell them something about the individual s inherent biological features in terms of processing this drug Other meaningful quantities for example the drug terminal half life 10 2 M47 may also be derived from these elements The above suggests that an appropriate regression model to describe blood concentration at time zj would be 31 expk j 53 BXPPBMJ 13 This model is a function of a covariate zj time and parameters 6 3132333 B4T but is nonlinear in some of the elements of 6 In particular the dependence in 13 on the parameters g and B4 appearing in the exponential terms is nonlinear PAGE 8 CHAPTER 1 ST 762 M DAVDDIAN o The model has a theoretical interpretation Rather than just being an empirical representation for the pro le the form of and parameters in the model have physical meaning and the model itself is derived from theoretical considerations CLASSICAL ASSUMPTIONS VIOLATED A linear model does not provide a scienti cally relevant or very good representation for these data ASIDE We put the word covariate in quotes above because time is really not a covariate in the strict sense in this setting If we think rigorously about this problem we may envision an entire process in continuous time In particular we can think of Yt the concentration we would observe at time t The deterministic compartmental model provides a description of what amount of drug X t and hence exact concentration of drug if we divide by volume would look like at any time t if we believed this model and if there were no measurement error or other source of variation such as uctuations within a person operating If we think of Yt as the concentration we would observe including these effects we can think of Yt as a stochastic process in time ie Wt 51 expe zt 53 expemt 505 say where 5t is the deviation at time t that takes into account the effects of uctuations and measurement error From this perspective then time is a fundamental part of the response not a covariate in the traditional sense When we take blood samples at spec ic times m1 zn say we are seeing realizations of the process Yt at these time points It is conventional in much of the literature on univariate regression to sweep this distinction under the rug and to treat time as a covariate in the usual way In the sequel we will do the same as for our purposes how we regard time in problems like this will not have implications for the concepts and results we will study When we discuss multivariate response later in the course this issue becomes somewhat more important we defer further comment until then PAGE 9 CHAPTER 1 ST 762 M DAVDDIAN EXAMPLE 12 Pharmacokmetz39cs of indomethacm continued As we will discuss model 13 may be tted to the indomethacin data using a nonlinear version of ordinary least squares In fact there are alternative tting methods that we will also discuss Figure 12 shows the data plotted on the log concentration scale with the model 13 tted to the data using nonlinear ordinary least squares solid line and another method called generalized least squares dashed line that we will introduce in Chapter 2 Note that these ts are different with the difference being most pronounced for the largest time points where the response is smallest Figure 12 Log Concentration time data with ts of model superimposed 190 050 Concentration megml Time hours From the nonlinear ordinary least squares t it is is possible to construct a familiar residual plot Figure 13 shows a plot of residuals which have been studentized s0 standardized in a certain way in a manner we will discuss in Chapter 7 against predicted values where predicted values are constructed in the usual way by evaluating the model 13 for each time point at the estimated value of o The plot shows clear evidence of a fan shape such that the magnitude of the studentized residuals is larger for larger predicted values so for larger values of the response Just as in linear regression this pattern suggests that variability in the response Y increases with the level of Y 7 the larger Y the larger the variation The increase seems to take place in a smooth fashion as suggested by the regularity of the fan shape PAGE 10 CHAPTER 1 ST 762 M DAVDDIAN Figure 13 Plot of studentized residuals uersus predicted ualues for the nonlinear ordinary least squares t of 1 Stu dentized residual 1 005 010 050 100 Predicted value 0 In fact this phenomenon is commonplace with pharmacokinetic data In particular it is widely noted that the variability in measured drug concentrations about a smooth concentration time pro le as dictated by a model like 13 increases with increasing concentration CLASSICAL ASSUMPTIONS VIOLATED The assumption of constant variance 2 for all j is obviously not appropriate In fact we can identify still another possible violation Concentration measurements are taken over time on the same subject As we will discuss in greater detail later in the course there is a possibility that if responses are ascertained suf ciently close together in time separated by only a small time interval they may be more alike than those that are farther apart in time That is they may tend to be large or small larger or smaller than the tted pro le would suggest together This tendency for serial correlation would clearly violate the independence assumption As it turns out pharmacokinetic data are often reasonably assumed to be approximately normally distributed at any time x so assumption 4 may be reasonable PAGE 11 CHAPTER 1 ST 762 M DAVDDIAN EXAMPLE 13 Soybean growth study Figure 14 depicts data from an experiment carried out by researchers in the Department of Crop Science at North Carolina State University reported in Davidian and Giltinan 1995 p 7 In a series of eld experiments over several years plots in a eld were planted with two genotypes of soybean and the goal was to compare the growth patterns across the two genotypes Each plot was sampled over the course of the growing season at approximately weekly intervals At each sampling time six plants were randomly selected from each plot their leaves were removed and mixed together and the mixture weighed The weight divided by 6 was the average leaf weight per plant The gure shows average leaf weight per plant versus time for the eight plots planted with one of the genotypes in one of the years Figure 14 Soybean growth data for 8 plots 8 9 x2 E o g N E E n 3 E n 4 9 of If i U 0 20 40 50 80 Days a er planting c From the gure one might again be tempted to use a polynomial to represent the growth pro le for a given plot However as with the pharmacokinetic example this will not lead to a very good approximation For one thing intuitively as time goes on growth cannot continue to increase forever Rather it must level off77 as the plants in the plot reach maturity Moreover as the plants are measured over the season where growth is taking place it is unlikely that they will begin to shrink77 so that the pro le begins to decrease A polynomial model such as a quadratic function of time would not be able capture the asymptotic behavior expected toward the end of the season moreover it allows the possibility of decrease PAGE 12 CHAPTER 1 ST 762 M DAVDDIAN 0 Several theoretical models have been postulated to describe growth processes The most common such model is the logistic growth model which says that growth rate relative to present size declines linearly with increasing size Formally letting Y be the growth value average leaf weight per plant here and x be time this may be expressed by the deterministic relationship dY Y Y k 1 7 dm lt a gt 7 where the right hand side is a linear function of present size Y and k gt 0 and a gt 0 Upon integration this model leads to Y 51 7 1 32 BXPFBSQE where 31 a g k and 32 is the value such that 311 32 represents size at time x 0 Note also that as time gets large ie z a 00 the function approaches 31 Thus 31 is a physically meaningful parameter characterizing the asymptotic behavior and together with 32 it characterizes the physically meaningful feature of starting growth77 at z 0 As 33 gt 0 this parameter describes the change of growth with time A plot of this function over a range of z for particular choices of 6 31 g gT reveals that it has an S shape Thus this model appears to be a reasonable way to represent the growth pro le for a given plot 0 A common phenomenon for growth data is that variability in the response about an S shaped pattern increases with size Furthermore for a given plot the measurements of growth are taken over time These features are similar to those for the pharmacokinetic example CLASSICAL ASSUMPTIONS VIOLATED 1 2 3 are clearly violated an appropriate model is not linear in parameters and the assumptions of constant variance and independence are suspect Whether or not the normality assumption 4 is reasonable is not immediately clear EXAMPLE 14 Assay data An assay is a procedure used to determine the level of drug or other substance in a sample For example pharmacokinetic analysis requires that the concentration of drug be determined for the blood samples drawn from a particular subject at each time point Ordinarily such levels cannot be determined directly Under the assay paradigm the level is inferred by observing a response that is related to level An assay is conducted as follows PAGE 13 CHAPTER 1 ST 762 M DAVDDIAN 0 Samples with known levels of the substance zj are prepared the so called standards Typically replicate samples with the same zj value are prepared For each standard a response is measured The resulting pairs xi may be used to establish a relationship between level eg concentration of drug and the response 0 Responses for samples of interest with unknown levels are also obtained The relationship may be used to infer the unknown level This procedure is known as calibration and is often referred to in regression texts as inverse regression Different kinds of assays are performed depending on the substance to be calibrated Radioim munoassay RlA is one type of assay procedure Here the response Y is generally a radioactive gamma count whose value decreases as level increases Figure 15 shows standards data for a RlA developed at Becton Dickinson Research Center in Research Triangle Park NC to determine the concentration of a certain drug in porcine blood serum for an experiment involving the pharma cokinetic properties of the drug in pigs the data are reported in Belanger Davidian and Giltinan 1996 Alternatively Figure 16 shows standards data from a type of assay known as an ELISA or enzymelinked immunosorbent assay The response for an ELISA is generally something like an absorbance or optical density color reading or rate of change of this that increases with level This assay was developed in the context of a large epidemiological study of childhood asthma The study involved relating the levels of common allergens in samples of house dust to development of childhood asthma To determine levels of allergen in house dust samples an assay was required The data in Figure 16 are standards data from a run of an assay developed to measure t of Dc 1 39J r 39 roach allergen and are reported in Higgins Davidian and Giltinan 1998 Both of these examples exhibit some common features of assay data an S shaped relationship between concentration and response and variability that tends to increase with the level of the response This latter feature is easily deduced directly from a plot of the data because of the replication present at each standard concentration in each case Thus variance is obviously not constant in either case A standard model to represent the relationship between response Y and concentration or level z is the so called four parameter logistic function given by 5251 51 1m g 4 PAGE 14 CHAPTER 1 ST 762 M DAVDDIAN Figure 15 Standards data for a BIA for drug in porcine serum Response 50 100 500 1000 Concentration n gml Figure 16 Standards data for a ELISA for roach allergen Response 100 150 200 250 300 50 10 Concentration n gml PAGE 15 CHAPTER 1 ST 762 M DAVDDIAN Here 31 is the response at m oo 32 is the background response at z 0 33 is the concentration giving response halfway between g and 31 often called the ED50 and B4 is a shape parameter governing the steepness of the increasing or decreasing part of the curve This model provides an excellent representation of most assay concentration response relationships CLASSICAL ASSUMPTIONS VIOLATED 1 and 2 are clearly violated as before Typically the responses are determined separately for each sample so are considered independent In the case of RlA the response is a count Thus one might question the relevance of the normality assumption Usually however the counts are quite large Recall for large mean the Poisson distribution often a model for count data approaches the normal Table 11 Data for 4 children in the Six C ities study Subject City Mother s smoking status Wheezing status 5 D on m d O 5 Hook 1 0 2 0 3 Portage 0 4 Portage 1 EXAMPLE 15 Six C ities study 7 Binary data Often the response is not continuous but rather is discrete The most profound example is that where the response takes on only two possible values generally denoted as Y 0 or 1 corresponding to absence or presence of a condition or failure or success An example is provide by data from a large public health study called the Six Cities Study which was undertaken in six small American cities to investigate a variety of public health issues The full situation is reported in Lipsitz Laird and Harrington 1992 The data in Table 11 are those for the rst 4 of 300 children in a study focused on the association between maternal smoking and child respiratory health at age 10 The response of interest was wheezing status a measure of the child s respiratory health which was coded as either no 0 or yes 1 where yes corresponds to respiratory problems Also recorded at each examination was a code to indicate the mother s level of smoking 0 none 1 moderate 2 heavy Formally for the jth child the response is the binary variable taking on values 0 and 1 Mother s smoking status is a categorical variable with three levels so we represent it by two dummy variables de ned as 307 1ifnone 317 1ifmoderate 0 otherwise Ootherwise PAGE 16 CHAPTER 1 ST 762 M DAVDDIAN Let 111739 1 30739 31739 Following the same reasoning as in classical regression modeling we would like to specify a model for Now here as is binary Ele PO 1l113j 7r113739 7 say So from the usual perspective we would like to write a model for the association between the probability of presence of wheezing and the covariate mother s smoking status Obviously as proba bilities must be between 0 and 1 a model that doesn t respect this restriction is a poor candidate A linear model mgr 30 31st 32517 say has no property that forces its value to be between 0 and 1 so is not an appropriate choice Alternatively a common approach is to postulate a model for 7rj that does respect this restriction The most popular such model is the so called logistic regression model This model is often motivated as follows The odds is the ratio of the probability of seeing the event of interest to not seeing it ie 7rj17 717 The logistic regression model models the logarithm of the odds or logit as a linear function of 111739 In particular in the case of the Six Cities data the model would be ex mT log 7r mgr or 79 p 7 17717 1exp1j 6 Note from the second expression that 0 7r7 1 o The interpretation is that an increase in one of the elements of 11 by one unit changes the log odds by the value of the corresponding coef cient in 6 for example going from 30739 0 to 30739 1 changes the log odds by the additive amount 31 Thus the odds is changed by the multiplicative factor exp l Note that this model is nonlinear in the elements of 6 Now with 717 we have immediately that VarOjlillj 7911 7 so that under the logistic regression model for expmzm 1 expf 239 Obviously the variance is a function of the mean and hence 111739 so varies with j In fact if we varOjlmj write expw 1expltmzagt7 because can only take on the values 0 and 1 for each j 5739 can also only take on only two WE possible values PAGE 17 CHAPTER 1 ST 762 M DAVDDIAN CLASSICAL ASSUMPTIONS VIOLATED 1 2 and 4 are clearly violated For 4 as is a Bernoulli random variable for each j the normality assumption is not even a good approximation Table 12 Data on damage to cargo ships 7 6 ships Ship type Year constructed Period of operation Months service damage incidents A 1960 64 1960 74 127 0 A 1970 74 1975 79 1512 6 B 1960 64 1960774 44882 39 B 1960 64 1975 79 17176 29 C 1960 64 1960 74 1179 1 D 1970 74 1975 79 1208 11 EXAMPLE 16 Wave damage to cargo ships The data in Table 12 are a subset of those reported in McCullagh and Nelder 1989 p 205 concerning a type of damage caused by waves to the forward section of cargo ships Each of a number of classes of ships indexed by j is represented by the ship type year of construction period of operation and months of aggregate service The rst three variables are categorical de ning appropriate dummy variables the information on the jth combination of ship type constructed in a certain period with a certain period of operations along with the aggregate months of service for that combination may be summarized in a vector 2117 The response is the number of damage incidents suffered by combination j This response is in the form of a count note from Table 12 that the value varies from very small including no incidents 0 to rather large A natural distributional model for data in the form of counts is the Poisson distribution In the regression context we envision a Poisson distribution describing possible values of the number of damage incidents for each ship type represented by 211739 That is Poe kiwi Emmi11 7 T k7012 where W and as dictated by the Poisson distribution varY7l1j w 0 Note that W appears to be quite small for some combinations as indicated by the very small observed counts As W gt 0 must hold it might be dangerous to postulate a linear model such as mgr as tting such a model to data by say ordinary least squares could lead to negative predicted values in these cases Rather a model that respects the requirement of positive mean would be more sensible o A popular such model is the so called loglinear model in which it is assumed that IogEmle 71 or Emmi expaf l PAGE 18 CHAPTER 1 ST 762 M DAVDDIAN CLASSICAL ASSUMPTIONS VIOLATED 1 2 and 4 are clearly violated Because of the pres ence of rather small counts using the normal distribution as an approximation to the Poisson would not be reasonable as such an approximation is only good when the mean is large 14 Further Violation multivariate response So far in our examples we have restricted attention to situations where the response may be viewed as univariate in the sense that each is a scalar quantity In some of the cases responses were observed on a single subject or plot so that the scope of inference of necessity is restricted to observations on this individual or unit Thus even though the observations are repeated measurements on the same unit we are only interested in modeling and inference for that unit In other cases like the Six Cities or cargo ship data a single observation was available on each of several units and interest focuses on inference about the population of units Despite this distinction in all of these cases the data structure is the same pairs consisting of scalar responses and associated covariates 2117 We now consider situations where there are several units each with several responses This puts us in the realm of what we will refer to as multivariate response We describe some examples to illustrate EXAMPLE 17 Developmental toxicology studies Developmental toxicology studies in rodents rats mice are used in testing and regulation of potentially toxic substances that may pose danger to de veloping fetuses An overview is given by Ryan 1992 and references therein As described by Ryan a typical study includes a control group of pregnant dams exposed to 0 dose of the toxic agent and several additional groups of pregnant dams exposed to different doses of the agent usually each group involves 20 to 30 dams Exposure to non zero doses typically leads to malformations of the fetuses prenatal deaths decreased birthweight and so on depending on the agent which are considered re sponses to the exposure The issues involved in risk assessment based on such observations are rather complicated here we do not attempt to discuss these issues but rather just use this situation to discuss the notion of clustered data For example supposed that the response of interest is the continuous variable birthweight and consider a study in which a total of m pregnant rats are exposed to different doses of a toxic agent The objective is to characterize the effect on birthweight of different doses of the agent across the population of all exposed mothers and their pups Formally index the pregnant rats by l 1 m and suppose that rat l is exposed to dose z and PAGE 19 CHAPTER 1 ST 762 M DAVDDIAN gives birth to m pups where the jth pup j 1 m39 has birthweight Yij Several eg 20 pregnant rats may be exposed to the same dose m For each mother rat then we have observations Y Ym on her m pups These responses may be collected into a vector Y 121YLT Thus each mother yields a multivariate response the vector of birthweights of each of her pups with possibly different lengths 0 Suppose that we believe that on average birthweight is a simple linear function of dose That is suppose we believe that for mother rats receiving dose m the average birthweight across all such mothers and their pups is 50 51 Suppose further that it is reasonable to assume that for all doses mi associated birthweights vary across all mothers and their pups in a similar way so that MFGMW 027 say constant for all i and j It may even be that continuous responses like birthweights for a given z may be reasonably thought to be normally distributed 0 An important issue however is that birthweights for pups born to the same mother may well be expected to be more alike than those compared across different mothers In particular some mothers may have a tendency to have heavier pups than other mothers so all pups from a heavy type mother given dose x will tend to deviate in a positive direction from the average birthweight for pups across all mothers given m On the other hand we would expect that the way in which birthweights for pups from mother i turn out would have nothing to do with how those for mother 2 turn out once we have accounted for the possibility that both mothers may have received the same dose More formally we may summarize these observations as follows Clearly the above reasoning indicates that we believe that birthweights for pups from the same mother may be correlated simply because they may tend to be heavy or light together However birthweights for pups from different mothers would be reasonably assumed to be independent as they may turn out heavy or light in a completely unrelated way We may write this as corrYjYjz p ii same mother 0 27 different mothers PAGE 20 CHAPTER 1 ST 762 M DAVDDIAN o Summarizing we may write a model incorporating all of these considerations in terms of assump tions about the vectors Y i 1 m EWilM 30 3190017 my MFGHM Vi where 1 is a x 1 vector of ones p 30 lm 30 lm x 1 and 1 p p p 1 V 02 t 39p t X p p p 1 Note that although the elements of the covariance matrix V do not depend on i the subscript i is necessary as it denotes dependence of its dimension on m Notational note Throughout we will use var to denote both variance of a scalar random variable and covariance matrix of a random vector The meaning should be clear from the context 0 This has the avor of a regression model but with multivariate response so pertaining to a set of pairs Y1z1 Ym mm where the response is a random vector The assumption on variance is replaced with an assumption about the covariance matrix of such random vectors 0 In fact the response need not be continuous For example suppose that instead the pregnant dams were sacri ced prior to giving birth and the fetuses of each examined For dam i the status of each of the m fetuses is either normal or malformed which may be summarized as Yij 0 normal or Y 1 malformed for the jth fetus from dam 2 In this case the vector Y for dam i would be a vector whose elements are binary taking on only the values 0 or 1 Nonetheless the same issues arise We may wish to model the probability of malformation as a function of dose taking into account that the way in which fetuses turn out for a given dam i will tend to be more alike correlated than across dams EXAMPLE 18 Pharmacokmetz39cs of theophylline As another example of multivariate response we consider data from a study of the anti asthmatic agent theophylline Figure 17 shows concentration time data for four ofthe 12 subjects in the study each of whom received a single oral dose oftheophylline given in units of mgkg so scaled to each individual s body weight in kg For each subject 10 blood samples were drawn following the dose and assayed for theophylline concentration PAGE 21 CHAPTER 1 ST 762 M DAVDDIAN Figure 17 Theophyllme concentration time pro les for 4 subjects receiving an oml dose of theophylline at time 0 with ts of model superimposed Subject 1 Subject 6 N N 2 c c c e e E 0 so so 8 m Nquot g o E lt o o o g N N p c 0 5 10 15 20 25 0 5 10 15 20 25 Subject 10 Subject 12 N N 2 c c c e e E 0 so so 8 a m E E ltr o o g N N p c c Time hr Time hr ln Examples 11 and 12 we considered pharmacokinetic data from a single subject and the objective was to characterize the pharmacokinetic behavior the concentration time pro le underlying parameters for that subject only Although this is sometimes done to aid in the selection of an appropriate dosing regimen for a particular subject it is far more common that data are obtained on several subjects with the broader objective of understanding pharmacokinetic behavior in the entire population of subjects In fact the data for the single subject in Examples 11 and 12 were taken from such a study Understanding pharmacokinetic behavior in the population means understanding how individual concentration time pro les and the parameters that characterize them vary across the population of individuals To appreciate this recall the usual compartmental modeling strategy discussed in Exam ple 11 For an oral dose given at time 0 a standard model that appears to represent well theophylline concentration time pro les is the one compartment open model with rst order absorption represented pictorially as PAGE 22 CHAPTER 1 ST 762 M DAVDDIAN Here Xt is the amount of drug in the blood compartment at time t The dose D given at time 0 is absorbed through the gut into the blood at a fractional absorption rate of km and is eliminated by the kidneys at fractional elimination rate k5 A system of differential equations with appropriate initial conditions may be written down and solved for Xt see for example Gibaldi and Perrier 1982 Dividing by V the volume of the blood compartment yields the following expression for concentration of drug at time t Ct y expiket 7 exp7kat k5 ClV 14 Now the compartment model and the ensuing model for concentration Ct in 14 pertain to individual subject behavior that is the model is a theoretical description of biological processes taking place over time within a given subject as that subject processes the drug The parameter F the bioavailability is usually taken to be equal to 1 The parameter Cl the clearance rate is a measure of the volume of blood cleared of drug per unit time and is of primary importance in understanding how the drug is eliminated from the system o If we were interested only in a particular subject s behavior then if that subject received dose D letting tj denote the jth observation time for that subject where concentration was measured and 111739 D tjT we could postulate a regression model D53 E Y 111 7 7 5253 5152 where 31 Cl g V and g ka 6 3132 33T As before interest would focus on exp51tj52 exp st 7 15 estimation of 6 0 However we are interested in the population of subjects lndexing subjects by i 1 m let Yij denote the jth concentration measurement for subject i receiving dose D at time 0 where the time points at which concentration is measured are 25 j 1 n the times may possibly be different for different subjects Letting 217 DitjT we have pairs Yil 1111 Ym for subject 2 The responses on subject i may be collected into the vector Yin1mT mm From our previous discussion note that including time in the covariate vector 21 may not be entirely satisfactory as what we have are realizations of a stochastic process for each subject i at times 257 but we do not dwell on this for now PAGE 23 CHAPTER 1 ST 762 M DAVDDIAN c From the reasoning above each subject could be thought of as having a regression relationship of the form 15 however as pharmacokinetic behavior is a within individual process it would be expected that each subject would have his or her own pharmacokinetic parameters 6 governing his or her individual behavior We could thus think of a model for subject i depending on his or her individual speci c set of pharmacokinetic parameters 8 31 gi giT DB 1 I I 3139 EltYlilw 61 5211531quot Bli zi 8Xp51itj52i 8Xp53itj 0 Of course the elements of Y would be expected to be correlated as concentrations from the same individual might tend to be more alike77 Moreover as discussed in Example 12 the variance of observations on a given subject tends to increase with concentration level and there is the potential for within subject serial correlation which is different from correlation due to differences across subjects We defer discussion of these issues until later chapters Returning to the objective of the study understanding how pharmacokinetic behavior varies in the population of subjects would clearly involve understanding of how the parameters 8 vary across subjects more precisely understanding their distribution Of course these parameters cannot be observed rather information on them is only available through the multivariate response vectors Yi available from each of m subjects drawn from the population Clearly the modeling and analysis associated with this objective is considerably more complicated than that involved in the usual classical regression framework Note that the data structure in this example is similar to that in the soybean growth study Example 13 In the discussion of that example we focused on modeling and analysis for a single plot and postulated the logistic growth model as a way to represent the individual plot growth process Of course in this study the real objective was to understand soybean growth patterns across the entire populations of plots planted with the two genotypes Thus it should be evident that to address this objective a similar type of model framework as that for Example 18 would be required PAGE 24 CHAPTER 1 ST 762 M DAVDDIAN In these last two examples we see that the extension of regression type modeling to multivariate response obviously requires a greater level of complexity Moreover it appears that different ways of thinking may be more natural for different problems 0 ln Example 17 on developmental toxicity we wrote down a model to describe dose response in the population of rats directly This made sense 7 each mother rat was seen at a single dose 0 In contrast in Example 18 the theophylline pharmacokinetic study a model for individual be havior was postulated where the population came into the model through the idea of individual speci c parameters 8 governing individual processes We will formalize and discuss both types of modeling strategies in later chapters In any event a general issue relevant in all multivariate response situations is the theme that observations from the same unit i subject rat plot may be correlated while responses taken from different units may be reasonably assumed as independent 0 Thus it is important to recognize that although it may be tempting to simply think of all the N 221 71 observations altogether as a single vector Y N x 1 it would not be appropriate to assume that all the elements of Y are independent as would be the case if one adopted the classical regression assumptions without careful thought That is it is not appropriate to try to force situations like those in Examples 17 and 18 into the classical regression framework In situations such as these the assumption 3 of independence is surely violated Indeed the other assumptions of linearity constant variance and normality are also violated PAGE 25 CHAPTER 1 ST 762 M DAVDDIAN 15 Summary and a look ahead The preceding series of examples illustrates the issues associated with regression modeling when one moves away from the classical regression framework that is generally introduced in a rst course on regression analysis The examples emphasize that more complex and exible models and associated inferential methods are required It is important to recognize that although not emphasized in a typical rst course in regression a regression model is in broad generality nothing moreo than a postulated description of the conditional expectation EYl1 where Y and 1 are viewed as random variablevectors This of course is a function of 13 so part of the art of regression modeling is positing a plausible functional form for this conditional expectation A regression model may also embody further assumptions such as the clas sical77 assumption that varYl1 does not depend on 13 or even assumptions on the entire conditional distribution of Y given 13 As we discuss at the beginning of the next chapter we will take this point of view throughout The extent to which one is willing or not to make such assumptions will be an important focus of this course We will begin our study by con ning attention to univariate response problems We will begin in Chapter 2 with an introduction to a general univariate nonlinear regression model framework that will form the basis for the material in Chapters 3712 all ofwhich focus on the model its implementation and theoretical results that form the groundwork for commonly used inferential procedures In terms of the classical assumptions the models and methods we will discuss accommodate violation of assumptions 1 linearity 2 constant variance and independence of 5739 and 111739 and 4 normality We will maintain assumption 3 independence ln Chapters 13715 we will take up the multivariate response case This will also involve violation of assumption The issues are much more complex than those in univariate regression modeling as we will see In fact a critical issue will be the fact that in contrast to univariate modeling there are different approaches of which we will discuss the two most popular Which approach is most suitable for a given problem will depend on the context and the scienti c objectives PAGE 26 Working Variances o Recall the general mean variance specification Eltle fx varYlX 02903 tax 0 Suppose we use the GLS scheme with the correct mean speci ficiation but a working variance specification varYx T2h 7x2 o What do 2 and amp estimate and what is the consequence for B 0 We solve l Maw alog fg mx Y 0 Suppose that there exist 7n and 73 such that n 2 2 2 E fXj76gt Tnh2877naxjgt 7l8 7n xjgtgt j1 2 Then we can view amp and 2 as estimators of 7n and 73 Also if 7n gt 7 and 73 gt 72 then consistency amp gtp 7 and 2 gtp 72 More strongly asymptotic normality an 7 0101 and a 73 72 0pm Note in general if we fit a parametric model f139 to data from another density f013 the MLE estimates the param eter values 6 that minimize the Kullback Leibler distance from fo t0 f9 0 Using similar methods to those used earlier we show that B A N 0 03 XTUXgt1 XTUW lux XTUXgt1 0 Note that this is the same asymptotic distribution as in the fixed working weights case 0 Here as before f X1 o X XWO f x25 0 f6ltXn7160T W diagw1w2 11707 1 9 oa 907xjgt27 LUj and in addition U diagu1u2 Un 1 2 h607 Y7Xjgt sz Notes 0 The original folklore theorem is the special case where the working variance function h is the same as the true variance function g o The efficiency discussion carries over immediately from the fixed weights case Corrected Standard Errors 0 If we know or suspect that the working variance specifica tion is not the truth we need to estimate the asymptotic variance matrix 03 XTUXgt1 XTUW lUX XTUXgt1 o X and U can be estimated by plugging in sample estimates and 0 Also 2 a 2 O E W fva ogt M W so we can estimate agwl by R diagrr rg where rj is the unweighted residual 73 Y fxjB 0 Note diagonal entries of R are squares of single residuals so R is a poor estimator of Jaw 1 even in large samples a However XTURUX angUW lux 0pm so XTURUX is a good large sample estimator of agXTUW lUX c We are led to the sandwich variance estimator XTUXgt 1 XTURUX XTUXgt1 Wald Inference o The asymptotic distribution B r9 N607 may be used to construct confidence intervals for individual parameters or linear com binations of parameters hypothesis tests about individual parameters or groups of parameters 10 o Inference is based on the asymptotic normal distribution of an individual parameter estimator or to test a hypothesis such as L L o the corresponding asymptotic X2 distribution of a quadratic form such as B pOTLT LSLT1 L B 30 c To allow for estimation of 2 the normal distribution is often replaced by the t distribution and the X2 distribution by the scaled F distribution o This replacement also gives the usual statistics in the special case of a linear model with known variances 11 o 2 may be estimated either by assuming that the working variances are correct or by using the sandwich estimator o Advantage of Wald inference Extension of familiar methods 0 Disadvantages Large sample distribution may give a poor approximation in small samples Not invariant to reparametrization of the mean model 12 Likelihood Inference o Simplest assume normal distribution for Yj then use likeli hood ratio methods for tests and profile likelihood methods for confidence intervals o Alternative approach less sensitive to assumptions 1 n A 2 W 3 1 w Yj fxja gt j is up to constant terms the log likelihood in the linear model 0 We can use it to construct confidence intervals and to test hypotheses 13 Optimality Of GLS o The GLS estimator B satisfies X TW 00 Y f 0 0 Here we hold 0 at its true value 4 replacing it by its esti mator does not change asymptotic distributions 0 We assume correct variance specification so that B mu Np0ag XTWXgt1 14 Consider B satisfying the more general equation A TY f 0 Similar arguments show that B A N 008ATXgt1ATW1Agt XTAgt1 Using an argument outlined before 1 1 ATX ATW lA XTA XTWX can be shown to be nonnegative definite 1 Conclusion the GLS estimator is asymptotically optimal in the class of linear estimating equations 15 CHAPTER 9 ST 762 M DAVDDIAN 9 The folklore theorem and optimality of GLS 91 Introduction In Section 83 we used standard M estimator arguments to examine the large sample properties of estimators for 6 in the situation where we specify a set of constant weights to represent potentially nonconstant variance and estimate 6 by WLS ie a linear estimating equation This of course includes specifying weights all equal to one so that the estimation procedure is that of OLS The main insights were that 0 Even if the weights are chosen incorrectly the estimators will still be consistent So for example using OLS when variance is nonconstant yields a consistent estimator for the true value of 6 o If the weights are correctly speci ed ie chosen so that they equal wj where in truth varYj 0210771 then the resulting estimator for 6 is more ef cient than others using incorrect weights Thus it is the optimal choice among all such estimators solving linear estimating equations with some speci ed xed weights In fact it was notable that we did not need to use any information about the distribution of beyond that of the rst two true moments Of course it would be very unusual in practice to be able to specify a set of known weights as we have noted previously However as we have discussed at length it is often feasible to adopt a variance model that provides a good representation of the form of the variance eg as a function of the level of mean response In this Chapter we will consider the model fltmj7 7 varOjlmj 0292 7 07 131 9391 and investigate the same issues as in Section 84 focusing on estimation of 6 via solution of linear estimating equations In particular we will consider estimation of 6 by the GLS approach where in fact we might have to also estimate unknown variance parameters 0 The GLS scheme may be viewed as WLS with estimated weights so it is natural to wonder if the results in the case of xed weights in Section 84 extend somehow to the estimated weights situation To be more speci c we will investigate the following PAGE 211 CHAPTER 9 ST 762 M DAVDDIAN 0 Suppose we have correctly speci ed the form of the variance function g and we estimate 6 using the threestep GLS algorithm possibly estimating additional variance parameters 0 at step ii What are the properties of the resulting estimator ECLS 0 Suppose we specify a model of the form 91 with the correct model 1 for the conditional mean but we misspecify the variance function postulating instead a model that does not accurately characterize the true variance What are the properties of the GLS estimator for 6 How do they compare to those of the GLS estimator where the variance function is correctly speci ed Throughout this Chapter we will assume that the mean model in 91 is correctly speci ed our focus will be on the variance function Once we have addressed these issues we will move to practical considerations of how to use the results to obtain approximate standard errors for the elements of the estimator for 6 how to construct con dence intervals and hypothesis tests about the true value 60 and so on Finally in Section 96 we will address an outstanding motivation for the GLS approach its role as the optimal in a large sample sense estimator within the class of linear estimating equation estimators when the variance function is correctly speci ed This development will in fact provide insight into optimality of more general estimation schemes 92 The folklore theorem of GLS The result we refer to as the folklore theorem establishes the asymptotic normality of the GLS estimator of 6 in the model 91 when the form of the variance function g 0 11739 is correctly speci ed that is the functional form describes accurately the form of the variance WHY THE CUTE NAME The result we are about to demonstrate has been known for decades shown by rigorous proof or deduced by informal arguments The econometricians were among the rst to realize the result in the 1970s and biostatisticians determined it again in the 1980s It has been discovered over and over again to the point that now it is so well known that it has achieved the status of statistical folklore See Carroll and Ruppert 1988 Sections 22 and 731 for more We will state and demonstrate the result in the case where 0 is unknown and estimated by a well behaved estimator We will in fact show in Chapter 12 that the estimators for 0 we discussed in Chapter 6 based on transformations of absolute residuals are all well behaved under reasonable conditions in this sense PAGE 212 CHAPTER 9 ST 762 M DAVDDIAN The result for the case where 0 is known will be a simpler special case of the general result We will use the same basic M estimator argument as in Section 82 although we will derive it explicitly so that some interesting features may be highlighted Consider the threestep GLS algorithm outlined in Section 63 Suppose the algorithm is carried out for C iterations recall that C 00 means iterate to convergence We are interested in the properties of the GLS estimator for 6 obtained from step iii after C iterations where 6 is estimated initially at step by for example OLS ignoring nonconstant variance and 0 is estimated at step ii based on residuals and predicted values from the current estimate of 6 For simplicity we will refer to the GLS estimator for any C at the end of step iii as 3 suppressing the subscript GLS RECALL As mentioned in Section 84 we will carry out arguments conditional on the 111 En and make use of the assumption that the conditional distribution of lem11n is the same as that of leillj C ONSISTENC Y To x ideas suppose that 6 is estimated initially by OLS at step and that 0 is estimated by PL at step ii To check consistency of the GLS estimator it makes sense to consider the entire process of estimating 6 at step iii along with estimation at the previous steps De ne 60 00 and 00 to be the true values of 6 p x 1 a and 0 q x 1 in 91 generating the data Recall that we are assuming that the models for mean and variance are both correct In Section 83 we argued that the OLS estimator is consistent even when the true variance is noncon stant Although in 91 we do not consider xed weights from a theoretical standpoint we may still think of the true weights wj g 260 001117 conditional on 2117 the value of 9 evaluated at the true parameter values and 111739 may be regarded as a constant The OLS estimator solves 71 BY 7 m f m 7 6 0 j1 clearly under 91 7 f1j80l1j 0 so that it is still reasonable to assume that EOLS is consistent Suppose that we are at the beginning of iteration C and El is the current estimator for 6 Then we would solve W Y7 A 2 W A a W71 W3 79 mi0 92 AT A to obtain the updated estimator 6 0 T at step ii and then solve for 6 at step iii 9 2 7 97 113M fj7f3f jf3 0 93 M x H PAGE 213 CHAPTER 9 ST 762 M DAVDDIAN i is the resulting estimator for iteration C We could modify 93 replacing the 33 in the weights by i so that solving 93 would be carried out by IRWLS for the xed value 9 but as we will see shortly this will not matter In the case C 00 then 333 i in both 92 and 93 Thus to check consistency of 3 we need to consider the entire set of equations as 33 and 9 are involved in the weights and solve their own equations As an example suppose C 1 and 33 is the OLS estimator Then to obtain 3 we are really solving the entire system of equations given by V L A A 5 fltjv lf ij 07 j1 92 and 93 For this entire system of 2p q 1 equations the parameter being estimated is really Ta 0T TT Note that all three equations are clearly unbiased Thus it is reasonable to be assured that the entire process leads to consistent estimators for the true values all elements of this extended parameter vector and hence of the last p elements solved by the GLS equation 93 This perspective would of course extend to any C where 333 previous GLS estimator In the case 0 00 then if a in both 92 and 93 Based on these observations we are willing to assume that the GLS estimator for any C is consistent as is the previous estimator for 6 and those for the variance parameters a and 0 We now focus on the estimating equation 93 and apply the M estimator argument to deduce the largesample distribution of 7112 7 30 As in previous chapters it will be again convenient to de ne E Y Mme 7 0099507 907 11le so that EEjl1j 0 and varel1j 1 THE FOLKLORE THEOREM Assume that the model 91 is correctly speci ed and that 333 and 9 are preliminary estimators for 6 and 0 such that A 1 A WW 7 e 0p17 711209 7 90 0221 94 Under suitable regularity conditions all GLS estimators i for any choice of C including C oo satisfy Wu 7 e A Mo 032mg 95 where n 71 7 v 71 7 v 71 T EWLS 1131 2107101991377 0f 77 50 i 1131 X WXv wa 9 2 o7 00 111 W diagw17 7wn7 and X Xt o M1117 30 7 Mm 30 Before we carry out the argument we make a few remarks PAGE 214 CHAPTER 9 ST 762 M DAVDDIAN o The condition in 94 says that these quantities do not blow up77 but are well behaved for large 71 As we discussed on page 195 such a result follows if the estimator is asymptotically normal with mean equal to the true value in this case Thus we may regard this condition as simply stating that El and 9 are usual M estimators Actually any estimators El and 9 satisfying 94 are suf cient to demonstrate the theorem Note that if El is the OLS estimator for example the condition is satis ed as we showed the asymptotic normality of EOLS in the last chapter 0 Comparing 95 to 821 we see that the largesample distribution of the GLS estimator has the same form as for WLS in the case where the weights wj are known and correctly speci ed As we noted above conditional on the 2117 the values wj are just a set of constants Thus 95 seems to say that whether we actually know the constants wj themselves or simply know the function g6 0 1117 we end up with the same large sample distribution We will elaborate on this after we complete the argument We now carry out the argument Consider the GLS estimating equation By the usual Taylor series expansion and under suitable conditions we have 0 n12292ltiai wjgtwfltwjiagtf ltwjiagt j1 z MW 294 so 2219 fltwj ogtf ltwj ogt j1 e rm e f mw o en l g zwo 907 ZEDf ij ofgj7 50 MM 7 g F 2n 1ilg gwo ownm 7 fjv of j7 opMo 00 ml n12ltia 7 30 F 47171 9 39507 907 113M fltjv 0f mjv o99T507 9071131 n129 90 F i AT A T AT Note that what we have done is expand 6 f 0 T about g OT 00TT although we have elected to write the linear term in the expansion as three separate pieces rather than keeping it stacked for reasons that shall become obvious momentarily Using the de nitions of 6739 11 and 19 given previously we may write this succinctly as 0 w on An1 An2n12i3 e Dnnlthf e Enn129 90 96 PAGE 215 CHAPTER 9 ST 762 M DAVDDIAN Here 7L On 0071 12 Z wiij 5067397 j1 n 12 n Am 007171 210 f ltj7 0ejv AnZ 77171 ijf j7 ofgjy oy j1 7391 7L Dn 72007171 2 wjf Ww 50ng o7 90 11am j1 7L En 720071 1 ijf w 50V9T507 90 Haley39 j1 We now deduce the behavior of each of these terms As EEj j 0 clearly by the weak law of large numbers all of Am B and En converge in probability to zero we will discuss the consequences of this shortly From the latter two results we may rewrite 96 as 7112 50 Aglcn Now conditional on the 2117 the term Ang depends only on constants By the assumptions of the theorem this term satis es Am a iEWLS 97 Now combining the results for Am and AM we have An i EWLS Moreover we may apply the multivariate central limit theorem to On Clearly Ew712f mj 06j j 0 and 12 7 T Varwj M11373 o jl j 7 wjf w o 1133 e recall that wj is a function of 111739 Thus L 7 0 amongzwlw Applying these results to 97 using Slutsky s theorem we thus may conclude that n12i 3 50 i NW7 UgEWLS7 as claimed in 95 REMARKS Carrying out the argument explicitly rather than just plugging in77 to the generic M estimator calculations allows us to make several important observations PAGE 216 CHAPTER 9 ST 762 M DAVDDIAN c As we noted previously the result here is identical to that in the case of WLS with the weights wj regarded as known constants that have been correctly speci ed The implication is that as long as we specify the functional form of the weights through the variance function g correctly and estimate the unknown 6 and 0 that fully characterize them conditional on 1117 we can estimate 60 using GLS as well as if we in fact knew the weights wj g 260 00 1117 That is having to estimate the weights by substituting the current estimate El and rather than knowing them exacts no penalty in the sense of large sample precision In the argument the terms Dn and En corresponding to the effects of El and 9 respectively are op1 This implies that the estimators we substitute for 6 and 0 in the weights play no role in determining the largesample properties of the resulting GLS estimator i and leads to the no penalty77 phenomenon noted above This feature is the main folklore message 7 the largesample precision is unaffected not only by the need to estimate the parameters in the weights but how these parameters are estimated as long as they are estimated sensibly In the case C 00 El 3 the term Dn also corresponds to a contribution from estimation of i in 96 but it is negligible as Dn 01 This shows that even in the case where we iterate the GLS algorithm to convergence the properties of i are are unaffected by the appearance of i itself in the weights In fact the result holds for any C Nowhere in the argument does C appear All we required was that El be the current estimate and 9 be the estimator for 0 based on it The important implication is that all GLS estimators any C have the same large sample distribution Thus the theory provides no insight into whether it is necessary to iterate to convergence C 00 or how large C should be Because the effect of 9 also is negligible as En op1 the result also implies that how one estimates 0 does not matter in determining the properties of the resulting GLS estimator Thus whether we use PL or some other approach eg transformation of absolute residuals as long as the estimator for 0 satis es 94 the properties of i are unaltered Of course it is important to recognize that the folklore theorem is a largesample approximate result lntuition suggests that the implications although theoretically interesting might be a bit optimistic in practice ie for small sample sizes PAGE 217 CHAPTER 9 ST 762 M DAVDDIAN That is whether 6 and 0 in the weights are known or estimated might in fact have some effect on the properties of the resulting i in sample sizes where that are not suf ciently large for the large sample result to have kicked in77 As we will discuss later this is often the case In Chapter 11 we will offer some more re ned theoretical arguments that support this observation STANDARD ERRORS FOR THE COMPONENTS OF 9 As we have discussed a main byproduct of a largesample distributional result like 95 is a way to construct approximate estimates of uncertainty We may express the folklore result as 9 39 Nl m 03XT oW507 6 0X o 1 l7 98 where X6 is de ned as before and W6 0 diagg 2 0 211 g 2 0 The covariance matrix in 98 may also be written as 71 03 292 0071f j7 ofgjy o j1 From now on and in subsequent chapters we will move between writing this type of matrix in matrix or summation notation without comment As 60 00 and 00 are unknown it is natural to substitute estimated values For 03 the obvious estimator given 3 and 9 from the nal iteration of GLS is the bias corrected estimator A2 0 71 29P 9 2ii 97 Hamj aw53 V L j1 Thus to obtain estimated approximate standard errors in practice for the components of 3 one would take the square roots of the diagonal elements of the matrix amp2XT W 7 9Xi3 1 99 The standard errors in the output in Sections 37 and 68 are in fact derived from formulae like 99 The SAS proc nlin and RSplus nls software provide standard error estimates in the case where known weights are involved using the asymptotic covariance matrix for WLS given in 820 As the form of this covariance matrix and the one for GLS is the same and the programs have been given estimated weights computed at the nal values of the estimators they simply use these estimated weights as if they were known to compute the standard errors PAGE 218 CHAPTER 9 ST 762 M DAVDDIAN In the case of xed C then the formula 99 is used in a slightly different form in particular as El in the weights would be treated as xed at step iii 99 the standard errors emerging from the nal call to the nonlinear regression program would have El rather than the nal estimate 5 in the weight matrix W Presumably if El and i are similar the standard errors calculated this way should be very close to those that would be obtained if 5 were used instead so most people do not bother to update them to do so In the case C 00 of course at convergence 3 and the standard errors from the nal invocation A 7 5 of step iii would be effectively derived from 99 In the case of IRWLS in SAS proc nlin where 6 in the weights is estimated with a possibly estimated value for 0 treated as xed the standard errors are calculated after the estimated value 3 is determined so again 99 is used as is with 9 equal to the current estimate As noted above standard errors based on the folklore theorem may sometimes be optimistic that is they may be smaller than the true sampling variance this has been gauged through simulation This is because the uncertainty due to estimation of 6 and 0 in the weights is not taken into account The folklore theorem says this uncertainty is negligible but in nite samples it may not be Thus it is usually advisable in problems where the sample size is not large to interpret these standard errors with caution 93 Misspeci cation 0f the variance function and GLS In Section 83 we saw that in the case where the weights used in estimation of 6 are chosen to be a set of xed constants incorrect choice of these constants does not lead to an inconsistent estimator for 6 as the estimating equation is still unbiased However such an incorrect choice will lead to potentially inef cient estimation relative to the precision that could be achieved using the correct weights In the GLS approach the weights are dictated by the choice of variance function Thus by analogy it is natural to consider what happens in the event that the variance function has not been correctly speci ed which would presumably lead to incorrect estimated weights at each 2117 An obvious complication when the variance function is not correctly speci ed is that the incorrect model may depend on parameters for which values may not be speci ed so that these parameters must be estimated PAGE 219 CHAPTER 9 ST 762 M DAVDDIAN Because these parameters appear in a model that does not characterize the truth it is not even clear what they represent or what is being estimated when the incorrect model is tted for example by the PL approach To formalize suppose that in truth the data follow the mean variance model EOjlillj fab37 MFG111131 02995797 111739 910 but we incorrectly specify the model as EOjlmj Katya37 vaFOjlmj 7120177 11739 911 Thus although the data come from a model of the form 910 we t a model where the variance function dictates some relationship other than the true one 9 In 911 the parameters are 739 and 7 where 739 is a scale parameter In the correct model 910 a and 0 represents parameters in the model that fully and correctly characterize the variance with true values 00 and 00 at which evaluation of the variance model yields exactly the value for conditional variance at 11 along with 30 The parameters 739 and 7 on the other hand do not have true values77 in the sense that the variance model in 911 evaluated at these values will give precisely the true variance at any 2117 It may well be that 397 is of a different dimension r say from that of 0 In fact it is not even clear how 6 in the mean model enters into the picture Suppose we have available under these conditions an estimator 0 satisfying the codition 94 n1207 80 Op1 where 60 is the true value of 6 in 910 We know immediately of one such estimator the OLS estimator Thus suppose we estimate 6 in 911 by OLS to obtain 0 7 as this does not involve the incorrect variance function we know that this estimator will be consistent Then unknowingly taking the misspeci ed variance model in 911 as correct suppose at step ii of the GLS algorithm we decide to estimate 739 and 397 by the PL approach From Chapter 6 we would solve for and 4 i 19 7 ratio 7 ahaAim mi A j1 71205 7771131 where 579637 11739 is the r 1 x 1 vector whose rst element is equal to one and the remaining r 57f3i r7i11j 07 912 elements are the derivatives of logh6 y with respect to the elements of 7 What does solving 912 yield To gain insight consider the general situation of M estimation PAGE 220 CHAPTER 9 ST 762 M DAVDDIAN In Section 82 we considered the situation of solving an equation of the form M I jZj777 0 1 x H We assumed implicitly that the model underlying the estimation is correctly speci ed and that the parameter 1 is the parameter of interest such that the distribution of Zj depends on the true value no Recall that 17 will be consistent for 170 under regularity conditions if E 1 1Zj7 o 07 where expectation is with respect to the true distribution of Zj It turns out that in fact this may be relaxed somewhat lf instead we have only that V L ZEWAZJWOH 0 913 j1 so that each summand does not necessarily have mean zero then under regularity conditions it still holds in general that f L 170 and the argument leading to the asymptotic normality of the estimator for 1 in Section 82 goes through unchanged except for one modi cation In particular the covariance matrix of IljZj no is no longer equal to E 1 jZj7 770 1 FZj77707 so that the de nitions of the matrices Bn and B in the argument must be changed eg Bn z 7171 291 Var 1 jZj 170 instead Now suppose that have an incorrect model with parameter 7 which leads us to solve instead the estimating equation 71 Z I zj 07 j1 where is some other function of 397 and the data Zj dictated by this incorrect model Now is some function of the random vector Zj Usually there exists a value 7 such that V L ZENmam o 914 j1 where expectation here is still with respect to the true distribution of Zj By analogy to 913 it turns out that under some conditions if 914 holds solving the incorrect estimating equation will result in an estimator 4 that satis es 7 PAGE 221 CHAPTER 9 ST 762 M DAVDDIAN Of course 7 does not necessarily have any meaning as representing a quantity that pertains to the true mechanism generating the data However it is a xed quantity dictated by 914 Thus it is often said perhaps misleadingly that 7 is consistent for 7 see page 188 One may in fact go on to pursue an argument exactly like that in Section 82 to establish that n127 7 7 converges in distribution to a mean zero multivariate normal random vector with a covariance matrix that may be derived It thus follows that M WVVOAU TERMINOLOG Y The value 7 may be thought of as the value that tries to get closest to representing the truth within the con nes of a misspeci ed model It has consequently sometimes been called the least false parameter The important conceptual point is that even with a misspeci ed model if we estimate a parameter in the model we may still deduce the behavior of the estimator even if the parameter has no real meaning Now consider the particular situation of 912 We may interpret this as a case where the mean 72h2 7 11739 of the response inf1j 62 has been misspeci ed and of course some weighting is also taking place Now if El is a consistent estimator for the true value of 60 eg OLS and held xed when the solution is found then we would expect that the same conclusions in the generic case of M estimation with a misspeci ed model above Clearly it is no longer the case that the expectation of response 7 mean conditional on 11739 under the true variance model 910 at some value of 7 and 7 is zero even if the correct value 60 were substituted It is usually said that such an estimating equation is biased However it is not far fetched to think that there are values 7 and 7 that make things average out to zero over 71 conditional on all n 1117 Viewing the problem as one of M estimation with a misspeci ed model then we may conclude that solving the biased estimating equation 912 will yield estimators such that n127 7 7 Op1 and M WVVOAD am for some values 7 and 7 see page 195 Of course we have used PL estimation as an example here It should be clear that the same sort of argument would apply in the case of other estimators for variance parameters We will discuss the properties of the general class of variance parameter estimators in Chapter 12 Now return to considering the GLS algorithm Suppose that we solve for i in step iii the GLS equation h2 7 7 111M 7 H1172 3f 72 f3 0 916 7L 11 PAGE 222 CHAPTER 9 ST 762 M DAVDDIAN As in the previous section and using the above discussion if we consider solving both the OLS equation the PL equation 912 and the GLS equation 916 we have that El L 80 the true value of 6 in 910 and 4 L 7 Thus if we focus on 916 evaluated at these values we have Eh 2 o77i HIM fj7 of j7 olj 07 so that we may conclude that 916 is indeed an unbiased estimating equation Thus despite the use of the wrong variance function we still expect the GLS estimator 6 solving 916 to be consistent for 60 In fact it should be clear that repeating the process with El as a consistent GLS estimator using the incorrect model will yield an update at step iii that is also consistent Thus heuristically solving the GLS equation 916 with the wrong variance function should result in a consistent estimator for 6 for any C Assuming consistency of 6 711265 7 80 Op1 and 71120 7 7 Op1 as in 915 we have AT T the same situation as in the folklore theorem We may thus expand 916 in 6 4T6 T about 35 7 OTT as in that argument The steps are identical so we do not repeat them all here De ne E YJ 1137350 1 7 Tow507907131 note that this is with respect to the true mean and variance functions dictated by the correct model 910 so that EEjl1j 0 and var6jl1j 1 where of course expectation is with respect to the true conditional distributions of the Write as before wj g 2 0 001117 and let uj h 2 07 1117 Then the expansion yields 0 3 C A21 AZZWIZb quot10 Dim2 quot10 EZnIZW 6 917 where 7L 0 Hon lZ Z ijglZf Ww 5067397 j1 7L 7L A21 7071 1 wa lZf j o w A22 i771 wa wjv o g jv 30 j1 j1 32 D 400771 19 f ltillj7 ogth lt o i 1131M j1 71 E 2907171 ZugZf ah 0h35077715j7 j1 and h and h represent the vectors of partial derivatives of h with respect to 6 and 7 PAGE 223 CHAPTER 9 ST 762 M DAVDDIAN As in the argument for the folklore result A21 E2 and Di all converge in probability to zero that this last term is negligible is especially interesting as it shows that there is no effect of the wrong estimator77 4 for the incorrect variance modell Rewriting 917 as nlZt e AZEICW n7 letting X X60 and de ning as in Section 83 U diagu1 un we have that 142 a 7A A lim n lXTUX WK 0 LNmJgB B nliango n lXTUW lUX Combining we obtain that WW3 7 e A MO 0314419144 so that 9 NW0 03XTUX 1XTUW lUXXTUX 1 918 We may compare the result in 918 to that obtained in Section 83 in the case of known weights In particular note that 918 is identical to the result in 815 when the wj and 14 were treated as known constants Thus just as in the folklore result we obtain the interesting conclusion that even if we estimate weights by substituting parameter estimators rather than knowing their values we will obtain the same large sample distribution for the estimator for 6 here this is seen to hold in the case of an incorrect modelincorrect constants 0 Note in fact that the folklore theorem may be regarded as just a special case of this general result where h and g are the same o It should be obvious that taking the function h to be identically equal to 1 for all j would thus yield the large sample properties of the OLS estimator when the variance is really nonconstant which we have already derived Here of course there would be no 6 or 397 to be estimated in weights 0 The ef ciency comparisons carried out in Section 84 in the case of xed weights thus carry over unaltered to the setting of estimation of variance functionsl IMPLICATION Using an incorrect variance function will not affect the consistency of the GLS estima tor for 6 but it will affect the ef ciency of the resulting estimator Using an incorrect variance function may result in a less precise estimator for 6 than if the correct function is used PAGE 224 CHAPTER 9 ST 762 M DAVDDIAN o The same issues discussed on page 217 carry over to this more general setting How one estimates weights even using an incorrect model does not play a role in the largesample properties of the GLS estimator for 6 in a correctly speci ed mean model Of course this may be optimistic in nite samples A version of this result is discussed in the case of multivariate response by Liang and Zeger 1986 and it is often attributed more generally to these authors although it has been known for considerably longer 94 Correction of standard errors The results of the previous section indicate that if we have incorrectly speci ed the variance model the usual formula for obtaining approximate standard errors for the elements of the GLS estimator ii is not appropriate This could potentially result in erroneous inferences a misleading assessment of the precision of i may result and test and con dence intervals to be discussed next in Section 95 could be compromised However although we may be concerned that we have selected an incorrect variance model the result in 918 is not really helpful 7 to use this result we must know the true variance function in order to deduce the values wj appearing in the middle piece of the asymptotic covariance matrix Of course if we knew this we would have used it for estimation of 6 Although the result is not immediately useful it does give insight into how one might protect against a potentially misspeci ed variance model when calculating standard error estimates IDEA To formalize this consider again the situation of the last section where the true mean variance model is at 6 and varY7l1j 0292 0 217 as in 910 but we have misspeci ed the variance model as in 911 instead assuming varle1j 72h2 7 111739 If we specify this incorrect model and are unaware that we have done so then we would presume that the folklore result holds with the weight matrix dictated by the variance model h Thus we would conclude that the GLS estimator based on this wrong model satis es 9 N Nl m21XTEUE7 7XETll here we have substituted the estimated values into the matrices X and U as would be the case for obtaining estimated standard errors in practice PAGE 225 CHAPTER 9 ST 762 M DAVDDIAN As this result is based on the incorrect assumption that h is the correct variance function the standard errors so obtained may be misleading Under these circumstances to obtain standard errors that give an accurate assessment of uncertainty we would rather base them on the result in 918 As noted above we cannot do this directly but we can do something close The covariance matrix we would like to estimate is from 918 XTUX 1agXTUW lUXXTUX 1 919 From above we can estimate the two end pieces by simply substituting the estimators for 6 and 397 from the t of the incorrect model The middle piece is the troublesome one Note that n a n leUW lUX 7 M1 Zuimwj ofaTiIIj7 ogt0392lt 0 00 m j1 as wj g 260 001117 We do not know the wj however we do know that in truth EH19 7 m 30 1227 0392307 so 113739 Thus by the weak law of large numbers we know that n n 7 p n 1Zuimwmov wj ogt197fltwj ogt27n 1214ij ofgjy 00392 o7907119a 0 j1 j1 suggesting that we could substitute the squared deviations 7 f1j 602 and get something close to the middle term Of course we do not know these deviations but we do have a consistent estimator for 60 with which we may estimate them Letting Ti Y1 7 111733 denote the unweighted GLS residual the obvious suggestion is thus to estimate the middle matrix in 919 by XT U IRUEWXE7 R 7 diam 7763 It may be shown that in fact n n 7171 7174053377 illjf j7 Ef e7 3ng n71 71 137397 o gww 00392507 907 1137 i 0 a a by expanding the rst term about i 60 and 4 7 and using the facts that i 7 60 Opn 12 and 4 7 7 Opn 12 and the weak law of large numbers PAGE 226 CHAPTER 9 ST 762 M DAVDDIAN RESULT To protect against possible misspeci cation of the variance model in GLS estimation of 6 it is suggested to derive standard errors for the elements of 0 by the square roots of the diagonal elements of the estimated covariance matrix XT U 7WX 1XT U WRUtfiWX XT U WXi 3 1 920 0 Even if it has been derived using an incorrect variance model 0 is still consistent Thus it is reasonable to use as an estimator for 6 recognizing that it is inef cient Calculation of standard errors using 920 is an attempt to ensure the assessments of precision are correct This idea is a special case of a general technique that has many different names In the particular case of h E 1 so that we use OLS when the variance may in fact be nonconstant U I In the econometrics literature the estimated covariance matrix 920 under this condition has been called the 7 covariance matrix This is because it is itself a consistent estimator of the true covariance matrix of 0 given in 919 with U I in the case where the true variance may be nonconstant heteroscedasticity In fact in situations where one s main interest is to estimate 6 and obtain realistic standard errors and where modeling the variance could be quite complicated this approach has been advocated for simplicity just estimate 6 by OLS and x up the standard errors The obvious drawback is that the OLS estimator may be very inef cient o More recently 920 has been called the robust sandwich estimator of the sampling covariance matrix of 0 this term is usually attributed to Liang and Zeger 1986 The term robust refers to the hope that as an estimator of sampling variation 920 is insensitive to misspeci cation of the variance model The term sandwich refers to the form of the estimator a correction term A sandwiched between two copies of the covariance matrix one would naively use if one believed the variance model were correct See also Moore and Tsiatis 1991 for an example in the univariate case WARNING Several authors have expressed concerns over the practical performance of this approach Most of the discussion has been with respect to the multivariate generalization of the approach which we will discuss in Chapter 14 but the implications are similar 0 920 may be shown to be a consistent estimator of the true covariance matrix of the estimator PAGE 227 CHAPTER 9 ST 762 M DAVDDIAN 95 However several authors have reported that in nite samples it may produce rather unreliable estimates of the true sampling variation as deduced by simulation studies See for example Rotnitzky and Jewell 1990 Part of this may be due to the estimator s apparent sensitivity to unusual observations In particular using the squared residuals r72 as essentially a proxy for the true variance at 211739 thus basing this on a single data value may be sensitive to an outlying value of Y7 and this effect could be noticeable in small samples Thus robustness to an incorrect variance model could be offset by lack of robustness to unusual data values Some authors have suggested that in some circumstances one might be better off just using the usual folklore result and not attempt to correct for possible misspeci cation of variance Alternatively other authors advocate always using the correction for protection as it may be optimistic to expect that one has modeled variance perfectly In the case of univariate response that we have been discussing it is actually quite reasonable to expect to be able to model variance well so it is routine not to use the sandwich correction However in the case of multivariate response which we will discuss in Chapter 14 the reasons for using the sandwich correction are much more compelling as we will discuss Recently there have been attempts to improve the estimator via small sample corrections see for example Mancl and DeRouen 2001 Inference for Once 6 has been estimated and standard errors obtained we may wish to construct con dence intervals for 60 carry out hypothesis tests and so on Here we discuss several approaches all based on large sample theory approximations of the type used in the folklore theorem Throughout this section we assume that the variance model has been correctly speci ed unless otherwise indicated so that in truth the model varY7l1j 0292 0 217 is correct WALD INFERENC E A natural and easy approach to these objectives is to use the usual Z statistic method that is applied in situations where exact nite sample results are available From the folklore theorem we have ia we amp2XTltiagtWltia mmw i n eprl i 9 mm 7 mama 71 921 92 922 PAGE 228 CHAPTER 9 ST 762 M DAVDDIAN To form con dence intervals and test statistics regarding one of the elements of 6 Bk say k 1 p a scalar let be the square root of the kth diagonal element of the estimated covariance matrix in 921 o The usual symmetric equal tailed con dence interval for the true value of 3 with con dence coef cient 1001 7 a is t i Ca25E3k7 where caZ is a critical value from a symmetric distribution Following the folklore result caZ is often chosen to be the appropriate quantile of the standard normal distribution caZ lt1gt 11 7 042 21704239 Recall that it has been observed that inferences based on the folklore result can be optimistic in practical nite sample situations Thus it is common to replace the standard normal critical values by something else By analogy to the classical linear regression case a routine approach is to instead choose caZ tn uw where 25 1042 is the quantile of the 25 distribution with n71 717 degrees of freedom with area 17042 to the left The rationale is the same as in the classical case to account for the degrees of freedom lost in estimating However it is customary to make no attempt to take into account the fact that 6 and 0 have also been replaced by estimates Test statistics would be constructed in the analogous way Eg to test H0 Bk 310 vs a one or two sided alternative the test statistic 7 k0SEBk would be compared to the relevant standard normal or t critical value 0 Programs such as SAS proc nlin and RSplus nls when used at step iii of the GLS algo rithm print out such Wald statistics for each component of 6 along with a p value based on the normal or t critical values See the discussion on page 218 for more on how the standard errors are calculated For more complicated questions of interest the above extends in the obvious way Suppose we are interested in a con dence region for or test concerning a subset of the elements of 6 Partitioning 6 as 8 31 1ltrx1gt 2ltp7m1gt 52 suppose we are interested in 62 we can of course always reorder the elements of 6 to group those of interest together Similarly partition the estimated covariance matrix of i as 11 12 3 A2XTEW37 A 121 A22 PAGE 229 CHAPTER 9 ST 762 M DAVDDIAN Thus 32 IN NWZO 322 p 7 r x 1 and it follows that 32 7 20T22 1 2 7 ago 39 x24 This approximation may be used as the basis for con dence regions and hypothesis tests in the usual way More generally suppose we are interested in a linear contrast L of the elements of 6 where L is r Xp and of full rank Then treating the largesample results as exact Lia amp NL 0 LSDLT and it follows that 5 50TLTL LT71LE 50 amp X3 ADVANTAGES OF WALD INFERENCE o Wald inference is often the default choice because it is easy to implement and is based on familiar ideas used in simpler problems where exact nitesample results are available 0 Another advantage is that it is straightforward to replace the folklore standard error estimates by the robust sandwich77 versions to protect against misspeci cation of the variance model DISAD VANTA GES OF WALD INFERENC E The large sample distributional results may be unreliable in practice For instance the asymptotic standard normal or chi square approximations above may be poor in small samples resulting in erroneous conclusions Wald inference is not invariant to reparameterization of the model Thus if we reparameterize the mean model and then attempt to make inference on the same feature in the new parameterization we may be led to different conclusions In general the inadequacies of Wald inference are widely recognized however because of ease of imple mentation and ready availability in the output of common software packages it is widely used LIKELIHOOD BASEDquot APPROACHES There a number of alternative approaches to constructing con dence intervals and hypothesis tests that are meant to circumvent some of the disadvantages of Wald inference at the expense of increased complexity We do not attempt to demonstrate all of these here rather we give a avor for the types of approaches that have been suggested PAGE 230 CHAPTER 9 ST 762 M DAVDDIAN For de niteness we will consider again the problem where we may partition 8 1 1rx17 2piwl 52 Suppose we are interested in testing H0 62 620 for some speci ed value 320 c We may wish to compare two nested generalized linear models for example where 62 contains the coef cients of the linear predictor corresponding to a group of covariates whose joint importance is to be assessed In this case 820 0 o For nonlinear models where the mean function is dictated by theoretical considerations 62 may correspond to certain physical parameters and we may be interested in whether there is evidence that these differ from default values that may be different from zero We rst consider a normal likelihood ratio test77 based on pretending that the GLS weights are xed Recall that when the weights wj are known solving the GLS equation V L Zw yj 7 fltjv f jv 0 j1 corresponds exactly to maximum likelihood estimation of 6 under the assumption that is normal with varY7l1j azwj for each j The idea is to use the normal likelihood as the basis for a test even if the data themselves are not really normally distributed We now show by a heuristic argument that the likelihood ratio test77 statistic derived from these considerations has a largesample chi square distribution regardless of whether normality holds lgnoring 02 the important part77 of the normal loglikelihood is as a function of 6 and holding the weights xed V L Llt gt 712gtZwYj 7 fltwj gt2 j1 where 10739 949519 111739 minimizing L in 6 for xed weights gives the usual GLS estimating equation Suppose that 60 is the true value of 6 under H0 Expanding L80 about L yields n lLWo n lLt n lLME B Wo i3 12 o Tn 1L o 9 Now L 0 as it corresponds to the equation that we solve to obtain 3 for xed weights Moreover nilL MEa i771 Emmet7 mfgww 3 7171 Z 7iin 1133 f p 3 j1 7391 PAGE 231 CHAPTER 9 ST 762 M DAVDDIAN The second term can be disregarded as negligible by the weak law of large numbers as E z 60 under H0 and the term has mean zero Combining all of this we have 2m e L on12 7 WT n1 wijmL mm m mlW3 7 a0 7 Now from the argument leading to the folklore theorem under Hg with 60 the true value n12i 3 50 3 UOEWLS 7112 Zwyzf illj o j UOEWLSCO7 j1 say Thus we have 2 L A 7 L 7 A A A A 02 mo CgEWLSn 1XT W 7 0X EWLSCO 0 Because i and 9 are consistent for 60 and the true value 00 under H0 and the middle term is continuous in 6 and 0 it follows from the de nition of EWLS that the middle term converges in probability to Eli1L5 Thus we obtain 2 L A 7L a 2 W COTEWLSCO 70 under H0 In fact even if we were to replace 0 by its estimator because the estimator is consistent the result would be unaltered Now consider the restricted estimator under H0 where we hold 62 xed at the null value 620 and estimate the rst r components 61 only This is an r lt p dimensional problem If we were to use GLS to estimate 61 it is straightforward to realize that applying the folklore theory treating 62 as xed at 620 we would have under H0 so 61 310 the true value under H0 n12i 31 510 i NW7 UgEWLS117 where EWLSH is the inverse of the upper left r x r submatrix Eli1L5 11 of EWLS p x p Moreover from the argument leading to the folklore result we may conclude that A n n1251 510 UOEWLS11 71712 ijf 1j7 05j UOEWLSJlClm j1 say where f represents the r x 1 vector of partial derivatives of f with respect to the elements of 61 and 010 is equal to the rst r elements of Co Let 30 3653 be the p x 1 vector of the estimator under the restriction that 62 620 for H0 PAGE 232 CHAPTER 9 ST 762 M DAVDDIAN Then L o is just the loglikelihood evaluated for the restricted problem and by an argument similar to that above we may conclude that 2Liio L o 70 T ClOEWLSJOClO Now consider the usual likelihood ratio test statistic 2LfioL 2L o L o2L L o 2 2 70 70 70 N T T Co EWLSCO CloEWLs11C3910 EWLsm 0 0 0 3 COT EWLS 007 where the last expression follows from the fact that the rst r elements of Co are Clo We are now in a position to exploit the following generic result If Z N NO E and ASA A with k trEA then ZTAZ N xi Note that CO amp NO Eli1L5 and it is straightforward to verify from the de nition of EWLSH that EWLS11 0 EWLS11 0 EWLS EWILS EWLS 0 0 0 0 EWLS11 0 EWLS 0 0 Moreover EWLS11 0 tr EWILS EWLS P T7 0 0 again using the de nition of EWLSJL We may thus conclude that the likelihood ratio statistic77 satis es ammo e um amp 2 2 X177quot 70 This result may be used as the basis for tests and con dence regions in the usual ways replacing 03 by the estimator 62 Variations on this theme are possible Carroll and Ruppert 1988 p 25 discuss constructing the likelihood ratio test77 statistic based on pro ling out77 02 see page 130 and basing the statistic on the pro led77 or concentrated loglikelihood This idea is also discussed in Seber and Wild 1989 section 523 PAGE 233 CHAPTER 9 ST 762 M DAVDDIAN Modi cations to the basic likelihood ratio test idea have been advocated to improve reliability in small samples One such modi cation is to instead construct an asymptotically equivalent version of the test statistic by analogy to the ordinary F test in classical linear regression In particular an alternative F test is based on the result W 39 Fpimip L n P Seber and Wild 1989 section 53 discuss this idea further As we have noted all of the aforementioned ideas treat 6 as if it were a xed known constant and thus do not attempt to adjust the statistics for the fact that 6 is estimated Indeed the presence of 6 in the weights is not accounted for either In Chapter 11 we will discuss the use of the bootstrap to take this extra uncertainty into account As mentioned in Section 46 in our discussion of quasilikelihood in the case where there are no unknown parameters 0 in the variance function and variance depends on 6 through a function of the mean response it has been proposed to use the quasilikelihood in the same way as a loglikelihood to construct so called quasilikelihood ratio test statistics Speci cally with the quasilikelihood LQL6 de ned as in Section 46 suppressing dependence on the data it may be shown by an argument similar to the one above that A A 2LQL60 LQL 2 i 2 N X12777 70 see for example McCullagh 1983 A problem with the quasilikelihood approach is that LQL may be dif cult to derive for general variance functions As we saw in Section 46 in the case of the power of the mean variance model the derivation is relatively straightforward REMARKS o lnference based on likelihood approaches is thought to be more reliable in small samples than Wald inference this has been deduced through simulations Because every nonlinear problem is different however it is impossible to say that this is always the case 0 Likelihood based inference is obviously more complicated to implement than Wald inference Thus use of these techniques is much less common in practice MORE COMPLICA TED HYPOTHESES For nonlinear models that arise from theoretical or empirical considerations in applications like growth analysis or pharmacokinetics it is not uncommon for interest to focus on nonlinear functions of the elements of 6 PAGE 234 CHAPTER 9 ST 762 M DAVDDIAN For example recall the data on the pharmacokinetics of indomethacin most recently discussed in Section 75 for which with z representing time hours after the dose the biexponential model m a em expee zz e a expeemz is reasonable we have used the parameterization that enforces positivity here In this model the quantities 5 92 and 5 94 are rate constants with units of 1hour As discussed in Section 75 a quantity that is of some interest is the so called terminal half life lf 5 94 lt 5 92 then the second exponential term in the model dictates the terminal phase77 ofthe elimination of drug which manifests itself as the second part77 of the decay The half life of this phase is the time it takes for the mean response in this phase to decrease by half given by 2512 log 2494 which has units of hours It is of general interest to estimate the terminal half life and provide some assessment of the uncertainty in the estimate eg approximate estimated standard errors and con dence intervals Here then the quantity of interest is itself a nonlinear function of the regression parameters A standard approach to deriving estimates of uncertainty for such quantities is via the Wald approach Consider a general real valued nonlinear function of 6 a8 say the following argument is easily extended by analogy to the case of a vector valued function The obvious estimator for a80 the function evaluated at the true value of 6 is a8 By a standard Taylor series expansion to linear terms we have 0133 150 a o 50 923 where a is the vector of partial derivatives of a with respect to the elements of 6 As i3 amp Ari507amp2XT W 79XE71l7 923 suggests that 13 amp Nlawo62a XT W 9Xf3 1a l This result may be used as the basis for Wald type con dence intervals for a80 ie 13 i Ca2ampla f3XTBW 7 9XB 1a f3 l1 and a test statistic regarding a8 PAGE 235 CHAPTER 9 ST 762 M DAVDDIAN See Seber and Wild 1989 Chapter 5 for more on alternative approaches to inference for nonlinear functions of parameters As with the regression parameters themselves Wald inference is routine in this context because of the relative ease of implementation 96 Optimality of GLS and extensions We have noted previously that o GLS C 00 is maximum likelihood estimation in the class of scaled exponential family distri butions for Here there are no additional variance parameters 0 o GLS C oo estimation of 6 is equivalent to normal theory ML when the variance function does not depend on 6 If the variance function also depends on unknown parameters 0 jointly solving the GLS and PL estimating equations leads to joint normal ML for Ta 0T Thus there are several situations in which the GLS approach is optimal in the sense that in general maximum likelihood estimation yields the most precise estimator in terms of asymptotic relative ef ciency Of course this is only true if the mean variance model and the distributional assumption are exactly correct We have also argued that even in the case where we are unwilling to make distributional assumptions the GLS approach is sensible as it weights the linear contributions of responses in accordance with their quality as dictated by the assumed variance function It turns out that we may be more formal about this In particular 0 We will show momentarily that GLS arises naturally as the optimal approach among the class of all possible linear estimating equations for 6 0 Because linear estimating equations depend on the data in a fairly simple way and are reasonably easy to solve they are an obvious practical choice relative to more complicated quadratic or other equations Thus the optimality result ensures that using GLS will yield the most precise estimation within this practical class In fact the argument we are about to carry out has broader implications for other types of equations eg quadratic equations we will discuss this in Chapter 10 PAGE 236 CHAPTER 9 ST 762 M DAVDDIAN ASYMPTOTIC GA USS MARKOV PROPERTY We have already derived some results related to the one we now consider showing in Section 93 that GLS yields a more precise estimator for 6 in a large sample sense than estimators constructed with an incorrect variance model including OLS The following argument subsumes that one In particular we now show that the GLS equation leads to the estimator for 6 with the smallest large sample covariance matrix among all linear estimating equations To simplify the calculations we will consider 0 as known and thus write 00 because from the folklore theorem the effect of 9 in the weights is negligible asymptotically replacing 0 by 9 will not alter the result Suppose that the true variance model is varle1j 0292 01117 and de ne the matrix W6 0 as before Then we may write the GLS equation with the correct variance model in matrix form as XT W 700Y f 0 924 Consider the general class of linear estimating equations for 6 of the form AT Y 7 ma 0 925 of which it is clear that 924 is a special case Let 3 denote the estimator for 6 satisfying 924 and let 3 denote the estimator solving 925 Obviously 925 is an unbiased estimating equation so we expect that a L 80 We will now show that i is best among all possible That is we will show that all linear functions of i have asymptotic covariance at least as great as that of i in the sense of nonnegative de niteness we have described previously Thus in a largesample sense 9 is optimal among all estimators solving linear estimating equations in the class 925 Let X X80 W W8000 A A60 and e 61 ELT a31W12Y 7 f60 From the folklore argument multiplying through by n lZ and using matrix notation we may represent 3 solving 924 as a 7 80 z 00XTWX 1XTW126 926 Now 3 satis es AT Y 7 ff3 0 thus letting agr6 denote the jth row of A6 by a Taylor series approximation we have PAGE 237 CHAPTER 9 ST 762 M DAVDDIAN 71 0 7712 92 7 WW oa o 73971 71 71 77142 ag omm e M1282 7 my 50aj o WW 7 e j1 73971 Analogous to the folklore argument7 the second part of the linear term converges in probability to zero thus7 rearranging and multiplying through by n lZ we obtain in matrix notation 5 7 50 g 00ATX 1ATW 1Ze 927 From 926 and 9277 we thus have that varfi 7 80 z 73XTWX 17 Vart 7 e 03ATX 1ATW 1AXTA 1 We now would like to show that the approximation to varfi 7 80 is smaller than that to vari 7 30 that is7 show that the matrix difference ATX 1ATW 1AXTA 1 7 XTWX 1 is nonnegative de nite The argument is entirely similar to that on page 209 We wish to show that ATATX 1ATW lAXTA 1 7 XTWX 1 2 0 for all A Letting d ATX 17 we may write this as dTATW 1A 7 ATXXTWX 1XTAd De ning c W lZAd and Xr WlZX7 we may rewrite this as cTI 7 XXTX 1XTC 1 2 0 as the middle matrix is symmetric and idempotent The result is sometimes shown alternatively as follows Writing for convenience L ATX 1AT and f f o7 we have varltia 7 e z varLltY 7 f 7 varLltY 7 f 7 ii 7 e i3 7 30 VarLY f i 50 Var 50 COVLY f 507 3 i 50 PAGE 238 CHAPTER 9 ST 762 M DAVDDIAN Using 926 and 927 the covariance term satis es aocovLW1Ze 7 XTWX 1XTW1Ze XTWX 1XTW1Ze a LW 1 7 XTWX 1XTWXXTWX 1 a LXXTWX 1 7 XTWX 1 0 usingLXI Thus varltia 7 e z varLltY 7 f 7 ii 7 30 varltia 7 e As the rst term is a covariance matrix it must be nonnegative de nite Thus we may conclude that varltia 7 e 2 ME 7 e in the sense of nonnegative de niteness RES ULT 0 Among all estimators solving linear estimating equations for 6 not just those involving the gra dient matrix X6 and incorrect weighting the GLS estimator is optimal in the sense of being the most precise in a largesample sense Thus the GLS approach may be motivated based on this appealing property with no mention of distributions or an intuitive connection to the sensible idea of WLS This result is consistent with what we already know When the variance function 9 does not depend on 6 the normal ML equation and the GLS equation are identical Thus if the data truly are conditionally normally distributed then the general maximum likelihood theory tells us that solving this linear estimating equation is the optimal approach in a large sample sense Hence in this situation we would expect that the optimal linear estimator would correspond to maximum likelihood which also involves a linear equation It is of course important to remember that the above result is predicated on having speci ed the variance model correctly Thus the optimal linear estimating equation involves the true conditional variance of the response As we have already seen misspeci cation of this variance leads to inef ciency CONJEC TURE The GLS equation depends on the form we highlighted rst in Chapter 5 of gra 1ent 0 mean HCthH gtlt covariance matrix 7 gtlt response7mean PAGE 239 CHAPTER 9 ST 762 M DAVDDIAN Recall that we have cast a number of equations in this general form as V L T 7 2 Dj OOVJ 10t8j04 7 W030 0 j1 see Section 64 which may be written more compactly in matrix notation using the de nitions in Section 64 as DTaV 1asa 7 ma 0 928 where Da is the gradient matrix of the mean vector ma for the response vector 3a with covariance matrix varsa 7 Va Here 3a is some function of the data and parameters we considered in Chapters 5 and 6 Y Sim 7 2 Y1 7 73 3 for quadratic estimating equations It is natural to wonder whether a similar result holds for equations of the form 928 more generally That is with the components of 928 de ned as above if we considered any other equation in obvious notation AT0tS04 7 ma 7 0 it would be interesting to nd that the estimator solving 928 is at least as precise asymptotically It turns out that under certain conditions this is indeed the case the optimal estimating equation based on a linear combination of some function of the data 3a minus its mean is of the form 928 with the matrices Da and Va correctly speci ed ILL USTRATION Consider joint normal theory ML estimation of a and 6 in under the mean variance model 91 with 0 known Then we identify a TaT and from 519 the estimating equation is in the previous shorthand notation 71 i f j 2029721491 02972 0 Y1 7 fa39 391 0 209 0 2049 Y 7 m2 7 0292 u If the given 11 are truly normally distributed these equations yield the asymptotically optimal estimator for 6 on the basis of general maximum likelihood theory The above argument suggests that the equations are also optimal among all such equations under the conditions that the mean and variance model are correct and that the third and fourth moment assumptions for the that appear in the covariance matrix are correct even if the data are not normal PAGE 240 CHAPTER 9 ST 762 M DAVDDIAN Similarly7 if we believe that the 6739 are iid and symmetrically distrilouted7 but that vare 2 a for all j then the argument would suggest that the above equation would lead to an inef cient estimator The argument would suggest that the estimating equation 71 quot f j 20292V j 029 0 Yj fj 1 0 209 0 2 0049 Yj 7 Hz 7 0292 7 discussed on page 115 would lead to the most precise estimation under these circumstances In Chapter 107 we will consider these issues more carefully A major interest will be to investigate potential gains in ef ciency by using quadratic rather than linear estimating equations and to appreciate the tradeoffs involved PAGE 241 CHAPTER 15 ST 762 M DAVDDIAN 15 Nonlinear mixed effects models 151 Introduction In this chapter we focus on methods for estimation of the parameters in subject speci c nonlinear mixed effects models which were introduced in Section 133 As noted in our motivation of these models they are often the preferred framework when interest focuses on inter individual variation in the population where the population may be characterized in terms of scienti cally meaningful parameters This is exactly the case in studies of pharmacokinetics growth and other areas where a theoretical model for within individual behavior is available whose parameters describe biologically or physically relevant intra individual processes As we have already discussed in Section 133 inference in these models is complicated by the hierar chical nature of the model Marginal quantities such as the mean and covariance of a response vector conditional on covariates are not speci ed explicitly so that inference on model parameters cannot be based directly on say estimating equation or maximum likelihood approaches similar to those we have discussed for population average models in Chapter 14 As a consequence it is necessary to consider alternative strategies In this chapter we will provide an overview of the issues involved and discuss some of the most popular approaches to inference in nonlinear mixed effects models 152 General nonlinear mixed effects model and likelihood Unlike in the case of population average models covered in Chapter 14 in this setting we will note explicitly the distinction between within individual covariates for individual i summarized for each j 1 m39 as zi zi1zmiT and individual level covariates ai in order to state the general form of the model Recall throughout this chapter that although for notational simplicity we include time in the within individual covariates77 time has a special role Thus we consider the data as described in Chapter 13 as the triplets Yiz 1 i 1 m where the Yiz 11 are assumed independent In particular 0 For individual 2 pairs Yihzil Ym zmi are collected ie the jth response Yij is collected under conditions zij j 1 n 0 Individual i also has associated individual level covariates ai that do not change within 2 c As before all covariates associated with individual i may be summarized as 11 ziT a2 PAGE 418 CHAPTER 15 ST 762 M DAVDDIAN We will consider a model slightly more general than that introduced in 139 and 1310 Speci cally we will allow the second stage population model to have a more general form as we now describe The general model we consider may be expressed as follows Stage 1 Individual model The random vectors Y i 1 m are assumed to satisfy EYilzi7 an bi EYili7 bi Maw i zzy div57 bi fiat57 bit varYilzl39 ai bi Rilt i777zi Rd m 1 1M 151 Here 8 is a b x 1 individual speci c regression parameter characterizing the model fzij6 for individual behavior and 397 is a q dimensional vector of within individual covariance parameters The forms of f1i6 bi and Ri 7 111139 bi the within individual conditional mean and covariance matrix as functions of 111 b and 6 are obtained by substituting the model for 8 in Stage 2 below Stage 2 Population model The individual speci c parameter 8 is assumed to be a function of individual level covariates 1 xed effect 6 p x 1 and random effects b k x 1 given by i an571707 152 where d is a b dimensional vector of possibly nonlinear functions of 1139 6 and bi The random effects b are usually assumed to be iid independent of z and 11 with 0 varbi D The most popular assumption is that the bi are normally distributed with these moments The second stage model 152 allows the possibility of nonlinear dependence of the 8 on the xed and random effects and allows the possibility that the dimensions of the random effects and 8 may not coincide This is in contrast to the simple linear model i Ai bi presented in 1310 where A is a b x p design matrix depending on elements of ai NONLINEAR DEPENDENCE An example of the need for a more general nonlinear model of the form 152 arises in the setting of pharmacokinetics In this application the elements of 8 are pharmacoki netic parameters characterizing features of the drug disposition process within individual i Pharma cokineticists have come to realize that the distribution of parameters such as drug clearance and volume of distribution in the population does not appear to be normally distributed as is often true with bio logical entities Rather it seems more likely that such parameters which are of course constrained to be positive have a skewed distribution in the population of individuals PAGE 419 CHAPTER 15 ST 762 M DAVDDIAN o If bi were taken to be normally distributed and if 8 were assumed to be linearly related to bi then it would follow that the 8 would be assumed normally distributed which would be a potentially unrealistic model for the distribution of pharmacokinetic parameters in the population Thus pharmacokineticists have favored modeling the components of 8 in such a way that if the bi were approximately normally distributed the components of 8 would have skewed distributions with positive support For example suppose that Cl represents drug clearance Then if a wi ciT where wi is weight and C is creatinine clearance they might specify a model like 01139 eXchzo aiwwi 301M baa 153 or Ch 5010 aiwwi 501M BXPUJLCD In both cases the associated random effect bag enters the model in a multiplicative fashion if bag were normally distributed then Cl would be lognormally distributed The dependence on the covariates takes a different functional form in each case the dependence on xed effects in 153 is also nonlinear REPARAMETERIZATION A linear alternative for a nonlinear model like 153 is as follows Note that 153 may be written as 10g Ch 5010 aiwwi 501M bran 154 which is of course linear in both xed and random effects Recall in previous chapters we have mentioned that models for pharmacokinetics such as the simple biexponential model or more complicated models such as the one compartment model with rst order absorption and elimination used to model the theophylline data in Section 133 might be reparameterized Eg the biexponential model which is commonly written 51 expP zm 53 BXPPBML might be reexpressed as em exp7e z 5 9 exp7e zm so that B in the second parameterization is the logarithm of the rate 32 in the rst Similarly the theophylline model kaD WexpiCZVt 7 exp7kat 155 might be reexpressed as 5ng 01 V M Wexpie e t 7 exp7e at 156 PAGE 420 CHAPTER 15 ST 762 M DAVDDIAN Such reparameterizations do not only serve to make tting of the model to individual data more stable In the context of multivariate data they also may be introduced to incorporate both skewness in the population of parameters and the desire for a simpler linear second stage population model For de niteness consider 156 In this parameterization for individual i the parameter Cl log 01 where Cl is drug clearance Thus the model has been parameterized directly in terms the logarithms of the pharmacokinetic parameters 0 The model parameterized in the original way 155 in terms of the pharmacokinetic parameters directly with a second stage model of the form 153 for each parameter and the model param eterized as 156 with a linear second stage model of the form 154 are two different ways to achieve the same objective 0 In practice pharmacokineticists tend to prefer the rst approach while statisticians analyzing these data have tended toward the second The rationale in the latter case is that the individ ual model is a more stable parameterization in the individual case the bene ts of which might carry over to the inferential methods we will discuss in Sections 1537155 Also the linear second stage model is simpler which also results in simpler implementation for some methods eg Section 153 DIFFERENT DIMENSIONS FOR 8 AND bi A second issue is that of relative magnitudes of inter individual variation in the population among the elements of 3 From a biological or physical stand point it is likely that after systematic variation due to association with covariates is taken into account a scienti cally meaningful parameter will still exhibit variation This variation is of course represented by a corresponding random effect For example consider again drug clearance in model 154 Unless bag has variance zero the model states that log Cl and hence clearance for the part of the population with weight w and creati nine clearance C is not constant but rather varies Under similar models for logW and log km these parameters are also assumed to vary after relevant covariates effects are taken into account Suppose that this remaining unexplained variation in the pharmacokinetic parameters is of consider ably different magnitudes ln loglinear models like 154 for log 01 the variance of the random effect corresponds roughly to the coef cient of variation in the population on the original scale ie among Cl values Thus suppose that the CV in Cl and V is about 30 while that in km values is less than 5 PAGE 421 CHAPTER 15 ST 762 M DAVDDIAN Of course the pharmacokinetic parameters for each individual are not observable rather information about them is available only through the Y The consequence is that with different magnitudes of variation it is often very dif cult to characterize accurately their distribution through the parameters 6 and D in practice 0 In these circumstances it is thus common to invoke an approximation to reduce the dimension of the distribution that must be characterized The approximation is to assume that the variation in elements of 8 for which variation is very small relative to that in other components is negligible and treat it as effectively zero Operationally this is accomplished by specifying a second stage model in which these elements are taken to have no associated random effect For example consider the model 5010 501w 10 Ch 1 wi Ci 0 0 0 0 5010 bacz i IOgVi 0001wi00 5V0 by logkm 0 0 0 0 0 1 wi vw 0 ne Bka w Here there is no random effect corresponding to log kai Technically the model says that all variation in log km in the population of all individuals may be explained by a systematic relation ship with weight In a practical sense we may not believe this precisely but it may serve as an approximation when the variation in log Cl and long is much larger by comparison 0 Note that we may rewrite the random effect part of this model as b 1 12 Cl 157 My 0 Thus we de ne the random effect as having only k 2 elements while 8 has dimension 1 3 The matrix 1 0 Bi 0 1 0 0 may be thought of as a design matrix that speci es to which elements of 8 the two random effects components correspond PAGE 422 CHAPTER 15 ST 762 M DAVDDIAN Another situation in which certain elements of 8 may be taken to have no corresponding random effect and thus to not vary in the population after accounting for relevant covariates is the case where the element may correspond to something like a universal constant77 this would be implausible in a biological application like pharmacokinetics but might be more likely in physical science applications such as chemical kinetics where certain aspects of a chemical process might be thought to be governed by deterministic mechanisms that do not change from run to run SUMMARY The nonlinear second stage speci cation 152 in our more general nonlinear mixed effects model accommodates these possibilities In the sequel we will discuss inferential approaches in the context of this model or a less general version of it given as follows SPECIAL CASE1A special case of 152 that is more general than the model considered in Section 133 allows k and b to differ but restricts attention to linear functions of 6 and b In particular this special case is usually written as i Ai Bibi 158 Here A b x p is a design matrix depending on the individual level covariates ai as before The I x k matrix B is also a design matrix whose elements would typically be either 0 or 1 that serves to specify which elements of 8 have associated random effects as in the example above As we will discuss in Section 156 some software packages implementing methods we will discuss are restricted to population models of the form 158 while others allow more general models of the form 152 but require more specialized programming by the user LIKELIHOOD BASED INFERENC E The usual objectives of an analysis in these models as discussed in Section 133 are to estimate 6 and D as these parameters characterize the typical behavior of individual speci c parameters and how these parameters vary in the population including both systematic variation due to p to covariates and une plained inherent population variation Because nonlinear mixed effects models incorporate underlying random effects whose distribution is meant to characterize explicitly inter individual variation in the population as we have seen in Sec tion 133 even speci cation of the marginal mean and covariance matrix of the distribution of Yilm requires that some assumption about the random effects distribution be made In particular with F denoting the distribution function of the bi assumed independent of 113139 EY l1if1i bidebi PAGE 423 CHAPTER 15 ST 762 M DAVDDIAN Clearly under the model we would need to know or be able to estimate somehow debi in order just to specify the marginal mean even if we are willing to do this the integral is likely to be analytically intractable Thus the advantage of making only a few moment assumptions rather than an assumption on the entire conditional distribution of the Y given the covariates is not apparent for these models Ac cordingly for nonlinear mixed effects models rather than seek inferential techniques based only on an assumption on the rst two marginal moments it is standard to consider likelihood inference as a starting point lnference on 6 and D as well as 7 is based on treating the joint density of Yilmi as a likelihood in these parameters In what follows we will assume as is usually the case in practice that the random effects b are continuous Using the independence the joint density is given by m pYlfiI w7D HPltYili 777D 159 i1 where pYl1 6 7 D is the marginal density of Now in general using obvious notation Jamm 577 D pYi7 1th In877 D dbi PYilbi7zi7ai 77PbilzivaiiDdbi PYilbivzi7ai 7719bii Ddbi7 1510 where the last equality follows by the usual assumption that the random effects are iid and independent of all the covariates The integral in 1510 thus represents the form of the contribution of individual i to 159 This depends on two distributions 0 pYilbi Z a 6 7 which characterizes the within individual distribution conditional on covari ates o pbi D which characterizes the population Evidently in order to base inference on 6 D and 397 on 159 we must be willing to specify these distributions We shall discuss this issue in subsequent sections The assumption that b is independent of 11 could in fact be weakened to allow dependence on a For example if a contains a dichotomous variable eg smoker or nonsmoker we might wish to allow that variation in the population of smokers differs from that in the population of nonsmokers PAGE 424 CHAPTER 15 ST 762 M DAVDDIAN A modi cation of the model in 151 and 152 to accommodate this would be to assume that the b are not independent of a and to revise the assumption on the second stage model in the following spirit 0 Ebilai 0 and varbilai Dai where Dai depends on the value of elements in I o For example if in fact the dependence is really only through the component 31 say of ai such that s 1 or 0 depending on whether i smokes or not then Ebla Ebls 0 and varbla varbls 7 Si D1812 That is the covariance matrix is allowed to differ in the populations of smokers and nonsmokers There would of course thus be two k x k covariance matrices D0 and D1 for which estimation of parameters would be required In the sequel for simplicity we will restrict attention to the model in which the b are taken to be iid All of the methods we discuss are easily modi ed to allow for more complicated dependence on covariates The obvious practical problem with treating 159 as a likelihood function of 6 7 D for the param eters is that evaluating it involves m k dimensional integrations o The required integrals are almost always intractable even when the random effects distribution is normal but see below Thus evaluation of the integrals must be carried out via some numerical approximation 0 For example one might implement a numerical integration approach such as Gaussian quadrature to do the integrals numerically Quadrature rules rely on a deterministic approximation to an integral as a weighted sum of the integrand evaluated at a specially chosen set of values or abscissae where the weights are also specially chosen a full description is given in Monahan 2001 Chapter 10 The approximation thus requires the integrand to be evaluated at each abscissa and then these values are weighted and summed The accuracy of the rule for approximating the true value of the integral is predicated on the number of abscissae L say which may be chosen by the user The more abscissae the better the approximation For k 1 dimensional integrals it is not too computationally burdensome to carry out such numerical integration as the abscissae need only be chosen in one dimension PAGE 425 CHAPTER 15 ST 762 M DAVDDIAN The approximation often works well for L as small as 5 or 10 However for k gt 1 abscissae must be chosen in each dimension and the integrand must be evaluated at each combination for example for k 3 and L 10 there are 103 1000 function evaluations to perform Thus for larger k the computational challenge increases greatly This feature is not so bad if all one wants to do is evaluate an integral once and one knows 6 7 and D However in the context of maximizing 159 in the parameters some sort of optimization scheme such as a Newton Raphson iterative approach would have to be invoked This would of course require that the likelihood be evaluated at each iteration of the maximization routine which in turn would require m k dimensional integrals to be evaluated at the current iterates for 6 7 and D It should be clear that the computational burden could become overwhelming If one reduced L to address this burden accuracy of the integration and hence accuracy of the evaluated likelihood would be compromised This problem wherein computational challenges become overwhelming in higher dimensions is often called the curse of dimensionality Another approach to doing the integral is by Monte Carlo techniques In particular an integral of the form 1510 may be considered as the expected value of the function pYilbi zi ai 37 with respect to the probability distribution corresponding to pb D Thus a natural approxima tion would be to draw a large sample of size L say from pbi D and approximate the integral by the average of pYlb z 1 37 over all the L sampled values In the context of maximizing the likelihood 159 this would require that for each i 1 m this sampling scheme would have to be carried out at the current values of 6 7 and D for each internal iteration of the optimization procedure Again the curse of dimensionality often limits the practical feasibility of this approach 0 In many problems the computational burden for carrying out numerical integration may actually not be that great But yet another issue may arise The likelihood 159 is itself typically a rather ill behaved highly nonlinear function of the parameters The result is that even if the integration can be carried out accurately and ef ciently the optimization problem may be dif cult to solve as the likelihood surface may be replete with local maxima The bottom line is that for practical reasons some approaches to carrying out likelihood based inference have focused on attempting to approximate the required integrals analytically rather than numerically with the goal of achieving an analytical approximation expression that has a closed form PAGE 426 CHAPTER 15 ST 762 M DAVDDIAN Alternatively other approaches have been based on more ad hoc considerations We will discuss both types of methods in Sections 153 and 154 EXCEPTION 7 THE LINEAR CASE Before we begin our discussion of these procedures it is worth pointing out explicitly that the issue of integration can be irrelevant in a special case of the model in 151 152 In particular suppose that the within individual model is linear in individual speci c parameters 3 eg suppose that the model is fizi7 i Zi i for some design matrix Z depending on zi Further suppose that the population model is 5139 Ai bi7 where A depends on 1 Then note that substituting the population model for 3 we obtain filtmi7 7bi Xi Zibz 1511 where X AiZi The model in 1511 is a special case of a linear mixed effects model which is characterized by linear dependence of the intra individual model on 6 and b This model usually supposes further that Ri8 y zi does not depend on 3 Take as an example Ri 39y zi 021m Recall that we discussed such linear models in Section 135 Similar to the discussion there we have that Ein5 Zibililli X175 Moreover under the above assumption it is straightforward to observe that see 1312 varYl1 721 ZDZT Under the further conditions that pYlbz a y is a normal density and bi N NO D then in fact it follows that pY39l1i 6 7 D is also a normal density with this mean and variance Thus in this special case the integral and hence the likelihood are in fact available in closed form speci cally the likelihood has the form of the product ofi 1 m normal densities for Yilm with E09111 Xi WHOJIM V1157 113139 021m ZiDZiT7 1512 where here consists of the distinct elements of D and 02 SAS proc mixed and RSplus lmeO are devoted to tting of such linear models linear in xed and random effects PAGE 427 CHAPTER 15 ST 762 M DAVDDIAN Note that although the model came about from a subject speci c perspective 1512 is a population averaged marginal model because of the linearity the marginal moments are available in a closed form The loglikelihood under normality is ignoring constants m m 712 Zlog lV 211l 7 12 20 7 X TV1 211Y 7 Xi 1513 i1 i1 Differentiation of 1513 with respect to 6 and g yields the estimating equations m ZXV1lt 1 in X113 07 1514 11 m lt12gtZltYi mimTv a wine65 Vilt wgtV1lt wigtltYi 7 X113 H 7 tr V a vane8mm mm 0 1515 where k indexes elements of g Equations 1514 and 1515 may be solved using the methods discussed in Chapter 14 Unfortunately this is of course not possible in more general models PREDICTION OF RANDOM EFFECTS Finally we note a concept that will be exploited in subse quent sections In a general subject speci c model the b represent effects that correspond to particular individuals and the within subject model f1i 6 bi describes the response for particular individuals Thus from a subject speci c point of view to characterize the behavior of individuals interest focuses not only on 6 but on bi as well Estimation of xed parameters such as 6 may be carried out in principle via maximum likelihood as outlined above The b are of course random vectors so inference on them is a bit more complicated From the standpoint of the population of individuals the b represent random draws from the population Thus characterizing b is akin to predicting the value taken on by a random vector representing a random draw from the population lnference on bi is thus a prediction problem Speci cally we would like to say something about the value taken on by the random vector bi Of course we have some information about bi through the response Y Thus it is natural to characterize this prediction problem as characterizing b given Y The usual approach to this problem is to in some sense use as the predictor for bi the value that is most likely77 given the response that has been observed PAGE 428 CHAPTER 15 ST 762 M DAVDDIAN This may be accomplished by nding the value of bi that maximizes the posterior density conditional on the covariates Y b 211 b D pltbilmangt p Aggf j f j 1516 where we have used the assumed independence of bi and 11 in writing 1516 That is one would nd the posterior mode If 6 7 and D were known 1516 could be maximized as a function of b Clearly the maximizing value would be a function of Y Of course it should be clear that the posterior mode may not be available in a closed form so the maximization of 1516 would have to be carried out numerically Of course the xed parameters 6 7 and D are not known but estimators for them are in principle available eg from maximizing the marginal likelihood 1510 A natural suggestion is to substitute the estimated values in 1516 and then maximize in bi From a Bayesian point of view the xed parameters 6 7 and D and the random effects b are on equal footing A Bayesian approach to estimation of all of these values would be to nd the mode of the posterior distribution for the particular parameter conditional only on the data In particular for bi conditioning on the covariates the Bayesian approach would nd the mode of Mimi pbilYimi r7 Dgtplt vDgtd d7dD Note that this approach requires that some very complicated integrations be carried out and that a prior distribution for 17 D jointly be speci ed The frequentist approach we have described above substitutes estimates for the xed parameters instead Thus nding the mode of the posterior 1516 treating 6 7 and D as xed and substituting estimates is often referred to as an empirical Bayes approach The posterior mode for b so obtained is often referred to as the empirical Bayes estimator for bi LINEAR CASE In the linear model discussed on page 427 with 126 57 bi Xi Zibiy and normality assumed for both pYlb1 39y and pbD it is straightforward to show that the posterior pbilYi 111139 37 D is also normal Because the normal is distribution is unimodal and symmetric it is in fact true that the posterior mode treating 6 7 and D as xed is exactly equal to the posterior mean conditional on the covariates EbilYi It is a straightforward exercise to show that under these conditions EblY1 DZTV71 211Y 7 Xi 1517 PAGE 429 CHAPTER 15 ST 762 M DAVDDIAN Note that this is a linear function of Y In fact if g were known then V V would be known In this case 1514 may be solved explicitly to yield the estimator for 6 given by A m 71 m 5 ZXV1X ExnglY i1 i1 Substituting 3 into 1517 still yields a linear function of the Y It may be shown that under these conditions 1517 with i substituted has the smallest variance of any other predictor that is a linear function of the Y Thus this expression is often referred to as the Best Linear Unbiased Predictor for b or BLUP 0 Of course it is unlikely that g would be known In practice the posterior mode would be found by substituting estimators in 1517 The resulting expression is no longer the BLUP exactly but it is approximately It is often referred to as the estimated BLUP or EBLUP to acknowledge that 397 and D have been replaced by estimates Thus in the case of a within individual model that is linear in 6 and bi prediction of b is carried out by nding the EBLUP In more general models that are nonlinear in b it is not necessarily the case that the conditional expectation EbilYi1i and the posterior mode need coincide As mentioned above the posterior mode with estimates substituted for the xed parameters ie the empirical Bayes estimator for b would have to be evaluated numerically We will see that empirical Bayes estimatorsposterior modes play a role in some of the approximate methods for inference on 6 7 and D we now discuss 153 Approximate methods based on individual estimates Because implementation of likelihood inference poses dif culties a promising alternative to consider is to use summaries of the responses Y on each individual to simplify the problem An obvious such approach is to obtain for each i an estimator for the individual speci c regression parameter 8 based on Y summarizing the responses on i in terms of the apparent information they contain on 3 and then use these estimators as data to estimate 6 and D characterizing the population model Indeed in the absence of more formal thinking this approach is natural PAGE 430 CHAPTER 1 5 ST 762 M DAVDDIAN Note that this approach requires that suf cient information be available on each individual to allow 8 to be estimated from the data on i That is 71 must be suf ciently large to estimate the b dimensional parameter 3 In the following we assume that this is the case The basic idea is as follows For each i 1 m from the data K1zi1l nizmi t the within individual model 151 which we write here as EYilzi7 i filtzi7 i7 VarYilZi75i Rilt i777zil Obtain the individual estimators 3 i 1 m We will discuss momentarily how this might be done in an advantageous manner however note that one may certainly t the within individual model to is data using the methods in Chapters 2712 If we restrict attention to i only then 8 is a xed parameter For m large then the usual asymptotic theory for univariate response models dictates that conditional on 8 is approxi mately normal with mean 8 and a covariance matrix that depends on 8 and other parameters but may be estimated Letting C denote this estimated covariance matrix we may state this formally as fill iw N Ci 1518 This is of course independent of a so we may write this equivalently as being conditional on 111 More generally we have Efil z 8 and varfil x C Now the second stage population model 152 relates 8 to the parameters 6 and D Consider the special case of 158 i Ai bl where 0 and varbi D Note that if in fact bi N NO D then ilai N NAi 7 D 1519 In any event E6lai Ai and Varwilai Varwilil i D Combining 1518 and 1519 it is clear that iiilaw ampNAi D 0 1520 This is easy to see by writing 1518 as 3139 3 i 8i Ai bi 817 Bil5171131quot NW7 Ci 1521 PAGE 431 CHAPTER 15 ST 762 M DAVDDIAN More generally7 we have Hit1111 EEEil i7 111011111 Elt ilmi Ai 1522 and Nit1111i MM521517 11101111 EVar il i7 11101111 Vad iliu ECili D Cu 1523 where the last equality follows from treating Cl as known as a function of the covariates o The distributional approximation 1520 and more generally the mean covariance approximation in 1522 and 1523 have the form of a population averaged marginal model for the data Bi conditional on covariates This suggests that the regression parameter 6 and the covariance parameters in D may be estimated using generalized estimating equation methods We will say more about this shortly In the event that the second stage model for i is nonlinear in bi this idea may be extended by making a further linear approximation to the function dai 7 bi In this section7 we restrict attention to linear second stage population models i A176 1 bi as above7 which may often be speci ed in lieu of a nonlinear model by suitable reparameterization of the function f as we discussed in the previous section See Davidian and Giltinan 19957 section 534 for more on this To implement the above scheme7 we need 1 A method for estimating the i based on the individual data that acknowledges that the parameter 397 characterizing this intra individual variation may be the same across 2 As we have discussed previously7 this is a reasonable assumption when7 for example7 the main source of intra individual variation is due to error in a common measuring device or method 2 An approach to tting the approximate mean covariance model 1522 and 1523 to estimate 6 and D 1 POOLED GL3 ALGORITHM Assuming that 397 represents parameters characterizing within individual variation that are common across 2 a natural approach to estimating the 6 i 17 7m7 is to estimate 397 by pooling the information on 397 from all subjects7 and then use the common estimate for 397 to estimate i For de niteness7 suppose that Ri i yzi is a diagonal matrix7 as would be the case if we believed that correlation due to within individual sources is negligible PAGE 432 CHAPTER 15 ST 762 M DAVDDIAN Assuming then that the Yijzij are independent conditional on 3 consider the stage 1 within individual model given by EYi739lzijv i Zip507 VaFOijlzij i 029251397 07 1524 where as in the univariate problem we regard the Yij as conditionally independent given zij and 3 and 397 a 0TT Note that 397 is common to all 2 Extension of the method we now describe to non diagonal Ri6 yzi is discussed in Section 52 of Davidian and Giltinan 1995 If 0 were known then 8 could be estimated for each i using one of the methods based on linear or quadratic estimating equations discussed in previous chapters For de niteness consider GLS for this purpose For model 1524 GLS could be carried out via the threestep algorithm or IRWLS lf 0 is not known then an obvious approach is to estimate it from the data Because all m individuals have information on 0 and a consider the following approach Recall that we motivated the PL methods by considering the normal loglikelihood If we condition on all of 61 8 then assuming that Yilzi 8 are normal we may write the joint loglikelihood for 0 and 0 across all m individuals as the sum of the individual loglikelihoods ie ignoring constants m m 2 Yij KM 30 flogailogg 80z 7 12 1525 lt 1 gt Uzgzwhw lt gt Differentiating 1525 with respect to a and 0 yields the estimating equation for a and 0 given by m n i H I 2 1 0 02923 0 z 11 71 1 7 7 V9lt i7 97 where 19 is de ned as previously The estimating equation 1526 pools across the data from all m individuals to estimate the common a and 0 These developments suggest that PL estimation of a and 0 could be incorporated into the three step GLS algorithm as follows A 0 i For eachi 1 m estimate 8 by where Ego is some initial estimate eg OLS That is estimate 8 for each i using OLS applied to is data Set k 0 ii Estimate 0 by substituting for 8 for each i 1 m into 1526 and solving in a and 0 Call the resulting estimator for 0 9 Form weights 71 943 90 zij for i 1 m j1m PAGE 433 CHAPTER 15 ST 762 M DAVDDIAN iii For each i 1 m reestimate 8 by solving in 8 m 21jo zm gtlf ltzij750 0 j1 that is for 2 estimate 8 using WLS with estimated weights applied to is data where the weights are constructed using the previous estimate of 8 and the common estimate for 0 from ii Set k k 1 and return to ii As in the case of a single individual the above algorithm could be iterated C times or to convergence C 00 At the conclusion the resulting estimators are 31 6m for each individual the nal estimator 6 say and an estimator for 02 that may be obtained from 1526 as m 71239 A A A 62 N71 2 2972377 97 ZijYij 7 zm 3727 1527 i1 j1 where N 221 m is the total number of observations ln 1527 by analogy to the bias correction in the univariate case the divisor N could be replaced by 2210 7 b N 7 mb here mb parameters m b dimensional 3 have been estimated from the N observations Of course the PL approach at step ii where 1526 is solved to obtain the estimator for 0 could be replaced by the analogous pooled version of any of the other estimating equations discussed in Section 65 provided one is willing to make the necessary assumptions Alternatively the pooled version of the restricted maximum pseudolikelihood equations discussed in Section 74 could be used which can be shown to yield the estimator for 02 in 1527 with N replaced by N 7 mb lmplementation is no more dif cult than when estimating variance parameters from data for a single individual In particular one may use the same computational trick introduced on page 130 The loglikelihood objective function 1525 leads to the pooled PL estimation method thus intuition suggests that pro ling out the parameter 02 would lead to a similar approach Indeed if estimators 6 are substituted in 1525 as a function of 0 and a this objective function is of exactly the same form as for a single individual the only difference is that the summation is over all N observations The same manipulations used to derive the trick in the individual case may be carried through unchanged This is left as an exercise for the reader details are given in Davidian and Giltinan 1993 and Davidian and Giltinan 1995 section 522 An alternative to the above pooled algorithm when 0 is unknown would be to carry out the usual threestep GLS PL algorithm to estimate 8 separately for each i estimating 0 using only the data from i PAGE 434 CHAPTER 15 ST 762 M DAVDDIAN o Intuition suggests that if we believe that the variance parameters are indeed common across 2 pooling information from all m individuals should result in a more precise characterization of a and 0 0 However according to the rst order folklore theory how the variance parameters are estimated should make no difference in the precision of the resulting estimators for the 3 Nonetheless for small 71 given the implications of second order theory discussed in Chapter 11 intuition suggests that the pooled algorithm may yield estimators for 8 that are more ef cient than those obtained by using only the data on each individual to estimate a and 0 lntuition would suggest further that if we use the estimates of the 8 as data in order to estimate 6 and D more ef cient estimation of 6 and D would be expected if the estimates were more precise ie so that the data are less variable Davidian and Giltinan 1993 provide simulation evidence verifying this intuition In order to use the as data to estimate 6 and D the developments leading up to 1522 and 1523 show that we also need the estimated covariance matrices 0 In the case ofthe pooled GLS scheme given above we know from the large sample results in Section 92 that we may estimate the covariance matrix of conditional on 8 and the covariates using the formula 71 a 2290zugtmltzmmgtltzwigt 1528 j1 Here we are assuming that the variance function is believed to be correctly speci ed if this were in doubt one could use the robust sandwich correction formula given in Section 94 Thus we may obtain C by substitution of the pooled estimates 62 and 9 for 02 and 0 and of for 6 in 1528 o If one uses a standard nonlinear regression package to carry out step iii the resulting estimated sampling covariance matrix output by the software will be based on 1528 The weights g 26 0 zij in 1528 will be replaced by the values 12 input from the most recent invocation of step ii However software such as SAS proc nlin and RSplus nls will automatically estimate 02 using only the data available that is these programs will use the estimator 2392 71F b 1 97 an 97 ZijYz j zijfi lz 11 rather than the pooled estimate PAGE 435 CHAPTER 15 ST 762 M DAVDDIAN 0 Thus if one believes that 02 is common across 2 the estimated covariance matrix output from the software for should be modi ed by multiplying by 626 to obtain the more appropriate 0 Of course the resulting estimate 0 will still use the previous estimates of 8 and 0 in the weights in the case C 00 this should not be of concern 2 ESTIMATION OF 6 AND D Given the pairs 0 obtained from the approach outlined above we now wish to estimate 6 and D in the approximate mean covariance model given in 1522 and 1523 namely Efil1 Am var0l1 D 0 1529 By analogy to the methods for population averaged marginal models in Chapter 14 a natural approach to estimating 6 and D would be to solve the estimating equations AiTD 0040317 A175 0 1530 1 02 i 19 7 Ai gtTltD 004 D 004 7 Am Wk m 1 11 itrDCi 18 DH 0 k1bb12 1531 81 where w wl wbltb12T is the vector of distinct elements of the symmetric matrix D b x b o The equation 1530 for 6 is linear in the data Note that it would be fruitless to consider a quadratic such equation as the covariance model does not depend on 6 In fact it follows that the estimator for 6 satis es m 71 m 6 ZARD 004A ZATD 0045 i1 i1 o The equation 1531 for the 11 12 distinct elements of D is the quadratic such equation corresponding to the Gaussian working assumption In the case where in fact one is willing to make the normality assumption 1519 for 8 lai along with the approximate normality result 1518 which together lead to the normal approximation 1520 for the equations 1530 and 1531 are the estimating equations that would be obtained by differentiating the assumed normal loglikelihood ignoring constants m m 702 210g D 011 7 12 31 Ai gtTD cgt1ltiar A113 1532 1 l with respect to 6 and D This follows by the same manipulations used in Sections 142 and 143 PAGE 436 CHAPTER 15 ST 762 M DAVDDIAN In the literature on this topic the equations 1531 are often expressed in an alternative form which is possible because of the simple structure of the covariance matrix D 0 The form is obtained by using the following alternative results for a symmetric matrix 2 88210gl2l2 1 882 ltw7 ugtT24ltw7 w 7 72 7ugtltw my With D 0 playing the role of E and noting that by the chain rule differentiation with respect to D 0 and D are the same it is straightforward to obtain that setting the derivative of 1532 with respect to D equal to zero yields m m 2w Girl 7 2w 004 7 Ai xiai 7 Ai TD CW1 7 0 l l H H Using the relationships DCi71 D717D71D71C171D71 071 7 0171D71 071710i71 D71D71 611175716111717 it is possible to show and left as an exercise that the this may be reexpressed as m m A A D m l Zw l Cglrl m 1 Zm l Ci l 1C1 7 A 6 7 A TC 1D 1 Cglrl i1 i1 1533 Soling the equations 1514 and 15311533 may be carried out in a several ways that allow the user to exploit popular software here we discuss two of these MIXED MODEL INTERPRETATION As noted in Section 152 SAS proc mixed and RSplus lme may be used to t so called linear mixed models These programs consider the models of the general form see page 427 bi Zibi varYl1i bi Ri39yzi bi N N0 Under the additional assumption that the distribution of Yilmi bi is normal with these moments it is straightforward to see that the distribution of Yilm is also normal with moments EYl1 X43 varYl1 2sz R yz 1534 PAGE 437 CHAPTER 15 ST 762 M DAVDDIAN Thus the loglikelihood for 6 7 and D is given by ignoring constants m m 12 210 lZiDZzT Rik7 Zil 12 09 Xi TZiDZzT Rik7 Zi 1Yi 7 Xi T i1 391 1535 Both of these programs solve the estimating equations for 6 7 and D obtained by differentiating this loglikelihood It is easy to recognize that the model for the data in 1522 and 1523 is of this same form identifying Xi A Z Ib Ri y zi C and that the loglikelihood 1532 is of the same form as 1535 above Thus it should be possible in principle to use proc mixed or lme to solve the equations 1530 and 15311533 A slight twist is that the matrix G contains no unknown parameters but rather is completely known this requires some nessing to trick the software We demonstrate this in Section 156 An alternative way to tting linear mixed models is through the use of the so called Expectation Maximization algorithm The EM algorithm is nothing more than a computational method to maximize an objective function so it may be applied to the normal loglikelihood for the linear mixed effects model The background for and derivation of the EM algorithm is beyond our scope here the seminal paper describing the idea is Dempster Laird and Rubin 1977 and Laird and Ware 1982 and Davidian and Giltinan 1995 section 34 describe its implementation in the particular case of linear mixed effects models to maximize 1535 Given this interpretation of our problem as a linear mixed model with data 3 an alternative is thus to maximize 1532 ie solve 1530 and 15311533 using the EM algorithm The algorithm is as follows i Obtain starting values as m m 30 mil 5 D0 m i 171 251quot Ai 0 i Ai oT i1 i1 Set k 0 ii E step Produce current empirical Bayes estimates of the 3 i 1 m given by A 1 7 7 7 A A 1 A ik1 Dog Ci 1 10 15 DkAi k39 It may be shown that in fact this expression is of the form of the EBLUP for 8 Ai b under the model for the data 3 that is of the form of E6lfi with substitution of estimators hence the designation empirical Bayes PAGE 438 CHAPTER 15 ST 762 M DAVDDIAN iii M step Obtain updated estimates as A m 7 m A 71 A 71 ltk1gt ZWik ik17 Wik AzTDkAigt AzTDk7 i1 i1 A 7 A 1 7 7 7 A A Dk1 m 1 ZltDk Cl 1 1 m 1 Zlt ik1 Ai k1 ik1 Ai k1T l 139 Set k k 1 and return to ii m m 1 1 The algorithm is iterated to convergence eg7 relative change in all parameters is below some prespec i ed tolerance The results should be identical to those obtained by direct maximization eg7 using the software mentioned above We demonstrate the use of the EM algorithm above to estimate 6 and D based on the data in Section 156 STANDARD ERRORS FOR 9 Viewing the approximate model in 1529 as a population averaged marginal mean covariance model7 it follows from the folklore theory of Section 145 that7 for large m7 m 71 39NN39 52ADCi1Ai 1536 i1 Standard errors for the components of i may of course be estimated by the square roots of the diagonal elements of the covariance matrix of the approximate sampling distribution 1536 TERMINOLOGY The method for estimating the population parameters 6 and D based on using individual estimators for the i as data is often referred to as a two stage approach7 for obvious reasons 0 In stage 1 of the estimation7 the individual estimates and their estimated covariance matrices are obtained using individual speci c estimation techniques7 pooling information on common intra individual covariance parameters where appropriate In stage 27 these data are the basis for estimation of the population parameters via population averaged marginal model estimating equation techniques Here7 we have discussed the particular choice of estimating equations involving a linear equation for 6 and a quadratic equation for D using the Gaussian working assumption In the pharmacokinetics literature7 this basic two stage approach has been called the Global Two Stage method see Davidian and Giltinan7 19957 section 532 PAGE 439 CHAPTER 15 ST 762 M DAVDDIAN The rationale for this name is unknown but one function of the name is to distinguish this approach from a simpler but possibly misleading one that has been referred to in this literature as the Standard Two Stage method Treating 1521 EizAi l bi l ei as a multivariate linear regression model but ignoring the heterogeneity of the covariance matrices vareil6 0 one would obtain the estimator A m 71 m A STS AiTAigt Z Air5139 i1 i1 Note that in the case A 117 this reduces to the sample mean of the The covariance matrix D could be estimated by the residual mean square77 A m A A A A DSTS m 171 251quot Ai STS i Ai STST7 i1 which reduces to the sample covariance matrix of the when A 117 It is straightforward to show that under the approximate model 1529 the estimator ESTS is expected to be inef cient relative to that discussed above More disturbingly DSTS will be a biased estimator for D Demonstration of these features is left as an exercise 154 Approximate methods based on linearization Methods based on individual estimators discussed in Section 153 are attractive because they break the problem of estimation in the rather complicated subject speci c nonlinear mixed effects models down into two stages each of which may be carried out using standard methods with possible modi cations such as pooling to estimate within individual covariance parameters The potential drawback of these methods is that they require 71 to be large enough on each of the m individuals so that estimation of the individual 8 is not only feasible but so that the analyst would have faith that the largesample approximation for the distribution of Eil i 11 is reasonable or at least the mean and covariance matrix of it are good approximations It is often the case in practice that the m are not large enough for these conditions to be met In some instances although a number of the m individuals have large enough 71 to allow individual estimation some do not Disregarding these individuals opens the possibility for biased estimation as the remaining individuals may not longer be a random sample from the population In other cases the sampling design for all individuals yields 71 that is too small in all cases In these situations the methods of Section 153 are not an option PAGE 440 CHAPTER 15 ST 762 M DAVDDIAN An alternative class of methods is motivated instead by returning to the marginal likelihood 159 which involves the product of the terms pltYilwa wr7 13 pltYibz ammbmdbi 1537 The idea behind the methods that we consider in this section is to approximate the integral in 1537 by a closed form expression As the likelihood for 8 D and 397 is the product of such integrals this would then provide an analytical closed form expression for an objective function to be maximized Differentiation of the corresponding approximate loglikelihood can then lead to estimating equations Further the approximate moments corresponding to the closed form approximate loglikelihood may also be used to form estimating equations in the spirit of Chapter 14 FIRST ORDER LINEARIZATION METHODS The simplest such methods attempt to approximate 1537 for each i by referring directly to the stage 1 individual model EYili7bi filtmi7 7 bi7 VarOlillzy bi Rd f h 1313 Assuming that Ri 39y1ibi is positive de nite letting R22 39y athbi be its square root matrix eg Cholesky decomposition and de ning ei R12lt mbigtm 7 ham 127 we may write the model as Y name R12lt 7bigtei 1538 Note that eilmi bi has mean 0 and identity covariance matrix Now the dif culty with the integration in 1537 is the fact that b enters in a nonlinear fashion This is similarly true even if we are only interested in obtaining the marginal mean and covariance matrix of Thus the idea is to approximate the model 1538 by one that is linear in the b The simplest way to do this is to approximate 1538 about the mean of the bi 0 This approach was advocated in the pharmacokinetics literature in the early 1980s see for example Real and Sheiner 1985 Such an approximation has also been discussed in the particular context of models where the stage 1 individual model for Yilmi b is of the generalized linear model type by Breslow and Clayton 1993 Zeger Liang and Albert 1988 and others In this case the nonlinear mixed effects model may be called a generalized linear mixed model although this latter term has been used to refer more broadly to a wider class of models than the ones we consider here PAGE 441 CHAPTER 15 ST 762 M DAVDDIAN By Taylor series of 1538 about bi 0 we have Y m fltm o aabfltm 0W e 0 REZwm me aabARzZwm own 7 me x 12mg 0 zim mm leiWm om 1539 In 1539 Z16 0 88bf1 6 bilbi0 The crossproduct term involving bis is disregarded as small relative to the leading three terms as both bi and 6139 have mean 0 conditional on 111139 The approximate model in 1539 suggests immediately the following approximate marginal moments EYililli 3 filti7 7 0 MFGdim Zilt i7 7 0DZ1T139767 0 Rd f h 113130 1540 Comparison of the approximate marginal moments 1540 to those for the linear mixed effects model given in 1534 reveals that they have a similar form which is expected as the moments in 1540 follow from a linear approximation to the nonlinear mixed effects model Note that the marginal mean approximation involves evaluation of the function f at b 0 for all individuals Clearly this approximation ignores the individuality of the mean response The approximation 1540 suggests a number of ways to estimate 6 the distinct elements of D and 7 0 Under the assumptions of normality of the conditional distributions Yilmi b and bi N NO D it follows from 1539 that the approximate marginal conditional distribution for Yilm is normal with mean and covariance matrix as in 1540 Under these conditions then one approach would be to approximate the integral 1537 by the normal density with these moments In turn then the approximate marginal likelihood for 6 D and 397 would be the product of these densities Note from 1540 that the approximate marginal covariance matrix depends on 6 Thus from the developments in Section 143 differentiation of the approximate marginal loglikelihood would result in a quadratic estimating equation for 6 and a quadratic estimating equation for 397 and the distinct elements of D incorporating the Gaussian working assumption In particular then this would lead to estimating equations of the GEE 2 form with the Gaussian working assumption This is exactly the strategy implemented in the software package nonmem a suite of Fortran programs that is heavily focused on pharmacokinetic analysis This method is also available in SAS proc nlmixed PAGE 442 CHAPTER 15 ST 762 M DAVDDIAN 0 Alternatively one could consider the GEE 1 approach using instead the appropriate linear esti mating equation for 6 coupled with the quadratic equations for D and 7 This is implemented in the SAS macro nlinmix 0 Standard errors for the estimator for 6 would be obtained by applying the usual GEE theory Note that technically speaking estimating equations of the form given above cannot be expected to be unbiased For example the approximate marginal mean is clearly not equal to the true marginal mean given by fwi bipbiDdb Replacing b by its mean 0 obviously will not yield this integral but rather is a crude approximation to it Hence as with the methods based on individual estimates in Section 153 which also rely on approxi mations there is no reason to expect that the estimators so derived need be consistent o Amazingly it turns out that in some applications where the magnitude of the inter individual variation represented by D is not too great estimators that are nearly unbiased in nite samples nite m may result This has been observed via extensive simulation studies in the area of pharmacokinetics Although this is fortuitous it is by no means necessary MORE REFINED APPROXIMATIONS As noted the foregoing approximation in some sense removes the individual aspect of the model by replacing b by 0 for all i Approximations that may be more accurate may be deduced and may be motivated in different ways It is beyond the scope of the discussion here to give a full detailed treatment of the various strategies that have been proposed Rather we just give a heuristic motivation for the general approach and refer the reader to the literature for more details alternative derivations and variations on this theme One such approximation was advocated by Lindstrom and Bates 1990 A simple motivation for their idea follows from a linearization argument similar to that above Begin again with the model 1538 repeated here Yi fii 57 bi R12 777i7bi6i Instead of approximating this model by expansion about bi 0 Lindstrom and Bates argued that expansion about a value closer to77 b might result in a more accurate approximation PAGE 443 CHAPTER 15 ST 762 M DAVDDIAN By analogy to the steps above expanding about b close to77 bi we obtain Y z rial1112aabimltwi bgtltbi712R22wmwmnei 88biR22lta 7 mm 7 hm mm 12 7 2m 12312 2m mm IiiWm hm 1541 Here the crossproduct term involving bi 7 b ei has been disregarded as negligible Treating b as a xed constant 1541 suggests the following approximate marginal moments EYililli fii7 7 b izi h I D WHOHM Zap57 b DZiTWh blRi 7quot7 1131317 1542 From the approximate moments given in 1542 if b were known we could proceed as for the cruder approximation about 0 to deduce estimating equations of types GEE 1 or GEE 2 that could be solved to yield estimators for 6 D and 7 Clearly to be practically useful this approach requires some means of identifying a suitable value b to substitute in 1542 Lindstrom and Bates 1990 focus on the situation where bi N N0 D and pYilbi 111117 is a normal density Under these conditions they suggest that a suitable choice would be the posterior mode for b In general the posterior density pbilYi 111139 6 7 D is proportional to pYilbi 111139 6 39ypbi D Under the normality assumptions for both of these the posterior mode maximizes the logarithm ofthis product which ignoring constants has the form 1210g lRi5777 1 bil12K39fiiv vbiTRZ15777 i7bieriiv v bi12biTD 1bi 1543 Actually Lindstrom and Bates 1990 restricted attention to models in which the matrix Ri y zi does not depend on 3 so that the within individual covariance matrix does not depend on b Under these conditions the rst term in 1543 is constant with respect to bi and may be disregarded Under the conditions where both pYlb 111 6 397 and pb D are normal these considerations suggest the following basic iterative strategy for estimation of 6 7 and D Let g denote the vector of covariance parameters consisting of 397 and the distinct elements of D i Obtain initial estimators 30 and 50 For example these could be obtained from tting the approximate marginal model with moments given in 1540 that is obtained by the approxima tion setting bi 0 In almost all the literature a GEE 1 approach with the Gaussian working assumption would be used here PAGE 444 CHAPTER 15 ST 762 M DAVDDIAN Then nd the initial empirical Bayes estimators 130 i 1 m by substituting 30 and 50 in 1543 for each i and holding these xed maximize in bi Thus m maximizations one for each i are performed Set k O A k A k Substitute b for b in the approximate marginal moment expressions 1542 Treating b as A V xed update estimation of 6 and g by solving a set of estimating equations The GEE 1 equations with the Gaussian working assumption are usually used here Call the updated estimators 9 A k and 5 1 i i Ak1 Ak1 i i i i i i 111 Substituting 6 and g in 1543 and holding xed maximize 1543 in bi for each z in i i i i A k 1 H m separate maximizations to obtain b z 1 m Set k k 1 and go to 11 lteration between steps ii and iii would proceed until convergence where this might be de ned as relative change in successive estimates of all components of 6 and g being less that some tolerance Various versions of this scheme are implemented in the SAS macro nlinmix and the RSplus function nlmeO These software packages focus speci cally on the case where pYlb 111 37 is assumed to be normally distributed Moreover there are a few subtleties 0 These implementations ignore the rst term in 1543 when performing the update in step iii 0 Both implementations in fact invoke a further approximation to allow algorithms for tting linear mixed effects models to be exploited In particular at step ii of the algorithm the approximate marginal moments are A k A k A k Emmi z rimw 7 Zimbabl 5b mama z 2m i EkgtgtDzltwt g 13E 13457713 1 i i A k i i If one substitutes the prev1ous 1terate for 6 8 1n the expression for varYl1 then note that A k A k A k A k the matrices Zi1i6 b and Ri gt37 111139 b are constant with respect to 6 just as in a linear mixed model Moreover by a further approximation to the marginal mean expanding 306 A210 A k f1i 6 and Zi1i 6 b to linear terms about 8 ignoring negligible terms we obtain k A k k A k A k A k A k A k Emmi z fim l 112 Xiltwi 112 7 3 l 7 zm l WE 512E AkAk AkAkAk AkAkAk 7 WM l 12 l 7 6 l b m 7 21m 6 l 12 5122 Xiltmiia kgt 2kgt where X1i bi aa ab 57 PAGE 445 CHAPTER 15 ST 762 M DAVDDIAN Note that if we de ne the pseudo response vector77 then approximately from above we have Ew kli if gm i 1544 k The approximate mean model for w is linear in 6 moreover from above varltw kgtlwigt z 21mm EiklgtDZltztialtkl13 RAB 7 132 1545 Together 1544 and 1545 represent an approximate linear mixed effects model with constan 7 design matrices A k A k A k A k x Xaa b and z zaa b This model may thus be tted ie 6 and g estimated by techniques for linear mixed effects models 0 In fact under this perspective step iii may also be approximated Under the approximate linear mixed model 1544 and 1545 the posterior mode for bi may be updated using the formula in 1517 In particular the update may be obtained as gym Dltk1gtZTEltkgtEgmwilwmAltk1gtmwgkgtXm ltkgt kgt ltk1gt Dltkl1gtZ MV kgt71ltw kgt7 XEkiikl7 1546 where W Viltialtkgtfltk1gtwigt 2mg Eiklmwllz wiW13 Riltia gtwltklgt13 kgt A k1 k T A k A A k Z D gt2 RM WW b o The SAS macro nlinmix takes full advantage of these further approximations In particular steps ii and iii are carried out as indicated above by forming the wgk and the matrices X9 and ng for the current iteration and calling proc mixed to t the approximate linear mixed effects model The approximate posterior modes 1546 which are approximate EBLUPS are a byproduct of calling proc mixed proc mixed calculates EBLUPs for linear mixed models using 1517 evaluated at the nal estimates of 6 7 and D by default The RSplus function nlme does not use the further approximation 1546 but rather maximizes 1543 disregarding the leading term PAGE 446 CHAPTER 15 ST 762 M DAVDDIAN Standard errors for the estimator for 6 obtained via this approach are generally computed by using the usual GEE expressions with the nal estimates of the xed parameters and the nal value for substituted ALTERNATIVE DERIVATION It is in fact possible to provide an alternative derivation that also motivates the threestep scheme for estimating 6 7 and D given above This derivation relies on the so called Laplace approximation to an integral of the form expn ltTgtdT l6 l 12expn 1547 Here 739 is k x 1 739 is a real valued function of 739 that is maximized at 7 and 739 82878739T T The Laplace approximation is valid when n is large In particular the approximation is 0n 1 Wol nger 1993 and Vonesh 1996 discuss how the Laplace approximation 1547 may be applied in the setting of nonlinear mixed effects models when pYlb 111 37 and pb D are normal densities In particular one may identify bi with T and n with m for individual i in the integral in 1537 Both authors consider the speci c situation in which Ri6 yz does not depend on 3 which we write as Ri y In this case the integral 1537 is given by 25774257751247 marlWar X eXpl712Yi 7 fii7 7biTRfl Y7 113053 7 Elam57 1 7 12szD71bildbi39 1548 Consider approximating the integral in 1548 by 1547 We may identify 1 L L 1 i Yi 7 filti7 7biTRi 1077113053 7 an57170 szD 117139 1 Then using results for matrix differentiation it is straightforward to show that bi 88bi satis es 1 7 77171Dilbi nZIZiTWh 57 biR71 7 MHYZ39 7 MM 57 171 1549 and 1 7 77171D 1 7 nllzg h biRZI 7izii7 ybi n718857TZ7Ti757biR71 Y7 113053 7 Elam57 1 1550 Note that the third term on the right hand side of 1550 has conditional expectation zero It is standard to disregard this term so effectively making a further approximation to the Laplace approximation 1547 by replacing Z by its conditional expectation on the right hand side of 1547 PAGE 447 CHAPTER 15 ST 762 M DAVDDIAN Substituting the conditional expectation of 1550 in 1547 we obtain PltYili 77vD 2W m22W k2lRi 7113ililZlDlilZ kZ kZ Will2W4 ZiTiIIi7 7 13013171977 iZii7 7 EMA2 A 7 A AT 7 A X expl12Yi filti7 7biTRi 177111iYi filtmi7 7bi 12bi D 15441551 Now as the posterior mode maximizes bi must be such that 03 0 From 1549 we thus have that must satisfy in DZiTltwi 13gtR1ltmigtYi 7 flaw 130 1552 Via tedious matrix algebra it may be shown using the representation of in 1552 and de ning In ux57 bi filti7 57 bi Z4131 57 501 that 1551 may be rewritten as PltYili 77v D 27V7m2lRiquot7 113139 Zii7 79iDZiTi7 7 EMA2 X expl12Yi hiCIIn 13 TRi 77 11 Elam57 13 DZ139TW757 131719939 hii7 7 13M 1553 0 Note that 1553 has the form of a normal density with mean Whit57 13 Milli 5 in Zi 5 1301A and covariance matrix BAA713139 Z4131 57 EiDZi7 7 13139 Comparing these moments to the form of the approximate moments in 1542 we see they are the same Thus this derivation leads to the same marginal moments obtained earlier In fact here we are led naturally to replacing b by the posterior mode The approximation 1553 may be substituted for the ith integral in the marginal likelihood 159 yielding an expression for the marginal likelihood that has a closed form It should be clear that maximization of this approximation to the marginal likelihood 159 should result in a set of estimating equations for 6 7 and D that is of the GEE 2 form with the Gaussian working assumption As noted above and following the recommendations discussed in Chapter 14 it is standard to use the GEE 1 approach instead The form of 1553 suggests naturally that an iterative strategy like the one we have already described may be used PAGE 448 CHAPTER 15 ST 762 M DAVDDIAN REMARKS c When the matrix R6 yz does depend on 3 and hence on b the above argument no longer applies as pointed out by Vonesh 1996 K0 and Davidian 2000 have shown that if R 39y 111bi has the form of a scale parameter 02 times a matrix then if a is small the same approximation as in 1553 may be obtained Because in many applications such as pharmacokinetics intra individual variation is indeed small this further approximation is often relevant in practice As with the simpler rst order approximation about b 0 it is not clear that the estimators for 6 7 and D obtained via the iterative strategy need be consistent For that matter it is not clear that the procedure need even converge to a solution Luckily in practice it usually does Note that the Laplace approximation requires 71 to be large This suggests that if both 71 for all i 1 m and m a 00 consistency might obtain Vonesh 1996 discusses this in more detail In fact it should be evident that one would need both 71 and m a 00 in order that the two stage approach discussed in Section 153 based on individual estimators for the 8 be consistent as this relies on the relevance of the individual level asymptotic theory approximation It turns out that under these conditions the method we have discussed here and the two stage method are virtually identical 0 In practice it is often the case that the approximation we have discussed here works quite well even when m is not too large for all 2 Simulation evidence shows that the estimator for 6 obtained via the threestep algorithm outlined above is virtually unbiased for moderate m even when n is not large 0 As we have mentioned previously it is often assumed that the Y conditional on 111 b are indepen dent as within individual correlation may be dif cult to identify from intermittent measurements In this situation the matrix R 39y 111 b is diagonal with diagonal elements equal to the as sumed intra individual variance function Under these conditions implementation of the above strategy is simpli ed somewhat but the principle is the same GENERALIZED LINEAR MIXED EFFECTS MODELS As mentioned on page 441 when the stage 1 individual model is of the form of a generalized linear model so that the distribution of Yijlzij 1 b is not reasonably assumed normal then the nonlinear mixed effects model is referred to as a generalized linear mixed effects model PAGE 449 CHAPTER 15 ST 762 M DAVDDIAN o In this case more appropriate models for the distribution of Yijlzij ai bi are members of the scaled exponential family class eg the Poisson or Bernoulli distributions 0 In contrast to the case where Yijlzijaibi is assumed normal there is no multivariate gener alization of the other scaled exponential family distributions such as the Poisson for the joint conditional distribution of Y As noted above it is common to assume that the Yij are conditionally independent In this case the density pYl1 bi 37 may be represented as the product of individual conditional densities Under the generalized linear mixed effects model these individual conditional densities are assumed to arise from the scaled exponential family class of univariate densities We will make this assumption in what follows As noted above ordinarily the conditional mean EYijlzij ai bi is assumed to depend on 6 and b through a linear combination of these and the covariates In particular as an example we might have an individual generalized linear model of the form T EYijlzij7 i Zia507 where 8 Ai Bibi A depends on the individual level covariate 1 and B is a design matrix of 0s and 1s By substitution we have EYlz a 12 fuT va 1 177171 ij ijlv where uij ziAi and vij ziTJBi Of course as discussed in Chapter 4 there is no reason why this linear dependence cannot be relaxed We will continue to write things in the more general nonlinear form below Note that for the scaled exponential family there may or may not be intra individual variance parameters 7 In any event the only such parameter would be a scale parameter 02 The variance function must be a function of the mean u only which we will write as 920 as in Chapter 4 o In the usual formulation of the generalized linear mixed model it is still assumed that b N NO D which we will maintain in the developments below Letting pYjlzj ai b y be the assumed conditional density of Yij a member of the scaled expo nential family class then m PltYili7 17 577 H 19Yijlzijv an bh l j1 From Chapter 4 we know that the important part77 of logpYijlzlj ai bi 6 397 may be written as i Mi Yij 7 u du quot quot 39 b 02 yi 9204 7 W fzl7al7 7 z PAGE 450 CHAPTER 15 ST 762 M DAVDDIAN Thus the contribution for i 1537 to the marginal likelihood 159 may be written as 1 m Miniu i 71 2 z T 71 pYil1iG yDoltlDl exp lYij Wduiummn b dbl 1554 Breslow and Clayton 1993 suggested approximating 1554 by using the Laplace approximation 1547 ldentifying 1 m Mil714 1 6b 7 d 7 bTD 1b l MHZ 920 U 2m 1 17 we have 239 Y7fZ a 6 12 Z b 7n71D 1b H7 2 1 l 7 88b za b l l l 92fzij7ai7 7bi lf W 1 1 Letting Ill n bi azdiagl92fzn am bi 92fltzm al 6 him and Z16 b be the x k matrix with jth row 88b fzj 136 biT we may write this as Nb inngAbi 2 biR1 7quot7 bum 7 12m bi 1555 This looks identical to 1549 in the case of conditionally normal Y with the exception that R depends on b Differentiating 1555 again with respect to b and as with 1550 ignoring the term with expectation zero we have quot1 nZ1D71 Zi7 vbiR1 777 i7biZii7 7bi Substituting these expressions into 1547 we obtain ignoring constants PltYili 77v D lt5 lDlilZlDil ZiTi7 79iR1 7 7 i79izii7 79ilil2 1 m fzijai j7i Y 7 u AT A xex du7 1 2 b D lb db 1556 pU2Yij 921 z z It turns out through some further manipulations that are left as an exercise that 1556 leads to a further approximation in terms of a linear mixed model representation as in the case of normal conditional distribution Thus essentially the same iterative strategy discussed in that case is applicable here and is implemented in the SAS macro glimmix Wol nger and O Connell 1993 discuss a slightly different derivation see also Schall 1991 In the context of generalized linear mixed effects models the iterative scheme has been referred to as penalized quasi likelihood PQL see Breslow and Clayton 1993 PAGE 451 CHAPTER 15 ST 762 M DAVDDIAN A Of course the same about t of the t t given in the normal case apply here as well Unfortunately it turns out that under some circumstances the approximation underlying these devel opments can be very poor In particular when the Y are binary and n are small it has been observed that the resulting estimators for 6 7 and D and particularly the latter can be subject to nontrivial bias in practice This phenomenon is discussed by Breslow and Lin 1995 and Lin and Breslow 1996 who also discuss analytical approaches to correcting this bias Alternatively a number of authors have suggested that the only way around this problem is to try to do the integral in 1537 more directly OPERATIONAL NOTE Issues of performance aside there is no reason why all of the methods we have discussed in this section cannot be implemented when m are suf ciently small that individual estimators for 8 would be impossible to obtain Of course the quality of the approximation may depend on n being large which is analogous to the case where estimation of 8 may be of questionable quality However if some n are suf ciently small to make estimation of 8 impossible eg n 3 and the dimension of 8 b 4 there is no operational barrier for the methods presented here 155 Exact likelihood methods The approximate methods of the last two sections often work remarkably well in practice in the sense that the resulting estimators for 6 7 and D are approximately unbiased in nite samples in many situations However as noted for binary data at the end of Section 154 sometimes these approximations do fail Thus there has been considerable recent interest in moving away from such approximations and instead attempting to carry out the integrations involved in evaluating the marginal likelihood for 6 7 and D via some numerical technique We have already discussed the potential use of usual quadrature techniques for numerical integration in this context Alternatively other approaches have been suggested To give a full accounting of these is beyond the scope of our treatment here Instead we simply summarize the ideas and refer the reader to the literature for more details USE OF THE EM ALGORITHM As we have mentioned previously the EM algorithm is a numerical technique that may be used to maximize an objective function PAGE 452 CHAPTER 15 ST 762 M DAVDDIAN As discussed by Laird and Ware 1982 and Davidian and Giltinan 1995 section 34 this approach may be used to maximize the likelihood associated with a linear mixed effects model The EM algorithm is often used in the context of missing data problems where speci cation of the likelihood requires integration over the missing data With random effects models the EM algorithm is used by treating the random effects as missing data that are not observed In the linear mixed effects models it turns out that because of the linearity of the model in the b the E step in which the missing data are taken into account is calculable in a closed form hence the attractiveness of this approach In more general nonlinear mixed effects models the E step is unfortunately not available in a closed form Several authors including McCulloch 1997 and Booth and Hobert 1999 have suggested using the EM algorithm in generalized linear mixed effects models where the intractable E step is calculated by carrying out a Monte Carlo simulation This approach appears to work well but can be computationally intensive REFINED INTEGRATION Alternatively another approach is to use re ned numerical integration methods to carry out the integration more directly Pinheiro and Bates 1995 suggested a modi cation to the usual quadrature rules for numerical integration in the context of nonlinear mixed effects models that they referred to as adaptive Gaussian quadrature This strategy seems to improve greatly the approximation of the integral while using many fewer abscissae then would be required with ordinary quadrature rules The SAS procedure nlmixed contains a facility to allow the user to attempt to maximize the exact marginal likelihood for a nonlineargeneralized linear mixed model by such direct integration The procedure implements both ordinary and adaptive Gaussian quadrature the latter is the default method for this software proc nlmixed also implements other methods for calculating the integral directly which we do not mention here See the documentation and Pinheiro and Bates 1995 for more details BA YESIANAPPROA CH Zeger and Karim 1991 suggested placing the generalized linear mixed model in a Bayesian framework and using Markov chain Monte Carlo techniques to carry out the necessary integrations This work has been followed by numerous papers re ning this idea Many standard models are now implemented in the software package bugs PAGE 453 CHAPTER 15 ST 762 M DAVDDIAN 156 Implementation in SAS and R In this section we demonstrate several of the approaches we have discussed These demonstrations are not meant to be exhaustive of all the possible implementations but rather are intended to serve as examples of what one might do in practice using either SAS R Splus or other software We consider two examples The rst is a case of continuous response that might reasonably be assumed to be approximately normally distributed For this example we consider use of the SAS macro nlinmix SAS proc nlmixed and RSplus nlmeO we demonstrate the version in R The second example is a case where the response is in the form of a count suggesting that a Poisson distribution may be a reasonable model for pijzij ai big y Here we consider use of the SAS macro glimmix and SAS proc nlmixed The two stage approach discussed in Section 153 based on individual estimates for the 8 may be used in principle with any type of assumption on stage 1 individual level model For example regardless of whether the response is continuous binary in the form of a count etc as long as suf cient data exist on each individual suf ciently large to estimate the individual speci c 8 using for example GLS then this method may be used If there are unknown but common intra individual covariance parameters then the pooled approach discussed in Section 153 is relevant If there are no such parameters or if they are thought to vary by individual then individual estimation techniques such as those discussed in Chapters 2712 may be used to obtain the and C for each individual Unfortunately there is no canned implementation of the two stage method available in statistical software such as SAS or R We thus demonstrate this method for the rst example only implementation for the second example would be entirely similar EXAMPLE 1 Pharmacokinetics of argatroban This example is taken from Davidian and Giltinan 1995 section 95 and concerns a study of the pharmacokinetics and pharmacodynamics of the anti coagulant agent argatroban We will consider the pharmacokinetic data In the study m 37 subjects each received a four hour 240 minute intravenous infusion of one of several doses of argatroban For each infusion rate from 1 ugkgmin to 5 ugkgmin in increments of 05 ugkgmin 4 subjects were randomized to receive that infusion rate a 37th subject received a rate of 437 ugkgmin Serial blood samples were taken from each patient at several time points over the 360 minutes 6 hours following the start of the infusion and were assayed for argatroban concentration Figure 151 shows concentration time pro les for 4 subjects at different doses with a GLS PL t of the pharmacokinetic model given below superimposed PAGE 454 CHAPTER 15 ST 762 M DAVDDIAN Figure 151 Concentration time data for four subjects from the argatroban pharmacokinetic study Subject 13 Subject 17 mgml 500 1000 1500 2000 2500 500 1000 1500 2000 2500 Argatroban cone 0 O 0 100 200 300 0 Subject24 o o o o 3 3 0 C 8 8 EN N 08 8 gm in 38 8 Be 0 g E 0 39 lt O O T1mem1nute5 T1mem1nute5l A standard model for concentration at time t during and following administration by a constant rate infusion of amount per unit time R of duration tmf is fz6 exp 7 7 exp 79 1557 where 25 0 for x 25W 25 x 725W fort gt 25W Z R7 75 and 5 3132T is de ned so that Cl exp l7 V exp z Here7 Cl is the clearance rate and V is the volume of distribution From Figure 1517 this model appears to provide an adequate representation of the concentration time relationship Note that Cl and V are not dose infusion rate dependent they do not depend on B This assumption7 which is embedded in the pharmacokinetic model 15577 means that an individual s clearance and volume characteristics do not change depending on the dose administered Thus7 in principle7 information on the values of Cl and V for a particular individual may be obtained from concentration time data at any dose R For each individual i with infusion rate Ri we have concentration measurements Y 7Ymi at times ti17tim7 so that zij Ri7tl jT Here7 m is in the range of 10 to 14 for each subject7 so that individual tting of 1557 is feasible PAGE 455 CHAPTER 15 ST 762 M DAVDDIAN We thus assume the stage 1 individual model where 1557 describes the concentration response rela tionship for each subject so that each individual has parameters 8 31 320T As it is well known i r i is i i that the variance of pharmacokinetic we assume that VaFOia39lzm t 02f29zij7 507 the power of themean variance model Here 02 and 9 are assumed to be common across subjects this is reasonable if the major source of intra individual variation is the assay used to determine argatroban concentrations Here there are no individual level covariates 1139 the infusion rate R is considered a within individual covariate as it is a condition of measurement We also assume that the Yij are conditionally independent given 11 zi With no individual level covariates the second stage population model is t 5 bi Here the elements of 6 represent the mean values of log Cl and log V in the population of patients from which these patients are assumed to arise As in the generic model bi N NO D so that D11 and D22 represent the variances of log Cl and logV values in the population On the original scale observe that D11 and D22 represent roughly the squares of the coef cients of variation of these parameters in the population The following six programs illustrate the use of various methods and software to t the model for the argatroban data given above PROGRAM 151 Two stage approach using the EM algorithm to t the stage 2 model This program written in R implements the pooled GLS PL algorithm to estimate the 3 a and 0 The resulting estimates and their estimated covariance matrices 0 are then fed into the EM algorithm PROGRAM STATEMENTS Stage 1 GLSPL pooled estimation Stage 2 Estimation of population parameters using the GTS EM algorithm function to print out matrices pretty writematrix lt functionxfilequotquotsepquot quot C x lt asmatrixx p lt ncol 2 cat dimnamesx 2 format t x filefilesepc repsepp1 quotnquot appendT PAGE 456 CHAPTER 15 ST 762 M DAVDDIAN put the mean function you want here Here the logistic function is parameterized so that the values of all parameters are in the same quotballparkquot meanfunc lt functionxb1b2dose v lt expb2 1 lt xlttinf t2 lt tinf1t1t1x 1 lt dosecl1expclt2vexpcl1t1Xtinfv compute analytical dervivatives create the gradient matrix X t3 lt 1t1xtinf temp1 lt doseclexpclt3v temp2 lt dosev 2expclt3v meangrad lt array0clengthx2listNULLcquotb1quotquotb2quot meangradquotb1quot lt temp1 1expclt2v1cl t3vt2vexpclt2vcl meangradquotb2quot lt temp21expclt2vt3expclt2vt2v ttrf1quotgradientquot lt meangrad meanfunc2 lt functionxb1b2dose v lt expb2 1 lt xlttinf t2 lt tinf1t1t1x i lt dosecl1expclt2vexpcl1t1Xtinfv 03 t lt t b h1 h9 do a mu 1 lt meanfunc2xb1b2dose wei htf lt f1mut nf t2 lt tinf1t1t1x t3 lt 1t1xtinf temp1 lt doseclexpclt3v temp2 lt dosev 2expclt3v compute analytical dervivatives create the gradient matrix X weightgrad lt array0clengthx2listNULLcquotb1quotquotb2quot weightgradquotb1quot lt temp1 1expclt2v1cl t3vt2vexpclt2vclmut weightgradquotb2quot lt temp21expclt2vtBexpclt2vt2vmut attr wightfquotgradientquot lt weightgrad g t set start values etc max number of iterations for fitting algorithm cmax lt 20 start values bstart lt listb16 Ob22 0 tstart lt listtheta05 p lt 2 name of output file outfile lt IIarggts soutquot catquotARGATROBAN DATA GTS ALGORITHMquotfileoutfilequotnquotquotnquotquotnquotappendF read in data thedata lt scanquotargconc datquot thedata lt matrixthedatancol5byrowT indiv lt thedata2 individual indicator ds lt thedata3 dose xs lt thedata4 time PAE 457 CHAPTER 15 ST 762 M DAVDDIAN ys lt thedata5 concentration n lt lengthxs uindiv lt uniqueindiv m lt lengthuindiv form individual 2nd stage covariate design matrices Ai case these are all identity matrices in this aimat lt NULL rlt p a1 lt diagp fori in 1m aimat lt rbindaimata1 quotmeanquot function for the quottrickquot pltrkfunc lt functionresidmudotmutheta t k lt residmudotmutheta trkgrad lt array0clengthmu1listNULLcquotthetaquot trkgradquotthetaquot lt trklogmudotmu at rtrkquotgradientquot lt trkgrad analytic derivative tr H Stage 1 Pooled GLSPL estimation Step 1 initial fit by OLS for each indiv catquotINITIAL OLS ESTIMATIONquotfileoutfilequotnquotquotnquotappendT bolsmat lt NULL matrix to contain OLS estimates muvec lt vector to contain all predicted values residvec lt NULL vector to contain all residuals for i in 1m id lt uindivi y lt ysindiv id x lt xsind d lt dsindi H olsdat lt data framexyd ols fit lt nlsy 39 meanfuncxb1b2dolsdatbstart bols lt coefols fit mu lt meanfunc2xbols1bols2d resid lt ymu bols lt matrixbolsncolpbyrowT bolsmat lt rbindbolsmatbols muvec lt cmuvecmu residvec lt cresidvecresid bols lt cbinduindivbolsmat uvec lt matrixmuvecncol1 residvec lt matrixresidvecncol1 catquot0LS estimatesquotfileoutfilequotnquotquotnquotappendT writematrixroundbols6fileoutfile pooled OLS estimate of sigma sigma lt sqrtsumresidvecresidvecnmp catfileoutfilequotnquotquotnquotappendT catquot0LS pooled estimate of sigma quotsigmaquotnquotquotnquotfileoutfileappendT begin iteration loop for GLSPL catquotGLSPL POOLED ESTIMATIONquotfileoutfilequotnquotquotnquotappendT for k in 1 cmax compute the geometric mean and predicted values PU E458 CHAPTER 15 ST 762 M DAVDDIAN mudot lt exp1nsumlogmuvec mudot lt repmudotlengt muvec dummy lt repOlengthmuvec pldat lt data frameresidvecmuvecmudot Step 2 estimate theta using the PL trick pl fit lt nlsdummy 39 ltrkfuncresidvecmudotmuvecthetapldat tstartcontrollistmaxiter500 theta lt coefplfit catquotIteration quotkquotnquotfileoutfileappendT catquotEstimate of theta quotthetaquotnquotfileoutfileappendT Step 3 update estimates of beta for each individual bglsmat lt NULL new muvec lt NU L new residvec lt NULL for i in 1m id lt uin y lt ysind x lt xs mu lt muvecindivid mut lt mu theta ymut lt ymut glsdat lt data framexymutd glspl fit lt nlsymut 39 weightfuncxb1b2dmutglsdat bstartcontrollistmaxiter100 bgls lt coefglspl fit mu lt meanfunc2xbgls1bgls2d resid lt ymu bgls lt matrixbglsncolpbyrowT bglsmat lt rbindbglsmatbgls new muvec lt cnew muvecmu new residvec lt cnew residvecres1d bgls lt cbinduindivbglsmat muvec lt matrixnew muvecncol1 residvec lt matrixnew residvecncol1 Compute final estimate of sigma 2 g lt muvectheta sigma2 lt sumresidvecg2nmp sigma lt sqrtsigma2 catfileoutfilequotnquotquotnquotquotnquotappendT catquotFinal GLS estimates after cmaxquot iterationsquot fileoutfilequotnquotquotnquotappendT writematrixroundbgls6fileoutfile catfileoutfilequotnquotquotnquotappendT catquotFinal pooled estimate of sigma quotsigmaquotnquotquotnquotfileoutfileappendT construct asymptotic theory covariance matrices for each indiv and stack them cmat lt NULL for i in 1m id lt uin y lt ysind x lt xs bglsi lt bglsi2p1 mui lt meanfuncxbglsi1bglsi2d muit lt mui theta PU E459 CHAPTER 15 ST 762 M DAVDDIAN xmati lt attrweightfuncxbglsi1bglsi2dmuitquotgradientquot cmati lt sigma 2solvetxmati Xmati cmat lt rbindcmatasmatrixcmati 339 Stage 2 start the GTS algorithm function to evaluate convergence of EM algorithm converged lt functiongtoldbddgitergmax bmax lt maxabsdb dmax lt maxabsdd ax lt itergmax max lt maxltgtol a axltgtol converged lt gmax I bmax amp dmax converged gtstol lt 0001 gtsmax lt 5000 starting values use sample mean and covariance aa lt matrix0rr ab lt matrix0r1 get individual design matrices for i in 1m ai lt aimati1p1ip aa lt aa taiai ab lt ab taibglsi2p1 bpop lt solveaaab dpop lt matrixrep0ppncolpbyrowT for i in 1m ai lt aimati1p1ip bdiff lt asmatrix bglsi2p1aibpop dpop lt dpop bdiff tbdiff dpop lt dpopm Initialize EM algorithm increments giter lt 1 dd 0 lt 1 dbgog lt 1 Start EM algorithm while convergedgtstolddpopdbpopgitergtsmax E step bi lt NULL dpopinv lt solvedpop for i in 1m row1 lt i1p1 row2 lt i cmatiinv lt solvecmatrow1row2 bistar lt asmatrixbglsi2p15 ai lt aimati1p1 ip betai lt solvecmatiinvdpopinv cmatiinv bistar dpopinv ai asmatrixbpop bi lt cbindbibetai matrix with columns the current emp Bayes ests M step winv lt matrix0rr PU E46O CHAPTER 15 ST 762 M DAVDDIAN for i in 1m ai lt aimati1p1ip Winv lt Winv taidp0piuvai winv lt solvewinv bpopnew lt matrix0r1 for i in 1m ai lt aimati1p1ip bpopnew lt bpopnew winvtaidpopinvbii dpopnew lt matrixrep0ppncolpbyrowT for i in 1m row1 lt i1p1 row2 lt ip cmati inv lt solvecmatrow1row2 ai lt aimati1p1ip bdiff lt asmatrixbiiaibpopnew dpop new lt dpop new bdiff tbdiffsolvedpop invcmati inv dpopnew lt dpopnewm relative increments for checking convergence dbpop lt bpopbpopnewbpop ddpop lt dpopdpopnewdpop bpop lt bpopnew dpop lt dpopnew giter lt giter1 get final asymptotic covariance matrix of bpop amat lt matrix0rr for i in 1m ai lt aimati1p1ip rowl lt i1p1 ow2 lt 1 cmati lt cmatrow1row2 amat taisolvecmatidpopai amat lt solveamat catquotRESULTS 0F GTS ALGORITHMquotfileoutfilequotnquotquotnquotappendT catquotNumber of iterations required quotgiterlfileoutfilequotnquotquotnquotappendT catquotEstimate of population beta and SEsquotfileoutfilequotnquotquotnquotappendT bout lt cbindbpopsqrtdiagamat writematrixroundbout6fileoutfile catfileoutfilequotnquotquotnquotappendT catquotEstimate of population DquotfileoutfilequotnquotquotnquotappendT writematrixdpopfileoutfile catfileoutfilequotnquotquotnquotappendT catquotIndividual empirical Bayes estimatesquotfileoutfilequotnquotquotnquotappendT tbi lt tbi writematrixroundtbi6fileoutfile OUTPUT ARGATROBAN DATA GTS ALGORITHM INITIAL OLS ESTIMATION OLS estimates PAE 461 CHAPTER15 ST 762 M DAVDDIAN 6748 mpmMHoooo d o o o o o o o o o 0 000000 122933 927109 000000 304374 974915 000000 205559 881280 000000 524927 398156 000000 369534 235292 000000 445647 445654 000000 519526 113683 000000 512732 944723 000000 749813 190686 000000 110668 198795 000000 993081 978620 000000 886762 846389 000000 240321 894611 000000 159957 918164 000000 952333 918614 000000 463692 861603 000000 165047 904327 000000 366761 012673 000000 457878 788923 00 882799 614647 0 c c c c c ltgt gt 4gt as a 4 no pa no 4 ltgt cm to ltgt 4 cu 00 00 7000000 957513 318169 OLS pooled estimate of sigma GLSPL POOLED ESTIMATION Iteration 1 Estimate of theta 01519031 Iteration Estimate of theta 01986853 teration Estimate of theta 02122114 Iteration Estimate of theta 02161920 Iteration Estimate of theta 02173757 teration 6 Estimate of theta 02177274 Iteration Estimate of theta 02178323 teration Estimate of theta 02178634 Iteration Estimate of theta 02178727 Iteration Estimate of theta 02178755 teration Estimate of theta 02178763 Iteration Estimate of theta 02178766 teration Estimate of theta 02178766 Iteration Estimate of theta 02178767 Iteration Estimate of theta 02178767 teration Estimate of theta 02178767 Iteration Estimate of theta 02178767 teration Estimate of theta 02178767 Iteration Estimate of theta 02178767 Iteration Estimate of theta 02178767 Final GLS estimates after 20 000000 5 087406 1523460 997347 8801405 iterations uindiv 1000000 5085338 1497000 F04IE 462 CHAPTER 15 ST 762 M DAVDDIAN 000000 318621 960873 m o o o o o c c p c c c ltgt no PA 4 ltgt no 4 O O O O O O I 0 I p 0 D O H I p p WMHO ltgt 0 o o o o p m m p m H H m o u o m 000000 243319 812966 000000 157970 865957 000000 956366 779829 826363 Ioooooo 3172127 3889859 000000 367375 920727 000000 465407 737488 000000 881074 566532 000000 951249 160116 Final pooled estimate of sigma 2346912 RESULTS OF GTS ALGORITHM Number of iterations required 67 Estimate of population beta and SEs 5 433146 0062089 1 927125 0026328 Estimate of population D b1 b2 0 137447600 0006104642 0006104642 0006277064 Individual empirical Bayes estimates 2 5 17 228 1 906658 23933 4s 0 M 00 o gt x I o gt x o o 00 cu 996609 951214 877066 912547 254534 889911 165907 904580 976786 886407 473666 889602 P54E463 CHAPTER 15 ST 762 M DAVDDIAN 6 170808 1 909681 PROGRAM 152 Two stage approach using linear mixed model software to t the stage 2 model Here the code to implement the rst stage of estimation of the 3 a 0 and the C is the same as in Program 151 Here however rather than use the EM algorithm to implement the second stage population model we use standard linear mixed model software here we use SAS proc mixed The rationale for this is as follows As discussed on page 438 we may regard the as data with known covariance matrices 0 In general if the stage 2 population model is of the form t Ai l bi7 these data follow the approximate linear mixed model i bi 8139 8139 IN 0 1558 For the argatroban data A is an identity matrix for all i The representation 1558 suggests an alternative way to t the second stage model Note that if 0712 is a matrix such that CglZCiC lZT I then 07128 has identity covariance matrix As 12 satisfying this property is the Cholesky decomposition AzTCiiz 011 i demonstrated in the rst program below Cquot of the matrix 071 le in R 0712 is an upper triangular matrix satisfying C Premultiplying 1558 by Oil2 we obtain 04223 CfZAm fl21 e 1559 This has the form of a linear mixed effects model with response vector 07123 design matrix for xed effects X Oil2A design matrix for random effects Z C lZbi and error vector 8 with mean 0 and identity covariance matrix It follows that this model can be tted with standard linear mixed model software to estimate 6 and D as long as the software allows the user to constrain the covariance matrix of the error vector to be an identity matrix That is writing vare 1 bi 73921 which is the default for SAS proc mixed the software must allow 7392 to be xed to be equal to 1 rather than estimated The use of the parms statement in proc mixed to do this is demonstrated in the second program below PAGE 464 CHAPTER 15 ST 762 M DAVDDIAN Unfortunately the lme function for tting linear mixed effects models in R or Splus does not allow the user to constrain the values of covariance parameters as does proc mixed PROGRAM STATEMENTS The rst R program carries out the stage 1 pooled GLS PL estimation and creates the data for tting the model 1559 These data are then read into the second SAS program7 which invokes proc mixed to estimate 6 and D Stage 1 GLSPL pooled estimation Stage 2 Estimation of population parameters using linear mixed model software This program outputs a data set that may then be input to linear mixed model software like SAS proc mixed or RSplus lme to carry out stage 2 function to print out matrices pretty write matrix lt functionxfilequotquotsepquot quot x lt as matrixx p lt nco x cat dimnamesx 2 format t x filefile sepc repsep pl quotnquotappendT put the mean function you want here meanfunc lt functionxb1b2dose x tinf t2 lt tinf1t1t1x f1 lt dosecl1expclt2vexpcl1t1Xtinfv compute analytical dervivatives create the gradient matrix X t3 lt 1t1xtinf templ lt doseclexpclt3v temp2 lt dosev 2expclt3v meangrad lt array0clengthx2listNULLcquotb1quotquotb2quot meangradquotb1quot lt temp1 1expclt2v1cl t3vt2vexpclt2vcl meangradquotb2quot lt temp21expclt2vt3expclt2vt2v Etrf1quotgradientquot lt meangrad 339 meanfunc2 lt functionxb1b2dose tinf lt 240 cl lt expb1 v lt expb2 t1 lt x inf t2 lt tinf1t1t1x i lt dosecl1expclt2vexpcl1t1Xtinfv c s lt a r M H do a mu f1 lt meanfunc2xb1b2dose weightf lt f1mut ti 240 cl lt expb1 v lt expb2 lt xltt nf t2 lt tinf1t1t1x t3 lt 1t1xtinf templ lt doseclexpclt3v temp2 lt dosev 2expclt3v compute analytical dervivatives create the gradient matrix X weightgrad lt array0clengthx2listNULLcquotb1quotquotb2quot weightgradquotb1quot lt temp1 1expclt2v1cl t3vt2vexpclt2vclmut weightgradquotb2quot lt temp21expclt2vt3expclt2vt2vmut PU E465 CHAPTER 15 ST 762 M DAVDDIAN attrweightfquotgradientquot lt weightgrad ghtf set start values etc max number of iterations for fitting algorithm cmax lt 20 start values bstart lt listb16 Ob22 0 tstart lt listtheta05 p lt 2 name of output file tfile lt quotargstage1soutquot catquotARGATROBAN DATA STAGE 1quotfileoutfilequotnquotquotnquotquotnquotappendF read in data thedata lt scanquotargconcdatquot thedata lt matrixthedatancol5byrowT indiv lt thedata2 ds lt thedata3 xs lt thedata4 ys lt thedata5 individual indicator dose time concentration n lt lengthxs uindiv lt uniqueindiv m lt lengthuindiv form individual 2nd stage covariate design matrices Ai these are all identity matrices in this case aimat lt NULL rlt p a1 lt diagp fori in 1m aimat lt rbindaimata1 quotmeanquot function for the quottrickquot pltrkfunc lt functionresidmudotmutheta trk lt residmudotmutheta trkgrad lt array0clengthmu1listNULLcquotthetaquot trkgradquotthetaquot lt trklogmudot mu at rtrkquotgradientquot lt trkgrad analytic derivative tr Stage 1 Pooled GLSPL estimation Step 1 initial fit by OLS for each indiv catquotINITIAL OLS ESTIMATIONquotfileoutfilequotnquotquotnquotappendT matrix to contain OLS estimates bolsmat lt NULL muvec lt NU L vector to contain all predicted values residvec lt NULL vector to contain all residuals for i in 1m id lt uindivi y lt ysindi i olsdat lt data framexyd ols fit lt nlsy 39 meanfuncxb1b2dolsdatbstart bols lt coefols fit mu lt meanfunc2xbols1bols2d PU E466 CHAPTER 15 ST 762 M DAVDDIAN resid lt ymu bols lt matrixbolsncolpbyrowT bolsmat lt rbindbolsmatbols muvec lt cmuvecmu residvec lt cresidvecresid bols lt cbinduindivbolsmat uvec lt matrixmuvecncol1 residvec lt matrixresidvecncol1 catquot0LS estimatesquotfileoutfilequotnquotquotnquotappendT writematrixroundbols6fileoutfile pooled OLS estimate of sigma sigma lt sqrtsumresidvecresidvecnmp catfileoutfilequotnquotquotnquotappendT catquot0LS pooled estimate of sigma quotsigmaquotnquotquotnquotfileoutfileappendT begin iteration loop for GLSPL catquotGLSPL POULED ESTIMATIONquotfileoutfilequotnquotquotnquotappendT for k in 1 cmax compute the geometric mean and predicted values mudot lt prodmuvec1n mudot lt exp1nsumlogmuvec mudot lt repmudotlengthmuvec pldat lt data frameresidvecmuvecmudot Step 2 estimate theta using the PL trick pl fit lt nlsdummy 39 ltrkfuncresidvecmudotmuvecthetapldat tstartcontrollistmaxiter500 theta lt coefpl fit catquotIteration quotkquotnquotfileoutfileappendT catquotEstimate of theta quotthetaquotnquotfileoutfileappendT Step 3 update estimates of beta for each individual bglsmat lt NULL new muvec NUL new residvec lt NULL for i in 1m id lt uindivi4 l y lt ysindi d x lt xsindi id d lt dsindi id mu lt muvecindivid mut lt mu theta ymut lt ymut glsdat lt data framexymutd glspl fit lt nlsymut 39 weightfuncxb1b2dmutglsdat bstartcontrollistmaxiter100 bgls lt coefglspl fit mu lt meanfunc2xbgls1bgls2d resid lt ymu bgls lt matrixbglsncolpbyrowT bglsmat lt rbindbglsmatbgls new muvec lt cnew muvecmu new residvec lt cnew residvecresid bgls lt cbinduindivbglsmat m vec lt matrixnew muvecncol1 residvec lt matrixnew residvecncol1 PAE 467 CHAPTER 15 ST 762 M DAVDDIAN Compute final estimate of sigma 2 g lt muvectheta sigma2 lt sumresidvecg2nmp sigma lt sqrtsigma2 catfileoutfilequotnquotquotnquotquotnquotappendT catquotFinal GLS estimates after 39cmaxquot iterationsquot fi eoutfilequotnquotquotnquotappendT writematrixroundbgls6fileoutfile catfileoutfilequotnquotquotnquotappendT catquotFinal pooled estimate of sigma quotsigmaquotnquotquotnquotfileoutfileappendT now transform problem to allow use of mix model software to fit 2nd stage mixdat lt NULL for i in 1m id lt uin y lt ysind x lt xs thisid lt repi p get covariance matrix for bglsi bglsi lt bglsi2p1 mui lt meanfunc2xbglsi1bglsi2d muit lt mui theta xmati lt attrweightfuncxbglsi1bglsi2dmuitquotgradientquot cmati lt roundsigma 2solvetxmati xmati6 cholesky decomp of inverse cinvhalf lt cholroundsolvecmati6 Create the quotdataquot for stage 2 response quotXiquot and quotZiquot ai lt aimati1p1ip respi lt cinvhalfbglsi thisxi lt cinvhalfai thiszi lt cinvhalf thisdat lt cbindthi id re pi mafri thisxi mafri thiszi mixdat lt rbindmixdatthisdat writematrixmixdatfilequotargmixsigdatquot read in the quotdataquot from the first stage analysis and fit the IImixed modelquot for the second stage with the variance of error constrained to be equa to 1 options ps59 ls80 nodate data arg ar ixsi datquot input idg x1 E2 21 Z2 00 run proc mixed dataarg methodml class id model y x1 x2 noint solution chisq random 21 Z2 sub ectid t eun g gcorr gc parms 014 0006 0006 1 eqcons4 run PU E468 CHAPTER 15 ST 762 M DAVDDIAN OUTPUT The output of the rst R program implementing the rst stage is identical to that from Program 151 so is not presented Compare the estimated values for 6 and D from SAS proc mixed here to those obtained from Program 151 using the EM algorithm 7 they are almost identical Slight differences likely re ect differences in the computational methods and convergence criteria used in each implementation The SAS System 1 The Mixed Procedure Model Information Data Set WORKARG Dependent Variable C 39 Variance Components Subject Effect Ld Estimation Method L Residual Variance Method arameter Fixed Effects SE Method odelBased Degrees of Freedom Method Containment Class Level Information Class Levels Values id 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Dimensions Covariance Parameters 4 Columns in 2 o s in Z Per Subject 2 Subjects 37 Max Obs Per Subject 2 Number of Observations Number of Observations Read 74 Number of Observations Used 74 Number of Observations Not Used 0 Parameter Search CovP1 CovP2 CovP3 CovP4 Log Like 2 Log Like 01400 0006000 0006000 10000 1847702 3695404 Iteration History Iteration Evaluations 2 Log Like Criterion 1 2 36953348968 000000008 2 1 36953348047 0 00000000 The SAS System 2 The Mixed Procedure Convergence criteria met Estimated G Matrix Row Effect id C011 C012 1 21 1 01374 0006061 2 z2 1 0006061 0006172 Estimated CholG Matrix Row Effect id Col1 Col2 P54E469 CHAPTER15 ST 762 M DAVDDIAN 1 z1 1 03707 2 22 1 001635 007684 Estimated G Correlation Matrix Row Effect id Col1 C012 1 z1 1 10000 02081 2 22 1 02081 10000 Covariance Parameter Estimates Cov Parm Subject Estimate UN11 id 01374 UN21 id 0006061 22 id 0006172 Residual 10000 Fit Statistics 2 Log Likelihood 695 AIC smaller is better 3795 AICC smaller is better 3804 BIC smaller is better 3876 FARMS Model Likelihood Ratio Test DF ChiSquare Pr gt ChiSq 3 001 09998 The SAS System The Mixed Procedure Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr gt Itl x1 54331 006209 0 8751 x2 19271 002625 0 7342 Type 3 Tests of Fixed Effects Num Den Effect DF DF ChiSquare F Value Pr gt ChiSq x1 1 0 765774 765774 lt0001 x2 1 0 539067 539067 lt0001 Pr gt F PROGRAM 153 Fitting a nonlinear mixed e ects model using the rst order linearization method via the SAS macro nlinmix Details on this macro may be found in Chapter 12 of Littell et al 19967 although these are pretty much out of date7 as new versions have See the macro for information on syntax7 options7 and required statements Examples of the use of the macro are available at httpwwwsas com The macro does not allow estimation of within subject variance parameters We thus form our own weights corresponding to the power variance function with 9 xed and equal to 0227 the value obtained from pooled GLS PL estimation PAE 470 CHAPTER 15 ST 762 M DAVDDIAN PROGRAM STATEMENTS Fit using SAS macro NLINMIX This program uses the newest version of the macro availble on the sas web site options ps55 ls80 nodate include the macro from file nlinmixsas Zinc 7afsunityncsuedulockersdeptstatinfost762infowwwdavidiannlinmixnlmm801sas osource input the data data arg infile ar concdat input obsno indiv ose time conc un data ar set ar tinfE40 E if timegttinf then t1O t2tinf1t1t1time run for first order approxiation use expandzero or exansion around empirical Bayes estiamtes use expandeblup aninmixdataarg modelstr clexpbeta1b1 vexpbeta2b2 predvdosecl1expclt2vexpcl1t1timetinfv derivsstr wt1predv2022 parmsstrbeta160 beta220 V model pseudoconc dbeta1 dbeta2 noint notest solution random db1 db2 subjectindiv typeun solution weight wt exp zero procoptstrmaxiter500 methodml run OU TP U T Note that these results differ nontriVially from those obtained using two stage methods This is not surprising7 as the rst order linearization is a different and cruder approximation The SAS System 1 The Mixed Procedure Model Information WORKNLINMIX ependent Variable eu onc eight Variable t novariance Structure Unstructured t Lndiv ro e ethod odelBased egrees of Freedom Method Containment Class Level Information Class Levels Values indiv 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2O 21 22 23 PAE 471 CHAPTER15 Number of Observations Read Number of Observations Used 34 35 24 25 26 27 28 29 30 31 32 33 36 37 Dimensions Covariance Parameters o umns in X Columns in Z Per Subject Subjects Max Obs Per Subject ptm pummp Number of Observations Number of Observations Not Used 0 CovP1 01578 CovP2 000308 Iteration 1 Effect dbeta1 dbeta2 Effect indiv db1 1 db2 1 Parameter Search CovP3 001676 CovP4 69980 Variance 69980 The SAS System The Mixed Procedure Iteration History Evaluations 2 Log Like 1 587062357427 Convergence criteria met Covariance Parameter Estimates Subject Estimate Fit Statistics 2 Log Likelihood AIC smaller is better AICC smaller is better BIC smaller is better PARMS Model Likelihood Ratio Test DF ChiSquare Pr gt ChiSq 3 000 10000 Solution for Fixed Effects Standard Estimate DE 401 401 t Value 8280 5330 54889 006629 18277 003429 Solution for Random Effects Estimate 02729 001884 Std Err Pred DF 01368 401 01229 401 The SAS System t Value 199 015 The Mixed Procedure Solution for Random Effects Std Err 475 475 Log Like 29353118 2 Log Like 58706236 Criterion 000000000 Pr gt Itl lt0001 lt0001 Pr gt Itl 00467 08782 3 ST 762 M DAVDDIAN PAE 472 CHAPTER15 ST 762 M DAVDDIAN Effect Effect uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu uuuuuuuuuuuuuuuuuuuuuuuuuu indiv indiv mppm Estimate The SAS System Th Soluti Estimate DF t Value pppppp OOOOOC HHHHHH 401 e Mixed Procedure on for Random Effects DF t ONpOHom xooum x IIIII mpbaommouo Value La446b bbbbb b bh b b PAGE4B CHAPTER 15 ST 762 M DAVIDIAN PROGRAM 154 Fitting a nonlinear mixed e ects model using the more re ned linear approximation method expanding about empirical Bayes estimates via the SAS macro nlinmix The program is identical to the one in Program 153 with the statement expandzero replaced by expandeblup Thus7 we only give the output OUTPUT These results differ from those obtained from the rst order linearization They are much closer to7 but not exactly equal to7 those from those obtained via two stage methods The biggest discrepancy is in the estimate of D7 which is of course the most dif cult parameter to estimate The SAS System The Mixed Procedure Model Information WORKNLINMIX ependent Variable seu onc eight Variable t novariance Structure Unstructured ubjec Effect Lndiv r0 1 e ethod odelBased egrees of Freedom Method Containment Class Level Information Class Levels Values indiv 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Dimensions Covariance Parameters o umns in Columns in Z Per Subject Subjects Max Obs Per Subject ptm pumm Number of Observations Number of Observations Read 475 Number of Observation d 475 Number of Observations Not Used 0 Parameter Search CovP1 CovP2 CovP3 CovP4 Variance Log Like 2 Log Like 01378 0005669 0004761 54908 54908 28626476 57252952 The SAS System The Mixed Procedure Iteration History Iteration Evaluations 2 Log Like Criterion 1 1 572529518878 0 00000000 Convergence criteria met Covariance Parameter Estimates Cov Parm Subject Estimate PAE 474 CHAPTER15 Effect db1 db2 Effect uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu Effect dbeta1 dbeta2 indiv indiv UN11 indiv 01378 UN21 indiv 0005669 UN22 indiv 0004761 Residual 54908 Fit Statistics 2 Log Likelihood AIC sma FARMS Model Likelihood Ratio Test DF ChiSquare 3 000 ller is better AICC smaller is better BIC smaller is better Solution for Fixed Effects Pr gt ChiSq 10000 Standard Estimate E DF t Value Pr gt Itl 54325 006212 401 8746 lt0001 19256 002527 401 7619 lt0001 Solution for Random Effects Std Err Estimate Pred DF t Value Pr gt Itl 1 02637 01452 401 182 00700 1 001700 006717 401 025 08004 The SAS System 3 The Mixed Procedure Solution for Random Effects Std Err Estimate Pred DF t Value Pr gt Itl 010 1311 04240 0 003189 0 06692 09620 3 1484 00178 0 0101 0 06727 08800 02 1 01391 00716 0 0190 0 06707 07766 02 3 01135 00129 0 0199 0 06642 07642 05 5 01349 lt0001 0 0267 006715 06909 383 0 08600 lt0001 0 0306 0 06309 06272 05 1311 lt0001 0 0288 06704 06673 0 0321 08801 07155 054 6433 03951 031 08926 07206 0 008431 06445 08960 4337 08142 lt0001 001724 06038 07754 0002182 08813 09803 001907 06421 07666 0551 7200 lt0001 0 0264 05656 06399 062 08316 04500 0 00332 06339 09582 76 07003 1 lt0001 057 5508 4 02976 0 05165 8218 3 05301 0 01466 06276 3 08154 06954 06870 1 2 lt0001 0 00726 05201 4 08890 1277 08014 9 01119 0 001505 06288 2 09809 2241 08277 1 00071 01173 06388 18 08544 0 05604 07652 7 04644 0 06830 06179 1 02697 08008 07583 0 02916 0 03468 06137 5 05723 02265 07495 30 07627 006328 06114 04 03013 008156 07313 12 02654 ST 762 M DAVDDIAN PAE 475 CHAPTER 15 ST 762 M DAVDDIAN 2 23 002913 005921 401 049 06230 d b1 24 008301 007313 401 114 02570 2 24 000273 005911 401 005 09632 The SAS System 4 The Mixed Procedure Solution for Random Effects Std Err Effect indiv Estimate Pred DF t Value Pr gt Itl 3 80 06883 40 8 lt0001 3 0 07712 05423 40 2 01558 3 78 07851 40 3 lt0001 3 0 01209 06296 40 9 08478 3 0 5640 06704 40 lt0001 3 0 02384 04972 40 06318 3 441 07097 40 lt0001 3 0 008292 05432 40 08787 3 777 07316 40 00156 3 0 02939 05997 40 06243 3 0 2666 07462 40 00004 3 0 01870 06118 40 07600 3 0 4557 07830 40 lt0001 3 03375 06304 40 05927 3 0 04167 07030 40 05537 3 02924 05675 40 06067 3 0 7385 06509 40 1 lt0001 3 0 01240 04346 401 07756 3 0 06433 07025 401 03604 3 0 003406 057 7 40 09526 3 0 607 06913 40 05055 3 0 06350 05520 40 02506 3 0 5114 08021 40 lt0001 3 0 05595 06361 40 03797 3 04989 07870 40 4 lt0001 3 7 000973 06310 40 5 08775 PROGRAM 155 Fitting a nonlinear mixed e ects model using the more re ned linear approximation method expanding about empirical Bayes estimates uia the Splus function nlme Details about the RSplus function nlme may be found in Pinheiro and Bates 2000 The nlme function allows estimation of within subject variance parameters We thus allow 9 to be estimated rather than held xed PROGRAM STATEMENTS Fit nonlinear effects model with intraplot power variance function using nlme agatata librarynlme Read in the data from the data frame thedata lt scanquotargconcdatquot thedata lt matrixthedatancol5byrowT indiv lt thedata2 individual indicator ds thedata3 dose xs lt thedata4 time ys lt thedata5 concentration A I sdat lt dataframeindivxsysds PAE 476 CHAPTER 15 ST 762 M DAVDDIAN Define the mean response function Here we program both the function and its derivatives with respect to each parameter I analytic derivatives are no provided by the user nlme will use numeric derivatives meanfunc lt functionxb1b2dose 1 lt xlttinf t2 lt tinf1t1t1x f1 lt dosecl1expclt2vexpcl1t1Xtinfv compute analytical dervivatives create the gradient matrix X t3 lt 1t1xtinf temp1 lt doseclexpclt3v temp2 lt doseW2expclt3v meangrad lt array0clengthx2listNULLcquotb1quotquotb2quot meangradquotb1quot lt temp1 1expclt2v1cl t3vt2vexpclt2vcl meangradquotb2quot lt temp21expclt2vt3expclt2vt2v Etrf1quotgradientquot lt meangrad H Fit the model with ML estimation variance components verboseT prints out history of each iteration Because the model is nonlinear starting values for the fixed effects the vector beta are provided power variance function with power parameter estimated from the data starting value 05 agatatata agatatatatatsr agatatata agatatatata outfile lt IIargnlme Routquot catquotARGATROBAN PK DATAquotfileoutfilequotnquotquotnquotappendF catquotML fit with estimation of thetaquotfileoutfilequotnquotquotnquotappendT sinkoutfileappendT argmlfit lt nlmeys 39 meanfuncxsb1b2ds fixed listb1 39 1b2 1 random listb1 39 1b2 39 1 roups indiv at start listfixed c6020 methodquotMLquotverboseTweightsvarPowerO5 catquotnquotquotnquotquotnquotfileoutfileappendT printsummaryarg mlfit catquotEstimate of sigmaquotarg mlfitsigmafileoutfilequotnquotappendT catquotnquotquotnquotfileoutfileappendT sink OUTPUT Note that the results are close to those obtained Via SAS proc nlinmix with expandeblup7 although they are not identical This is likely due to differences in implementation and the fact that 9 is estimated here ML fit with estimation of theta Iteration 1 LME step Loglik 2911 786 nlm iterations 10 reStruct arameters indiv indiv2 indivS 6237879 6354590 870406418 varStruct parameters PAE 477 CHAPTER 15 ST 762 M DAVDDIAN ower 003828819 PNLS step RSS 2220579 fixed e fects543318 19864 iterations Convergence ixed reStruct varStruct 01043257 129128790 119909972 Iteration 2 LME step Loglik 2866087 nlm iterations 10 ameters in iv1 indiv2 indiv3 4302554 6675258 73064398 varStruct over 019 0856 PNLS step RSS 34564 fe 8 fixed e cts54327 192693 iterations 5 Convergence ixed reStruct varStruct 003086151 182543060 080569923 Iteration 3 LME step Loglik 2862693 nlm iterations 10 reStruct parameters indiv1 indiv2 indiv3 4081142 5929969 88516752 varStruct parameters over 023 8328 PNLS ste RSS 2220562 fixed egfects543257 192031 4 iterations Conver ence ixed reStruct varStruct 0003450617 0066300763 0152874701 Iteration 4 LME step Loglik 2862309 nlm iterations 10 reStruct parameters indiv1 indiv2 indiv3 4049980 5721158 83012918 varStruct parameters ower 024 5995 PNLS ste RSS 2022085 fixed egfects543255 191829 iterations Conver ence ixed reStruct varStruct 0001052599 0015257246 0032147402 Iteration 5 LME step Loglik 2862228 nlm iterations 8 reStruct parameters indiv1 indiv2 indiv3 4044804 5713681 84298986 varStruct parameters over 024 0759 PNLS step RSS 1985808 fixed e fects543255 191799 iterations 2 Convergence ixed reStruct varStruct 00001527312 00021381005 00060735501 Iteration 6 LME step Loglik 2862214 nlm iterations 6 reStruct parameters indiv1 indiv2 indiv3 4043671 5713053 84138894 varStruct parameters over 024 2481 PNLS step RSS 198157 fixed e fects543255 191799 PAE 478 CHAPTER 15 ST 762 M DAVDDIAN iterations 1 Convergence fixed reStruct varStruct 0 0000000000 00002550833 00007078780 Iteration 7 LME step Loglik 2862214 nlm iterations 2 reStruct parameters indiv1 indiv2 indiv3 4043694 5712481 84136193 varStruct parame ers over 024 2503 PNLS step RSS 1981501 fixed e fects543255 191799 iterations 1 Convergence ixed reStruct varStruct 0000000e00 3680926e04 9032974e06 Iteration 8 LME step Loglik 2862214 nlm iterations 2 reStruct parameters indiv1 indiv2 indiv3 4043670 5712186 84105234 varStruct parameters over 024 2528 PNLS step RSS 1981432 fixed e fects543255 191799 iterations 1 Convergence fixed reStruct varStruct 0000000e00 1977769e04 1047465e05 Iteration 9 LME step Loglik reStruc paramete 2862214 nlm iterations 2 rs indiv1 indiv2 indiv3 4043649 5712022 84088604 varStruct parameters ower 02432552 PNLS ste RSS 1981371 fixed egfects543255 191799 iterations Convergence fixed reStruct varStruct 0000000e00 1038129e04 9917301e06 Iteration 10 LME step Loglik 2862214 nlm iterations 2 reStruct parameters indiv1 indiv2 indiv3 4043633 5711929 84079875 varStruct parame ers over 024 2578 PNLS step RSS 1981309 fixed e fects543255 191799 iterations 1 Convergence ixed reStruct varStruct 0000000e00 5166518e05 1058012e05 Iteration 11 LME step Loglik 2862214 nlm iterations 2 reStruct parameters indiv1 indiv2 indiv3 4043617 5711873 84075531 varStruct parameters over 024 2599 PNLS step RSS 1981260 fixed e fects543255 191799 iterations 1 Convergence fixed reStruct varStruct 0000000e00 2731288e05 8505028e06 PAE 479 CHAPTER 15 ST 762 M DAVDDIAN Iteration 12 LME step Loglik 2862214 nlm iterations 3 reStruct parameters indiv1 indiv2 indiv3 4043605 5711839 84073235 varStruct parameters over 02 3261 PNLS step RSS 1981232 fixed e fects543255 191799 iterations 1 Convergence ixed reStruct varStruct 0000000e00 1744778e05 4678971e06 Iteration 13 LME step Loglik 2862214 nlm iterations 3 reStruct parameters indiv1 indiv2 indiv3 4043598 5711818 84071768 varStruct parameters over 024 2619 PNLS step RSS 1981211 fixed e fects543255 191799 iterations 1 Convergence fixed reStruct varStruct 0000000e00 9281553e06 3698841e06 Nonlinear mixedeffects model fit by maximum likelihood Model ys 39 meanfuncxs b1 b2 ds Data sdat AIC BIC logLik 5738429 5767572 2862214 Random effects Formula listb1 39 1 b2 39 1 Leve indiv Structure General positivedefinite LogCholesky parametrization ev 0 b1 037168333 b1 b2 006753254 0268 Residual 2042295300 Variance function Struc ure Power of variance covariate Formula 39fitted Parameter estimates ower 02432619 Fixed effects listb1 39 1 b2 39 1 Value S d rror DF tvalue pvalue b1 5432546 006230325 437 8719522 0 b2 1917993 002513039 437 7632165 0 Correlation b1 b2 0156 Standardized WithinGroup Residuals Min Q Med Q3 Max 248365331 042422129 003291951 057930497 938429344 Number of Observations 475 Number of Groups Estimate of sigma 2042295 PROGRAM 156 Fitting a nonlinear mixed e ects model using the exact likelihood integration by adaptive quadrature using SAS proc nlmixed Details on the use of proc nlmixed are available in the SASSTAT documentation The program supports different models for pYijlzlj ai bi 6 397 by default The default normal model does not allow nonconstant variance It is possible for the user to specify hisher own such model Rather than do P54E480 CHAPTER 15 ST 762 M DAVDDIAN that here7 we simply exploit the correspondence between the transform both sides model with the Box Cox transformation noted in Chapter 27 and the power variance model recall that 9 1 7 PROGRAM STATEMENTS Fit using SAS PROC NLIMIXED options ps55 ls80 nodate input the data data arg infile ar concdat input obsno indiv ose time conc i oc nlimixed does not allow variance functions so we use transform both sides instead With theta 025 lambda1O25O75 data ar set ar tinf 40 g t if timegttinf then t10 t2tinf1t1t1time conctransconc075 run put methodfiro in the proc nlmixed statement to get first order approximation Default is Gaussian quadrature proc nlmixed dataarg parms beta160 beta220 s2b1014 cb120006 s2b20006 s2230 clexpbeta1b1 vexpbeta2b2 preddosecl1expclt2vexpcl1t1timetinfv075 model conctrans 39 normal pre 52 random b1 b2 39 normal0052b1cb12s2b2 subjectin i OUTPUT Note that these results differ somewhat from those obtained from approximate methods Again7 the main difference is in estimation of the covariance parameters Note in particular that the estimate of 0 here is about 139 x192727 which is considerably smaller The SAS System The NLMIXED Procedure Specifications Set WORKARG ependent Variable conctrans istribution for Dependent Variable Norma ts b1 b2 vistribution for Random Effects Normal ubject Varia le indiv ptimization Technique Dual QuasiNewton ntegration Method Adaptive Gaussian Quadrature Dimensions Observations Used 475 Observations Not Used Total Observations 475 Subjects 37 Max Obs Per Subject 14 Parameters Quadrature Points Parameters PAE 481 CHAPTER15 ST 762 M DAVDDIAN beta1 Iter gt gt MHOOmNOIU39hbuJMH gt gt gt gt 01pm Iter Hooooxloumpm beta2 s2b1 cb12 s2b2 2 014 0006 0006 Iteration History Calls NegLogLike Diff MaxGrad 9 309858474 2951677 2909793 12 308077795 17 80679 3418003 14 306795308 12 82486 77 70754 15 304477913 23 17395 61 82354 1 296217648 82 60265 1318005 20 2922961 3 39 21465 9691708 22 250890755 4140543 9516593 23 238001836 1288892 5 44308 29 2257890 4 1221283 97 97422 30 210494795 1529421 30 31 20709 023 340077 6777944 34 205750772 1343251 9131095 47 205574231 1 765402 9453062 49 205001852 5723796 963334 5 20482184 1800119 9350092 The SAS System The NLMIXED Procedure Iteration History Calls NegLogLike Diff MaxGrad 4 204793266 28574 9458728 5 204753363 0 399031 9327132 1 204556326 1 970364 395 2 20441 6 437 417715 3 204196 163375 37 10144 4 204111761 845084 1206015 5 204027137 846231 164403 7 03994 331378 17 85355 8 74884 191158 245508 0 203781255 936287 98 57278 1 203661902 193532 8472607 3 203105674 562282 3442082 4 202167997 3 6772 1971213 202010976 570209 1772437 7 201745439 2 65537 1666242 9 201644558 008807 1199825 1 201606462 380959 60 74677 3 201513479 929828 86237 4 1480 327596 10011 38 945893 1122332 201337515 486157 1217816 201227809 097061 1238972 201126662 1 01147 1311645 201056597 700647 22 0956 20074458 120169 1119848 1 200664678 799017 92 97202 1 1 200564411 002678 83 53623 1 2 200497255 671558 1030308 1 4 20046101 362452 537051 1 5 200433397 276122 53 48391 1 6 200401379 320188 274253 1 9 2003 95592 057862 15 68222 1 1 392 032029 1 780078 1 3 2003 92259 001306 3 299421 1 5 2003 92215 0444 0 129486 1 7 2003 92213 000014 0124519 1 9 2003 92213 03E6 0008466 NOTE GCUNV convergence criterion satisfied Fit Statistics 2 Log Likelihood AIC smaller is better AICC smaller is better 40078 40198 40200 The SAS System The NLMIXED Procedure Fit Statistics BIC smaller is better 40295 s2 NegLogLike 23 31281015 Slope 021647 PU4E482 CHAPTER 15 ST 762 M DAVDDIAN Parameter Estimates Standard Parameter Estimate Error DF t Value Pr gt t Alpha Lower beta1 54237 006277 35 8640 lt0001 005 55511 beta2 1 9238 002972 35 6473 lt0001 005 1 s2b1 11 003389 35 00002 05 007230 cb12 0006562 0010 35 064 05242 0 05 001415 s2b2 0006010 0006141 35 098 03345 0 05 000646 s2 19272 136128 35 1416 lt0001 0 Parameter Estimates Parameter Upper Gradient beta1 52963 0000145 beta2 18635 000138 s2b1 020 0203 cb12 0 02727 0 008466 s2b2 0 01848 0 006348 s2 20 93 EXAMPLE 2 Seizure data As an example of response that is not continuous and might be taken to follow a generalized linear mixed model we consider the data from Thall and Vail 1990 on epileptic seizures described in Section 147 Rather than use a marginal model as was done in Section 147 we consider a subject speci c model as follows Here recall that Yij is the number of seizures experienced in the two weeks prior to the jth Visit We may identify zij 1t4jT where as in Section 147 2547 0 if observation j for subject i is prior to the 4th and last Visit and 2547 1 if observation j is the last one 4th Visit and ai bi ai 61 Consider the following rst stage model 507 EYijlzijv i Zia5139 expwm 754175107 i 1560 1139 We consider two second stage population models of the form 57 Ai Bibi ln the rst we take 11339 396390b39639 10 b39 A 1 l H6 62 B7 12 0 1561 000010 33 01 b1 ln the second we simplify the rst by keeping A and 6 the same but instead taking 7 1 B 12 50 1562 0 P54E483 CHAPTER 15 ST 762 M DAVDDIAN The model in 1561 allows the effect of nal visit to be different for each individual while 1562 only allows subject speci c variation to manifest itself through a multiplicative effect expb0 on the individual conditional mean response PROGRAM 157 Fitting a generalized linear mixed e ects model using the more re ned linear approx imation method expanding about empirical Bayes estimates PQL via the SAS macro glimmix The macro glimmix is set up to deduce starting values for 6 in a way similar to that used by proc genmod Thus the user need not specify starting values See Littell et al 1996 Chapter 11 for more on the use of the macro although this is a bit outdated See httpwwwsascom for more recent examples PROGRAM STATEMENTS Fit a loglinear mixed model to the epileptic seizure data of Thall and Vail 1990 using the SAS macro glimmix These are count data so we use the Poisson meanvariance assumptions First access the glimmix macro options ls80 ps59 nodate run Zinc afseosncsueduinfostst762infowwwdavidiannlinmixglmm800sas nosource The data look like first 8 records on first 2 subjects 104 5 1 0 11 31 104 3 2 0 11 31 104 3 3 0 11 31 104 3 4 0 11 31 106 3 1 0 11 30 106 5 2 0 11 30 106 3 3 0 11 30 106 3 4 0 11 30 column 1 sub ect column 2 number of seizures column 3 visit 14 biweekly visits column 4 0 if placebo 1 if progabide column 2 baseline number of seizures in 8 weeks prior to study co umn data seizure infile seizedat input subject seize Visit trt base age run data seizure set seizure i subject207 then delete logbaselogbase4 logagelogage basetrtlogbasetrt if visitlt then vi if visit4 then visit41 run P54E484 CHAPTER 15 ST 762 M DAVDDIAN Fit two models a model that contains only a random quotinterceptquot in the linear predictor to capture intersubject variation and induce correlation a fancier model with random effects both for quotinterceptquot and time visit allow the dependence with time to be subjectspecfic nquot 0 See the macro for information on syntax required statements d tions See the roc mixed documentation for i ormation on specifying the mode using the quotstmtsquot sta emen title quotRANDOM INTERCEPT ONLYquot Xglimmixdataseizure rocoptmethodreml u ect model seize logbase logage trt visit4 basetrt solution random intercept subjectsubject type errorpoisson linklog run title quotRANDOM INTERCEPT AND TIMEquot Xglimmixdataseizure procoptmethodreml stmtsstr model seize lo base logage trt visit4 basetrt solution random intercept visit4 subjectsubject typeun g gcorr errorpoisson linklog i run OUTPUT The results differ somewhat from those obtained under a marginal model7 given in Sec tion 147 This is to be expected 7 the meanings of the components of 6 are different under the marginal and subject speci c models RANDOM INTERCEPT ONLY The Mixed Procedure Model Information WORKDS ependent Variable z eight Variable w ovariance Structure Unstructured bject ect sub ect nstimation Me esidual Variance Method Profile ixed Effects SE Method ModelBased egrees of Freedom Method Containmen Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238 PAGE 485 CHAPTER15 ST 762 M DAVDDIAN Dimensions Covariance Parameters 2 o umns in 6 Columns in Z Per Subject 1 Subjects 59 Max Obs Per Subject 4 Number of Observations Number of Observations Read 236 Number of Observations Used 236 Number of Observations Not Used 0 Parameter Search CovP1 CovP2 Variance Res Log Like 2 Res Log Like 02242 19704 19704 2596176 5192351 RANDOM INTERCEPT ONLY The Mixed Procedure Iteration History Iteration Evaluations 2 Res Log Like Criterion 1 1 51923513864 0 00000000 Convergence criteria met Covariance Parameter Estimates Cov Parm Subject Estimate UN11 subject 02242 Residual 19704 Fit Statistics 2 Res Log Likelihood 5192 AIC smaller is better 5232 AICC smaller is better 5233 BIC smaller is better 5274 PARMS Model Likelihood Ratio Test DF ChiSquare Pr gt ChiSq 1 000 10000 Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr gt Itl Intercept 14437 12200 54 118 02419 08810 01337 176 659 lt0001 logage 05285 03577 176 148 01413 trt 0911 4251 176 214 00333 visit4 01611 007661 176 210 00369 basetrt 338 02100 176 161 01083 Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr gt F logbase 1 176 4339 lt0001 logage 1 176 218 01413 trt 1 176 460 00333 visit4 1 176 442 00369 RANDOM INTERCEPT ONLY The Mixed Procedure Type 3 Tests of Fixed Effects P54E486 CHAPTER15 ST 762 M DAVDDIAN CovP1 02710 Effect basetrt 1 Description q are Scaled Pearson ChiSguare ExtraDispersion Sca e Class subject Number of Observations Read Number of Observations Used S ependent Variable eight Va ovariance Structure t c ixed Effects SE Metho egrees of Freedom Method Den DF 176 261 RANDOM INTERCEPT ONLY GLIMMIX Model Statistics Num DF F Value Pr gt F 01083 Value 4286978 pAthJ QDCOPA chrA ogtrAU Jpoxm Luca 4 aha 03 04 RANDOM INTERCEPT AND TIME The Mixed Procedure Model Information WORKDS z riable w Unstructured sub ect e Method Prof le Containment Class Level Information Values 101 Levels 59 102 103 104 106 108 BJBJBJPAPAPA thAltgt4gtthA 4 oacn oaouq 113 114 123 124 137 139 202 203 209 210 218 219 MMMHHHH MHopMHH Mpmmomo 227 228 232 234 236 238 Dimensions Covariance Parameters Columns in Columns in Z Per Subject Subj Max n 4 a ho 034E ects Obs Per Subject Number of Observations 236 236 Number of Observations Not Used 0 CovP2 008879 Iteration 1 Parameter Search CovP3 CovP4 Variance Res Log Like 0 19402 19402 2567671 RANDOM INTERCEPT AND TIME The Mixed Procedure Parameter Search 2 Res Log Like 5135342 Iteration History Evaluations 2 Res Log Like 1 51353424057 Criterion 000000000 Convergence criteria met PAE 487 CHAPTER15 Effect Intercept logage trt visit4 basetrt Estimated G Matrix Row Effect subject C011 1 Intercept 101 02710 2 visit4 101 008879 Estimated G Correlation Matrix Row Effect subject C011 1 Intercept 101 10000 2 visit4 101 Covariance Parameter Estimates Cov Parm Subject Estimate UN11 subject 02710 UN21 subject 008879 UN22 subject 0 Residual 19402 Fit Statistics 2 Res Log Likelihood 5135 AIC smaller is better 5195 AICC smaller is better 5196 BIC smaller is better 5258 RANDOM INTERCEPT AND TIME The Mixed Procedure PARMS Model Likelihood Ratio Test DF ChiSquare Pr gt ChiSq 2 000 10000 Solution for Fixed Effects Standard Estimate Error DF t Value 12917 11710 55 110 08870 01266 117 701 04740 0 3437 117 138 09962 03935 117 253 007502 007383 58 102 03771 01843 117 205 Type 3 Tests of Fixed Effects Col2 008879 Col2 10000 Pr gt Itl Num Den Effect DF DF F Value Pr gt F logbase 1 117 4908 lt 0001 logage 1 117 190 1705 r 1 117 6 41 0 0127 visit4 1 58 103 0 3138 basetrt 1 117 419 0 0430 RANDOM INTERCEPT AND TIME GLIMMIX Model Statistics Description Value Deviance 4364705 Scaled Deviance 2249584 Pearson ChiSquare 3866099 Scaled Pearson ChiSguare 1992601 ExtraDispersion Sca e 19402 ST 762 M DAVDDIAN P54E488 CHAPTER 15 ST 762 M DAVDDIAN PROGRAM 158 Fitting a generalized linear mixed e ects model using the exact likelihood integra tion by adaptiue Gaussian quadrature using SAS proc nlmixed We inspect the results of tting the model using glimmix to deduce rough starting values for tting models 1560 with 1561 and 1562 with proc nlmixed Here7 we use the default adaptive Gaussian quadrature approach to the necessary integration PROGRAM STATEMENTS Fit a loglinear mixed model to the epileptic seizure data of Thall and Vail 1990 using the SAS procedure nlmixed These are c ount data so we use the Poisson meanvariance assumptions options ls80 ps59 nodate run The data look like first 8 records on first 2 subjects 104 5 1 0 11 31 104 3 2 0 11 31 104 3 3 0 11 31 104 3 4 0 11 31 106 3 1 0 11 30 106 5 2 0 11 30 106 3 3 0 11 30 106 3 4 0 11 30 column 1 sub ect column 2 num er of seizures column 3 visit 14 biweekly visits column 4 0 if placebo 1 if progabide column 2 baseline number of seizures in 8 weeks prior to study co umn data seizure infile seizedat input subject seize visit trt base age n data seizure set se izure if subject207 then delete logbaselogbase4 visit4 then visit41 Fit two models a model that contains only a random quotinterceptquot in the linear predictor to capture intersubject variation and induce correlation a fancier model with random effects both for quotinterceptquot and time visit4 to allow the dependence with time to be subjectspecfic Use the results from running the glimmix macro to deduce starting values See the proc nlmixed documentation for information on syntax required statements and options P54E489 CHAPTER 15 ST 762 M DAVDDIAN proc nlmixed dataseizure rms b02 110 b210 b31 b401 b505 sb202 ta b0b1logbase b2logage b3trt b4visit4 b5basetrt bi f expeta model seize 39 poissonf random bi 39 normal0sb2 subjectsubject run proc nlmixed dataseizure parms b020 b110 b210 b31 b401 b505 sb1203 cb12002 sb220009 eta b0b110 base b2logage b3trt b4visit4 b5basetrt b1i 2ivisit4 f expeta model seize 39 poissonf random b1i b2i 39 normal00sb12cb12sb22 subjectsubject run OUTP UT These results differ somewhat from those obtained using the approximate method in glimmix The SAS System 1 The NLMIXED Procedure Specifications Set WORKSEIZURE ependent Variable seize istribution for Dependent Variable Eoisson ts i vistribution for Random Effects Normal ubject aria le subiect uptimization Technique Dua QuasiNewton ntegration Method Adaptive Gaussian Quadrature Dimensions Observations Used 236 Observations Not Used Total Observations 236 Subjects 59 Max Obs Per Subject 4 Parameters 7 Quadrature Points 1 Parameters b0 b1 b2 b3 b4 b5 sb2 NegLogLike 2 1 1 1 01 05 02 896856116 Iteration History Iter Calls NegLogLike Diff MaxGrad Slope 1 3 729530587 1673255 3193758 273139 2 5 68850306 4102753 5622485 596089 3 7 682348117 6154943 24 06336 5 53781 4 9 681504553 0843564 3368563 428549 5 11 680230995 1273558 1812916 074447 6 16 668558382 1167261 5150487 28295 7 19 666285706 272676 20371 356798 8 21 665807737 0477969 5423755 229608 9 23 665587898 0219839 3829635 060848 10 25 665546337 0041562 1328546 015546 11 27 665536 96 0 994 0625625 001066 12 28 66551907 0 017321 1926391 000719 1 29 66549093 0 028146 212034 002602 14 30 665459816 0 031114 090962 00299 15 32 665450124 0 009691 043 0 01134 16 34 6654206 0 029454 2 560905 0 00387 17 35 665369289 0 051382 0 658528 0 03672 18 37 665358751 0010538 0114023 002172 19 39 665358734 0000017 0005956 000003 The SAS System 2 The NLMIXED Procedure Iteration History P54E490 CHAPTER15 Iter Calls NegLogLike Diff MaxGrad Slope 20 41 665358734 5281E8 0000567 113E7 NOTE GCONV convergence criterion satisfied Fit Statistics 2 Log Likelihood 13307 AIC smaller is bet r 13447 AICC smaller is better 13452 BIC smaller is better 13593 Parameter Estimates Standard Parameter Estimate Error DF t Value Pr gt t Alpha Lower b0 13387 11800 58 1 13 02613 005 37008 b1 08845 01310 58 75 lt0001 005 6224 b2 04846 03466 58 140 0 1674 005 02091 b3 09332 04001 8 233 0 0232 005 17341 b4 01611 005458 58 295 0 0046 005 2703 b5 3384 02029 58 167 01008 005 006782 sb2 02516 005855 58 430 lt0001 005 013 Parameter Estimates Parameter Upper Gradient b0 10234 0000167 b1 11466 0000401 b2 11783 0 000567 b3 0132 0000 b4 005184 00002 b5 07446 9321E6 sb2 03688 000020 The SAS System 3 The NLMIXED Procedure Specifications Set WORKSEIZURE ependent Variable seize istribution for Dependent Variable Poisson ts b1i b2i vistribution for Random Effects Normal u ject Varia le subiect uptimization Technique Dua QuasiNewton ntegration Method Adaptive Gaussian Quadrature Dimensions Observations Used 236 Observations Not Used 0 Total Observations 236 Subjects 59 Max Obs Per Subject 4 Parameters 9 Quadrature Points 1 Parameters b0 b1 b2 b3 b4 b5 sb12 cb12 2 1 1 1 01 05 03 002 Parameters sb22 NegLogLike 0009 83671658 Iteration History Iter Calls NegLogLike Diff MaxGrad Slope 1 4 67600693 1607097 9609225 138792 2 6 665317411 1068952 4188403 179356 3 11 664247303 1070108 3424037 109042 ST 762 M DAVDDIAN PAE 491 CHAPTER15 ST 762 M DAVDDIAN 4 15 663466858 0780445 2710385 5 16 662265911 1 200947 26 11294 6 18 66169 1 569401 13 07142 7 20 661479642 0 216869 16 62289 8 21 661114 365612 8 024794 9 23 660887715 0 226315 5 759232 10 25 660835837 0 051877 1 812149 11 27 66082 1 012527 2 356712 12 28 660813756 0009554 2004119 13 30 660772829 0040927 3888936 The SAS System The NLMIXED Procedure Iteration History Iter Calls NegLogLike Diff MaxGrad 14 32 660670014 0102815 1335043 15 34 660 659156 0 010859 1 278097 16 36 660 580945 0078211 4 669951 17 37 660 532248 0048697 2 404251 18 39 660 514972 0017275 0 460348 19 41 660514858 0000115 004561 20 43 660514855 2238E6 000274 NOTE GCUNV convergence criterion satisfied Fit Statistics 2 Log Likelihood 13210 AIC smaller is better 13390 AICC smaller is better 13398 BIC smaller is better 13577 Parameter Estimates Standard Parameter Estimate Error DF t Value Pr gt Itl b0 11760 11128 57 106 02951 b1 08978 01215 57 739 lt0001 b2 04264 03270 57 130 01974 b3 0929 03777 57 246 00169 b4 007542 0 07410 57 102 03131 b5 0317 01854 57 171 00922 sb12 02949 0 06933 57 425 lt0001 cb12 01048 004530 57 231 00244 sb22 005934 004544 57 131 01968 Parameter Estimates Parameter Upper Gradient b0 10523 b1 11412 b2 10811 b3 01736 b4 0 07297 b5 6887 sb12 04337 cb12 0 01404 sb22 015 3 0 Alpha 005 005 0 05 005 005 005 0 05 0 05 0 05 0 031955 03165 PROC GLIMMIX New in version 9 of SAS is a procedure also called glimmix not to be confused with the glimmix macro demonstrated earlier As of May 2005 this procedure was not available in the SAS distribution but could be downloaded from http support sas comrndappdaglimmix html At this time the procedure is only available for the Windows implementation of SAS PU4E492 CHAPTER 2 ST 762 M DAVDDIAN 2 Introduction to nonlinear models 21 Introduction In this chapter we will discuss the model that will be our central focus in Chapters 3712 Through the course of our discussion we will identify different approaches to inference in the model setting the stage for these future chapters SITUATION Assume that we have independent pairs of observations 1117 j 1 n The 11 may be xed77 or random as discussed in Chapter 1 We will assume that the pairs and hence the random variables Yj are independent BASIC MODEL Rather than state the model in the form of response model deivation77 and a series of assumptions about the deviations we will instead write the model in terms of what we are willing to say about the rst two moments of the distribution of given 2117 We will begin with a basic form of the model As our discussion progresses we will modify this basic form EOjlilrj fj7 7 MFG ing T 21 0 ln model 21 f16 is a real valued function of the vector of covariates 1 r x 1 say and the vector of regression parameters 6 p X 1 The dependence of f on 6 need not be in a linear fashion as in the models discussed in the examples of Chapter 1 1 may depend on some or all of the components of 6 in a complicated nonlinear way Note that r need not be equal to p as in the examples in Chapter 1 o The assumption varY71j 072 is left deliberately vague at this point What is important right now is the idea that the values 0739 are j dependent This implies that the variances of the conditional distributions of Y values at different 11 are not the same across j The values a may be known constants or more generally the expression allows the possibility that they may depend on 211739 o If we de ne 6739 Y1 EWAM Y1 7 fab5 we do not necessarily assume that 5739 is independent of 211739 as in the classical77 assumptions Given the way we have de ned the model we do have that Eej1j O which is similar to classical assumption Thus we do assume that the chosen model form f1j6 is a correct speci cation 0mm PAGE 27 CHAPTER 2 ST 762 M DAVDDIAN This may be interpreted as saying that the data analyst is well equipped to identify an appropriate model form In the case where there is a theoretical basis for choosing a model as in the case of pharmacokinetics this is certainly a reasonable assumption 0 Note that we make no assumption about the distributions of the 11739s or more directly the conditional distributions of given 2117 Major themes will be the ability to develop inferen tial strategies that have nice properties without making such assumptions and the robustness of inferential methods to violation of distributional assumptions that might be made 22 Inferent ial approaches Generally as in the classical regression set up the scienti c objective may be stated in terms of questions about the value of the parameter 6 or at least some of its elements That is questions of interest focus on the mean response as a function of 211739 eg c To obtain the most accurate characterization 0 To determine whether the model may be modi ed to exclude consideration of some components of 111739 Thus at least initially when we speak of inference within the framework of our basic model 21 we interpret this to mean estimation of and testing with respect to the parameter 6 We will see that other parameters may also be involved in carrying this out most effectively and that indeed other parameters in modi cations of 21 may also be of interest APPROACH 1 Except for the fact that f is nonlinear in 6 pretend that some of the other classical assumptions hold In particular whether we believe variance is constant or not suppose we proceed as if it is so that varY7l1j 02 is a constant We might even adopt the normality assumption this clearly would be erroneous for binary data or data in the form of small counts but might be a reasonable approximation for continuous responses Under this perspective a natural approach would then be ordinary least squares OLS that is minimizing in 6 the sum of squared deviations n 239 fab5 22 71 Just as in the linear case this approach can be motivated in different ways PAGE 28 CHAPTER 2 ST 762 M DAVDDIAN o If we adopt the normality assumption maximum likelihood estimation of 6 and 02 involves jointly maximizing the loglikelihood V L logL 7n2 log2vr 7 722 logch 7 12 292 e fagm az 23 j1 Maximization of this in 6 is equivalent to minimizing 22 0 With or without the normality assumption one may view minimizing 22 in 6 as a sensible thing to do as discussed in Chapter 1 The sum of squared deviations 22 may be viewed as a distance criterion that in accordance with the assumption of constant variance treats all n observations as if they were of equal quality ASIDE It is important to recognize that in discussing maximum likelihood we are implicitly condi tioning on 11 when writing the likelihood To appreciate this suppose the 11 r x 1 are random and themselves normally distributed with some mean u and covariance matrix 2 So if we consider the Yj 217 as independent draws from a distribution of possible Y 11 pairs ideally the loglikelihood of the observed data the pairs 111739 j 1 n would be logL 7 Tn2 Iog2vr 7 722 log 2 7 12 is e Mylo e u 24 391 u where logL is de ned in 23 and is the logarithm of the product of individual normal densities for Y given 2117 Note that as the part of the loglikelihood due to 11 does not involve 6 maximizing the full loglikelihood 24 in 6 is the same as maximizing logL alone This also shows that in the context of regression modeling where the distribution of given 11 is of central interest the distribution of random covariates is not directly relevant A word of warning however this observation applies only if the 11 are observed without error or are not missing In these more complex cases which are beyond our scope here the distribution of 111739 values does enter into the picture complicating matters considerably Using the notation described in Section 24 minimizing 22 is equivalent to solving the p dimensional estimating equation n 2 7 My f i 17 5 07 25 j1 where f 1j 6 is the p x 1 vector whose elements are the partial derivatives of f with respect to each component of That is minimizing 22 or for that matter maximizing 23 is equivalent to solving a set of equations in 6 that is linear in the data Thus whether we adopt normality or we do not assume normality but adopt minimization of 22 as a sensible thing to do the resulting estimation method of OLS has the intuitively appealing property of yielding a linear estimating equation PAGE 29 CHAPTER 2 ST 762 M DAVDDIAN Of course from either perspective we have pretended that varle1j is a constant when it may not be in order to get ourselves back into classical territory so that we may do the nonlinear version of the familiar OLS Thus this approach seems unsatisfying as it ignores the possibility that this variance is something else 0 As we will see later making this simplifying assumption when it is incorrect leads to inef cient estimation of 6 in a sense we will make precise There is a further complication as well Regardless of the variance note that for arbitrary function f nonlinear in 6 it is not at all clear that 25 may be solved for 6 in closed form That is it may not be possible to obtain an explicit expression for the value EOLS solving 25 Thus obtaining the OLS estimator for 6 with a nonlinear model is no longer a straightforward analytical calculation as it was in the the linear case For now we will not dwell on this in Chapter 3 we will discuss methods of solving such estimating equations numerically APPROACH 2 If we are unwilling to pretend that varle1 02 and accept that varle1j 072 as in 21 an alternative approach would be to make a further assumption about the nature of the 072 A very strong such assumption would be to assume we knew the actual numerical values of 072 at least up to a constant of proportionality That is for j 1 n we might assume that 2 2 0397 a wj 26 for some known set of values wl w Note that 11 has not entered into the picture at all here we are making the very strong assumption that even if 072 is a function of 211739 we already know the values of that function at each 2117 at least up to a constant of proportionality 02 As will be clear momentarily the constant of proportionality will not be important An example of a situation where 26 might be realistic is the case where each is the average of Ti independent replicate responses at 2117 2716 k 1 rj say so that W 7 1 I Y1 Zzak39 k1 Suppose we are willing to believe that varzjkl1j 02 for eachj Then we have that varY7l1j 0277 so that wj rj In any event if we adopt the assumption 26 for some set of known constants wj we could one of the following PAGE 30 CHAPTER 2 ST 762 M DAVDDIAN Make the additional assumption of normality and nd the maximum likelihood estimator for 6 that is maximize IogL 7ltn2gtIog2vr a 12 floww a 12 ivyY7 7 m am02 27 j1 7391 0 Even without the normality assumption we could argue that a reasonable approach would be to minimize a sensible distance criterion that takes into account the fact that some observations are of higher quality smaller variance than others To do this we could consider a criterion like 22 but weight each observation in accordance with its precision letting responses of lower higher quality receive less more weight With this perspective a natural extension of 22 is to minimize TL Zw yj f z z 28 j1 Note that as 02 is just a constant of proportionality it does not affect this In fact from 27 minimizing this criterion is equivalent to maximizing the loglikelihood in 6 Call the estimator for 6 minimizing 28 the weighted least squares WLS estimator for obvious reasons denoted by EWLS Note that BWLS solves n 211in fltjv f jv 0 29 7 This is also a linear estimating equation ie linear in the responses Yj thus like OLS WLS also yields a linear equation to solve An obvious question is given the two different equations which one is better in the sense of yielding a better more precise estimator for 6 Intuition would suggest that if we do not believe varY7l1j is constant and we are lucky enough to know the wj the weighted WLS approach seems more sensible than the unweighted OLS approach Note that under 26 as the wj are known constants we may in fact transform the situation as follows Let 1 2 1 2 Y w Yj may wj 6 w 22739 6 Then under 26 it is easy to show that treating w as xed known constants so not conditioning on them but regarding them as known EW M fwj7 lav5 MFGflaw 02 Following this transformation it would seem one could estimate 6 by OLS minimizing 22 It is straightforward to show that OLS under this model and WLS under 28 are identical methods PAGE 31 CHAPTER 2 ST 762 M DAVDDIAN APPROACH 3 In practice it is highly unlikely that we would know such values wj If we believe that varY7l1j 072 for some 072 which may be functions of 1117 of course we are still being vague here one possibility would be to try to estimate the 072 values for each j The WLS approach has a nice intuitive appeal perhaps if known weights were replaced by estimated weights it would be better than ignoring the issue altogether A natural approach arises if we are lucky enough to have replicate observations which are independent at each 2117 To discuss this explicitly for the moment modify the notation to let denote the kth observation at 211739 k 1 rj and de ne r r 8 W i 1 1 2ij 37739 37739 7771 k1 k1 where s is thus the sample variance of the for given j The 3 would be expected to be close to the relevant values of 072 This suggests adopting the WLS idea estimating 6 by minimizing n Tj Z 2 Sim6 7 MM 210 7391 k1 0 Although this preserves the spirit of WLS s is likely to not be a very good representation of the true value of 072 if W is small 0 As we saw in the examples in Chapter 1 a common feature is that variance tends to exhibit a systematic pattern eg increasing in a rather smooth fashion as the level of the response increases This approach does not acknowledge this but rather only represents the variance at each distinct j 0 Moreover a natural question is whether a penalty must be paid for estimating weights rather than being lucky enough to know them APPROACH 4 Under the same conditions as Approach 3 an alternative approach is possible One could make the assumption of normality and treat the 072 as unknown parameters to be estimated more formally along with 6 by maximum likelihood Differentiating the loglikelihood n n 7 739 7012 10g 2 Z7710ng 12 Z 28 7 fivj7 2U F1 j1 k1 with respect to 6 0f 03 yields the system of equations 2 ZUJZUjk fltj7 f j7 07 07239 Q71 29 fj7 27 j 17 771 j1k1 k1 PAGE 32 CHAPTER 2 ST 762 M DAVDDIAN This seems to be similar in spirit to Approach 3 except that now the sample variance is replaced by an expression that depends on the assumed mean model 1 Note that this system cannot be solved in closed form One simpli cation to get around this problem would be to replace 6 in the equations for 072 by EOLS from a preliminary OLS t to obtain estimates 672 which could then be substituted in the equation for 6 In any event this suffers the same drawbacks as Approach 3 and in addition is more dif cult to implement COMMENTS ON APPROACHES SO FAR 0 Approach 1 OLS seems inappropriate if we believe variances are not constant 0 Approach 2 WLS attempts to address this but it seems unlikely that the wj would be known in practice 0 Approaches 3 and 4 use this same premise but require replication which may be unavailable or impossible eg pharmacokinetic data would never be available in replicate at the same time point 0 All of these approaches disregard the possible evidence or sub ject matter knowledge of a systematic pattern of nonconstant variance eg changing with response level as noted above Indeed such a pattern may well be expected eg count data may follow a Poisson distribution for which the variance is equal to the mean If there is evidence of a systematic relationship between variance and mean or if variance seems sys tematically related to 111739 itself it seems that a better approach would be to attempt to characterize the relationship directly 0 More precisely if the evidence suggests that varle1j is a smooth function of 111739 either through the mean function f or directly then Approaches 274 represent trying to characterize this smooth relationship by a connect the dots approach That is these approaches attempt only to evaluate the value of varY7l1j only at speci c points 11 rather than try to describe the entire function The result is that the smooth function is represented by connecting the dots estimates of 07 to form a jagged pro le PAGE 33 CHAPTER 2 ST 762 M DAVDDIAN 0 One would never estimate a smooth mean function this way ie we would not use replicates at each 11 to characterize instead we use the smooth function filj So why do this for variances RESULT Think explicitly of varle1j as a smooth function of 111739 as the notation suggests and exploit this 0 This is a natural approach for certain types of data eg counts binary 0 And it may be realistic in other settings when suggested by empirical evidence HISTORICALLY Rather than address this issue directly early approaches tried to transform the problem back to the familiar classical assumption of constant variance APPROACH 5 Invoke a standard transformation of the data and carry out modeling and inference on the transformed scale Depending on the nature of the data select an appropriate transformation hY say such that varhY7l1j is constant and assume a regression model for Such a transformation would generally be suggested by a relevant distributional assumption eg for count data a Poisson distribution might be assumed Table 21 shows some common transformations based on distributional assumptions Table 21 Well known variance stabilizing transformations aj a varle1j Distribution Mean variance Transformation Poisson 072 aj leZl 2 Binomial a x M17 W atria1 Lognormal 072 X a long Early work in this vein usually assumed that the transformation also served to induce a linear relation ship eg for count Poisson data the transformed model would be 1 2 T 1 2 2 E02 While87 my mum That is the same transformation was assumed both to stabilize the variance and make the relationship linear Later work dropped the latter assumption 0 Main point in this approach the transformation is assumed to return the problem to the clas sical set up Thus normality on the transformed scale may also be assumed as well PAGE 34 CHAPTER 2 ST 762 M DAVDDIAN APPROACH 6 An extension of this idea was discussed in the seminal paper by Box and Cox 1964 The idea let the data determine the transformation 0 Assume there exists a transformation indexed by a parameter A hY A say 0 Assume further that EhYj7 Willi mgr57 varhYj7 W113i 02 hY7l1j N Normal The linear model could be replaced by a nonlinear model here but the original formulation assumed linearity as well as normality o The authors discussed a family of power transformations A WA Y177507 logY 0 now known as the Box Cox transformation The representation for 3A 0 is made in such a way that it is continuous and equal to log Y at 0 Although the Box Cox transformation is very popular the next suggestion applies to general transformations h 0 Under these assumptions estimate 6 a and jointly by maximum likelihood thus letting the data inform on the appropriate value for This would involve maximizing the loglikelihood 71 71 logL 7n2 log 27139 7 712 log a2 Zlog Jjo 7 12 Emuj 7 2118202 73971 71 where JJA is the Jacobian of the transformation 12 A me A 7 ddmm Anya For the Box Cox transformation JJA Yfil REMARKS Approaches 5 and 6 assume that a single transformation achieves all of o normality constant variance and linearity of mean although this last one could be relaxed PAGE 35 CHAPTER 2 ST 762 M DAVDDIAN This seems like a pretty tall order for one transformation Moreover some additional conceptual issues arise What about inference on the original scale If we estimate the transformation how do we take this into account Do we need to What is the meaning of 6 For problems where theoretical considerations suggest a model for on the original scale so that the form of the model 1 say is scienti cally meaningful this approach ignores this and destroys the role of this model Are there situations where no transformation can achieve nonconstant variance APPROACH 7 An approach that addresses the issue of a theoretical model was suggested by Carroll and Ruppert 1984 although the basic idea had been used previously The suggestion is to preserve the theoretical model 1 by transforming both and f1j 6 These authors wrote the model in a way similar to the classical framework as 710239 hfj7 37 63 211 where the 57 are assumed to satisfy classical assumptions 274 constant variance indepen dence normality This implies that 71ij mm Nlht m 6 ALaZ This approach is called Transform Both Sides TBS for obvious reasons Under this model for a monotone in Y transformation hY hf1j 6 is both the con ditional mean and the median of hY7 so that by the monotonicity f17 6 is the median of the conditional distribution of which is not the same as the mean in general of course Thus the consequence of this approach is that the theoretical model 1 is used as a model for median rather than mean on the original scale Carroll and Ruppert 1984 suggested that 6 02 and could be estimated jointly by maximizing the corresponding loglikelihood logL 4712 low 7 712 loge flowA 7 12 film A e wag AHZaz j1 7391 PAGE 36 CHAPTER 2 ST 762 M DAVDDIAN 0 Of course the assumption is still rather strong the transformation induces both normality and constant variance Moreover there is a technical issue in that it may be impossible for such a transformation to even exist for certain choices of Putting the dif culties aside for a moment it is instructive to note in more detail what this model implies about the properties of on the original scale To gain insight we consider an approximation to 211 for small deviations 5739 hYj x 7 hf1j using a Taylor s series expansion Section 24 That is we apply Taylor s theorem to 211 expanding about 57 0 Assuming that h is a one to one transformation with unique inverse h l we have from 211 Yj hillhfilrj7 v5wl h lthWw 5 7 l ddu V1017Auhfltzj gtx61 7 0 Hills5 ddu 7171047 Auhfma6j 22 22 Thus note that we have z f1j 6 Now for monotone h it may be shown that ddu 7171047 Auhy ddy 7197 Ayh 1uxl71i thus the approximation may be written as Yj aw 3 ddy My Ayfltzj gtl 16j From this approximation we may thus deduce that varOjlmj 3 Uszdy hm yfm l 2 212 This result is known as Bartlett s transformation For the particular case of the Box Cox transformation ddy hy x yA l so under this transforma tion we have from 212 that mow z 02fmj7 21 The result is that the Box Cox power transformation used in the TBS fashion corresponds roughly to a regression model for for which variance is proportional to a smooth function of the approximate mean f1j6 in particular the power 217 of f1j 6 Other transformations will correspond to other functions RESULT For the TBS model and more generally the choice of transformation family implies at least approximately a speci c assumption about varY7l1j on the original scale PAGE 37 CHAPTER 2 ST 762 M DAVDDIAN Transformation of data has an effect not only on variance but also on the distribution on the new scale It may be unrealistic to expect that a single transformation can achieve both constant variance and normality or even just symmetry In fact a transformation that achieves normality on the transformed scale could induce a complicated nonconstant variance or vice versa In the above approximation 57 small corresponds roughly to the case where the shape of the conditional distribution of is not affected much by the transformation so that most of the effect is on variance Even in this situation where stabilizing variance is the main focus note that a particular transformation family only allows certain approximate variance patterns on the original scale We will not consider transformations further as a way of handling departures from the constant variance assumption Instead we will focus on a more direct modeling approach A good reference to read more about transformations in regression is Carroll and Ruppert 1989 Chapter 4 The approach we will take instead is to model variance directly much as one models the mean The rst such approach we consider is one we have already discussed brie y APPROACH 8 Assume an appropriate distributional model thus assuming something about variance and estimate 6 by maximum likelihood c To illustrate recall Example 16 where the response is a count As discussed there a natural distributional model is the Poisson distribution If we have in mind a model f1j 6 eg a loglinear model then under the Poisson assumption varY7l1j filj Thus we can write down the likelihood for the data L H expfj7 gtfltwj gtYjm 73971 which leads to maximizing in 6 the loglikelihood n IogL Zmlogm 6 7 ea 710mm j1 0 Alternatively suppose we have continuous data where the response takes on positive values only Sometimes a normal distribution can be a reasonable approximation for such data even though it has support over the whole real line for example when the values of the response are very far from 0 However in many biological applications for a particular choice of 2117 the distribution of does not seem symmetric rather it appears to be skewed with a long right tail Moreover the variance of the distribution increases with increasing values of the response in such a way that the coef cient of variation stays constant ie varOleIIjWZ CV EOjlillj is a constant PAGE 38 CHAPTER 2 ST 762 M DAVDDIAN Two possible distributional models for data with these features are the gamma and lognormal distributions The gammaa 1 distribution has density 24 BXPPyb 0 mwm gt with EY ab and varY abz so that CV fl2 Thus if we assume is gamma with a 102 I 02f1j6 then fmj7 7 varOjlmj 02fm17 27 where CV a a constant As above one could write down the corresponding loglikelihood and maximize in 6 and 02 Alternatively the lognormal distribution is a competitor to the gamma with the same mean variance relationship A random variable Y has a lognormal distribution with mean u and vari ance 72112 so CV a if logY has a normal distribution with some mean and variance that are particular functions of u and 02 Thus maximum likelihood based on this distributional assump tion could be entertained Gamma and lognormal distributions with the same mean and variance often are quite similar It turns out that there is a uni ed framework and huge body of associated literature on this type of maximum likelihood approach We will consider this in detail in Chapter 4 Note that this approach only allows certain relationships between mean and variance those dictated by the distributional model chosen In some situations the empirical evidence may suggest that the pattern of variance is different from those corresponding to candidate distributions Indeed the data analyst may be reluctant to adopt a particular rigid distributional assumption MODELS FOR VARIANCE An alternative approach is to model the variance directly This is a standard approach in numerous applications and there is an enormous statistical literature devoted to the topic Much of this has to do with the fact that variance models that arise in the context of distributions or via transformation are simply not appropriate to represent the features of the data To illustrate we consider some examples 0 Recall the case of immunoassay data discussed in Example 14 A standard phenomenon in this application is that the variance increases with response level as we have noted In the case of RIA data for example where the response is a radioactive count the Poisson distribution might seem like an appropriate model PAGE 39 CHAPTER 2 ST 762 M DAVDDIAN However in addition to the inherent Poisson variability afflicting counts there are other sources of variation that contribute to the overall pattern for example lab procedures in the preparation of samples errors in assaying or storing samples and so on may combine to make the pattern of variation present in the data appear different from and typically larger in magnitude than that one would expect from pure Poisson counts Thus a common approach is to adopt a model that is more exible allowing this possibility A very popular model is MIC11 02fj7 297 213 where 9 gt 0 is typically the case Note that 9 05 corresponds to the Poisson variance structure and t9 10 to a Gamma like constant coef cient of variation model These common choices may be inappropriate to describe the pattern of variation in the data however Assay scientists have found that values in the range of 9 07 to 09 do a much better job An obvious interpretation is that 9 05 which would likely hold iflab procedures were perfect must be in ated somewhat to do a better job of representing the variation It may not be possible to specify a value for 9 a priori without vast experience with the assay as its value depends on the particular assay procedure In this case no handy distributional model applies This suggests that to implement 213 a possible approach would be to estimate 9 from the data A further common feature of assay data is that of two distinct components of variation At low response levels the limits of the measuring device may be approached for example a radioactive counter may be unable to detect accurately very small counts Thus at such response levels there may be inherent noise due to the breakdown of the device s ability to produce an accurate result At higher response levels this is not a problem and the data exhibit the usual increasing variance pattern A model that attempts to characterize this phenomenon is vaFOjlillj 02101 air50292 214 At small values of f the rst term in 214 dominates and varle1j 0201 representing constant noise so the noise associated with breakdown of the device swamps the inherent variation in response At larger values of f the second term dominates and varY7l1j 02f1j 292 Of course a complication is that the values of 01 and 02 are unlikely to be known Note further that it is not likely that one would be able to identify a transformation that would imply a variance pattern like 214 on the original scale PAGE 40 CHAPTER 2 ST 762 M DAVDDIAN 0 Sometimes a power model like 213 may be insuf cient to characterize a pattern in which variance increases very profoundly A popular model in this situation is the exponential model varOjlilrj U2 eXp20f17 lt is not clear that a distribution corresponding to the implied mean variance relationship is available 0 ln econometrics study of phenomena such as stock prices is of great interest In fact what is of primary interest is not necessarily the price itself but how it varies One would like ideally to buy low and sell high For a risk taking investor a stock that exhibits wide price swings or volatility would be a prime candidate Thus understanding the variance of stock prices and how it may be related to other information eg the value of the dollar against a foreign currency the season consumer price indices and so on is of keen interest Here modeling of the variance is in itself an important concern In this application such modeling is often done on the basis of empirical evidence and involves incorporating covariates that seem to describe the pattern of variation The actual model framework is quite more involved than the regression model with data independent over j considered here because of the timeordered nature of data collection but the issues are the same For example if the available covariates are 2117 examples of such empirical models are varY7l1j 021 219 varY7l1j 02 exp1 varY7l1j azzgz where 72 is the second component of 211739 IMPORTANT BUT SUBTLE POINT So far we have denoted the available covariates as 11 and have noted as in 21 that we are interested in modeling and varle1j In the econometrics lit erature a distinction is made between covariates that are relevant in modeling the mean and covariates that may be exogenous to the mean model that may be used in modeling the variance That is the volatility may be thought to be related to phenomena that do not affect the stock price Thus for example suppose there were additional covariates Zj that might be considered for modeling variance but are deemed irrelevant for modeling mean Then the available covariates are 11 and zj If the Zj are irrelevant for modeling the mean we might be willing to express this as EWli Iijj EOjli39 113175 PAGE 41 CHAPTER 2 ST 762 M DAVDDIAN This says that once 117 is known knowing Zj does not add anything Note that this is not the same as saying that filj directly as integrating Ele1jzj which in general would be a function of Zj and 211739 over the distribution of Zj will also give a function of 111739 only What is important is that we believe this conditional expectation is the same as Under this perspective we would model varY7l1j zj Such a model might be Ele 7zj Ele 1133 37 VarOjlillw Zj azexpw01 92 With these considerations we might as well consider 11 and zj together Thus in the sequel we will continue to write and varle1j where 11739 represents all available covariates and we will not make the distinction between covariates relevant to mean response and those exogenous to it It is important to be keep this in mind REMARK Clearly such variance models which are based on empirical modeling considerations are unlikely to arise from a distributional assumptionl Thus under such conditions appealing to a distri butional framework is not possible APPROACH 9 Summarizing the above a broadly applicable approach is to simply postulate a model for variance based on the available evidence and subject matter considerations We have seen situations where it is reasonable to write down such a model as a function of filj and other parameters 0 more generally the model may depend on 111739 and other parameters directly A model that encompasses these possibilities is fmj7 7 varOjlmj 0292 707 131 23915 0 g is a function of parameters 6 appearing in the characterization of additional parame ters 0 and covariates 2117 This function need not depend on 6 through the function at 6 but in almost all situations this will be the case We will refer to g as the variance function as it characterizes the variance as a function of covariates and parameters note that standard deviation function might be a better term The model 215 subsumes the distributional models that arise from a perspective like Approach 7 in these cases 9 is a known speci ed function usually of filj o The approach is exible 7 one may contemplate and accommodate explicitly features of the data through different choices for g PAGE 42 CHAPTER 2 ST 762 M DAVDDIAN o The parameter 0 may be known or unknown To clarify what we mean by this consider the following In the RIA examples the value of a parameter 0 would likely be unknown we may be willing to specify a functional form for variance but only up to the values of some parameters that must be determined In contrast in situations where we are willing to specify a distributional model as in Approach 8 the functional form of g is known eg for count data modeled by a Poisson distribution 9205797 113739 itj and 02 1 Here there are no unknown parameters in the variance model save 6 which enters through dependence on the mean function We refer to such a case as 0 known A popular inferential method for the general model 215 maybe deduced by writing VarOjlillj 021073 w 972 01131 so that weights are identi ed for each j This representation suggests a natural approach that of adapting the WLS paradigm to this situation where weights are not known entirely but instead are known up to the values of parameters 6 and perhaps 0 That is the functional form that dictates the values of the weights is known only parameters are not This perspective leads to what we will call generalized least squares also called estimated generalized least squares or GLS The basic GLS approach may be thought of in terms of the following algorithm i Estimate 6 by BOLS ii Estimate 0 if it is not known somehow obtaining 9 and form estimated weights 15139 972EOL57 97 113739 iii Re estimate 6 by solving M x H treating the 7i as xed constants to obtain the GLS estimator BCLS Note that minimizing 21 7 f1j 62 would lead to solving 216 In the sequel we will focus on estimators obtained by solving estimating equations rather than necessarily minimizing an objective function Thus we prefer to express iii in terms of solution of such an equation Note also that a natural extension of this algorithm would be to take the estimator ECLS from iii and return to ii forming new estimated weights 7i g 2fiGLS 9 21 possibly re estimating 0 somehow PAGE 43 CHAPTER 2 ST 762 M DAVDDIAN Then go back to iii and update estimation of 6 In fact intuition would suggest that this process could be iterated multiple times perhaps after a suf cient number of iterations the estimator for 6 might stabilize so that the iterative algorithm converges to a solution It will turn out that this is often the case We will defer further discussion of this idea until later Finally note that except for the implicit presence of values in the vi through the plugged in estimators EOLS or ECLS and 9 the equation 216 is again linear in the This feature will be very important As we will see the basic approach of GLS is a fundamental one for regression models of the form 215 with some very nice properties It is not however the only one APPROACH 10 An alternative approach to 215 would be to add a distributional assumption and estimate 6 along with 02 and 0 if they are not known by maximum likelihood We have already talked about this idea Approach 8 from the perspective that a distributional assumption dictates a mean variance relationship However we may do it the other way around given a postulated mean variance relationship of the form 215 also impose a distributional model A popular such approach is to assume as in the classical set up that the distribution of is normal with mean and variance given by 215 o In the case of nonconstant variance and known weights wj the WLS estimator may be motivated as the maximum likelihood estimator under normality Similarly with wj E 1 for all j OLS is maximum likelihood under normality Thus by analogy one might be tempted to use normal theory maximum likelihood to estimate 6 in the more complicated model 215 In fact as we have discussed the assumption that the distribution of is normal is often relevant in practice eg for RlA data where the response is a large count the normal distribution may be a good approximation to Poisson like behavior This assumption is depicted in Figure 21 in the case where f1j 6 is a straight line scalar The normal distribution has the very general property that its mean and variance are unrelated thus it is perfectly reasonable to suppose that data satisfying 215 with 19091117 aw37 WHO111111 729237 0 111 might be normally distributed Under this assumption we would maximize in 6 02 and 0 if necessary logL 7n2 log 27139 7 n loga 7 i logg8 0 m 7 12 ig 0 mY 7 f1j 202 217 71 71 PAGE 44 CHAPTER 2 ST 762 M DAVDDIAN Figure 21 Normality assumption with nonconstant variance 0 One might speculate that given the relationship between maximum likelihood under normality and WLS with known weights this would lead to a method very similar to Approach 9 However this is not the case Note that 6 enters into the objective function 217 in a complicated way rather than just in filj in the squared deviations in the last term This feature makes the resulting estimator for 6 different from something akin to WLSGLS as we will see In particular the resulting estimating equation that would be solved jointly with additional equations for 02 and 0 turns out to be a special case of a general class of such equations In this class the equation depends on the in a quadratic rather than a linear fashion as in 216 Thus taking this normality based approach or more generally considering solving such quadratic equations leads to a competing method to the GLS approach for tting the general model 215 We will spend considerable time elucidating the differences in these approaches and the tradeoff between them 2 3 Preview We have gone through a variety of approaches to modeling and estimation of the regression parameter 6 for univariate response data for which the usual classical regression assumptions do not apply PAGE 45 CHAPTER 2 ST 762 M DAVDDIAN Within the framework of the basic model 21 it seems that the set of potentially interesting approaches narrows down to the last three 0 Assume a particular appropriate distributional model eg the Poisson for count data for the conditional distribution of given 2117 Under this model the speci cation of the varY7l1j is accounted for naturally in particular varle1j a speci ed function of f1j6 Estimate 6 and any other parameters jointly by maximum likelihood Approach 8 0 Do not attempt to make a distributional assumption Rather postulate a model for variance depending on the application and empirical evidence Because such a model is not likely to correspond to a particular distribution and because we may be unwilling to make a distributional assumption adopt a weighted least squares approach to tting the model on the basis of its 77 sensible omnibus appeal Call this approach generalized least squares Approach 9 0 Alternatively under such a model make an additional distributional assumption in particular adopt a normality assumption and t by maximum likelihood From another perspective do something different from a weighted least squares approach Approach 10 ISSUES TO BE ADDRESSEDRESOLVED We will nd that for a certain class of distributions Approach 8 leads to the same estimation technique as Approach 9 when 0 is known We will describe this class of distributions and gain an understanding of why this is the case 0 This GLS estimation method may be motivated from an entirely different perspective 7 neither maximum likelihood for certain distributions or from weighted least squares ideas We will exhibit this perspective o It will turn out that Approach 10 may or may not lead to an approach similar to GLS We will determine under what conditions this is the case and discuss why we should care 0 For models where 0 is unknown a way to estimate 0 is required We will discuss approaches to this problem 0 In fact if one decides to model variance based on empirical and other evidence determining an appropriate model seems very important We will discuss just how important this is as well as tools for identifying an appropriate model in practical situations PAGE 46 CHAPTER 2 ST 762 M DAVDDIAN o A key point about models of the form 215 is that the parameter 6 potentially appears both in the mean and variance speci cations which makes this situation much more complicated than that of classical regression7 nonlinearity of the mean model notwithstanding Note7 of course7 that 9 need not depend on 6 as in the empirical models for variance discussed in the context of econometrics It will turn out that whether 9 has dependence on 6 appearing in the mean speci cation is a crucial issue As we address these issues7 it is useful to keep the following 2 X 2 table in mind As one moves from the left column to the right7 and from the top row to the loottom7 things become more complicated 0 known 0 unknown 9 does not depend on 6 I Weights known WLS 7 9 depends on 6 I We ll start here 7 24 Appendix Notation and multivariate Taylor series Throughout7 we will need to differentiate real valued functions like 1 and g with respect to vector valued parameters like 6 p x 1 ln fact7 we will often need to differentiate vector valued functions with respect to vector valued arguments It is advantageous in order to streamline the presentation to adopt a shorthand notation to represent such derivatives We will use this notation in subsequent chapters similar notation is used routinely in the literature REAL VAL UED FUNCTIONS Let h17 a be some real valued function of a vector 1 which will be irrelevant for the developments here and a r x 1 vector 1 0417 7047 We write 1380417117 a 7104117 a 8811h17 a f r X 1 218 8804rh17 a The vector 7104117 a of partial derivatives of h17 a with respect to the elements of a is referred to as the gradient vector Of course7 hair 1 denotes its transpose7 a 1 x r vector We may extend this notation to second partial derivatives with respect to the elements of a This is done by writing the r x r symmetric matrix of second partial derivatives of h17 a as follows PAGE 47 CHAPTER 2 ST 762 M DAVDDIAN 82804 h1a 828a18a2hma 828a18ah1a 82827111341 828 87111341 71040461211 aaaaaThma a2 t a2 391 828a h1 a 219 This is often called the Hessian More generally suppose that h1 a 6 is a real valued function of two vectors a r x 1 and 6 s x 1 Then de ne 828a1861h 828a1862h 8280418651 hatsa a 6 aaaa Th a 6 828d2861h 828d2862h 8280f2865h 23920 828048611 828018651 where we have written h h1 a 6 for brevity This is a r x 5 matrix it follows that by reducing the roles of a and 6 7150 is a s x r matrix and is the transpose of hm in 220 VECTOR VAL UED FUNCTIONS Now suppose that h1 6 is a vector valued function of dimension n of a vector 1 and parameter 6 s x 1 ie h1m76 h1 6 Thus the vector valued function h has real valued component functions hl h We will write 8861h11 6 8865h116 h57 6 886Thw76 7 8861h516 8865h51 6 a n X 5 matrix In particular note that if we have a real valued function Mat 1 6 for 04 r x 1 and 6 s x 1 then 88041hi11 616 hmlt7a76gt 710111170476 3 3 M1 88041h1 16 ham 047 6 PAGE 48 CHAPTER 2 ST 762 M DAVDDIAN Thus ha1 a 6 is a vector valued function Applying the above de nition of the partial derivative of a vector valued function with respect to a vector parameter to ha1 a 6 we may conclude that the result of differentiating haill 16 with respect to 6 is the matrix ha51 16 de ned in 220 that is 886Tha1 a 6 88a86Th1 a 6 This notation will be very helpful in streamlining expression of the rather complicated equations we will derive in connection with estimation of 6 in 215 It will also prove very useful for simplifying the derivation of largesample approximations we will require in deducing the properties of estimators These will follow from application of Taylor s theorem which we now consider UNIVARIATE TAYLOR S THEOREM Assume ha has k 7 1 continuous derivatives in 11 and nite kth derivative in a b where 04 is univariate and h is a real valued function Let 040 6 11 For each 04 6 11 04 7 040 there exists 04 interior to the interval joining 040 and 04 such that 1 71 1 z z z 1 k k k Ma ha0 7 2 KW Ma haaaoa 7 a0 39 Ma haama 7 a0 1 Taylor s theorem is used heavily in making large sample arguments as we will see Because we will deal with functions of vector valued parameters and vector valued functions of such parameters we will need a multivariate version of Taylor s theorem MULTIVARIA TE TAYLOR S THEOREM First consider a real valued function 711 where a is TX 1 We will state the theorem for general k but we will write the form of the representation of 111 for k 2 only as things get messy pretty quickly One should be able to deduce the form for larger k by analogy to the univariate version Assume that ha on R7 has continuous partial derivatives of order k at each point in an open set S C 72 Let a0 6 S For each a 10 such that the line segment joining a and a0 lies in S there exists 1 in the interior of this line segment such that in the case k 2 7 7 7 hltagt ha0Z88Whaaaoal7a0z12 Z Z828az8athaam az7a0lat7a0t 1 1 t1 Inspection of this expression reveals the usefulness of our shorthand notation above it is straightforward to show that we may rewrite this more succinctly as h04 h040 7 h aoa 7 0407 1211 7 040Thumzm0t 7 10 Here we have used a further shorthand representation that we will use routinely ie hM040 88ahaaao PAGE 49 CHAPTER 2 ST 762 M DAVDDIAN We will use similar notation for hma and other expressions If h is in fact a function of two vectors a r x 1 and 6 s x 1 we can de ne the single stacked vector aT 6TT r s x 1 and apply the theorem to obtain a representation of ha6 about some value 136 This comes in handy when we want to maintain the distinction between two arguments of a function and treat them separately Using the above it is easy to show that for k 2 we may write 710175 71010750 7150107 50Ot 7 a0 h5TOt07 505 7 50 12a 7 a0Thma6a 7 a0 2a 7 a0Tha5a 66 7 60 6 7 60Th55a 66 7 60 We will often wish to apply Taylor s theorem to vector valued functions Things can get a bit com plicated but it is mainly a matter of notation Suppose 1a h1a hvaT v x 1 where again a is r x 1 If we apply the multivariate Taylor theorem to each component of h which are real valued functions it may be shown that the expansion of ha about a 10 for k 2 may be written compactly as hfamo Ma h010 3 0t 7 010 12Iv 0t 7 a0THZaa 7 10 7130010 where H20 is the m x r matrix consisting of the r x r matrices hlma1 hma u stacked vertically Iv is a v x 12 identity matrix and X denotes Kronecker product We will rarely use multivariate Taylor series beyond k 1 one place where they are needed is in the derivation of second order theoretical properties as in Chapter 11 For completeness we have given these expressions here Finally if ha 6 v x 1 is a vector valued function depending on a r x 1 and 6 s x 1 and we wish to separate explicitly the terms involving a and 6 the above extends in the obvious way for k 1 we have using obvious notation hgoxatv af hf6a76 ha6 ha0 60 a 7 a0 6 7 60 haa7 6 h36a76 This could of course be written more compactly as hiTaWh L hf6a76 39 t Cl 7 a0 6760 ha6 ha060 haa7 6 h6a76 PAGE 50 CHAPTER 13 ST 762 M DAVDDIAN 13 Approaches to modeling multivariate response 131 Introduction In the previous chapters we have focused on the situation in which the responses may be viewed as independent We thought about this in different ways 0 In some circumstances units may be randomly sampled from some population of interest and unit j gives rise to 1117 Thus we have iid pairs 1117 j 1 n and interest focuses on characterizing the association between and 111739 by modeling Here we may View the conditional on the 11 as independent as is customary in the classical regression set up and more generally In other situations an experiment may be carried out in which the 111739 are xed by the investigator For each setting 2117 is observed independently of the observations for the other 1 settings eg each j involves observing a certain chemical reaction at temperature 7 or zj is the dose of a drug given to animal j in a random sample of such animals Here again the assumption that the conditional on the 111739 are independent is reasonable We also discussed examples eg Examples 11 12 and 13 in which repeated measurements are taken on a single unit In Examples 11 and 12 for instance a single subject is given a dose of a drug at time 0 and drug concentrations at times zj post dose are derived from blood samples taken at those times and the objective is to characterize the concentration time relationship for this particular subject In this setting we noted the possibility that observations taken very close together in time might be expected to be more alike than those far apart in time that is we might expect such repeated measurements to be serially correlated In Section 132 we will discuss this in more detail Often the blood samples are taken suf ciently far apart in time that concern over this issue is negligible so it is routine to assume that the Yj j 1 n are independent over time We noted in Chapter 1 that in this situation thinking of time as a covariate in the usual sense may not be quite appropriate it may be more reasonable to consider a stochastic process over time realizations of which we observe at times mi We will for the most part downplay this distinction but discuss it further shortly There are many circumstances however in which assuming independence among all observations in a regression setting is not reasonable PAGE 320 CHAPTER 13 ST 762 M DAVDDIAN 0 Example 17 is an example of a situation in which the data fall into natural clusters due to the way in which the observations arise In this example pregnant mother rats indexed by i 1 m receive a dose z of a toxic agent each mother gives birth to n pups where the jth pup has birthweight Yij j 1 n As discussed previously it is natural to think that the birthweights of pups from the same mother might be more alike than those from different mothers a high birthweight mother s pups will tend to be heavier than the average pup across all such mothers at dose u so might all tend to be above this average together Of course we would expect that the way birthweights turn out for pups from different mothers would have nothing to do with one another Under these circumstances it is natural to think of the observations from mother i together as a group or cluster Letting Y Yil YmT be the vector of birthweights for mother 2 we observe pairs Yi m i 1 m The Y given the M may be reasonably viewed as independent from above however the observations within Y for any i are correlated So clearly if we consider the entire data vector Y YlT Ya it would be unreasonable to assume independence among all elements of Y Thus trying to view this problem from the per spective we have considered previously as N 221 m independent pairs Yij m i 1 m j 1 n would be erroneous 0 Example 18 re ects the same issue with some additional complication Here the ith ofm subjects receives dose of a drug Di and drug concentrations Y are taken at times tij post dose for subject 2 j 1 m Thus letting 11 DitijT subject i has m pairs of observations Yij 11 j 1 n Here we include time as part of the covariate 21 for notational convenience Again it is natural to think of the observations in clusters by individuals Letting Yi Yil YmT it is clearly reasonable to consider the m Y as independent as each arises from a different individual In particular as pharmacokinetic behavior is a within individual process there is no reason to expect that the way in which drug concentrations arise for one person would be related to that for another However it is natural to be concerned about correlation of the observations within a given data vector Y from two sources a As discussed above the timeordered nature of the observations raises concern over serial correlation This concern is an individual one 7 even if the available data were only from a single subject and interest focused on that subject such correlation would still be an issue PAGE 321 CHAPTER 13 ST 762 M DAVDDIAN b As in the developmental toxicology example from the perspective of the entire population of individuals it is natural to be concerned that concentrations on the same individual would be more alike than concentrations from different individuals Eg a certain subject may be a high concentration subject whose concentrations tend to be inherently higher than the average concentration across all subjects if they were all to receive the same dose and be measured at a particular time Thus two concentrations from such a subject might be expected to be high in this way together We will discuss these ideas further in Section 132 It is clear however that regarding the elements of Y YlT YT as independent would be erroneous WHY WORRY2 We have made the case above that under these more complex data structures it is not legitimate to view all N observations as independent 0 As regression methods are available under the independence assumption whose properties are well established and for which implementation is widely available it is natural to wonder about the consequences of using these methods anyway 0 On the other hand it seems intuitively clear that if in truth the data are not all mutually independent failure to account for this could compromise inferences on parameters of interest As it turns out treating the observations as independent when they are not can lead to several problems i lnferences may be inef cient in a largesample sense analogous to the way that failure to take into account nonconstant variance in the independence setting leads to inef ciency recall the asymptotic GaussMarkov result in Section 96 ii Failure to take account of correlation among elements of Y represents inaccurate characterization of the variation in the data As in any problem standard errors deduced under an incorrect assumption on variation will not appropriately represent the true sampling variation Thus a second problem will be that estimated standard errors and associated inferences will be biased This is a more complicated version of the issue we saw in Section 94 where incorrect speci cation of the variance in the independence case leads to erroneous standard errors iii Finally the usual regression model may not be an appropriate framework in which to state and address the scienti c objectives PAGE 322 CHAPTER 13 ST 762 M DAVDDIAN We will exhibit formally problems and ii in Chapter 14 problem iii will become clear shortly There is thus ample motivation to consider regression modeling when all observations cannot be viewed as independent As we have seen in the examples above when independence cannot be assumed it may be the case that groups of observations may be thought of naturally by design as falling into clusters each with an associated response vector Although it may not be legitimate to think of the observations within a given vector as independent it is reasonable to view the vectors as independent Hence it is natural to think of data in the form of vectors where these vectors are a set ofm independent multivariate responses It is important to note that things could be even more complicated o In the examples we have considered the vectors of independent observations are obvious each is from a different unit In some experiments the vectors that may be treated as independent are more dif cult For instance a famous data set reported by McCullagh and Nelder 1989 Section 145 involves a number of male and female salamanders of two different varieties The males and females were placed together to mate one another across and within varieties according to a rather complex design The response was number of observed matings that took place during the pairing It is obvious that responses involving the same female or male could potentially be correlated as particular males or females may be more or less interested in matingl Clearly in this crossed experiment identifying independent data vectors is complicated c When observations are taken over a physical area in two dimensional space it is reasonable to be concerned about correlation due to physical proximity 7 observations close together may be more alike77 than those far apart Hopefully the correlation dies off77 as they become farther apart in space but there is no natural way to decide if some observations could be treated as independent from others as in this setting there are no individual units Thus there is no obvious way to represent the data as independent vectors Frameworks for this situation are the subject of study of spatial statistics Here we will limit our consideration of multivariate response to the situation where identi cation of independent response vectors may be made unambiguously The primary setting where this occurs is that of repeated measurement data where repeated measurements over time or some other factor are taken on independent units PAGE 323 CHAPTER 13 ST 762 M DAVDDIAN Most of our examples and the terminology we use will follow from this It is important to recognize that the models and methods we will discuss are relevant to any situation where such independent vectors may be identi ed INFERENTIAL OBJECTIVES It is often the case that the independent vectors are collected on each of a number of individual units mother rats subjects where these units are randomly sampled from some populations of interest 0 Thus scienti c interest generally focuses on saying something about these populations 0 Accordingly modeling these data must take this into account In the univariate case questions of interest could be speci ed in terms of parameters in the mean variance model In the multivariate setting useful models will involve parameters that characterize features of the populations of interest in some way We now formally describe the data structure we will consider DATA STRUCTURE AND FRAMEWORK The data consist of m response vectors Yi i 1 m where Y Yil YmT so that the Y need not be of the same dimension Covariate information is also collected Here in contrast to the univariate case covariates may be of two types and as we will see it will be important to distinguish between them To make the distinction clear we will adopt a speci c notation o Within individual covariates These are covariates that describe conditions under which the Y were collected on individual i Eg in the pharmacokinetic example above dose D given at time 0 describes a condition for subject 2 Such covariates would be important to know even if the focus of inference was restricted to individual i only Generically we use the notation u to denote such within individual covariates For repeated measurement data we also have time or other condition of measurement that may change value for i over j 1 n eg in the pharmacokinetic example the times tij at which blood samples were drawn for subject 2 As we have noted viewing the observations as realizations of a stochastic process for individual i it may not be completely appropriate to view such values as covariates Time or other condition is tied up with serial correlation and hence would reasonably be regarded separately when discussing this issue PAGE 324 CHAPTER 13 ST 762 M DAVDDIAN For notation convenience we will use a single notation to refer to both true within individual covariates and the repeated measurement condition time in most cases and write zij r x 1 to denote all such conditions associated with collecting Yij on i eg zij Dhtij in the pharmacokinetic example Later we will be careful to distinguish time from other conditions u when it is necessary to do so Individual level or among individual covariates These are covariates that ordinarily do not change value over j 1 n and that often may be viewed as characteristics ofi or how i was treated For example in the developmental toxicology example the dose xi given to the mother rat does not change value over the pups whose responses represent the Yij As such these are covariates that would be of no interest if one were only interested in modeling for unit i alone as they do not change Clearly such covariates can only be of interest when interest focuses on the population from which units i 1 m were drawn as they are features of the entire unit not what happened within that unit We will denote these covariates as ai Thus the available data for i consist of pairs K1z1 Hszmi along with the associated T individual level covariates 1139 Writing zi ziTl zT mr x 1 to denote the collection of within individual covariates over j including time we may think of the data as the triplets Yiz 1 i 1 m As above we will assume that the Yhzi ai are independent however independence among the components of Y is not assumed As shorthand we will de ne 11 ziT aT T to denote the full set of all covariates associated with Yi Thus we may represent the data more succinctly as independent pairs Y 113 i 1 m MODELING MULTIVARIATE RESPONSE Unlike in the case of univariate response we cannot im mediately continue by writing down a general model eg a mean covariance matrix model This is because the modeling issues become considerably more complex in this setting 0 There is no longer one obvious way to proceed Modeling of these data to a large extent depends on the situation and scienti c objectives 0 In Sections 133 and 134 we will discuss two of the most popular and widely used modeling approaches It will turn out that both approaches involve modeling ofthe rst and second moments of response vectors as we did in the univariate case however they go about this in different ways PAGE 325 CHAPTER 13 ST 762 M DAVDDIAN Before we discuss the modeling strategies in Section 132 we take a closer look at how correlation in such data may be thought to arise In the approaches we will consider the overall pattern of correlation is represented in different ways In either case a popular approach is to use models for the nature of correlation either at the individual or population level We will review some of the more common models and their features and implicit assumptions in Section 136 132 Sources of correlation As we noted previously with the type of data we are considering it is possible to imagine correlation among the elements of a response vector Yi arising at two levels a Due to individual level sources b Due to population level sources We discuss each of these issues in turn INDIVID UAL LEVEL SOUR CES An example of a is correlation that may be thought to be the result of timeordered data collection We have noted that such correlation would be relevant at the level of the individual ie if our only interest were to develop a univariate regression model for the pairs Yihzil Ymi zmi this feature might cause us to question the usual independence assumption Thus this we refer to this as individual level correlation because it is due to phenomena that arise at the individual unit level and would be an issue regardless of whether interest focuses on the particular individual or the entire population of such individuals To allow us to speak more concretely about this consider Figure 131 which depicts a possible represen tation of how data collected over time on each of several individuals might arise Figure 131a shows three hypothetical observed response vectors from different individuals the plotting symbols represent the actual responses observed at each of several time points for each and thus corresponds to what we might see in practice if we were to plot such data Figure 131b is a conceptual representation of the underlying mechanism that might give rise to the actual responses which are all we get to see as in Figure 131a Focus on the top most response vector to consider the mechanism for a single individual PAGE 326 CHAPTER 13 ST 762 M DAVDDIAN Figure 131 Conceptual model for sources of correlation in data collected over time a b response response time time 77 The solid line may be thought of as the inheren response pro le for the individual If we were to consider regression modeling of the data for this individual this line represents the mean response77 about which responses for this individual vary Here we have taken it to be a straight line for simplicity although the same considerations would apply to a more complicated form that might be represented by a nonlinear model Of course as usual the responses observed in Figure 131a do not lie exactly on this line Rather they deviate from it in a positive or negative fashion Such deviation may be due to two possible sources For de niteness suppose the response is blood pressure As is well established an individual s blood pressure varies throughout the day quite a bit If we could continually monitor blood pressure for a given subject with a perfect measuring device so with no error in measurement at all we might see something like the dashed continuous line in Figure 131b The dashed continuous line thus depicts the actual underlying response process taking place over continuous time that is a stochastic process of actual blood pressure going on inside an individual Of course in practice we only might observe the process at intermittent points in time PAGE 327 CHAPTER 13 ST 762 M DAVDDIAN This dashed line can be thought of as representing the fact that real phenomena tend to not behave entirely smoothly Eg a person s blood pressure may follow an inheren smooth trend over the long term in our picture a straight line but the actual process of blood pressure second to second uctuates about such a trend perhaps as a natural biological phenomenon or in response to things the person does such as drink a cup of coffee Note that if we were to observe the process at two points in time very close together it is very likely that the two observations would tend to be on the same side ie uctuate on the same side of the mean response trajectory eg both positive or negative together However if we consider two observations on the process very far apart in time they would be just as likely to be negative or positive Thus in the context of timeordered data this is what we mean by two observations being alike More generally the correlation between observations very close together might be likely to be positive and damp out to zero as observations become more and more separated Now as the process varies about the line we could conceive of it as having a variance at each point in time This process variance has to do with how the process varies about the mean The mean can be thought of conceptually as the mean of all such processes that could have occurred The eld of time series analysis is devoted to modeling of such processes In the most common form observations may be taken on a single realization of the process at equally spaced time points and the goal is to understand the mean and correlation pattern of the process Typically the number of observations is very large in this setting Spatial statistics extends this idea to more than one dimension Although we may wish to ascertain blood pressure perfectly the device used may be subject to measurement error so that the responses we actually observe deviate from the process somewhat as depicted in Figure 131b where the plotting symbols do not necessarily lie on the dashed line Thus there is an additional measurement error variance associated with the device It is generally assumed in time series and spatial statistics analysis that the magnitude of mea surement error is very small relative to that of the variation of the process so it is ignored In the spatial statistics literature this measurement error component is called the nugget effect Summarizing then the conceptual representation says that the response vector for a single individual is intermittent observations on a stochastic process whose realizations uctuate to some extent about a smooth inherent trend possibly subject to additional measurement error PAGE 328 CHAPTER 13 ST 762 M DAVDDIAN We can be a bit more formal about this Focusing on individual i suppose we believe that the inherent response pro le or mean response for i is a straight line with intercept BO and slope 31 Note that it because we are talking about individual i only we subscript the intercept and slope by to emphasize 77 that we refer only to individual is inheren or personal mean trajectory If we were to consider a regression model for i we could start with the stochastic process 50139 51it eiti here is the response we would observe on individual i at time t if we could monitor him in continuous time The deviation emf 7 30139 But has mean zero and represents the fact that the stochastic process uctuates about the smooth straight line trend plus the effects of measurement error Thus we could consider writing Silt Salt 5Mit7 where eggt represents the part of the overall deviation 5t due to variation in the process and 5Mt represents the part of the overall deviation due to measurement error each with mean 0 Now of course we obtain observations at intermittent times tij Thus let YM lltij 57 505 epij ep tij and EMU eMltij We now discuss assumptions one might make on the epij and 6M It is usually reasonable to suppose that the deviations due to error in the measuring device are inde pendent no matter how close together in time they are the errors made by the device are unlikely to be systematic over time Formally this is represented by the assumption that the EMU are independent for all j lt furthermore makes sense that 5M is independent of cpl7 for a typical measuring device although it is possible that the magnitude of the error committed could be different depending on the magnitude of the deviation of the process for some devices Under these assumptions varYj var5pj vareMj Note here that we do not condition on tij as a covariate as time is an inherent part of the process under study In the usual regression literature as we have discussed it is customary to include time as a covariate and then condition on all covariates We will abuse notation similarly later however it is important to note what is really meant when we discuss such processes PAGE 329 CHAPTER 13 ST 762 M DAVDDIAN Thus even in the usual regression situation if we conceptualize things this way it is clear that the variance being modeled potentially has components due to both sourcesl On the other hand although EMU and eMij are reasonably assumed independent for j 7 j epij and epij may be be correlated The magnitude of this correlation is likely to be nonnegligible if tij and 2577 are close It is this correlation that concerns us with data collected over time As we will discuss in Section 136 there are several popular models for this type of correlation that damps out77 over time o The important point here is that correlation arising because of timeordered data collection is a within individual effect OTHER FORMS OF CLUSTERING The above discussion pertains to responses collected over time In the developmental toxicity example for which the elements of a response vector are in no particular order we did not mention individual level sources of correlation but it is natural to wonder about this To place the problem in an individual regression context suppose instead that Y consists of responses Yij on each pup where each pup j receives a different dose 21739 of a drug after birth Here it is a bit hard to envision the analogue of a process over time but we can conceptualize something similar by thinking that the mother rat has potentially an in nite number of pups who could be exposed to a particular dose and we have seen one of them at each dose As all the pups are not identical their responses vary about the average of all of that mother s pup responses at that dose There may also be an additional component of measurement error which we will assume is negligible for the purposes of the following demonstration What would be a reasonable way to represent correlation In the case of timeordered responses models that allow correlation to damp out77 over time seem sensible Here however there is no such ordering Writing the model for mother i as Yij 50139 iizij 5Pij7 where as before 30139 and 31 describe the dose response relationship for mother 2 say it is reasonable to suppose that spij and epij are correlated within the individual mother rat in an identical fashion for all j and j as there is no natural ordering For simplicity assume that the cpl7 are independent of the 27 as in the classical regression set up and suppose that varepij 02 and COIIlt6Pij6pij 04 for all j 7 j PAGE 330 CHAPTER 13 ST 762 M DAVDDIAN It turns out that such within individual correlation is uninteresting if our interest focuses on deduc ing the doseresponse relationship for mother 2 as we now show Under the conditions above7 it is straightforward to see that 1 Oz 04 2 a a WHO139 0 114007 1304 1 0 Im O Jm M X mt 131 a a 1 where J is a n x 71 matrix of all 1 s As we will discuss in Section 1367 this correlation structure is called the compound symmetric or exchangeable correlation model Assume that the 217 have been centered7 so that 21 2H 0 Suppose we ignore the correlation and estimate 31 50 510T by OLS 5st ZzTZi71ZzTYi7 where Zl is the obvious design matrix It follows that Nitpm 02ZiTZi 1ZiTPiaZiZfZi 1 lnstead7 suppose that we decide to take the correlation into account and the 04 is known Then the matrix 1204 is a known matrix7 and from standard linear model theory77 a natural approach would be the WLS estimator flaw ZiTPilaZi 1ZPilaYi7 which has covariance matrix VHltELWLSgt 0392Z1TP1O Z1397139 It is a straightforward exercise in matrix algebra to show that A n 11n71a 0 A WIN aw 72 n 2 71 WIN aw 0 1 O Ej1zij That is7 whether one takes the correlation into account or not does not affect the quality of estimation of This result illustrates a general concept if interest focuses on inference for a single individual only7 then if the correlation among elements of Yl is of the exchangeable type7 it is fruitless to take it into account An intuitive explanation eg Diggle7 Heagerty7 Liang7 and Zeger 2002 is that7 with a common correlation among all observations7 there is no reason to weight observations differently PAGE 331 CHAPTER 13 ST 762 M DAVDDIAN This explains why this suspected pattern of correlation is generally not accounted for at the individual level it is not necessary REMARKS 0 Again we emphasize that what we are calling correlation due to individual level sources would be of interest even if our objective was modeling and inference for a single individual or unit except in the case where this correlation is exchangeablecompound symmetric In our treatment of regression models for univariate response we made the assumption of independence so we did not address within individual correlation in that development However note that we could have discussed this if we had not restricted attention to the independence assumption such correlation would be relevant at the individual level When data are in the form of repeated measurements over time or some other factor for each of a sample of m units from a population of interest it is usually the case that the number of observations on each unit m for the ith such unit is small Moreover this small number of observations is often taken quite intermittently so that the obser vations are far apart in time space or whatever the repeated measurement factor is Consequently information on the pattern of correlation over time or other factor is generally fairly sparse Further because observations are often far apart correlation among the available observations is often negligible In fact it turns out that for such data that correlation due to population level sources discussed next dominates the contribution of that due to individual level sources in relative magnitude 0 As a result it is quite often the case that correlation attributable to individual level sources is treated as negligible We will exhibit this more precisely in Section 133 POPULATION LEVEL SOURCES As we have remarked if we consider the response vectors as drawn from some populations we would expect responses from the same vector to be more alike than those from different vectors Eg in the rat example all pups from a particular mother might tend to have high birthweight relative to the rest of the population This phenomenon is depicted pictorially in the case of repeated measurements over time in Figure 131 In Figure 131a the three observed response vectors from three different units span the range of the response If we consider the rst two observations from the topmost vector they are clearly high together while the rst two observations from two different vectors vary quite a bit PAGE 332 CHAPTER 13 ST 762 M DAVDDIAN Figure 131b offers an underlying conceptual representation of how this might arise Each of the three vectors each corresponding to a different unit has an associated solid line which we may think of as above as the mean of all possible response vectors over time for that unit This mean response may be thought of as the inherent trajectory for the unit that is from the perspective of the population of all units it dictates the general location and trend that observed response vectors for that unit would have relative to those for other units Thus responses for the top unit which vary about its inherent trend would tend to be high because its inherent trend is to high relative to other units and similarly would be low for the bottommost unit Note that because some trends are steep and some shallow how they vary among themselves in fact changes over time To summarize population level correlation can be thought to arise because all responses within a given vector vary about a shared underlying inherent trend that dictates where responses for that unit are likely to be observed relative to those for other units which each have their own underlying trends We will formalize this idea in Section 133 If we conceive of the population of units as the population of all such inherent trends then the heavy solid line in Figure 131b represents the mean of all such trends at each time point the open symbols denote the means at each time point where there are observed responses 0 Hence we may think of this solid line as the population average over all units in the population Note that at any time point as these trends represent the mean of possible observed responses for a particular unit resulting from the realized process and possible measurement errors we may also think of this solid line as the population average over all possible observations Likewise we can think of how individual observed response vectors vary about this average Clearly this will depend on how the inherent trends vary about it as well as how responses vary about these trends and are subject to measurement error within an individual It may in fact be the case that some of this variation particularly that in inherent trends may be attributable to systematic sources eg perhaps units were assigned to different treatments and units receiving treatment A tend to have higher pro les than those receiving B Of course some of the variation may be due to unexplainable sources as in any regression problem Again we will formalize all of this in Section 133 PAGE 333 CHAPTER 13 ST 762 M DAVDDIAN REMARKS 0 Note that if inference focuses only on a single unit population level sources of correlation are irrelevant For instance in Example 11 interest is on the pharmacokinetic behavior of the particular subject Whether this behavior is high or low relative to that for other subjects is of no importance to this objective However if interest focuses on the population from which this subject was drawn this correlation does become relevant This brief discussion of sources of correlation is not meant to be technically precise nor is the conceptual model we have considered the only way to think about the issues The overall message is that with such multivariate data one must consider both the individual and the population of individuals in modeling In the two main approaches we will consider this is carried out in different ways 133 Subjectspeci c modeling Zeger Liang and Albert 1988 coined the terms subject speci c and population averaged modeling to describe the two competing approaches we now discuss The rationale for these terms will become clear shortly To motivate our discussion of sub ject speci c models recall the data on the pharmacokinetics of theo phylline in Example 18 Theophylline concentrations were determined on each of m 12 subjects at several times following a single oral dose D of theophylline the dose was given in units of mgkg according to the subject s weight in kg so that all subjects received the same dose on a per weight basis The times for each subject were not necessarily the same Thus each subject i has an associated vector of concentrations Yi Yil YWT where Yij was observed at time 251739 following dose Di Let zij D 25 Here D is a within individual covariate u in the strict sense which for notational convenience we group together with time tij For convenience Figure 17 showing data from four subjects is reproduced in Figure 132 The objective of the study is to understand pharmacokinetic behavior in the population of subjects More precisely the objective is PAGE 334 CHAPTER 13 ST 762 M DAVDDIAN Figure 132 Theophyllme concentration time pro les for 4 subjects receiving an oral dose of theophylline at time 0 with ts of model superimposed Subject 1 Subject 6 N N 2 c c c e e E 0 so so 8 m uquot g o E lt o o o g N N p c 0 5 10 15 20 25 0 5 10 15 20 25 Subject 10 Subject 12 N N 2 c c c e e E 0 so so 8 a m E E ltr o o g N N p c c ije W ije hr 0 To learn about variability in how pharmacokinetic processes absorption distribution elimination take place in the population and the typical or average such behavior A framework that allows this objective to be addressed in a scienti cally meaningful way is required As discussed in Example 18 a model that is thought to describe the concentration at time t for a particular subject following a single oral dose D of theophylline is the onecompartment model with rst order absorption given by kaDF mexpeket expekat k5 ClV 132 where ka Cl and V are the fractional absorption rate clearance and volume of distribution 0 We emphasize that 132 is a model for the within individual process of pharmacokinetics within a single subject The model arises from theoretical considerations about this process taking place within the subject Accordingly it makes sense only as a model for individual behavior This model dictates a smooth trajectory over time Thus in the context ofthe conceptual model we discussed earlier we can think of 132 as describing the inherent trajectory that concentrations might follow PAGE 335 CHAPTER 13 ST 762 M DAVDDIAN c From Figure 132 which has some different individual regression ts of 132 superimposed on the data although the general form ofthe model seems to describe the pattern well for all subjects the form is not identical for them all For example Subject 1 has a much higher peak concentration than the others and Subject 12 s pattern of decay at the later time points seems steeper As discussed in Example 18 these different manifestations of the common pattern can be thought to be due to differences in the values of the parameters ka Cl and V governing the underlying individual pharmacokinetic processes across subjects 0 Thus variation in pharmacokinetic behavior and the typical behavior in the population can be conceptualized in terms of these parameters Speci cally the population mean values of each of the parameters characterize typical behavior and the pattern of variation and co variation of parameter values across the population characterizes inter subject variation in pharmacokinetic behavior An appropriate statistical model should thus 0 Preserve the notion that pharmacokinetics happens at the individual subject level ie model 132 describes within individual behavior 0 Allow this behavior to vary across subjects by allowing the parameters to vary across subjects 0 From the previous section account for correlation from individual and population level sources A model that does this is the following Let 8 31332 330T denote the pharmacokinetic parameters speci c to subject 2 where 31 clearance g volume and g absorption rate for individual i Letting ui Di we can think of things in continuous time as 1 02 W7 i Pig 8Xp51it52i 8Xp53it 7 3i 7 521K531 51i52i For the particular times at which we have measurements we can write fltzij7 i 105177 My i Pi gi exp litij32i 8Xp53itij 521K531 51i52i Hence the function f is the one compartment model thought to govern theophylline kinetics for any subject exactly how this happens for subject i is determined by that subject s speci c parameters 3 PAGE 336 CHAPTER 13 ST 762 M DAVDDIAN The function fz6 ft u 3 say characterizes inherent pharmacokinetic behavior for subject 2 as in the usual regression setting responses observed for i will vary about fz 3 at each time c As discussed in Section 132 part ofthis variation may be due to the fact that the actual realization of the pharmacokinetic process taking place over time within i does not follow exactly the smooth path fz 3 Additional variation may be due to measurement error eg introduced by the assay used to determine drug concentrations from blood samples at each tij o It is thus natural to think of fz 3 as the mean response for subject 2 as in the discussion of Figure 131 Analogous to our previous discussion we can think of the stochastic process 3305 fltt7ui7 t t 5it7 where emf is a mean zero deviation that we will discuss further shortly Letting YM lltlj these considerations suggest that at the individual subject level we have EK jlzij7 i zijv 50 133 0 The mean is conditional on 3 as the model describes individual level behavior where 8 may be viewed as a xed parameter from the perspective of individual i c We have abused notation by also conditioning on zij ui tij which includes time given the special role of time it would be more precise to condition on ui only For notational convenience we will continue to write things by conditioning on zij but emphasize that it is important to keep in mind what we really mean here 0 Clearly we could also write a model for the conditional variance taking into account the sources described above We will discuss this in more detail momentarily With this representation a way to characterize typical pharmacokinetic behavior and how it varies in the population presents itself Each subject i has hisher own pharmacokinetic parameters so a natural way to formalize this is to think of the 8 as random vectors arising from some multivariate distribution Of course the 8 are unobservable so we cannot use what we see to model them but we can make some plausible assumptions 0 For example we might assume that t NAN57D 134 PAGE 337 CHAPTER 13 ST 762 M DAVDDIAN This model thus characterizes the population of subjects in terms of the distribution of their underlying inherent pharmacokinetic parameters Here 6 represents the mean or typical parameter values in the population while the covariance matrix D describes variation and covariation of the parameters in the population The normality assumption as in 134 is commonly made but is not necessary We may write t 5 bi 135 where b are random vectors usually taken to be independent of 11 such that 0 varb D Of course under 134 bi N N0 D The bi are usually referred to as random effects they describe how the parameter vector for a randomly chosen subject i deviates from the population mean 6 In fact 135 has the look of a very simple regression model for the parameter vectors 3 and may be extended readily As a simple example whether a subject is a smoker has been shown to affect pharmacokinetic behavior for some drugs eg drug clearance may be accelerated in the presence of nicotine Thus under such circumstances part of the overall variation across 8 may in fact be due to the systematic effect of smoking status This may be incorporated by adopting the model at 50 f 516139 f bh where 6139 1 ifi is a smoker and 0 if not Thus ai 6139 is an individual level covariate With varbi D for all i this model says that mean typical pharmacokinetic parameter values may differ for smokers and nonsmokers although the variation about the mean in each population smokers and nonsmokers is similar This is analogous to the usual classical assumption in analysis of variance that although factors might affect the mean variance is similar across groups Of course the b could be taken to have different distributions for each population in which case they would not be independent of I More generally a number of baseline characteristics of individuals may be thought to in uence pharmacokinetic behavior and thus account for some of the variation in the population in a systematic way Eg drug elimination as measured by clearance is often associated with weight age kidney function status the kidneys are directly involved in elimination and so on These characteristics are at the level of the individual so are components of ai PAGE 338 CHAPTER 13 ST 762 M DAVDDIAN Extending 135 more generally then we may consider 5139 A15 bi7 136 where A is a design matrix that is a function of the components of 1 For example if w is the weight of subject 1 and C is creatinine clearance for i a measure of kidney function ai wi ciT we might postulate a model that says that 31 drug clearance depends on w and Ci while g volume and absorption rate 33 depend only on wi these would be unlikely to be affected by kidney function Such a model is 31 1 wi Ci 0 0 0 0 5010 i 52139 0 0 0 1 w 0 0 gm b 531 0 0 0 0 0 1 wi Vw kao Bka w This model allows mean clearance volume and absorption rate to change systematically according to weight and creatinine clearance in a linear fashion Thus the elements of 6 describe mean pharma cokinetic parameter behavior As in usual regression whether clearance varies with subject weight systematically in this fashion is addressed by the parameter ow As with ordinary regression modeling one could consider introducing interactions higher order poly nomial terms etc which could be accommodated by an appropriate matrix A The model Ai could even be replaced by a vector of functions nonlinear in 6 For de niteness we will consider the linear model 136 in the sequel Of course 135 is a special case with A I SUMMARY We may write the basic features of the model we have described for the pairs of observed data Y1 111 Ym1m in two stages as follows Stage 1 Individual model From 133 Ema1mm Zip51 Stage 2 Population model From 136 t Ai 1 bi PAGE 339 CHAPTER 13 ST 762 M DAVDDIAN WITHIN INDIVIDUAL VARIATION The model is not complete however We still must consider individual level variation about fzj 3 Analogous to the discussion of Section 132 writing 5H Yij 7 fzij6 we have Eeijlzij 0 where technically conditioning is on u and 8 only We may think of 6M 6m 5Mij7 where epij represents the part of the deviation associated with the biological uctuations of the realized process and 5M that associated with measurement error In the particular application of pharmacokinetics the EMU are reasonably assumed independent across j as each blood sample is assayed separately It is also reasonable to assume that the 5M are independent of the epij as the magnitude of deviations caused by assay error is likely to depend more on the overall magnitude of the response as represented by fzj 3 than on a local uctuation away from it Correlation between epij and cpl74 j 7 j may be nonnegligible if tij and 25 are close in time It is standard to assume that within individual uctuations are suf ciently local in time that obser vation times are far apart enough so that this correlation is negligible However in principle this need not be the case Combining assuming the spij and 5M are independent for all j we have in general that Vamyijlzm i Varlt Pileij7 i Var5Mileij7 i39 This shows that even for individual level modeling the assumed variance model is an attempt to characterize the sum of variances due to both within individual sources Thus the most general model would have two components For example suppose that uctuations about fz 3 represented by epij are of similar magnitude so have common variance varepij lzij 3 2 UP Suppose further that vareMjlzj 3 029296 0 zij for some variance function g re ecting the idea that variation due to the assay is of different magnitude depending on the magnitude of the response Then these assumptions imply the model VaerlZm i 0 029295 97 137 PAGE 340 CHAPTER 13 ST 762 M DAVDDIAN 0 Note that the variance parameters 0 02 and 0 do not have a subscript i A more general model could adopt this thus allowing the magnitudes of variation to be subject speci c Here allowing a and 0 to be the same for all i re ects the reasonable assumption that errors due to the common assay used to process blood samples for all m subjects are probably similar Allowing Up to be the same for all i implies the belief that uctuations from the individual speci c inherent trajectories fz 3 are of similar magnitudes for all individuals This may or may not be biologically plausible 137 is more complicated than variance functions we have discussed for individual regression modeling in previous chapters It would have certainly been possible in previous chapters to entertain such models however with limited data from a single individual it is often quite challenging to estimate all the variance parameters Thus it is routine to make simplifying assumptions In the context of pharmacokinetics it is often thought that the measurement error due to the assay is the predominant source of variation about fzij 3 Biological uctuations about the trajectory fz 3 are very small by comparison In 137 these considerations would lead to the popular model varYijlzij 029296 0zij Another less precise perspective is that the variance speci cations popularly adopted at the individual level that are not in the form of a sum of two expressions for each source do not attempt to distinguish between the two sources but rather only try to approximate the overall aggregate pattern of intra individual variation that is admittedly more complicated as the sum of two components In any event the above discussion indicates the considerations involved in thinking about intra individual sources of variance A fuller discussion may be found in Davidian and Giltinan 2003 We now consider correlation De ning fzi fz1i6 fzmi T ei 51 emT and 8p and 8M similarly we may write Yi fizi7 i 8i filtzi7 i ePi eMi Under our above assumptions 8M has identity correlation matrix and is independent of Epi whose elements may in fact be correlated If it is believed that this correlation is nonnegligible we may consider a model to describe its pattern popular models that may be used are discussed in Section 136 Write 131 zi to denote the correlation matrix of Epi where a is a parameter characterizing the model PAGE 341 CHAPTER 13 ST 762 M DAVDDIAN o Again7 we have taken the correlation parameter a to be the same regardless of i which implies the assumption that the pattern is similar across subjects 0 We have also allowed the model to depend on zi as we will discuss in Section 1367 some correlation models may depend on the actual times of the observations The exchangeable correlation model does not Note that the correlation matrix is taken not to depend on i here Combining all of this together leads to speci cation of a model for the intra individual covariance matrix varYlzi i o In our particular example7 we have varYilZi7 i 7914047 Zi 02W71 i797zi7 138 where W i7 07zi diagg 2 i7 07417 7g 2 i0zim as usual 0 Often7 correlation is assumed negligible7 and this matrix is taken to be diagonal GENERAL SUBJECT SPECIFIC MODEL We are now in a position to state the full model in general Let al denote the covariates for subject i that are involved in constructing the matrix Ai Stage 1 Individual model The model for individual behavior in its conventional form may be written as EYilzi7 an bi EYili7 bi filzzy i filzuau i bi filtmi7 7 bi VarOilzzy an bi Rilt i77vzi 1345777 1 1 139 Stage 2 Population model 6 A176 bi 1310 where the bi are usually assumed to be iid independent of zi and ai with 07 varbi D REMARKS o It is customary to write the conditional mean and variance in Stage 1 with conditioning on bl rather than ii7 as the bi are random vectors taken in Stage 2 to be independent of everything else Substitution of 1310 in Stage 1 means that7 with respect to covariates7 the conditional mean and variance are conditional on both the within individual covariates zi and the individual level covariates in Ai 11 and thus on 111 as indicated PAGE 342 CHAPTER 13 ST 762 M DAVDDIAN o In 139 Ri 7 111139 bi is the covariance matrix implied by the assumptions on within individual sources of variation and correlation eg as in 138 Here 397 represents all parameters charac terizing these sources eg 397 Up 0 0T tTT in the example 0 Regardless of what is believed about variation within individuals note that observations in the same data vector Y share the same random effect bi Thus the model takes into account naturally the population level phenomenon that observations on the same individual tend to be more alike77 than observations from different individuals We will exhibit this formally momentarily RESULT The model in 139 and 1310 is a subject speci c model where modeling is in two stages individual level 139 and population level 1310 c This model is ideally suited to the scienti c objectives stated in the case of pharmacokinetics 6 and D characterize mean and variation in the population of parameters which is of direct scienti c interest 0 Moreover the model acknowledges that individual level behavior is well understood in the sense that a theoretical model for it can be incorporated directly MARGINAL CONDITIONAL MOMENTS Recall in regression modeling of univariate response we model what we see77 in the sense that the general mean variance model is a model for the marginal moments of the response conditional on covariates Here we observe Y 113 i 1 m The model in 139 1310 does not directly yield an analogous model for the multivariate case that is a model for and varYl1 We now consider what the model implies about these marginal moments of the Y conditional only on the covariates From 139 we have in general that EYili EEYili7bili Efi7 7 bmwi 1311 where the expectation is over the distribution of the b given 111 That is to obtain the average of all response vectors with covariates 11 ziT aT the we average over the population the distribution of PAGE 343 CHAPTER 13 ST 762 M DAVDDIAN 0 Written another way letting F be the distribution function of the bi fii7 7 bi debi Note that here we have used the assumed independence of bi and 111139 o For applications like pharmacokinetics f is a highly nonlinear function of the b as it is nonlinear in the individual parameters 3 Thus it is likely that 1311 cannot be evaluated analytically that is the integral cannot likely be obtained in closed form 0 Thus this modeling strategy leads to the unpleasant feature that a mean model analogous to the usual regression models for univariate responses cannot be written directlyl Also from 139 using the relationship varZ EvarZlV varEZlV for random vectors Z and V we have varYili MBA577 11min MLmi 5 12min 1312 where the expectation and variance are again with respect to the distribution of the b given 11 in general 0 As with the mean because the components of R and 1 may be nonlinear in b it is not necessarily the case that this represents a closed form expression for varYil1i as both terms in 1312 involve likely intractable integrals over the distribution of b 0 Moreover note that because bi appears in each element of f1i 6 bi the second term in 1312 is almost certainly not a diagonal matrix 0 Thus even if the within individual covariance matrix R is diagonal ie if we believe that corre lation due to individual level sources is negligible although the rst term in 1312 is diagonal because the second term need not be it follows that varYi is unlikely to be a diagonal matrix in general This exhibits how the model allows for the possibility the responses on the same individual are more alike The second term in 1312 automatically incorporates this o It is often the case in practice that analysts will assume R is a diagonal matrix without much thought because the second term in 1312 is not The hope is that this term will capture any correlation among the elements of Y However from the point of view of identifying the sepa rate contributions of individual and population level sources to the overall pattern of correlation induces when averaging across the population this is erroneous PAGE 344 CHAPTER 13 ST 762 M DAVDDIAN But because it is sometimes dif cult to identify with limited data on each unit the nature of individual level correlation this strategy is often used as an approximation We will discuss this further in Chapter 15 SUMMARY The model in 139 and 1310 is attractive because it allows features of individual level and population level variation to be explicitly represented For applications such as pharmacokinetics the formulation allows questions of scienti c interest to be addressed directly 0 In particular in such applications primary interest focuses on the 8 and how they vary in the population Part of this variation may be systematic represented by Ai and part may be unexplained represented by b As we will discuss in the next section an alternative approach to modeling pairs Y 113 i 1 m is indeed to write down a model for the marginal conditional mean and covariance matrix and varYl1 directly Such a model represents already having averaged across the population and hence the bi in the model here and hence does not explicitly acknowledge individual behavior through individual speci c parameters like 3 With such a model we would not be able to address the scienti c issue satisfactorily Here interest does not focus directly on the moments of response vectors per se conditional on covariates rather these vectors are of interest only in that they are the available source of information about the 3 which are not observable directly TERMINOLOGY Models of the form given in 139 and 1310 are known as nonlinear mixed effects models 7 nonlinear77 for obvious reasons and mixed effects77 to recognize the presence of both xed parameters 6 D etc and random effects in the model We will consider inferential strategies for nonlinear mixed effects models in Chapter 15 134 Populationaveraged marginal modeling As mentioned at the end of the previous section an alternative strategy is one analogous to that adopted in the univariate case to model the rst two moments of the response vectors conditional on covariates directly Although this is not the best strategy for applications like pharmacokinetics where questions of interest focus on variation in individual speci c features like pharmacokinetic parameters it is a reasonable strategy in other settings where such inference is not the focus PAGE 345 CHAPTER 13 ST 762 M DAVDDIAN To x ideas consider the following example Recall Example 15 which described part of the Six Cities study in which data were collected on mother child pairs from different cities Example 15 discussed a subset of the data where maternal smoking status city and wheezing status77 a measure of child respiratory health the binary response were recorded at child age 10 In reality the study was longitudinal mother child pairs were recruited into the study at child s age 9 and the mother and child were observed every year from age 9 to 12 The objective of the study was to understand the association between maternal smoking behavior and child respiratory health taking into account other baseline factors such as city gender etc Thus the data ideally consist of a response vector Y for each mother child pair i 1 m with m 4 with Yij 1 if child i was experiencing wheezing at age tij tj same for all and 0 otherwise and ti 9 10 11 12 asj 1 2 3 4 The tij may be regarded as within individual covariates characterizing time the conditions under which responses were observed Also available are 1 the vector of baseline factors including city gender etc an individual level covariate and 6 the maternal smoking indicator We will take 67 to be dichotomous for simplicity 67 1 if the mother is smoking at 25 and 0 if not Now 67 is also a within individual level covariate in that it changes within 2 and represents conditions under which i was observed so in this sense we incorporate it in zij ti 6 However note that unlike the 25 67 is not set by the investigators but instead is only observed This application is an example of a cohort study in which a group of individuals is followed and information is recorded on each with the objective of understanding behavior over time Such studies are common in epidemiology which seeks to understand the interplay between different potential risk factors and outcomes that are important to public health On the basis of observation of associations among risk factors and responses epidemiologists would like to make public policy recommendations Thus in such circumstances individual behavior is not the focus rather the objective is to understand the phenomenon of interest at the population level so that broad recommendations can be made This is most conveniently done as is customary on the basis of average behavior across the population of subjects that is the nature of the response on average for randomly chosen subjects with certain characteristics is of direct interest PAGE 346 CHAPTER 13 ST 762 M DAVDDIAN To give a speci c example in the Six Cities study a potential maternal smoking pattern for child ages 912 might be 6 1100 ie the mother quit smoking after her child was 10 For all mother child pairs with baseline characteristics a and this smoking pattern let 1 consist of the times of observations 1 and 6 Then EY represents the average response vector over all pairs with baseline characteristics a and smoking pattern 6 so is the population average77 response for the population of all pairs with 13 To get a sense of how response is associated with smoking this average could be compared to that for the same baseline covariates a and a different smoking pattern Contrast this with the situation in pharmacokinetics There average response drug concentration was not of direct interest rather the underlying individual pharmacokinetic behavior as represented by 3 was In fact in pharmacokinetics there is often a theoretical model describing within individual behavior Here there is no theory to characterize how wheezing changes over time for a given subject The objectives for this kind of study thus have to do with understanding the typical average response vector77 as a function of covariates Thus as it is not routine for a theoretical model for individual behavior to be available the most straightforward approach is to focus on modeling this feature directly 0 That is postulate a model for directly from the observed data TRICKY BUSINESS Before we discuss this more formally it is worth noting that this enterprise can be straightforward or quite tricky depending on the context 0 Recall the developmental toxicology example in Example 17 mentioned at the beginning of this chapter where mother rat i receives dose z of a toxic agent and gives rise to n fetuses whose malformation status Yij 0 or 1 is recorded lnterest focuses on the relationship between the probability of having a malformation and dose It seems sensible to represent this as follows In this situation M is the sole individual level covariate and thus the objective more precisely stated is to understand the relationship EY For the binary response each component of this conditional expectation is equal to the conditional probability of malformation for a fetus given dose z across all mothers It would be natural to represent the probability by an empirical model such as the logistic ie EWWW 1 exp o 519 In this case then how to write down a model seems unambiguous This is partly because the data arise from a controlled experiment in which the covariates are determined by the investigator PAGE 347 CHAPTER 13 ST 762 M DAVDDIAN 0 Contrast this with the longitudinal cohort study situation Here 11 contains both individual level baseline characteristics a and within individual covariates 6 j 1 4 the latter are not controlled but rather can only be observed Under these conditions what is sensible model for x 1 with individual components Note that 257617 ai is the information that is available only at j along with the baseline char acteristics available and the same for all j One possibility is to write a model for as depending on 11 only through 257617 ai as Yij is binary we might use a logistic regression model for this purpose eg expWo 61751 52517 IT53 T T EYm EYt6a l Ml 1 Ml 7 17 1 1exp o ltj z6ijlg63 3 30 31 32 I63 1313 Model 1313 makes an assumption namely that the probability of wheezing at 257 given all the covariates depends only on the the mother s smoking status at j and not on her previous smoking behavior at earlier ages Alternatively it might be plausible that a child s respiratory health may take longer to recover from being in the presence of a smoking mother Under these circumstances a model that allows to depend on not only 67 but also 6M4 for j gt 1 might be more appropriate In this case we might make the assumption that Emam EYijltj75ij7 aiv6ij1v and again use a logistic regression model In fact it is plausible that a child s respiratory status at 25 might be associated with not only hisher mother s previous and current smoking status but hisher own previous wheezing behav ior This would suggest models where the relevant conditional expectations involve conditioning on previous observations We will not contemplate such modeling here but be aware that it may be a realistic alternative in some settings The wheezing example thus illustrates that modeling of the mean of multivariate longitudinal responses in such a setting involves considerable complexityl What constitutes a realistic model for the association between Y and 11 obviously depends on the context We will not pursue this issue further now but it is important to recognize that modeling of conditional on covariates moments in the multivariate setting is potentially much more complicated than in the univariate setting PAGE 348 CHAPTER 13 ST 762 M DAVDDIAN POPULATION AVERAGE MEAN MODEL We will write to denote a generic conditional mean model of this type Here 6 is a parameter that characterizes the model analogous to the univariate case 0 Although we use the same notation 6 as in the subject speci c model the interpretation of 6 here is different from that in the subject speci c case We will discuss this in detail in Section 135 o In order to discuss the components of and hence those of f1 6 we will adopt the following notation Denote by 21 all covariates both within individual and individual level that are included in the jth component of f1i6 We assume that 11 are of the same dimension for all i and j Then let 1 be a real valued function used to model which is assumed to equal EYijl1ij eg the logistic model as in the examples above We will write f 13117 5 lei11273 3 7 lm 5 Where EYijlij Maw5 Note that we are thus using the function f to denote a model for the marginal conditional mean directly Again as with 8 1 here has a different interpretation from f in the context of subject speci c models We will discuss this further in Section 135 MARGINAL VS POPULATION AVERAGE The cohort study example and the above description highlight an important issue regarding terminology So far we have used the terms marginal and population average77 almost interchangeably It turns out that in this area they are often taken to have speci c meanings As the cohort study example emphasized when one posits a model for one must consider carefully what one is willing to assume about the relationship between elements of Y and covariate information In 1313 the assumption was made that the conditional expectation which is a function of all covariate information across all j 1 m39 some of which may change with j depends on 11 only through the elements of 11 that are either timeindendepent or are associated with time j PAGE 349 CHAPTER 13 ST 762 M DAVDDIAN In general a model like 1314 in which the dependence of on 11 is taken to be only through covariate information associated with time j possibly including information that does not change with j is often referred to as a marginal model That is the term marginal is used speci cally to refer to a population average model embodying this feature The term population average model is used to indicate models like 1314 in much greater generality It is important to recognize when reading the literature that marginal is often used in this way Our de nition of 11 above is more general but it is critical to be aware that such notation is often used to imply that 21 includes only the information associated with j as described above We will continue to use the terms interchangeably The distinction raises some important issues Pepe and Anderson 1994 is one paper that discuses these issue explicitly WHAT ABOUT VARIANCE AND CORRELATION As in the univariate case we also wish to specify a model for the second conditional moment of Y varYll1i x From our discussion in Sections 132 and 133 there are two sources of variation that combine to lead to the overall pattern of variance and correlation sources at the individual level and sources at the population level As demonstrated in 1312 if we adopt the perspective of these two sources then the marginal condi tional covariance matrix of Y is the sum of two matrices each arising from one of the sources 0 Thus the model for varYjl1 is meant to represent the aggregate of variance arising from both sources 0 Similarly the correlation matrix associated with the model for varYl1 is meant to characterize the overall pattern of correlation resulting from the two sources 0 Thus in contrast to the subject speci c approach the two sources are not represented explicitly separately rather only their aggregate is modeled There is no automatic feature that induces a marginal correlation model in this approach the aggregate correlation is modeled directly The approach is thus to postulate a model of the form MFGAIM V457 7 113139 7 X m 1315 where 6 is the regression parameter and g is a vector of variance and correlation parameters to be made precise shortly PAGE 350 CHAPTER 13 ST 762 M DAVDDIAN This is generally carried out as follows For random vector Z Z1 ZnT with covariance matrix V with elements 12777 j j 1 n it is straightforward to verify that V may be written as V TlZPTlZ 1316 where T diagv11 v22 vnn is the diagonal matrix with the variances of the Zj on the diagonal and P is the correlation matrix of Z with jj element Ujj UjjUjaquot 1239 This expression represents the covariance matrix equivalently in terms of the variances of the individual elements and the correlations among them and is the motivation for the popular approach to specifying V in 1315 proposed by Liang and Zeger 1986 o In particular 1316 suggests deducing a model for V by specifying models for the variances and correlations separately Clearly writing down a model for the overall variance and pattern of correlation from all sources could be challenging on the basis of complicated multivariate but limited data In particular specifying a suitable correlation matrix that accurately represents the aggregate pattern of cor relation may be especially dif cult as this involves considering all possible associations among elements of response vectors In fact as we will discuss in Section 137 depending on the type of data the correlation pattern may be very complicated and may be subject to certain restrictions Understanding the pattern of variance may be a bit easier as this may be examined on a component by component basis Liang and Zeger 1986 suggested that to represent the variance varYijl1i one use models similar to those that would be used in the univariate case 0 For example if the Y are counts the distribution of Y for a given i would be expected to be Poisson so we might consider a model for variance as f1ij6 equal to the mean PAGE 351 CHAPTER 13 ST 762 M DAVDDIAN However if we have counts from m different units there is Poisson variation within units plus additional variation among units ie some units maybe higher or lower than others Thus if we consider the variance averaged across units the resulting variance might be more profound in magnitude than that expected from ordinary Poisson counts alone because it involves both the within unit Poisson variation and among unit variation in the population This is exactly the feature of overdispersion discussed in Section 45 This suggests that a model for varYjl1 where Y is a count should be something that allows for overdispersion due to this phenomenon For example one might postulate varOijlilri 02fij7 7 where 02 most likely gt 1 is an overdispersion parameter Of course alternative models would also be possible In general in light of 1316 such considerations would lead to speci cation of the x diagonal matrix Ti 0 113139 say with diagonal elements varYijl1i that may be parameterized in terms of 6 appearing in the marginal mean model and additional variance parameters 0 As above this model may likely include an unknown scale parameter 0 although this is not re quired Here for brevity we have absorbed a into 0 In the sequel we may explicitly acknowledge the scale parameter or not depending on the context As mentioned above the aggregate correlation pattern may be dif cult to specify Diggle Heagerty Liang and Zeger 2002 discuss diagnostic tools for investigating this correlation structure which we mention in Section 136 Recognizing that correctly specifying the correlation structure might be dif cult but also that failure to take correlation into account might result in inef cient and potentially misleading inferences Liang and Zeger 1986 suggested that the analyst instead try only to specify a working model for correlation c That is they advocated trying to select a model that hopefully captures some of the main features of the overall pattern of correlation in the hope that at least acknowledging correlation is better than ignoring it altogether 0 They then suggested carrying out inference allowing for the possibility that the correlation model is only a working model that may be incorrect We will discuss how this latter objective is achieved in Chapter 14 now we simply state the approach to adopting a working correlation model PAGE 352 CHAPTER 13 ST 762 M DAVDDIAN The idea is to select a correlation matrix model 131 that attempts to represent correlation in a relatively simple way 0 The most popular models some of which are discussed in Section 136 are such that F depends on 11 only through the 251739 Pi does not depend on 6 or other covariates In Section 137 we will see that this could be a misspeci cation for some underlying distributions Putting this together the marginal conditional covariance matrix of a data vector is speci ed as 7 7 12 12 MFGHIM 7 Vita 7 111139 7 Ti 57 07 1110114047 111012 57 011 m X 7107 1317 where T and F would be based on the considerations given above and thus 0T aTT SUMMARY The population averaged approach involves writing down directly a model for the rst two marginal conditional on covariates moments of a response vector as described above From 1314 and 1317 the general form of the model may be stated analogous to the univariate case as EYili M11113 37 MFGHIM Vita 7 11 1318 135 Comparing subjectspeci c and population averaged models As we have seen depending on the scienti c objectives one of these approaches may make more sense than the other We now seek to clarify the differences between the two strategies To begin we summarize the key features SUBJECT SPECIFIC MODEL We will abbreviate this as SS 0 This approach tends to be preferred in situations where interest focuses on the distribution of particular aspects of individual behavior 0 More precisely a model for individual behavior perhaps based on theoretical considerations is available and questions of interest may focus on the distribution of parameters in such a model 0 However this approach may be used in other situations too Here interest may well focus on the population of responses The SS approach may be adopted solely as a mechanism for modeling correlation the random effects induce a correlation structure due to population level sources PAGE 353 CHAPTER 13 ST 762 M DAVDDIAN 0 Although the approach does not directly present a model for EYil1i such a model is implied in particular if a model 1 is postulated at Stage 1 for individual behavior then EYili Effdil h bili 1319 Moreover a correlation structure is naturally induced as shown in 1312 POPULATION AVERAGED MODEL We will use the abbreviation PA 0 This approach is used when interest focuses on the population of responses The population average of the response is modeled directly by postulating a model of the form EYili filti7 1320 Here the function f in f is meant to model the population average directly 0 The aggregate pattern of variance and correlation over both individual and population level sources is modeled explicitly as in 1317 COMPARISON The two approaches may or may not lead to the same model and inferences To appreciate this suppose that in each approach we use the same model 1 0 In the SS approach we use 1 to represent EYjl1b j 1 n as in 1319 o In the PA approach we use 1 to represent directly It turns out that if f is a linear function the two strategies can lead to the same model for the marginal mean To see this suppose that in the SS approach we have the simplest second stage model 8 6 bi Ebili 0 and suppose that we take film 57 bi X175 where X is a design matrix depending on the elements of 111139 Then EYili EXi ilmi EXi Xibilili X178 On the other hand if we take a PA approach we would write directly PAGE 354 CHAPTER 13 ST 762 M DAVDDIAN Thus in the case where f is a linear model whether one takes a SS or PA approach leads to the same model for the marginal mean Of course the models for varYl1 may well be different However the fact that the marginal mean model is the same from both perspectives allows the xed parameter 6 to be interpreted two different ways 0 From the SS perspective 6 has the interpretation as the mean of the population of individual regression parameters 8 that dictate individual speci c mean models Thus 6 may be interpreted as the typical parameter value c From the PA perspective 6 has the interpretation as the parameter producing the typical re sponse vector Because the marginal mean model is the same in the linear case the analyst is correct in interpreting 6 either way NONLINEAR f It is clear that this pleasing feature does not carry over to the more general case of nonlinear 1 For a general nonlinear function f16 say the marginal mean under a SS model will not equal that under the PA model with the same 1 To see this take 8 6 b for the second stage SS model for simplicity so that there are no individual level covariates and assume bi independent of 111 Then Emjlilli fzij757 bidebi 739 aw5 in general Thus under each approach the xed parameter 6 has a different interpretation 0 Under SS 6 is the typical parameter value 0 Under PA 6 is the parameter value leading to the typical response vector EXAMPLE Recall the wheezing example in Section 134 For simplicity suppose we have only a single within individual covariate 2 Under a PA approach we might adopt a logistic model of the form eXPWO 512 E Y 1139 1321 l l 1 eXPWO 512 for the marginal mean From a SS perspective we might instead use the logistic model to represent individual behavior For example suppose we wrote a model in which the log odds for individual i is linear in zij with each individual having hisher own individual speci c intercept BO and slope BM on the logit scale satisfying 8 6 bi where 8 30331 6 Bo lT and similarly for bi PAGE 355 CHAPTER 13 ST 762 M DAVDDIAN Then the model would take the form eXPWO izij 50139 biizij E Yz139b39 l l l 1exp o izijbmbiizm 1322 Suppose further that bi N NO D independent of 111 It is straightforward to appreciate that under the SS model the marginal mean T 71 2W71 D 71Zexplt biD2 1 db explt o izzal 1 exp o 312ml eXPWO izij 50139 biizij E Y 1139 l l 1 expWo izij 501 biizij SUMMARY By simply considering the implications for models for the marginal conditional mean of a response vector Yi it is clear that the SS and PA approaches lead to very different representations of the rst two conditional moments Which approach is to be preferred in a given situation usually depends on the application and scienti c objectives Because the PA approach involves writing models for the rst two marginal conditional moments directly it turns out that methods for tting these models and the associated theoretical developments are direct extensions of those for univariate models discussed in Chapters 2712 Thus it makes sense to discuss these rst and we do this in the next Chapter Because of the more complicated nature of SS models for which the implied marginal moments must be obtained via likely intractable integration methods and theoretical developments for these models are much harder The most popular strategies for tting these models rely on approximations to the marginal moments that seek to avoid the integration and allow exploitation of methods for PA models We thus defer discussion of inference in these models in Chapter 15 until after our discussion of PA models in Chapter 14 136 Models for correlation Now that we have discussed the two main approaches we review several models for correlation that are used routinely in regression modeling of multivariate response 0 ln subject speci c modeling the main consideration for modeling correlation is at the individual level as the population level sources of correlation are taken into account automatically by the presence of the random effects Thus in this setting modeling correlation pertains to selection of a correlation matrix to represent possible within individual correlation due to the component Epi eg the matrix 131 z in 138 which along with a model for variance of spij and EM leads to an overall speci cation for the within individual covariance matrix Ri 7 111139 bi at the rst stage PAGE 356 CHAPTER 13 ST 762 M DAVDDIAN As we have noted in many practical situations where observations on each i are taken relatively far apart in time information on this correlation may be scant and it is standard to assume that it is negligible Thus modeling of this correlation in the context of SS models is often not even an issue In the population averaged approach models for correlation are meant to capture the aggregate pattern of association due to both individual and population level sources Thus in this setting modeling correlation pertains to selection of a working correlation matrix 131 in 1317 that is meant to approximate this overall pattern of correlation from both sources Along with a model for variance this correlation matrix leads to the marginal second moment speci cation varYll1i Vi6 In this context it is routine to contemplate working correlation models even when the informa tion from each individual is scant Even if within individual sources of correlation are negligible population level sources are most likely not We rst introduce several popular correlation models followed by a brief discussion of diagnostic meth ods for deducing the suitability of these models in practice In this section only we will suppress conditioning on covariates as the models and considerations we discuss are meant to be applicable both to SS models at stage 1 and to PA models We will write throughout this section 131 to denote a correlation matrix of dimension x m suppressing possible dependence on covariates Some of the models we will discuss will not depend on the times of observation tij others will UNSTRUC TURED CORRELATION MODEL Certainly the most general model is one that makes no assumptions at all about the pattern of association In particular the matrix where of course 04777 1 0412 0413 041m 0421 1 0423 042 ma t Z 0 CW 1 1323 04m1 04m2 animiil 1 a for all jj allows the correlation between any pair of observations to be different Thus this matrix depends on 7 12 arbitrary correlation parameters This model is usually referred to as unstructured for obvious reasons PAGE 357 CHAPTER 13 ST 762 M DAVDDIAN This is not a very parsimonious model and moreover does not take into account the way in which the data were collected For example in modeling within individual correlation due to timeordered data collection in a SS model it would make sense that correlations between observations far apart in time might be less strong than those close together in time The model 1323 does not impose any such restriction but rather allows the correlations to be anything As a model in the PA setting 1323 might be more plausible as the aggregate of correlation from both sources may well result in a haphazard rather than systematic pattern of association Even here however the issue of parsimony is relevant it may well be that a simpler model with fewer parameters can do an adequate job as a working model in capturing the predominant features of the overall pattern of correlation Thus it is standard in both SS and PA settings to use models that attempt to represent correlation in terms of a small number maybe one or two of parameters EXCHANGEABLE OR COMPOUND SYMMETRIC MODEL As discussed in Section 132 the ex changeable or compound symmetric model given by 1 Oz 04 a a 11104 t t t t 17 11m onm 1324 a a 1 is ordinarily not used as a model for within individual correlation in SS modeling for the reasons discussed in that section This model is most often used in the PA setting to represent the aggregate pattern of correlation in the case of clustered data such as in the developmental toxicology example where there is no natural ordering to the observations within a response vector In contrast to the situation discussed in Section 132 for a single individual when we have m response vectors with marginal exchangeable correlation structure whether we take this into account or not leads to different inferences This model is certainly parsimonious as it depends on only a single scalar parameter a PAGE 358 CHAPTER 13 ST 762 M DAVDDIAN Many of the models that are used for both modeling within individual correlation in SS models and overall correlation in PA models have their roots in time series analysis For within individual correlation due to time ordered data collection in a SS model such correlation models are a natural choice As we have discussed in the PA setting it is often the case that correlation due to population level sources among individual variation dominates that from within individual sources but this is not always true When the data are collected over time and neither source dominates it is popular to use models that emphasize the time ordered data collection aspect as working models as these models are relatively well understood and parsimonious Here are some popular correlation models from standard time series analysis As basic time series analysis is predicated on the observations begin equally spaced in time the rst two models we discuss are probably appropriate only in situations where the observation times are approximately equidistant ONE DEPENDENT MODEL This model may be thought of as representing the situation where ob servations close in time may be correlated but those correlation among those farther apart is negligible For equally spaced data one could imagine that observations adjacent in time might have nonnegligi ble correlation while the correlation between those more than one interval apart might be reasonably thought to have damped out77 This situation is represented by the general onedependent model 1 a1 0 0 041 1 042 0 1311 0 a2 1 0 1325 0 0 anrl 1 where 0 g 04739 g 1 for j 1 m 71 represent the correlations between adjacent observations at times tij and tij1 and CE 041 04m1T A special case of this matrix is to assume that 04739 E 04 for all j resulting in the model 1 Oz 0 0 Oz 1 04 0 1311 0 Oz 1 0 1326 0 0 Oz 1 PAGE 359 CHAPTER 13 ST 762 M DAVDDIAN The models in 1325 and 1326 may be extended to two and higher order dependency in the obvious way AUTOREGRESSIVE MODEL OF ORDER 1 The AR1 model assumes that the correlations among observations farther apart in time decay to zero For equally spaced measurements this decay happens according to the number of time intervals separating two observations In particular if tij and tag1 are the times at which Yij and YMH are observed then we have that the time interval tag1 7 tijl is a constant for all j The model is 1 Oz 042 0 71 Oz 1 Oz 042 Ma a2 a 1 a E 1327 anrl a2 a 1 where 0 g 04 g 1 Clearly as observations become farther apart so number of time intervals 1 increases 01 approaches 0 rather quickly This model depends only on the scalar correlation parameter a so may be a parsimonious representation when this approximate pattern of decay is thought to hold A correlation structure such as the AR1 structure 1327 embodies a certain assumption about the underlying process In particular processes for which the correlation and hence covariance between any two observations depends only on the distance eg time separation between them and not on the actual observations times themselves except through their difference is called stationary Stationarity may or may not be a reasonable assumption but its appeal is obvious lf correlation depends only on distance and not on the actual time points pairs of observations at different time points have information about the entire correlation structure A problem with models such as 1327 is that it is not always the case that observations tend to be equally spaced In situations where observations are taken over time whether the responses are equally spaced is generally a function of the application In many epidemiological studies observations are taken at regular intervals for convenience the same is often true in clinical trials where data are collected longitudinally on participants over time PAGE 360 CHAPTER 13 ST 762 M DAVDDIAN In contrast in pharmacokinetics it is traditional and sensible to make unequally spaced observations For drugs that are administered orally as for the theophylline example in Figure 132 it is routine that the absorption phase where concentration is increasing is rather quick while the elimination phase where concentrations are decaying is long It is standard to take observations close together shortly after the dose in order to have hope of capturing the nature of the absorption pattern and to take fewer measurements later when the decay pattern is fairly well determined A model like 1327 would not be appropriate for representing possible within individual correlation in this case Generalizations of models like the AR1 to the case of unequally spaced responses are popular in both SS and PA situations These models continue to embody the assumption of stationarity so that the correlation between Yij and Yijz say depends only on the distance ti739 7 2574 but not on the particular values tij and tijz It is standard to focus on particular choices of the autocorrelation function WHOm Yij POW W07 which describes the correlation as a function of distance EXPONENTIAL CORRELATION M ODEL The exponential correlation model is represented by the autocorrelation function pu exp7ozu 04 gt O 1328 1328 yields corrYj Yijz exp7ozltj 7 2577 This results in the correlation matrix 1 afiI tiZl afiI tiBl H apil tinrll 1 afizitial PM E E 7 1329 1 afiniilitinii 1 where 04 exp7oz Comparing this matrix to 1327 we see that in the case of equally spaced times 1329 reduces to the AR1 correlation structure Thus the exponential correlation model is often viewed as a generalization of the AR1 to unequally spaced observation times PAGE 361 CHAPTER 13 ST 762 M DAVDDIAN GAUSSIAN CORRELATION MODEL An alternative to 1328 is the so called Gaussian correlation model pu exp7ozu2 04 gt O 1330 which yields corrYj Yijz exp7oztj 7 tij2 More extensive discussion of these models is given in Chapters 375 of Diggle Heagerty Liang and Zeger 2002 The above account of various models is by no means exhaustive rather we have just reviewed some of the more popular representative models that are used to model either pure within individual serial correlation in the SS case or are chosen as empirical approximations to model overall correlation in the PA case In both situations the hope is that such models may do a reasonable job at capturing the salient features of associations among observations with only a low dimensional parameter a that usually must be estimated IDENTIFYING AN APPROPRIATE CORRELATION STRUCTURE Just as there are ad hoc graph ical and other procedures for aiding the analyst in assessing and modeling nonconstant variance in the univariate case Chapter 7 there are methods in a similar spirit for evaluating and modeling the pattern of correlation Two main areas in which understanding and modeling of correlation is a central focus are time series analysis and spatial statistics In the former the data are usually in the form of a single series of observations over time in the latter the data are typically gathered at different locations in two dimensional space In both of these applications the number of observations in the time series or over a region of interest is generally large This provides the data analyst with a good deal of information for assessing the nature of correlation among observations separated in time or space The observations in this context are usually continuous and often assumed to be normally distributed In contrast in the typical repeated measuremen 7 or clustered data77 situations which are the fo cus here a relatively small number of observations 71 are available over time or other condition of measurement on each of a large number m of units PAGE 362 CHAPTER 13 ST 762 M DAVDDIAN When the observations are continuous many authors advocated applying procedures from time series and spatial statistics to residuals from a t of the data for all m units by an appropriate method that assumes independence here the residuals from all units would be pooled together under the assumption that the pattern of correlation under investigation is similar across all units 0 Although under this assumption one has good information from the population with only a small number m of observations per individual the ability to assess correlation patterns may be nonetheless limited 0 Consequently it is often the case that analysts do not even attempt to investigate the nature of correlation Rather particularly in the PA case they may assume a working model77 and then account for the possibility it may be incorrect as discussed in Chapter 14 Here we only mention brie y a few techniques that are used in both the SS and PA contexts A more in depth discussion may be found in Diggle Heagerty Liang and Zeger 2002 SS MODEL Here interest focuses on the pattern of within individual correlation Thus if one considers the rst stage model EWHW bi fltzij7 5 written here deliberately in terms of the individual speci c regression parameter 3 along with a can didate variance model varYijl1i bi 029296 0 zij say one could t this model separately for each i using GLS or other methods for univariate data assuming independence within each individual to obtain individual speci c estimates 3 say Note that this assumes that m is large enough for this to be feasible which may not always be the case One may then form weighted residuals for each i at each j ie Yij Affijv i39 6931397 07 If within individual correlation is negligible we would expect the wnj to be of approximately similar le39j 39 magnitude with no association across j otherwise we would expect to see a pattern When the observations are equally spaced in time or other set of conditions at the same time points for all m units so m n for all i a simple way to assess possible correlation is to construct the estimated correlation matrix of the wnj with j 3 entry the usual estimate across all m units of corrYj Yijz PAGE 363 CHAPTER 13 ST 762 M DAVDDIAN If one is further willing to assume stationarity a more re ned version of this is to estimate the autocor relation function pu where here u is in units of the common time interval This may be carried out for u 1 23 n 7 1 as follows For a given u calculate the sample correlation among all possible wnj and Lung1 that is all residuals separated by a lag of u This leads to estimates of pu at each u 1 23 n 7 1 Obviously as u gets larger fewer and fewer observations may be used so that estimates for large lags are probably fairly unreliable One can plot the estimated autocorrelation function and look for a smooth pattern As a further diagnostic it is also commonplace to plot the lagged residuals against one another For example one would plot create the lag 177 residual plot by plotting wnj against Lung1 for all j 1 n and similarly for larger lags The degree of association in the plot may be interpreted as giving information on the strength of within individual correlation at different lags When the observations are not equally spaced but the analyst is still willing to assume stationarity the autocorrelation function may be replaced by the variogram The generic de nition of the variogram for a stochastic process Zt is W E Zt 7 Zt 7 712 u 2 0 Under stationarity the variogram is related to the autocorrelation function by the relationship 39yu varZt1 PM In our context the sample variogram is estimated from the weighted residuals by rst computing vii74 12wnjiwnjz2 and umz tijitij for all ijj The viiiuijjz pairs forj lt j 1 m over all i 1 m may be plotted and related back to the autocorrelation function PA MODEL For a PA model interest focuses on the overall aggregate pattern of correlation To obtain weighted residuals one can t the assumed marginal model for varYj1 by assuming that all N 221 m observations are mutually independent and using GLS or other methods for univariate response One may then apply the same ideas as for the SS case above to these weighted residuals In Section 138 we illustrate a few of these methods for assessing within individual correlation pattern in a SS model PAGE 364 CHAPTER 13 ST 762 M DAVDDIAN 13 7 Distributional considerations We have noted previously that depending on the true distribution of the data some correlation struc tures that one might contemplate as working models77 may be implausible We now demonstrate this potential dif culty In our discussion of independent data for the most part distributional considerations were not neces sarily central to modeling and tting methods but were relevant mainly to the properties of estimators However there were some instances where distributional considerations were important Recall that in our discussion of methods for estimating variance functions for independent data in Chapters 6 and 12 we noted that some methods rely on certain moment assumptions that may be implausible when the data follow certain distributions For example use of the identity transformation of absolute residuals as a basis for estimating variance function parameters required that EEjH1j be constant 770 for all j which may be shown to be impossible for underlying distributions like the Poisson It turns out that similar restrictions arise for correlation for some distributions These restrictions may make separate modeling of variance and correlation in the PA approach as implied by 1317 where the correlation matrix is taken to not depend on the mean model for example problematic To investigate we consider two special cases in the context of PA modeling MULTIVARIA TE NORMAL DATA Suppose that the Y given 11 are reasonably assumed to arise from a m variate normal distribution Analogous to the univariate case the multivariate normal density is fully characterized by its rst two moments that is the mean vector and covariance matrix Thus in marginal modeling of multivariate normal response vectors the conditional mean and variance may be thought of separately so indeed may be anything In this situation postulating a model of the form 1317 does not violate any underlying restrictions imposed by the distribution Using obvious shorthand notation for a PA model of the form g and varYij1i 0292 say then corrYz j7 Yij Otjj fijf ij7 so that EYjYj OzjjO39Zgijgij fijfijz It is clear that taking any 71 g 04 g 1 would not violate any characteristic of jointly normally distributed random variables Thus for approximately normally distributed data we may reasonably contemplate models for variance and correlation separately without concern that the resulting covariance structure violates some distributional requirement PAGE 365 CHAPTER 13 ST 762 M DAVDDIAN CORRELATED BINARY DATA The situation is much different for other types of data Here we focus on the situation of binary Yij where Y take on only the values 0 or 1 and are assumed to be correlated If we postulate a marginal model then we are equivalently postulating a model for PYj fij say which as a probability is restricted to lie between 0 and 1 We have mentioned the logistic model as a reasonable choice under these conditionsThe variance varYijl1i with no overdis persion is then fig1 7 fig Consider the correlation conditional on between Y and Yijz this is given by 7 EYinij fijfij 7 10171017417 fij1 ful l27 where EYjYjzl1 PYj 1 and Yij corrYj Yijz 04777 Clearly PYij 1 and Yij PYj fij and PKj fijl Thus it must be that EYinijlilli minfij figz Furthermore PYj1and Yijz1l1 17PYj00r Yijz0l1 V 7 1 POii Olin 130quot Olin 1 1 fij1 13 fij figquot 1 Thus it may be deduced that max0 fij fij 7 1 EYjYj which follows from noting that the events Y 1 and Yijz 1 are either disjoint or not We thus have combining the above that 11189407 1 fij 1 EYinij li minltfij7 fig Arbitrarily taking fij g fij without loss of generality we thus have that lt fij fijfij 10171 fij 12 7 fijfijlifij li ful l2 ice1 fig 7 that is the largest this correlation may be is the square root of the odds ratio corrYj Yijz Similarly the smallest this correlation may be is when if fij fij 1 gt fijfij 7 fijfij 12 7 Hefty1 fij1 fig le 1 fij1 fly COMOM7 Yij W or when if fij fij 2 1 corrYj Yij l gt fiitfij ilifijfij 71 fij1 fij 1239 7 fijfij 1 fij1 fij12 fijfij PAGE 366 CHAPTER 13 ST 762 M DAVDDIAN The result is that the fact that the data are binary imposes natural restrictions on the correlations that are possible between two binary random variables The correlations must satisfy a constraint that depends on the means in a complicated way Thus in contrast to the situation of normal data corre lations cannot be anything In particular here assuming that the correlations are not dependent on the mean may be inappropriate The working model chosen may allow violations of these restrictions to be possible for certain choices of correlation parameter rendering the ensuing inferences suspect A similar phenomenon may be exhibited for other distributions such as the Poisson Thus when postulating a working correlation model in the PA setting one must appreciate that the model is almost certainly not correct and could in fact represent implausible correlation structure 138 Example assessing Withinindividual correlation To illustrate some of the diagnostic techniques for assessing correlation discussed in Section 136 we consider data from a forest resource study reported by Davidian and Giltinan 1995 Section 113 It is of interest in forestry to derive models to predict the volume of the bole ie trunk of a tree without having to fell the tree this is useful in assessing the extent of usable wood product available in the tree The prediction should be easy to obtain using only measurements that may be made without dif culty on standing trees Common practice in the forestry literature is to model cumulative bole volume Y up to the point where the tree is of diameter d in terms of tree height H and diameter of the tree at breast height D where d g D Diameter of the tree is reasonably assumed to become smaller as one moves up the bole For each of m 39 sweetgum trees the data available for tree i are Di diameter at breast height Hi height of the tree pairs dij Y j 1 n where the dij j 1 n are the diameters of the tree measured at 3 foot intervals up the bole and Yij are the corresponding determinations of cumulative bole volume from breast height up to dij calculated by assuming that the bole has a conical shape It is natural to suspect that observations Yij on a given treei that are close together along the bole might be correlated Figure 133 shows Yij plotted against 2 Di 7 dij for four of the m 39 trees and suggests that an appropriate model for the relationship of volume to difference in diameter within tree i may be reasonably represented by a logistic model EYi739lzijv i ab5 1331 51139 1 expe 3ilog 27 7 3 PAGE 367 ST 762 M DAVDDIAN CHAPTER 13 It turned out that variance varYiJlzij7 i within a tree is reasonably assumed to be constant and similar for all trees In this study7 the goal was to investigate the relationship of the parameters i describing shape and apparent degree of tapering of the bole to tree height Hi and other factors Here7 we only focus on within tree correlation Figure 133 Cumulative bole uolume plotted against the di erence Di 7d for four sweetgum trees with indiuidual OLS ts of 1331 superimposed Tree10 D 267H 1074 Tree1D127H 895 erquot5quot39quot 6 c 2774 2 3 n 2 w39 quot 2 in r 0 3 N o o 0gt 0iquot c o 7 E g O 3 quot f be gt 3 f e 2 oquot o 8 9 f 8 f E e v j in o o 5 c v39 c 0 5 10 15 20 25 Tree 24 D 168 H 922 quot39 39 39 quot quot 30 o 20 Cum Boie Volumecu ft 0 70 20 30 40 50 50 Diameter Difference in Diameter Difference in Even though the values 217 are in fact different for each tree7 because the bole volume responses were taken at equally spaced 3 foot intervals along the bole starting at breast height7 it is not unreasonable to treat the observations starting at breast height 21739 0 as equally spaced as one moves up the bole From the individual OLS ts of 1331 assuming within tree independence for all m 39 trees7 one 6 9 and 12 7777 may compute weighted residuals which are in fact unweighted here Figure 134 shows these residuals plotted against each other at lags of u 17 27 37 4 representing observations separated by 3 feet along the bole7 respectively The visual impression is that volume observations taken 3 feet apart may be mildly positively correlated7 but observations farther apart along the bole appear less so As described in Section 1367 the autocorrelation function pu may be estimated by the sample corre lation of the residuals in each panel PAGE 368 CHAPTER 13 ST 762 M DAVDDIAN Figure 134 Lagged residuals from individual OLS ts of the sweetgum tree data L591 LagZ V g E in 4U 2U 3U 2U 4n H 2U 3U mm mm L594 A a A a w 3 v Resxdua s Resuua s PAGE 369 CHAPTER 13 ST 762 M DAVDDIAN The values obtained are u 1 2 3 4 6u 0305 0108 70152 70192 It is somewhat hard to interpret these values7 as no standard errors are available Taking them in con junction with the plot7 a reasonable conclusion is that there is some evidence of within tree correlation7 but it is not strong Davidian and Giltinan 19957 Section 113 t both a onedependent 1325 and AR1 1327 correlation structure7 assuming in each case that the scalar parameter 04 is the same across all trees These authors experienced dif culty in obtaining a stable t of the AR1 model for the onedependent model7 they obtained an estimate of 64 025 using estimating equation type meth ods7 which is close to the sample autocorrelation function value These authors found that taking this correlation into account in this example had virtually no affect on further inferences of interest7 so they concluded that the correlation may be suf ciently low as to be negligible for the purposes of making inference PAGE 370

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.