Applied Longitudianal Data Analysis
Applied Longitudianal Data Analysis ST 732
Popular in Course
Popular in Statistics
This 37 page Class Notes was uploaded by Jordane Kemmer on Thursday October 15, 2015. The Class Notes belongs to ST 732 at North Carolina State University taught by Marie Davidian in Fall. Since its upload, it has received 14 views. For similar materials see /class/223938/st-732-north-carolina-state-university in Statistics at North Carolina State University.
Reviews for Applied Longitudianal Data Analysis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/15/15
CHAPTER 4 ST 732 M DAVIDIAN 4 Introduction to modeling longitudinal data We are now in a position to introduce a basic statistical model for longitudinal data The models and methods we discuss in subsequent chapters may be viewed as modi cations of this model to incorporate speci c assumptions on sources of variation and the form of mean vectors We restrict our discussion here to the case of balanced data ie where all units have repeated measurements at the same 71 time points Later we will extend our thinking to handle the case of unbalanced data 41 Basic Statistical Model Recall that the longitudinal or more general repeated measurement data situation involves observation of the same response repeatedly over time or some other condition for each of a number of units individuals 0 In the simplest case the units may be a random sample from a single population 0 More generally the units may arise from different populations Units may be randomly assigned to different treatments or units may be of different types eg male and female 0 In some cases additional information on individual unit characteristics like age and weight may be recorded We rst introduce a fundamental model for balanced longitudinal data for a single sample from a com mon population and then discuss how it may be adapted to incorporate these more general situations MOST BASIC MODEL FOR BALANCED DATA Suppose the response of interest is measured on each individual at n times t1 lt t2 lt lt tn The dental study 71 4 t1 t4 8101214 and the guinea pig diet data 71 6 t1 t5 134567 are balanced data sets with units coming from more than one population Consider the case where all the units are from a single population rst Corresponding to each tj j 1 n there is a random variable Y j 1 n with a probability distribution that summarizes the way in which responses at time t among all units in the population take on their possible values PAGE 68 CHAPTER 4 ST 732 M DAVIDIAN As we discuss in detail shortly values of the response at any time t may vary due to the effects of relevant sources of variation We may think of the generic random vector where the variables are arranged in increasing time order 0 Y in 41 has a multivariate probability distribution summarizing the way in which all responses at times t1 tn among all units in the population take on their possible values jointly o This probability distribution has mean vector EY p with elements p j 1 n and covariance matrix varY 2 CONVENTION Except when we discuss classical methods in the next two chapters we will use 2 as the subscript indexing units and j as the subscript indexing responses in time order within units We will also use m to denote the total number of units across groups where relevant Eg for the dental study and guinea pig diet data m 27 and m 15 respectively Thus in thinking about a random sample of units from a single population of interest just as we do for scalar response we may thus think of m n X 1 random vectors Y17Y2739 39 39 7Ym7 corresponding to each of m individuals each of which has features eg multivariate probability distri bution identical to Y in 41 For the 2th such vector such that p varY 2 PAGE 69 CHAPTER 4 ST 732 M DAVIDIAN o It is natural to be concerned that components 193 j 1 n are correlated o In particular this may be due to the simple fact that observations on the same unit may tend to be more alike77 than those compared across different units eg a guinea pig with low weight at any given time relative to other pigs will likely be low relative to other pigs at any other time 0 Alternatively correlation may be due to biological uctuations within a unit as in the pine seedling example of the last chapter We will discuss these sources of variation for longitudinal data shortly For now it is realistic to expect that covYYk 7 O for any 97 k 1n in general so that 2 is unlikely to be a diagonal matrix INDEPENDENCE ACROSS UNITS On the other hand if each Y corresponds to a different indi vidual and individuals are not related in any way eg different children or guinea pigs treated and handled separately then it seems reasonable to suppose that the way any observation may turn out at any time for unit 2 is unrelated to the way any observation may turn out for another unit 6 7 2 that is observations from different vectors are independent 0 Under this view the random vectors Y1Y2 Ym are all mutually independent o It follows that if Y2 is a response from unit 2 and Yak is a response from unit 6 covYng 0 even ifj k same time point but different units BASIC STATISTICAL MODEL Putting all this together we have m mutually independent random vectors Yi 2 1 m with p and varY 2 0 We may write this model equivalently similarly to the univariate case speci cally Y p 62 O varei 2 42 where the 62 2 1 m are mutually independent 0 62 are random vector deviations such that 62 E21 em where each 527 j 1 n Eqj 0 represents how K deviates from its mean p due to aggregate effects of sources of variation PAGE 70 CHAPTER 4 ST 732 M DAVIDIAN o In addition the W are correlated but 62 are mutually independent across 2 Questions of scienti c interest are characterized as questions about the elements of p as will be for malized in later chapters MULTIVARIATE NORMALITY If the response is continuous it may be reasonable to assume that the K and q are normally distributed In this case adding the further assumption that 62 N NO 2 42 implies Yi NNnu2 i1m where the Y are mutually independent EXTENSION TO MORE THAN ONE POPULATION Suppose that individuals may be thought of as sampled randomly from 1 different populations eg q 2 males and females in the dental study 0 We may again think of Y2 m independent random vectors Where if YZ39 corresponds to a unit from group 6 6 1 q then Y has a multivariate probability distribution with pl varY 25 That is each population may have a different mean vector and covariance matrix 0 Equivalently we may express this as Y pl 62 O varei 23 for 2 from group 6 1 q 0 We might also assume 62 N NO 23 for units in group 6 so that Yz N NW2 22 for 2 from group 6 o If furthermore it is reasonable to assume that all sources of variation act similarly in each popu lation we might assume that 23 2 a common covariance matrix for all populations With univariate responses it is often reasonable to assume that population membership may imply a change in mean response but not affect the nature of variation eg the primary effect of a treatment may be to shift responses on average relative to those for another but to leave variability unchanged This reduces to the assumption of equal variances For the longitudinal case such an assumption may also be reasonable but is more involved as assuming the same variation in all groups must take into account both variance and covaria tion PAGE 71 CHAPTER 4 ST 732 M DAVIDIAN 0 Under this assumption the model becomes Y p 6 O vare 2 for 2 from group 6 1q for a covariance matrix 2 common to all groups 0 Note that even though 2 is common to all populations the diagonal elements of 2 may be different across 9 1 n so that variance may be different at different times however at any given time the variance is the same for all groups 0 Similarly the covariances in 2 between the jth and kth elements of Y may be different for different choices ofj and k but for any particular pair 9 k the covariance is the same for all groups EXTENSION TO INDIVIDUAL INFORMATION We may extend this thinking to take into account other individual covariate information besides population membership by analogy to regression models for univariate response 0 Eg suppose age a at the rst time point is recorded for each unit 2 1 m We may envision for each age a a multivariate probability distribution describing the possible values of Y The mean vector of this distribution would naturally depend on a We write this for now as m where p is the mean of random vectors from the population corresponding to age a and the subscript 2 implies that the mean is unique to 2 in the sense that it depends on a somehow Assuming that variation is similar regardless of age we may write Y p 6 O vare 2 We defer discussion of how dependence of p on a and other factors might be characterized to later chapters All of the foregoing models represent random vectors Y in terms of a mean vector plus a random deviation vector 6 that captures the aggregate effect of all sources of variation This emphasizes the two key aspects of modeling longitudinal data 1 Characterizing mean vectors in these models in a way that best captures how mean response changes with time and depends on other factors such as group or age in order to address questions of scienti c interest PAGE 72 CHAPTER 4 ST 732 M DAVIDIAN 2 Taking into account important sources of variation by characterizing the nature of the random deviations 62 so that these questions may be addressed by taking faithful account of all variation in the data Models we discuss in subsequent chapters may be viewed as particular cases of this representation where 1 and 2 are approached differently We rst take up the issue in 2 that of the sources of variation that 62 may re ect 42 Sources of variation in longitudinal data For longitudinal data potential sources of variation usually are thought of as being of two main types 0 Among unit variation 0 Within units variation It is useful to conceptualize the way in which longitudinal response vectors may be thought to arise There are different perspectives on this here we consider one popular approach For simplicity consider the case of a single population and the model YZ p 62 The ideas are relevant more generally Figure 1 provides a convenient backdrop for thinking about the sources that might make up 62 0 Panel a shows the values actually observed for m 3 units these values include the effects of all sources of variation 0 Panel b is a conceptual representation of possible underlying features of the situation The open circles on the thick solid line represent the elements of p at each of the n 9 time points E g the leftmost circle represents the mean p1 of all possible values that could be observed at t1 thus averaging all deviations 62 1 due to all among and within unit sources over all units 2 The means over time lie on a straight line but this need not be true in general The solid diamonds represent the actual observations for each individual If we focus on the rst time point for example it is clear that the observations for each 2 vary about 1 PAGE 73 CHAPTER 4 ST 732 M DAVIDIAN Figure 1 a Hypothetical longitudinal data from m 3 units at n 9 time points b Conceptual representation of sources of variation The open circles connected by the thick solid line represent the means aj j 1 n for the populations of all possible observations at each of the n time points The thin solid lines represent trends for each unit The dotted lines represent the pattern of error free responses for the unit over time which fluctuate about the trend The diamonds represent the observations of these responses which are subject to measurement error a b response response time time o For each individual we may envision a trend depicted by the solid lines the trend need not follow a straight line in general The trend places the unit in the population The vertical position of this trend at any time point dictates whether the individual is high or low relative to the corresponding mean in u Thus these trends highlight biological variation among units 7 Some units may be consistently high or low others may be high at some times and low at others relative to the mean 0 The dotted lines represent uctuations about the smoother straight line trend representing variation in how responses for that individual may evolve In the pine seedling example cited earlier with response height of a growing plant over time although the overall pattern of growth may track a smooth trend natural variation in the growth process may cause the responses to uctuate about the trend This phenomenon necessarily occurs Within units biological uctuations about the trend are the result of processes taking place only Within that unit PAGE 74 CHAPTER 4 ST 732 M DAVIDIAN Note that values on the dotted line that are very close in time tend to be larger or smaller than the trend together while those farther apart seem just as likely to be larger or smaller than the trend with no relationship Finally the observations for a unit diamonds do not lie exactly on the dotted lines but vary about them This is due to measurement error Again such errors take place Within the unit itself in the sense that the measuring process occurs at the speci c unit level We may formalize this thinking by re ning how we view the basic model Y7 p 62 The jth element of Y7 393 may be thought of as being the sum of several components each corresponding to a different source of variation ie Y2739 M7 62739 M7 52739 62739 M7 52739 612739 622739 43 where Eb jgt 0 E 1 j 0 and E 2 j 0 1777 is a deviation representing among unit variation at time tj due to the fact that unit 2 sits somewhere in the population relative to W due to biological variation We may think of 1777 as dictating the inherent trend for 2 at tj o 612739 represents the additional deviation due to Withinunit uctuations about the trend 627 339 is the deviation due to measurement error Withinunits The sum 67 339 617 339 627 339 denotes the aggregate deviation due to all Withinunit sources 0 The sum 57 b j 1 j 2 j thus represents the aggregate deviation from W due to all sources Stacking the 573 1777 and 673 we may write 6739 b 6739 b2 12 322397 which emphasizes that 6739 includes components due to among and within unit sources of variation SOURCES OF CORRELATION This representation provides a framework for thinking about assump tions on among and within unit variation and how correlation among the Y7 equivalently among the 577 may be thought to arise o The 1777 determines the inherent trend in the sense that W 2777 represents position of the inherent trajectory for unit 2 at time 9 The Y7 thus all tend to be in the vicinity of this trend across time for unit 2 As can be seen from Figure 1 this makes the observations on 2 more alike relative to observations from units PAGE 75 CHAPTER 4 ST 732 M DAVIDIAN Accordingly we expect that the elements of 62 and hence those of YZ39 are correlated due to the fact that they share this common underlying trend We may refer to correlation arising in this way as correlation due to amongunit sources In subsequent chapters we will see that different longitudinal data models may make speci c assumptions about terms like bij that represent among unit variation and hence this source of correlation Because cm are deviations due to the uctuation process it is natural to think that the 61 might be correlated across 9 If the process is high relative to the inherent trend at time t so cm is positive it might be expected to be high at times tj close to tj 612 7 positive as well Thus we might expect the elements of 62 and thus Y to be correlated as a consequence of such uctuations because the elements of 612 are correlated We may refer to correlation arising in this way as correlation due to Withinunit sources Note that if the uctuations occur in a very short time span relative to the spacing of the tj whether the process is high at tj may have little or no relation to whether it is high at adjacent times In this case we might believe such within unit correlation is negligible As we will see this is a common assumption often justi ed by noting that the t7 are far apart in time The overall pattern of correlation for 62 and hence Y may be thought of as resulting from the combined effects of these two sources among and within units As measuring devices tend to commit haphazard errors every time they are used it may be reasonable to assume that the 622 339 are independent across 9 Thus we expect no contribution to the overall pattern of correlation To complete the thinking we must also consider the variances of the bi 612739 and em We defer discussion of this to later chapters in the context of speci c models 43 Exploring mean and covariance structure The aggregate effect of all sources of variation such as those identi ed in the conceptual scheme of Section 42 dictates the form of the covariance matrix of 62 and hence that of Y2 PAGE 76 CHAPTER 4 ST 732 M DAVIDIAN As was emphasized earlier in our discussion of weighted least squares if observations are correlated and have possibly different variances it is important to acknowledge this in estimating parameters of interest such as population means so that differences in data quality and associations are taken into adequate account Thus an accurate representation of varei is critically important A rst step in an analysis is often to examine the data for clues about the likely nature of the form of this covariance matrix as well as the structure of the means and how they change over time Consider rst the model for a single population Y p 62 O varei 2 Based on observed data we would like to gain insight into the likely forms of p and 2 0 We illustrate with the data for the 11 girls in the dental study so for now take m 11 and n 4 0 Thus the w j 14 of p are the population mean distance for girls at ages 8 10 12 and 14 the diagonal elements of 2 are the population variances of distance at each age and the off diagonal elements of 2 represent the covariances among distances at different ages Spaghetti plots for both the boys and girls are given in in Figure 2 Figure 2 Spaghetti plots of the dental data The open circles represent the sample mean distance at each age these are connected by the thick line to highlight the relationship among means over time Girls Buys m 25 m 25 age V2573 we was PAGE 77 CHAPTER 4 ST 732 M DAVIDIAN SAMPLE MEAN VECTOR As we have discussed the natural estimator for the mean W at the jth time point is the sample mean m 7739 mil 2 Yip 2 1 where the dot77 subscript indicates averaging over the rst index 2 ie across units The sample mean may be calculated for each time point 9 1 n suggesting that the obvious estimator for p is the vector whose elements are the 73 the sample mean vector given by Y1 7 o It is straightforward to show that the random vector 7 is an unbiased estimator for p ie y We may apply this estimator to the dental study data on girls to obtain the estimate rounded to three decimal places 21 182 22227 El ll 23091 24091 In the left panel of Figure 2 these values are plotted for each age by the open circles o The thick solid line which connects the Y gives a visual impression of a smooth77 indeed straight line relationship over time among the M 0 Of course we have no data at ages intermediate to those in the study so it is possible that mean distance in the intervals between these times deviates from a straight line relationship However from a biological point of view it seems sensible to suppose that dental distance would increase steadily over time at least on average rather than jumping77 around Graphical inspection of sample mean vectors is an important tool for understanding possible relation ships among means over time When there are q gt 1 groups an obvious strategy is to carry this out separately for the data from each group so that possible differences in means can be evaluated For the dental data on the 16 boys the estimated mean turns out to be y 22875 23813 25719 27469 this is shown as the thick solid line with open circles in the right panel of Figure 2 This estimate seems to also look like a straight line77 but with steepness possibly different from that for girls PAGE 78 CHAPTER 4 ST 732 M DAVIDIAN SAMPLE OOVARIANCE MATRIX Gaining insight into the form of 2 may be carried out both graphically and through an unbiased estimator for 2 and its associated correlation matrix 2 o The diagonal elements of 2 are simply the variances a of the distributions of values at each time 9 1 n Thus based on m units the natural estimator for a is the sample variance at time 9 m 2 71 2 Sj m7 1 2027 Yj 7 2 71 which may be shown to be an unbiased estimator for 73 The off diagonal elements of 2 are the covariances 07k EYj mm MW Thus a natural estimator for TM is m 37k m i 1 1 202 YjY2 k Vic 21 which may also be shown to be unbiased The obvious estimator for 2 is thus the matrix in which the variances a and covariances TM are replaced by S2 and Sjk It is possible to represent this matrix succinctly verify as m 2 m 711 20 7 YY 7 Y 2 71 This is known as the sample covariance matrix The sum 231Yi 7 7 7 is often called the sum of squares and crossproducts SSampCP matrix as its entries are the sums of squared deviations and cross products of deviations from the sample mean The sample covariance matrix is exactly as we would expect recall that the covariance matrix itself is de ned as 2 7 MW 7 MW e m The sample covariance matrix may be used to estimate the covariance matrix However although the diagonal elements may provide information on the true variances at each time point the off diagonal ele ments may be dif cult to interpret Given the unitless nature of correlation it may be more informative to learn about associations from estimates of correlation PAGE 79 CHAPTER 4 ST 732 M DAVIDIAN SAMPLE CORRELATION MATRIX If f is an estimator for a covariance matrix 2 with elements flyk j k 1 n then the natural estimator for the associated correlation matrix P is f the n X 71 matrix f with ones on the diagonal as required for a correlation matrix and j k off diagonal element 27k M2 For a single population where f is the sample covariance matrix the off diagonal elements are 3339 S 3 44gt which are obvious estimators for the correlations o In this case the estimated matrix f is called the sample correlation matrix as it is an estimate of the correlation matrix corresponding to the sample covariance matrix for the single population 0 The expression in 44 is known as the sample correlation coe icient between the observations at times t7 and tk as it estimates the correlation coe icient pjk Shortly we shall see how to estimate common covariance and correlation matrices based on data from several populations For the 11 girls in the dental study we obtain the estimated covariance and correlation matrices rounded to three decimal places 4514 3355 4332 4357 1000 0830 0862 0841 A 3355 3618 4027 4077 A 0830 1000 0895 0879 20 7 PG 4332 4027 5591 5466 0862 0895 1000 0948 4357 4077 5466 5941 0841 0879 0948 1000 o The diagonal elements of fig suggest that the aggregate variance in dental distances roughly increases over time from age 8 to 14 However keep in mind that the values shown are estimates of the corresponding parameters based on only m 11 observations thus they are subject to the usual uncertainty of estimation It is thus sensible to not over interpret the numbers but rather to only examine them for suggestive features PAGE 80 CHAPTER 4 ST 732 M DAVIDIAN o The off diagonal elements of P represent the aggregate pattern of correlation due to among and Withingirl sources Here the estimate of this correlation for any pair of time points is positive and close to one suggesting that high values at one time are strongly associated with high values at another time regardless of how far apart in time the observations occur In light of Figure 2 this is really not surprising The data for individual girls in the gure show pronounced trends that for the most part place a girl s trajectory above or below the estimated mean pro le thick line Thus a girl such as the topmost one is high throughout time suggest ing a strong component of among girl variation in the population and the estimates of correlation are likely re ecting this Again it is not prudent to attach importance to the numbers and differences among them as they are estimates from a rather small sample so the observed difference between 0948 and 0830 may or may not re ect a real difference in the true correlations SOATTERPLOT MATRICES A useful supplement to numerical estimates is a graphical display of the observed data known as a scatterplot matrix As correlation re ects associations among observations at different time points initially one would think that a natural way of graphically assessing these associations would be to make the following plot For each pair of times ti and tk graph the observed data values 31273519 for all 2 1 m units with 327 values on the horizontal axis and 3279 values on the vertical axis The observed pattern might be suggestive of the nature of association among responses at times ti and tk o This is not exactly correct in particular if the means p and W and variances 732 and a are not the same the patterns in the pairwise plots will in part be a consequence of this It would make better sense to plot the centered and scaled versions of these ie plot the pairs 7 327 My 31 Mk 7 ak 39 0 Given we do not know the W or 73 a natural strategy is to replace these by estimates and instead 312739 37 31m 319 87 7 8k 39 Following this reasoning it is common to make these plots for all pairs 9 k where j 7 k plot the pairs Figure 3 shows the scatterplot matrix for the girls in the dental study PAGE 81 CHAPTER 4 ST 732 M DAVIDIAN Figure 3 Scatterplot matrix for the girls in the dental study rl rlnrq nnn lnl 4 u 1 Age 8 Age 10 45 Ana 051m Age12 a Age 14 In each panel the apparent association among centered and scaled distance observations appears strong The fact that the trend is from lower left to upper right in each panel so that large centered and scaled values at one time correspond to large ones at another time indicates that the association is positive for each pair of time points Moreover the nature of the association seems fairly similar regardless of the separation in time ie the pattern of the plot corresponding to ages 8 and 14 shows a similar qualitative trend to those corresponding to ages 8 and 10 ages 8 and 12 and so on The evidence in the plots coincides with the numerical summary provided by the sample correlation matrix which suggests that correlation is of similar magnitude and direction for any pair of times Some remarks 0 Visual display offers the data analyst another perspective on the likely pattern of aggregate cor relation in the data in addition to that provided by the estimated correlation matrix This information taken with that on variance in the sample covariance matrix can help the analyst to identify whether the pattern of variation has systematic features If such systematic features are identi ed it may be possible to adopt a model for varei that embodies them allowing an accurate characterization We take up this issue shortly o The same principles may be applied in more complicated settings eg with more than one group Here one could estimate the covariance matrix 23 and associated correlation matrix Pg say for each group 6 separately and construct a separate scatterplot matrix 0 In the case of q gt 1 groups a natural objective would be to assess whether in fact it is reasonable to assume that the covariance matrix is the same for all groups PAGE 82 CHAPTER 4 ST 732 M DAVIDIAN POOLED SAMPLE CO VARIANCE AND CORRELATION MATRICES To illustrate this last point consider the data for boys in the dental study It may be shown that the sample covariance and correlation matrices are 6017 2292 3629 1613 1000 0437 0558 0315 A 2292 4563 2194 2810 A 0437 1000 0387 0631 2B 7 PB 3629 2194 7032 3241 0558 0387 1000 0586 1613 2810 3241 4349 0315 0631 0586 1000 0 Comparing to fig for girls aggregate variance does not seem to increase over time and seems larger than that for girls at all but the last time These estimates are based on small samples 11 and 16 units so should be interpreted with care 0 Comparing to FG for girls suggests that correlation for boys although positive is of smaller magnitude Moreover the estimated correlations for boys tend to jump around77 more than those for girls Figure 4 shows the scatterplot matrix for boys Figure 4 Scatterplot matrix for the boys in the dental study 1 4 U rwnruannnawnwaz Age 8 D39 39 Age10 Age 12 Age 14 4n nnn msz 4 b l rl r l l z Comparing this gure to that for girls in Figure 3 reveals that the trend in each panel seems less profound for boys although it is still positive in every case Overall there seems to be informal evidence that both the mean and pattern of variance and cor relation in the populations of girls and boys may be different We will study longitudinal data models that allow such features to be taken into account PAGE 83 CHAPTER 4 ST 732 M DAVIDIAN Although this seems to be the case here in many situations the evidence may not be strong enough to suggest a difference in variation across groups or scienti c considerations may dictate that an assump tion of a common pattern of overall variation is reasonable Under these conditions it is natural to combine the information on variation across groups in order to examine the features of the assumed common structure Since ordinarily interest focuses on whether the ye are the same as we will see such an assessment continues to assume that the ye may be different The assumed common covariance matrix 2 and its corresponding correlation matrix P from data for 1 groups may be estimated as follows Assume that there are re units from the 6th population so that m the total number of units is such that m r1 rq 0 As we continue to believe the ye are different estimate these by the sample means 75 say for each group Let fig denote the sample covariance matrix calculated for each group separately based on 73 o A natural strategy if we believe that there is a common covariance matrix 2 is then to use as an estimator for 2 a weighted average of the fig 6 1 q that takes into account the differing amount of information from each group f3 7 m 7 q 1n 7 Ugh rq 705511 This matrix is referred to as the pooled sample covariance matrix If the number of units from each group is the same so that re E r say then f reduces to a simple average ie EA 1q 31 fig 0 The quantity in braces is often called the Error SSampCP matrix as we will see later 0 The pooled sample correlation matrix estimating the assumed common correlation matrix P is naturally de ned as the estimated correlation matrix corresponding to i From the de nition the diagonal elements of the pooled sample covariance matrix are weighted averages of the sample variances from each group That is if S is the sample variance of the observations from group 6 at time 9 then the jj element of EA flyj say is equal to A 1 2 2 2m mean 7 US gt Tq 1gt3 qgt7 the so called pooled saInple variance at time tj PAGE 84 CHAPTER 4 ST 732 M DAVIDIAN If the analyst is willing to adopt the assumption of a common covariance matrix for all groups then inspection of the pooled estimate may be carried out as in the case of a single population Similarly a pooled scatterplot matrix would be based on centered and scaled versions of the 327 where the centering continues to be based on the sample means for each group but the scaling is based on the common estimate of variance for 327 from f In particular one would plot the observed pairs e e 3127 31 31m V 23739 V lm for all units 2 1 m from all groups 6 1 q on the same graph for each pair of times ti and tk DENTAL STUD Y Although we are not convinced that it is appropriate to assume a common covariance matrix for boys and girls in the dental study for illustration we calculate the pooled sample covariance and correlation matrix to obtain 5415 2717 3910 2710 2717 4185 2927 3317 3910 2927 6456 4131 2710 3317 4131 4986 2 12510 G 1523 1000 0571 0661 0522 0571 1000 0563 0726 0661 0563 1000 0728 0522 0726 0728 1000 1 0 Inspection of the diagonal elements shows that the pooled estimates seem to be a compromise between the two groupspeci c estimates This in fact illustrates how the pooled estimates combine information across groups 0 For brevity we do not display the combined scatterplot matrix for these data Not surprisingly the pattern is somewhere in between those exhibited in Figures 3 and 4 We have assumed throughout that we have balanced data When the data are not balanced either because some individuals are missing observations at intended times or because the times are different for different units application of the above methods can be misleading Later in the course we consider methods for unbalanced data PAGE 85 CHAPTER 4 ST 732 M DAVIDIAN 44 Popular models for covariance structure As we have noted previously if estimated covariance and correlation matrices show systematic fea tures the analyst may be led to consider models for covariance and associated correlation matrices We will see later in the course that common models and associated methods for longitudinal data either explicitly or implicitly involve adopting particular models for varei In anticipation this here we introduce some popular such covariance models that embody different sys tematic patterns that are often seen with longitudinal data Each covariance model has a corresponding correlation model We consider these models for balanced data only modi cation for unbalanced data is discussed later UNSTRUCTURED CO VARIANCE MODEL In some situations there may be no evidence of an ap parent systematic pattern of variance and correlation In this case the covariance matrix is said to follow the unstructured model The unstructured covariance model was adopted in the discussion of the last section as an initial assumption to allow assessment of whether a model with more structure could be substituted The unstructured covariance matrix allows 71 different variances one for each time point and 71717 12 distinct off diagonal elements representing the possibly different covariances for each pair of times for a total of n nn 7 12 nn 12 variances and covariances Because a covariance matrix is symmetric the off diagonal elements at positions 9 k and kj are the same so we need only count each covariance once in totaling up the number of variances and covariances involved Thus if the unstructured model is assumed there are numerous parameters describing variation that must be estimated particularly if n is large Eg if n 5 which does not seem that large there are 56 2 15 parameters involved If there are 1 different groups each with a different covariance matrix there will be 1 times this many variances and covariances If the pattern of covariance does show a systematic structure then not acknowledging this by maintain ing the unstructured assumption involves estimation of many more parameters than might otherwise be necessary thus making ine icient use of the available data We now consider models that represent things in terms of far fewer parameters As we will see in the following it is sometimes easier to discuss the correlation model rst and then discuss the covariance matrix models to which it may correspond PAGE 86 CHAPTER 4 ST 732 M DAVIDIAN COMPOUND SYMMETRIO CO VARIANCE MODELS For both the boys and girls in the dental study the correlation between observations at any times ti and tk seemed similar although the variances at different times might be different These considerations suggest a covariance model that imposes equal correlation between all time points but allows variance to differ at each time as follows Suppose that p is a parameter representing the common correlation for any two time points For illustration suppose that n 5 Then the correlation matrixis lpppp plppp P pplpp ppplp ppppl the same structure generalizes to any 71 Here 71 lt p lt 1 This is often referred to as the compound symmetric or exchangeable correlation model where the latter term emphasizes that the correlation is the same even if we exchange two time points for two others Two popular covariance models with this correlation matrix are as follows o If 732 and a are the overall variances at t7 and tk possibly different at different times and 0719 is the corresponding covariance then it must be that 77 k I or O39jk O39jo39kp ajak We thus have a covariance matrix of the form in the case n 5 a palag palag pala4 pala5 palag 7 po39go39g pO39ZO394 p0205 2 palag pagag 7 p0304 p0305 7 P0104 P0204 P0304 Ti P0405 pala5 p0205 paga5 p0405 7 which of course generalizes to any n This covariance matrix is often said to have a heteroge neous compound symmetric structure 7 compound symmetric because it has corresponding correlation as above and heterogeneous because it incorporates the assumption of different or heterogeneous variances at each time point Note that this model may be described with n 1 parameters the correlation p and the n variances PAGE 87 CHAPTER 4 ST 732 M DAVIDIAN o In some settings the evidence may suggest that the overall variance at each time point is the same so that a32 a2 for some common value a2 for all j 1 n Under this condition O39jk p 2 so that ajk a2p for all jk 039 Under these conditions the covariance matrix is in the case n 5 a2 pa2 pa2 pa2 pa2 pa2 a2 pa2 pa2 pa2 2 pa2 pa2 a2 pa2 pa2 azF pa2 pa2 pa2 a2 pa2 2 2 2 2 2 pa pa pa pa a This covariance matrix for any n is said to have the compound symmetric or exchangeable structure with no quali cation This model involves only two parameters a2 and p for any n Remarks 0 From the diagnostic calculations and plots for the dental study data the heterogeneous compound symmetric covariance model seems like a plausible model for each of the boys and girls although the values of p and the variances at each time may be potentially different in each group 0 The unstructured and compound symmetric models do not emphasize the fact that observations are collected over time neither has built in features that really only make sense when the n observations are in a particular order Recall the two sources of correlation that contribute to the overall pattern that arising from among unit sources eg units being high or low and those due to within unit sources eg uctuations about a smooth trend and measurement error The compound symmetric models seem to emphasize the among unit component The models we now discuss instead may be thought of as emphasizing the within unit component through structures that are plausible when correlation depends on the times of observation in some way As uctuations determine this source of correlation these models may be thought of as assuming that the variation attributable to these uctuations dominates that from other sources among units or measurement error These models have roots in the literature on time series analysis PAGE 88 CHAPTER 4 ST 732 M DAVIDIAN ONEDEPENDEN T Correlation due to within unit uctuation would be expected to be stronger the closer observations are taken in time on a particular unit as observations close in time would be more alike77 than those far apart Thus we expect correlation due to within unit sources to be largest in magnitude among responses that are adjacent in time that is are at consecutive observation times and to become less pronounced as observations become farther apart Relative to this magnitude of correlation that between two nonconsecutive observations might be for all practical purposes be negligible A correlation matrix that re ects this shown for n 5 is 10000 p00 100 H O plp 0001 1 H ooob 0 Here the correlation is the same equal to 0 71 lt p lt 1 for any two consecutive observations This model is referred to as the onedependent correlation structure as dependence is nonnegligible only for adjacent responses Alternatively such a matrix is also referred to as a banded Toeplitz matrix The onedependent correlation model seems to make the most sense if observation times are equally spaced separate by the same time interval If the overall variances 73 j 1 n are possibly different at each time tj the corresponding covariance matrix 71 5 looks like a p01 72 0 0 0 p01 72 7 p02 73 0 0 2 0 pagag 7 p0304 0 0 0 p03 74 0392 00 40 5 0 0 0 p0405 7 and is called a heterogeneous onedependent or banded Toeplitz matrix for obvious reasons Of course this structure may be generalized to any n PAGE 89 CHAPTER 4 ST 732 M DAVIDIAN lf overall variance at each time point is the same so that 732 72 for all 9 then this becomes 0 o o 0 q to q to which is usually called a onedependent or banded Toeplitz matrix without quali cation It is possible to extend this structure to a twodependent or higher model For example two dependence implies that observations one or two intervals apart in time are correlated but those farther apart are not The onedependent correlation model implies that correlation falls off 7 as observations become farther apart in time in a rather dramatic way so that only consecutive observations are correlated Alterna tively it may be the case that correlation falls off77 more gradually AUTOREGRESSIVE STRUCTURE OF ORDER 1 Again this model makes sense sense when the observation times are equally spaced The autoregressive or AR1 correlation model formalizes the idea that the magnitude of correlation among observations decays as they become farther apart In particular for n 5 the AR1 correlation matrix has the form 1 p 02 pg 94 p 1 p 92 93 P 02 p 1 p 92 v 93 92 p 1 p 94 pg 02 p 1 where 71lt 0 lt1 0 As 0 is less than 1 in magnitude as we take it to higher powers the result is values closer and closer to zero Thus as the number of time intervals between pairs of observations increases the correlation decreases toward zero 0 With equally spaced data the time interval between ti and tj1 is the same for all j ie ltj 7 tj1 d for j 1 n 7 1 where d is the length of the interval Note then that the power of 0 corresponds to the number of intervals by which a pair of observations is separated PAGE 90 CHAPTER 4 ST 732 M DAVIDIAN As with the compound symmetric and one dependent models both heterogeneous and standard covariance matrices with corresponding AR1 correlation matrix are possible In the case of overall variances 732 that may differ across 9 the heterogeneous covariance matrix in the case n 5 has the form 2 2 3 4 lt71 P0102 P 0103 P 0104 P 0105 2 2 3 palag lt72 pagag 0 7204 0 7205 7 2 2 2 2 p 7103 po39go39g 0393 p0304 p 7305 3 2 2 0 7104 0 7204 p0304 74 p0405 4 3 2 2 P 7105 P 0205 P 0305 P0405 05 When the variance is assumed equal to the same value 72 for all j 1 n the covariance matrix has the form n 5 72 072 0272 0372 0472 0 72 72 072 0272 0372 2 p202 p02 72 p02 pzaz 02F 0372 0272 072 72 072 0472 0372 0272 072 72 The one dependent and AR1 models really only seem sensible when the observation times are spaced at equal intervals as in the dental study data This is not always the case for instance for longitudinal data collected in clinical trials comparing treatments for disease it is routine to collect responses frequently at the beginning of therapy but then to take them at wider intervals later The following offers a generalization of the AR1 model to allow the possibility of unequally spaced times MARKOV STRUCTURE Suppose that the observation times t1 tn are not necessarily equally spaced and let djk It tkl be the length of time between times t7 and tk for all jk 1 n Then the Markov correlation model has the form shown here for n 5 1 p0112 pdia pdizi pd15 p0112 1 p0123 p0124 p0125 P pdia p0123 1 pdSA pdas pdizi p0124 pdazi 1 pd45 pd15 p0125 pdas pd45 1 PAGE 91 CHAPTER 4 ST 732 M DAVIDIAN 0 Here we must have 0 2 0 why 0 Comparing this to the AR1 structure the powers of p and thus the degree of decay of correlation are also related to the length of the time interval between observations Here however because the time intervals djk are of unequal length the powers are the actual lengths Corresponding covariance matrices are de ned similarly to those in the one dependent and AR1 cases 2 Eg under the assumption of common variance 7 we have 72 a2pd12 azpd13 a2pd14 a2pd15 a2pd12 a2 a2pd23 a2 pd 72 125 2 a2pd13 02pd23 a2 02pd34 azpd35 n 72de 72de oZpdazi 02 02 pd a2pd15 azpd a2pd35 a2 pd 72 This model has two parameters 72 and 0 for any 71 These are not the only such models available but give a avor of the types of considerations involved The documentation for the SAS procedure proc mixed the use of which we will demonstrate in subse quent chapters offers a rich catalog of possible covariance models If one believes that one of the foregoing models or some other model provides a realistic representation of the pattern of variation and covariation in the data then intuition suggests that a better estimate of varei could be obtained by exploiting this information We will see this in action shortly We will also see that these models may be used not only to represent varei but to represent covariance matrices of components of 62 corresponding to among and within unit variation 45 Diagnostic calculations under stationarity The one dependent AR1 and Markov structures are popular models when it is thought that the predominant source of correlation leading to the aggregate pattern is from Withinindividual sources All of these models are such that the correlation between K and m for any 9 7 k depends only on the time interval ltj 7 tkl and not only the speci c times tj or tk themselves This property is known as stationarity o If stationarity is thought to hold the analyst may wish to investigate which correlation structure eg one dependent AR1 or other model for equally spaced data might be the best model PAGE 92 CHAPTER 4 ST 732 M DAVIDIAN o Variance at each t may be assessed by examining the sample covariance matrix o If one believes in stationarity an investigation of correlation that takes this into account may offer more re ned information than one that does not as we now demonstrate The rationale is as follows 0 When the tj j 1 n are equally spaced with time interval 01 under stationarity all pairs of observations corresponding to times whose subscripts differ by 1 eg j and j 1 are 01 time units apart and are correlated in an identical fashion 0 Similarly all pairs with subscripts differing by 2 eg j and j 2 are 201 time units apart and correlated in the same way In general pairs with subscripts j and j u are ud time units apart and share the same correlation o The value of subscripts for n time points must range between 1 and n Thus when we write 9 and j u it is understood that the values ofj and u are chosen so that all possible distinct pairs of unequal subscripts in this range are represented Eg if j 1 then u may take on the values 1 n 7 1 to give all pairs corresponding to time t1 and all other times t2 tn lfj 2 then u may take on values 1 n 72 and so on lfj n7 1 then u 1 gives the pair corresponding to times tn1 tn 0 For example under the AR1 model for a particular u pairs at times ti and ti for satisfy COMO2 77 Ym39w P suggesting that the correlation between observations u time intervals apart may be assessed using information from all such pairs AUTOOORRELATION FUNCTION The autocorrelation function is just the correlation corre sponding to pairs of observations u time intervals apart thought of as a function of the number of intervals That is for all j 1 n 7 1 and appropriate u PW COMOijvyz g39wl o This depends only on u and is the same for all 9 because of stationarity o The value of p0 is taken to be equal to one as with u 0 p0 is just the correlation between an observation and itself PAGE 93 CHAPTER 4 ST 732 M DAVIDIAN o The value u is often called the lag The total number of possible lags is n 7 1 for n time points 0 The autocorrelation function describes how correlation changes as the time between observations gets farther apart ie as u increases As expected the value of pu tends to decrease in magnitude as u increases re ecting the usual situation in which within unit correlation falls off77 as observations become more separated in time In practice we may estimate the autocorrelation function if we are willing to assume that stationarity holds Inspection of the estimate can help the analyst decide which model might be appropriate eg if correlation falls off gradually with lag it may suggest that an AR1 model is appropriate For data from a single population it is natural to base estimation of pu for each u 1 n 7 1 on all pairs of observations 19739311 across all individuals 2 1 m and relevant choices of j 0 Care must be taken to ensure that the fact that responses have different means and overall variances at each tj is taken into account as with scatterplot matrices 0 Thus we consider centered and scaled observations In particular pu for a particular lag u may be estimated by calculating the sample correlation coe icient treating all pairs of the form Yij 7739 Ym w 77 37 7 Sju as if they were observations on two random variables from a sample of m individuals where each individual contributes more than one pair The resulting estimator as a function ofu is called the sample autocorrelation function which we denote as may be calculated and plotted against u to provide the analyst with both numerical and visual information on the nature of correlation if the stationarity assumption is plausible We illustrate using the data from girls in the dental study Here the time interval is of length d 2 years and n 4 so u can take on values 1 n 71 3 0 When u 1 each girl has three pairs of values separated by 01 units ie one time interval the values at t1t2 t2t3 and t3t4 Thus there is a total of 33 possible pairs from all 11 girls 0 When u 2 there are two pairs per girl at t1t3 and t2t4 or 22 total pairs PAGE 94 CHAPTER 4 ST 732 M DAVIDIAN 0 When u 3 each girl contributes a single pair at t1 t4 11 pairs in total Thus the calculation of is carried out by calculating the sample correlation coef cient from 33 22 and 11 observations for u 1 2 and 3 respectively and yields 5a 0891 0871 0841 Because each estimated value is based on a decreasing number of pairs they are not of equal quality so should be interpreted with care The estimates suggest that if we are willing to believe stationarity as observations become farther apart in time 7 increasing correlation seems to stay fairly constant This agrees with the evidence from the calculation of the sample covariance matrix and the scatterplot matrix in Figure 3 Figure 5 shows a plot of the sample autocorrelation function displaying the same information graphi cally Figure 5 Sample autocorrelation function for data from girls in the dental study T K O5 autocorrelation function 0 4 02 1 number of intervals H An alternative way of displaying information on correlation under the assumption of stationarity is to plot the pairs for each choice of lag u From above there are 33 pairs corresponding to lag u 1 22 for lag u 2 and 11 for lag u 3 In Figure 6 these pairs are plotted for each u The plot gives a similar impression as the numerical estimate An advantage of the plot is that it clearly shows that the information on correlation total number of pairs decreases as u increases For more than one group these procedures may be carried out separately for each group PAGE 95 CHAPTER 4 ST 732 M DAVIDIAN Figure 6 Lay plots for data from girls in the dental study for lays u 1 2 and 3 L591 Lag 2 response al age A n 72 71 El 1 2 72 71 El 1 2 vespunse at aqe vespunse at aqe Lag 3 1 response al age3 n 72 4 n 1 2 vespunse at aue When data are not equally spaced extensions of the method for estimating the autocorrelation function are available but are beyond the scope of our discussion here The reader is referred to Diggle Heagerty Liang and Zeger 2002 It is important to recognize that whether stationarity holds is an assumption The foregoing procedures are relevant when this assumption is valid Unfortunately assessing with con dence whether stationarity holds is not really possible in longitudinal data situations where the number of time points is usually small Because many popular models for correlation used in longitudinal data analysis embody the stationarity assumption it is often assumed without comment and it is often reasonable 46 Implementation with SAS We demonstrate the use of various SAS procedures on the dental data In particular we show how the following may be obtained 0 Sample mean vectors for each group girls and boys 0 Group speci c sample covariance and correlation matrices o Pooled sample covariance and correlation matrix 0 Pairs for plotting scatterplot matrices for each group 0 Autocorrelation functions for each gender and pairs for making lag plots PAGE 96 CHAPTER4 ST7MMLDAVHHAN There are actually numerous ways to obtain the pooled sample covariance and correlation matrices We show one way here7 using SAS PRDC DISCRIM Additional ways can be found in the program on the course web site EXAMPLE 1 7 DENTAL STUDY DATA The data are in the le dentaldat PROGRAM EXAMPLE 1 CHAPTER 4 Using SAS to obtain sample mean vectors sample covariance matrices and sample correlation matrices options ls80 ps59 nodate run The data are not in the correct from for use with the SAS procedures CORR and DISCRIM we use below These procedures require t at the data be in the form of one record line per experimental unit The ata in the file dentaldat are in the orm of one record per observation so that each child has 4 data records In particular the data set looks like 1 1 8 21 0 2 1 20 0 3 1 12 215 0 4 1 14 23 0 5 2 8 21 0 column 1 observation number column 2 child id number column 3 age column 4 response distance column 5 gender indicator Ogirl 1boy We thus create a new data set such that each record in the data set re resents all 4 observations on each c il lus gender 39 o o this we use some data manipulation features of the SAS data step The second data step does this We redefine the values of AGE so that we ma use AGE as an quotindexquot in creating the new data set DENT2 The DAT step that creates DENT2 demonstrates one waK using the notion of an ARRAY to transform a a a set in e form of one observation per record the original form into a data set in the form of one record per indivi ua e ata must be sorted prior to this operation we invoke PRGC SORT for this purpose In the new data set the observations at ages 8 10 12 and 14 are placed in variables AGE1 AGE2 AGE3 and AGE4 respectively We use PRGC PRINT to print out the first 5 records so data for the first 5 children all girls using the GBS feature of the DATA option data dent1 infile dentaldat input obsno child age distance gender L111 data dent1 set dent1 if a e8 then age1 if age14 then age4 drop o sno Di proc sort datadent1 by gender child Di PAGE397 CHAPTER 4 ST 732 M DAVIDIAN data dent2keepage1age4 gender child array aa4 agelage4 o a e1 to 4 set entl by gender child aaagedistance if lastchild then return run title quotTRANSFORMED DATA l RECORDINDIVIDUALquot proc print datadent2obs5 run Here we use PRGC CORR to obtain the sample means at age means of the variables AGE1AGE4 in DENT2 and to calculate the sample covariance matrix and corresponding sample correlation trix separatel for each group girls and boys The GOV option in the PRGC C RR statement asks for the sample covariance to be printed without it on y t e sample correlation matrix would appear in the output each proc sort datadent2 by gender run title quotSAMPLE CGVARIANCE AND CORRELATION MATRICES BY GENDERquot proc corr data en cov by gender var agel age2 age3 age4 run We now obtain the quotcenteredquot and quotscaledquot values that may be used for plotting scatter lot matrices such as that i 39 e ere we call RGC MEAN to calculate the sample GE4 and standard deviation SDAGE1SDAGE4 for each of the variables AGE1 GE4 for each gender The are output to the data set DENTSTATS which has two records one for each ender see the output We then MERGE this data set with DENT BY GENDER which as the ef ect of matching up the appropriate gender mean and o each child rint out the s of he resulting data set to i lustrate option wi We use the NGPRINT 39th PRGC MEANS to suppress printing of its output The variables CSAGE1 CSAGE4 contain the centeredscaled values These may be plotted against each other to obtain plots like Figure 3 e have not done this ere to save space proc sort datadent2 by gender child run proc means datadent2 mean std noprint by gender var agel a e2 age3 a e4 output outdentstats meanmage1 mage2 mage3 mage4 stdsdage1 sdage2 sdage3 sdage4 run title quotSAMPLE MEANS AND SDS BY GENDER FROM PRGC MEANSquot proc print datadentstats run data dentstats merge dentstats dent2 by gender csage1agelmage1sdage1 csage2age2mage2sdage2 csage3age3mage3sdage3 csage4age4mage4sdage4 run title quotINDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDERquot proc print datadentstatsobs3 run One straightforward way to have SAS calculate the pooled sample covariance matrix and the corresponding es imated correlation matrix is using PRGC DISCRIM Thi 39 on socalled p m n H El H E n general multivariate analysis a are considered as in the form of vectors here the elements of a data vector are denoted as AGE1AGE4 Here we only use PRGC DISCRIM for its facility to print out the sample covariance matrix and correlation matrix quotautomaticallyquot and disregard other portions of the put PAGEIQS CHAPTER 4 ST 732 M DAVIDIAN proc discrim pcov pcorr datadent2 class gen er var agel age2 age3 age4 Di Althou h it is a bit cumbersome we may use some DATA step manipu ations and PROC CORR to obtain the values of the autocorrelation function fo each gender We first dro variables no longer needed from the data set DENTS ATS We create then three data sets LAGl LAG2 and LAG3 and describe LAGl here the other two are simi ar e create two new variables PAIRl and PAIR2 For LAGl PAIRl and PAIR2 are the two values in 543 for u s there are ages each child has 3 such pairs The outpu of PROC PRINT for LAGl shows this for the first 2 chi dren e then sort t e data by gender and call PROC CORR to find the sample correlation between the two variables for each gen e The same principle is used to obtain the correlation by gender for lags 2 and 3 u23 There are othe more sophisticated ways to obtain the v ues of the autocorrelation function however for longitudinal data sets where e n r of time points is smal the quotmanua quot appro we have demonstrated here is easy to implement and understand PAIRl versus PAIR2 may be plotted for each lag to obtain visual presentation of the results as in Figure 6 data dentstats set dentstats drop agelage4 magelmage4 sdagelsdage4 run data lagl by child pair1csage1 pair2csage2 output 39 sage2 pair csage3 output pair csage3 pair csage4 output if lastc ild then return drop csagelcsage4 L111 set dentstats title quotAUTOCORRELATION FUNCTION AT LAG 1quot proc print atalag1 obs 6 run proc sort datalag by gender proc corr datalag1 by gender var pairl pair2 run data lag2 set dentstats by child pair1csage1 pair2csage3 output pair1csage2 pair2csage4 outpu if lastc ild then return drop csagelcsage4 run title quotAUTOCORRELATION FUNCTION AT LAG 2quot 2 o s6 run proc print atalag proc sort datalag by gender proc corr datalgg2 by gender T i va pairl pair run data lag3 set dentstats by Chlld air1csa e1 air2csa e4 out ut Ef lastcEild Ehen retu n p drop csagelcsage4 run title quotAUTOCORRELATION FUNCTION AT LAG 3quot proc print atalag3 obs6 run proc sort datalag by gender proc corr datalag3 by gender var pairl pair2 run PAGEIQQ IIIJXITFIEI 4 ST 732 M DAVIDIAN OUTPUT We have deleted some of the output of PROC DISCRIM that is irrelevant to our purposes here to shorten the presentation The full output from the call to this procedure is on the course web page TRANSFORMED DATA 1 RECORDINDIVIDUAL Obs age1 age2 age3 age4 child gender 1 210 200 215 230 1 0 2 210 215 240 255 2 0 3 205 240 245 260 3 0 4 235 245 250 265 4 0 5 215 23 225 235 5 0 SAMPLE COVARIANCE AND CORRELATION MATRICES BY GENDER 2 gender0 The CORR Procedure 4 Variables age1 age2 age3 age4 Covariance Matrix DF 10 age1 age2 age3 age4 age1 4513636364 3354545455 4331818182 4356818182 age2 3354545455 3618181818 4027272727 4077272727 age3 4331818182 4027272727 5590909091 5465909091 age4 4356818182 4077272727 5465909091 5940909091 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum age1 11 2118182 212453 23300000 1650000 2450000 age2 11 2222727 190215 24450000 1900000 2500000 age3 11 2309091 236451 25400000 1900000 2800000 age4 11 2409091 243740 26500000 1950000 2800000 Pearson Correlation Coefficients N 11 Prob gt r under H0 Rho0 age1 age2 age3 age4 age1 100000 083009 086231 084136 00016 00006 00012 age2 083009 100000 089542 087942 00016 00002 00004 age3 086231 089542 100000 094841 0 0006 00002 0001 age4 084136 087942 094841 100000 00012 00004 0001 SAMPLE COVARIANCE AND CORRELATION MATRICES BY GENDER 3 gender1 The CORR Procedure 4 Variables age1 age2 age3 age4 Covariance Matrix DF 15 age1 age2 age3 age4 age1 6016666667 2291666667 3629166667 1612500000 age2 2291666667 4562500000 2193750000 2810416667 age3 3629166667 2193750000 7032291667 3240625000 age4 1612500000 2810416667 3240625000 4348958333 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum age1 16 2287500 245289 36600000 1700000 2750000 age2 16 2381250 213600 38100000 2050000 2800000 age3 16 2571875 265185 41150000 2250000 3100000 PAGE 100 IIIJXITFIEI 4 age4 OH lIU39D b O HanaIdqu m dgtltgt l 0 0 16 2746875 208542 43950000 2500000 3150000 Pearson Correlation Coefficients N 16 Prob gt r under H0 Rho0 age1 age2 age3 age4 age1 100000 043739 055793 031523 00902 00247 02343 age2 043739 100000 038729 063092 00902 01383 00088 age3 055793 038729 100000 058599 00247 01383 00171 age4 031523 063092 058599 100000 02343 00088 00171 SAMPLE MEANS AND SDS BY GENDER FROM PRGC MEANS 4 s s s s F m m m m d d d d R a a a a a a a a E g g g g g g g g G e e e e e e e e 1 2 3 4 1 2 3 4 11 211818 222273 230909 240909 212453 190215 236451 243740 16 228750 238125 257188 274688 245289 213600 265185 208542 INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER 5 Obs gender TYPE FREQ mage1 mage2 mage3 mage4 sdage1 sdage2 sdage3 1 0 0 11 211818 222273 230909 240909 212453 190215 236451 2 0 0 11 211818 222273 230909 240909 212453 190215 236451 3 0 0 11 211818 222273 230909 240909 212453 190215 236451 Obs sdage4 age1 age2 age3 age4 child csage1 csage2 csage3 csage4 1 243740 210 200 215 230 1 008558 117092 067283 044757 2 243740 210 215 240 255 2 008558 038234 038447 057811 3 243740 205 240 245 26 3 032093 093196 059593 078325 INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER 6 The DISCRIM Procedure Observations 27 DF Total 26 Variables 4 DP Within Classes 25 Classes 2 DP Between Classes 1 Class Level Information Variable Prior gender Name Frequency Weight Proportion Probability 0 0 11 110000 0407407 0500000 1 1 16 160000 0592593 0500000 INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER 7 The DISCRIM Procedure Pooled WithinClass Covariance Matrix DF 25 Variable age1 age2 age3 age4 age1 5415454545 2716818182 3910227273 2710227273 age2 2716818182 4184772727 2927159091 3317159091 age3 3910227273 2927159091 6455738636 4130738636 age4 2710227273 3317159091 4130738636 4985738636 INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER 8 The DISCRIM Procedure Pooled WithinClass Correlation Coefficients Pr gt r Variable age1 age2 age3 age4 age1 100000 057070 066132 052158 ST 732 M DAVIDIAN PAGE 101 ST 732 M DAVIDIAN II1ITFIEI 4 00023 00002 00063 age2 057070 100000 056317 072622 00023 00027 0001 age3 066132 056317 100000 072810 00002 00027 0001 age4 052158 072622 072810 100000 00063 0001 0001 AUTOCORRELATION FUNCTION AT LAG 1 11 Obs gender TYPE FREQ child pair1 pair2 1 0 0 11 1 008558 117092 2 0 0 11 1 117092 067283 3 0 0 11 1 0 67283 044757 4 0 0 11 2 0 08558 038234 5 0 0 11 2 038234 038447 6 0 0 11 2 038447 057811 AUTOCORRELATION FUNCTION AT LAG 1 12 gender0 The CORR Procedure 2 Variables pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 33 0 096825 0 220369 207616 pair2 33 0 096825 0 188353 207616 Pearson Correlation Coefficients N 33 Prob gt r under H0 Rho0 pair1 pair2 pair1 100000 089130 0001 pair2 089130 100000 0001 AUTOCORRELATION FUNCTION AT LAG 1 13 gender1 The CORR Procedure 2 Variables pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 48 0 097849 0 239513 199154 pair2 48 0 097849 0 155080 199154 Pearson Correlation Coefficients N 48 Prob gt r under H0 Rho0 pair1 pair2 pair1 100000 047022 00007 pair2 047022 100000 00007 AUTOCORRELATION FUNCTION AT LAG 2 14 Obs gender TYPE FREQ child pair1 pair2 1 0 0 11 1 008558 067283 2 0 0 11 1 117092 044757 3 0 0 11 2 0 08558 038447 4 0 0 11 2 0 38234 057811 5 0 0 11 3 032093 059593 6 0 0 11 3 93196 078325 PAGE 102 CHAPTER 4 ST 732 M DAVIDIAN AUTOCORRELATION FUNCTION AT LAG 2 15 gender0 The CORR Procedure 2 Variables pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 22 0 097590 0 220369 156184 pair2 22 0 097590 0 188353 207616 Pearson Correlation Coefficients N 22 Prob gt r under H0 Rho0 pair1 pair2 pair1 100000 087087 0001 pair2 087087 100000 0001 AUTOCORRELATION FUNCTION AT LAG 2 16 gender1 The CORR Procedure 2 Variables pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 32 0 098374 0 239513 196044 pair2 32 0 098374 0 121378 199154 Pearson Correlation Coefficients N 32 Prob gt r under H0 Rho0 pair1 pair2 pair1 100000 059443 00003 pair2 059443 100000 00003 AUTOCORRELATION FUNCTION AT LAG 3 17 Obs gender TYPE FREQ child pair1 pair2 1 0 0 11 1 008558 044757 2 0 0 11 2 008558 057811 3 0 0 11 3 032093 078325 4 0 0 11 4 1 09115 098839 5 0 0 11 5 014977 024243 6 0 0 11 6 055627 065271 AUTOCORRELATION FUNCTION AT LAG 3 18 gender0 The CORR Procedure 2 Variables pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 11 0 100000 0 220369 156184 pair2 11 0 100000 0 188353 160380 Pearson Correlation Coefficients N 11 Prob gt r under H0 Rho0 pair1 pair2 PAGE 103 CHAPTER 4 ST 732 M DAVIDIAN pair1 100000 084136 00012 pair2 084136 100000 00012 AUTGCGRRELATIGN FUNCTION AT LAG 3 19 gender1 The CORR Procedure 2 Variables pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 16 0 100000 0 239513 188553 pair2 16 0 100000 0 118382 193307 Pearson Correlation Coefficients N 16 Prob gt r under H0 Rho0 pair1 pair2 pair1 100000 031523 02343 pair2 031523 100000 02343 PAGE 104