METH MULTIVARIATE ANLY
METH MULTIVARIATE ANLY STAT 636
Popular in Course
Popular in Statistics
This 46 page Class Notes was uploaded by Darien Kutch on Wednesday October 21, 2015. The Class Notes belongs to STAT 636 at Texas A&M University taught by Staff in Fall. Since its upload, it has received 17 views. For similar materials see /class/225763/stat-636-texas-a-m-university in Statistics at Texas A&M University.
Reviews for METH MULTIVARIATE ANLY
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/21/15
LECTURE 6 FACTOR ANALYSIS Read paragraphs 6162 Example in DJPage 149 Data were taken on test scores of students in various topics Suppose we are interested in determining a smaller number of uncorrelated variables to describe the data REVIEW OF PRINCIPAL COMPONENTS ANALYSIS Recall that principal component analysis has this objective The principal component procedure de nes a matrix A namely the matrix of eigenvectors that relates the new variables to the original ones Thus y ATx or equivalently cc Ay In terms of data matrices we write Y XA or X YAT The matrix A is determined from the covariance matrix 2 or the correlation matrix R and satis es the relation 2 LLT AAAT or R L QLT ARA AkT Also recall that the correlation between 6 and y when the data have been standardized is given by a a 1 C01Tc y A A 3 L A STATISTICAL MODEL Looking for a statistical model to describe the data Spearman conjectured that each student could be given a score on each topic consisting of a measure of general intelligence f and a measure of ability in the speci c topic ni Thus for each topic the score would look like Topici A1 f 711 i l6 That is Classics A1 f 771 French A2 f 772i Music A5 f 775 Here f is a random component common the common factor to all topics and m is a random component speci c to the topic the speci c factor The Ai are parameters re ecting the importance of f in describing the response and is called the loading of the im response on the common factor the factor loading Further he assumed that l f is a random variable with mean 0 and variance 1 2 n is a random variable with mean 0 and variance wi 3 f and the ni are independent for all i The assumption that the means are zero is consistent with the fact that in the correlation matrix the data have been standardized Similarly the assumption that the variance of f is one is not restrictive since any different assumption could be absorbed in the A39s the 7139s are allowed to have different variances The strong assumption is that the random variables are independent Denoting the topic scores by 11 we summarize the model in matrix form as 1 A f 77 Here 1 A and n are column vectors and f is a scalar The assumptions are f N 01 77 N 0 w f and n are independent A is a vector of unknown parameters bP N Under this model it follows that we can write the correlation matrix as C0rr1 C0rrA f n 2 MT p The diagonal elements of AAT that Ai2 are called communalities The elements of the diagonal matrix IJ say wi are called the specific variances The parameters Ai are the correlations of the 1i with f To see this note that C0Vx Efn f A and recall that we have assumed f has unit variance and we have standardized the 1 to have unit variance COMPARING THE PRINCIPAL COMPONENT MODEL AND THE FACTOR MODEL In principal components the matrix A relates the observed data 1 to the new variables y and the reverse relation is defined by A In the factor model the vector A describes 1 in terms of f Similarly in principal components we can write the correlation matrix as P LquotLT A A A T In the factor model we have a similar expression but add on the variances of the speci c factors thus P AAT p Look at Spearman example in SAS RRHSTUDENTPCFA SOLVING THE FACTOR ANALYSIS EQUATIONS To determine if f A and 39r exist we try to nd A and 1 such that P AAT w These are called the factor analysis equations Our problems are 1 To determine A and w if they exist that exactly satisfy this equation 2 If not we need to determine additional factors Examining these equation we see that there are P 1 unknowns on the left and 2p on the right Thus with our example p 6 and we are faced with the problem of solving 21 equations in 12 unknowns To illustrate with p 3 these equations are 1i 1 T1212 13 2 T1313 l w3 T232A3 Clearly this is a cumbersome task as p increases DJ considers the two special cases for P In one case he is lead to the equation A i 03 an impossibility In the other case he is lead to the conclusion 1amp1 i 044 and A 12 Both are impossible since 1amp1 is a variance and A1 is the correlation between 61 and f The conclusion if these cases is that we must increase the number of factors if we are to find a solution A GENERAL MODEL The general model extends the above example to allow for m common factors and allows the factorization on the covariance matrix as well as the correlation matrix We write the model as x 2 A f I39 where 6 is the pvector of response variables f is the mvector of common factors 7 is the pvector of speci c factors and A is the p X m matrix of factor loadings Again we assume 1 f N 01 2 7 0 w where w is diagonal f and n are independent In general we may assume that so 02 that is the data have been centered but as in the principal components analysis we argue that it makes more sense to assume that the variables have been standardized to have unit variance and hence x 0 P Most computer programs standardize the data automatically As before we see that Varx 2 VarAf n AAT x1 and we define the factor analysis equations as 2 AAT q Note that these equations imply the relation ajj Ajzk 1amp1 1a The proportion of the variance that is explained by the common factors is called the communality of the ll response variable This is given by m 2 ZNk k1 an If we are analyzing the correlation matrix the denominator is one The covariance between sci and 1 is given by m COVCUi 56139 ZAikAjk k1 The covariance between 561 and factor fk is given by the loading of the ll response on the km factor That is covci fk Ajk If the data have been standardized this is the correlation between those two variables Clearly the solution of these equation contains the same problems as in the case m l the problems are computational for a given m and magni ed by trying to determine m Example Consider the matrix 19 30 2 12 27 30 57 5 23 T 2 5 38 47 12 23 47 68 and the factor loading and specific variance matrices 41 2000 72 0400 A 16andqj 0010 18 0003 It can be veri ed that these matrices exactly satisfy the factor analysis equations This illustrates the situation but we are surprised that an exact solution is obtained with only two factors This seems contrary to our intuition since we would expect that four eigenvectors would be needed You might try to find the source of this confusion LACK OF UNIQUENESS There is an additional problem of lack of uniqueness That is if A and 1p satisfy the equations and T is an orthogonal matrix then AT is also a solution since 2 ATTTAT 0 Thus the factor loading matrix is not unique Note that multiplying by T corresponds to a rotation and hence a part of most programs is to allow for a rotation of factors in search of a more satisfying solution SOLVING THE FACTOR ANALYSIS EQUATIONS There are many methods for doing this Johnson suggests several methods that have been proposed for solving the equations The principal factoring and the maximum likelihood method seem to most common A brief description of principal factoring follows Since factor analysis is commonly performed on the correlation matrix we will make that assumption Thus the factor analysis equations are RAAT1J Principal Factoring Most methods also require an initial estimate of the communalities or equivalently the speci c variances A simple choice is to let TJ 0 hence all communalities are equal to one Another choice is to assume the communalities are given by Rj2 the squared multiple correlation of the jth response with the remaining responses This is obtained by regressing 561 on 61 561 excluding 561 In this case our initial estimate is A m 2 2 wj1 z jk Rj kl Thus we seek a matrix A satisfying the relation AAT R 7 3 As noted AT is equally good for any orthogonal matrix Thus we could choose an orthogonal matrix such that TAT TA 2 D a diagonal matrix For this reason we assume that ATAzD where D is diagonal Note that postmultiplying the factor analysis equations by A yields AATA AD R 7 xpm Recalling the eigen analysis of a matrix and applying it to R 7 10 shows that the diagonal elements of D are the eigenvalues and the columns of A are the eigenvectors Recall from our discussion of eigenvectors that they are orthogonal and scaled to have length one That is the result of the above analysis yields ATA I To get vectors that satisfy our requirement that ATA D we scale the vectors by multiplying by D That is we consider Ail AD Thus the estimate of the factor loading matrix is given by the scaled eigenvectors of R i 11 and the estimate of IJ is given by diagonal elements of R AkART That is m 2 quotW1 1 Aij j1 Recalling our results from the principal component analysis we note that the matrix of factor loadings is analogous to the columns of the matrix L of component correlation vectors that had the property R LLT This suggest that if we are to keep only m factors we keep those corresponding to the m largest eigenvalues If an eigenvalue is negative it implies that R 7 10 is not positive semidefinite If so we must begin again either with a new estimate of the communalties equivalently of 10 or with a smaller value of m Note that if we make the initial assumption that the specific variances are all equal to a zero that is IJ 0 then the factor loading matrix A is given by the scaled eigenvectors of R and hence agrees with the vectors obtained by principal components Usually we would iterate on this method by using the estimates of the communalities just obtained and starting the process anew After extraction of the factors it is common to compute the residual matrices defined by Hereafter we drop the on A Res R AAT 1 This gives an idication of how well the given number of factors allows us to reproduce the correlation matrix Maximum Likelihood Analysis This method is based on applying maximum likelihood computations on the Wishart distribution Again it is an iterative process An initial estimate of IJ is required as well as the iterative quot of 39 39 and 39 of a matrix The details of the development are tedious but lead to the system of equations l 2 NIH xii R xp j jA if Ae Thus the solution is to find the matrix say A of eigenvectors of the matrix A L A IJ 2R i 10li 2 The loadings are then given by l A TJ A62 Given A we then compute a new matrix of specific variances R A AT IJ DiagR AA and repeat the process until convergence A feature of the method is that it allows a test of the adequacy of the mfactor model This is given by lleAATl lRl X2 aLn where a n i l 2p5 gm and the degrees offreedom on the chisquare statistic are given by v p i m2 i p i m The null hypothesis is that the mfactor model is adequate as opposed to the alternative that another model would be better ROTATING FACTORS We have seen that the factor loading matrix is not unique under orthogonal transformations Thus an in nity of factorizations can be generated For example with m 2 postmultiplication of A by the matrix cos0 sin0 sin0 cos0 for any value of theta will perform an orthogonal rotation and yield a new set of factor loadings The hope is that this rotation will produce a more meaningful set of factors There are several rotation methods that have been suggested If we let T AT then we see that T I T TTTAT AAT and hence the communalities that is the diagonal elements of these two products are unchanged under this rotation We would like to choose a rotation that would yield a simple structure to the factor loading matrix Some of the criteria that are desired for a simple structure are 1 Each row of 1quot should contain at least one zero and each column should contain at least m zeros 2 Each pair of columns should have several entries with zero loadings in one column but not the other Examine the two factors in the STUDENT data geometrically Analytical Methods for Rotation There are several automated methods for doing rotations Among these are The Varimax Rotation The idea is that for each column of the new factor matrix I we want to maximize the quantity This is just the sum of the variances of the squared loadings within each column By recalling that T AT we see that this quantity depends on 0 the angle of rotation and hence we determine 0 to maximize v SAS has several methods for rotating the factors such as QuartimaX EquamaX etc etc Oblique Rotations These rotations are produced by multiplying A by a nonorthogonal matrix That is in two dimensions we would rotate one aXis through an angle 01 and the other through 02 Such rotations are subject to much criticism and are not recommended FACTOR SCORES Recall that in principal component analysis we could describe 1 as a function of y and conversely Recall 1 Ay or equivalently y AT1 Often we wish to use the principal components for further analysis for example in a regression model In factor analysis there is no immediate way to abtain the factor scores Methods have been prorosed but theyresulting scores may differ Here are two of the proposed methods Bartlett39s Method This method is based on the weighted least squares method Thus consider the relation 11AXfn this looks like a linear regression model with response 1 and inputs A The probelm is that Var39r II and not 02 We can remove the problem by premultiplying the l regression model by If The revised model is 1 m f m and we rewrite this as z W f e where now Vare I It follows that the estimate of f is given by WTW f WTz or equivalently or finally These are the weighted least squares equations Most regression programs will allow you to do this directly Having obtained the de nition of f in terms of c the n X m matrix of factor scores is given by A ATA A 1 1 A F X01 AA I A 1 Thompson39s Method From our assumptions it follows that c and f are multivariate normal That is i Nd l 13 ll From our discussion of the multivariate normal distibution we see that the conditional mean of f given 6 is given by me ATR lcc Applying this we see that the equation for the factors are given by AT AA A fA AAT11c and the n X m matriX of factor scores is given by F XMT r1z EXAMINING THE DATA AND ASSESSING THE NORMALITY ASSUMPTION NORMALITY Marginal Distributions i ii Histograms q q plots For the data xi compute the standardized variates ui and assume that 39u1 g U2 g mun and compare to q where qi is the point such that the Probz g g Here 2 N N 0 02 i 38 nl4 Improvement Replace by To do this you may use PROC UNIVARIATE or construct your own plot Refer to SAS program RRH Normality test Bivariate Normality 1 Scatter plots of the data should be elliptical in shape To aid in this examination we could compute E and E and plot the ellipses for various values of c If we use c X2oz with 2 degrees of freedom then approximately 1 a of the data should lie in the ellipse This is quite tedious and a more forlmal procedure is as follows 2 Compute the quantities Compute the cj X where cj is the M percentile of the n chi squared distribution with two degrees of freedom Improvement Use M n A plot of the pairs d 0 Should resemble a straight line A curved pattern suggest a lack of normality One or two points to the right of line indicates outlying observations This procedure generalizes for p dimensions by computing cj for p degrees of freedom Refer to SAS program RRH Bivar Normality test Multi variate Normality Apply the above procedure using p degrees of freedom for the chi squared statistic TRANSFORMATIONS TO NORMALITY 1 Variance stablizing transformations Suppose the variance of x depends on the mean That is Ex p and Varx 02hp The idea is to find a variable 2 gx such that the variance of does not depend on Ez that is approximately equal to gp Example The when sampling from the bivariate normal distribution sample correlation coefficient 7 has variance proportional to G In this case the appropriate transformation is z logeE Ref Fisher39s z trans DJ page39 It has been shown thatz is approximately normal with Ez oggi i C and Varz Other useful transformation are If the variance of x is proportional to p then 2 x quot p2 then z logx quot p the z i 2 Box Cox transformations The last three transformations above have 2 as a power of x The Box Cox method uses the data to determine the power APPLICATION TO CORRELATION 1 To make inferences on p we can apply the z transformation as follows 1 o 1 o To test H0 0 p0 let C0 ags Then the test statistic is 3 o 4 N73 u and the hypothesis is rejected if u exceeds the selected critcal values from N0 1 Similarly con dence intervals for p can be obtained by writing con dence interval for C and solving for 0 Thus the con dence interval for C is 1 and hence the interval for p is u z p tanhz I N739 The chart on DJ page 37 is usually suf cient for describing con dence intervals An alternative procedure is given in DJ page 39 2 With independent samples of sizes N1 and N2 from two bivariate normal populations the hypothesis of equal correlations is tested using the statistic 21722 u N113 N23 with analogous procedure for developing con dence intervals VARIBLE GROUPING BASED ON CORRELATIONS In Example 21 DJ considers creating new variables by grouping variable based on their correlations He identi es ve groups and considers def39ming ve new variables by averaging the variables within a group This is our rst attempt at trying to simplify the data That is reduce the number of variables that need to be examined We will pursue this idea in later chapters REFER TO SAS PROGRAM RRH CORRELATION Use SAS to examine the correlation structure and identify any groupings CTURE 9 CANONICAL CORRELATION ANALYSIS Introduction The concept of canonical correlation arises when we want to quantify the associations between two sets of variables For example suppose that the rst set of variables labeled arithmetic records x1 the speed of an individual in working problems and x2 the accuracy The second set of variables labeled reading consists of X3 reading speed and x4comprehension We can examine the six pair wise correlations but in addition we ask if it makes sense to ask if arithmetic is correlated with reading The answer is given by considering a linear combination of the arithmetic variables say u and a linear combination of the reading variables say v and using their correlation to represent the association between the groups Thus we construct u a1x1 a2x2 and v b1X3 b2x4 and we seek coefficients so that this correlation is maximized NOTE Every text I know of uses u and v for these variables SAS PROC CANCORR uses v and w That is OK but don t get confused Development Suppose we have a vector of variables x that consists of two sets of variables x1 and x2 where x1 has length p1 and x2 has length p2 Assume that p1 g p2 To develop the notation let X1 1 Z311 Z312 E d E V X X2 X 2 an arx 221 222 The matrix 212 gives the covariances between the variables in set one and set two and in correlation form it gives the correlations When p1 and p2 are moderately large examining the p1p2 correlations and drawing conclusions is not an easy task As an alternative we consider linear combinations u aTxl and v bTx2 Note that Varu aTEua Varv bTEnb Covuv aTan We want to determine the vectors a and b so that Corru v aTEub yaTEHanTEnb is as large as possible To this end we determine a and b as the solution to the problem maximize aT 212 b subject to aTEua 1 bT Enb 1 The variables so determined are called the rst pair of canonical variables 111 and v2 The second pair of canonical variables u2 and v2 are similarly determined by linear combinations of XI and x2 with unit variance and maximum correlation among all variables that are uncorrelated with the rst pair This reminds us of the discussion of principal components and leads to the determination of eigenvalues and eigenvectors The solution leads us to the stationary equations 2121 A2113 2 0 221a 0222b 0 Multiplying the rst equation by aT and the second by bT shows that A 0 aT Sub We thus seek A so that 7 AZ11 E12 Z21 AZ22 The following result is useful I the matrix A is written in partitioned form as A11 A12 A A21 A22 then lAl lAiillAzz A21A1 11A12l lA22 I lAll A12A21A21l Applying the second form of this to our matrix we have AZ 2 221 11 1 122 i T A222 7 AZ1121222271221l l 222ll212222 1221 Azzul l 222 lzllllz lzlz222712217 Azll Since A is only involved in the last determinant it follows that we can determine the values of A by nding the eigenvalues of the matrix 21711212232721 221 and taking the square root The positive square root of the largest eigenvalue gives the largest correlation Note that the matrix has at most p1 nonzero eigenvalues To nd a and b we return to the stationary equations Recalling that A 0 multiplying the second by Zgzlwe see that b i227 221a Substituting this in the rst equation and rearranging terms we see that a is given by the solution of the equations 21312122231221 7 A21a 0 That is the vector a is the 39 cur r J39 to the 39 39 A2 Similar computations show that the vector b is given by solution to the equations 22312212131212 Afb 0 Thus the rst pair of canonical variates are 111 alTxl and v1 blsz with correlation 01 To nd the second canonical pair u2 v2 we solve the problem maximize a Elzbz subject to azTEu a2 1 sz 222m 1 azT Eu a1 0 sz 222m 0 It follows that the squared correlation between u2 and v2 is A the second largest eigenvalue of the matrix XII 112122231 221 and the vectors a2 and b2 are obtained by solving the above equations using Ag Although we did not specify this in our optimization problem it also follows that 3 2121sz 0 and lg22131 0 We can continue this for all nonzero eigenvalues Summary The canonical variable pairs ui aiT x1 and viT x2 as determined have the following properties Corrui vi A1 Corrui uj 0 Corrvi vj 0 Corrui vj 0 for i These properties can be summarized by the correlation matrix R i IP1 Diag A1 V 7 Diagi IPZ Example Returning to the readingarithmetic example suppose the sample correlation matrix is given by 1 4 5 6 4 1 3 4 1 4 5 6 R 5 3 1 2 Rll 4 1 R12 3 4 6 4 2 1 l 2 5 3 R22 2 1 R2 6 4 Note that it is best to apply the results to standardized data and hence we use the correlation matrix We may then compute 452 289 A RfllRlzRileZI 146 495 and 1 1 206 251 B R22 RZIRH R 278 340 The eigenvalues ofthese two matrices are the same that is A 5457 and A 0009 The eigenvectors of A and B are the columns of the matrices 951 540 309 842 WCA 804 633 and VecB 595 774 Recall that we have speci ed that the variances of the Hi and Vi must be one That is iii R1131 1 and biTani 1 The eigenvectors as determined are normalized to have length one but do not satisfy this condition The eigenvectors must be scaled The scaled eigenvectors are given by 1 1 3 3 AVecA13923 0 and B VecB13919 0 0 636 0 804 Thus A7 856 677 d 7 545 863 278 1055 an 737 706 It follows that the rst canonical pair is de ned by ul 85621 quot27822 v1 54523 73724 with correlation p1 45457 74 The second canonical pair is de ned by u2 7 67721 105622 vz 7 863X3 706X4 with correlation p2 40009 03 We see that the rst pair captures most of the relation between arithmetic and reading The canonical variate for arithmetic 111 places over three times as much weight on speed as it does on accuracy and the canonical variate for reading v1 puts more weight on comprehension that on speed in proportion 43 Note that this does not say for example that speed is three times as important as accuracy in arithmetic It simply says that if we are asking for a measure of the relation between arithmetic and reading these functions provide the essential component of that relation Interpretation of Canonical Variables In general the canonical variables are arti cial and may have no physical meaning The interpretation is often aided by computing the correlation between the original variables and the canonical variables To do this note that the canonical variables are related to the original variables by the equations 11 ATzl and V BT12 where zi denotes the standardized data from which the eigenvectors have been determined Recalling that the canonical variables have been standardized to have variance one it follows that Corru zl Covu Z1 CovATz1 zl ATRH Similarly Corru zz CovATz1 z2 ATR12 Corrv Z1 BTR21 Corrv zz BTR22 Example Returning to the arithemeticreading example we see that 1 4 Corrul zl 856 278 4 1 97 62 and l 2 Corrvly zz 545 737 2 I 69 85 We see that of the two variables in Z1 111 if most highly correlated with the rst Of the two variables in z2 v1 is most highly correlated with the second Similarly we obtain the correlations Corru1 z2 51 63 and Corrvly zl 71 46 As in our study of principal components it is more informative to look at the correlations as opposed to the eigen vectors Observations It can be shown that the rst canonical correlation is larger than any of the simple correlations in R12 If there is one variable in set one but several in set two the squared canonical correlation is the squared multiple correlation R2 in the regression of 21 on 22 In general it can be shown that the squared multiple correlation for the regression of uk on 22 is given p12 this is also the squared multiple correlation for the regression of vk on 21 LECTURE 11 MULTIVARIATE ANALYSIS OF VARIANCE Introduction The one and twosample examples in Lecture 10 are special cases of a general methodolgy called the Analysis of Variance For example in the univariate case the twosample problem can be described in linear model format as ij pieir fori 12 andr 12 ni Thus we have n1 observations one population with mean 111 and n2 observations on a second population with mean 112 The population are assumed to have equal variance 02 and our objective is to make inferences about the population means In matrix form we write the model as Illustrated for k2 x1 1 e 1 1 X 7 X1Tl1 i Jnl 0 1 6 1m in2 0 an 2 e21 x2HZ eznZ The natural extension of this problem is to one in which we observe k 2 2 populations The matrix notation extends directly The results for the preliminary analysis are presented in an Analysis of Variance AOV table as follows Label df Sum of Squares Mean Squares k Population k i 1 2mm 7 y2 11 kn R ss Res1dual N i k 2312in 7 yi2 57k 1 r The ratio of the two mean squares give the test statistic for the hypothesis H041i pj for all i and j and is distributed as Fk l N i k where N is the total number of observations The Multivariate Analysis of Variance Model Refer to DJ Chapter 11 The above model is extended to allow for a matrix of observations on each population as follows x1 Jm 0 0 u X2 0 JHZ 0 p e Xk 0 0 Jrlk i 1 Here the it are row vectors of length p the Xi are data matrices with ni rows and p columns and the matrix of random errors e is N X p and its rows are assumed to be independent N0 Z In matrix form we write the model as XAe The parameter estimates are given by gt H k and Exfsnixi N 7 k i1 Multivariate Hypotheses In the univariate case we are interested in testing hypotheses about the rows of the parameter vector 6 In the simplest case the hypothesis that all of the pi are the same We write such a hypothesis as H0 L6 h In the multivariate case we may also be interested in testing hypotheses about the columns of G or combinations of the two We write the general hypothesis as H0 LeMT h To illustrate consider the case k 2 discussed in Example 106 Recall that we considered four different hypotheses l The hypothesis of equal mean vectors that is p pg This corresponds in the present notation to letting L l i l and M IP 2 The group by time interaction hypothesis that is parallel pro les This corresponds to choosing L l i l and letting M be the matrix from the Hotelling T2 discussion That is with p 4 l 0 7 l 0 T L M T 0 7 l l 0 0 7 l 3 The group main effect hypothesis that is equal average over time This corresponds to choosing L l 7 l and M JT 4 The time main effect hypothesis This corresponds to choosing L l l and the M matrix from question 2 Development of the Test Statistic It is helpful to establish the dimensions of the matrices involved From the linear model statement we see that X is N X p A is N X k and 9 is k X p From the hypothesis statement we assume that L is q X k and M is t X p Case 1 M I Such hypotheses are the analogs of those tested in the univariate case To describe the test statistic we use the analogs of the residual and population sums of squares from the univariate discussion These are given by E N 7 kZ and A 71 A H L6 7 hT LATA 1LT L6 7 h The test statistic is given by 2 7 AN m The exact distribution of this statistic is not known except in special cases but an approximation is given by 7 2Log 7 xi where d pq where in the ksample problem q k 7 1 It follows that the test is given by rejecting the null hypothesis if 7 2Log 7NloglE i m gt ngm Note that we may write this as 7 2Log NloglI HF1 N1og1391 A0 i Where A1 are the eigenvalues of HE I An improvement is given by replacing N by N 7 k 7 50p 7 ql l The quantity is known as Wilk s Lambda and is one of the test statistics included in the SAS output Other test statistics provided are l Roy39s greatest root test based on the largest eigenvalue of HE I say Amax 2 LawleyHotelling test based on T trHE 1 Z 1 3 Pillai s test based on v trHH Erl trI HE I 1 A 1 Note the distributions of these quantities can all be approximated by the F distribution Case 2 M I The easiest way to explain this is to transform the model by the matrix M Thus we take the original model X A6 e and multiply on the right by the matrix MT yielding XMT AoMT MT and writing this as Z AF 6 We can then apply the results from case 1 to this model The net result is that we de ne E MEMT and H MHMT and the above results are applied to these two matrices In particular the Wilk39s test is given by rejecting the hypothesis if N 7 k 7 got 7 ql 1IogJ L gt x xa E E H Illustrate with DJ Example 106 Con dence Intervals Most texts recommend con dence intervals based on the Bonferroni method For example if you want con dence intervals on mean differences of the form pm 7 psi they are given by 11 sii n si i Wim w Gui i where m p A general result based on the UnionIntersection principal gives con dence intervals of the form bUMl MHi M wvmwmmmrmW where A is the maximum characteristic root of HgEk l39 Note that this root is also the basis for Roy s greatest root test Canonical Analysis It seems appropriate to end these lectures with yet another eigenvalue problem Again the idea is to try to reduce the dimensionally of the observations by considering linear combinations of the observed variables To develop the idea consider a linear combination of the p variates that is consider the data vector y Xa If follows that Ey Xea that the elements of y are independent with variance aTEa 52 Thus the linear model for y is yX n where a and n ea For testing the hypothesis L l we have the test statistic 7 aTHaq F 7 aTEav where H and E are de ned as above q is the number of rows of L and v N i k For our rst canonical variable we seek a such that this F ratio is maximized Using our usual methods it follows that the maximum of F is given by A1 the largest eigenvalue of the matrix HE I Further a1 is the associated eigenvector Continuing the vector a2 corresponding to the second eigenvalue gives the canonical variable with F ratio equal to A2 that is uncorrelated with the rst canonical variable We can now consider doing univariate analysis of variance on each of the canonical variables Note This is the basis for Roy39s greatest root test Note Recall in CANDISC program that the canonical variates were constructed in this way Hence we now see the basis for that choice of canonical variates Refer to DJExample 111 LECTURE 8 CLUSTER ANALYSIS Introduction Recall that in discriminant analysis we had data from several populations and the objective was to determine a rule for assigning a future observation to one of the populations That is a rule for discriminating between the populations In cluster analysis the situation is in a sense the reverse That is we have data that may have come from several populations but it is not known which population they came from The objective is to group the observations into subgroup that might correspond to different populations The idea is to group together observations that are 39similar That raises questions of how we define similarity and how similar do they have to be to be put into the same group Measures of SimilarityDissimilarity The definition of similar is subjective One idea is to use the distance between two points Some of the suggestions are for measuring distance are l Euclidean Distance Given two pvectors X and y let 2 T p 2 d W 6 7 y 1 i y k 1061 yk and let dXy d2xy 2 Absolute distance P dXY le Ykl k1 3 Standardized Euclidean Distance Given the N by p data matrix X compute 7 1 T and let the standardized data matrix be z SNXDiag Note that all of the data are used to compute the mean and standard deviations hence they are in ated by between group differences However this does remove problems introduced by scales that are not comparable To illustrate suppose we have two observations X1 l 96 and X2 2 108 Then d2X1X2 l 144 145 Ifthe second variable was measured in inches and we change to feet then X1 l 8 and X2 2 9 and d2X1X2 l l 2 The standardized variables are commonly used 4 Mahalanbois Distance For some symmetric matrix A de ne d2Xy 6 7 EDTAC U y The use of the common covariance matrix S de ned above is not recommended We will use the standardized Euclidean Distance in our discussions An alternative to using distance to de ne similarity is to use one of more of the graphical techniques described in DJ Chapter 3 Thus if p 2 we could construct a scatterplot of the data and see if it suggests a grouping Alternatively we could use the Starplots or the Chemoff Faces to suggest observations that are similar Similarly the Andrews plots could be used Since these are very subjective we will concentrate on methods based on differences METHODS The methods fall into two general classes Hierarchical and Non Hierarchical The rst class includes what are called singleaverage and completelinkage methods The second include the Kmeans method A brief discussion follows Hierarchical Methods Single Linkage Clustering The idea is to group observations that are nearneighbors that is observations that are close The method is iterative A brief description of the steps follows and an example will make it clear Step 1 Given the N by p standardized data matrix Z compute the Euclidean distance dil as de ned above and think of these distances as being displayed in a matrix D Thus for N 3 0 D1 C121 0 C131 C132 0 Our rst clustering consists of N clusters with one observation in each We now form a new clustering consisting of N i 1 clusters where on cluster consists of the two points with minimum distance and the remaining clusters have only one observation Thus if dgh is minimum observations g and h are grouped together Step 2 Form a new distance matrix D2 of size N i l by deleting the rows and columns g and h from D1 and add a new row labeled gh The distances associated with this cluster are de ned as dghk mindgk dhk Again we pick the smallest distance and do the grouping as described above The new cluster could consist of two new points say u and V or it de nes the three observation cluster say g h and w Step 3 A new matrix D3 is formed by either deleting the rows u and V and adding a new row labeled u V or deleting the rows w and gh and adding a new row labeled g h w The process is continued until a suitable clustering is de ned Note that if it is continued to completion all observations will be in one cluster We need a criterion for stopping before this state Example Suppose we have ve observation and the initial distance matrix is given by l 2 3 4 5 l 0 2 i 9 0 1 D1 3 3 7 0 4 6 5 9 0 5 ll 10 2 8 0 Note that d53 2 is minimum and hence are grouped into the cluster 35 The distances from the other clusters to this cluster are given by d351 mind31 d51 min3 11 3 d352 mind32 d52 min7 10 7 d354 mind34 d54 min9 8 8 Deleting the rows 3 and 5 from D1 and adding the row 35 Thus 35 1 2 4 35 0 l 3 0 D2 7 The minimum distance is now d3751 3 and so we place observation one into the cluster 35 The new distances are de ned by d1352 mind352 d12 min7 9 7 d1354 mind354 d14 min86 6 The new distance matrix is form by deleting rows 2 and 35 from D2 and adding the row 135 Thus 135 2 4 135 0 D3 2 7 0 4 6 5 0 The minimum distance is now given by d24 5 and so a new cluster is formed for this pair ofpoints We now have two clusters 13 5 and 24 and their distance is given by d13524 mind1352 d1354 min76 6 The new distance matrix is given by 135 24 7 135 0 4 24 6 0 The nal step would place all observations in the same cluster The question of interest is when should we stop this process As an aid to this decision we may construct a dendogram Such a diagram is illustrated in DJ page 326 Here is my attempt at drawing it for our example 6 5 4 Distance 3 2 l 0 l 3 5 2 4 Observation It seems clear that in this example two clusters stand out but that is not always the case There are other criterion for making this decision and we will discuss them in conjunction with the SAS programs Average Linkage Clustering This is similar to the single linkage except that the distance between two clusters is given by the average distance between all pairs of items where one member of the pair belongs to each cluster Complete Linkage Again this is similar to single linkage except that at each stage the distance between two clusters is determined by the two elements one from each cluster that are most distant This ensures that all items in a cluster are within some maximum distance of each other Non HierarchicalMeth0ds K Means Method Step 1 Partition the data into K initial clusters This may be done at random Step 2 Determine the centroid that is the mean for each cluster For each observation reassign it to the cluster that is closest That is compute the distance to each centroid and assign it to the one that is smallest It is best to use the standardized data and normally uses the Euclidean distance It an observation is reassigned recompute the centroid for the cluster receiving the new observation and for the one losing that observation Step 3 Repeat Step 2 until no more reassignments have been made Alternatively we could begin the procedure by specifying K initial centroids and proceed as in Step 2 Example Suppose p 2 and we have N 4 observations The data are as follows Item x1 x2 A 5 3 B l l C l 2 3 2 The objective is to divide the four items into K 2 clusters To begin we arbitrarily partition the items into two clusters AB and CD and compute the cluster centroids The following table shows this computation Cluster i1 222 AB 2 2 173 7 7272 7 CD 7 7 1 T 7 7 2 A Step 2 we begin with item A and compute its squared distance to the cluster centroids Thus d2A AB 5 7 22 3 7 22 10 d2A CD 5 12 3 22 61 Since A is closer to AB it is not reassigned Continuing we examine item B d2BAB717 22 17 22 10 d2BCD112 1 22 9 Since B is closer to cluster CD it is reassigned and we update the centroid as follows Cluster X1 X2 A 5 3 BCD 1 i 1 Note d2AA 0 d2ABCD 5 12 3 12 52 d2BA717 52 17 32 40 the remaining squared distances are shown in the following table Item Cluster A B C D A 0 40 41 89 B 52 4 5 5 Since each item is currently assigned to the cluster with closest centroid the procedure is terminated Since the procedure depends on the initial choice of clusters and the number of clusters it is suggested that it be repeated for a different choice In particular the specification of K could lead to unusual clustering Outlying observations can produce unusual clusters This method is useful when the data set is large The SAS procedure FASCLUS uses this method Number Of Clusters The determination of the number of clusters is not well de ned Clearly we would like to identify a small number The dendogram is helpful in this regard Several numerical criteria have been suggested The cubic clustering criteria suggested by DJ is helpful The idea is that CCC gt 2 suggests good clusters CCC lt 2 suggests potential clusters A plot of CCC versus number of clusters is examined As the number of clusters is increased we look for jumps or leveling off points These are not often welldefined An alternative is the Hotelling T2 statistics If two clusters where just combined it tests the hypothesis that the cluster means are the same If this is rejected ie T2 is large we look for a large value and move up one to the next largest number of clusters We will discuss T2 in Chapter 10 Problems Sometimes the clusters that are formed are stringy or even parabolic in shape A procedure called ACECLUS forms canonical variables based on the original data using the computations similar to CANDISC The clustering is then done on these variables and often produces better results Example Data are collected on birth rates death rates and infant death rates for 97 countries We want to use this data to see if we can identify clusters for these countries We will use the SAS program CLUSTER to do this The problem will also illustrate the virtues of the ACECLUS PROCEDURE Clustering With Categorical Variables In some applications the variable that we observe is simply the answer to a question say yes or no this is usually coded as a one39 or 39zero39 so that we can apply our standard results for clustering but it is of interest to examine this situation further To illustrate suppose there are five such variable in our study and the responses for two individuals are given as follows Variable l 2 3 4 5 Case i l 0 0 l 1 Case k l l 0 l 0 Note that these two cases agree on variables one three and four and disagree on variables two and five I we compute the Euclidean squared distance for these two cases we get 5 21Xij ij2 2 J Frequently these variables are treated in the same way are the continuous variables in clustering programs As a measure of similarity this distance may be misleading That is a 11 match may imply a stronger indication of similarity than a 00 match For example suppose the question is whether the individual can speak a particular foreign language say Greek If both respondents answer yes that is a 11 match it gives a stronger indication of similarity than n if they both answer no It seems reasonable then to de ne a different measure of distance for these cases to see how this might be done suppose we summarize the data for the above example in a twoway table as follows Case k l 0 Totals 1 a2 bl ab3 Casei 0 cl dl cd2 Totals ac3 bd2 T5 Thus a is the number of 11 matches b the number of 10 matches and so on Several measures of similarity to be used in the distance table are suggested For example 1 Td Gives equal weight to 11 and 00 matches 2 Ignores 00 matches in the numerator 3 Ignores 00 matches 4 Ratio of matches to mismatches ignoring 00 For our example the similarity measure for the two cases using the first idea is 35 Example A more extensive example is given in Johnson and Wichem Six variables are measured height weight eye color hair color handedness gender They are coded as follows X i 1 height 2 72 i 1 weight 2 150 i l brown eyes 1 2 7 3 7 0 height lt 72 Oweight lt 150 0 other i l blond i 1 right i 1 female X4 7 0 other X5 7 0 left X6 7 0 male Note Suppose you had three choices for hair color blond brown and other Then you must use two variables for hair color 7 l blond d i l brown X4 7 0 other an X4 7 0 other It is not appropriate to use for example l blond xf 2 brown 3 other as this places both an order and a scale on the arbitrary assignment of these numbers The data for their example is as follows Variable l 2 3 4 5 6 l 0 0 0 l l l 2 l l l 0 l 0 Case 3 0 l 0 l l 0 4 0 0 l 0 l l 5 l l l 0 0 0 Using the similarity measure a d T the similarity matrix is given by 11 231 Case31 2w Ozzzl In this example cases 2 and 5 are most similar and would form the first cluster We then determine new similarities and proceed to look for more clusters The motivation for these similarity indices was motivated by the fact that a 11 match may imply a stronger similarity than a 00 match That is two people one with brown hair and one with red hair would be judged to be similar As an alternative we modify the survey to ask for more detail ie blond brown red and use the coding suggested above and just compute distances on the zeroone variables If we had asked for actual height and weight these continuos variables can also be used in the clustering We would scale them so that they do not dominate the distance measure Standardizing all variables both continuous and discrete would seem to make them comparable NOTE I will skip the section on Multidimensional Scaling for now and return later if time permits LECTURE 10 INFERENCE ON MEAN VECTORS Introduction In this section we will generalize the concept of ttests to the multivariate situation In the twosample problem we will assume equality of covariance matrices Later we will develop tests for equality of covariance matrices and for other patterns on the covariance matrix The concept of ttests generalizes in the univariate case to the analysis of variance and we will make that generalization in the multivariate case We will see that SAS does not have a PROC speci cally devoted to the multivariate ttest so we will develop an IML program Later we will see how to use the multivariate analysis of variance program to perform the test Hotelling T2 Statistic This concept is discussed in DJ Section 102 page 408 The problem is as follows Given a random sample of size n from a pvariate normal distribution Np 2 that is the data matrix X we want to test the hypothesis H p 0 where no is a speci ed pvector p 1 For motivation recall the special case p 1 In that case we computed u t srl and rejected the hypothesis if ltl gt t n i l Equivalently we could compute F t2 and reject the hypothesis ifF gt Fa l n i 1 Note that we can write 7 Z 7 7 t2 1253 no 7 max 7 no p gt 1 In the multivariate case the test statistic is given by 2 7 T 71 7 T nxi o E X ILO noting that and 0 are vectors of length p The hypothesis is rejected if 71 T2 gt 313 Fm p n i p or equivalently 535T2 gt Fm p n e p To motivate the statistic note that this problem is similar to the classi cation problem That is did this data come from a population with mean no of from a population with some other mean The decision is based on the distance from x to no Equivalently it is based on the ratio of the likelihood functions for p p0 and p x Example Recall the SWEAT data from a previous exercise and suppose the question is raised as to whether these 20 females were selected from a population of females whose mean vector is know to be pOT 4 50 10 on the three variables x1 sweat rate x2 sodium content and X3 potassium content From the 20 observations we compute 464 A 29 100 l8 x 4540 and E 100 200 56 997 l8 56 36 It follows that T2 974 and 511 T2 29 With oz 010 we see that F0 10 3 l7 244 and the hypothesis is rejected Invariance Since the variances in this example are quite different if is natural to ask if we should have standardized the data before performing the test The answer is that it does not make any difference In fact if C is any nonsingular matrix of size p and d is any p vector we can compute y Cx d and the T2 statistic based on y will be identical to the one based on x Development of the T2 Statistic For those interested here are two developments of the T2 statistic Likelihood Ratio Test Given n observations on a pvariate normal distribution the likelihood function is given by x it 2 2w 2 exp tr2 1A where Am X 7 JpTTX 7 JpT Letting Ai XTSnX we can write Awe A0 n 7 06 7 0T The likelihood ratio test leads to the ratio lAil 0 lAM and we reject the hypothesis H p no if 0 is small39 To relate to the T2 statistic consider the determinant A02 7 M56 7 0 M36 7 no 1 Aoz no 7 ma 7 Mi Awo lAiH1 no 7 0TAi 1Y e no It follows that T Ml 14mm and we reject the hypothesis if T2 is large Union Intersection Test Let a be an arbitrary pvector and consider y aTX N NaT p aTZa Then consider the hypothesis HaEy pyo based on the data Y Xa The tstatistic for this univariate test is 7 ma Y Hyo s n Not1ng that py anO y aTi and s aTZa we see that t2a MTG ol I o a and we would accept the hypothesis is t2a lt Fa l n i 1 Noting that p no is equivalent to an anO for any vector a Thus acceptance of the hypothesis Hp no is equivalent to accepting the hypothesis Ha for any vector a this implies that ma t2a lt F0t 101 1 Thus we seek a to maximize t2a of equivalently we consider the problem max naTG um 7 Ma subject ot aTEa These leads to the system of equations 71 T n2 i pox no AIa 0 since the matrix has rank one there is only one eigenvalue and eigenvector given by 7 T A 71 7 Anxep0 2 X llo 1 7 a 2 x i 0 Thus the maximum is given by the T2 statistic and we are lead to the same test Con dence Regions and Intervals It is usually more informative to give a confidence interval on a parameter than simple an accept reject decision In the univariate case we recall the familiar confidence interval for p p x j ta2 n i l s2n The analog of that in the multivariate case is the con dence region given by the p dimensional ellipsoid centered at x given by 71 no i W2 u i i i f mp n i p Thus we accept as reasonable any vector u that lies in this ellipsoid Since this is hard to visualize in more than two dimensions a common practice is to write con dence intervals on the individual components or linear functions of them Since we are writing several intervals the confidence coefficient based on a single interval is no longer applicable and alternatives have been suggested For example if we are testing k intervals a simple suggestion it to use t2a k n i l in the above univariate expression with s2 replace by 8n This is know as the Bonferroni method In addition to the p components of p we may be interested in other linear functions say an and k can be quite large making these intervals very wide An alternative suggested by the Union Intersection principle is to use the intervals 7 an aTx j aTEaPP1FOtP 11 i P These intervals called simultaneous con dence interval although wider than the simple tstatistic intervals more correctly re ect the con dence coef cient Since we often write many such intervals they are generally better than the Bonferroni intervals that are given by an aTx j t2 k n 1 aT Ia GENERAL LINEAR HYPOTHESES As a generalization of the hypothesis H041 no we consider the hypothesis H0 Mp m where M is a matrix of size q X p of rank q and m is a qvector usually zero We will encounter several M matrices Testing for Equality of Elements of p The initial hypothesis H p no is a special case ofthis with M I and h 110 Another hypothesis is that all of the elements of p are equal With p 4 consider the matrix 1 1 0 0 0 M 0 1 1 0 m 0 0 0 1 1 0 Thus Millz Mu urJs 3 IL4 and we see that this matrix is testing the hypothesis that all elements of the mean vector are the same The T2 statistic generalizes as follows T2 nMx 7 mTM MT 1 M5 7 m and we reject the hypothesis if T2 gt Fa q n q The concept of simultaneous con dence intervals extends directly Thus we can write intervals on linear functions of the form aTMp aTMp aTMx j iltaTM IMTa 1511 ROM 11 i 1 For example with M as given above and if aT l i l 0 we have aTM l i 2 l and hence the function aTMp pl 7 2 3 The Analysis of Repeated Measurements Often the p responses on a subject are taken over time That is the p columns of the X matrix represent observations taken at various points in time In this case we may ask if there is a difference in response at different points in time The hypothesis of equal means discussed above would be appropriate If this hypothesis is rejected we might ask if there is a linear trend That is is the rate of change from one time to the next constant Assuming that the time periods are equally space this says that the difference between consecutive means is constant The hypothesis of a linear trend is described for p 4 as H0 2 1372473 This is equivalent to the hypothesis Ho 372u2J10 4 2320 and the appropriate matrix is given by l i 2 l 0 0 M 0 1 7 2 1 m 0 A plot of the sample means over time helps to visualize the situation If the hypothesis of a linear trend is accepted we would see that these means lie roughly on a straight line and we might be interested in the equation of that line This suggested fitting a linear regression model to the sample means Thus we would consider the linear model iii b0 blti ei where ti denotes the i3911 time period Ordinary linear regression is not appropriate in this case since the xi are not independent Recall that if i denotes the vector of means then Vari En We thus consider a weighted least squares estimator To describe this let B denote the matrix whose first column is all ones and whose second column is t1t2tPT Then the estimator is given by b 71 71 b0BT B 1BT i 1 Clearly one might consider a more complex relation for example a quadratic in time Refer to RRH DJEX105 to illustrate this analysis Note comment on page 421 THE TWO SAMPLE PROBLEM Suppose we have data from two normal populations with the same covariance matrix 2 but possibly different mean vectors 1 and 2 We wish to test the hypothesis H 1 2 Let the data matrices be X1 of size n1 X p and X2 of size n2 X p De ne the sample mean vectors x1 and x2 and sample covar1ance matr1ces El and 22 Since we are assuming the covariance matrices are equal we compute the 39pooled39 covariance matrix A n171gt 1ltnz71gt 1 Z 7 H1nz2 For testing the hypothesis H pl 7 p2 5 the test statistic is given by T2 rzi1 i2T271i1 i2 The hypothesis is rejected if 72 T2 gt pn1n2 7p 71 or equivalently if 7 71 T2 gt Fa 13011 n2 7 p 1 In general we may consider a hypothesis of the form H Mml 7 2 7 6 where M is a matrix of size q X p of rank q The test statistic is T 71 T2 ltM l 7 122 MEMT M021 7 122 and the hypothesis is rejected if T2 gt 3f fgfiFa q n1 7 n2 7 q 71 Simultaneous con dence intervals on functions of the form aT Mp1 2 are given by aTMU l 2 7 7 22 i V aTMzMTa ltn qgt For example with aT l 0 0 we are interested in the rst row of M With aT l i l 0 0 we are interested in the difference of the rst two rows etc Two Group Repeated Measures Profile Analysis Recall the concept of repeated measurements discussed earlier That is we now assume the we have data on two populations taken over time and are interested in examining the relation beyond simply asking if the mean vectors are the same Several hypotheses are of interest Again these are motivated by a plot of the sample means for the two groups as a function of time This plot is sometimes called a quotpro lequot Examining this plot suggests three questions that we might ask REFER TO PROFILE PLOT 1 Are the pro les for the two groups similar in the sense that the line segments of adjacent tests are parallel 2 If the pro les are parallel are they at the same level 3 If the pro les are parallel are the means of the tests different Another way to look at the Pro le Plot is to display the means for the two groups in a twoway table as follows Question 1 Parallelism Hypothesis Group by Time Interaction To answer the rst question we note that the parallelism condition is that that the difference the means in consecutive time periods is the same in each group Thus we want to test the hypothesis H03 117 12 21 7 22 12 7 13 22 7 23 13 7 14 23 7 24 etc The appropriate matrix in this case for p 4 is and again we test the hypothesis H0 Mp1 p2 0 In analysis of variance terminology this is known as the quot group by time interactionquot hypothesis Question 2 Equal Average Over Time Group Main Effect Letting pij denote the mean response in group i i 12 at time jj 12 p Suppose we want to check that the average effect over time is the same in the two groups The hypothesis of interest is P P H03 21 Ella j1 j1 In our general notation this corresponds to letting M 11 1 and testing the hypothesis H03MP1 2 0 In analysis of variance terminology this is known as the quotmarginal means main effect testquot Note that we can test this hypothesis even if the parallelism hypothesis is rejected but the interpretation is not as strong Question 3 Are the means of the tests different Time Main Effect In this case the hypothesis is stated as 2 2 H01 p ij 2111f forj 11 11 or in matrix notation the hypothesis is written as H0 Mp1 p2 0 with The test statistic is given by T2 M021 122 T MEMT 71 M021 12 and the hypothesis is rejected if 2 QTl 2 7 7 T gtmFaaqanln2 q 1 Refer to RRHDJEX106 to illustrate this analysis In the next section we will extend this concept to more than two groups