### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Applied Multivariate Analysis ISQS 6348

TTU

GPA 3.78

### View Full Document

## 22

## 0

## Popular in Course

## Popular in Informational Systems

This 17 page Class Notes was uploaded by Agustina Batz on Thursday October 22, 2015. The Class Notes belongs to ISQS 6348 at Texas Tech University taught by Westfall in Fall. Since its upload, it has received 22 views. For similar materials see /class/226412/isqs-6348-texas-tech-university in Informational Systems at Texas Tech University.

## Reviews for Applied Multivariate Analysis

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/22/15

ISQS 6348 Midterm Spring 2000 Instructions Open book and notes Points out of 100 in parentheses 1 Data vectors X1 X2 ng are to be sampled from a population whose mean vector and covariance matrix are M 1 and 2 2011 712 M2 712 722 1a2 Find hint it is identical to Solution EX2 M1 M2 1b3 Find CovX25 hint it is identical to CovX1 lt7 7 Solution CovX25 11 12 712 722 1c2 Let a 5 5 for the remainder Find Ea X1 Solution Ea X1 a EX1 5 5 5M1 5M2 2 1d3 Find Vara X1 Solution Vara X1 a CovX1a 5 5 011 012 395 25011 712 722 5 5012 25022 1e3 Let X QisZgilXi Find Ea X Solution Ea X a EX 5 5 5 5u2 2 1f5 Find Vara X Solution Vara X a CovXa a 225a 01011 02012 01022 1g2 Find the standard deviation of a X Solution 401011 02012 01022 2 220 Suppose you now have carried out the study in question 1 and you have collected the data Suppose the sample mean vector and sample covari ance matrix are 300 160 40 X 100 and 3 lt 40 40 Find the 95 con dence interval for 111 M22 Solution The general form of con dence interval inferences is 0 icase6A Here 0 a 5 5 egg 302 102 20 and se a S25a 0116 024 014 0529 The critical value for the single con dence interval is ca til2351 2064 Note that there is only one inference here and there is no suggestion that the particular linear combination 111 M22 is related to MSLC so the ordinary t critical value is used and not the Bonferroni or TQ based critical values The con dence interval is thus 20 i 2064529 or 1891 lt 111 pg2 lt 2109 3 An experiment yields the following unadjusted or ordinary p values For hypothesis H1 p1 013 for hypothesis H2 p2 49 for hypothesis 113173 011 for hypothesis 114174 50 3a10 Calculate the Bonferroni adjusted p values and state which hypothe ses are rejected State also what signi cance level you are applying and state whether it refers to EER or CER Solution The Bonferroni adjusted p values are obtained by multiplying the unadjusted p values by k and truncating to 10 if necessary Thus the adjusted p values are 151 4 X 013 052 152 4 X 49 which is truncated to 10133 4 X 011 044 and 154 4 X 50 which is also truncated to 10 Since 044 lt 05 hypothesis H is rejected at the signi cance level 05 The signi cance level 05 refers to the EER 3 3b10 Calculate the step down Bonferroni adjusted p values and state which hypotheses are rejected State what signi cance level you are ap plying and state whether it refers to EER or CER Solution The step down Bonferroni adjusted p values are obtained by mul tiplying the ordered unadjusted p values by k k 7 1 making sure that the ordering of the adjusted p values corresponds to the ordering of the unadjusted p values Starting with the smallest 153 4 X 011 044 151 max0443 X 013 max044 039 044132 max044 2 X 49 max044 98 98 and 154 max98 1 X 50 98 We reject H and H1 since their adjusted p values are less than 05 here 05 refers to the EER 4 Assume the distribution of X is multivariate normal with mean vector and covariance matrix 0 4 0 0 u 0 E 0 16 0 0 0 0 1 The eigenvalue eigenvector pairs of Z are 1 0 0 412461 0 421662 1 4321163 0 0 0 1 4a10 Find the Euclidean length of the major axis of the 95 probability ellipsoid Solution The axes of the probability ellipsoid are gimmei where Aie denote the eigenvalueeigenvector pairs of the covariance matrix Here p 3 so the critical value is xg05 7815 The longest axis corre sponds with the largest eigenvalue here A 16 The axis extends from 0 0 0 0 0 i xEWSK 1 to 0 xEWBH 1 0 0 0 0 0 0 or from 71118 to 1118 The Euclidean distance between these 0 0 points is 0 7 02 1118 7 11182 0 7 02 2236 4b10 Describe the appearance of the ellipsoid How does it look Solution The ellipse is three dimensional centered at the origin the origin is the point 0 where all axes of the ellipse coincide exactly with the coordinate axes The ellipse is longest along the 32 axis it is half as long along the 31 axis and is half again as long along the 33 axis If the 31 and 32 axes are drawn on a blackboard with 31 horizontal and 32 vertical and the 33 axis is coming out of the blackboard then the ellipse will have more vertical elongation than horizontal elongation and will have even smaller depth than horizontal elongation It will look like a football standing on its end that has been squashed 5 A sample of 100 nancially viable rms and a sample of 50 rms with nancial problems are compared Variables include X1 Debt ratio X2 Assets to liabilities ratio Both measures are ratios with reason ably common scales or variances The most signi cant linear combination is MSLC 704X1 28X2 5a5 Interpret the linear combination in terms of what it DOES Solution This linear combination is the LC that maximizes the squared univariate twosample t statistic for comparing the two samples among all possible in nitely many LC s of the form a1X1 1ng that could be con sidered when comparing nancially viable and nancially troubled rms via the twosample t test 5b5 Interpret the linear combination in terms of what it MEASURES 5 Solution Since the variances of X1 and X 2 are comparable it is permissable to examine the sizes of the coe icients Since the coe icient of X1 is small we may state that the MSLC is essentially a measure of assets to liability ratio 5c10 Find the appropriate critical value for the two sample t statistic based on the MSLC in this example Emphasis on the word appropriate Solution You must use the T2 method whenever the linear combination is the MSLC or even if it is suggested by the MSLC This is a twosample case so m n2 7 2 p r F 05 600 1 n2 7 p i 1 pn1n2 p 1 296 100 50 7 2 2 F21oo5o2105 m3058 2 248139 100507271 ISQS 6348 Midterm 2 Fall 06 Open notes no books Points out of 100 in parentheses 120 Consider the study of housing selling and asking prices as indicated in the following SAS code proc glm datahouseprices class location model ask sell location manova h location run Recall that the LOCATION variable has five levels A B C D and E Recall also that the MANOVA hypothesis was rejected Now that you have rejected the MANOVA hypothesis what further hypotheses would you like to test These further hypotheses should be logically related to the speci c MANOVA hypothesis that was tested here and should be natural followup questions of interest Solution The null hypothesis is that the mean asking prices and the mean selling prices are equal for all five locations If you reject this hypothesis you still don t know i whether the differences are in the asking price or the selling price or both or ii which specific locations differ Thus a number of subhypotheses are of interest For example you could test H0 there is no difference between the five regions in the asking price only Or you could test H0 there is no difference between the five regions in the selling price only Or you could test H0 There is no difference between regions A and B in either asking or selling price Or H0 There is no difference between regions A and C in either asking or selling price Etc To get even more specific you could test pairwise hypotheses like H0 There is no difference between regions A and B in asking price and H0 There is no difference between regions A and C in asking price Etc And to nish the picture you could test H0 There is no difference between regions A and B in selling price and H0 There is no difference between regions A and C in selling price Etc I suppose you might consider other hypotheses relating to ANCOVA MANCOVA etc as well But a main point that I emphasized in class was that as a composite test the MANOVA test is unspeci c Therefore if you reject the MANOVA test you want to look at component tests to identify where the differences are 2 Two uses of ANCOVNMANCOVA are a variance reduction and b confounder control 2A10 In what kind of study is variance reduction the more important goal Describe a specific example of such a study Solution This is used in experimental studies where the subjects are randomly assigned to the groups For example you want to test two interfaces So you randomly assign students to one or the other interface let them use it and then measure performance But performance may also be affected by computer literacy so if you measure that prior to the study and use ANCOVA you will reduce the variability and have a more powerful comparison of the two interfaces Another example Drug trial for lowering cholesterol The two groups are drug and placebo and the covariate is initial cholesterol measured prior to start of the study The same issues as discussed for the previous example apply equally here 2B10 In what kind of study is confounder control the more important goal Describe a specific example of such a study Solution In nonexperimental studies also called observational studies where the observations are not randomly assigned to groups the group differences can be confounded by covariates For example in the housing case in problem 1 a covariate might be size of home If one region has larger homes than another than the region comparison is confounded by size It might be that the regions really have equally priced homes but due to the fact that one region has larger homes than the other the simpler price average is larger in the region with bigger homes Another example was the comparison of ethnic groups satisfaction with TTU It is possible that Year graduated is a confounding variable In experimental studies you do not have such confounders because random assignment assures all confounders are unrelated to treatment group 3 Consider the following measurement model Y1T81 Y2TSz Assume VarT l Varsll Varszl CovT 810 CovT 820 and C0V81 82 099 Y 3A15 Find the covariance matrix of Z Y T 8 Solution Note that T 1 and since CovT 810 CovT 820 we have 2 82 Y1 T 81 l l l 99 2 01 Cov Cov Cov 277 Z Y2 T 82 l l 99 l 01 2 To find the covariance matrix 277 recall that CovTT VarT 3B10 Find Cronbach s alpha Solution Using the covariance matrix of the Y vector Ca 2ll 7 4402 01 Pretty low 3C15 Find the true reliability of Y1 Yz as a measure of T Solution Recall Y0 Y1 Yz and T0 T1 T2 In this example T1 T2 T so To 2T But the reliability of Y0 as a measure of T is identical to the reliability of Y0 as a measure of 2T since it is based on correlation so we can use the formula developed in class Since 2T2 0 the squared correlation between Y0 and T0 is equal to sumZTTzsumZTT sumZTT 222 sumZTTsumZTT Egg 4402 995 How interesting The true reliability is very high when the correlation between residuals is extreme and negative Some insight into this suppose that 8182 implying C0V81 82 1 Then Y1 Yz T 81T 82 T 81T 81 2T Look 7 no error So in the case of perfect negative correlation of the errors we have perfect reliability 4A10 Factor analysis is similar to multivariate regression Explain the most important similarities Solution The main similarity is that the models are identical in there linear function representations In multivariate regression we have multiple Y variables each predicted as a linear function of a collection of X variables The same X variables are used for each Y variable although the regression coefficients are different for the different Y variable models In factor analysis we have exactly the same linear models except the X variables are now the latent data corresponding to each of the factors Another similarity The usual regression assumptions are needed for both models 4B10 Factor analysis is different from multivariate regression Explain the most important differences Solution The most important difference is that there is no X data to use in the FA model The factor data are latent and unobservable So even though you make the same regression assumptions for both models you can t check the assumptions in the FA model in the usual way eg by plotting the data since there is no data available for checking these assumptions Another difference FA requires additional assumptions for estimating the model that are not required by regression analysis including uncorrelatedness of the factors the X data that are used in FA and uncorrelatedness of the errors of the FA model Recall that in the MV reg model the errors can be arbitrarily correlated these are the partial correlations between the Y variables So in addition to the postulation of the existence of latent variables the FA model requires assumptions over and above those of regression in order to allow estimation of the model Give an example of a real situation where the correlation between T and s is positive and one where it is negative By quotrealquot I mean you should have a speci c thing quotTquot in real life that you are measuring and some real way of measuring it that gives you quotY quot The examples can be silly or profound but they should be real Mainly they should exhibit an understanding of the correlation in question Here s a solution Positive correlation Suppose we are measuring liberalness T of a person Suppose the T we want is the person s percentile rank in the population measured on a 01 scale For example if a person has T34 then 34 of the population is less liberal or equivalently more conservative than that person Suppose also we measure T by asking the person who did you vote for in the last election and record their answers 1 for Kerry and 0 for Bush All others and all non voters are discarded Now let s make the assumption that if the person s T is less than 5 then the person voted Bush otherwise Kerry In this case 8 Y 7T where Y is a 01 variable and T is a variable between 0 and l The following code simulates this scenario Example where error is positively correlated with T d a do i 1 to 1000 liberal ranuni0 Tliberal response liberalgt5 response is YO if Tlt5 Yl if Tgt5 error response liberal output end run proc gplot plot errorliberal run proc corr var error liberal run The output shows me man Prncedure 2 wensuues eme liberal Smnle Statistics wensuue u mean Sm Dev Sum mmnun eme moo 00mm mum 1330556 o49375 liberal moo 043519 mm 3515444 000142 pesesm meusmn Enefflclents u e moo Prnh E E under H0 Rhn0 eme liberal eme I ooooo 050533 0001 liberal 050533 100000 0001 Clearly 1nd1caung a posmve oonelauon Vth W 4 lsposmve correlauonbetweenT and 5 Negauye correlauon example nu u W andals negauye r n u vauAAAnmmLA un as ether T0 orT2 dxscarded so you ask someone who dxd you vote for 7quot and they tell you and let us assume LhattheytelltheLruLh However and a e evayone a 1quotmstead ofthe requested 0quot or Zquot So the observed dam Y1 a s 1quot Then 5 where P71 when TZ and PH when T0 Thus 5 15 negauyely correlated wah T Here IS a file w danonstrate Example Vuth negatlve ooeeeleemn uses do 1 1 to Jun 1f ranunll lt 5 than Fquot 2152 12 y 1 end run proc gplot plot errorT run proc corr var error T run The output shows The CORR Procedure 2 Uariables error T Simple Statistics Uariable N Mean Std Dev Sum Minimum error 1000 002200 100026 2200000 100000 T 1000 1 2200 100026 1022 0 Pearson Correlation Coefficients N 1000 Pr b lrl under H0 Rho0 error T error 100000 100000 0001 T 100000 100000 0001 This case is identical to problem lA of the homework the variance of T is 10 the variance of s is 10 and the correlation between T and s is 10 The general message is that if you have a question that forces people from the extremes inward then there is negative correlation between T and S Multivariate Analysis Final Exam Fall 2008 Instructions Open Notes Points out of 100 are in parentheses lA5 Draw the path diagram corresponding to the following code proc calis dataisq56348p1017 var smokinglsmoking4 X1X3 lineqs smokingl betal 1 e1 smoking2 beta2 1 e2 smoking3 beta3 1 e3 smoklng4 beta4 1 e4 1 betall X1 betal2 X2 betal3 X3 d1 ele4 thelthe4 1 run Solution e3 4 e4 4 I could label the arrows with parameters to be even more explicit 1B5 How many parameters are estimated in the model above What are the parameters Solution There are 11 parameters indicated in the code betalbeta4 beta11beta13 and thel the4 1C10 What are the statistical assumptions of the model shown in the code above Solution The variables e1e4 d1 and F are all uncorrelated and the e s and d are uncorrelated with the X s All of the usual regression assumptions apply as well linearity homoscedasticity independence and ideally normalty Specifically the assumptions are embedded in the model which is given by Y1 1F81 Y2 2F82 Y3 3F83 Y4 4F 4 F 171X1 172X2 173X3 6 We see all equations are regression models so the usual regression assumptions apply in addition to the uncorrelatedness assumptions mentioned above 210 Why does an upwardcurving chisquare qq plot suggest that the multivariate distribution is more outlierprone than the multivariate normal distribution What is the logic that explains it Solution We know that squared Mahalanobis distance d from observation ito the mean vector has an approximate chisquare distribution under the MVN assumption In the chisquare qq plot the sorted observed squared Mahalanobis distances d2 lt d2 lt S d2 1 2 n are plotted against the quantiles qoisw of the chi square distribution with p degrees of freedom p denotes number of variables here If the distribution is MVN the distances and the quantiles should be approximately equally ie 61 m qoisw giving the plot the appearance of a 45 degree line If the distribution is outlierprone then the most extreme observations will have Mahalanobis distance 61 from the mean that tend to be much larger than the corresponding chisquare quantile qoisw Le 6131 gtgt qoisw for these observations The plotted values qodVWda will thus lie well above the 45 degree line for these observations giving the chisquare qq plot an upward curving appearance 3 Suppose that the following model is to be estimated from manifest variables Y1Y2 and X Y1 l31Y281 Y2 Xs2 Assume X has variance 10 and sland 82 have variances 1111 and 1112 respectively also assume that X 81 82 are mutually uncorrelated 3A10 Find the 3x3 covariance matrix of Y3L Y2 X that is implied by this model Solution Note that Y1 3le 81 Y1 31X82 81 31X3182 81 so that Y1 1 l l 81 Y 0 1 1 g X 0 0 1 X 81 111 0 0 NowCov 82 0 12 0 hence Y1 1 l l W1 0 0 1 0 0 W1 12lIZ 12 l W2 1 l Cov Y2 0 1 1 0 12 0 1 0 ll12 1 1z2 1 X 0 0 1 0 0 1 1 1 1 1 3B10 Explain how the implied covariance matrix is used to estimate the parameters of the model Solution The idea is to choose the parameters in the implied covariance matrix so that the implied covariance matrix matches the observed covariance matrix as closely as possible For example if the sample covariance matrix of the Y1 Y2 X data was 23 6 12 S 6 18 11 we would choose the parameters 611111 and 1112 so that the implied form and S 12 11 1 are as close as possible Unweighted least squares weighted least squares and maximum likelihood are all possible methods to do this 4A5 What are the assumptions behind Cronbach s 0c Solution We assume the measurement model Y T ei i1p where the errors are uncorrelated with each other and with the T s 4B5 What can you say about Cronbach s on when the assumptions are violated Solution In this case we are no longer guaranteed that Cronbach s 0c is less than or equal to the squared correlation between ZY and ZTi 5 Each of the following statements is debatable ie each statement is neither clearly true nor clearly false Discuss the validity of each statement separately Provide specific details andor llfor instances to make answers better 5A8 Multiple comparisons procedures allow you to lldata snoop Discussion To an extent this is true If you prespecify many hypotheses and correct for the type I error inflation using Bonferroni s method or some other method then you can legitimately claim significances that exceed the Bonferroni threshold On the other hand it can be difficult to predefine every single hypothesis and it can be difficult to count the total number of tests that one performs especially if the data analysis is done interactively So even Bonferroni s method does not protect against indiscriminate data snooping 58 8 The most significant linear combination is similar to a principal component Discussion Yes it is They are both linear combinations and as linear combinations can be calculated and the data sorted from smallest to largest value to aid in interpretation However they do different things The PC maximizes variance and the MSLC maximizes significance So they are used for different purposes 5C 8 Mahalanobis distance should be used in multivariate analysis Discussion Sure Mahalanobis distance has its benefits namely to incorporate correlation information and to sure distances realistically in the light of differing correlations and different variances Also it is the foundation for many types of Multivariate tests such as the Hotelling T2 test On the other hand good old Euclidean distance is not necessarily quotwrongquot It has its benefits for example we quotseequot Euclidean distance most clearly in graphs Other types of distance can be appropriate as well eg quotdriving distance when looking at a map this is certainly not Euclidean distance but it is a multivariate distance since the map is twodimensional three dimensional when the Earth s curvature is incorporated Also Mahalanobis distance can be difficult to calculate especially in high dimensional situations so we may prefer something simpler like standardized Eclidean distance simply for expediency the default distance for cluster analysis is standardized Euclidean for this reason SD 8 Latent variables don t exist Discussion We never see a latent variable so arguably they don t exist at all On the other hand the latent variable as a concept is logical For example we think some people or more satisfied than others the job salience job satisfaction case This can be modeled as a singe number a latent satisfaction number When a person has a higher value of that number then the person is more satisfied The latent variable concept seems logical enough even if the variables are fictitious in actuality SE 8 The better the fit statistics in a structural equations model the better the model Discussion This is true to an extent Certainly if the fit statistics are good then the model is more believable On the other hand the relationships may be so weak that the model is practically useless For example in the job saliencejob satisfaction case the model could have fit perfectly in terms of RMR RMSEA GFI Chi Squared etc but if the R2 measuring how well job salience predicts job satisfaction was very small say 03 then we would not think the model was quotgoodquot Some mentioned that the fit statistics will look ok when the model is quotoverfitquot but the model itself won t be very good useful or realistic I also gave credit for this answer although it was not exactly what I had in mind since many of the fit statistics penalize you for overfitting eg AIC hence these statistics will tend not to look very good for overfit models

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.