### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Multivar Data Analysis STA 135

UCD

GPA 3.69

### View Full Document

## 37

## 0

## Popular in Course

## Popular in Statistics

This 70 page Class Notes was uploaded by Carmen Mayer on Tuesday September 8, 2015. The Class Notes belongs to STA 135 at University of California - Davis taught by Staff in Fall. Since its upload, it has received 37 views. For similar materials see /class/191914/sta-135-university-of-california-davis in Statistics at University of California - Davis.

## Similar to STA 135 at UCD

## Reviews for Multivar Data Analysis

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/08/15

ASquare root of matrix Calculate the square root of a positivenonneg definite matrix A see p65 66 A pos e or nonnegdef has eigenvalues and eigenvectors PLP39 is the spectral decomposition of A eigval are the diagonal in matrix L eigvec are columns of matrix P Then A5 PL P39 a r m s W gt Amatrixc25 24 2414l9ncol3 gt A 1 2 3 1 25 2 4 2 2 4 1 3 4 1 9 gt eigeigenA gt eig values 1 26078452 8495796 3425752 vectors I I 3 1 097169436 01914314 01384345 2 007792066 02934880 09527818 3 022302119 09365996 02702642 gt diageigval 1 2 3 1 2607845 0000000 0000000 2 000000 8495796 0000000 3 000000 0000000 3425752 gt Beigvec diageigvalA5 teigvec gt B B check BBA 1 2 3 11 2 4 2 2 4 1 3 4 1 9 BRandom Variable Generation generate 6 standard normal independent random variables gt rnorm gt rnorm 6 1 08457609 17154079 06080926 11890400 08762133 01515956 gt yrnorm6 gt Y 1 20232779 12769311 04330825 05096962 06214099 02061143 resize y into 2 rows 3 columns gt to get bivariate standard normal random sample Y1 Y2 Y3 Nmean 0 variance I gt matrixynrow2 1 2 3 1 2023278 04330825 06214099 2 1276931 05096962 02061143 Note after transformation from Y to X to have meanmu varianceSigma take the transpose of y to align variables horizontally ie X to be a nx2 matrix CLogica1 expressions lt less than gt equal to or equal to D Ellipse gt dataread table quotSta135j imingwichernidataTAl l datquot gt data readtablequotStal35jimingwichernidataT4 5datquot gt data cbinddatadata2 gt datadataA 14 gt dimdata 1 42 2 gt colMeansdata v v 05642575 06029812 gt vardata Vl 001435023 001171547 Vl 001171547 001454530 gt xbarmatrixc 564 603 ncol1 gt Smatrixc 0144011701170146 ncol2 gt n42 NH Draw confidence ellipsoid 2 variables based on Chi square approximation Ht This function applies only to a pair of variables gt 1 Uses simultaneous Confidence Regions of 2 variables at a time Load package ellipse gt libraryellipse gt plotellipseSncentrexbarlevel95 typequotlquot gt plotellipse Sncentrexbarlevel 95 type lquotxlabquotmu1quot ylabquotmu2quot mainquot95 Confidence Ellipsequot 95 Con dence Ellipse 95 Confidence ellipse g E r x a a g l Z a U52 054 055 055 050 o 04 us us 07 ea ml xl plot pairwise ellipses over data above gt plotdata 1 data 2 pch20mainquot95 Confidence ellipsequotxlabquotx1quotylabquotx2quot gt linesellipseSncentrexbarlevel95 Alternative Draw confidence ellipsoid 2 variables based on F distribution H e by permission from our student Chris Dienes Thank you transform the unit circle coordinates eigensolveSn q ig vec g igSveC2 qrtleigval1CZ m gt mu2e1V12cosangle e2V22sinangle xbar2 E m gt mu1e1V11cosangle e2V21sinangle xbar1 a gt plotmu1mu2typequotlquot C39 l 052 054 u 55 H 5B execute commands in a Script file Windows FilegtNew Script Mac FilegtNew Document comparing ellipses of both methods Microwave Data Examples 53 54 56 xbarmatrixc564603ncoll Smatrixc0144011701170146ncol2 SINVmatrixc203018 163391 163391200228ncol2 n42 p2 m2 ccsqrtn 1pn pqf95pn pn bbqtl 052mn lsqrtn amatrixcl0ncoll a2matrixc0lncoll TA2 CI Sltaxbar ccsqrttaSa SZtaxbarccsqrttaSa Sllta2xbar ccsqrtta2Sa2 522ta2xbarccsqrtta2Sa2 bonferroni CI Bltaxbar bbsqrttaSa B2taxbarbbsqrttaSa Bllta2xbar bbsqrtta2Sa2 B22ta2xbarbbsqrtta2Sa2 Ellipse using chi square approximation libraryellipse plotellipseSncentrexbarlevel95type lquotylabquotmu2quotmainquot95 Confidence Ellipsequot quotxlimc5163ylimc5467xlabquotmu linescSlSlcSll522 linescSZSZcSll522 linescSlSZcSllSll linescSlSZc522522 linescBlBlcBllB22 linescB2B2cBllB22 linescBlBZcBllBll linescBlBZcB22B22 Ellipse using F distribution sinvnsolveS c2pn 1qf95pn pn p Vleigensinvvecl V2eigensinvvec2 elsqrtleigensinvvallc2 e2sqrtleigensinvval2c2 angle01002pi100 mulelVllcosangle e2V2lsinangle xbarl mu2elVl2cosangle e2V22sinangle xbar2 linesmulmu2colquotredquottypequotlquot legendquottoprightquotlegendcquotChi sqquotquotF distquotcolcquotblackquotquotredquotpchcll Suppose trueMu 054059 points054 059 pch19 text054 0595labelsquottrueMu 054059quot 95 Confidence Ellipse a emu H mm nae um um 052 054 56 use 060 n62 Louquot us in funnlinns in data analysis m up x mm xnormllsm ea MM 1 y n x mum153d 2 Fen122zz ssssvvxx wzchu dixddl mum make 3mm pm of 2 vaxxahles mm tmake macxxx of mum of n vaxxahles m see mum new 1 mm tmake mum of 1 varxahle Iuvaxxace mth check mlouxx twoma mm pm pm uxd xed obs of x 13 mm mm quincxles an ner treiexenc 111 s of mm Proh m p p z m m davrwrumacnxlxepmxmrpm a ugld vltlsulvelvaxlwllltlcldevI xqlnrnSnd p mumm 9 pm Theorem chliq ixe gamma mu a 2w x mm may 39mm m Multivariate Analysis Many statistical techniques focus on just one or two variables Multivariate analysis MVA techniques allow more than two variables to be analysed at once Multiple regression is not typically included under this heading but can be thought of as a multivariate analysis Outline of Lectures We will cover Why MVA is useful and important Simpson s Paradox Some commonly used techniques Principal components Cluster analysis Correspondence analysis Others if time permits Market segmentation methods An overview of MVA methods and their niches Simpson s Paradox Example 44 of male applicants are admitted by a university but only 33 of female applicants Does this mean there is unfair discrimination University investigates and breaks down gures for Engineering and English programmes Male Female Accept 35 20 Refuse 45 40 entry Total 80 6O Simpson s Paradox No relationship between seX Engineer Male Female and acceptance for either in programme g So no evidence of Accept 3O 10 discrimination Why Refuse 3O 10 More females apply for the entry English programme but it it Total 60 20 hard to get into More males applied to E 139 h M 1 F 1 Engineering which has a ng Is a e ema 6 higher acceptance rate than Accept 5 10 English Must look deeper than single Refuse 15 3O crosstab to nd this out entry Total 20 4O Another Example A study of graduates salaries showed negative association between economists starting salary and the level of the degree ie PhDs earned less than Masters degree holders who in turn earned less than those with just a Bachelor s degree Why The data was split into three employment sectors Teaching government and private industry Each sector showed a positive relationship Employer type was confounded with degree level CAUSATIONiChanges in A cause changes in B 9 COMMON RESPONSEChanges in both A and B are caused by changes in a Lhird variable C 9 CONFOUNDING Changes in B are caused both by changes in A and by changes in third variable C quot Simpson s Paradox In each of these examples the bivariate analysis crosstabulation or correlation gave misleading results Introducing another variable gave a better understanding of the data It even reversed the initial conclusions Many Variables Commonly have many relevant variables in market research surveys E g one not atypical survey had 2000 variables Typically researchers pore over many crosstabs However it can be dif cult to make sense of these and the crosstabs may be misleading MVA can help summarise the data E g factor analysis and segmentation based on agreement ratings on 20 attitude statements MVA can also reduce the chance of obtaining spurious results Multivariate Analysis Methods Two general types of MVA technique Analysis of dependence Where one or more variables are dependent variables to be explained or predicted by others E g Multiple regression PLS MDA Analysis of interdependence No variables thought of as dependent Look at the relationships among variables objects or cases E g cluster analysis factor analysis Principal Components 0 Identify underlying dimensions or principal components of a distribution Helps understand the joint or common variation among a set of variables Probably the most commonly used method of deriving factors in factor analysis before rotation Principal Components The rst principal component is identi ed as the vector or equivalently the linear combination of variables on which the most data variation can be projected The 2nd principal component is a vector perpendicular to the rst chosen so that it contains as much of the remaining variation as possible And so on for the 3rd principal component the 4th the 5th etc Principal Components Examples Ellipse ellipsoid sphere Rugby ball Pen Frying pan Banana CD Book Multivariate Normal Distribution Generalisation of the univariate normal Determined by the mean vector and covariance matrix XN2 E g Standard bivariate normal X2 y2 X N0012 pxe 2 Example Crime Rates by State Crime Rates per 100000 Population by State UD State Iv1urae Kap nuuuel Assaul burglar Larcen Autoner Q r p y f y y 1 Alabama 142 252 968 2783 11355 18819 2807 2 Alaska 108 516 968 2840 13317 33698 7533 3 Arizona 95 342 1382 3123 23461 44674 4395 4 Arkansas 88 276 832 2034 9726 18621 1834 5 Sanfarm 115 494 2870 3580 21394 34998 6635 The PRINCOMP Procedure UUbBIleIUII g 0 Variables 7 Simple StatlStICS Auto 1th Murder Rape Robbery Assault Burglary Larceny t anea 744400008 257340008 124092008 211 300008 1291 90403 2671 28803 3775260000 StD 339Bbb b89 101596 29 883485672 10025304 432455711 725908707 1933944175 Murde Burglar Larcen AutoThef r Rape y t y y t Murder 10000 03960 04837 06486 03858 01019 00688 Rape 06012 1008 05919 07403 07121 06140 03489 Robbery 04837 03959 10000 05571 06372 04467 05907 Assault 06486 0743 05571 10000 06229 04044 02758 Burglary 03858 0397112 06372 06229 10000 07921 05580 Larceny 01019 03961 04467 04044 07921 10000 04442 WIOJ quoter 00688 03934 05907 02758 05580 04442 10000 Eig f t COBHHCMatraImnla v n e e e 1 41 1495951 287623768 05879 05879 2 1 23872183 051290521 01770 07648 3 072581663 040938458 01037 08685 4 031643205 005845759 00452 09137 5 025797446 003593499 00369 09506 6 022203947 009798342 00317 09823 7 012405606 00177 10000 Eigenvectors Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Murder 0300279 629174 0178245 232114 0538123 0259117 0267593 Rape 0431759 169435 244198 0062216 0188471 773271 296485 Robbery 0396875 0042247 0495861 557989 519977 114385 003903 Assault 0396652 343528 069510 0629804 506651 0172363 0191745 Burglary 0440157 0203341 209895 057555 0101033 0535987 648117 Larceny 0357360 0402319 539231 234890 0030099 0039406 0601690 AutoTheft 0295177 0502421 0568384 0419238 0369753 057298 0147046 23 components explain 7687 of the variance First principal component has uniform variable weights so is a general crime level indicator Second principal component appears to contrast violent versus property crimes Third component is harder to interpret Cluster Analysis Techniques for identifying separate groups of similar cases Similarity of cases is either speci ed directly in a distance matrix or de ned in terms of some distance function Also used to summarise data by de ning segments of similar cases in the data This use of cluster analysis is known as dissection Clustering Techniques Two main types of cluster analysis methods Hierarchical cluster analysis Each cluster starting with the Whole dataset is divided into two then divided again and so on Iterative methods kmeans clustering PROC FASTCLUS Analogous nonparametric density estimation method Also other methods Overlapping clusters Fuzzy clusters Applications Market segmentation is usually conducted using some form of cluster analysis to divide people into segments Other methods such as latent class models or archetypal analysis are sometimes used instead It is also possible to cluster other items such as productsSKUs image attributes brands Tandem Segmentation One general method is to conduct a factor analysis followed by a cluster analysis This approach has been criticised for losing information and not yielding as much discrimination as cluster analysis alone However it can make it easier to design the distance function and to interpret the results Tandem k means Example proc factor datadata le n6 rotatevarimax round reorder ag54 scree outscores var reasons lreasons15 usage lusage 10 run proc fastclus datascores maxc4 seed1091623 19 maxiter50 var factorlfactor6 run Haveused the default unweighted Euclidean distance functlon which is not sensible in every context Also note that k means results depend on the initial cluster centroids determined here by the seed Typically k means is very prone to local maxima Run at least 20 times to ensure reasonable maximum 8080 V 9ZV6 V 6968 Z VZLZ V Z869 0L8 98V0 V 69TE 9 0806 909 6968 V ZO69 V L8V8 ILV 9Z 0 Z V VZ99 V 0T06 0 EV I JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ sp1011uag IalS IQ IalS IQ UOIlQAISSqO 01 UOIlQIASQ Aouanbaij IalS IQ 0086 EEV u Nu N OOOO NM LO uaemlag aoualsiq 189199N peas m0 p19 swa aoualsiq w mIXQW Ai mmns IalS IQ sluawbes 9 go UNI qlgf s1nd1nQ p9109193 Selected Outputs 19th run of 5 segments FASTCLUS Procedure ReplaceRANDOM Radius0 Maxclusters5 Maxiter100 Converge002 Statistics for Variables Variable Total STD Within STD R Squared RSQl RSQ ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff FACTORl 1000000 0788183 0379684 0612082 FACTOR2 1000000 0893187 0203395 0255327 FACTOR3 1000000 0809710 0345337 0527503 FACTOR4 1000000 0733956 0462104 0859095 FACTOR5 1000000 0948424 0101820 0113363 FACTOR6 1000000 0838418 0298092 0424689 OVER ALL 1000000 0838231 0298405 0425324 Pseudo F Statistic 28784 Approximate Expected Over All R Squared 037027 Cubic Clustering Criterion 26135 WARNING The two above values are invalid for correlated variables 9E9E6 09TVL 9LV00 6VTT6 T TEE9839 ZLE68 0 8938939 0 0 9L99939 TLLEO39T 30908 BQLVL O 6V869 LOTL6 69ZZ639 096VL VIQO6 L886 0 V8068 ZZZB39 6VV88 ZEOIL ovvaa VIVL6 0 9IZ6L 9999L 0 LEVSO I OOIIB O 91996 0 I906L 0 V0996 0 I JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ 9HOLOV3 QHOLQV3 VHOLOVJ HOLOVE ZHOLOVJ THOLOVJ JelsnIO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NM LO SUOIlQIASQ pIQpUle IalS IQ ZL600390 BLZLT39O E6ETO390 VLBTZ T L9983 0 VZV06 0 69EOV390 ZQT6E 0 89TLE 0 ZL999 0 T9900390 V6L6E 0 9669E390 699EZ390 66LVE39T LLOQT O 09V60390 QEVTV39O 90690390 VTEVV O 980L9390 L96ZO 0 L6VZ9 0 TVV96390 0V9LT T LOVVT O 89180 0 6V 90390 9V698390 TQTLT39O I JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ 9HOLOVJ GHOLOVJ VHOLOVE HOLOVJ ZHOLOVE THOLOVE IalsnIO NM LO SUQSW IalS IQ sluawbes 9 go UNI qlgf s1nd1nQ p9109193 Cluster Analysis Options There are several choices of how to form clusters in hierarchical cluster analysis Single linkage Average linkage Density linkage Ward s method Many others Ward s method like kmeans tends to form equal sized roundish clusters Average linkage generally forms roundish clusters with equal variance Density linkage can identify clusters of different shapes FASTCLUS Density Linkage C LUSTER Cluster Analysis Issues Distance de nition Weighted Euclidean distance often works well if weights are chosen intelligently Cluster shape Shape of clusters found is determined by method so choose method appropriately Hierarchical methods usually take more computation time than k means However multiple runs are more important for k means since it can be badly affected by local minima Adjusting for response styles can also be worthwhile Some people give more positive responses overall than others Clusters may simply re ect these response styles unless this is adjusted for e g by standardising responses across attributes for each respondent MVA FASTCLUS PROC FASTCLUS in SAS tries to minimise the root mean square difference between the data points and their corresponding cluster means Iterates until convergence is reached on this criterion However it often reaches a local minimum Can be useful to run many times with different seeds and choose the best set of clusters based on this RMS criterion See for more k means issues Iteration History from FASTCLUS Relative Change in Cluster Seeds lteratio n Criterion l 2 3 4 5 ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff l medo f wm 0 000000000 9645 8596 8499 8454 8430 84l4 8402 8392 8386 8383 Criterion Based on Final Seeds l 000000000 0436 3549 209l l534 ll53 0878 0840 0657 0429 0l97 0 000000000 7366 l727 l047 070l 0640 06l3 0547 0396 0267 0l39 0 000000000 6440 l227 l047 0785 0727 0488 0522 0440 0324 0l70 Convergence criterion is satisfied 083824 0 000000000 6343 l246 0656 0276 033l 0253 0249 0l88 0l49 0ll9 0 000000000 5666 073l 0584 0439 0276 0327 0340 0286 0223 0l73 6ZLET390 LLE9Z 0 VOLQL O EZVLO39O BIOTZ O ZT6OZ39T 9 LE9LL 0 L69T390 998T390 TOELZ39T QEEZV39O 6VZE9390 V LZ6V6 0 9LVTE390 9LL98390 ELE9E390 T69OE 0 ETVZT39O E V069Z 0 6 0V9390 6638E390 T6VV9 0 LEEOO O 60V6E390 Z 9E9 9390 TBZQQ39O ZVZTQ39O ZQZBV O 999L390 18380 0 I JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ 9HOLOV3 GHOLOVE VHOLOVE HOLOVE ZHOLOVE THOLOVE IalsnIO SUQSW IalS IQ sluawbes 9 go UNI qloz ZL600390 BLZLT39O E6ETO390 VLBTZ T L9983 0 VZV06 0 69EOV390 ZQT6E 0 89TLE 0 ZL999 0 T9900390 V6L6E 0 9669E390 699EZ390 66LVE39T LLOQT O 09V60390 QEVTV39O 90690390 VTEVV O 980L9390 L96ZO 0 L6VZ9 0 TVV96390 0V9LT T LOVVT O 89180 0 6V 90390 9V698390 TQTLT39O I JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ 9HOLOVJ GHOLOVJ VHOLOVE HOLOVJ ZHOLOVE THOLOVE IalsnIO NM LO SUQSW IalS IQ sluawbes 9 go UNI qlgf speeg 13111111111919qu mug SlI SQH HowardHarris Approach Provides automatic approach to choosing seeds for k means clustering Chooses initial seeds by xed procedure Takes variable with highest variance splits the data at the mean and calculates centroids of the resulting two groups Applies k means with these centroids as initial seeds This yields a 2 cluster solution Choose the cluster with the higher Withincluster variance Choose the variable with the highest variance Within that cluster split the cluster as above and repeat to give a 3 cluster solution Repeat until have reached a set number of clusters I believe this approach is used by the ESPRI software package after variables are standardised by their range Another Clustering Method One alternative approach to identifying clusters is to t a finite mixture model Assume the overall distribution is a mixture of several normal distributions Typically this model is t using some variant of the EM algorithm E g wekaclusterersEM method in WEKA data mining package See WEKA tutorial for an example using Fisher s iris data Advantages of this method include Probability model allows for statistical tests Handles missing data Within model tting process Can extend this approach to de ne clusters based on model parameters e g regression coef cients Also known as latent class modeling Q Zmax39 Cluster Means S min Cluster 1 Cluster 2 Cluster 3 Cluster 4 Reason 1 421 450 Reason 2 412 402 Reason 3 390 406 Reason 4 215 335 Reason 5 219 380 Reason 6 409 428 Reason 7 194 366 Reason 8 230 377 Reason 9 351 382 Reason 10 375 347 Reason 11 384 437 Reason 12 a 278 186 260 Reason 13 TE 395 306 345 Reason 14 QED 375 206 383 Reason 15 m 293 404 Cluster Means O maX Usage 1 Usage 2 Usage 3 Usage 4 Usage 5 Usage 6 Usage 7 Usage 8 Usage 9 Usage 10 Cluster 1 343 391 307 S min Cluster 2 366 394 295 302 355 425 329 288 338 426 Cluster 3 348 386 261 262 352 278 258 319 400 Cluster4 m m 0 250 356 456 259 234 268 Cluster Means Cluster 1 Cluster 2 Cluster 3 Cluster 4 Usage 1 343 366 348 Usage 2 391 394 386 Usage 3 295 261 Usage 4 302 262 Usage 5 355 Usage 6 414 Usage 7 329 278 Usage 8 288 258 Usage 9 338 319 Usage 10 426 400 Correspondence Analysis Provides a graphical summary of the interactions in a table Also known as a perceptual map But so are many other charts Can be very useful E g to provide overview of cluster results However the correct interpretation is less than intuitive and this leads many researchers astray Four Clusters imputed normalised Usage 9 Usage 7 Usage 8 Usage 4 Usage 10 I Cluster 2 Reason 2 Reason 9 Cluster 3 I Reason 6 Reason 13 Q Usage 5 Reason 10 Usage 6 Usage 2 Reason 4 Cluster 1 Usage 1 9 Reason 12 Reason 11 Reason 7 Reason 3 Reason 5 Reason 14 Reason 1 I Cluster 4 Reason 8 25 30 Reason 15 0 L 538 I Q Correlation lt 050 2D Fit 791 Interpretation Correspondence analysis plots should be interpreted by looking at points relative to the origin Points that are in similar directions are positively associated Points that are on opposite sides of the origin are negatively assoc1ated Points that are far from the origin exhibit the strongest associations Also the results re ect relative associations not just which rows are highest or lowest overall Software for Correspondence Analysis Earlier chart was created using a specialised package called BRANDMAP Can also do correspondence analysis in most major statistical packages For example using PROC CORRESP in SAS gt Perform Simple Correspondence Analysis Example l in SAS OnlineDoc proc corresp all dataCars outcCoor tables Marital Origin run gt Plot the Simple Correspondence Analysis Results plotitdataCoor datatypecorresp Cars by Marital Status 3 European 4 Married 3 Single Mam39ed with Kids 439 Japanese American H Single with Kids I l I l I Canonical Discriminant Analysis Predicts a discrete response from continuous predictor variables Aims to determine which of g groups each respondent belongs to based on the predictors Finds the linear combination of the predictors with the highest correlation with group membership Called the rst canonical variate Repeat to nd further canonical variates that are uncorrelated with the previous ones Produces maximum of gl canonical variates CDA Plot Canonical Var 2 Seto Canonical Var 1 Discriminant Analysis Discriminant analysis also refers to a Wider family of techniques Still for discrete response continuous predictors Produces discriminant functions that classify observations into groups These can be linear or quadratic functions Can also be based on nonparametric techniques Often train on one dataset then test on another CHAID Chisquared Automatic Interaction Detection For discrete response and many discrete predictors Common situation in market research Produces a tree structure Nodes get purer more different from each other Uses a chisquared test statistic to determine best variable to split on at each node Also tries various ways of merging categories making a Bonferroni adjustment for multiple tests Stops when no more statistically signi cant splits can be found Example of CHAID Output RLWF39 F M W at 8 2 1 H quotj 4 2300 Total mm an 100 cat AB n 1 2 l ETHN wC Px aluezu um I cnlsqu are1 301386 173 I I I 1 234 1 1 Val n cat 3 n 1 451U dd quot 1449 391 2 BED 5 2 5 2 3 j 716 1 5 1 quotrjf W 4 E4 4 3633 18 Tmal 51mm 11 Total 4900 19 Titanic Survival Example Adults 20 Men 1st or 2nd class passenger 93 CHAID Software Available in SAS Enterprise Miner if you have enough money Was provided as a free macro until SAS decided to market it as a data mining technique TREEDISCSAS still available on the web although apparently not on the SAS web site Also implemented in at least one standalone package Developed in 1970s Other treebased techniques available Will discuss these later TREEDISC Macro treediscdatasurvey2 depvarbs nominalc o p q X ae af ag ai aj al am ao ap aw bfl bf2 ck cn ordinallifestag t u V w y ab ah ak ord oatac ad an aq ar as W optionslist noformat readmaXdepth3 tracemedium drawgr leaf50 outtreeall Need to specify type of each variable Nominal Ordinal Ordinal with a oating value Partial Least Squares PLS Multivariate generalisation of regression Have model of form YXBE Also extract factors underlying the predictors These are chosen to explain both the response variation and the variation among predictors Results are often more powerful than principal components regression PLS also refers to a more general technique for fitting general path models not discussed here Structural Equation Modeling SEM General method for tting and testing path analysis models based on covariances Also known as LISREL Implemented in SAS in PROC CALIS Fits speci ed causal structures path models that usually involve factors or latent variables Con rmatory analysis SEM Example Relationship between Academic and Job Success SAS Code data job typecov input type 55 name 55 act cgpa entry salary promo cards n 500 500 500 500 500 cov act 1024 cov cgpa 0792 1077 cov entry 0567 0537 0852 cov salary 0445 0424 0518 0670 cov promo 0434 0389 0475 0545 0716 3 proc calis dataj ob cov stderr lineqs act 1Fl e1 cgpa p2flFl e2 entry p3flFl e3 salary 1F2 e4 promo p5flF2 e5 std el vare 1 e2 vare2 e3 vare3 e4 vare4 e5 vare5 Fl varF 1 F2 varF2 cov fl f2 covfle var act cgpa entry salary promo run Results Table 3 Parameter VEHEDE aciem Cgpeem p211 Enlry F1 p3f1 Saleryer Wormer p511 Variance e1 F2 Covariance Fin Estlmates Business Model 3 dard zrvaue ESNm lE 20 265 18122 20 341 vixen value All parameters are statistically signi cant With ahigh correlation being found between the latent traits of academic andjob success However the overall chisquared value for the model is 1113 With 4 df so the model does not t the observed covariances perfectly Latent Variable Models Have seen that both latent trait and latent class models can be useful Latent traits for factor analysis and SEM Latent class for probabilistic segmentation Mplus software can now t combined latent trait and latent class models Appears very powerful Subsumes a wide range of multivariate analyses Broader MVA Issues 0 Preliminaries EDA is usually very worthwhile Univariate summaries e g histograms Scatterplot matrix Multivariate pro les spiderweb plots Missing data Establish amount by variable and overall and pattern across individuals Think about reasons for missing data Treat missing data appropriately e g impute or build into model tting Factor Analysis Example 910 gt XreadtablequotSta135jimingWichernidataT8 4datquot gt namesXcquotJPMorganquotquotCitibankquotquotWellsFargoquotquotRoyalDutchShellquotquotExxonMobilquot Principal c of m2 fact gt SvarX gt RcorX gt R omponent solution of factor model ors JPMorgan Citibank WellsFargo RoyalDutchShell ExxonMobil JPMorgan 10000000 01145019 01544528 Citibank 05322878 10000000 03222921 02125747 WellsFargo 05104973 05741424 10000000 01824992 01452057 RoyalDutchShell 01145019 03222921 01824992 10000000 ExxonMobil 01544528 02125747 01452057 05833777 10000000 gt eigeigenR gt eig va 1 24372731 14070127 05005127 04000316 02551699 gt eigvec 4 1 2 3 1 5 4690832 03680070 060431522 03630228 038412160 2 05324055 02364624 013610618 06292079 0496l8794 3 0465l633 03151795 077l82810 02889658 007116948 4 03873459 05850373 009336192 038125l5 059466408 5 03606821 06058463 010882629 04934145 049755167 gt m2 gt Leigvec diagsqrteigval1m Estimated factor loadings lg L 1 2 1 07323218 04365209 2 08311791 02804859 3 07262022 03738582 4 06047155 06939569 5 05630885 07186401 gt Psi diagR LtL Estimated specific variances h gt Psi JPMorgan Citibank WellsFargo RoyalDutchShell ExxonMobil 02731542 02304689 03328604 015 7429 01664878 N2 gt hi2 applyLA2 1 sum Estimated communalities hi gt hi2 1 07268458 07695311 06671396 08472571 08335122 gt 1 hi2 alternative calculation of the estimated specific variances 1 02731542 02304689 03328604 01527429 01664878 R function factanal provides maximum likelihood estimates method based on R matrix SAS provides principal component estimates method MLE without rotation gt factanalXfactor2rotationquotnonequot Call factanalx X factors 2 rotation IInonequot Uniquenesses 1 0417 0275 0542 0005 0530 Loadings Factorl Factor2 1 0121 0754 2 0328 0785 3 0188 0550 4 0997 5 0585 Factorl Factor2 SS loadings 1622 1610 Proportion Var 0324 0322 Cumulative Var 0324 0646 Test of the hypothesis that 2 factors are sufficient The chi square statistic is 197 on 1 degree of freedom The p value is 16 MLE with rotation varimax gt factanalXfactor2rotationquotvarimaxquot factanalx X factors 2 rotation quotvarimaxquot Uniquenesses 1 0417 0275 0542 0005 0530 Loadings Factorl Factor2 1 0753 2 0819 0232 3 0558 0108 4 0113 0991 5 0108 0577 Factorl Factor2 SS loadings 1725 1507 Proportion Var 0345 0301 Cumulative Var 0345 0646 Test of the hypothesis that 2 factors are sufficient The chi square statistic is 197 on 1 degree of freedom The p value is 016 Store results gt fitfactanalXfactor2rotationquotnonequot gt namesfit 1 quotconvergedquot quotloadingsquot quotuniquenessesquot quotcorrelationquot quotcriteriaquot quotfactorsquot quotdofquot 8 quotmethodquot quotSTATISTICquot quotPVALquot quotnobsquot quotcallquot To access the estimated specific variances h gt fituniq JPMorgan Citibank WellsFargo RoyalDutchShell ExxonMobil 04165374 02746902 05420233 00050000 05298429 To access the estlmated factor 1oad1ngs lv gt frt51oadrngs Loadlngsu Factor1 Factor2 J39FMorgan 0121 0751 Cltlbank 0323 0735 WellsFargo 0133 0550 RoyalDutchShell 0337 ExxonMobll 05 5 Factor1 Factor2 s 1oad1ngs 1522 1510 Froportron Var 0321 0322 Cumulatlve Var 0321 0515 gt Lf1t51oad1ngs gt LJ 1 01205372 03231321 01375017 03371721 05351715 gt LJ 1 0751257050 0735713512 0550217053 e0007103505 0025317113 Plot 1oad1ngs of Factor 1 by Factor 2 gt p1otL pch20 xla F1 ylab F2 maln Factor 1oad1ngs MLE netnod no rotatlon gt textltL021abe1senanesxcexe add varlable names Factor loadings MLE method nu rotation Fred1ct factors factor scores usrng weighted least squares method gt f t 2 e y e gt nanesltfrt2 1 converged loadlngs unlquenesses 3 method quotsec 5 correlatlon c re STATISTIC FVAL rlterla factors dot n0bs call To access the predrcted factors gt frtzsscores Factor1 Factor2 1 e131015135 017157713 2 023325203 011007375 3 e015535353 e015550535 1 e120351173 035125332 Pxedlct factors factor scores usmg ngnssiu mama gt f1t3factanaxfactu rutatlu nune scures regression 3 scores m n s Artur Factor 1 7 n3m3257 2 n2s7775u n33A2u55 3 7D5A582D 479379 4 712u35352 H7752555 4me the premcted factur scores gt pmf1t3scuxesrpch2 xlab act gt abhne gt abhne yla Fanan rm Factor scores by regression Factor scams by regression datasetreadtablequotStal35jimingWichernidataT7 6datquot datasetreadtablequotStal35jimingWichern dataT7 7datquot namesdatasetcquotylquotquoty2quotquot21quotquot22quotquot2339 quot24quotquot25quot Ydatasetl2 YasmatrixY Design matrix Code variables in the FULL model Y Bo 8121 8222 B323 B424 8525 E columnl all 1 for intercept term column2 predictor 21 column3 predictor 22 column4 predictor 23 column5 predictor 24 column6 predictor 25 Zcbindldataset21 dataset22 dataset23 dataset24 dataset25 Generalized Inverse If Z if not of full r nk or olumns of Z have collinearity then libraryMASS load proper package into R ginvZ fu nction call for generalized inverse Z39Z Z39Y wY z 39Y z Beta parameter estimates Error sum of squares and cross products matrix YhatZ Predicted values EY Yhat Residuals nnrowY n number of observations mncolY m number of response variables rncolZ l where rl number of parameters rankZ in Full model Full model Treatment sum of squares BhatsolvetZZ tZY Full model residual sum of squares and cross products matrix WtY z Bhat Y z Bhat Estimated error covariance matrix X unbiased estimator Result 79 SuWn r l Maximum Likelihood Estimator of error covariance Z biased estimator Result 710 used for hypothesis testing and model selection SWn Q Suppose wetwant to test i Reduced model y Bo B323 B424 8525 s which is equivalent for testing Ho BlBZ0 i Rcl456 index of Z columns used in Reduced model qncolZi7R l where ql number of parameters rankZmd in Reduced model 21 and 22 are not significant in predicting the response BhatlsolvetZi7R ZiiR tZiiR Y Reduced model residual sum of products and cross products matrix GtY Zi7R Bhatl Y Zi7R Bhatl MLE of reduced model Error covariance Zl SlGn Likelihood ratio test statistic n r and n m is large LRT n r l 5m rqllog detSdetSl Chi Square critical value qchisq095mr q Some of Residual diagnostics refer to section 76 and middle of p395 YhatZBhat eY Yhat parmfrowc23 Split graph window into matrix subwindows 2 rows 3 columns forj in lm plotYhatj ej xlabquotFitted valuesquotylabquotResidualsquot maincquotResidual plot for responsequotj ablineh0 parmfrowcll parmfrowc23 forj in lm qqnormejmaincquotQ Q plot of residuals for responsequotj qqlineej parmfrowcll Some Inference based on fitted regression model Estimated values of the mean responses at 2110 2295 z3z4zS0z 20cl 10 95 0 t20Bhat Simultaneous confidence intervals for mean responses sesqrt t20solvetZZ20 nn r ldiagS rbind t20Bhat sqrtmn r ln r mqf095mn r mse t20Bhat sqrtmn r ln r mqf095mn r mse Predicted values of individual responses at 2110 2295 z3z4zS0z 20cl 10 95 o o t20Bhat Simultaneous prediction intervals for individual responses sesqrt l t20solvetZZ20 nn r ldiagS rbind t20Bhat sqrtmn r ln r mqf095mn r mse t20Bhat sqrtmn r ln r mqf095mn r mse Note the confidence intervals are printed by column you may wish to write a for loop as before to store results by row AE1ementary commands gt 3317 Error Object quotx742quot not found 9 1 59 Arithmetic operators p1us minus times divided by A power BVectors gt xlt c1234 gtX 1 1 2 3 4 gt lengthx 14 gt nameslt cquotPeterquotquotMaryquotquotNancyquot gt names 1 quotPeterquot quotMaryquot quotNancyquot gt mixlt cnames5533 gt mix 1 quotPeterquot quotMaryquot quotNancyquot quot55quot quot33quot seqlowerupperincrement gt seq151 1 1 2 3 4 5 gt seq2202 1 2 4 6 8 10 12 14 16 18 20 gt seq gt xlt cseq151seq4204 gt x 1 1 2 3 4 5 4 8 12 16 20 gt 15 1 1 2 3 4 5 reppattern number of times gt rep53 1 5 5 5 gt rep134 1 1 2 3 1 2 3 1 2 3 1 2 3 gt xlt repseq33 gt x 1 1 2 3 1 2 3 1 2 3 gt xlt repseq3c123 gt x 1 1 2 2 3 3 3 gt xlt repseq3rep33 gt x 1 1 1 1 2 2 gt xlt repseq3rep33 gt x 1 1 1 1 2 2 2 3 3 3 xlt l5 gt x3 1 3 gt xcl3 1 1 3 gt x2 l 3 4 5 6 7 gt x2 l 2 4 6 8 10 gt x2 1 05 10 15 20 25 gt xquot2 1 1 4 9 15 25 Common functions Sqrt square root log natural logarithm log10logarithm base 10 Exp exponential abs absolute value round round to the nearest integer Ceiling round up floor round down sin cos tan asin acos atan CMatrices MatrixdatanrowncolbyrowF gt xlt cl23 gt ylt c455 gt xylt matrixcxynrow2 gtXY 1 2 3 1 1 3 5 2 2 4 5 gt zlt l20 gt matrixznrow4 1 2 3 4 5 5 9 l3 l7 1 1 2 2 5 10 14 18 3 3 7 11 15 19 4 4 8 12 15 20 gt matrixznrow4byrowT 1234151 1 1 3 4 5 2 5 7 8 9 10 3 11 12 13 14 15 4 15 17 18 19 20 1 2 4 5 gt xyxy 1 2 3 1 1 9 25 2 4 15 35 gt xyxy 1 2 3 1 1 1 2 1 1 Matrix product operator gt xytxy 1 2 1 35 44 2 44 55 DData frame and Reading in data Constructing a dataframe gt heightlt c50704580100 gt weightlt c120l40100200190 gt agelt c20404l3l33 gt nameslt cquotBob39 quotTedquotquotAlice39 MaryquotquotSuequot gt sexlt cquotMalequotquotMalequotquotFemalequotquotFemalequotquotFemalequot gt datalt dataframenamessexheightweightage gt da names sex height weight age 1 Bob Male 50 120 2 Ted Male 70 140 40 3 Alice Female 45 100 41 4 Mary Female 80 200 31 5 Sue Female 100 190 33 gt datacl25 Sue Female 33 gt dataquotagequot 1 20 4o 41 31 33 gt dataage 1 20 4o 41 31 33 gt attachdata gt age 1 20 4o 41 31 33 change the value of age gt dataagelt c2030453232 gt data names sex height weight age 1 Bob M le 50 120 2 Ted Male 70 140 30 3 Alice Female 45 100 45 4 Mary Female 80 200 32 5 Sue Female 100 190 32 Reading in data Read into a dataframe PC version gt readtablequotcchildrentxtquotheaderT file with column names age sex 1 2 9 boy 58 3 13 girl 12 Mac version gt readtablequotcchildrentxtquotheaderT file with column names readtablequotSta135jimingT1 1datquot file without column names V1 V2 1 3497900 0623 2 2485475 0593 3 1782875 0512 4 1725450 0500 5 1645575 0463 6 1469800 0395 data2readtablequotSta135jimingT1 1datquot gt meandata2 V1 V2 2101179e06 5143333e 01 gt vardata2 V1 V2 V1 5894434e11 5737374e04 V2 5737374e04 7016667e 03 Read into a matrix39 gt matrixscan39cT1 5dat39ncol6byrowT Read 294 items 1 2 3 4 5 5 1 e 98 7 2 12 e 2 2 7 107 4 3 9 3 5 3 7 103 4 3 4 5 5 3 10 88 5 gt wlt matrixscan39Sta135jimingWichernidataTl Sdat39ncol7byrowT Read 294 items gt colMeansw 1 7500000 73857143 4547619 2190476 10047619 9404762 3095238 gt varw 1 2 3 4 5 6 7 1 25000000 27804878 03780488 0 4634146 5853659 22317073 1707317 3867596 7630662 307909408 6236934 0 0 2 27804878 3005156794 39094077 1 6 0 3 03780488 39094077 15220674 06736353 23147503 28217189 01416957 4 04634146 13867596 06736353 11823461 10882695 08106852 01765389 5 05853659 67630662 23147503 10882695 13635308 31265970 10441347 6 22317073 307909408 28217189 0 31265970 309785134 05946574 1 0 8106852 7 01707317 06236934 01416957 0 0441347 05946574 4785134 1765389 gt data35 height weight age 50 1 20 2 70 140 30 3 45 100 45 4 80 200 32 5 100 190 32 gt meandata35 height weight age 690 1500 318 gt vardata35 eight weight e height 5050 8875 215 weight 8875 19000 675 age 215 675 792

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.