### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Experimental Statistics For Biological Sciences II ST 512

NCS

GPA 3.79

### View Full Document

## 50

## 0

## Popular in Course

## Popular in Statistics

This 65 page Class Notes was uploaded by Jordane Kemmer on Thursday October 15, 2015. The Class Notes belongs to ST 512 at North Carolina State University taught by David Dickey in Fall. Since its upload, it has received 50 views. For similar materials see /class/223952/st-512-north-carolina-state-university in Statistics at North Carolina State University.

## Reviews for Experimental Statistics For Biological Sciences II

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/15/15

GRADE IQ EXAMPLE IQ STUDY TIME GRADE 105 10 75 110 12 79 120 6 68 116 13 85 122 16 91 130 8 79 114 20 98 102 15 76 Use the class computing account to enter the study time data Regress GRADE on IQ Regress GRADE on TIME IQ Finally regress GRADE on TIME IQ Tl where TI 2 TIMEIQ The TI variable could for example be created in your data step For the regression of GRADE on TIME and IQ use the option I in PROC REG This will output the X X 1 matrix ANOVA Grade on IQ Source df SSq Mn Sq F IQ 1 159393 159393 0153 Error 6 625935 10432 It appears that IQ has nothing to do with grade but we did not look at study time Looking at the multiple regression we get ANOVA Grade on IQ Study Time Source df SSq Mn Sq F Model 2 59612 29806 3257 Error 5 4576 915 TYPEI TYPE III sequential partial SOURCE df IQ 1 1594 12124 STUDY 1 58018 58018 Parameter Estimate t Pr gt t Std Err INTERCEPT 074 005 09656 1626 03851 IQ 047 364 00149 013 STUDY 210 796 00005 026 i 364 364 From this regression we also can get 288985 2261 2242 X X391 2261 0018 0011 2242 0011 0076 1 To test H0 Coefficient on IQ is 0 Note calculations done with extra decimal accuracy a Using ttestt 04700018915 364 b Using type III Ftest F 12124915 1325 t2 Note The type III sum of squares is defined by setting t2 F This means that type III SSq bbc where b is the coefficient being tested and c is the diagonal element of X X 1 which corresponds to b We have 04704700018 12124 2 Estimate the mean grade for the population of all potential students with IQ 113 and study time 14 hours a Write this estimate as A b where A 1 113 14 b Variance of this is A X X 1 AMSE 1303 c Prediction is A b 8364 d To get confidence interval 8364 i 2571 1303 e Interval 8071 8657 3 Estimate grade for individual with 113 IQ and 14 hours study time 8364 i 2571x1303 l 915 7533 9195 4 What percent of grade variability is explained by IQ STUDY R2 corrected regn SSQcorrected total SSq 5961264188 93 5 Notes When a new column is added to a regression aH the coefficients and their tstatistics can change The t39s could go from significance to insignificance g viceversa The exception to the above case is when the added column of X is orthogonal to the original columns This means that the new X X has the old X X in the upper left corner the sum of squares of the new column as the bottom right element and all other elements 0 Rerun this example adding a row 113 14 at the end of the dataset The dot implies a missing value Use the statement MODEL GRADE 2 IQ STUDYP CLM Compare to part 2 above Rerun again with CLI instead of CLM Compare to part 3 above Was the extra data row used in computing the regression coefficients OPTIONS LS 80 NODATE DATA GRADES INPUT IQ STUDY GRADE STQ STUDYQ CARDS 10510 7511012 79120 6 6811613 85 12216 91 130 8 79114 20 9810215 76 PROC REG MODEL GRADE 2 IQ STUDY STQSS1 SS2 TITLE GRADE AND STUDY TIME EXAMPLE FROM ST 512 NOTESquot PROC PLOT PLOT STUDYQ W VPOS 35 DATA EXTRA INPUT IQ STUDY GRADE CARDS 113 14 DATA BOTH SET GRADES EXTRA PROC REG MODEL GRADE 2 IQ STUDYP CLM RUN GRADE AND STUDY TIME EXAMPLE FROM ST512 NOTES DEP VARIABLE GRADE SOURCE DF MODEL 3 ERROR 4 C TOTAL 7 SUM OF SQUARES 610810 31064674 641875 ROOT MSE 2786785 DEP MEAN 81375000 CV VARIABLE INTERCEP STUDY STJQ VARIABLE INTERCEP STUDY STJQ 342462 PARAMETER ESTIMATE 72206076 7 0131170 7 4111072 0053071 TYPE II SS 13848316 0644589 6412303 14695210 MEAN SQUARE F VALUE PROBgtF 203603 26217 00043 7766169 RSQUARE 09516 ADJ RSQ 09153 STANDARD T For H0 ERROR PARAMETER 0 54072776 1335 0455300 7 0288 4524301 7 0909 0038581 1376 PROBgtT TYPE I 33 02527 52975125 07876 15939299 04149 580176 02410 14695210 Discussion of the interaction model We call the product IS QSTUDY an interaction term Our model is l G 7221 013l 4118 00531IS Now ile 100 we get AB 7221 131 411 531S and if IQ 120 we get 3 7221 157 411 637S Thus we expect an extra hour of study to increase the grade by 120 points for someone with IQ 100 and by 226 points for someone with IQ 120 if we use this interaction model Since the interaction is not significant we may want to go back to the simpler main effectsquot model Suppose we measure IQ in deviations from 100 and STUDY in deviations from 8 What happens to the coefficients and ttests in the interaction model How about the main effects model GRADE AND STUDY TIME EXAMPLE FROM CLASS NOTES Plot of STUDYIQ Symbol used is 39X39 STUDY GRADE AND STUDY TIME EXAMPLE FROM NOTES DEP VARIABLE GRADE SUM OF MEAN SOURCE DF SQUARES SQUARE MODEL 2 596115 298058 ERROR 5 45759885 9151977 C TOTAL 7 641875 ROOT MSE 3025223 RSQUARE DEP MEAN 81375000 ADJ RSQ CV 3717633 PARAMETER STANDARD VARIABLE DF ESTIMATE ERROR INTERCEP 1 0736555 16262800 IQ 1 0473084 0129980 STUDY 1 2103436 0264184 PREDICT STD ERR OBS ACTUAL VALUE PREDICT 1 75000 71445 1933 2 79000 78017 1270 3 68000 70127 1963 4 85000 82959 1093 5 91000 92108 1835 6 79000 79065 2242 7 98000 96737 2224 8 76000 80543 1929 9 83643 1141 SUM OF RESIDUALS 710543E 15 SUM OF SQUARED RESIDUALS 4575988 F VALUE PROBgtF 32568 00014 09287 09002 T FOR H0 PARAMETER o PROBgtT 0045 09656 3640 00149 7962 00005 LOWER 95 UPPER 95 ME MEAN RESIDUAL 66477 76412 3555 74752 81282 0983001 65082 75173 7 2127 80150 85768 2041 87390 96826 7 1108 73303 84827 7 064928 91019 102455 1263 75585 85500 7 4543 80709 86577 ANALYSIS OF VARIANCE AS A SPECIAL CASE OF REGRESSION First let us review ANOVA from last semester record the weights of 20 plants where each of 4 fertilizers is used on one group of 5 We get these yields FERTILIZER A 60 FERTILIZER B 62 FERTILIZER C 63 FERTILIZER D 62 61 61 61 61 DATA YIELD 59 60 60 62 61 64 63 60 MEAN 60 60 66 64 60 61 63 62 You should remember how to compute this table AN OVA Source FERTILIZER ERROR TOTAL df 3 16 19 88q 25 34 59 Mn Sq 8333 2125 F 392 Suppose we 315 310 1230 I Cth Formulas for Simple linear regression Data points X1y1 X2y2 Xnyn 15 27 39 46 58 Sum of X values 2X1 15 i1 Mean ofX is X 1553 Fory ZY135 77 i1 Raw sum of squares for X 2x 1425 55 i1 Raw sum of squares for y 2in 254964 255 i1 Raw sum of cross products inyi 152758110 i1 Corrected sum of squares for X 209 X2 SXX 132532 10 i1 2X12 n22 55562 10 i1 Raw SS quotcorrection termquot Corrected sum of squares for y 2yi72 syy 572872 10 i1 g n72 255572 M y i 1 Raw SS quotcorrection termquot Corrected sum of cross products SXy 201 YXyiV 221021 5 inyin X 7 i1 Slope b SXySXX 510 05 Intercept 7 b X 7053 55 3955 05x Inference 1 Estimate error variance 02 y57968 y 665775 8 ry 1 05 2 15 0 n Zri2 quotError sum of squaresquot i1 SSE 1025422575 Formulas2 Error degrees of freedom df is nnumber of estimated coefficients n2 slope intercept Mean squared error MSE SSEn2 7552 25 This MSE is the desired estimate of 02 lA SSE is also Syy SEWSXX Syy bZSXX 10521075 Syy is called quotTotal sum of squares correctedquot We see Error sum of squares Total sum of squares SEWSXX Therefore we call SEWSXX the quotRegression sum of squaresquot Also R2 Regression SSq Total SSq 2 Estimate the variance of b Imagine drawing repeated samples at these X values l2345 and for each sample tting a slope b These b39s have a distribution that is normal with mean equal to the true population value of the slope and variance 028XX Estimate this with MSESXX 2510 025 MSESXX is called quotstandard errorquot of b Task test H0 true slope is 0 t b 025 1 which is not an unusual t 3 Estimate the variance of a the intercept Again repeated samples each With its regression intercept These a39s have a normal distribution with mean equal to the true but unknown parameter again For a the standard error is MSEln XZSXX 4 Note the intercept is the mean of y when X is 0 How about estimating the mean of y when X is say some number c Estimate is abc Standard error is MSEln Xc2SXX CLM abc t MSEln icys 5 An estimate of the mean y at xc serves as an estimate for any individual future y at xc as well However we are uncertain as to that future y because A We do not know the mean of all possible values of y at xc and B we do not know where our individual y will be with respect to that population mean Part 4 deals with the variation coming from A and individuals vary around their population means with a variance being estimated by MSE Thus the same prediction as 4 but with standard error MSE11n Xc2SXX CLI abc t MSE11n Xc2SXX Example of a regression printout The SAS System The REG Procedure Model MODEL1 Dependent Variable y Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F Model 1 250000 250000 100 03910 Error 3 750000 250000 Corrected Total 4 1000000 Root MSE 158114 RSquare 02500 Dependent Mean 700000 Adj RSq 00000 Coeff Var 2258770 cont cont Variable Intercept x Obs UlbCDM k Parameter Estimates Parameter DF Estimate 1 550000 1 050000 Standard Error t 165831 050000 Output Statistics Dep Var y 50000 70000 90000 60000 80000 Predicted Value 60000 65000 70000 75000 80000 Value 332 100 Pr gt t 00452 03910 Residual 10000 05000 20000 15000 0 Unbalanced data handout l Unbalanced Data Part 1 The basic problem with unbalanced data Unbalanced one way classification no problem A two way classi cation by job category and gender Salaries thousand at XYZ Company Workers Executives Males 30 5030 35 55 100 Females 2O 75 85 8O Males are complaining because There are more female than male executives Females make more on average Unbalanced data handout 2 30503035551006 50000 M 207585804 2604 65000 F This is very interesting in light of the fact that every female makes at least 10000 lei than the worst paid of her male counterparts The comparison of the male to female average salary clearly is unfair Why is this Maybe it is because of some sort of interaction in the salaries which would mean that the difference between male and female salaries is a function of the job level To check this out we look at a table of means Table of Mean Salaries thousand S Workers Executives Males 40 100 Females 20 80 Unbalanced data handout 3 The interaction is 0 In both columns males make an average of 20000 more than females Therefore the difference in overall male and female salaries has nothing to do with interaction It is simply a result of the imbalance in the data Part 2 Contrasts and sums of squares Model Kjk M l Gi l Lj l eijk Y Overall Gender Level of H Mean Effect Job Effect 6W Summing all the table entries using eZj to denote a sum over k Workers Executives M 5M5G15L15 11o M l G1 L2 121 F M l G2 L1 211 3M3G23L2 22 so the original data table has rows with means Unbalanced data handout 4 M u G1 L1 Lz error term F u G2 L1 iLz error term The difference of these two means row 1 mean minus row 2 mean is thus an estimate of G1 G2 112L1 L2 not an estimate of just G1 G2 We have four means Yij in the cells of our table The mean of the 6 male salaries minus the mean of the 4 female salaries is seen to be 711 712 i721 722 NOW with 4 cell means we can compute any linear combination 011711 012712 021721 022722 Clearly this is an estimate of Unbalanced data handout 5 011 012 021 022M 011 012G1021 022 G2 011 021 Li 012 022L2 to which we would add 011GL11 012GL12 021GL21 022GL22 if our model included interaction It has standard error 2 02 2 02 2 02 2 02 and sum of squares where Vij is the mean and nij is the number of observations in the ijth cell of the table 2 Q2 011Y11012Y12o 021Y21022Y22 den 011111439 021112439 cglnz 021122 TYPE I Unbalanced data handout 6 I want to estimate G1 G2 so if I do not worry about contamination from other parts of my model I want 011012G1 021022 G2 2 G139 G2 so that any c39s with 012 1011 and 022 1021 will work Since there are many ways to do this I will pick the one with the smallest standard error Using calculus to nd the 011 and 021 that do the job we nd Cllzg and 021 Thus OU1 711 ymo i721 722 is the estimate of G1 G2 plus other contaminating effects that has minimum variance Its sum of squares is the Type 1 sum of squares for GENDER in a model containing GENDER LEVEL and possibly Unbalanced data handout 7 GENDERLEVEL in that order It is also sometimes called the sum of squares for GENDER ignoring LEVEL Using the cell means we have our Type 1 sum of squares 2 011Y11012Y12 C21Y2ioC22Y22 2 2 2 2 CllIlll l39 0121112 0211121 0221122 340 100 20 802 EV5 gr1 2123 540 Type 11 Suppose I decide that I want to estimate 11 G2 OL1 0L2 I now need to have 01101219 021C22 1 011 C21 0 a1V1d012 022 or in a table Unbalanced data handout 8 011 1011 39C11 C111 39le 0L1 0L2 0M Now we minimize the standard error of such a linear combination by minimizing 2 1 21 2 1 21 0C113 139011 T c11T 011391 3 so we set 2011 21Cll 2011 2 0111O 10 ion E Using the cell totals we have our Type II sum of squares 2 011Y11012Y12 C21Y21oC22Y22 2 2 2 2 clln11 c121112 32111214r c221122 g40100 20 802 252121 23 63333 Unbalanced data handout 9 This would also be referred to as the sum of squares for GENDER adjusted for LEVEL Type III The Type II sum of squares above is ne for the model without interaction If we had an interaction our Type 11 linear combination of means would be 1 9 10 9 EY11o gym E Y21o 5122 and thus would estimate 0M1G11G2 0L10L2 GL11 GL12 GL21 GL22 which still seems like a bizarre quantity in which to be interested Looking at the four cells it is clear that we cannot get the coef cients of these interactions to be 0 unless we set all the 039s t0 0 Perhaps we Unbalanced data handout 10 can get a linear combination that is zeroed out by the quotstandard assumptionsquot that z J accomplished by this table of 039s 5 5 1G1 5 5 1G2 0L1 0L2 0M The margin restrictions and the restriction that the coef cients in each row be equal so the sum of coef cients times interactions will be the coef cient times the sum of the interactions gt 0 in each row are enough to completely specify the 039s Notice that the linear combination we are discussing compares the average of the two row 1 cell means to the average of the two cell means in row 2 This is 540 100 20 80 20 thousand dollars and represents what we would say after a little careful Unbalanced data handout l I thought is the correct salary comparison for males versus females if we adjust for job level Type IV The Type IV and Type III sums of squares are the same unless a cell in your table has no entries In that case the Type IV SS has the unfortunate property of sometimes having different values depending on the alphabetical order of the levels of your factors Thus if instead of quotworkersquot and quotexecutivesquot we had used the terms quotbluecollarquot and quotVIPquot the actual Type IV sums of squares might change For this reason I recommend you never use Type IV and I do not discuss this topic further here Part 3 LSMEANS Unbalanced data handout 12 PROC GLM can produce LSMEANS or least squares adjusted means The ith GENDER LSMEAN for example would be an estimate of MGi l so these are estimates of What we would have had if the data had been balanced Part 4 Some SAS Examples Data unbal input gender 17 level 919 n do worker 1 to n input salary output end cards male worker 5 30 5O 30 35 55 male executive 1 100 female worker 1 20 female executive 3 75 85 80 J proc glm class gender level model salary gender level genderlevel ss1 ss2 ss3 ss4 lsmeans gender pdiff run Unbalanced data handout 13 General Linear Models Procedure Class Level Information Class Levels Values GENDER 2 female male LEVEL 2 executive worker Number of observations in data set 10 Dependent Variable SALARY Sum of Mean Source DF Squares Square F Value Pr gt F Model 3 62400000 20800000 2080 00014 Error 6 6000000 1000000 Cor Tot 9 68400000 RSquare CV Root MSE SALARY Mean 0912281 1785714 100000 56000 Source DF Type I SS Mean Square F Value Pr gt F GENDER 1 5400000 5400000 540 00591 LEVEL 1 57000000 57000000 5700 00003 GENLEV 1 00000 00000 000 10000 Source DF Type II SS Mean Square F Value Pr gt F GENDER 1 6333333 6333333 633 00455 LEVEL 1 57000000 57000000 5700 00003 GENLEV 1 00000 00000 000 10000 Source DF TypeIII SS Mean Square F Value Pr gt F GENDER 1 6315789 6315789 632 00457 LEVEL 1 56842105 56842105 5684 00003 Unbalanced data handout 14 GENLEV 1 00000 00000 000 10000 Source DF TypeIV SS Mean Square F Value Pr gt F GENDER 1 6315789 6315789 632 00457 LEVEL 1 56842105 56842105 5684 00003 GENLEV 1 00000 00000 000 10000 Least Squares Means GENDER SALARY Pr gt T H0 LSMEAN LSMEAN1LSMEAN2 female 500000000 00457 male 700000000 Example 2 Paint example from the book In this example I point out that no amount of statistical gymnastics will make up for a poor experiment Here there are 2 additives A and B each at two levels in some paint Drying time for the paint is the response The data are badly unbalanced Paint Drying Times noA A no B 2033242326 2031 B 26 14171822161787 Unbalanced data handout 15 Data paint input A B n do board 1 to n input dryt output end cards 0 0 5 20 33 24 23 26 1 1 26 1 0 2 20 31 1 1 8 14 17 18 22 16 17 8 7 J proc glm class a b model dryt a b abss1 ss2 ss3 ss4 lsmeans a bpdiff run General Linear Models Procedure Class Level Information Class Levels Values A 2 0 1 B 2 0 1 Number of observations in data set 16 Dependent Variable DRYT Sum of Mean Source DF Squares Square F Value Pr gt F Model 3 44157500 14719167 525 00152 Error 12 33617500 2801458 Corr Total 15 77775000 Source DF Type I SS Mean Square F Value Pr gt F A 1 26041667 26041667 930 00101 B 1 10963470 10963470 391 00713 AB 1 7152363 7152363 255 01361 Source DF Type II SS Mean Square F Value Pr gt F Unbalanced data handout 16 A 1 3861883 3861883 138 02631 B 1 10963470 10963470 391 00713 AB 1 7152363 7152363 255 01361 Source DF Type III SS Mean Square F Value Pr gt F A 1 64208562 64208562 229 01559 B 1 52893493 52893493 189 01945 AB 1 71523630 71523630 255 01361 Source DF Type IV SS Mean Square F Value Pr gt F A 1 64208562 64208562 229 01559 B 1 52893493 52893493 189 01945 AB 1 71523630 71523630 255 01361 Least Squares Means A DRYT Pr gt T HO LSMEAN LSMEAN1LSMEAN2 o 256000000 01559 1 201875000 B DRYT Pr gt T HO LSMEAN LSMEAN1LSMEAN2 o 253500000 01945 1 204375000 Points Overall F is signi cant P0152 but nothing is signi cant individually except for the Type I for A If you put B in there rst it would be signi cant but not A in the Unbalanced data handout 17 Type I list SSAB would be 3862 as we can tell from the current Type 11 list In other words there is clearly some effect of these additives but it is Virtually impossible to sort out the nature of the effect The cell means suggest the interesting hypothesis that both additives need to be present The comparison of the lower right cell to the rest uses up almost all the treatment sum of squares check it out No amount of statistical calculation can save a poorly designed experiment like this There is no magic the only inference we can make is based on arbitrary uncheckable assumptions about the treatment effects e g assuming no B or AB we have significant A Flowers Y1 4 8 5 0 Y2 N MVN 6 5 12 4 Y3 10 0 4 9 Y N MVNm V 2 customers will pay 21 and 22 21 1 1 1 Y1 O 22 Z 1 3 2 Y2 0 Y3 Z AY B DMZ AHYB lt1 21 F0 lt8gtlt2gt 2VZ AVyA 8 5 0 1 1 i i 5 12 411 3 0191 2 Z n m m M PTZl lt 22 PTZl 22 lt 0 1 U2NQAD In normal tables look up 031 Area to left probability 03783 Practice on Matrices 1 2 f 2 l 1 1 Y 8 1 5 j 24 j a Compute X X b Compute the inverse of X X c Compute X Y d Multiply the matrix in b by the one in c e What is the rank of the matrix X f What is the rank of the matrix X X In what way does this relate to b above g Find a matrix with the same first column as X and a different second column such that the rank of this new X matrix is not 2 h We will show that the formula X X 1X Y gives the slope and intercept in a regression on points whose x coordinates are in the second column of the X matrix and y coordinates in the Y vector What would that plot look like for your X matrix in g St512 130 interaction The effect of a delay is the same for location A as for location B COVARIANCE ANALYSIS An experiment is run to assess the effects of two fertilizers on yield using a completely randomized design with 8 reps The experiment is done in a greenhouse by randomly assigning fertilizerA to eight pots and fertilizer B to eight pots The yields were the weights of the roots of the plants after several weeks of growth The yields for fertilizer A were 18 15 12 11 13 17 12 and 16 For fertilizer B the weights were 9 101213151511and 9 It is easy to calculate the ANOVA table or equivalently the two sample t statistic for this data The difference of the two means is 1425 1175 25 The MSE 6357 and t 25263578 198 with 14 df Equivalently compute SOURCE DF 88 Mn Sq F Fertilizer 1 25 2500 393 insignificant Error 14 89 638 F005 460 While harvesting the plants it was noticed that the pots had become infested with insects The degree of infestation was rated on a scale from 0 no infestation to 10 high infestation by estimating the number of insects in the pot D A Dickey St512 131 The data are recorded in a SAS dataset and PROC PRINT isissued YIELDINFEST X1 FERT FD 18 5 5 A 1 15 0 0 A 1 12 4 4 A 1 11 3 3 A 1 13 2 2 A 1 17 4 4 A 1 12 5 5 A 1 16 2 2 A 1 9 4 0 B 0 10 3 0 B 0 12 1 0 B 0 13 5 0 B 0 15 4 0 B 0 15 5 0 B 0 11 1 0 B 0 9 4 0 B 0 The columns X1 and FD will be used in later analysis The column INFEST is the infestation rating minus the sample mean rating which was 5 This subtraction of the mean is not really necessary It is convenient since it tells at a glance how far any pot is abovebelow average infestation We now run the following SAS step PROC PLOT PLOT YIELDINFEST FERT The statements cause the plot symbol to be the value of FERT so that an A in the plot corresponds to fertilizer level 1 and a B corresponds to fertilizer level 2 D A Dickey St512 132 PLOT OF YIELDINFEST SYMBOL USED IS FERT 18 A 17 A 16 A 15 B B A 14 mean of A39s 13 B A 12 B A A 11 10 D gt meanofB39s 5 4 3 2 1 01 2 3 We see that the overall picture is two parallel lines It is fairly obvious that there is a fertilizer effect but we did not pick it up in our ANOVA since the ANOVA model basically just fits two horizontal lines to the data Any departure from this model is attributed to error variation and we see that this resulting variation is way too large The ANOVA model is written 2 LL Ai Eij where A1 is fertilizer A effect and A2 is fertilizer B effect The next step is to write a model which incorporates both the treatment effect m the linear effect of infestation as displayed in the graph We write Yield 2 LL 1 Ai 6Xij Eij D A Dickey St512 133 where is the sample mean of the quotcovariatequot infestation rating in our case To fit this model simply input the deviations of the covariate from its sample mean the column INFEST in our case and issue these commands PROC GLM CLASS FERT MODEL YIELD FERT INFESTSOLUTION Notice that the CLASS statement will replace the column of k fertilizer values with k 1 columns of indicator variables In our case k 2 and it is really not necessary to use a class statement except for the fact that FERT is not numeric Note that we do not put the variable INFEST in a class statement We do not want SAS to replace that one column with a set of columns we just want a regression coefficient to be computed Our output contains these items PARAMETER ESTIMATE TTEST PgtT STD ERR INTERCEPT 115139 B 4278 00001 02691 FERT A 29720 B 779 00001 03816 B 00000 B 000 00000 00000 INFEST 06294 1189 00001 00529 Because we used a CLASS statement SAS created a column which replaced the FERT column This column would be the same as our FD ie it has eight 139s followed by eight 039s The coefficient on the new column is 29720 which means that the overall level for fertilizer A is 115139 29720 2 144859 The level for B is then just 115139 0 115139 Why do we see the B39s These D A Dickey St512 134 indicate that other combinations of numbers will fit the data equally well For example suppose we choose INTERCEPT 110000 FERT A 34859 B 05139 IN FEST 06294 This gives exactly the same fit as the other parameter estimates Now what is the effect of the covariate For either variety start with the overall level 115139 or 144859 as calculated above Now subtract 06294 times the infestation rating Recall that the infestation rating we are using is the original one with 5 subtracted off Another summary is to say that the fit consists of two parallel lines with slope 06294 and the vertical distance between them is 29720 Finally we see that if X eguals K Le if INFEST0 then the slope gets multiplied by 0 This illustrates the fact that a covariance analysis simply adiusts all the observations to the levels they would have had if they had each been infested at rate Thus the overall levels referred to earlier 115139 and 144859 are often called adiusted treatment means TESTING FOR HOMOGENEOUS SLOPES It is fairly obvious that the nice interpretations in covariance analysis hinge critically on the assumption that the lines within each treatment level have the same slope If the slope of the infestation line were different for fertilizer A than for fertilizer B then A and B would not differ by a constant amount 29720 D A Dickey St512 135 and in fact B might be better than A for some X values and A better than B for others Of course our model forces the fitting of parallel lines but if that is inappropriate then our analysis is meaningless How can we check for parallelism The answer is that we use the full and reduced model F testquot from our multiple regression theory We will fit a model which allows different slopes at each level of the treatment variable full model and compare this to our covariance model with the parallel slopes One way to accomplish this fitting is to issue these commands PROC GLM CLASS FERT MODEL YIELD INFEST FERT FERTINFEST The FERTINFEST term will give the sum of squares for testing parallelism Use the type III sum of squares In our data we see that FD is a dummy variable for fertilizer which is what the CLASS statement will produce for us and X1 is the product of FD times INFEST It is a bit easier to understand what is going on if we input our own interaction column X1 in the model The column X1 serves to estimate interaction and is exactly what PROC GLM does behind the scenes when we call for FERTINFEST with FERT in a CLASS statement We write our model as 2 LL Ai 61Xij 62X1 EU and we see that in fertilizer group 2 B the slope is 61 since X1 is 0 forthis level If we are in the fertilizerA group then X1 is the same as FERT so in M group we have 61 62X X D A Dickey St512 136 and thus the coefficient 62 is the difference in slopes of the two regression lines The t statistic for 62 tests the parallelism hypothesis and ift is significant we conclude that the lines are not parallel and the idea of adjusted treatment meansquot is meaningless For more than 2 treatments use the Type III F test for FERTINFEST to check for parallel slopes It is easy to produce a list of adjusted treatment means along with either confidence intervals or prediction intervals for individuals Simply append to your data as many lines as you have treatment groups 2 in our case On these lines set YIELD and put in the appropriate levels of the treatments For the covariate just put in 0 or so that you are predicting at the mean level of the covariate SAS will do the computations for you Here are the last two lines of our current dataset along with the additional lines for producing adjusted treatment means we use 0 since we have centered our X39s 11 9 DOA x DOA x wgtww We now issue the following SAS code to perform our analysis PROC GLM CLASS FERT MODEL YIELD FERT INFESTP CLM The output will contain predictions and confidence intervals for all the row configurations in your X matrix including the last two rows The predictions in those last two rows will be the so called adjusted treatment meansquot An alternative is to use the D A Dickey St512 137 LSMEANS option in PROC GLM See the SAS manual for more details data yields input fert do rep 1 to 9 inputyield infest output end cards A 18015512911813717112101635 B 99108124130151150116 99 5 proc means var infest yield proc glm class fert model yield fert infestp lsmeans fert pdiff means fert Variable N Mean Std Dev Minimum Maximum INFEST 18 5 000 3 4978985 0 10 000 YIELD 16 13000 27568098 9000 18000 General Linear Models Procedure Class Level Information class Levels Values FERT 2 A B Number of observations in data set 18 NOTE Due to missing values only 16 observations can be used in this analysis Dependent Variable YIELD Sum of Mean Source DF Squares Square F Value Pr gt F D A Dickey Model 2 10650790 5325395 Error 13 749210 057632 Corrected Total 15 11400000 RSquare CV Root MSE 0934280 5839650 07592 Source DF Type I SS Mean Square FERT 1 25000000 25000000 INFEST 1 81507898 81507898 Source DF Type III SS Mean Square FERT 1 34950206 34950206 INFEST 1 81507898 81507898 Observation Observed Predicted Value Value 1 1800000000 1763304982 2 1500000000 1448602673 3 1200000000 1196840826 4 1100000000 1259781288 5 1300000000 1322721750 6 1700000000 1700364520 7 1200000000 1133900365 8 1600000000 1574483597 9 1448602673 10 900000000 899635480 11 1000000000 962575942 12 1200000000 1214337789 13 1300000000 1466099635 14 1500000000 1403159174 15 1500000000 1466099635 16 1100000000 1088456865 17 900000000 899635480 18 1151397327 Observation was not used in t Sum of Residuals Sum of Squared Resid his analysis uals Sum of Squared Residuals Error SS D A Dickey 9240 F Valu 433 1414 F Valu 138 00001 YIELD Mean 8 3 6064 1414 OOOO OOO OOOO OOO 3 ONO 13000 PrgtF 00001 00001 PrgtF 00001 00001 Residual 36695018 51397327 03159174 59781288 22721750 00364520 66099635 25516403 00364520 37424058 14337789 66099635 96840826 33900365 11543135 00364520 00000000 49210207 00000000 St512 139 First Order Autocorrelation 004930460 DurbinWatson D 208063484 FERT YIELD Pr gt T Ho LSMEAN LSMEAN1 LSMEAN2 A 14 4860267 0 0001 B 11 5139733 Level of YIELD INFEST FERT N Mean SD Mean SD A 8 142500000 2 60494036 5 37500000 3 73927036 B 8 1 1 7500000 2 43486579 4 62500000 3 92564826 D A Dickey Count data page 1 Count data 1 Estimating testing proportions 100 seeds 45 germinate We estimate probability p that a plant will germinate to be 045 for this population Is a 50 germination rate a reasonable possibility Pr 45 or less germinate in 100 trials if p05 Binomial n independent trials Each trial success or failure pprobability of success same on every trial X observed number of successes in n trials PrXr nrnr pr 1p 39r 01 11 2212 33216 4 432124 etc Example soft drinks 2 cups one has artificial sweetener 1 sugar Task guess which has artificial sweetener 6 people 5 get it right Can they tell the difference H0 p05 indistinguishable Pr 5 or6 right if p05 651 015 y 660 6 5 6164 1164 764 011 gt 005 not significant Count data page 2 Example Germination Pr 0 or 1 or2 or or 45 too complicated Plot Centered above 1 a rectangle with base 1 05 to 15 and height PrX1 Centered above 2 a rectangle with base 1 15 to 25 and height PrX2 etc Probability sum of areas of appropriate rectangles WI 15 25 Program to show heights of rectangles data a do X 30 to 70 fX probbnml5100Xpr0bbnml5100X1Xgt0 output end proc timeplot plot fX id X run coooooooooooooooooooooooooooooooooooooooo Count data page 3 lin IBX 0 0000231707 0 0795892374 t t F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F Count data page 4 CENTRAL LIMIT THEOREM If you average n independent identically distributed random variables then Tn My a 2vi nu a has approximately a N01 distribution for n reasonably large Each trial Y 1 or 0 with probability p and 1p E0 M p Variance on is p1p Therefore 2vi nu a successes np ynp1p is approximately normal Germination example For 45 successes or less start at 455 and find area to the left Pr Z lt 45550 100 Pr Zlt 455 Pr Zlt9 3159 not significant Example 2 Comparing 2 proportions Variety A 45 of 100 germinate Variety B 55 of 100 germinate Again think of Y 1 or 0 if variety A plant germinates or does not Y2 1 or 0 for variety B We have then V145100 V2 55100 so V2 V1 10100 01 Using our rules the standard error of this difference is P139Pn1 I01I0r12 where for H0p1 p2 the common proportion p1 p2 p would be estimated from the pooled sample as 5545100100 5 and we have 1 100 1 100 005 0707 so Z 10707 1414 not significant Count data page 5 Fact IfZ1 Zz Zk are independent N01 variables then Z 2 Z has a Xi distribution In our case we have 22 14142 2 2 Contingency tables DROUGHT DAMAGE DROUGHT DAMAGE LIGHT MOD SEVERE LIGHT MOD SEVERE v A A 30 45 75 150 30 75 45 R I B 20 30 50 100 15 10 75 E T c 10 15 25 50 15 5 30 Y 60 90 150 300 60 90 150 drought damage drought damage quotindependent ofquot depends on variety variety 235 ratio all rows ratios not same Left table Perfect independence how far away can we get by chance Right table shows significant dependency How to test On left overall we have 6090150 split f rows have same split we expect 2n 3n 5n in the three columns for any group of n plants In bottom row we have 50 plants so we expect 10 15 25 which is exactly what we get and in every cell of the table on left observed count expected count Note expected number is seen to be row totalxcolumn totalgrand total ie upper left cell expected count 15060300 30 Count data page 6 Chi Square For any table like one on right compute all expected counts E Fortable on right expected numbers are in table on left Next using 0 for observed compute oE2 30302 75452 30252 Xi Z 30 45 25 6925 all cells where k degrees of freedom rows 1columns 1 3131 4 From tables upper 1 tail of4 df Chisquare is 133 would reject even at 1 In SAS data drought do quotall quotbquot quotCquot do response quotnonequot quotmildquot quotseverequot input count ea output end end cards 30 75 45 15 1O 75 15 5 30 proc freq tables varietyresponsechisq weight count title quotDrought resistance dataquot run Count data page 7 Drought resistance data TABLE OF VARIETY BY RESPONSE VARIETY RESPONSE Frequency Percent Row Pct Col Pct mild none seve Total a 75 30 45 150 2500 1000 1500 5000 5000 2000 3000 8333 5000 3000 b 10 15 75 100 333 500 2500 3333 1000 1500 7500 1111 2500 5000 167 500 1000 1667 1000 3000 6000 Total 90 60 150 300 3000 2000 5000 10000 STATISTICS FOR TABLE OF VARIETY BY RESPONSE Statistic DF Value Prob ChiSquare 4 69250 0001 some output deleted here Sample Size 300 Default order is alphabetical For character variable first encountered quotnonequot determines width SAS truncated quotseverequot Count data page 8 Example Back to Variety A B germination rates 0 E Germ Not germ Chisquare 2550 2550 A 45 50 55 50 2550 2550 2 B 55 50 45 50 Recall we got 2 square root of 2 Logistic Regression Idea pprobabiity of germinating function of some variables maybe temperature moisture or both Example Text page 530 Temperatures Germinating 7O 73 78 64 67 71 77 85 82 Not germ 50 63 58 72 67 75 Germination vs temperature Plot of GERMTEMP Legend A 1 obs B 2 obs etc GERM 1 A A A A A A A A A o A A A A A A I I I I I I I I I I I I I I I I 50 55 so 65 7o 75 so 35 Te mpe rat u re Idea Regress Germ 0 or 1 on Temperature Germination vs temperature Model MODEL1 Dependent Variable GERM Analysis of Variance Count data page 9 Sum of Mean Source DF Squares Square F Value ProbgtF Model 1 109755 109755 5702 00328 Error 13 250245 019250 0 Total 14 360000 Parameter Estimates Parameter Standard T for H0 Variable DF Estimate Error Parameter0 Prob gt T INTERCEP 1 1550126 090755721 1 708 01114 TEMP 1 0030658 001283925 2388 00328 Germination vs temperature Dep Var Predict Obs TEMP GERM Value Residual 1 70 10000 05959 04041 2 73 10000 06879 03121 3 78 10000 08412 01588 4 64 10000 04120 05880 5 67 10000 05039 04961 6 71 10000 06266 03734 7 77 10000 08105 01895 8 85 10000 10558 00558 lt 9 82 10000 09638 00362 10 50 0 0 0172 00172 lt 11 63 0 03813 0 3813 12 58 0 02280 0 2280 13 72 0 06572 06572 14 67 0 05039 05039 15 75 0 07492 07492 Count data page 10 Normal residuals Reasonable predicted probabilities 10558 00172 Better idea Map 0ltplt1 into L 1719541 then model L or 6temperature e or L or 6temperature70 quotLikelihoodquot probability of sample p1pppp1pp p Use p for germinated 1p for not germinated Substitute p eLl 6L1 p 11 eL and L 2 oz l 6X quotMaximum Likelihood Estimatesquot Likelihood ea 7070 1 ea 7070 ea 7370 1 ea 73 70 11 ea 75397 Hoe6 Graph fa6 vs a6 and find values of a6 that maximize Theory also gives standard errors large sample approximations Use quotPROC LOGISTICquot to fit or quotPROC CATMODquot see text pg 535 We get Pr Germinate e4961 O1821X1e4961 O1821X where Xtemperature70 See graph page 535 Count data page 11 Data seeds Input Germ 13 n YGermquotYesquot If Germquot quot then Y do i1 to n input temp output end oards Yes 9 64 67 70 71 73 77 78 82 85 N0 6 50 58 63 67 72 75 23 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 PROC LOGISTIC dataseeds orderdata model germtemp itprint otable pprob6923 output outout1 prediotedp xbetalogit proo plot plot ptemp Ytempyvpos20 overlay run The LOGISTIC Prooedure Data Set WORKSEEDS Response Variable GERM Response Levels 2 Number of Observations 15 Link Funotion Logit Response Profile Ordered Value GERM Count 1 Yes 9 2 N0 6 WARNING 23 observations were deleted due to missing values for the response or explanatory variables Maximum Likelihood Iterative Phase Iter Step 2 Log L INTERCPT TEMP 0 INITIAL 20190350 0405465 0 1 IRLS 15205626 8553392 0127740 2 IRLS 14878609 11501730 0171150 3 IRLS 14866742 12219688 0181644 4 IRLS 14866718 12253782 0182141 5 IRLS 14866718 12253854 0182142 Model Fitting Information and Testing Global Null Hypothesis BETA0 Interoept Interoept and Criterion Only Covariates ChiSquare for Covariates AIC 22190 18867 SC 22898 20283 2 L06 L 20190 14867 5324 with 1 DF p00210 Count data page 12 Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr gt Standardized 0dds Variable DF Estimate Error ChiSquare ChiSquare Estimate Ratio INTERCPT 1 122539 71941 29013 00885 TEMP 1 01821 01034 31025 00782 0917127 1200 Association of Predicted Probabilities and observed Responses Concordant 796 Somers39 D 0611 Discordant 185 Gamma 0623 Tied 19 Taua 0314 54 pairs c 0806 Classification Table Correct Incorrect Percentages Prob No Sensi Speci False False Level Event Event Event Event Correct tivity ficity POS NEG 0692 5 4 2 4 600 556 667 286 500 Plot of PTEMP Legend A 1 obs B 2 obs etc Plot of YTEMP Symbol is value of Y E S t i 111 a1 1 1 111 11 1 AAAA t ABA e AAB d AAA AB P B r A 0 AB 1 B a AA 1 A i B l BAAA iOAAO 0 0 0 0 0 t y 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 TEMP Count data page 13 Likelihood Ratio ChiSquare Small contingency table 5 2 7 3 3 6 lt Probabilities are p11 2912 p21 p22 Likelihood is some constant times 2911 11 p12n12p21n11p22n12 where we must have these ps summing to 1 so that p22 1 p11 p12 921 The values of these ps that maximize the likelihood are the same values that maximize the logarithm of the likelihood namely 7111 710911 7122 1710 2911 2912 p21 and taking the derivatives with respect to each of the three unconstrained ps we have nijpij n221 p11 p12 p21 and if we then solve these 3 equations ij 11 12 21 we get estimates ij nijnu so we have 513 213 313 and 313 which we then plug into the log likelihood function to get 2 logLikelihood Count data page 14 C 2 5 ln513 2 ln213 3 n313 3 n313 which is C 34638368 where C is some constant Suppose p11 2 pp etcwhere p7 and p0 are probabilities of being in the first row and of being in the first column respectively This would be suggested by the independence hypothesis Then the likelihood is proportional to mamm1we7 1prpc 1pr1pc Taking logs and differentiating we have 5r 2 713 and 6 813 and 2 logLikelihood C 35268067 The difference in 2 LogLikelihood from the full and reduced models has approximately a Chisquare distribution with degrees of freedom equal to the difference in the number of unrestricted parameters The difference 06297 has 1 df and is the likelihood ratio Chisquare on the printout DATA LRT lnput major wealth n cards Stat Rich 5 Stat Poor 2 Other Rich 3 Other Poor 3 proc freq table majorwealthchisq norow nocol weight n run Count data page 15 TABLE OF MAJOR BY WEALTH MAJOR WEALTH Frequency Percent Poor Rich Total Other 3 3 6 2308 2308 4615 Stat 2 5 7 1538 3846 5385 Total 5 8 13 3846 6154 10000 STATISTICS FOR TABLE OF MAJOR BY WEALTH Statistic DF Value Prob ChiSquare 1 0627 0429 Likelihood Ratio ChiSquare 1 0630 0427 Continuity Adj ChiSquare 1 0048 0826 MantelHaenszel ChiSquare 1 0579 0447 Fisher39s Exact Test Left 0914 Right 0413 2Tail 0592 Phi Coefficient 0220 Contingency Coefficient 0214 Cramer39s V 0220 Sample Size 13 WARNING 100 of the cells have expected counts less than 5 ChiSquare may not be a valid test Count data page 16 Notice the warning The cell counts are not high enough for our usual Chisquare or the likelihood ratio Chisquare test statistics to have close to a X2 distribution both are only approximately Chisquare in large samples One approach to this is to use Fisher39s exact test How many tables are more extreme than this one First what do we mean by quotextremequot We expect 5613 43 rich statistics majors but we get more 5 lfwe insist on preserving the row and column totals what other tables could we get with even more rich statistics majors P08158 and P004662 n E are even more extreme Fisher suggested assigning hypergeometric probabilities as shown to these tables see text page 512513 for details 3 3 n1l n2l n1l n2l 6 7 5 8 P 3263 for 2 5 hm hm hm hm nl 32630815800466 4125 right tail Fisher exact Pvalue on printout Contrasts MSE 2125 Totals of denom F 1 Sum of3 SSq is SSTrt 25 t 2 We compute Q 201 where C weight Yi is 13971 total t 4treaments t These are contrasts because 2050 in every row i1 t They are orthogonal because 2010130 where C are i1 the coef cients weights multipliers from another row Example l1 11 21 O 3 O Other sets of tl3 orthogonal contrasts are possible Data Flowers input Fertilzer 01 C2 03 D1 D2 D3 Do Pep 1 to 5 input Y output end cards A 1 1 1 1 1 1 60 61 59 60 60 B 1 1 1 62 61 60 62 60 C 0 2 1 1 1 1 63 61 61 64 66 D 0 0 3 1 1 1 62 61 63 60 64 proc reg Set1 model Y C1 C2 C3 SS1 SS2 Set2 Model Y D1 D2 D3 SS1 SS2 run Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F Model 3 2500000 833333 392 00283 Error 16 3400000 212500 Corrected Total 19 5900000 Parameter Estimates Parameter Variable DF Estimate Pr gt t Type I Type II Intercept 1 6150000 lt0001 75645 75645 01 1 050000 02941 25000 25000 02 1 083333 00064 208333 208333 03 1 016667 03889 16667 16667 Parameter Estimates Parameter Variable DF Estimate Pr gt t Type I Type II Intercept 1 6150000 lt0001 75645 75645 D1 1 100000 00074 200000 20000 D2 1 0 10000 0 0 D3 1 050000 01446 50000 5000

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.