### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# STATISTICAL METHODS II STAT 516

GPA 3.93

### View Full Document

## 16

## 0

## Popular in Course

## Popular in Statistics

This 35 page Class Notes was uploaded by Shane Marks on Monday October 26, 2015. The Class Notes belongs to STAT 516 at University of South Carolina - Columbia taught by Staff in Fall. Since its upload, it has received 16 views. For similar materials see /class/229675/stat-516-university-of-south-carolina-columbia in Statistics at University of South Carolina - Columbia.

## Reviews for STATISTICAL METHODS II

### What is Karma?

#### Karma is the currency of StudySoup.

Date Created: 10/26/15

gt Data for the ANCOVA example the Trigonometry scores gt that we studied in class gt gt Entering the data and defining the variables gt gt gt gt H gt Reading the data into R gt gt mydatafile lt7 tempflle gt catfilemydatafile quot 1 1 3 1o 2 2 2 24 34 129 56 1 11 17 93 u sewn u gt gt optionsscipen999 suppressing scientific notation gt gt trig lt7 readtablemydatafile headerFALSE colnamescquot0Bsquot quotCLASSTYPEquot quotPREquot quotPOSTquot quotIQquot gt gt Note we could also save the data columns into a file and use a command such as gt readtablefile quotzstat75l6fllenametxtquot headerFALSE colnames CquotOBSquot quotCLASSTYPEquot quotPREquot quotPOSTquot Qquot gt gt attachtrig gt gt The data frame called trig is now created gt with five variables quotOBSquot quotCLASSTYPEquot quotPREquot quotPOSTquot quotIQquot gt 3 gt gt A scatter plot of the data plotted separately by class type Setting up the axes and labels For these data the xevalues are all within 0 and 25 and the Yivalues are all within 0 and 35 plotco25 co35 type39n39 ylabvposteclass Score39 xlabapreeclass score39 Making the plots for each treatment r POSTCLASSTYPE 1 PosTCLAssTYP POSTCLASSTYPE7 p39 col39blue39 col type39 type 1039 y y dquot gt llneSPRECLASSTYPE e3 types 1339 Cole green39 pch gt gt legend39topleft39 C CLASSTYPE 13939CLASSTYPE 23939CLASSTYPE 339 pchc1208 colc39blue39 red3939green39 cex1 gt gt gt in m a o CLASSTYPE1 CLASSTYPEZ o CLASSTYPES o 9 in g N a i o g 9 i o 0 8 t a 8 at 39 at e c o in f r O o o A I k n o o o f i O O k o m e o o i 0 5 10 15 20 25 Predass score gt 4 We use the lm function to do the ANCOVA analysis The response is POST and the factor is gt CLASSTYPE The covarlate here is PRE gt gt 4 Making quotCLASSTYPEquot a factor CLASSTYPE lt factorcLASSTYPE gt gt 4 The summary function gives us least squares estimates of gt muidot tauil tau72 tau73 and most importantly in this case gamma gt gt trigfit lt lmPOST CLASSTYPE PRE gt summarylttrigfit lmformula POST CLASSTYPE PRE Residuals Median Min 1Q 3Q ax 7107049 735486 702473 21001 170370 Coefficients Estimate std Error t yalue rgtt Intercept 103722 10103 5730 0000000513 M CLASSTYPEZ 709565 15023 70 CLASSTYPE3 40505 16321 2407 00161 732 01705 4536 0000034120 M Signif codes 0 WM39 0001 39M39 001 39f39 005 3939 01 39 39 1 Residual standard error 4097 on 52 degrees of freedom Multiple ReSquared 03201 Adjusted Risquared 02094 Restatistic 0465 on 3 and 52 DF peyalue 00001120 gt gt 4 Note that R sets the FIRST tauehat equal to zero whereas 5A5 sets the LAST tauehat equal to zero gt 4 Either way is fine since the leastesquares estimates are not unique gt gt anoyatrigfit Analysis of Variance Table Response POST Df sum Sq Mean Sq F yalue PrgtF CLASSTYPE 2 11564 5702 24109 00997 PRE 1 49339 49339 205729 000003413 M Residuals 52 124709 2398 Signif codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 gt 4 The test for treatment effects can be done with a reduced Vs full approach gt gt trigfitreduced lt7 lmPOST PRE gt anoyatrigfitreduced trigfit Analysls of Variance Table Model 1 POST Model 2 POST CLASSTYPE PRE ResDf RSS of Sum of Sq F PrgtF 1 54 147500 2 52 124709 2 22071 47602 001255 Signif codes 0 WM39 0001 39M39 001 39 005 3939 01 39 39 1 gt gt 4 The Overall Fitest can also be done with a reduced Vs full approach gt tri fitreallyreduced lt7 1mPOST 1 gt anoyatrigfitreallyreduced trigfit Analysls of Variance Table Model 1 POST Model 2 POST CLASSTYPE PRE Sq f Sum of F PrgtF 1 55 105612 2 52 124709 3 60903 04649 00001120 M Signif codes 0 WM39 0001 39M39 001 39 005 3939 01 39 39 1 gt gt 4 Results what does the Overall Retest T046 tell you gt 4 what do the equlvalent tests for the effect of gt 4 preeclass score T2057 or t454 tell you gt 4 what does the test for the effect of type of class T477 tell you gt 4 How do we interpret the estimate 0773 of betail for our model we can include multiple coyariates by simply adding coyariate terms Here e 4 4 into the lm statement IQ is another COVarlat VVVVVV gt trlgflt2 lt6 ImPOST CLASSTYPE PRE IQ gt Summarytrlgflt2 Imformula POST CLASSTYPE PRE IQ Residuals Min 1Q ian 3Q Max 71155440 7344045 7003712 291420 1334325 coefficients Estimate Std Error t yalue Prgtt Intercept 7148759 88927 71673 0 10048 CLASSTYPE2 714026 14889 70942 0 3506 CLASSTYPE3 49870 15609 3195 0 00240 802 01596 4 889 0 0000105 IQ 02129 00736 2892 000561 Slgnlf Codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 Residual standard error 4503 on 51 degrees of freedom Multiple ReSquared 04220 Adjusted Risquared 03775 Testatistic 9339 on 4 and 51 DF peyalue 0000009665 gt gt anoyatrigfit2 Analysis of Variance Table Response POST Df Sum Sq Mean Sq F yalue PrgtF CLASSTYPE 2 11564 5702 27523 PRE 1 49339 49339 234067 000001217 M IQ 572 17572 Residuals 51 107137 Slgnlf Codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 gt gt gt gt 4 we can test for unequal slopes by including an interaction term in the lm statement gt gt gt trlgflt3 lt6 ImPOST CLASSTYPE PRE CLASSTYPEPRE gt Summarytrlgflt3 Call 1mformu1a POST CLASSTYPE PRE CLASSTYPEPRE Residuals Min 1Q ian 3Q Max 7104369 734436 704216 21491 161425 Coefficients Estimate Std Error t Value Prgtt Intercept 84741 5 8 0 00854 CLASSTYPE2 20495 40574 0505 0 61568 CLASSTYPE3 57670 47106 1224 022659 09947 03383 2940 0 0495 CLASSTYPE2PRE 703278 04073 70805 0 42477 CLASSTYPE3PRE 701968 05493 70358 0 2169 Slgnlf Codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 Residua1 standard error 4962 on 50 degrees of freedom Mu1tip1e Resouared 3360 Adjusted Risquared 02704 Testatistic 5070 on 5 and 50 DF peva1ue 00007692 gt gt anovatr1gf1t3 Ana1ysis of Variance Table Response POST Df Sum Sq Mean Sq F Value PrgtF CLASSTYPE 2 11564 5702 23404 01060 PRE 49339 49339 200394 000004404 M6 CLASSTYPEPRE 2 1604 002 03257 0723 Residua1s 50 123105 24 Slgnlf codes 039M39000139M390013939005393901quot1 gt gt 4 The interaction term is NOT significant here T033 so we fall to reject H70 gt 4 We conc1ude the equa1es1opes model is reasonable There is NOT evidence that gt 4 the slopes are unequa1 4 This example shows the analyses for the RED 4 using the wheat data example we looked at in class Entering the data and defining the Variables 4 Reading the data into R mydatafile lt7 tempfile catfilemydatafile quot 1 310 aaaaawwwmwl l l l l WDWNHWDWNHWDWN w H U1 seps optionsscipen999 4 suppressing scientific notation wheat lt7 readtablemydataf1le headerFALSE COlnamesC Var1etyquot quotblockquot quotyleldquot vvvvvvvvvvvvvvvv a gt 4 Note we Could also saVe the data Columns into a fll and use a Command such as gt wheat lt7 readtablef1le quotZstat75l6f1lenametgtlttquot headerFALSE COlnames CquotVarletyquot quotblockquot quotyleldquot gt gt attach wheat gt gt 4 The data frame called wheat is now created gt 4 with three variables variety block and yield gt 3 gt gt 4 what if we ignored the blocks and just did a oneeway CRD ANOVA 4 We specify that variety is a qualitative factor with the factor function 4 Making quotvarietyquot a factor variety lt7 factorvariety 4 The lm statement specifies that yield is the response 4 and variety is the factor 4 The ANOVA table is produced by the anova function gt wheatfit lt7 lmy1eld variety lt Analysis of Variance Table Response yield f sum Sq Mean Sq F value PrgtF variety 2 90433 49217 36167 005099 Residuals 12 163300 13 Slgnlf codes 0 3944 0001 39M39 001 39439 005 39v 01 39 39 1 gt gt 4 The effect of Variety is not significant at the 5 level psvalue 0590 gt 4 This is NOT the best way to analyze these datallll Here we treat the experiment as a Randomized Block Design RED m 4 The l and anova functions will do a standard ANOVA for a RED 4 We specify that block and variety are factors with the factor function vvvvvvv gt variety lt7 factorltvariety lock lt7 factorltblock gt gt wheatRRDfit lt7 lmyleld variety block gt anovawheatRRDfit Analysis of Variance Table Response yield of sum Sq Mean Sq F value PrgtF variety 2 90433 49217 27343 00002653 MA block 4 140900 37225 20601 00002010 MA Residuals 8 14400 Signif Codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 gt gt 4 Note that there are significant effects among treatments gt 4 F statistic 2734 pevalue 0003 gt 4 This implies that the mean yield is significantly different gt 4 for the different varieties of wheat gt gt 4 There is also significant variation among blocks gt 4 F statistic 2060 pevalue 0003 gt 4 This may or may not be of interest gt gt gt 4 Question Is Variety A superior to the others gt 4 We can again answer this question with a statement about a contrast gt gt 4 Variety A vs others gt gt contrastsltvariety lt7 matrigtltc1 712 712 nrow3 ncol1 gt gt printquotVariety A vs othersquot 1 quotVariety A vs othersquot gt summarylmyield variety blockcoefquotvariety1quot Estimate Std Error t value Prgtt 356666666667 040909794056 720042704661 000000549109 gt gt 4 This Code shows line labeled quotVarietylquot which gives gt 4 the twoisided P Value of titest about Contrast gt gt 4 Yes clearly there is strong evidence t 720 that variety A has a superior gt 4 mean yield to the others gt gt Twoisided P Value is given as 00000855 gt 4 What is the P Value for this Oneisided test gt gt 7 This example shows the analyses for the oneeway ANOVA gt 7 using the rice data example we looked at in class Entering the data and defining the Variables 4 4 Reading the data into R mydatafile lt7 tempfile catfilemydatafile quot DDDDWWWWNNNNHHHH sepquot quot optionsscipen999 4 suppressing scientific notation rice lt7 readtab1emydataf11e headerFALSE C01namescquotvar1etyquot quotyleldquot vvvvvvvvvvvvvv gt 4 Note we Could also save the data Columns into a file and use a Command such as gt 4 rice lt7 readtablef11e quotZlstat7516f11enametXtquot headerFALSE C01names CquotVarletyquot quotyleldquot gt gt attachrlce gt gt 7 The data frame called rice is now created gt 7 with two variables variety and yield gt 3 gt 1m and anoVa will do a standard analysis of variance We specify that variety is a qualitative factor with the factor function t 7 Making quotvarietyquot a factor variety lt7 factorvariety The lm statement specifies that yield is the response and variety is the The ANOVA table is produced by the anOVa function vvvvvvvvvvvvvv t gt rlceflt lt7 lmyleld variety gt anovarlceflt Analysis of variance Table Response yield Df Sum Sq Mean Sq F value rgtF r 09931 29977 72124 0005034 M Residuals 12 49076 415 Slgnlf Codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 gt The following code produces some residual plots 7 and Levene39s test for unequal variances H A short function to perform the Levene test H for equal variances across the populations Levenetest lt7 function y group 1 group lt7 asfactorgroup precautionary means lt7 tapp1yYr group mean narm TRUE resp lt7 absy 7 meansgroup vvvvvvvvv anovalmresp groupy C1 4 5 i i gt gt 1 Implementing the function for our data gt gt F value PrgtF group 3 09092 04654 Residuals 12 gt Levene test yield variety Df gt 1 printing the sample mean yields for each variety level gt flttedValueSrlceflt 1 2 4 5 6 7 8 9 10 98450 98450 98450 98450 92825 92825 92825 92825 93850 93850 11 12 15 16 93850 93850 111650 111650 111650 111650 gt gt 1 plotting residuals versus fltted values gt gt plotflttedvaluesrlceflt residualsricetit XlabquotFltted Valuesquot ylabquotResldualsquot abllne h0 n Res1uua1s Ann 1 1 1 BED WEIEIEI WEED MUD Fmeuvames gt 1 A normal 070 plot of the residuals Normal QQ Plot gamma enemas n 1 1 1 72 r1 0 1 2 Theuveuca 0151111125 qqnormresldualsrlceflt 1 Note that according to Levene39s test pevalue 04654 we would 1 FAIL TO REJECT the null hypothesis that all variances a a1 1 So the equalevariance assumption eems reasonable for these data 4 Notice there 15 some evidence of nonnormallty based On the Q Q plot vvvvvvvvv 4 Estimating and testing contrasts 4 The following code estimates the contrasts in the example from class 4 We also use a tetest to test whether a contrast is zero 4 Getting the sample means for each group sampmeans lt7 tapplyyield yariety mean narm TRUE 444 variety 4 vs others 4 Defining the correct coefficients for the contrast contrastcoefficients lt7 C13 13 13 1 see Just need to specify name of fit object like quotricefitquot here an name of factor like quotyarietyquot in code below Lhat lt7 sumcontrastcoefficients 4 sampmeans seLhat lt7 sgrt sumanoyaricefit quotReSldualsquotquotMean 4contrastcoefficients92tableyariety tstar lt7 LhatseLhat twosldedPVal lt7 round2ptabststar dfanoyaricefit quotReSldualsquotquotDfquot lowertailE4 vvvgvvvvvvvvvvvvvvvvvvvvvv gt printquotyariety 4 vs othersquot 1 quotvariety 4 vs others gt dataframeLhat seLhat tstar twosldedPVal L h t tstar twosldedPVal 1 71660833 3722147 74462031 00000 gt 4 A shorter way to test about this Contrast gt gt 4 comparing variety 4 vs others gt gt contrastsyariety lt7 matrigtltc13 13 13 71 nrow4 ncol1 gt gt summarylmyield varietycoefquotyariety1quot E s Error yalue Prgtt 7124562500000 27916099106 74462031001 0000776383 gt gt 4 Look on the 39yariety139 line selected yia the above code gt 4 for Peyalue of tetest about this contrast gt gt 4 I prefer the longer way since it is more instructiye about what is going on gt 4 The long way also gives the correct Lehat point estimate gt 4 gt 4 444 gt gt 444 variety 1 vs variety 2 gt gt contrastcoefficients lt7 c1 71 0 0 gt gt Lhat lt7 sumcontrastcoefficients 4 sampmeans gt seLhat lt7 sgrt sumanoyaricefit quotReSldualsquotquotMean 5gquot4contrastcoefficients92tableyariety gt tstar lt7 LhatseLhat gt twosldedPVal lt7 round2ptabststar dfanoyaricefit quotResidualsquotquotDfquot lowertailE4 gt gt printquotyariety 1 vs variety 2quot 1 quotvariety 1 vs variety 2quot gt dataframeLhat seLhat tstar twosldedPVal Lhat eLha star twosldedPVal 1 5625 455868 1233910 02409 gt gt 4444 A shorter way to test about this contrast 4 Comparing variety 1 vs variety 2 4 contrastsyariety lt7 matrigtltc1 71 0 0 nrow4 ncol1 vvvvv gt summarylmyield varietycoefquotyariety1quot Estimate Std Error t yalue Prgtt 201250000 227933995 12339090 02400573 gt gt 4 Look on the 39Varletyl39 line selected Via the above Code gt 4 for P Value of titest about this Contrast I prefer the longer way since it is more instructiye about what is going on gt gt 4 gt 4 The long way also gives the correct Lehat point estimate gt gt gt When doing simultaneous tests about multiple contrasts gt gt expererrorrate lt7 005 gt numberoftests lt7 2 gt gt adjustedcutoff lt7 qt1 e expererrorrate2numberoftestsdfanovaricefit quotReSidualsquotquotDfquot gt printltpastequotlf t is greater than roundltadjustedcutoff4 quotthen reject H0 for that testquot 1 quotIf t is greater than 256 then reject H0 for that testquot 4 Post Hoc Multiple Comparisons in R 4 Fisher LSD procedure Using an alpha of 005 alpha lt7 005 4 Note from the ANOVA table there are 12 error df in this problem 4 and n 4 r this problem 4 observations per level errordf lt7 anovaricefit quotResidualsquotquotDfquot n lt7 4 MSW lt7 summaryricefitslgma92 leastsignifdiffer lt7 qt1 e alpha2 errordf sqrt2MSWn vvvvvvvvvvvvvvvvvvvvv cbind TukeyHSD aovrice fit svariety 1 least signifdiffer ast signif differ 75625 99 1 746 00 99 3251 1 132 00 993251 2 10 25 993251 2 188 25 99 3251 3 17800 993251 Using alpha005 for the comparisonwise error rate any pair of means whose absolute difference from first column take absolute values is GREATER THAN the Least Significant Difference second column judged to be significantly different by Fisher Tukey procedure The TukeyHSD function does the Tukey multiple comparison procedure in R The last column gives the ADJUSTED Psvalues Using alpha005 for the experimentwise error rate any air of means with pevalue LESS THAN 005 in h t column are judged to be significantly different by Tukey gt TukestDltaovltricefit conflevel095 Tukey multiple comparisons of means 95 familyewise confidence level Fit aovltformula ricefit Variety diff r 71 75625 7191592699 790927 06105496 71 74600 7181342699 093427 07473470 71 13200 73342699 2673427 00567296 2 9957690 72 10025 52907301 3235927 00066015 3 17000 42657301 3133427 00097522 confleVel095 is actually the default leVel We could choose another leVel if desi see Notice the results for the Fisher LSD and Tukey procedures According to mean riety 4 is significantly different from the means of each other variety Tukey gives similar results but Tukey39s method does NOT find a significant difference between varieties 4 Recall Tukey is more conservative less likely to reject H70 Tukey 4 offers more protection against Type I errors but less power vvvvvvvvvvvvvv gt Slmultaneous 95 Confldence Intervals for All Pairwise Differences gt 4 Uses the Tukey method gt gt TukeyHSDaoVrlceflt conflevel095 Tukey multiple comparisons of means 95 familyeWise confidence level Flt aOVfOrmula r1ceflt Varlety diff upr p a 271 756 25 7191 592699 79 0927 06105496 371 746 00 7101 342699 09 3427 07473470 471 132 00 73 342699 267 3427 00567296 372 10 25 7125 092699 145 5927 09957690 472 100 25 52 907301 323 5927 00066015 473 17000 42657301 3133427 00097522 gt 4 This example shows the analysis for the Latin Square experiment gt 4 using the productivity data example we looked at in class Entering the data and defining the Variables 4 4 Reading the data into R ydatafile lt7 tempfile catfilemydatafile quot 1 1 1 63 2 8 vvvvvvvvv optionsscipen999 4 suppressing scientific notation musicprod lt7 readtablemydatafile headerFALSE colnamescquot0BSquot quotMUSICquot quotDAYquot quotTIMEquot quotPRODUCTquot gt gt 4 Note we could also save the data columns into a file and use a command such as gt 4 musicprod lt7 readtablefile quotzstat7516filenametxtquot headerFALSE colnames CquotOBSquot quotMUSICquot quotDAYquot quotTIMEquot quotPRODUCTquot gt gt attach musicprod gt gt 4 The data frame called musicprod is now created gt 4 with five variables OBS MUSIC DAY TIME and PRODUCT gt 3 gt gt 4 1m and anoVa w a standard analysis of variance 4 We specify our qualitative factors with the factor function 4 Making MUSIC DAY TIME factors vvvvvvv gt MUSIC lt7 factorMUSIC gt DAY lt7 factorDAY MUSIC is the treatment factor here and TIME and DAY are the row and column factors The ANOVA table is produced by the anOVa function gt musicprodfit lt6 1mPRODUCT MUSIC DAY TIME gt anovaltmusicprodfi Analysis of Variance Table Response PRODUCT Df Sum Sq Mean Sq F value PrgtF MUSIC 4 56314 14079 122750 00003341 M DAY 41362 10341 90159 00013326 M TIME 922 10731 93559 00011356 Residuals 12 13763 1147 Signif codes 0 3944 0001 394 001 39439 005 01 39 39 1 gt gt 4 From the Tetests and their Psvalues there is a significant effect of music typ e gt 4 on mean productivity We also see a significant row TIME effect and column DAY gt 4 ct 4 The sample mean productivity values for each music type listed from smallest to largest vvvvvv sort tapp1yPRODUCT MUSIC mean A 796 920 1010 1128 1222 gt gt 4 Now which of these means are significantly different gt gt 4 Tukey39s procedure tells us which pairs of music types are significantly gt 4 different gt gt 4 Tukey CIs for pairwise treatment mean differences gt gt TukeyHSD aovmuslcprod lff 1 Uaammmby pb VVVVFJFJUF JUOFJUOW a 1 o 1 NWNWD NOTE The C that are Sign fit conf wr 7091893738 339893738 210106262 641893738 116106262 547893738 7001893738 429893738 086106262 517893738 7007893738 423893738 7125893738 305893738 7309893738 121893738 7427893738 003893738 7333893738 097893738 15 which do NOT 0 0 0 0 0 0 0 0 0 0 P a II 4011130913 4465558074 1f1cantly different at here level0 95 MUSIC d the Contain Zero indicate the treatment mea 005 experimentwlse ns Significance level gt 4 Multiple Regression example gt 4 california rain data gt gt 4 I am calling the data set quotcalirainquot gt 4 The variables are number city precip altitude latitude distance gt gt gt H gt 4 Reading the data into R gt gt mydatafile lt7 tempfile gt catfilemydatafile quot 1 Eureka 3957 43 400 1 2 RedEluff 2327 341 402 97 30 Colusa 1595 60 392 91 i sep u gt gt optionsscipen999 suppressing scientific notation gt gt calirain lt7 readtablemydatafile headerEAL5E colnamescquotnumberquot quotCityquot quotprecipquot quotaltitudequot quotlatitudequot quotdistancequot gt gt 4 Note we Could also saVe the data Columns into a file and use a Command such as gt lirain lt7 readtablefile quotZstat7516filenametgtlttquot headerFALSE COlnames Cquotnumberquot quotCityquot quotprecipquot quotaltitudequot quotlatitudequot quotdistancequot gt attachltcalirain The following Objects are masked from packagedatasets precip gt gt 4 The data frame called calirain is now created gt 4 with six variables four of which we will use in the analysis gt H gt gt 4 The im function gives us a basic regression output gt gt rainreg lt7 lmprecip altitude latitude distance gt summaryralnreg call lmformula precip altitude latitude distance Residuals Min 1Q Median 3Q Max 7207221 756030 705315 35102 333171 coefficients Estimate Std Error t value Prgtt Intercept 7102357429 29205402 73505 0001676 M altitude 0004091 0001218 3358 0002431 latitude 3451080 0794863 4342 0000191 distance 70142858 0036340 73931 0000559 Signif Codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 Residual standard error 111 on 26 degrees of freedom Multiple Resouared 06003 Adjusted Resguared 05542 F statistic 1302 on 3 and 26 DE pevalue 000002205 gt gt anovaralnreg Analysis of variance Table Response precip Sum Sq Mean Sq E value PrgtE altitude 1 7307 7307 59329 00220220 latitude 1 21753 21753 176612 00002752 3 distance 1 19034 19034 154530 00005593 3 Residuals 26 32023 Signif Codes 0 3939 0001 3939 001 3939 005 3939 01 39 39 1 gt gt gt gt 4 Testing about sets of Coefficients in R gt gt 4 To test in R whether betailbeta730 given that X2 is in the model gt 4 we must specify the full model with all Variables included a d gt 4 the reduced model with ONLY latitude X2 included rainfullmodel lt7 lmprecip altitude latitude distance rainreducedmodel lt7 lmprecip latitude This will give us the F statistic and P Value for this test vvvvvvv gt anovarainreducedmodel rainfullmodel Analysis of Variance Table Model 1 preclp latitude Model 2 preclp altitude latitude dlstance F Pr gtF Sum of Sq 1 20 53468 2 26 32023 2 21445 07057 0001276 M Signif codes 0 WM39 0001 39M39 001 39 005 3939 01 39 39 1 gt gt gt gt gt gt INFERENCES ABOUT THE RESPONSE VARIABLE gt gt We want to 1 estlmate the mean preclpltatlon for cltles of altitude 100 feet gt latitude 40 degrees and 70 mlles from the ast gt and 2 predlct the preclpltatlon of a new clty of altitude 100 feet gt latitude 40 degrees and 7 mlles from the oast gt gt gt The predlct function will glve us C15 for the mean of Y and P15 for Y gt for the Values of x1 x2 x3 in the data set gt gt XValues lt7 dataframecblndaltltude 100 latitude 40 dlstance 70 gt gt getting the 90 confidence lnterval for the mean at x1100 x240 and x370 gt gt predlctltralnreg XValues lntervalquotconfldencequot level090 l wr upr 1 2609477 1995278 3223676 4 getting the 90 prediction interval for a new response at x1100 x240 and x370 vvvv predictrainreg XValues intervalquotpredictionquot level090 fit lwr 1 2609477 6194312 4599523 MN The following code will produce resldual plots for this multiple regression M resldual plot plotfittedrainreg residrainreg7 ablineh0 vvvvvvvv i i D a W 7 x x x x x n m 2m an an tted am reg gt QsQ Plot of the residuals gt gt qqnormltresidrainreg gt gt Norrml 00 Plot Samp e Ouanmes 0 so a O Bat0 on a DD i 7 o a s o a V o m o x x x x x 72 71 0 1 2 Theoret ca Ouanmes To get VIF39S 4WWWHHW Vlf funct10nobect UseMethodquotV1fquot V1fdefault lt7 function object V1flm lt7 funct10nob ect v lt7 summaryobjectCOVunscaled V1 lt7 m lt7 namescoefobject nam vvvvvvvvvvvv a v1 lt7 d1agV first copy this function stopquotllo default method for Vlf Sorryquot crossprod model matrlxobect by Bill Venables nomatch e Vik 7k Getting Variance inflation Factors is easy in R F AzVikk1 into R v2 lt7 diagVi warningquotNo intercept term detected Results may surprisequot 1 structure V1V2 names nam 1 H 4fWW4W4fW Then use it as follows vvvvvv gt Vlfralnreg altitude latitude distance 1536447 1057748 1493345 gt gt gt Getting Influence statistics is easy In R The studentized residuals we discussed in class 4 4 which we compare in absolute Value to 25 are the internally studentized residuals rstandard gives the INTERNALLY studentized residuals gt gt gt gt gt gt gt what 5A5 calls quotStudent Residualquot gt gt rstandardralnreg 3 4 5 6 7 0107800522 70061716575 70312275645 0355514998 1035614068 70542297448 70104113401 70828704478 1395722121 70842258610 0006989693 70150945461 70350069344 70471480620 15 16 17 18 19 20 21 70891040867 1098956696 0118052633 70713830729 72849768693 1080889505 0274137418 22 23 24 25 26 27 28 0066776263 70039934708 70906269235 1157541443 0008171686 70711745114 0688420227 29 3383854694 70399465494 getting the measures of influence Gives DFFITS Cook39s Distance Hat diagonal elements and some others vvvv gt lnfluencemeasuresralnreg Influence measures of formula precip altitude latitude distance 7 dfbaltt dfblttd dfbdstn dfflt COVr COOkd hat lnf 1 70031720 70005837 00353240 7002116 004775 1406 0000592502 01694 2 0015096 0011977 700157509 7000742 7002202 1324 0000126093 01169 3 70102106 70124670 00976650 006592 7016000 1466 0006630640 02138 4 70064431 70011068 00757796 7006725 012793 1301 0004234418 01182 5 70067901 0525299 00648605 7012449 063557 1360 0100695353 02730 6 0033449 0011072 700496176 009041 7015799 1215 0006416775 00803 7 0012466 0017885 700138634 7000883 7002855 1259 0000211792 00725 8 0027910 0044375 700481682 007498 7020226 1114 0010355589 00569 9 0213613 0710325 702176639 7014447 083082 1149 0166020104 02542 10 70015675 0014521 700077539 011930 7022187 1121 0012449977 00656 11 0000093 70001249 700000566 000129 000194 1264 0000000983 00745 12 70013906 70002126 00094275 002627 7004379 1268 0000498141 00804 004030 7008389 1216 0001820985 00561 14 70053550 0033641 00473859 7002216 7010788 1191 0003000041 00512 15 70031305 70075020 00439881 7021301 7036210 1205 0033048893 01427 16 70202922 0269422 02028447 7001763 047553 1147 0056061365 01566 17 0022447 0004473 700189702 7002083 003879 1298 0000391069 01009 18 0139773 70035928 701304411 7016430 7031949 1302 0026018709 01696 19 1078644 70445710 710826789 7009222 7155337 0317 0431408237 01752 20 0186460 70283731 702018180 049799 057856 1250 0083118581 02215 21 0057229 0005164 700520690 7001539 007684 1251 0001530506 00753 22 0014269 0002335 700125414 7000912 002119 1291 0000116706 00948 23 70009279 70000669 00082082 000533 7001341 1307 0000046738 01049 24 0060666 0128376 700738324 7005514 7021949 1090 0012130082 00558 25 0297799 70285425 703042329 042536 058079 1182 0083183587 01989 26 0002628 0000339 700023937 7000128 000334 1374 0000002909 01484 27 70172811 70077115 01654081 001224 7021939 1186 0012270294 00883 28 0008072 70300581 700177535 038001 042188 1504 0045432044 02772 29 71688239 70311130 18453768 7092128 230700 0146 0774362399 02129 30 0069274 0079890 700738523 7004762 7012640 1260 0004128327 00938 STAT 516 STATISTICAL METHODS 11 STAT 516 is primarily about linear models Model A mathematical equation describing approximating the relationship between two or more variables 0 Any assumptions we make about the variables are also part of the model Simple Linear Regression SLR Modeling 0 Analyzes the relationship between two guantitative variables 0 We have a sample and for each observation we have data observed for two variables Dependent Response Variable Measures major outcome of interest in study often denoted Y Independent Predictor Variable Another variable whose value may explain predict or affect the value of the dependent variable often denoted X Example o In SLR we assume the relationship between Yand X can be mathematically approximated by a straightline equation 0 We assume this is a statistical relationship not a perfect linear relationship but an approximately linear one Example Consider the relationship between We might expect that gas spending changes with distance traveled maybe nearly linearly o If we took a sample of trips and measured X and Y for each would the data fall exactly along a line Picture 0 Our goal is often to predict Yor to estimate the mean of Y based on a given value of X Examples Simple Linear Regression Model expressed mathematically Y 0 1Xg Deterministic Component Random Component Regression Coef cients 30 We assume a has a Since a has mean 0 the mean expected value of Y for a given Xvalue is o This is called the conditional mean of Y o The deterministic part of the SLR model is simply the mean of onr any value ofX Example Suppose 30 2 B1 1 Picture OWhenX 1 EY O WhenX 2 EY o The actual Yvalues we observe for these X values are a little different they vary along with the random error component 8 Assumptions for the SLR model 0 The linear model is correctly specified 0 The error terms are independent across observations 0 The error terms are normally distributed 0 The error terms have the same variance 0392 across observations Notes 0 Even if Yis linearly related to X we rarely conclude that X causes Y This would require eliminating all unobserved factors as possible causes for Y 0 We should not use the regression line for extrapolation that is predicting onr anyX values outside the range of our observed X values We have no evidence that a linear relationship is appropriate outside the observed range Picture Example Data gathered on 58 houses Table 72 p 293 X size of house in thousands of square feet Y selling price of house in thousands of dollars 0 Is a linear relationship between X and Yappropriate On computer examine a scatter plot of the sample data 0 How to choose the best slope and intercept for these data Estimating Parameters o 30 and 31 are unknown parameters 0 We use the sample data to find estimates 30 and l 0 Typically done by choosing 30 and 31 to produce the leastsquares regression line Picture For each data point predicted Yvalue is denoted l Picture 0 Residual or error Y Y for each data point 0 We want our line to make these residuals as small as possible Leastsguares line The line chosen so that the sum of sguared residuals SSE is minimized 0 Choose 30 and l to minimize Example House Price data The following can be calculated from the sample So the estimates are Our estimated regression line is 0 Typically we calculate the leastsquares estimates on the computer Interpretations of estimated slope and intercept STAT 516 STATISTICAL METHODS 11 STAT 516 is primarily about linear models Model A mathematical equation describing approximating the relationship between two or more variables 0 Any assumptions we make about the variables are also part of the model Simple Linear Regression SLR Modeling 0 Analyzes the relationship between two guantitative variables 0 We have a sample and for each observation we have data observed for two variables Dependent Response Variable Measures major outcome of interest in study often denoted Y Independent Predictor Variable Another variable whose value may explain predict or affect the value of the dependent variable often denoted X Example o In SLR we assume the relationship between Yand X can be mathematically approximated by a straightline equation 0 We assume this is a statistical relationship not a perfect linear relationship but an approximately linear one Example Consider the relationship between We might expect that gas spending changes with distance traveled maybe nearly linearly o If we took a sample of trips and measured X and Y for each would the data fall exactly along a line Picture 0 Our goal is often to predict Yor to estimate the mean of Y based on a given value of X Examples Simple Linear Regression Model expressed mathematically Y 0 1Xg Deterministic Component Random Component Regression Coef cients 3930 BI We assume 8 has a Since 8 has mean 0 the mean expected value of Y for a given Xvalue is o This is called the conditional mean of Y o The deterministic part of the SLR model is simply the mean of onr any value ofX Example Suppose B0 2 B1 1 Picture OWhenX 1 EY O WhenX 2 EY o The actual Yvalues we observe for these X values are a little different they vary along with the random error component a Assumptions for the SLR model 0 The linear model is correctly specified 0 The error terms are independent across observations 0 The error terms are normally distributed 0 The error terms have the same variance 0392 across observations Notes 0 Even if Yis linearly related to X we rarely conclude that X causes Y This would require eliminating all unobserved factors as possible causes for Y 0 We should not use the regression line for extrapolation that is predicting onr anyX values outside the range of our observed X values We have no evidence that a linear relationship is appropriate outside the observed range Picture Example Data gathered on 58 houses Table 72 p 293 X size of house in thousands of square feet Y selling price of house in thousands of dollars 0 Is a linear relationship between X and Yappropriate On computer examine a scatter plot of the sample data 0 How to choose the best slope and intercept for these data Estimating Parameters 0 Bo and B1 are unknown parameters 0 We use the sample data to find estimates 80 and 81 0 Typically done by choosing 80 and 31 to produce the leastsquares regression line Picture For each data point predicted Yvalue is denoted I Picture 0 Residual or error Y I for each data point 0 We want our line to make these residuals as small as possible Leastsguares line The line chosen so that the sum of sguared residuals SSE is minimized 0 Choose 30 and 31 to minimize Example House Price data The following can be calculated from the sample So the estimates are Our estimated regression line is 0 Typically we calculate the leastsquares estimates on the computer Interpretations of estimated slope and intercept

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.