### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Analysis of Experiments STAT 3115

UCONN

GPA 3.87

### View Full Document

## 38

## 0

## Popular in Course

## Popular in Statistics

This 23 page Class Notes was uploaded by Blair Williamson on Thursday September 17, 2015. The Class Notes belongs to STAT 3115 at University of Connecticut taught by Joseph Glaz in Fall. Since its upload, it has received 38 views. For similar materials see /class/205896/stat-3115-university-of-connecticut in Statistics at University of Connecticut.

## Popular in Statistics

## Reviews for Analysis of Experiments

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/17/15

Statistics 31155315 Handout 11 Confounding and Interaction in Regression The list of objectives in regression analysis includes 1 Predict the dependent variable by using a set of independent variables 2 Quantify the relationship between a dependent variable and one or more independent variables The rst objective focuses on nding the best possible model as far as accuracy of the prediction is concerned while the second objective is related to getting accurate estimates of one or more regression coef cients Confounding and interaction are two methodological concepts related to achieving the second objective Confounding and interaction are closely related to the assessment of the association between two or more variables so that additional variables that may have an impact on this association are taken into account In general confounding exists if meaningfully interpretations of the relationship of interest result when an extraneous variable is ignored or included in data analysis For example adding or deleting an independent variable in the model results in a change of sign of regression coe icients or in a signi cant increase or decrease in the value of regression coe icients lnteraction is the condition where the relationship of interest is different at different levels values of the independent variables Assessment of confounding is questionable in the presence of interaction So the rst step is to investigate if there is interaction 1 Interaction in Regression Consider the following regression models in two independent variables 1 Yz39 50 51X1 52X2 53X12 54X 55X1 X2 Ed 2 Yz39 50 51X1i 52X2 53X12 54X E 3 Yz39 50 61X 52X2 Ed 4 Yz39 50 51X1 52X2i 53X12 E 5 Yz39 50 51X1 52X2 55X1 X2 Ed ln models 1 and 5 the presence of the terms 55X1 X2i affects the rate of change in the mean of the response variable depends on the value of X1 and or X2 whenever they increase by one unit On the other hand in models 2 4 the rate of change in the response variable remains constant for all values of X1 and or X2 We say that an interaction is present in models 1 and 5 We will test for signi cance of such an interaction by testing 65 0 D Confounding in Regression ln models that do not contain a signi cant interaction component we will examine a collection of quotgoodquot models and select a models that has consistent estimates of the regression coe icients as far as the sign and a change in their values is concerned Statistics 31155315 Handout 12 Indicator Dummy Variables in Regression So far we have utilized quantitative independent variables in regression models Sometimes qualitative independent variables are used Examples of qualitative variables are gender of respondent female male job location of an employee plant o ice eld season of year winterspring summer autumn and type of location for a restaurant shopping mall street high way Representation of Qualitative Variables A qualitative variable can be represented by means of 0 1 indicator vari ables A 0 1 indicator dummy variable is a variable that can have only two possible values 0 or 1 A qualitative variable with k classes or categories is represented by k 7 1 indicator variables Examples 1 The independent variables gender of respondent female male requires one indicator variable to represent it in a regression model We denote this indicator variable by X1 and de ne it as follows X1 1 1f female 0 otherwise Hence the numerical values of X1 are associated with the two classes of the variable gender as follows Class X1 Male reference class 0 Female 1 The rst class is called the reference class since all the indicator variables identifying this class are set to be equal to 0 2 The independent variable location for a restaurant shopping mall street highway requires k 7 l 3 7 l 2 indicator variables to represent it in a regression model We denote this variables by X2 and X3 and de ne them as follows 1 if shopping mall location X2 0 otherw1se X 1 if street location 3 0 otherwise Hence the numerical values of X2 and X3 are associated with the three locations as follows Class X2 X3 Highway reference class 0 0 Shopping mall l 0 Street 0 l Statistics 31155315 Handout 5 THE DURBINWATSON TEST H0 The measurement errors are independent Ha The successive measurement errors are correlated Test Statistic 1 1 1 1 DW 2 eLeHYE e3 t 2 t 1 Decision Rule Based on the sample size the number of independent variables in the model and the Qt level of our test we get two entries dL lower bound and dU upper bound 1 If DW lt dD reject the null hypothesis and conclude that there is evidence of positive autocorrelation 21de lt DW lt 2 do not reject H0 3 If dL lt DW lt dU the DW test is inconclusive 4 If DW gt 2 compute DW 4 DW and use the same rules 1 3 as above but substitute DW for DW and the world negative for positive in 1 Remark If a two sided alternative is used as speci ed above then use 12 and use DW and DW PRESS Statistic An interesting and very important criterion that can be used as a form of validation of the tted model is the PRESS Prediction Sum of Squares statistic Let YI be the ith observation on the dependent variable in the ith case i 1n Suppose that at stage i we withhold the ith case and use the remaining nl cases to estimate the coefficients for a particular candidate model We use the fitted model resulting at stage i to estimate the ith response corresponding to the ith case and denote it by Y i 1n We denote by 10 I em Y 39 Yum the ith PRESS residual These PRESS residuals are true prediction errors with Y being independent of Y1 Therefore they can be used for validating the model under evaluation For each candidate model we will have n PRESS residuals associated with it and PRESS statistic is defined as PRESS Eerlez Help Ee1h2 where eI is the ith raw residual and hI is the leverage of the ith observation Therefore for choosing the best model one might favor the model with the smallest PRESS If PRESS is much larger than the sum of squares of the residuals that will indicate the data set has in uential observation or outliers The PRESS residuals can be used to generate another R2 like statistic which re ects prediction capabilities This statistic is given by PRESS 2 R 4 1 H Eon 102 i1 pre Utility of the PRESS Residuals The individual PRESS residuals can be valuable quite apart from their role in computing the PRESS statistic They give separate measures on the stability of the regression and they help the analyst to isolate cases that have a sizeable in uence on the outcome of the tted model Remarks 1 To obtain the value of DurbinWatson statistic have to specify the DW option in the model statement 2 To get values for the in uence statistics have to specify the INFLUENCE option in the model statement Another regression diagnostic that is used to measure the in uence of the ith case on each regression coef cient is given by the difference between the estimated regression coef cient 3k based on all n cases and the regression coef cient estimated with the ith case removed divided by the estimate of standard error of 9k based on n1 cases DFBETAS m SH A large absolute value of DFBETAS is indicative of a large impact of the ith case on the kth regression coef cient The standard recommendation is to consider a case in uential if the absolute value of DFBETAS exceeds 1 for small to medium data sets and 2 n 39 5 for large data sets Handout 8 Statistics 31155315 Testing Hypotheses in Multiple Regression One of the main tools in building an effective multiple regression model is testing hypotheses The following represents the main types of testing hypotheses that are of interest 1 Overall test for the multiple regression model 2 Tests for addition of a single variable to a regression model 3 Tests for addition of a group of variables to a regression model 4 A test for the signi cance of the intercept in a regression model 1 Overall test for the multiple regression model Suppose we are investigating the performance of the regression model Y5051X1quotquot5kaE7 and we assume that the classical assumptions are valid Then for testing H0151quot 5k0 vs the alternative that not all the are equal to 0 the following test statistics is used F 7 MS regression SSEk MS residual SSEn 7 k 71 which under the null hypotheses has an F distribution with k degrees of free dom for the numerator and n 7 k 7 1 degrees of freedom for the denominator The larger the value of the computed test statistic is the more evidence we have that the null hypothesis is not true For a speci ed level of signi cance 04 we reject H0 if the computed value of F gt Fhwhhkm 1 where Fknk11a is the critical point of the F distribution with k degrees of freedom for the numerator and n 7 k 7 1 degrees of freedom for the denomi nator corresponding to the signi cance level 04 and can be obtained from the F table SAS output will provide us with the p value for this test which is presented above the ANOVA table Note that SS and MS corresponding to the regression model is called by SAS as Model If the above null hypothesis is not rejected it means that the model we are investigating is not appropriate On the other hand if this null hypothesis is rejected we have very little information about the actual performance of this model 2 Tests for addition of a single variable to a regression model Partial F tests are the tests that are used in testing the effectiveness of the additional variable to a regression model Suppose we are investigating the performance of the regression model Y6061X1quotquot5poE7 and we assume that the classical assumptions are valid We might be inter ested in testing if the addition of the variable X improves the performance of the regression model given that the variables X1 Xp are already in the model In other words we are interested to compare the performance of the models Y6061X1quotquot6pXPE and Y5051X1 po6XE This can be accomplished by testing H060vsHa67 0 Notation 1 SSX1 the sum of squares explained by using only X1 to predict Y ie by using a simple regression model with X1 2 SSX2lX1 the extra sum of squares explained by using X2 in addition to X1 to predict Y 3 SSXle1 X 1 the extra sum of squares explained by using X152 S i S p in addition to X1Xi1 to predict Y 4 SSX X1 X17 the extra sum of squares explained by using X in addition to X1 Xp to predict Y 5 SSRX1 Xi Regression sum of squares for a model with X1 Xi 1 S i S p 6 SSRX1 XpX Regression sum of squares for a model with X1 XpX One can show that for 2 S i S p SSX X1 Xi1 SSRX1 X 7 SSRX1 Xi1 and SSXX1 Xp 51309 XpX 7 SSRX1 Xp From the SAS output for the regression model Y5051X1BPXPBXE the Type I SS will give us SSX1 SSXle1 X 1 and SSX X1 X17 The Type I SS are called the sequential type sum of squares and they depend on the order the variables are entered in the model in the model statement of SAS For testing H0 5 0 the following test statistic is used SSXX1 Xp FFXX X l 17 MSresidualX1XPX 3 where MS residual X17 7XP7X n 7 p 7 2 and SSEX1 Xp7 X is the sum of squares for error for the model with the variables X1 Xp7 X This F statistic under the null hypothesis has an F distribution with 1 degree of freedom numerator and n 7 p 7 2 degrees of freedom denominator The SAS output will give us the p value for testing this hypothesis It will correspond to the last line in Type I SS Another way to test the above hypothesis for the variable added last in the model is to use the t test for the signi cance of the coef cient 5 0 with the t statistic A T i 567 where the value of this test statistic and the corresponding the p value will appear in the SAS output adjacent to the estimate of this regression coef cient in the last line of the table presenting the estimates of the regression coef cients 3 Tests for addition of a group of variables to a regression model Multiple partial F tests are the tests that are used in testing the effec tiveness of adding a group of variable to a regression model Suppose we are investigating the performance of the regression model Y6061X1quotquot5poE7 and we assume that the classical assumptions are valid We might be in terested in testing if the addition of the variables Xf X improves the performance of the regression model given that the variables X1 Xp are already in the model In other words we are interested to compare the performance of the models Y6061X1BPXPE and Y6061X16po6 Xf6XE 4 This can be accomplished by testing Hm i60vsHan0tall6ls0 We de ne the extra sum of squares due to the addition of the variables Xf X to the existing model as 5509 quotXZth uXH SSRX17Xp7XI7XSSRX17Xp 7Xp 7 739p 1quot7 The following F statistic can be used the above null hypothesis 5509 XZlX17Xpk FFX XX X l 1 kl 1 17 MS residual X1 XpX1 X which has an F distribution with k and n 7 p 7 k 7 1 degrees of freedom under H0 If we have the SAS output for the model with p k independent variables we can get SSX1 XlX1 Xp from the table of Type I SS as follows 19 55091 7X X17 7Xp 255X X17 7XP7X17 7X71 j2 SSX X1 Xp 4 A test for the signi cance of the intercept in a regression model Suppose we are investigating the performance of the regression model Y6061X1quotquot5poE7 and we assume that the classical assumptions are valid We might be in terested in testing if the intercept 50 is improving the performance of the model To test H0600vsHa507 07 as an intercept added last test the t statistic given by SAS can be used 5 pi 5a The value of this test statistics and the corresponding p value are given in the table for the estimates of the regression coe icients 5 Type I and Type 11 SS and their Use Multiple Regression Type l sequential SS and Type ll partial SS can be obtained in PROC REG in SAS by specifying SS1 and SS2 as MODEL statement options The Type l SS are called sequential sum of squares and they represent a partitioning of the Regression Model SS into components sums of squares due to each independent variable as it is added to the model in the order given in the MODEL statement of SAS As we have seen above Type l SS are useful when we are comparing two speci ed regression models Based on this tests we can asses the importance of adding one or more variables to a regression model Type ll SS are usually called partial sum of squares The Type ll SS for a variable is the extra sum of squares due to that variable when all the other variables are already in the model lt is also the reduction in the error sum of squares due to adding that variable to the model that already contains all the other variables For a regression model Y6061X1quotquot5poE7 the partial sum of squares are SSX1X2 Xp SSX2X1X3 Xp 88XiiX17 quot7Xi717Xi1quotquotXp SSXPX1 Xp1 The corresponding F tests to these partial SS are testing the importance of each individual independent variable X 1 S i S p when it is added last in the model The t tests A 6139 T 5a and the associated p values that are given in the SAS output are equivalent to these F tests Handout 7 Statistics 31155315 Introduction to Multiple Regression Analysis A Multiple linear regression model is a generalization of a simple lin ear regression model which has one independent variable to a model with k independent variables Y5061X152X26kaE 1 The parameters 60 61 are called the regression coe icients and have to be estimated from the data Xmm7 X191 Y1 X12 X19239 XW le Y Special cases of the general multiple regression model given in Equation 1 include Y5061X152X2E7 2 Y6061X62X2E 3 and Y 50 51X1 52X2 53X12 54X 55X1X2 E 4 Model 3 is a second order polynomial model of the independent variable X and model 5 is called a full complete second order model Given the data the regression coe icients are estimated using the least squares method by minimizing quot A 2 quot A A A A 2 Z 7 7 Z 7 60 7 61X 7 6299 7 7 ka39 i1 i1 where A A A A A K 50 51X1i 52X2 ka39 is the predicted or tted value for the dependent variable for the ith case A X1i7X2i Xm One can show that the s are linear functions of the Yi s and therefore have a normal distribution moreover 6139 m 70271 1 where the formulae for 02A is quite complex lts estimator 82A can be obtained from the SAS output SAS will also perform the testihg of the hypotheses that 0 The ith residual is de ned as follows E Y Y Assumptions of the Multiple Regression Model Existence For given values of the independent variables in the model the dependent variable Y has a univariate distribution with a nite mean and variance Independence The observations Y1 YT are independent of each other Linearity The mean value of the dependent variable Y given speci ed values of X1X2Xk is a linear function of X1X2 Xk MYX1X2 Xc 50 51X1 52X2 519 This is equivalent to the linear model assumption and the fact that the mean of the error is equal to 0 Homoscedasticity The variance of Y is the same given any xed values of the independent variables in the model U X1X2Xk V TYlX17X277Xk 02 Normality For any given values of X1 X2 Xk the distribution of Y is normal with mean My Xl XZ HH Xk and variance 02 This is equivalent to the fact the random error E has a normal distribution with mean 0 and variance 02 ANOVA for a Multiple Regression Model As for the simple linear regression model we have a decomposition of the total variability for the dependent variable Y Total sum of squares 2 Regression sum of squares Residual sum of squares 01 n n i A i 2 quot A 2 2K7Y2ZYi7y 2024 171 171 171 We de ne the multiple R2 coe icient to be R2 7 SSH 7 SSY SSE SSY SSY 39 The adjusted R2 coe icient is given by 17 R2n 71 R2 7 M17 R2 d th2717 a wse nikil nikil The variance of the errors0392 can be estimated by 52 7 MSE 7 n 7 k 7 1 The ANOVA Table for multiple regression is given by ANOVA Table for Multiple Regression Source df SS MS F Regression k SSR SSRk MSRMSE Residual n 7 k 71 SSE SSEn 7 k 71 Total n 7 l SSY The above F statistic is used for testing the null hypothesis H0616k0 This null hypothesis is rejected at the 04 level if the computed value of the F statistics exceeds Fk nk11a This test by itself is not very informative unless we accept the above null hypothesis Remarks 1 All the regression diagnostics we have discussed for the linear regression model have to adjusted for the general regression model See Chapter 14 2 Con dence intervals for the average response and for a future observation of the dependent variable can be obtained via SAS with the CLM and CLl options Statistics 31155315 Handout 4 Regression Diagnostics Analysis of the Residuals Most of the regression diagnostics is based on the analysis of various types of residuals We proceed to de ne these residuals and discuss their properties and applications The raw residuals or just residuals have been de ned as 6iYi 1 l TL e0 n 171 2 This immediately implies the raw residuals are dependent H n is large this dependence can be ignored 3 The estimate of the variance of the random errors E is based on the residuals 1 TL 2 i 2 SW 7 n 7 2 MSE and ES X 02 To simplify the presentation of the results we adopt the following notation s2 53W 4 Since the range of the raw residuals varies from one data set to another depending on the units of the data set it makes sense to standardize the residuals One approach is to de ne the standardized residuals 5i S 2139 The standardized residuals have unit sample variance in the sense that 1 quot 1 quote3 n7222 n72 139 i1 5 One can show that for i l n and Vare 021 7 h where for a straight line model with the intercept 72 h 1 X 7 X i39 n Zi1Xi X is called the leverage of the ith observation Also one can show that the egs are linear combination of Y1 YT and therefore have a normal distribution 6 Since the egs do not have the same population variance a better way to standardize the residuals is to consider the stadentized residuals de ned as follows 5i Sxl i h In some books these are called internally studentized residuals The reason for that is that S is a function of e and therefore not independent of it Studentized residuals have a mean that is close to zero and its variance can be estimated by 1 quot 2 n 7 2 Z 7239 7 i1 which slightly greater than 1 For large n 7 have an approximate t distribution with n 7 2 degrees of freedom Also 7 l have a beta distribution with parameters 12 and n 7 22 Ti 7 Analogously one can de ne the so called jaemife residuals e n 3 7 7 7 Sixl 7 hi TL 7 7 7 where 8 l is the residual variance computed with the ith observation deleted 1e n 1 2 i 2 SH in 3 Z 5quot 1711 In some books these residuals are called externally studentized residuals Jacknife residuals also have mean approximately equal to zero and its vari ance can be approximated by n 1 which is also slightly greater than 1 If the classical assumptions hold then each of the jacknife residuals have an exact t distribution with n 7 3 degrees of freedom Remarks 1 Under normal circumstances all the different residuals will exhibit similar behavior On the other hand if we have in uential observations observations with a high value of hi then m and especially 71 will point out this problem 2 One can show that 239 Z hinj hl39Yi Z hijyjy j1 FLM where we have denoted hij hi and for the model with an intercept lX Xj hm n 72 n 211 Xi X If the ith observation is far from Y while the rest are not then hi will tend to be large while the hij for j i will tend to be small and therefore K will have a greater in uence on K 3 For a model with the intercept l S hi S 1 n Observations with large values of hi will have a small value of Vadei and regardless of what value of K is observed will have a residual close to 0 That is the reason why it is important to examine studentized or even better the jacknifed residuals to detect in uential observations that may corrupt the effectiveness of the tted model Outliers Outliers are extreme observations that do not belong to the data set To detect outliers one examines the various residuals For example if we decide to examine the jacknifed residuals the fact that they have a t distribution with n 7 3 degrees of freedom is helpful For example if we have a data set of n 10 observations and we want to test if any of these observations are outliers then we have to perform 10 tests simultaneously The critical values are given in Table A SA on page 729 Outliers detection based on the leverages has been discussed by Hoaglin and Welsch 1978 They recommend to look carefully at any observation with an leverage In Table A 9 provides leverage critical values for the following statistics hi i 1 h n 2 adjusted for the multiple testing of the n leverages E In uential Observations One of the most popular measures for evaluating in uence of an obser vation is Coo2 5 distance It measures the extent of change in the estimates of the regression coe icients if a particular observation is deleted Cook s distance for the ith observation is given by dd 1732i 2 l 1 7 hi 28217 hi2 One can show that div is proportional to 67 7 633 62 7 612 Cook and Weisberg 1982 recommend to scrutinize any observation with div gt 1 In Table A 10 critical values for examining the maximum of a set of Cook s distances for din 7 2 Another measure to examine in uence of observations is 239 i 373 5 hi DFFITS M201 M SHwhi SH T 1 7 h 4 If DFFITSM gt 2 the observation has to be scrutinized

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.