QUANT METH BEHAV DATA I
QUANT METH BEHAV DATA I PSYC 709
Popular in Course
Popular in Psychlogy
This 19 page Class Notes was uploaded by Alfonso Grady PhD on Monday October 26, 2015. The Class Notes belongs to PSYC 709 at University of South Carolina - Columbia taught by Staff in Fall. Since its upload, it has received 11 views. For similar materials see /class/229641/psyc-709-university-of-south-carolina-columbia in Psychlogy at University of South Carolina - Columbia.
Reviews for QUANT METH BEHAV DATA I
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/26/15
Single Factor Analysis of Variance ANOVA 1 The ANOVA model 11 Using a Factor with 2 levels First off lets assume that we are doing a generalized linear model with one dependent variable Yi Bo BlXi39 6i 1 Lets further assume that our X takes on two values 0 or 1 X 7 07 Subject belongs to group A l T 1 Subject belongs to group B What does our model look like when X 0 What does our model look like when X 1 What does our B0 estimate What does our B1 estimate Lets denote XA as the average Y value for Group A And X3 as the average Y value for Group B Then when model 1 is t using PROC GLM the test of H0 B1 0 is equivalent to the test of H0 XA X3 This is the idea of ANOVA Lets do an example in SAS using the MEEXPtxt data set The variable HIST is coded as follows 07 No History 17 History Notice that 77No History77 will be the control group We can t an ANOVA7 with Perceived bene t of mammography as the response7 by using HIST proc glm datatmp1meexp2 model pbHIST run quit Now we will do this using the CLASS statement proc glm datatmp1meexp2 Class HIST model pbHISTsolution run quit Notice which group is considered the control group Also notice the equivalence in the results ALWAYS PLOT THE DATA USING PROC BOXPLOT TO CHECK IF WE CAN AS SUME THE VARIANCE OF THE DIFFERENT GROUPS ARE EQUIVALENT ALSO CHECK THE USUAL ASSUMPTIONS FROM REGRESSION NORMALITY OF RESID UALS AUTOCORRELATION LINEARITY AND HOMOSCEDASTICITY NOTE When doing ANOVA on a factor with two levels is equivalent to the 2 sarnple t test 12 Using a factor with more than 2 levels No suppose that we wish do t an ANOVA model on a categorical variable which has more that 2 factors Say the 77Natural Repellent77 data which is not on Blackboard This data has two variables Y PERCENTAGE OF 20 FLIES REPELLED 1 Untreated cloth 27 Piperine 37 Black Pepper X 4 Lemon Juice 57 Hesperidin 6 Ascorbic Acid 7 Citric Acid This model can be t in SAS using the following PRDC SORT BY TRT PRDC BOXPLOT PLOT YTRT PRDC GLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT RUN QUIT The LSMEANS statement will estirnate arithrnetic rneans adjusted for other e ects in the model even though we don7t have any Note the control variable that is speci ed Is that what we would have wanted two ways we can change this What are the test of the Bs equivalent to What is the overall F test equivalent to 13 Multiple Comparisons using LSMEANS statement The next two sections will discuss comparing means of different factor levels First we will compare all the means to each other7 or to a control group Second we will use linear combi nations of the factor levels What ever comparison scheme you are using should be decided before hand LSMEANS will provide means which are adjusted for the average value of the speci ed covariates If there are no covariates in the model than LSMEANS will provide the unad justed means of the di ferent factor levels There are many different options that can be used in LSMEANS Obtaining means and CIzThe rst which we used above is to estimate the means levels of each factor To this we can add a con dence interval option PRDC GLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL alpha05 RUN QUIT Notice that the alpha option has been added and can be changed Obtaining the difference in means to the control We will estimate the difference in the means to a control mean This can be done by using the PDlFFControl option PRDC GLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL PDIFFCDNTRDL RUN QUIT Notice that we have 6 differences7 and 6 95 Con dence lntervals What is the probability that at least one con dence interval does not contain the mean Number of 95 Cl Probability of at least one error 1 05 2 098 3 14 4 19 20 64 50 92 100 99 Notice that SAS says 77Simultaneous 95 Con dence L mits for LSMeani LSMeanj So SAS will already correct for this problem If you do not want to correct for this problem you can use the ADJUST option in the following way PRDC GLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL PDIFFCUNTRDL ADJUSTT RUN QUIT There are many different ways to adjust you con dence level to that you are 95 con dent that all are true There are some methods with are a compromise between being 95 con dent that all are true and 95 con dent each individual con dence interval is true Fisher7s 77protected Obtaining the difference in means to each other We will now estimate the dif ference in all of the means along with obtaining Cl for each This can be done using the PDlFF option PRDC GLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL PDIFF RUN QUIT 14 Multiple Comparisons using Estimate statement The above is all useful for estimating the difference in means and there signi cance between a control group and treatment groups Sometimes we would like to estimate the difference in means between treatment groups Say we would like to estimate and test the following 1 X1 7 X3 0 untreated black pepper 2 X4 7 X5 0 Lemon Juice Hesperidin 3 Mean OfltX2 X3 X4 X5 X6 X7 X1 This last test will be useful when trying to test if the mean of all of the treatment groups is di erent than the mean of the control group This type of test will also be useful if you were for testing equality of two di ferent types of treatments For example if one set of treatments groups were physical treatments and another were psychological treatments7 you might want to test the di erence in the two This can be done using the Estimate statement PRUC GLM CLASS TRT MUDEL Y TRT SOLUTION ESTIMATE quotTRT1 TRT3quot TRT 1 O 1 O O O O ESTIMATE quotTRT4 TRT4quot TRT O O O 1 1 O O ESTIMATE quotMEAN UF TRTS 27 TRT 1quot TRT 6 1 1 1 1 1 1 DIVISUR6 RUN QUIT Note that the sum of the linear coef cients has to sum to 0 or error messages will appear The Divisor will divide all the linear coef cients by that value Example for a treatment with four levels ESTIMATE TRT 4 1 1 1 DlVlSOR4 ls equivalent to ESTIMATE TRT 1 25 25 25 NOTE Only variables which were speci ed in the CLASS statement ie classi cation variables can be used in the ESTIMATE statement 15 Multiple Comparisons using LSMEANS statement LSMEANS will provide means which are adjusted for the average value of the speci ed covariates lf there are no covariates in the model than LSMEANS will provide the unad justed means of the di ferent factor levels There are many di erent options that can be used in LSMEANS Obtaining means and CIThe rst which we used above is to estimate the means levels of each factor To this we can add a con dence interval option PRDC GLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL alpha05 RUN QUIT Notice that the alpha option has been added and can be changed Obtaining the difference in means to the control We will estimate the di erence in the means to a control mean This can be done by using the PDlFFControl option PRDC CLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL PDIFFCDNTRDL RUN QUIT Notice that we have 6 di erences7 and 6 95 Con dence Intervals What is the probability that at least one con dence interval does not contain the mean Number of 95 Cl Probability of at least one error 1 05 2 098 3 14 4 19 20 64 50 92 100 99 Notice that SAS says 77Simultaneous 95 Con dence L mits for LSMeani LSMeanj So SAS will already correct for this problem If you do not want to correct for this problem you can use the ADJUST option in the following way PRDC CLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL PDIFFCUNTRDL ADJUSTT RUN QUIT There are many di erent ways to adjust you con dence level to that you are 95 con dent that all are true There are some methods with are a compromise between being 95 con dent that all are true and 95 con dent each individual con dence interval is true Fisher7s 77protected Obtaining the difference in means to each other We will now estimate the dif ference in all of the means along with obtaining Cl for each This can be done using the PDlFF option PRDC CLM CLASS TRT MODEL Y TRT SOLUTION LSMEANS TRT CL PDIFF RUN QUIT Notice that the default that SAS uses does not correct the alpha level so that we are 95 con dent that all the 017s true If you would like SAS to correct use the ADJUSTTUKEY method Polynomial Regression 1 Quadratic Regression Setting 0 Y versus a single predictor 0 Y vs X1 relationship is not linear 0 Transformation of Y andor X1 is not successful or not allowed Quadratic Regression model Yi Bo BlXil Binzl 6i 1 With the usual Multiple Linear Regression Assumptions Often to minimize the collinearity we will center the predictor variables The centering is done around the mean After centering the regression coef cient B0 represents the mean response of Y when X X 07 or when X X B2 is often referred to as the quadratic e ect coe cient Higherorder polynomials Cubic Regression 77Third order model Y Bo Ban BzXfl 3in 61 lt2 Quartic Regression 77Forth order model 6 Bo BlXil BZXE1 33X BiX 1 e 3 By adding higher and higher polynomial terms7 the error sum of squares for the tted regression will be reduced However7 o The tted regression is meaningless o Often high order polynomials do not t well by visual inspection 0 Severe collinearity can be created7 even using centered X7s These model can be easily t in SAS proc glm model y X XX XXX 2 Crossproduct Terms Twovariable simple interaction model Yi Bo BlXil Binz BSXilXiZ 6i 4 X1 and X2 are said to be additive in their relationship to Y if the slope of Y with re spect to X1 does not depend on X2 and vice versa otherwise7 they are said to interact The hypothesis test of H0 B3 0 is equivalent to a test of interaction between X1 and X2 The meaning of the regression coef cients B1 and B2 here is not the same as that given earlier because of the interaction term BgXl39le39g The coef cients B1 and B2 no longer indi cate the change in Y with a unit increase in the predictor variable7 with the other predictor held constant at any given level It can be shown that the change in Y with a unit increase in X1 when X2 is held constant is B1 BgXZ Similarly7 the change in mean response with a unit increase in X2 when X1 is held constant is B2 BgXl These models can also be easily t in SAS proc glm model y X y Xy Note These interaction models can also present collinearity issues For this reason it also might be necessary to center each independent variable by its sample mean 3 Building A Regression Model Setting we have a valid MLR model assumptions checked with a large number K of terms Goal arrive at a parsimonious submodel of k note lower case k terms that still well eXplains Y There are 2K 1 submodels Under tting occurs when important regressors are left out of the model This can lead to seriously de cient models and serious misinterpretations of variable relationships Over tting occurs when all important regressors are in the model7 but some unimpor tant ones are7 too Costs df for error unneeded complexity somewhat widened con dence and prediction intervals Bottom line slight over tting is preferable to any kind of under tting Question How should we compare one model to another Answer 1 Look for models with low SSE high R2 Problem when regressors are added to a model7 even if they are ridiculous regressors7 SSE will decrease Answer 2 weigh decreases in SSE versus increases in model size k Some Model Se lection Criteria 1 The MSE criterion choose the model with minimum MSE Since MSE controls for model size This is equivalent to minimizing ROOT MSE or maximizing adjusted Rid Note that Ridj can be negative Three other model selection criteria are 0 AlC nlnSSEn 2p and choose the model with smallest AlC o BIC nlnSSEn plnn and choose the model with smallest BIC o Ck K n 2k Choose the smallest k such that Okwk then choose the model of that size with smallest SSE SAS calls it BIC SBC Comments Model Selection Criteria has no universal agreement on which criterion is best and there are others weve not discussed They tend to pick similar models They do not incorporate scienti c or good sense knowledge about the predictors Use them to aid the model selection processyou make the decision PRDC REG MODEL YX1 X2 SELECTIDNRSQUARE options Some options 0 INCLUDEi forces the rst i predictors in every considered model 0 STARTm1 considers only models with at least m1 predictors o STOPm2 considers only models with at most m2 predictors o SELECTb prints only the b best models 0 AlC SBC CP what we called Ck RMSE prints the speci ed model selection criterion for each model considered 0 Backward Forward or Stepwise7 explanation below Stepwise Regression Backward Forward 0r Stepwise Stepwise regression algo rithms consider submodels one after another in a hopefully 77intelligent77 fashion printing out summaries eventually settling on a 77best77 model These algorithms are best viewed simply as automatic ways of looking at many models Their nal model choice should not be mindlessly adopted Stepwise Regression 1 Backwards elimination starts with the full model then removes the least signi cant regressor if its partial F test P value is greater than a user speci ed cutoff in SAS SLSTAY Default is SLSTAY 10 Continues until all remaining regressors are signi cant at the speci ed level Stepwise Regression 2 Forward selection starts with the intercept model then adds the regressor giving the best 1 variable model if its P value is less than a user speci ed cut off in SAS SLENTER Default is SLENTER 50 More regressors are added until none are available satisfying the entry criterion Stepwise Regression 3 Stepwise regression starts with Forward Selection When at least two variables have been entered it checks to see if any have become insigni cant Con tinues until no more regressors are available satisfying the entry criterion and all variables in the model satisfy the retention criterion Let7s try an example with a new data set Generalized Logistic Regression 1 Multinomial Logistic Regression Logistic regression can easily be extended to outcomes with multiple categories Initially consider an outcome Y with values 07 17 J We will consider Y0 our referent or non case group7 but beyond that do not need to make any assumptions about order of severity for the remaining outcome categories For simplicity most notation and examples will assume that J2 or a total of 3 possible outcomes The model is based on the generalized logit function 91W 109 501 iiX 1 92W 109 502 izX 2 The conditional probabilities for each outcome category are 1 My 0 95 m 3 i 96 IKE 0 1ezp91z explt92lt96gtgt 4 exp my 0 1 mole apogee 5 The coef cients are interpreted as log odds ratios7 just as in the binary logistic model hypothesis tests and con dence intervals are constructed similarly In some circumstances7 it may further support the analysis to compare the magnitude of the two J l estimated odds ratios For example to test that the coef cients ie7 log odds ratios for a variable X are the same for outcomes 1 and 27 we would test the hypothesis 11 217 or equivalently7 11 21 0 The point estimate is the difference in the estimated values7 the variance can be calculated as A V V Va A m V varwii e 612 Var611 VaTWiQ e 2Oovlt611612 lt6 Lets t this model with an nominal three level outcome7 in PROC LOGlSTlC with the linkglogit option proc logistic datameexp descending orderdata model mehistlinkglogit run 2 Ordinal Regression When the scale of a multiple category outcome is ordinal one could use the multinomial logistic model described above This analysis would not take into account the ordinal nature of the outcome and hence the estimated odds ratios may not address the questions asked of the analysis We will now discuss a number of different logistic regression models that take into account the ordering of the outcomes Recall the multinomial model for the 2 group My M 109 There are other ways to model this For example the Adjacent Category Logistic model 50139 51139X 7 109 50 51X 8 The Continuation ratio logistic model log 50139 51139X 9 My lt M And the Proportional Odds Model 109 50139 51139X 10 Right now well only discuss the Proportional Odds and the Multinomial models Probably the most frequently used ordinal logistic regression model in practice is the proportional odds model The other models mentioned compare a single outcome response to one or more reference responses The proportional odds model describes a less than or equal versus more comparison For example if the outcome is extent of disease the model gives the log odds of no more severe outcome versus a more severe outcome The constraint placed on the model is that the log odds does not depend on the outcome category Thus inferences from the models lend themselves to a general discussion of direction of response and do not have to focus on speci c outcome categories Lets t this in SAS proc logistic datatemp descending model bwt4smokelinkglogit run proc logistic datatemp descending model bwt4smoke output outda predp predprobsi c run Checking Regression Assumptions 1 Testing for Autocorrelation and Normality This section we will learn some techniques to test for two of our regression assumptions Recall that the four main simple linear regression assumptions are H The residuals are centered around zero for all predicted values Linearity Can be checked using loess lines 2 Var6i 02 77 Constant Error Variance Also can be check using loess lines 3 61 are independent or uncorrelated7 i 17 27 n 4 61 are Normally Distributed Today we will go over ways to test and check the last two assumption7s above The forth as sumption can be checked using PROC UNIVARIATE7 with methods previously learned While testing for autocorrelation can be tested for using PROC ARIMA Here is an example7 which tests for all four proc glm datatmp1school2 model gpahs iq output outdat rry ppy proc gplot plot rypy iqvref0 proc univariate normal var ry qqplot ry histogram rynormal proc arima identify varry run quit Let7s look at two examples which violate the certain assumptions data ali do i 1 to 1000 by 1 X1 normal020 100 X2 normal020 100 y 14 4X1 8X2 normal0x12 output end run proc glm dataali model y X1 X2 output outdat rry ppy proc gplot plot rypy X1vref0 proc univariate normal var ry qqplot ry histogram rynormal proc arima identify varry run quit data ali do i 1 to 1000 by 1 X1 normal020 100 X2 normal020 100 y 14 4X1 iX2 normal0 output end run proc glm dataali model y X1 X2 output outdat rry ppy proc gplot plot rypy X1vref0 proc univariate textcolorrgb100000000normal var ry histogram rynormal qqplot ry proc arima identify varry run quit 2 Calculating ICC with PROC MIXED PROC MIXED is a procedure that we will use in detail in the upcoming semester For today Well only use it to nd the ICC for the UlS data set on blackboard Recall that T 007 l T702 Here7s how we will calculate it data uis infile quotZ uisdatquot input ID AGE BECK IVHX NDRUGTX RACE TREAT SITE DRFREE run proc sort by site run PRDC MIXED Class site Model beck IVHX NDRUGTX RACE Random int subject site run quit Semi partial R2 and Hierarchical analysis 1 Calculating 6 and semipartial R2 11 Semipartial B2 Proc corr provides correlation and other descriptive statistics involving multiple variables These are useful for quantitative variables Correlation p corrX7 Y where X and Y are random variables quantitative This is the population correlation and measures the strength of the linear relationship between X and Y If X and Y are independent then p 07 but the converse is not always true The sample Pearson correlation is 7quot Spearman correlation is obtained by ranking each of the X and Y variables and computing Pearson correlation on the ranks This is a nonpara metric measure of the association between X and Y that is not effected by outliers And remember to Always plot the data The formula for the Semi partial R2 for the correlation of Y and X beyond that which was accounted for by W ryx 7 rywrxw 2 STizXw 1 1 7 rXW ln SAS this can be done using PROC CORR and the PARTIAL statement The following statement would calculate the Semi Partial 7quot for the variables weight7 oxygen7 and runtime beyond that which was accounted for by age PRUC CORR DATAFAKE var weight oxygen runtime partial age RUN R2 will then be obtained by squaring the correlation 12 B s PROC STANDARD is a procedure that can be used to create Z scores in SAS The syntax for STANDARD is similar to that of proc means7 here is an example proc standard datanew outABC meanO std1 REPLACE var statwght sysbp diabp by sex run proc print dataABC run This procedure creates a temporary data set ABC where sysbp7 and diabp are standardized This means that there values are equal to there Z scores We can now use these Z scores to calculate the 67s proc glm dataABC model statwght sysbp diabpclparm alpha1 output outregdata rresid pyhat run quit proc gplot dataregdata plot yhatresid run quit 2 Hierarchical Analysis 21 The ANOVA table Source df Sums of Squares Mean Square F Ratio Model k SSR MSR F Error n k l SSE MSE Total n 1 SSTO Where if we refer back to Lee notes or the book 0 SSR regressionSS o SSE residualSS o MSR regressionMS o MSE residualMS Or as formulas SSE 25 SSTO 2m 7 i72 SSR SSTO i SSR 2 22 Signi cance Testing for Variables SSRX2lX1 Decrease in the error sum of squares obtained by adding X2 to a model containing X1 SSRX2lX1 SSEX1 7 SSEX1X2 3 There are two di erent ways this can be used for testing the signi cance in variables They 2 are call Sequential Sums of Squares or Type 1 SS and Partial Sums of Squares or Type 111 SS Assuming we have a model with 6 explanatory variables we can nd SSl and SSlll in SAS by PRDC GLM MODEL Y x1 x2 x3 x4 x5 X6SSl sss RUN We can use this output to form a signi cance test of the added variables Say we wanted to test if already established model of Y B0 B1 gtk X1 B2 gtk X2 B3 gtk X3 and want to test if X4 and X5 are signi cant H0 B4 B5 0 Us Ha 77n0t H077 Here is how you would do this in SAS7 using PROC REG PRDC REG MODEL Y x1 x2 x3 x4 X5SSl ss2 TEST x4 0 x5 o RUN Notice that in PROC REG SS3 is denoted as SS2 Furthermore this can be done for any number of Bs as you would like
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'