224 Note 19 for STAT 544 at PSU
224 Note 19 for STAT 544 at PSU
Popular in Course
Popular in Department
verified elite notetaker
One Day of Notes
verified elite notetaker
verified elite notetaker
One Day of Notes
verified elite notetaker
verified elite notetaker
verified elite notetaker
This 20 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Pennsylvania State University taught by a professor in Fall. Since its upload, it has received 21 views.
Reviews for 224 Note 19 for STAT 544 at PSU
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
STAT 544 LECTURE 19 MULTINOMIAL LOGISTIC REGRESSION MODELS Polytomous responses Logistic regression can be extended to handle responses that are polytomous ie taking 7 gt 2 categories Note The word polychotomous is sometimes used but this word does not exist When analyzing a polytomous response it s important to note whether the response is ordinal consisting of ordered categories or nominal consisting of unordered categories Some types of models are appropriate only for ordinal responses other models may be used whether the response is ordinal or nominal If the response is ordinal we do not necessarily have to take the ordering into account but it often helps if we do Using the natural ordering can 0 lead to a simpler more parsimonious model and 0 increase power to detect relationships with other k variables STAT 544 LECTURE 19 F If the response variable is polytomous and all the potential predictors are discrete as well we could describe the multiway contingency table by a loglinear model But fitting a loglinear model has two disadvant ages 0 It has many more parameters and many of them are not of interest The loglinear model describes the joint distribution of all the variables whereas the logistic model describes only the conditional distribution of the response given the predictors o The loglinear model is more complicated to interpret In the loglinear model the effect of a predictor X on the response Y is described by the X Y association In a logit model however the effect of X on Y is a main effect If you are analyzing a set of categorical variables and one of them is clearly a response while the others are predictors I recommend that you use logistic rather than loglinear models k STAT 544 LECTURE 19 Grouped versus ungrouped Consider a medical study to investigate the long term effects of radiation exposure on mortality The response variable is 1 if alive 2 if dead from cause other than cancer 3 if dead from cancer other than leukemia 4 if dead from leukemia The main predictor of interest is level of exposure low medium high The data could arrive in ungrouped form with one record per subject low 4 med 1 med 2 high 1 Or it could arrive in grouped form Exposure Y1 Y2 Y3 Y4 low 22 7 5 0 medium 18 6 7 3 k high 14 12 9 9 STAT 544 LECTURE 19 G1 ungrouped form the response occupies a single column of the dataset but in grouped form the response occupies 7quot columns Most computer programs for polytomous logistic regression can handle grouped or ungrouped data Whether the data are grouped or ungrouped we will imagine the response to be multinomial That is the response for row 7L T yiyi17yi27Hvyi7 7 is assumed to have a multinomial distribution with index m 21 yij and parameter 71 7Tz3917Tz392 TarT If the data are grouped then m is the total number of trials in the ith row of the dataset and yij is the number of trials in which outcome j occurred If the data are ungrouped then y has a 1 in the position corresponding to the outcome that occurred and 0 s elsewhere and m 1 Note however that if the data are ungrouped we do not have to actually create a dataset with columns of 0 s and 1 s a single column Qntaining the response level 12 7 is su icient STAT 5447 LECTURE 19 Gescribing polytomous responses by a sequence of binary models In some cases it makes sense to factor the response into a sequence of binary choices and model them with a sequence of ordinary logistic models For example consider the study of the effects of radiation exposure on mortality The four level response can be modeled in three stages Population Stage I Alive Dead Stage 2 Noncancer Cancer Stage 3 Other cancer Leukemia k STAT 544 LECTURE 19 F The stage 1 model which is t to all subjects describes the log odds of death The stage 2 model which is t only to the subjects that die describes the log odds of death due to cancer versus death from other causes The stage 3 model which is t only to the subjects who die of cancer describes the log odds of death due to leukemia versus death due to other cancers Because the multinomial distribution can be factored into a sequence of conditional binomials we can t these three logistic models separately The overall likelihood function factors into three independent likelihoods This approach is attractive when the response can be naturally arranged as a sequence of binary choices But in situations where arranging such a sequence is unnatural we should probably t a single multinomial model to the entire response k STAT 544 LECTURE 19 F Baselinecategory logit model Suppose that 31239 yilvyi27vyi7 T has a multinomial distribution with index m 221 yij and parameter 71 7Tz3917Tz392 7TZ39TT When the response categories 12 7 are unordered the most popular way to relate 7r to covariates is through a set of 7 1 baseline category logits Taking j as the baseline category the model is 7Tz39 39 T 10glt 7 Zmi jv 17 7Tz39j If 1 has length p then this model has 7 1 X p free parameters which we can arrange as a matrix or a vector For example if the last category is the baseline 7 the coe icients are 17 27 7 T 1 k STAT 5447 LECTURE 19 C i 2 vec r l Comments on this model 0 The kth element of j can be interpreted as the increase in log odds of falling into category j versus category 3quot resulting from a one unit increase in the kth covariate holding the other covariates constant 0 Removing the kth covariate from the model is equivalent to simultaneously setting j 1 coe icients to zero 0 Any of the categories can be chosen to be the baseline The model will t equally well achieving the same likelihood and producing the same tted values Only the values and k interpretation of the coe icients will change STAT 5447 LECTURE 19 0 To calculate m from the back transformationb 7TH 6Xp j Z 1 Zky j k eXprIIF k for the non baseline categories j 7E j and the baseline category probability is 1 1 Zk j exprv k39 7Tz39j 2 Model tting This model is not di icult to t by Newton Raphson or Fisher scoring PROC LOGISTIC can do it Goodness of t If the estimated expected counts Li mfrij are large enough we can test the t of our model versus a saturated model that estimates 7T independently for 7 1 N The deviance for comparing this model to a saturated one is The saturated model has Nr 1 free parameters nd the current model has pr 1 Where p is the STAT 544 LECTURE 19 ngth of am so the degrees of freedom are df N p7 1 The corresponding Pearson statistic is N 7 X2 227 z391 j1 where A yij Mij w is the Pearson residual If the model is true both are Tij 2 approximately distributed as xif provided that o no more than 20 of the uij s are below 50 and 0 none are below 10 In practice this is often not satis ed so there may be no way to assess the overall t of the model However we may still apply a X2 approximation to AG2 and AX2 to compare nested models provided that N pr 1 is large relative to Adf Overdispersion Overdispersion means that the actual covariance 10 natrix of y exceeds that speci ed by the multinomial STAT 544 LECTURE 19 11 nodel VyZ m Diag7rz39 rimT It is reasonable to think that overdispersion is present if 0 the data are grouped ni s are greater than 1 0 1 already contains all covariates worth considering and o the overall X 2 is substantially larger than its degrees of freedom N pr 1 In this situation it may be worthwhile to introduce a scale parameter 02 so that VyZ 7102 Diag7rz39 mfg The usual estimate for a2 is A2 X 2 a N 19W 1 which is approximately unbiased if N p7 1 is large Introducing a scale parameter does not alter the estimate of which then becomes a Quasilikelihood estimate but it does alter our STAT 544 LECTURE 19 12 Qtimate of the variability of If we estimate a scale parameter we should 0 multiply the estimated ML covariance matrix for B by amp2 SAS does this automatically 0 divide the usual Pearson residuals by d and o divide the usual X2 G2 AX2 and AG2 statistics by amp2 SAS reports these as scaled statistics These adjustments will have little practical effect unless the estimated scale parameter is substantially greater than 10 say 12 or higher Example The table below reported by Delany and Moore 1987 comes from a study of the primary food choices of alligators in four Florida lakes Researchers classi ed the stomach contents of 219 captured alligators into ve categories Fish the most common primary food choice Invertebrate snails insects cray sh etc Reptile turtles alligators Bird and Other amphibians plants household pets stones and other debris Let s describe these data by a baseline category model with Primary Food Choice as the outcome and Qake Sex and Size as covariates STAT 5447 LECTURE 19 13 Primary Food Choice Lake Sect Size Fish 711 Rept Bird Other Hancock M small 7 1 0 0 5 large 4 0 0 1 2 F small 16 3 2 2 3 large 3 0 1 2 3 Oklawaha M small 2 2 0 0 1 large 13 7 6 0 0 F small 9 1 0 2 large 1 0 1 0 Trafford M small 3 7 1 0 1 large 8 6 6 3 5 F small 2 4 1 1 4 large 0 1 0 0 0 George M small 13 10 0 2 2 large 9 0 1 2 F small 3 1 0 1 large 8 0 0 1 types instead of sh appears to be sh we ll use sh as the baseline Because the usual primary food choice of alligators category the four logit equations will then describe the log odds that alligators select other primary food Qntering the data When the data are grouped as STAT 5447 LECTURE 19 14 this Hancock Hancock Hancock Hancock Hancock Hancock Hancock Hancock Hancock Hancock George George k male male male male male male male male male male female female small small small small small large large large large large lines omitted large large fish invert reptile bird other fish invert reptile bird other bird other they are in this example SAS expects the response categories 1 2 7 to appear in a single column of the dataset with another column containing the frequency or count That is the data should look like MHOOrbO39IOOb N HO The lines that have a frequency of zero are not actually used in the modeling because they contribute nothing to the loglikelihood You can include them if you want to but it s not necessary STAT 5447 LECTURE 19 Specifying the model In the model statement you need to tell SAS about the existence of a count or frequency variable otherwise SAS will assume that the data are ungrouped with each line representing a single alligator You also need to specify which of the categories is the baseline The link function is glogit for generalized logit To get t statistics include the options aggregate and scalenone options nocenter nodate nonumber linesize72 data gator input lake sex size food count cards Hancock male small fish 7 Hancock male small invert 1 Hancock male small reptile 0 Hancock male small bird 0 lines omitted George female large other 1 proc logist datagator freq count class lake size sex orderdata paramref reffirst model foodref fish lake size sex linkglogit aggregate scalenone run Here is the output pertaining to the goodness of t 15 STAT 544 LECTURE 19 16 Response Profile Ordered Total Value food Frequency 1 bird 1S 2 fish 94 S invert 61 4 other 52 5 reptile 19 Logits modeled use food fish as the reference category NOTE 24 observations having zero frequencies or weights were excluded since they do not contribute to the analysis Deviance and Pearson GoodnessofFit Statistics Criterion DF Value ValueDF Pr gt ChiSq Deviance 40 502687 12566 01282 Pearson 40 525648 18141 00881 Number of unique profiles 16 There are N 16 pro les unique combinations of lake sex and size in this dataset The saturated model which ts a separate multinomial distribution to each pro le has 16 X 4 64 free parameters The current model has an intercept three lake coe icients one sex coe icient and one size coe icient for each of the four legit equations for a total of 24 parameters Therefore the overall t statistics have 64 24 40 Qegrees of freedom STAT 5447 LECTURE 19 17 autput pertaining to the signi cance of covariates Testing Global Null Hypothesis BETA0 Test ChiSquare DF Pr gt ChiSq Likelihood Ratio 664974 20 lt0001 Score 594616 20 lt0001 Wald 512886 20 00001 Type III Analysis of Effects Wald Effect DF ChiSquare Pr gt ChiSq lake 12 862298 00008 size 4 158878 00082 sex 4 21850 07018 The rst section global null hypothesis tests the t of the current model against a null or intercept only model The null model has four parameters one for each logit equation Therefore the comparison has 24 4 20 degrees of freedom This test is highly signi cant indicating that at least one of the covariates has an effect on food choice The next section Type III analysis of effects shows the change in t resulting from discarding any one of the covariatesilake sex or sizeiwhile keeping the others in the model For example consider the test for lake Discarding lake is equivalent to setting three e icicients to zero in each of the four logit STAT 5447 LECTURE 19 18 uations so the test for lake has 3 X 4 12 degrees of freedom Judging from these tests we see that o lake has an effect on food choice 0 size has an effect on food choice and 0 sex does not have a discernible effect This suggests that we should probably remove sex from the model We also may want to look for interactions between lake and size Here are the estimated coe icients Parameter Intercept Intercept Intercept Intercept lake lake lake lake lake lake lake lake lake lake lake lake size We Analysis of Maximum Likelihood Estimates Standard Wald food DF Estimate Error ChiSquare bird 1 24688 07789 101810 invert 1 20744 06116 115025 other 1 09167 04782 86755 reptile 1 29141 08856 108275 Oklawaha bird 1 11256 11924 08912 Oklawaha invert 1 26987 06692 162000 Oklawaha other 1 07405 07422 09956 Oklawaha reptile 1 14008 08105 29872 Trafford bird 1 06617 08461 06117 Trafford invert 1 29868 06874 182469 Trafford other 1 07912 05879 18109 Trafford reptile 1 19816 08258 54775 George bird 1 05758 07952 05288 George invert 1 17805 06282 81628 George other 1 07666 05686 18179 George reptile 1 11287 11925 08959 large bird 1 07802 06528 12588 large invert 1 18868 04112 105606 Pr gt ChiSq OOOOOOOOAOOOAOOOOO 0015 0007 0552 0010 8452 0001 8184 0889 4841 0001 1784 0198 4694 0048 1776 8489 2629 0012 STAT 5447 LECTURE 19 size large other 1 02906 04599 05992 05275 size large reptile 1 05570 06466 07421 05890 sex female bird 1 06064 06888 07750 05787 sex female invert 1 04650 05955 15701 02418 sex female other 1 02526 04665 02955 05881 sex female reptile 1 06275 06852 08587 05598 Odds Ratio Estimates Point 95 Wald Effect food Estimate Confidence Limits lake Oklawaha vs Hancock bird 0524 0051 5558 lake Oklawaha vs Hancock invert 14786 5985 54895 lake Oklawaha vs Hancock other 0477 0111 2042 lake Oklawaha vs Hancock reptile 4058 0829 19872 lake Trafford vs Hancock bird 1958 0569 10176 lake Trafford vs Hancock invert 18846 4899 72500 lake Trafford vs Hancock other 2206 0697 6985 lake Trafford vs Hancock reptile 6900 1569 54784 lake George vs Hancock bird 0565 0118 2675 lake George vs Hancock invert 5955 1749 20125 lake George vs Hancock other 0465 0152 1416 lake George vs Hancock reptile 0525 0051 5549 size large vs small bird 2076 0578 7454 size large vs small invert 0265 0117 0588 size large vs small other 0748 0504 1842 size large vs small reptile 1745 0492 6198 sex female vs male bird 1854 0475 7075 sex female vs male invert 1589 0752 5449 sex female vs male other 1287 0516 5211 sex female vs male reptile 1875 0489 7175 How do we interpret them Recall that there are four legit equations to predict the log odds of 0 birds versus sh 0 invertebrates versus sh 0 other versus sh and STAT 5447 LECTURE 19 20 F o reptiles versus sh The intercepts give the estimated log odds for the reference group lakeHancock sizesmall sexmale For example the estimated log odds of birds versus sh in this group is 24633 the estimated log odds of invertebrates versus sh is 20744 and so on The lake effect is characterized by three dummy coe icients in each of the four logit equations The estimated coe icient for the Lake Oklawaha dummy in the bird versus sh equation is 11256 This means that alligators in Lake Oklawaha are less likely to choose birds over sh than their colleagues in Lake Hancock are In other words sh appear to be less common in Lake Oklavvaha than in Lake Hancock The estimated odds ratio of eXp 11256 032 is the same for alligators of all seX and sizes because this is a model with main effects but no interactions k
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'