### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Categorical Data Analysis I STAT 544

Penn State

GPA 3.92

### View Full Document

## 189

## 0

## Popular in Course

## Popular in Statistics

This 0 page Class Notes was uploaded by Hilbert Denesik on Sunday November 1, 2015. The Class Notes belongs to STAT 544 at Pennsylvania State University taught by Staff in Fall. Since its upload, it has received 189 views. For similar materials see /class/233142/stat-544-pennsylvania-state-university in Statistics at Pennsylvania State University.

## Popular in Statistics

## Reviews for Categorical Data Analysis I

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/01/15

STAT 544 LECTURE 19 MULTINOMIAL LOGISTIC REGRESSION MODELS Polytomous responses Logistic regression can be extended to handle responses that are polytomous ie taking 7 gt 2 categories Note The word polychotomous is sometimes used but this word does not exist When analyzing a polytomous response it s important to note whether the response is ordinal consisting of ordered categories or nominal consisting of unordered categories Some types of models are appropriate only for ordinal responses other models may be used whether the response is ordinal or nominal If the response is ordinal we do not necessarily have to take the ordering into account but it often helps if we do Using the natural ordering can 0 lead to a simpler more parsimonious model and 0 increase power to detect relationships with other k variables STAT 544 LECTURE 19 If the response variable is polytomous and all the potential predictors are discrete as well we could describe the multiway contingency table by a loglinear model But fitting a loglinear model has two disadvant ages 0 It has many more parameters and many of them are not of interest The loglinear model describes the joint distribution of all the variables whereas the logistic model describes only the conditional distribution of the response given the predictors o The loglinear model is more complicated to interpret In the loglinear model the effect of a predictor X on the response Y is described by the X Y association In a logit model however the effect of X on Y is a main effect If you are analyzing a set of categorical variables and one of them is clearly a response while the others are predictors I recommend that you use logistic rather than loglinear models k STAT 544 LECTURE 19 Grouped versus ungrouped Consider a medical study to investigate the long term effects of radiation exposure on mortality The response variable is if alive if dead from cause other than cancer if dead from cancer other than leukemia vbwwI t if dead from leukemia The main predictor of interest is level of exposure low medium high The data could arrive in ungrouped form with one record per subject low 4 med 1 med 2 high 1 Or it could arrive in grouped form Exposure Y1 Y2 Y3 Y4 low 22 7 5 0 medium 18 6 7 3 k high 14 12 9 9 STAT 544 LECTURE 19 G1 ungrouped form the response occupies a single column of the dataset but in grouped form the response occupies 7quot columns Most computer programs for polytomous logistic regression can handle grouped or ungrouped data Whether the data are grouped or ungrouped we will imagine the response to be multinomial That is the response for row 7L yi yilvy v 397y1397 T7 is assumed to have a multinomial distribution with index m 21 yij and parameter T 71 7Tz3917Tz392 7139 If the data are grouped then m is the total number of trials in the ith row of the dataset and yij is the number of trials in which outcome j occurred If the data are ungrouped then y has a 1 in the position corresponding to the outcome that occurred and 0 s elsewhere and m 1 Note however that if the data are ungrouped we do not have to actually create a dataset with columns of 0 s and 1 s a single column Qntaining the response level 12 7 is su icient STAT 5447 LECTURE 19 Gescribing polytomous responses by a sequence of binary models In some cases it makes sense to factor the response into a sequence of binary choices and model them with a sequence of ordinary logistic models For example consider the study of the effects of radiation exposure on mortality The four level response can be modeled in three stages Population Stage I Alive Dead Stage 2 Noncancer Cancer Stage 3 Other cancer Leukemia k STAT 544 LECTURE 19 The stage 1 model which is t to all subjects describes the log odds of death The stage 2 model which is t only to the subjects that die describes the log odds of death due to cancer versus death from other causes The stage 3 model which is t only to the subjects who die of cancer describes the log odds of death due to leukemia versus death due to other cancers Because the multinomial distribution can be factored into a sequence of conditional binomials we can t these three logistic models separately The overall likelihood function factors into three independent likelihoods This approach is attractive when the response can be naturally arranged as a sequence of binary choices But in situations where arranging such a sequence is unnatural we should probably t a single multinomial model to the entire response k STAT 544 LECTURE 19 Baselinecategory logit model Suppose that 31239 yilvyi27vyi7 T has a multinomial distribution with index 7quot quot139 Zj1 yij and parameter T 77239 7Tz391a7Tz392aunrz39r When the response categories 12 7 are unordered the most popular way to relate 7r to covariates is through a set of 7 1 baseline category logits Taking j as the baseline category the model is 77239 39 T 10g z39 jv J 3 7Tz39j If 1 has length p then this model has 7 1 X p free parameters which we can arrange as a matrix or a vector For example if the last category is the baseline 7 the coe icients are 17 27 7 T 1l k STAT 5447 LECTURE 19 C 2 vecw r l Comments on this model 0 The kth element of j can be interpreted as the increase in log odds of falling into category j versus category 3quot resulting from a one unit increase in the kth covariate holding the other covariates constant 0 Removing the kth covariate from the model is equivalent to simultaneously setting j 1 coe icients to zero 0 Any of the categories can be chosen to be the baseline The model will t equally well achieving the same likelihood and producing the same tted values Only the values and k interpretation of the coe icients will change STAT 5447 LECTURE 19 0 To calculate m from the back transformationb 6Xp j 1 Zky j k eXprIIF k for the non baseline categories j 7E j and the 7Tz39j baseline category probability is 1 1 139 Zk jgtk eXp TzT k 7Tz39j 2 Model tting This model is not di icult to t by Newton Raphson or Fisher scoring PROC LOGISTIC can do it Goodness of t If the estimated expected counts Li mfrij are large enough we can test the t of our model versus a saturated model that estimates 7T independently for 7 1 N The deviance for comparing this model to a saturated one is N 7 G2 223 log z391 j1 M w ij The saturated model has Nr 1 free parameters nd the current model has pr 1 Where p is the STAT 544 LECTURE 19 10 ngth of am so the degrees of freedom are df N p7 1 The corresponding Pearson statistic is N 7 X2 227 z391 j1 where A m 2 x w is the Pearson residual If the model is true both are approximately distributed as xif provided that o no more than 20 of the uij s are below 50 and 0 none are below 10 In practice this is often not satis ed so there may be no way to assess the overall t of the model However we may still apply a X2 approximation to AG2 and AX2 to compare nested models provided that N pr 1 is large relative to Adf Overdispersion Overdispersion means that the actual covariance natrix of y exceeds that speci ed by the multinomial STAT 544 LECTURE 19 11 model VyZ m Diag7rz39 rimT It is reasonable to think that overdispersion is present if 0 the data are grouped ni s are greater than 1 0 1 already contains all covariates worth considering and o the overall X 2 is substantially larger than its degrees of freedom N pr 1 In this situation it may be worthwhile to introduce a scale parameter 02 so that VyZ 7102 Diag7rz39 mfg The usual estimate for a2 is A2 X 2 a N 19W 1 which is approximately unbiased if N p7 1 is large Introducing a scale parameter does not alter the estimate of which then becomes a Quasilikelihood estimate but it does alter our STAT 544 LECTURE 19 12 Qtimate of the variability of If we estimate a scale parameter we should 0 multiply the estimated ML covariance matrix for B by amp2 SAS does this automatically 0 divide the usual Pearson residuals by d and o divide the usual X2 G2 AX2 and AG2 statistics by amp2 SAS reports these as scaled statistics These adjustments will have little practical effect unless the estimated scale parameter is substantially greater than 10 say 12 or higher Example The table below reported by Delany and Moore 1987 comes from a study of the primary food choices of alligators in four Florida lakes Researchers classi ed the stomach contents of 219 captured alligators into ve categories Fish the most common primary food choice Invertebrate snails insects cray sh etc Reptile turtles alligators Bird and Other amphibians plants household pets stones and other debris Let s describe these data by a baseline category model with Primary Food Choice as the outcome and Qake Sex and Size as covariates STAT 5447 LECTURE 19 13 Primary Food Choice Lake Sect Size Fish 711 Rept Bird Other Hancock M small 7 1 0 0 5 large 4 0 0 1 2 F small 16 3 2 2 3 large 3 0 1 2 3 Oklawaha M small 2 2 0 0 1 large 13 7 6 0 0 F small 3 9 1 0 2 large 0 1 0 1 0 Trafford M small 3 7 1 0 1 large 8 6 6 3 5 F small 2 4 1 1 4 large 0 1 0 0 0 George M small 13 10 0 2 2 large 9 0 0 1 2 F small 3 9 1 0 1 large 8 1 0 0 1 types instead of sh appears to be sh we ll use sh as the baseline Because the usual primary food choice of alligators category the four logit equations will then describe the log odds that alligators select other primary food Qntering the data When the data are grouped as STAT 544 LECTURE 19 14 they are in this example SAS expects the response categories 1 2 7 to appear in a single column of the dataset with another column containing the frequency or count That is the data should look like this The lines that have a frequency of zero are not actually used in the modeling because they contribute nothing to the loglikelihood You can include them if you want to but it s not necessary k Hancock Hancock Hancock Hancock Hancock Hancock Hancock Hancock Hancock Hancock male male male male male male male male male male small small small small small large large large large large lines omitted George George female female large large fish invert reptile bird other fish invert reptile bird other bird other MHOOrbO39IOOb N STAT 5447 LECTURE 19 Specifying the model In the model statement you need to tell SAS about the existence of a count or frequency variable otherwise SAS will assume that the data are ungrouped with each line representing a single alligator You also need to specify which of the categories is the baseline The link function is glogit for generalized logit To get t statistics include the options aggregate and scalenone options nocenter nodate nonumber linesize72 data gator input lake sex size food count cards Hancock Hancock Hancock Hancock George male male male male small fish small invert small reptile small bird lines omitted female large other proc logist datagator freq count class lake size sex orderdata paramref reffirst model foodref fish lake size sex linkglogit aggregate scalenone run Here is the output pertaining to the goodness of t 7 1 O O 15 STAT 544 LECTURE 19 Response Profile Ordered Total Value food Frequency 1 bird 1S 2 fish 94 S invert 61 4 other 52 5 reptile 19 Logits modeled use food fish as the reference category NOTE 24 observations having zero frequencies or weights were excluded since they do not contribute to the analysis Deviance and Pearson GoodnessofFit Statistics Criterion DF Value ValueDF Pr gt ChiSq Deviance 40 502687 12566 01282 Pearson 40 525648 18141 00881 Number of unique profiles 16 There are N 16 pro les unique combinations of lake sex and size in this dataset The saturated model which ts a separate multinomial distribution to each pro le has 16 X 4 64 free parameters The current model has an intercept three lake coe icients one sex coe icient and one size coe icient for each of the four legit equations for a total of 24 parameters Therefore the overall t statistics have 64 24 40 16 Qegrees of freedom STAT 5447 LECTURE 19 17 autput pertaining to the signi cance of covariates Testing Global Null Hypothesis BETA0 Test ChiSquare DF Pr gt ChiSq Likelihood Ratio 664974 20 lt0001 Score 594616 20 lt0001 Wald 512886 20 00001 Type III Analysis of Effects Wald Effect DF ChiSquare Pr gt ChiSq lake 12 862298 00008 size 4 158878 00082 sex 4 21850 07018 The rst section global null hypothesis tests the t of the current model against a null or intercept only model The null model has four parameters one for each logit equation Therefore the comparison has 24 4 20 degrees of freedom This test is highly signi cant indicating that at least one of the covariates has an effect on food choice The next section Type III analysis of effects shows the change in t resulting from discarding any one of the covariatesilake sex or sizeiwhile keeping the others in the model For example consider the test for lake Discarding lake is equivalent to setting three e icicients to zero in each of the four logit STAT 5447 LECTURE 19 uations so the test for lake has 3 X 4 12 degrees of freedom Judging from these tests we see that Parameter Intercept Intercept Intercept Intercept lake lake lake lake lake lake lake lake lake lake lake lake size We o lake has an effect on food choice interactions between lake and size Here are the estimated coe icients Analysis of Maximum Likelihood Estimates Standard Wald food DF Estimate Error ChiSquare bird 1 24688 07789 101810 invert 1 20744 06116 115025 other 1 0 9167 04782 86755 reptile 1 2 9141 08856 108275 Oklawaha bird 1 1 1256 11924 08912 Oklawaha invert 1 2 6987 06692 162000 Oklawaha other 1 07405 07422 09956 Oklawaha reptile 1 14008 08105 29872 Trafford bird 1 06617 08461 06117 Trafford invert 1 29868 06874 182469 Trafford other 1 07912 05879 18109 Trafford reptile 1 19816 08258 54775 George bird 1 0 5758 07952 05288 George invert 1 1 7805 06282 81628 George other 1 0 7666 05686 18179 George reptile 1 1 1287 11925 08959 large bird 1 0 7802 06528 12588 large invert 1 1 8868 04112 105606 0 size has an effect on food choice and 0 sex does not have a discernible effect This suggests that we should probably remove sex from the model We also may want to look for Pr gt ChiSq OOOOOOOOAOOOAOOOOO 0015 18 STAT 5447 LECTURE 19 size large other 1 02906 04599 05992 05275 size large reptile 1 05570 06466 07421 05890 sex female bird 1 06064 06888 07750 05787 sex female invert 1 0 4650 05955 15701 02418 sex female other 1 0 2526 04665 02955 05881 sex female reptile 1 06275 06852 08587 05598 Odds Ratio Estimates Point 95 Wald Effect food Estimate Confidence Limits lake Oklawaha vs Hancock bird 0524 0051 5558 lake Oklawaha vs Hancock invert 14786 5985 54895 lake Oklawaha vs Hancock other 0477 0111 2042 lake Oklawaha vs Hancock reptile 4058 0829 19872 lake Trafford vs Hancock bird 1958 0569 10176 lake Trafford vs Hancock invert 18846 4899 72500 lake Trafford vs Hancock other 2206 0697 6985 lake Trafford vs Hancock reptile 6900 1569 54784 lake George vs Hancock bird 0565 0118 2675 lake George vs Hancock invert 5955 1749 20125 lake George vs Hancock other 0465 0152 1416 lake George vs Hancock reptile 0525 0051 5549 size large vs small bird 2076 0578 7454 size large vs small invert 0265 0117 0588 size large vs small other 0748 0504 1842 size large vs small reptile 1745 0492 6198 sex female vs male bird 1854 0475 7075 sex female vs male invert 1589 0752 5449 sex female vs male other 1287 0516 5211 sex female vs male reptile 1875 0489 7175 How do we interpret them Recall that there are four legit equations to predict the log odds of 0 birds versus sh 0 invertebrates versus sh 0 other versus sh and STAT 5447 LECTURE 19 2O o reptiles versus sh The intercepts give the estimated log odds for the reference group lakeHancock sizesmall sexmale For example the estimated log odds of birds versus sh in this group is 24633 the estimated log odds of invertebrates versus sh is 20744 and so on The lake effect is characterized by three dummy coe icients in each of the four logit equations The estimated coe icient for the Lake Oklawaha dummy in the bird versus sh equation is 11256 This means that alligators in Lake Oklawaha are less likely to choose birds over sh than their colleagues in Lake Hancock are In other words sh appear to be less common in Lake Oklavvaha than in Lake Hancock The estimated odds ratio of eXp 11256 032 is the same for alligators of all seX and sizes because this is a model with main effects but no interactions k

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.