### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Note 4 for PUBHLTH 640 at UMass

### View Full Document

## 17

## 0

## Popular in Course

## Popular in Department

This 78 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 17 views.

## Similar to Course at UMass

## Popular in Subject

## Reviews for Note 4 for PUBHLTH 640 at UMass

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Puleth 640 4 Categorical Data Analysis Page 1 of 78 Unit 4 Categorical Data Analysis Topics 1 Lesson Overview and Example 2 2 Other Examples of Categorical Data 4 3 Hypotheses of Independence No Association Homogeneity 8 4 The Chi Square Test of No Association in an RxC Table 9 5 Rejection of Independence The Chi Square Residual 15 6 Con dence Interval Estimation of RR and OR 19 7 Strategies for Controlling Confounding 22 8 Standardization of Rates 24 A Indirect Standardization SMR 26 B Direct Standardization SRR 28 9 Strati ed Analysis of Rates 30 A Woolf Test of Homogeneity of Odds Ratios 35 B Mantel Haenszel Test of No Association 37 10 Factors Associated with Mammography Screening 39 11 The Chi Square GoodnessofFit Test 45 Appendices A The Chi Square Distribution 54 B Probability Models for the 2x2 Table 58 C Concepts of Observed and Expected 60 D Review Measures of Association in a 2x2 Table 64 E Review Confounding of Rates 72 F Computer Resources 78 Puleth 640 4 Categorical Data Analysis Page 2 of 78 1 Lesson Overview and Example In Unit 3 Discrete Distributions the Bernoulli binomial poisson and central hypergeometric distributions were introduced The Fisher s exact test for association was also introduced 0 The setting was the single 2x2 table of count data 0 The focus was the null hypothesis of no association eg 7 no association of exposure with disease In Unit 4 Categorical Data Analysis we consider extensions of the 2x2 table of count data eg 0 An r X c table where r the number of rows and c the number of columns and where r3 2 and c 3 2 0 Multiple 2X2 tables You will learn the following The chi square test for the null hypothesis of no association in a r X 0 table Strategies for the analysis of multiple 2 x2 tables and in particular strati ed analysis of association including Mantel Haenszel methods and standardization of rates The focus is contingency table approaches for the analysis of proportions and rates The methods have important advantages including 1 they are model free and 2 they allow us to see the data in ways that are not possible in regression Regression approaches for analyses of proportions and rates eg 7 logistic regression poisson regression Cox proportional hazards regression are introduced in Unit 5 Logistic Regression and Unit 6 Introduction to Survival Analysis Puleth 640 4 Categorical Data Analysis Page 3 of 78 Data Example Source Fisher LD and Van Belle G Biostatistics A Methodologv for the Health Sciences New York John Wiley 1993 It is of interest to explore the question of a relationship between coffee consumption and cardiovascular risk However the issue is made more difficult because many coffee drinkers are also smokers and smoking is itself a risk factor for heart disease The goal is to estimate the nature and strength of a coffeeMI relationship that is independent of the role of smoking For each category of smoking the following summarizes the proportion of cases of MI among low coffee drinkers left bar and among high coffee drinkers right bar 1 Never Smoked a mmquot m m w any Came ounsummmn Drums Former Smoker 1 Fmpanan m cm 2 Cunsummmn Some rows omitted Elma 45 cigarettesday m Cu ee ounsummmn 0 Among never smokers the data suggest a positive coffeeMI relationship 0 Among former smokers the coffeeMI association is less strong 0 Among frequent smokers there is no longer evidence of a coffeeMI association Puleth 640 4 Categorical Data Analysis Page 4 of 78 2 Other Examples of Categorical Data In Introduction to Biostatistics Puleth 540 7 Unit 8 Chi Square Tests contingency tables were introduced These are summaries of categorical data Various epidemiological study designs give rise to categorical data Here are some other examples Example Single group intervention Does minoxidil show promise for the treatment of hair loss N13 volunteers l Administer minoxidil l Wait 6 months l Count occurrences of new hair growth Call this X Suppose we ob serve X12 Possible values of Xcount of occurrences of new hair growth are 0 l 2 13 Thus IF 1 p probabilitynew hair growth for all 13 volunteers and 2 Outcomes for each of the 13 volunteers are independent THEN X is distributed Binomial Nl3 p The Binomial is an example of a categorical distribution Puleth 640 4 Categorical Data Analysis Page 5 of 78 Example Two group randomized controlled trial In controlled analysis does minoxidil work for the treatment of hair loss Consent of N30 volunteers l Randomization Standard Care Minoxidil N1 17 N2 13 Administer standard care Administer minoxidil l l Wait 6 months Wait 6 months l l Count occurrences of new hair Count occurrences of new hair growth Call this X1 growth Call this X2 This design produces a 2x2 table array of count data that is correctly modeled using two binomial distributions New Growth Not Minoxidil X2 12 N2 13 Standard care X1 6 N1 17 IF 1 p1 probabilitynew hair growth on standard care 2 p2 probabilitynew hair growth on minoxidil 3 The outcomes for all 30 trial participants are independent THEN 1 X1 is distributed Binomial N1 17 pl 2 X2 is distributed Binomial N2 13 p2 The product of two binomial distributions In a 2x2 array of randomized trial data is an example of a categorical distribution Puleth 640 4 Categorical Data Analysis Page 6 of 78 Exampl CaseControl Study Is history of oral contraceptive 0C use associated with thromboembolism Enroll cases of Enroll controls thromboembolism wo thromboembolism N1 100 N2 200 Query history of OC use Query history of OC use l l Count histories of OC use Count histories of OC use Call this X1 Call this X2 This design also produces a 2X2 table array of count data that is correctly modeled using two binomial distributions Case Control History of OC Use X1 65 X2 118 Not N1 100 N2 200 Reminder A case control design does not permit the estimation of probabilities of disease IF 1 p1 probabilityhistory of OC use among cases 2 p2 probabilityhistory of OC use among controls 3 The histories for all 300 observations are independent THEN 1 X1 is distributed Binomial N100 pl 2 X2 is distributed Binomial N200 p2 The product of two binomial distributions in a 2X2 array of casecontrol data is an example of a categorical distribution Puleth 640 4 Categorical Data Analysis Exampl Crosssectional Prevalence Study WHO investigated the variation in prevalence of Alzheimer s Disease with raceethnicity Page 7 of 78 Alzheimer s Disease N0 Alzheimer s Disease African Black 115 22885 Native Japan 7560 46440 European White 105930 857070 South Paci c 21 8479 North American Indian 44 10956 This design produces a 5X2 table array of count data that is correctly modeled using five binomial distributions Puleth 640 4 Categorical Data Analysis Page 8 of 78 3 Hypothesis of Independence N0 Association Homogeneity 0f Proportions Review Independence No Association Homogeneity of Proportions are equivalent null hypotheses For example 1 Length of time since last visit to physician is independent of income says that income has no bearing on the elapsed time between visits to a physician The expected elapsed time is the same regardless of income level 2 There is no association between co ee consumption and lung cancer says that an individual s likelihood of lung cancer is not affected by his or her coffee consumption 3 The equality of probability of success on treatment experimental versus standard of care in a randomized trial of two groups is a test of homogenei of grogortions The hypotheses of independence no association homogeneity of proportions are equivalent in an analysis of contingency table data Puleth 640 4 Categorical Data Analysis Page 9 of 78 4 The Chi Square Test of N0 Association in an R x C Table Example Is there an association between income level and the time elapsed since last visit to a physician HA Or is time elapsed independent of income level HO Last Consulted Physician Income 3 6 months 712 months gt12 months Total lt 6000 186 38 35 259 60009999 227 54 45 326 1000013999 219 78 78 375 1400019999 355 112 140 607 2 20000 653 285 259 1197 Total 1640 567 557 2764 Step I Begin the proofbvcontradiction argument with null hvpothesis of independence We consider a crosssectional design view What are the chances of any particular combination income X elapsed time Notation for Probability A natural estimate is the proportion observed 71 princ0me is level i and 73quot income is i elapsed time is j elapsed time is level j For example when i1 and j1 count row1 AND column1 186 total 2764 TEL princ0me is level i 7339 income is i count row 1 TOTAL 2 259 when i1 total 2764 E Prelapsed time is level j 72 elapsed time is j count columnquotjquot TOTAL 1640 when j1 total 2764 Puleth 640 4 Categorical Data Analysis Page 10 of 78 Recall the meaning of independence of two coin tosses Probheads toss 1 heads toss 2 Probheads toss 1 X Probheads toss 2 The statistical model of independence in an r x c table has the same intuition Probincome is level i elapsed time is level j Probincome is level i X Probelapsed time is level j That is Tug 75175j Step 2 State Assumptions 1 The contingency table of count data is a random sample from some population 2 The crossclassi cation of each individual is independent of the crossclassi cations of all other individuals The typical layout may use either nij or Oij notation Columns j j l j C ROWS I 1 011I111 01CI11C N1 01 i R 0R111R1 0RCHRC NR 0R N1 01 NC 0C 0 Step 3 State Null and Alternative H vpoth eses HO 2 ij i j this is the null hypothesis of independence again HA 71 i an Step 4 Estimate the ij under the Null Hypothesis of Independence ij i j Where Puleth 640 4 Categorical Data Analysis Page ll of 78 A n row quotiquot total A 111 column quotjquot total 1 7 39 and 7 n grand total 39 11 grand total Step 5 Reason Null vaothesis Expected Counts E row quotiquot total column quotjquot total Eij trials7Arij under nulln 172 11 Step 6 Reason an Appropriate Test Statistic For each cell we can obtain a sense for the unusualness of the data by comparing observed versus expected under the null hypothesis Large disparities are evidence against the null and in favor of the alternative The statistical test is a zscore2 measure that involves observed and expected counts OH Bull E ii Our chi square test statistic of association is the sum of these over all the cells in the table R 1 2 c 01 Eij2 Ivar1x01 E ii How to compute degrees of freedom df df total cells 7 constraints on data RC 7 l for grand total R 1 because row totals have to be fixed and extra quot1 is for total C 1 because column totals are fixed and extra quot1 is for total RC717R17C1 RC7R7C1 R1C1 v Puleth 640 4 Categorical Data Analysis Page 12 of 78 Behavior of Chi Square Statistic under each of the null and alternative hypotheses Null is true Alternative is true 2 Z on Eu Oij Eu Each is close to zero Each is gtgt 0 ij ij 2 Z 2 R C Oij Eu 2 R C Oij Eij Zar R lC 1 2 g E E 15 Small de KIWI g jg E 1S large and ij ij and has expected value R lC 1 has expected value gt gt R lC l Step 7 Decision Rule Reject null hypothesis HO when test statistic is large as when achieved signi cance level is small test statistic exceeds upper 00100th percentile of Chi square distribution Step 8 Computations 1 For each cell compute A A A row quotiquot totalcolumn quotjquot total Eij tr1als7rij under null n7ri7rj r1 2 And then compute for each cell 2 on Eu Eij Observed Counts Last Consulted Physician Income lt 6 months 712 months gt12 months Total lt 6000 011 186 012 38 013 35 0L 259 60009999 021 227 022 54 023 45 02 326 1000013999 031 219 032 78 033 78 03 375 1400019999 O41 355 O42 112 O43 140 04 607 gt 20000 051 653 052 285 053 259 05 1197 Total 01 1640 02 567 03 557 0 2764 Puleth 640 4 Categorical Data Analysis Page 13 of 78 Expected Counts Last Consulted Physician Income lt 6 months 712 months gt12 months Total lt 6000 E11 259164015368 E12 5313 E13 5219 EL 259 2764 60009999 E21 19343 E22 6687 E23 6570 EL 326 1000013999 E31 22250 E32 7693 E33 7557 E3 375 1400019999 E41 36016 E42 12452 E43 12232 E4 607 3 20000 E51 71023 E52 24555 Es3 W 24122 E5 1197 2764 Total E1 1640 Ez 567 E3 557 E 2764 2 on EU2 186 153682 259 241222 Zmimcil Z Z E 15368 24122 ii 4790 With degrees of freedom RlCl 5l3 l 8 Achieved signi cance level p Value Prob Chi Square W df8 3 4790 ltlt 0001 Stay 9 Statistical Conclusion We reject the null hypothesis in this example because if the null is true the chances of obtaining an observed test statistic value as far away from small as 4 7 90 is less than 1 chance in 10 000 Puleth 640 4 Categorical Data Analysis Special Case The Chi Square Test of N0 Association in a 2 x 2 Table Page 14 of 78 Often the a b c and c notation is used to represent the cell counts in a 2X2 table as follows 2quot l Classi cation Variable 1 2 1St Classi cation 1 a b 2 c d ac bd The calculation for the chi square test given by Z all cells Eij is equivalently given by 2 nadbc2 acgtltbdgtltcdgtltabgt ab cd Puleth 640 4 Categorical Data Analysis Page 15 of 78 5 Rejection of Independence The Chi Square Residual With rejection of the null hypothesis of no association comes the question why We have a tool to help us that we ve already seen Appendix A gives us the following reasoning IF THEN Comment X has a distribution that is Binomial np exactly X is approximately Normal nu n62 With This X is our observed count 0 H2 P G P1P Z X EX 39 score This E X is our ex ected count SDX E p X nlu Thus the numerator of the Z J scoreisOiE 110 X 39 11p The denominator of the Zscore is xnp1p is approx Normal0 1 almost xE le but not quite OE JE Zscore m This approximation to the zseore and similar formulae are called residuals Puleth 640 4 Categorical Data Analysis Page 16 of 78 The following are Two Measures of Residuals approximate zscores for Investigation of Association Name Calculation Remark Standardized Residuals Oij Eij As these are approximately Z rij E scores these residuals are U distributed Normal0 1 approximately Adjusted Standardized Oij Eij Thought to be more reasonably Residuals rij approximated as distributed E 1 1 Normal0 1 I n n Behavior of Measures of Residual under each of the null and alternative hypotheses Null is True Alternative is True O 7 E Will be near zero 9 O 7 E Will be appreciably different from zero when measured in SE units 9 Residual Will be small Residual Will be large in absolute value How large is significantly large We answer this using the Normal0 1 distribution Puleth 640 4 Categorical Data Analysis Page 17 of 78 Example Investigation of Relationship between Income and Physician Visits continued The shaded are adjusted standardized residuals r that are bigger than 196 in magnitude approximately Income 5 6 months 712 months gt12 months lt 6000 43 24 28 60009999 40 30 1000013999 1400019999 20 2 20000 45 38 A little shading applied to the appreciably lower than expected reveals 1 Low income individuals were more likely to Visit their physician within 6 months than were higher income individuals 2 Low income individuals were less likely to delay seeing their physician beyond one year than were higher income individuals Puleth 640 4 Categorical Data Analysis Page 18 of 78 The Small Cell Frequency Problem The problem is The entirety of this analysis relies on a continuous distribution chi square approximation to a distribution that is actually discrete binomial Poisson etc The approximation is poor if the actual cell counts are small When is the Approximation okay There are a variety of rules on this 1 Require that min Eij 3 10 for all the cells very conservative 2 Require min Eij 1 if 20 or fewer of the cells have expected counts lt 5 3 Require min Eij 3 2 if degrees of freedom is less than 30 What do I do if there are sparse cell frequencies Combine adjacent rows andor columns to attain required minimum expected cell frequencies The disadvantage to this approach is a loss of degrees of freedom The resulting test statistic is less powerful Puleth 640 4 Categorical Data Analysis Page 19 of 78 6 Con dence Interval Estimation of Relative Risk RR and Odds Ratio OR The zscore method is used to obtain con dence interval estimates of the relative risk RR and odds ratio OR The approach is the following Its reasonableness is an application of the central limit theorem the details of which are beyond the scope of this course If 6 parameter of interest A 6 best guess based on a reasonably large sample size vaf6 best guess of the variance of 6 6 Then A is well approximated as Normal 01 vaf6 Thus we can use this new zscore variable to obtain the con dence interval we re after For a loc100 con dence interval Outline of Steps in Obtaining a Con dence Interval for RR or OR 1 Utilize the reasonableness of the zscore approximation to the distribution of lnRR and lnOR 2 Obtain con dence interval for lnRR or lnOR 3 To obtain con dence interval estimates for RR and OR eXponentiate the con dence interval estimates for lnRR and lnOR Puleth 640 4 Categorical Data Analysis Page 20 of 78 Con dence Interval Estimate of Relative Risk RR Example CHD No CHD High Cholesterol 27 95 122 Not high cholesterol 44 443 487 7 l 538 609 RR nnn1 27122 2245 niln2 44487 1 Utilize the zscore approximation for the distribution of lnRR lnRR ln245 896 A l 1 vaflnRR w M M n 11 21 A N1 27122 1 44487 In our example vaf1nRR 27 00495 2 Obtain con dence interval for lnRR For a 95 con dence interval Z Z Z975 Z so that With 95 con dence laZ 896 l 9600495 S lnRR S 896 19600495 or 4599 S1nRR g 1332 3 Exponentiate e 4599 S RR S 61 332 so that 1584 S RR S 3789 Puleth 640 4 Categorical Data Analysis Page 21 of 78 Con dence Interval Estimate of Odds Ratio OR Example Disease No disease Exposed a 8 b Not exposed c 2 d OR 2 8X20 2 67 be 230 4 Utilize the zscore approximation for the distribution of lnOR 1110R ln267 981 z i g l l a C d A A l l l 1 In our example varlnOR 8 3O 2 20 5 Obtain con dence interval for lnOR For a 95 con dence interval Z Z Z975 Z so that With 95 con dence laZ 981 196 708 s lnOR 981 196 70 or 67 1nOR 263 6 Exponentiate e 67 S S 62 63 so that With 95 con dence 512 S OR S 1387 Puleth 640 4 Categorical Data Analysis 7 Strategies for Controlling Confounding We can control confounding at study desi 0 Restriction 0 Matching We can also control confounding analytically o Strati cation o Standardization 0 Matching Restriction Page 22 of 78 Restriction is the inclusion of only persons Who are the same With respect to the confounder o A study of males only Will not produce results that are confounded by gender effects 0 A study of nonsmokers only Will not produce results that are confounded by the effects of smoking The advantage is a guarantee of control for confounding However there are also disadvantages The sample size is limited and generalizability is reduced Puleth 640 4 Categorical Data Analysis Page 23 of 78 Matching in a Cohort Study Matching in a cohort study involves the following o Enrollment of exposed persons Without restriction o Enrollment of unexposed only if they match exposed Matching in a CaseControl Study In a casecontrol study the following occurs 0 Enrollment of cases Without restriction o Enrollment of controls only if they match cases Be careful Matching is not necessarily a good idea 0 In casecontrol studies controls may be arti cially similar to cases 0 Estimates of association may be spuriously low 0 If matching is related to exposure only not confounding then spurious confounding may be introduced 0 Sample size is reduced 0 Identical matched pairs provide no information Do not match 0 Most casecontrol studies 0 On a variable that is intermediary 0 When a large number of controls are available Consider matching 0 In an experiment and some cohort studies 0 On some variables age sex site Puleth 640 4 Categorical Data Analysis Page 24 of 78 8 Standardization of Rates Many epidemiological studies of association involve the comparison of rates using the method of standardization to control for confounding Goal of Standardization 0 To compare event rates in two populations while taking into account differences in their distributions with respect to one or more confounders o This is achieved by comparing the event rates in the two populations as their distributions with respect to the confounder were the same Example We wish to compare death rates in two populations Exposed Single women UneXposed Ever married women Age is a confounder since 1 Death rates increase with age AND 2 Ever married women tend to be older How do we pretend that the two groups have the same age distribution 1 In both populations we de ne strata of age 2 We obtain the relative frequencies of each stratum of age 3 These relative frequencies become weights 4 A standardized rate is just a weighted average 6 g 7 Standardized rate Z relative frequency of stratum rate in stratum agz Z weight rate in stratum agz Puleth 640 4 Categorical Data Analysis There are two commonly used methods of comparing standardized rates Page 25 of 78 Name Weight 1 Indirect SMR standardized mortality Relative frequency of stratum ratio among EXPOSED 2 Direct SRR standardized risk ratio Relative frequency of stratum among UneXposed A comparison of standardized rates has the following form Standardized rate among exposed Standardized rate among the unexposed Puleth 640 4 Categorical Data Analysis Page 26 of 78 A Indirect Standardization Standardized Mortality Ratio SMR SMR 2 weight in EXPOSED rate in exposed Observed deaths in exposed 2weight in EXPOSED rate in Unexposed Expected deaths in exposed Suppose we observe the following Exposed Unexposed Stratumi Frequencv n1i Events ai Freguency noi Events bi Young 10000 50 100000 50 Old 1000 4 200000 400 Total 11000 300000 Step 1 Calculate stratum speci c death rates in each of the exposed and unexposed populations Stratumi Exposed Unexposed Young 5010000 005 501000000005 Old 41000004 400200000002 Stay 2 For a SMR our weights will be the relative frequency of the age strata in the exposed population Exposed Stratumi Freguency n1i Weight Young 10000 1000011000 9091 Old 1000 100011000 0909 Total 11000 1000 Stay 3 Compute the standardized event rate in each stratum Exposed Event Rate Weight Assignment Weight Rate Young 005 9091 004546 Old 004 0909 000364 2 004910 Unexposed Event Rate Weight Assignment Weight Rate Young 0005 9091 000455 Old 002 0909 000182 Z 000637 Puleth 640 4 Categorical Data Analysis Page 27 of 78 Stay 4 Compare SMR 004910 77 000637 We say that the number of events among the exposed is 77 times greater than What was expected in the absence of exposure Puleth 640 SRR Z gweight in UNeXposed grate in exposed 2 weight in UNeXposed rate in Unexposed 4 Categorical Data Analysis Page 28 of 78 B Direct Standardization Standardized Rate Ratio SRR Consider the same pattern of event occurrence Expected deaths in UNeXposed Observed deaths in UNexposed Exposed Unexposed Stratumi Frequencv n1i Events ai Freguency noi Events bi Young 10000 50 100000 50 Old 1000 4 200000 400 Total 11000 300000 Stay 1 Calculate stratum speci c death rates in each of the exposed and unexposed populations just as we did before Stratumi Exposed Unexposed Young 5010000 005 501000000005 Old 41000 004 400200000002 Stay 2 For a SRR our weights will be the relative frequency of the age strata in the unexposed population UnExposed Stratumi Frequency n1i Weight Young 100000 100000300000 3333 Old 200000 2000003000006667 Total 300000 1000 Step 3 Compute the standardized event rate in each stratum Exposed Event Rate Weight Assignment Weight Rate Young 005 3333 00167 Old 004 6667 00267 2 00434 Unexposed Event Rate Weight Assignment Weight Rate Young 0005 3333 00017 Old 002 6667 00133 Z 0015 Puleth 640 Step 4 Compare 4 Categorical Data Analysis SRR 00434 289 0015 Page 29 of 78 Summary of Standardization Indirect Direct Name Standardization is to Formula Advantages Disadvantages SMR Standardized Mortality Ratio Exposed Population 2 weight exposed rate in exposed 2 weight exposed rate in unexposed Observed in exposed Expected in exposed Easy Intuitive Standardization is to exposed Often cannot compare populations SRR Standardized Rate Ratio Unexposed Population 2 weight unexposed rate in exposed 2 weight unexposed rate unexposed Expected in Unexposed Observed in Unexposed o Standardization is to unexposed 0 Single reference can be used for comparison of many populations 0 Not intuitive Two warnings about standardiz ationll Warning 1 Standardization is not meaningful if the distribution of the confounder in the exposed and unexposed populations do not overlap Eg 7 An example of nonoverlapping distribution occurs when the ages among the exposed population range 50 and older while the ages among the unexposed population range 18 to 35 Warning 2 Standardization assumes that the effect of exposure is the same in each stratum If it is not we say that the stratification variable is an effect modi er Puleth 640 4 Categorical Data Analysis Page 30 of 78 9 Strati ed Analysis of Rates In a strati ed analysis of rates the goal is to understand an exposure disease relationship while taking into account confounding or effect modi cation Need a review of confounding and effect modification See Appendix E Example We Will explore in some detail a data set investigating exposure to Video display terminals and spontaneous abortion SAB In this unit we ll do a strati ed analysis of this association considering strata de ned by month of gestation Suppose the following are observed Unexposed Exposed Month of Gestation SAB Pregnancies SAB Pregnancies 1 10512 20 1366 03 2 38502 75 30365 82 3 15462 32 12335 36 4 7449 16 5323 15 5 2442 05 4318 13 6 4440 09 1314 03 7 2436 05 1313 03 Puleth 640 4 Categorical Data Analysis The following analysis plan might be followed l Step 1 Are the stratum speci c OR the same Estimate common OR Test homogeneity of OR OR are the same Step 2 Is common OR 1 Step I Page 31 of 78 l OR are different Step 2 Report stratum speci c OR O A preliminary is the estimation of an assumed common odds ratio This will be the MantelHaenzel estimate 0 Testing homogeneity of the stratum speci c OR involves comparing the stratum speci c OR s to the MantelHaenzel OR Step 2 0 IF we judge the stratum speci c odds ratios OR to be within noise of each other the same THEN we evaluate whether it is close to unity no association or 0 IF we judge the stratum speci c odds ratios OR to be different THEN we report stratum speci c OR Puleth 640 4 Categorical Data Analysis Page 32 of 78 How to Estimate the MantelHaenzel Odds Ratio ORMH o It is a weighted average of the stratum speci c odds ratios 0 The weights are a function of the variances of the stratum speci c odds ratios Stay 1 For each stratum obtain the following Case Control Exposed a b M1 UNexposed c d M0 N1 N0 T ORstratum variance ORstratum E bc T Stay 2 Calculate the ORMH as a weighted average of stratum speci c OR Mantel Haenszel Odds Ratio ORMH OR OR stratum strata MH SVarORsmm 2 b strata Puleth 640 4 Categorical Data Analysis Page 33 of 78 Here are the calculations for the data in our example MONTH OF NOTATION OBSERVED GESTATION 1 a b M1 10 502 512 0R172709 c d M0 1 365 366 N1 N0 T 11 867 878 2 a b M1 38 464 502 0R209145 c d M0 30 335 365 N1 N0 T 68 799 867 3 a b M1 15 447 462 0R309032 c d M0 12 323 335 N1 N0 T 27 770 797 4 a b M1 7 442 449 0R410072 c d M0 5 318 323 N1 N0 T 12 760 772 5 a b M1 2 440 442 0R503568 c d M0 4 314 318 N1 N0 T 6 754 760 Puleth 640 4 Categorical Data Analysis Page 34 of 78 6 a b M1 4 436 440 0R6 28716 c 1 M0 1 313 314 N1 N0 T 5 749 754 7 a b M1 2 434 436 0R7 14378 c 1 M0 1 312 313 N1 N0 T 3 746 749 Month of Gestation a b c d T adT bcT 1 10 502 1 365 878 41572 05718 2 38 464 30 335 867 146828 160554 3 15 447 12 323 797 60790 67302 4 7 442 5 318 772 28834 28627 5 2 440 4 314 760 08263 23158 6 4 436 1 313 754 16605 05782 7 2 434 1 312 749 08331 05794 TOTALS 311223 296935 ORMH Z sadT 311223 10481 2 bcT 296935 Puleth 640 4 Categorical Data Analysis Page 35 of 78 A Woolf Test of Homogenei 397 1 Stay 1 For each stratum obtain the following 1 l l l l 1n0Ri anti Ch and WCIghtWi bd ici 1 i 1 i Stay 2 Obtain a weighted average of the stratum speci c lnOR K strata Z wilnORi o anE H K strata Z Wi il Stay 3 The Woolf statistic under the null hypothesis of homogeneity of OR is distributed chi square With degrees of freedom strata 7 l Kstrata 39 sttratal Z WilnORi anET il Kstrata Wiln0Ri2 Puleth 640 4 Categorical Data Analysis Example Month a b c d W lnlORl W lnlORli 1 10 502 1 365 09052 1983882 1795805 2 38 464 30 335 154346 0089365 137932 3 15 447 12 323 64378 0101763 0655126 4 7 442 5 318 28714 0007214 0020713 5 6 7 Page 36 of 78 2 440 4 314 13238 1030529 1367496 4 436 1 313 07965 1054855 0841448 2 434 1 312 06642 0363106 0241485 Totals 2843872 0502491 Kstrata 1 OR W n 1 0502491 anR 11 Km 0017669 Z 2843872 Wi Kstrata 1 2 811140131 11101312 Step 4 Signi cance level calculation pValue Probability Chi square df6 3 61284 409 Do not reject The null hypothesis is retained because the Woolf statistic is not statistically signi cant Inasmuch as the stratum specific odds ratios range from 035 to 727 the lack of statistical signi cance is re ecting the limited availability of sample size to study Puleth 640 4 Categorical Data Analysis Page 37 of 78 B Mantel Haenszel Test of N0 Association It has been determined previously that it is reasonable to assume that the stratum speci c odds ratios are the same Now we ask Are the stratum speci c odds ratios all unity Stay 1 For each stratum the hypothesis of no association means that the count a has a distribution that is central hypergeometric Case Control Exposed a b M1 UNeXposed c 1 M0 N N MM E 1 1 1 0 1 0 a T mm m 1 Stay 2 The test statistic Will be the sum over strata of the counts a A EA2 sz ValA Where N1 M1 T N1 NOMIMO T2T 1 A 2a EA z Strata Strata varA z Puleth 640 4 Categorical Data Analysis Page 38 of 78 We get the following in our example Month of Gestation a NlMlT NlNQMlMQTZHDl 1 10 64146 26435 2 38 393725 52931 3 15 156512 63637 4 7 69793 28784 5 2 34895 14505 6 4 29178 12086 7 2 17463 07278 TOTALS 78 765712 305656 A Za78 EA ZN1Ml765712 52mm mam T varA z 305656 T2T 1 X A EA2 78 7657122 00668 varA 305656 Si i cance Level gP value If the assumption of no association is true then the chances of a chi square statistic more extreme than 00668 is PValue Prob Chi square W dfl 3 00668 080 Do not Reject Conclude that overall data do not suggest an association This is not surprising inasmuch as ORMH 1048 Puleth 640 4 Categorical Data Analysis Page 39 of 78 10 Factors Associated with Mammographic Screening Source Evans et al 1998 Factors Associated with Repeat Mammography in a New York State Public Health Screening Program Public Health Management Practice 45 6371 Background 0 Breast cancer is a major cause of morbidity and mortality In the US it is the second major cause of cancer deaths for women 0 There is no known way of primary prevention In the meantime secondary prevention is of critical public health importance 0 Mammography detects cancer approximately 17 years before a woman could feel the lump herself It also locates cancers too small for detection by clinical breast exam Stage of breast cancer at diagnosis is related to survival Stage at Diagnosis Percent Surviving t0 5 Years Early 97 Late 20 0 One screening mammogram is not enough The risk of breast cancer increases with age 0 Previous work has shown that mammography is underutilized 0 Therefore surveillance of patterns of repeat mammographic screening among women is needed to identify targets for intervention Such a study is among the activities of the New York State Department of Health Research Question Among women with no history of breast cancer and with a normal mammogram what factors among selected sets of characteristics sociodemographic cancer risk health behavior health care access predict the occurrence of a repeat mammogram Puleth 640 4 Categorical Data Analysis Page 40 of 78 Design Cohort study investigation of the occurrence of a repeat screening mammogram during the period 19881993 among women without a history of breast cancer and who received a baseline screening mammogram that is documented in the Breast and Cervical Screening Program Database of the New York State Department of Health Breast and Cervical Cancer Screening Program Cohort 19881993 New York State Department of Health 9 Mammography Sites 16529 baseline mammograms among women aged over 50 i Exclusions 6311 due nonnegative baseline mammogram 205 due requirement for followup testing 528 due historv of breast cancer 7044 exclusions total Analysis Cohort N9485 women 0 No history of breast cancer 0 No missing data Puleth 640 4 Categorical Data Analysis Characteristics of Analysis Cohort Frequency Total 9485 100 Age 5069 years 3670 39 NonWhite RaceEthnicity 5160 54 Less than High School Education 6472 68 Family History of Breast Cancer 1130 12 Previous Mammogram 4366 46 Returned for Repeat Screening Mammogram 2604 27 Page 41 of 78 Puleth 640 4 Categorical Data Analysis Page 42 of 78 Recall 0 Among 9485 women with an initial negative mammogram 2561 27 returned for a repeat screening mammogram 0 Interest is in variations in these events of return with demographics medical history and access to health care 0 Rationale is the importance of detecting breast cancer in its early stage A reasonable analysis plan is the following Goal Rationale Methods 1 Description of Analysis Sample 0 To describe sample 0 To compare sample with target 0 To identify data errors Relative frequency 2 Estimation of Crude Associations 0 To obtain these associations 0 To identify candidates for adjusted analysis 0 To guide adjusted analysis Relative frequency 11 OR 95 CI Chi square tests of association 3 Model Free Estimation of Adiusted Associations 0 To obtain estimates of independent predictive signi cance 0 To obtain model free hypothesis tests 0 Test of homogeneity of OR 0 To discover effect modi cation 0 To discover confounding o Strati ed estimates of OR and 95 con dence intervals 0 Test of homogeneity of OR 0 Estimation of ORMH Puleth 640 4 Categorical Data Analysis Page 43 of 78 Characteristics of Participants with Negative Mammograms at Initial Visit New York State 19881991 N9485 Partial listing n Age years 70 885 93 5069 3670 387 4049 2805 296 lt40 2061 217 unknown 64 07 RaceEthnicity White NonHispanic 4325 456 Black NonHispanic 2567 271 Hispanic Asian Other 2587 272 unknown 6 0 1 Time Since Last Mammogram Less than 1 year 1552 164 15 years 2220 234 More than 5 years 594 63 No prior mammogram 4933 520 unknown 186 20 Note Initial visits occurred during the years 19881991 Almost half were over the age of 50 46 were White 52 had never had a mammogram Give counts of unknown Puleth 640 4 Categorical Data Analysis Page 44 of 78 Crude Associations with Return for Screening Mammogram Among Women with Initial Negative Mammogram New York State 19881993 N9485 Screening Mammogram N n Age 70 885 292 330 5069 3670 1328 362 4049 2805 691 246 lt40 2061 288 140 Pa 0001 RaceEthnicity White 4325 1380 319 Black 2567 755 294 Hispanic 2587 468 181 Asian0ther P0001 Last Mammogram Less than 1 year 1552 576 371 15 years 2220 727 327 gt 5 years 594 165 278 No prior 4416 958 217 P0001 a Chi square test of association 0 Reminder PValues are not very useful They are especially uninformative in large scale studies 0 Best return is seen among women 5069 years of age 0 Crude analysis suggests that Hispanics Asian women of other raceethnicity are less likely to follow their initial negative mammogram with a repeat screen 0 Not surprisingly women with a history of mammogram are more likely to return for a repeat screen Puleth 640 4 Categorical Data Analysis Page 45 of 78 11 The Chi Square Goodness of Fit Test Another use of the chi square distribution 0 So far we ve used the chi square statistic to test the hypothesis of no association 0 Now we ll use the chi square distribution to assess whether two distributions are the same or reasonably the same goodnessof t Suppose that a histogram of the observed data looks like Of interest Can we reasonably assume for purposes of analysis that the data represent a sample from a Normal distribution This permits application of normal theory estimation and hypothesis testing approaches like the ones we learned in Puleth 540 Introductory Biostatistics This might also be of interest if we d like to know if the sample distribution can reasonably be described as that of another distribution eg Binomial or Poisson Puleth 640 4 Categorical Data Analysis Page 46 of 78 Consider the setting Where interest is in goodnessof t to the Normal distribution Which normal distribution Let s consider the Normal distribution that is the closest By closest we mean u sample mean X 2 39 2 a sample varlance S The idea is to consider an overlay of this Normal distribution on the histogram of the observed data N0rmaluX 6 S 4 Puleth 640 4 Categorical Data Analysis Page 47 of 78 The Idea of the GoodnessofFit Test 397 1 Divide up the range into intervals is 39w iv 39 l 72 r1 0 l 2 Interval il i2 i3 etc iK 4 In each interval I obtain 3 Observed count Oi 2 Expected count E 1 Also obtain for each interval I What is called a component chi sguare 0 7 2 31 6 1 2 2 Oi Observed 01 02 etc OK E i E Xpected E1 E2 etc EK Each is a comparison of the observed and expected counts 01 E1 2 O2 Ely etc OK EK 2 E1 E2 EK 2 2 K O E1 Sum these to obta1n the Ch1 sguare Zgof 2 E i1 i Goodness of Fit Test Puleth 640 4 Categorical Data Analysis Page 48 of 78 Behavior of the Chi Square Goodnessof Fit Statistic This is a setting where the null hypothesis is typically the one that we hope is operative The null hypothesis says that the unknown true the distribution that gave rise to the data is reasonably similar to the hypothesized in this example Normal Values of the chi square goodness of t test will be small when the two distributions are reasonably similar This is because the observed and expected counts are similar giving rise to component chi square values that are small How many degrees of freedom has the Chi Square Statistic Degrees of freedom to use intervals total this is K l ONE df is lost for the last interval parameters estimated using the data DF K l parameters estimated Puleth 640 4 Categorical Data Analysis Page 49 of 78 Example Source Rosner B Fundamentals oZBiostatisties second edition Boston Duxbury 198617 352 Test for goodness of t the normal probability distribution for the following data comprised of nl4736 blood pressure readings Note 7 these data have sample mean and variance values of i8068 and S2 122 respectively Step 1 Obtain Observed counts from a histogram Class Observed i Interval Count Oi 1 lt50 57 2 3 50 to lt 60 330 3 3 60 to lt 70 2132 4 3 70 to lt 80 4584 5 3 80 to lt 90 4604 6 390 tolt100 2119 7 3100 to lt 110 659 8 3 110 251 TOTAL 14736 Tip Check that the sum of the observed counts MATCHES the total sample size Puleth 640 4 Categorical Data Analysis Page 50 of 78 Step 2 Obtain the u and Oquot2 of the comparison normal distribution Compute from the sample X 8068 82 122 So we ll compare the data to the normal distribution With u 8068 52 122 Step 3 7 Calculate the likelihood of a value in each interval using the z score method introduced in BE540 For interval i1 PrXlt50PrZlt PrZlt2556000529 For interval i2 508068 608068 lt Zlt Pr50ltXlt60Pr Pr2556ltZlt l 72330424200529037l 12 12 Etc For interval iK8 PrXgtl lOPrZgt PrZgt24433000728 Puleth 640 4 Categorical Data Analysis Page 51 of 78 Step 4 7 Calculate the expected count of observations in each interval using Expected count sample size X probability of interval For interval i1 39 E1 14736 000529 7795 For interval i2 E2 14736 00371 54671 Etc For interval iK8 E8 14736 000728 10728 Step 5 Obtain Observed counts from a histogram Component 1 Class Interval Observed Count Oi Expected Count Ei Ei l lt50 57 7795 56306 2 3 50 to lt 60 330 54671 859015 3 360 to lt70 2132 212640 00147 4 3 70 to lt 80 4584 428375 210447 5 3 8 to lt 90 4604 447827 35299 6 390 tolt 100 2119 243144 401485 7 3100 to lt 110 659 68375 08959 8 3110 251 10757 1912444 TOTAL 14736 14736 34841 Puleth 640 4 Categorical Data Analysis Page 52 of 78 Tip Check that sum of observed sum of expected sample size Step 6 Determine degrees of freedom DF K l parameters estimated 8711foru7lforo 5 Step 7 Assess statistical signi cance Zgzoodness of t df5 pValue Prob Chi square W df5 3 34841 ltlt 00001 This suggests that the data cannot reasonably be assumed to follow a normal distribution Examination of the component chi squares suggests that the normal distribution t is reasonable for blood pressures between 60 and 110 mm Hg but is poor for readings below 60 mm Hg or above 110 mm Hg Puleth 640 4 Categorical Data Analysis Page 53 of 78 Exampl Source Zar JH BiostatistiealAnalysis third edition Upper Saddle River Prentice Hall 99617 461 A plant geneticist wishes to know ifa sample of n250 seedlings comes from a population having a 933l ratio of yellow smooth yellow wrinkled green smooth green wrinkled seeds In this example expected counts are computed using the hypothesized phenotype ratios Component i Phenotype 0i Expected Count E 0i Ei2 Bi 1 Yellow 152 P h r 9 250 5625 140 625 09201 smooth n r0 enmype m m l 2 Yellow 39 P h t i 3 250 1875 46 875 13230 wrinkled n rp eno ypen m 3 Green 53 P h t i 250 1875 46 875 08003 smooth n rP 110 YPen m 4 Green 6 P h t i 250 0625 15 625 59290 wrinkled n r0 eno ype m m TOTAL 250 250 8972 DF K l parameters estimated 4 7 l 0 because we didn t have to estimate any 3 2 28972 Zgoodness offit df3 pValue Prob Chi square w df 3 3 8972 002967 This suggests that the data do NOT come from a population having a 933l ratio of the four seedling types Puleth 640 4 Categorical Data Analysis Page 54 of 78 Appendix A The Chi Square Distribution In Puleth 540 the chi square distribution was introduced in Unit 6 Estimation and in Unit 8 Chi Square Tests This appendix explains the appropriateness of using the chi square distribution a model for a continuous random variable for the analysis of discrete data The chi square distribution is related to the normal distribution Has a Chi Square Distribution IF THEN with DF Z has a distribution that is Normal Z2 1 01 X has a distribution that is Normal u 62 so that Zscore gt2 1 x Zscore a X1 X2 Xn are each distributed Normal u 62 and are independent so that Zscore gt2 1 X is Normal u czn and Y Z score aJh X1 X2 Xn are each distributed Normal u 62 and are independent 11 DSZ and we calculate 2 II1 U Puleth 640 4 Categorical Data Analysis Page 55 of 78 The chi square distribution can be used in the analysis of categorical count data for reasons related to the normal distribution and in particular the central limit theorem Z1 Z2 Zn are each Bernoulli with probability of event p EEJ p Vad1102pO p t L The net number of events X is Binomial Np i1 A We learned in Puleth 540 that the distribution of the average of the Zi is well described as Normalu czn Apply this notion here By convention zi x 11 11 l 3 So perhaps the distribution of the w is also well described as Normal At least approximately If Y is described well as Normal u czn Then X HY is described well as Normal my n62 l Exactly X is distributed Binomialnp Approximately X is distributed Normal my n62 Where Id 13 and 02 1313913 Puleth 640 Putting it all together 4 Categorical Data Analysis Page 56 of 78 IF THEN Comment X has a distribution that is Binomial np exactly X has a distribution that is Normal up no2 apgroximately where X 11p lnp p is approx N0rma101 l Zscore gt2 has distribution that is well described as Chi Square We arrive at a continuous distribution model for count data Puleth 640 4 Categorical Data Analysis Page 57 of 78 A F eel for things continued 7 You will come to think of the chi square distribution as this when analyzing count data For one cell Observed Expected 2 Count Count is Chi Square df 1 approximately Expected Count For the sum of all RC cells in a R x C table Observed EXPeCted R c C t C t 39 39 Z 2 01111 131 Gun 13 is Chi Square df Rlllc39ll 11 11 Expected Count 11 approximately Puleth 640 4 Categorical Data Analysis Page 58 of 78 Appendix B Selected Models for Categorical Data Various study designs eg 7 case control cohort surveillance give rise to categorical data utilizing some of the probability distributions that have been introduced in Unit 3 eg 7 binomial poisson product binomial and product poisson 1 CaseControl We count events of exposure Case Control Exposed a b 1 Not c d FIXED FIXED The count a is distributed Binomial trials ac Probcase exposed The count b is distributed Binomial trials bd Probcontml exposed 2 Cohort We count events of disease Disease Not Exposed a b FIXED Unexposed C d FIXED The count a is distributed Binomial trials ab Probexposed disease The count 0 is distributed Binomial trials cd Probmexposed disease 3 2x2 Table We count events of joint occurrence of exposure and disease Disease Not Exposed a b FIXED Not c d FIXED FIXED FIXED The count a is distributed Hypergeometric Puleth 640 4 Categorical Data Analysis Page 59 of 78 4 2x2 Table We count events of all 4 types of joint events Disease Not Exposed a b Not c d The count a is distributed Poisson in The count b is distributed Poisson lb The count 0 is distributed Poisson if The count d is distributed Poisson 1d 5 RxC Table General Mild Moderate Severe Exposed a b c FIXED Not d e f FIXED The triplet of counts abc is distributed Multinomial The triplet of counts def is distributed Multinomial Nate The multinomial distribution has not been discussed in this course It is an extension ofthe Binomial distribution to the setting of more than two outcomes Puleth 640 4 Categorical Data Analysis Page 60 of 78 Appendix C Concepts of Observed versus Expected In categorical data analysis methodology we compare observed counts of events with expected counts of events Emphasis on counts Consider an investigation of a possible association between electronic fetal monitoring EFM and delivery by caesarian section Caesarian Section Yes No EFM Exposure Yes 5 l 6 No 2 7 9 7 8 15 The observed counts are with EFM exposureyes AND Caesarian sectionyes 5 with EFM exposureyes AND Caesarian sectionno l with EFM exposureno AND Caesarian sectionyes 2 with EFM exposureno AND Caesarian sectionno 7 The expected counts depend on what we believe Absent a null hypothesis Cohort Study Suppose we allow for possibility of different probabilities of caesarian section for EFM exposed women versus nonEFM exposed women Best guess of prcaesarian section for EFM exposed women 56 Best guess of prcaesarian section for nonEFM exposed women 29 CaseControl Study Suppose we allow for possibility of different probabilities of history EFM exposure caesarian section women versus non women Best guess of prEFM history for Csection women 57 Best guess of prEFM history for non Csection women 18 Puleth 640 4 Categorical Data Analysis Page 61 of 78 Expected Counts Under Independence No Association Homogeneity row totalcolumn total grand total Expectedmm me Example Expected Count in a Cohort Study Viewed as a cohort study the outcome is caesarian section The null hypothesis of independence no association homogeneity of proportions suggests that Best Guess of pr caesarian section Overall proportion of csection A 7 column quotyesquot total pcsection E grand total Best Guess of prNO caesarian section Overall proportion of NON csection A 8 column quotnoquot total p NON csec on 15 grand total Caesarian Section Yes No A 7 A 8 nefmyespcsection nefmyespNO csecu39on 1 H H t t 1 1 H H t t 1 2 row quotyesquot WWW row quotyesquot WWW grand total grand total efm row quotyesquot totalcolumn quotyesquot total row quotyesquot totalcolumn quotnoquot total grand total grand total 7 8 n A 9 n A 9 lt lt 15 lt lt 15 N0 2 row quotnoquot tOtamcolumn yes total 2 row quotnoquot tOtamcolumn no total grand total grand total 2 row quotnoquot totalcolumn quotyesquot total 2 row quotnoquot totalcolumn quotnoquot total grand total grand total Puleth 640 4 Categorical Data Analysis Page 62 of 78 Expected Counts Under Independence No Association Homogeneity row totalcolumn total grand total Example Expected Count in a CaseControl Study Viewed as a casecontrol study the outcome is history EFM exposure The null hypothesis of independence no association homogeneity of proportions suggests that Best Guess ofpr zx EFM Overall proportion of EFM exposure 6 row quotyesquot total p1 EFM E grand total Best Guess ofpr zx NO EFM Overall proportion of NO EFM exposure 9 row quot noquot total pNO EFM E 2 grand total Caesarian Section Yes No A 6 A 6 nosectionyespthPM nosectionnopthFM H H t t 1 H H t t 1 2 column quotyesquot totalw 2 column quotnoquot WWW grand total grand total efm H H H H 2 column quotnoquot totalrow quotyesquot total 2 column yes totCa11roN yes total grand total gran tota A 9 A 9 nosectionyespNOhx EFM nosection pNO hx EFM N0 1 5 1 5 H H t t 1 H H t t 1 2 column quotyesquot totalw 2 column quotnoquot totalw grand total grand total 2 column quotyesquot totalrow quotnoquot total 2 column quotnoquot totalrow quotnoquot total grand total grand total Puleth 640 4 Categorical Data Analysis Page 63 of 78 Observed and Expected Counts General R x C Table A useful notation is O for observed and E for expected and the following subscripts 397 397 1 Oij Observed count in row and column Eij Expected count in row and column j Oi EL ni Observed and Expected row total for row i 01 Ej nj Observed and Expected column total for column 397 1 Yes it s true Under the null hypothesis the expected and observed totals row totals column totals grand total match Observed Counts Columns j j 1 C Rows i 139 l 011 01c N1 01 l R 0R1 ORC NR OR N1 01 NC OC NO Expected Counts under Null Independence No Association Homogeneity Columns j j 1 j C r1 r1 7 Rows 1 1 1 E nlnl BIG 1 C Nifoi 11 n 1 1 1 1 1 1 1 1 1 1 l R Em R 1 ERG R C NROR 1 1 1 1 Puleth 640 4 Categorical Data Analysis Page 64 of 78 Appendix D Review Measures of Association Recall that various epidemiological studies prevalence cohort casecontrol give rise to data in the form of counts in a 2x2 table Recall again the goal of assessing the association between exposure and disease in a 2x2 table of counts represented using the a b c and d notation Disease Healthy Exposed a b a b Not Exposed c d c c a c b d Let s consider some actual counts Disease Healthy Exposed 2 8 10 Not Exposed 10 290 300 12 298 3 10 We might have more than one 2x2 table if the population of interest is partitioned into subgroups or strata Example Strati cation by gender would yield a separate 2x2 table for men and women Puleth 640 4 Categorical Data Analysis Page 65 of 78 A good measure of association is a single measure that is stable over the various characteristics strata of the population Excess Risk Suppose that the cumulative incidence of disease among exposed 71 and that the cumulative incidence of disease among nonexposed 70 Excess Risk The difference between the cumulative incidence rates b75139750 Example In our 2x2 table we have 71 210 20 70 10300 0333 Thus b 20 0333 1667 o The effect of exposure is said to be additive because we can write 71 70 b o Hypothesis testing focuses on HO b 0 o For a population that has been strati ed with strata k l K the additive model says that TEkl TEkO i b Note The absence of a subscript k on the excess risk b says that we are assuming that the excess risk is constant in every stratum e g among men and women 0 Biological mechanisms which relate exposure to disease in an additive model often do not operate in the same way across strata o If so the additive risk model does not satisfy our criterion of being stable Puleth 640 4 Categorical Data Analysis Page 66 of 78 Relative Risk RR The relative risk is the ratio of the cumulative incidence rate of disease among the exposed 71 to the cumulative incidence rate of disease among the nonexposed 70 Relative Risk The ratio of the cumulative incidence rates RR751TEO Example In our 2x2 table we have 71 210 20 70 10300 0333 Thus RR 200333 6006 0 The effect of exposure is said to be multiplicative because we can write 751 750 RR 0 Hypothesis testing focuses on HO RR l o This model is also said to be additive on the log scale It is also said to be an example ofa loglinear model To see this 71 70 RR 3 lnnl lnno lnRR 3 lnnlln rto3 where BlnRR o It has been found empirically that many exposuredisease relationships vary with age in such a way that the log linear model is a good description Speci cally the change with age in the relative risk of disease with exposure is reasonably stable In such instances the model is preferable to the additive risk model Puleth 640 4 Categorical Data Analysis Page 67 of 78 Attributable Risk The attributable risk is proportion of the incidence of disease among exposed persons that is in excess of the incidence of cases of disease among nonexposed persons Often it is expressed as a percent Attributable Risk AR 0 when expressed as a percent Recalling that RR 71 70 reveals that E RR AR Example In our 2x2 table a RR 6006 yields an attributable risk value of AR 6006 716006 8335 8335 Puleth 640 4 Categorical Data Analysis Page 68 of 78 Odds Ratio Recall that the odds ratio measure of association has some wonderful advantages both biological and analytical Recall rst the meaning of an odds Probabilityevent 1 OddsEvent TElTE Let s look at the odds that are possible in our 2x2 table 0 10 Disease Healthy Exposed a b a b Not Exposed c d c d a c b d Cohort study design 7 a a b a 2 Est1mated Odds of d1sease among exposed 7 25 b a b 1 8 Estimated Odds of disease among non exposed M 3 0345 dcd d 90 Casecontrol study design Estimated Odds of exposure among diseased m 3 20 cac Estimated Odds of exposure among healthy 2 i 0276 dbd d 290 Puleth 640 4 Categorical Data Analysis Page 69 of 78 Odds ratio Cohort study design Odds disease among exposed a b ad Odds disease among non exposed cd E Casecontrol study design R Odds exposure among disease ac Odds exposure among healthy bd bc Terri c The OR is the same regardless of the study design cohort prospective or casecontrol retrospective Example In our 2x2 table a 2 b8 010 and d290 so the OR 725 This is slightly larger than the value ofthe RR 6006 Thus there are advantages of the Odds Ratio OR 1 Many exposure disease relationships are described better using ratio measures of association rather than difference measures of association 2 ORcohort study ORcasecontrol study 3 The OR is the appropriate measure of association in a casecontrol study Note that it is not possible to estimate an incidence of disease in a retrospective study This is because we select our study persons based on their disease status 4 When the disease is rare ORcasecontml RR Puleth 640 4 Categorical Data Analysis Page 70 of 78 Appendix E Review Confounding of Rates Is our estimate of a diseaseexposure relationship measuring what we think it is Or is there some other in uence that plays a role The presence of other in uences might be as confounders or effect modi ers A confounded association does not tell us about the association of interest A confounded relationship is biased because of an extraneous variable An effect modi ed relationship changes with variations in the extraneous variableSeveral examples illustrate these ideas Example Among 600 women it appears that nulliparity is protective against breast cancer CaseControl Status Breast Cancer Control Exposure Null 120 180 300 Status 40120300 60180300 Parous 180 120 300 300 300 600 Odds Ratio 044 However when we take into account exposure to radiation a different story emerges No radiation Radiation Cancer Control Cancer Control Null 30 170 200 Null 90 10 100 Parou 10 90 100 Parous 170 30 200 s 40 26 300 260 40 300 Odds Ratio 16 Odds Ratio 16 The unadjusted odds ratio of 044 is reversed It now appears that nulliparity is a risk factor for breast cancer this is re ected in the odds ratio that is greater than 1 Puleth 640 4 Categorical Data Analysis Page 71 of 78 How did this apparent contradiction occur 0 In the nulliparous group there are disproportionately fewer women exposed to radiation 0 Women exposed to radiation are more likely to have breast cancer 0 Women exposed to radiation were less likely to be nulliparous with the result that 0 OR 044 is biased due to the confounding effect of exposure to radiation The calculation of an association for example an RR or an OR for a 2x2 table of counts may be misleading because of one or more extraneous in uences An extraneous in uence can be Confounder Effect modi er Both Neither A confounded association is biased and does not tell us about the association of interest An effect modi ed relationship changes with variations in the extraneous variable Intuitively confounding is the o Distortion of a predictoroutcome relationship due to a third variable that is related to both predictor and outcome 0 The bias from confounding can be a spurious strengthening weakening elimination reversal o A reversal is said to be an example of Simpson s Paradox Puleth 640 4 Categorical Data Analysis Page 72 of 78 Apparent but not true confounding can occur in the absence of a relationship between exposure and disease Example Are breath mints associated with cancer Casecontrol Status Cancer Control Exposure Breath 200 1646 1846 Status Mints 77200260 1816468935 None 67 7289 73 56 260 893 5 9202 Odds Ratio 1322 It looks like we should not be eating breath mints What happens if we control for smoking Smokers NonSmokers Cancer Control Cancer Control Breath 194 706 900 Breath 6 940 946 Mints Mints None 21 79 100 None 46 7210 7256 215 785 1000 52 8150 8202 Odds Ratio 103 Odds Ratio 100 Controlling for smoking eating breath mints is no longer associated with cancer If the extraneous variable has no effect on disease then it will not cause confounding Puleth 640 4 Categorical Data Analysis Page 73 of 78 Example Hot tea is suspected of being associated with esophageal cancer CaseControl Status Cancer Control Exposure Tea 1420 3650 5070 Drin 9414201504 8136504499 Water 84 849 933 1504 4499 6003 Odds Ratio 393 Notice that the tea drinkers have disproportionately fewer smokers Smoker NON Smoker Exposure Tea 70 5000 5070 Drink 14705070 Water 833 100 933 89833933 903 5100 6003 Interestingly smoking status does not distort the association of tea With cancer SMOKERS NONSMOKERS Cancer Control Cancer Control Tea 20 50 70 Tea 1400 3600 5000 Water 75 758 833 Water 9 91 100 95 808 903 1409 3691 5100 Odds Ratio 404 Odds Ratio 393 Puleth 640 4 Categorical Data Analysis Page 74 of 78 This is because smoking itself is not associated With esophageal cancer WATER TEA Cancer Control Cancer Control SMOKER 75 758 833 SMOKER 20 50 70 NOT 9 91 100 NOT 1400 3600 5000 84 849 933 1420 3650 5070 Odds Ratio 100 Odds Ratio 103 Thus 0 It is possible to observe a strong relationship between the extraneous variable smoking and exposure tea 0 With Q confounding of the exposuredisease relationship of interest 0 This Will occur when the extraneous variable is wrelated to the disease outcome Puleth 640 4 Categorical Data Analysis Page 75 of 78 If the extraneous variable has no relationship to exposure then it will not cause confounding Example A crude analysis suggests that use of sugar substitutes is associated with bladder cancer Case Control Status Cancer Healthy Substitute 10675 73813 844 Exposure Status Sugar 35 5149 5184 141 5887 6028 Odds Ratio 2113 However we have learned that smoking is associated With bladder cancer Cancer Healthy Smoker 12790 305152 3178 NON 14 2836 2850 Smoker 141 5887 6028 Odds Ratio 843 Puleth 640 4 Categorical Data Analysis Page 76 of 78 However the variable smoking is not related to the use of sugar substitutes Substitute Sugar Smoker 44514 2733 3178 NON 399 14 2451 Smoker 2850 844 5184 6028 Odds Ratio 10 The independence of smoking and sugar substitute use means that the stratum speci c odds ratios Will be close to the unadjusted odds ratio Stratum Smokers Cancer Control Substitute 95 350 445 Sugar 32 2701 2733 127 3051 3178 Odds Ratio 2291 Stratum NONSmokers Cancer Control Substitute 1 1 388 399 Sugar 3 2488 2491 14 2876 2890 Odds Ratio 2351 Thus an extraneous variable unrelated to exposure does not cause confounding Puleth 640 4 Categorical Data Analysis We have what we need to de ne confounding De nition Confounding A variable is confounding if 1 It is extraneous not intermedia 2 It is related to disease BOTH among the exposed AND among the unexposed 3 It is related to exposure Recall that an intermedia variable is an intermediate in a causal pathway Example Coal dust gt Asthma gt Lesions on Lung Asthma is the intermediary variable Strati cation on an intermedia variable eliminates the exposure disease relationship When we discuss the logistic regression model we ll learn about effect modification Page 77 of 78 Puleth 640 4 Categorical Data Analysis Page 78 of 78 Appendix F Computer Resources Applets AppletCalculator for Analysis of 2 Way Contingency Table courtesy of John C Pezullo PhD http statpagesorgctab2x2html Fisher s Exact Test source Vassar Stats httpfacultvvassaredulowrvchSahtml Chi Square Test for General R x C Table maximum 9 x 9 Source Colorado State University httpwwwpthicscsbsiuedustatscontingencvhtml Standardized Mortality Ratio Calculation Download an Excel File Calculator Source Pennsylvania Department of H ealth http wwwdsfhealthstatepaushealthcwpView asp q 202 1 14

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.