Intrmd Biostatistics PUBHLTH 640
Popular in Course
Popular in Public Health
This 78 page Class Notes was uploaded by Agustin Bechtelar on Friday October 30, 2015. The Class Notes belongs to PUBHLTH 640 at University of Massachusetts taught by Carol Bigelow in Fall. Since its upload, it has received 37 views. For similar materials see /class/232288/pubhlth-640-university-of-massachusetts in Public Health at University of Massachusetts.
Reviews for Intrmd Biostatistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/30/15
Puleth 640 4 Categorical Data Analysis Page 1 of 78 Unit 4 Categorical Data Analysis Topics 1 Lesson Overview and Example 2 2 Other Examples of Categorical Data 4 3 Hypotheses of Independence No Association Homogeneity 8 4 The Chi Square Test of No Association in an RxC Table 9 5 Rejection of Independence The Chi Square Residual 15 6 Con dence Interval Estimation of RR and OR 19 7 forC quot39 C r J39 22 8 Standardization of Rates 24 A Indirect Standardization SMR 26 B Direct Standardization SRR 28 9 Strati ed Analysis of Rates 30 A Woolf Test of Homogeneity of Odds Ratios 35 B Mantel Haenszel Test of No Association 37 10 Factors Associated with Mammography Screening 39 11 The Chi Square Goodnessof Fit Test 45 Appendices A The Chi Square Distribution 54 B Probability Models for the 2x2 Table 58 C Concepts of Observed and Expected 60 D Review Measures of Association in a 2x2 Table 64 E Review Confounding of Rates 72 78 F Computer Resources Puleth 640 4 Categorical Data Analysis Page 2 of 78 1 Lesson Overview and Example In Unit 3 Discrete Distributions the Bernoulli binomial poisson and central hypergeometric distributions were introduced The Fisher s exact test for association was also introduced 0 The setting was the single 2x2 table of count data 0 The focus was the null hypothesis of no association eg 7 no association of exposure with disease In Unit 4 Categorical Data Analysis we consider extensions of the 2x2 table of count data eg 0 An r X c table Where r the number of rows and c the number of columns and where r3 2 and c 3 2 0 Multiple 2X2 tables You will learn the following The chi square test for the null hypothesis of no association in a r X 0 table Strategies for the analysis of multiple 2 x2 tables and in particular strati ed analysis of association including Mantel Haenszel methods and standardization of rates The focus is contingency table approaches for the analysis of proportions and rates The methods have important advantages including 1 they are model free and 2 they allow us to see the data in ways that are not possible in regression Regression approaches for analyses of proportions and rates eg 7 logistic regression poisson regression Cox proportional hazards regression are introduced in Unit 5 Logistic Regression and Unit 6 Introduction to Survival Analysis Puleth 640 4 Categorical Data Analysis Page 3 of 78 Data Example Source Fisher LD and Van Belle G Bi 39 39 A M 39 J 39 y for the Health Sciences New York John Wiley 1993 It is of interest to explore the question of a relationship between coffee consumption and cardiovascular risk However the issue is made more difficult because many coffee drinkers are also smokers and smoking is itself a risk factor for heart disease The goal is to estimate the nature and strength of a coffeeMI relationship that is independent of the role of smoking For each category of smoking the following summarizes the proportion of cases of MI among low coffee drinkers left bar and among high coffee drinkers right bar Never Smoked a mmquot m Former Smoker 1 Fmpanan m W 22 Cunsummmn 9 m Some rows omitted 45 cigarettesday m w w lt c Cu ee ounsummmn 0 Among never smokers the data suggest a positive coffeeMI relationship 0 Among former smokers the coffeeMI association is less strong 0 Among frequent smokers there is no longer evidence of a coffeeMI association Puleth 640 4 Categorical Data Analysis Page 4 of 78 2 Other Examples of Categorical Data In Introduction to Biostatistics Puleth 540 7 Unit 8 Chi Square Tests contingency tables were introduced These are summaries of categorical data Various epidemiological study designs give rise to categorical data Here are some other examples Example Single group intervention Does minoxidil show promise for the treatment of hair loss N13 volunteers Administer minoxidil Wait 6 months l Count occurrences of new hair growth Call this X Suppose we ob serve X12 Possible values of Xcount of occurrences of new hair growth are 0 l 2 13 Thus IF 1 p probabilitynew hair growth for all 13 volunteers and 2 Outcomes for each of the 13 volunteers are independent THEN X is distributed Binomial Nl3 p The Binomial is an example of a categorical distribution Puleth 640 4 Categorical Data Analysis Page 5 of 78 Example Two group randomized controlled trial In controlled analysis does minoxidil work for the treatment of hair loss Consent of N30 volunteers Randomization Standard Care Minoxidil N1 17 N2 13 Administer standard care Administer minoxidil Wait 6 months Wait 6 months Count occurrences of new hair Count occurrences of new hair growth Call this X1 growth Call this X2 This design produces a 2x2 table array of count data that is correctly modeled using two binomial distributions New Growth Not Minoxidil X2 12 N2 13 Standard care X1 6 N1 17 IF 1 p1 probabilitynew hair growth on standard care 2 p2 probabilitynew hair growth on minoxidil 3 The outcomes for all 30 trial participants are independent THEN 1 X1 is distributed Binomial N1 17 pl 2 X2 is distributed Binomial N2 13 p2 The product of two binomial distributions In a 2x2 array of randomized trial data is an example of a categorical distribution Puleth 640 4 Categorical Data Analysis Page 6 of 78 Exampl CaseControl Study Is histor of oral contrace tive 0C use associated with thromboembolism Enroll cases of Enroll controls thromboembolism wo thromboembolism N1 100 N2 200 Query history of OC use Query history of OC use l l l l Count histories of OC use Count histories of OC use Call this X1 Call this X2 This design also produces a 2X2 table array of count data that is correctly modeled using two binomial distributions Case Control History of OC Use X1 65 X2 118 Not N1 100 N2 200 Reminder A case control design does not permit the estimation of probabilities of disease IF 1 p1 probabilityhistory of OC use among cases 2 p2 probabilityhistory of OC use among controls 3 The histories for all 300 observations are independent THEN 1 X1 is distributed Binomial N100 pl 2 X2 is distributed Binomial N200 p2 The product of two binomial distributions in a 2X2 array of casecontrol data is an example of a categorical distribution Puleth 640 4 Categorical Data Analysis Exampl Crosssectional Prevalence Study WHO investigated the variation in prevalence of Alzheimer s Disease with raceethnicity Page 7 of 78 Alzheimer s Disease N0 Alzheimer s Disease African Black 115 22885 Native Japan 7560 46440 European White 105930 857070 South Paci c 21 8479 North American Indian 44 10956 This design produces a 5X2 table array of count data that is correctly modeled using five binomial distributions Puleth 640 4 Categorical Data Analysis Page 8 of 78 3 Hypothesis of Independence N0 Association Homogeneity 0f Proportions Review Independence No Association Homogeneity of Proportions are equivalent null hypotheses For example 1 Length of time since last visit to physician is independent of income says that income has no bearing on the elapsed time between visits to a physician The expected elapsed time is the same regardless of income level 2 There is no association between co ee consumption and lung cancer says that an individual s likelihood of lung cancer is not affected by his or her coffee consumption 3 The equality of probability of success on treatment experimental versus standard of care in a randomized trial of two groups is a test of homogenei of grogortions The hypotheses of independence no association homogeneity of proportions are equivalent in an analysis of contingency table data Puleth 640 4 Categorical Data Analysis 4 The Chi Square Test of N0 Association in an R x C Table Example Is there an association between income level and the time elapsed since last visit to a physician HA Or is time elapsed independent of income level HO Last Consulted Physician Income 3 6 months 712 months gt12 months Total lt 6000 186 38 35 259 60009999 227 54 45 326 1000013999 219 78 78 375 1400019999 355 112 140 607 2 20000 653 285 259 1197 Total 1640 567 557 2764 Ste 1 Be in the mo b contradiction ar ument with nullh othesis 0 inde endence We consider a crosssectional design view What are the chances of any particular combination income X elapsed time Notation for Probabili 71 princome is level i and elapsed time is level j A natural estimate is the proportion observed 73quot income is i elapsed time is j For example when i1 and j1 count row1 AND column1 186 total 2764 m princome is level i 731 income is i t H39H w When i1 total 2764 71 prelapsed time is level j 72 elapsed time is j count column TOTAL 2 1640 when j1 total 2764 Page 9 of 78 Puleth 640 4 Categorical Data Analysis Recall the meaning of independence of two coin tosses Probheads toss 1 heads toss 2 Probheads toss 1 X Probheads toss 2 The statistical model of independence in an r x c table has the same intuition Probincome is level i elapsed time is level j Probincome is level i That is X Probelapsed time is level j Page 10 of 78 mi 75175j Step 2 State Assumptions 1 individuals The contingency table of count data is a random sample from some population 2 The crossclassi cation of each individual is independent of the crossclassi cations of all other The typical layout may use either nr or Oij notation Columns j j 1 j C ROWS I 1 011I111 01CI11C N1 01 i R 0R1nR1 ORCHRC NR 0R N1 01 NC 0C 0 Step 3 State Null and Alternative H quot HO 71 2 717 this is the null hypothesis of independence again HA 71 i 717 Step 4 Estimate the 7b under theNallH quot 39 of J J ij i j Where Puleth 640 4 Categorical Data Analysis Page ll of 78 A ni row quotiquot total A 111 column quotjquot total ai 39 and a 2 n grand total 11 grand total Step 5 Reason Null nggothesis Expected Counts E il row quotiquot total column quotjquot total Eij trials7Arij under nulln 172 11 Step 6 Reason an Aggrogriate Test Statistic For each cell we can obtain a sense for the unusualness of the data by comparing observed versus expected under the null hypothesis Large disparities are evidence against the null and in favor of the alternative The statistical test is a zscore measure that involves observed and expected counts OH Bull E ii Our chi square test statistic of association is the sum of these over all the cells in the table 2 o E g g E ii M0 R 2 deR1C1 g x How to compute degrees of freedom df df total cells 7 constraints on data RC 7 l for grand total R 1 because row totals have to be fixed and extra quot1 is for total C 1 because column totals are fixed and extra quot1 is for total RC717R17C1 RC 7 R 7 C l 7 R1C1 v Puleth 640 4 Categorical Data Analysis Page 12 of 78 Behavior of Chi Square Statistic under each of the null and alternative hypotheses Null is true Alternative is true 2 2 on Eu Oij Eu Each is close to zero Each is gtgt 0 Eu Eu 2 2 2 R C Oij Eij 2 R C Oij Eij I Rlcl E 15 small de kmm 2 11 jg E 15 large and ij ij and has expected value R lC 1 has expected value gt gt R lC l Step 7 Decision Rule Reject null hypothesis HO when test statistic is large as when achieved signi cance level is small test statistic exceeds upper 00100th percentile of Chi square distribution Step 8 Computations 1 For each cell compute EU 2 trials7iij under nulln7ii7ij 2 row quotiquot totalcolumn quotjquot total r1 2 And then compute for each cell 2 on Eu Eij Observed Counts Last Consulted Physician Income lt 6 months 712 months gt12 months Total lt 6000 011 186 012 38 013 35 0L 259 60009999 021 227 022 54 023 45 02 326 1000013999 031 219 032 78 033 78 03 375 1400019999 O41 355 O42 112 O43 140 04 607 gt 20000 051 653 052 285 053 259 05 1197 Total 01 1640 02 567 03 557 0 2764 Puleth 640 4 Categorical Data Analysis Page 13 of 78 Expected Counts Last Consulted Physician Income lt 6 months 712 months gt12 months Total lt 6000 E11 259164015368 E12 5313 E13 5219 EL 259 2764 60009999 E21 19343 E22 6687 E23 6570 EL 326 1000013999 E31 22250 E32 7693 E33 7557 E3 375 1400019999 E41 36016 E42 12452 E43 12232 E4 607 3 20000 E51 71023 E52 24555 Es3 W 24122 E5 1197 2764 Total E1 1640 Ez 567 E3 557 E 2764 o E 2 186 153682 259 24122 2 Z 1 u 24790 E 15368 24122 ii 2 Zercrl With degrees of freedom RlCl 5l3 l 8 Achieved signi cance level p Value Prob Chi Square W df8 3 4790 ltlt 0001 Stay 9 Statistical Conclusion We reject the null hypothesis in this example because if the null is true the chances of obtaining an observed test statistic value as far away from small as 4 7 90 is less than 1 chance in 10 000 Puleth 640 4 Categorical Data Analysis Page 14 of 78 Special Case The Chi Square Test of N0 Association in a 2 x 2 Table Often the a b c and c notation is used to represent the cell counts in a 2X2 table as follows 2quot l Classi cation Variable 1st Classi cation 1 a b a b 2 c d c d a c b d n The calculation for the chi square test given by Z all cells Eij is equivalently given by 2 nadbc2 ac bdcdab Puleth 640 4 Categorical Data Analysis Page 15 of 78 5 Rejection of Independence The Chi Square Residual With rejection of the null hypothesis of no association comes the question why We have a tool to help us that we ve already seen Appendix A gives us the o lowing reasoning IF THEN Comment X has a distribution that is Binomial np exactly X is approximately Normal nu n62 With This X is our observed count 0 xnp1p is approx Normal0 1 H2 P G P1P Z x EX 39 score This E X is our ex ected count SDX E p X nlu Thus the numerator of the Z J scoreisOiE 110 X 39 11p The denominator of the Zscore is almost xE le but not quite OE Zscore m JE This approximation to the zscore and similar formulae are called residuals Puleth 640 4 Categorical Data Analysis Page 16 of 78 The following are Two Measures of Residuals approximate zscores for Investigation of Association Name Calculation Remark Standardized Residuals Oij Eij As these are approximately Z rij E scores these residuals are U distributed Normal0 1 approximately Adjusted Standardized Oi i Thought to be more reasonably r Residuals approximated as distributed Normal0 l Behavior of Measures of Residual under each of the null and alternative hypotheses Null is True Alternative is True O 7 E Will be near zero 9 Residual Will be small O 7 E Will be appreciably different from zero when measured in SE units 9 Residual Will be large in absolute value How large is significantly large We answer this using the Normal0 1 distribution Puleth 640 4 Categorical Data Analysis Page 17 of 78 Example Investigation of Relationship between Income and Physician Visits continued The shaded are adjusted standardized residuals r that are bigger than 196 in magnitude approximately Income 5 6 months 712 months gt12 months lt 6000 43 24 28 60009999 40 30 1000013999 1400019999 20 2 20000 45 38 A little shading applied to the appreciably lower than expected reveals Low income individuals were more likely to Visit their physician within 6 months than were higher income individuals 2 Low income individuals were less likely to delay seeing their physician beyond one year than were higher income individuals Puleth 640 4 Categorical Data Analysis Page 18 of 78 The Small Cell Frequency Problem The problem is The entirety of this analysis relies on a continuous distribution chi square approximation to a distribution that is actually discrete binomial Poisson etc The approximation is poor if the actual cell counts are small When is the Approximation okay There are a variety of rules on this 1 Require that min Eij 3 10 for all the cells very conservative 2 Require min Eij 1 if 20 or fewer of the cells have expected counts lt 5 3 Require min Eij 3 2 if degrees of freedom is less than 30 What do I do if there are sparse cell f 39 9 Combine adjacent rows andor columns to attain required minimum expected cell frequencies The disadvantage to this approach is a loss of degrees of freedom The resulting test statistic is less powerful Puleth 640 4 Categorical Data Analysis Page 19 of 78 6 Con dence Interval Estimation of Relative Risk RR and Odds Ratio OR The zscore method is used to obtain con dence interval estimates of the relative risk RR and odds ratio OR The approach is the following Its reasonableness is an application of the central limit theorem the details of which are beyond the scope of this course If 6 parameter of interest 6 best guess based on a reasonably large sample size vaf6 best guess of the variance of 6 Q 9 Then A is well approximated as Normal 01 vaf6 Thus we can use this new zscore variable to obtain the con dence interval we re after For a loc100 con dence interval izm va Outline of Ste 5 in Obtainin a Con dence Interval for RR or OR l Utilize the reasonableness of the zscore approximation to the distribution of lnRR and lnOR 2 Obtain con dence interval for lnRR or lnOR 3 To obtain con dence interval estimates for RR and OR eXponentiate the con dence interval estimates for lnRR and lnOR Puleth 640 4 Categorical Data Analysis Page 20 of 78 Con dence Interval Estimate of Relative Risk RR Example CHD No CHD High Cholesterol 27 95 122 Not high cholesterol I 44 I 443 I 487 71 538 609 RR nnn1 27122 2 1121 hi 44487 1 Utilize the zscore approximation for the distribution of lnRR lnRR ln245 896 A l 1 vaflnRR w M M 11 21 00495 In our example vaf1nR s 127122 1 431487 27 2 Obtain con dence interval for lnRR For a 95 con dence interval Z Z Z975 Z so that With 95 con dence 896 l 9600495 S lnRR S 896 19600495 or 4599 S1nRR g 1332 3 EXponentiate e 4599 S RR S 61 332 so that 1584 S RR S 3789 Puleth 640 4 Categorical Data Analysis Page 21 of 78 Con dence Interval Estimate of Odds Ratio OR Example Disease No disease Exposed I a 8 I b I Not exposed I c 2 I d I OR 2 g lt8gtlt20gt bc 230 4 Utilize the zscore approximation for the distribution of lnOR 1110R ln267 981 z i g l l a C d A A l l l 1 In our example varlnOR 8 3O 2 20 5 Obtain con dence interval for lnOR For a 95 con dence interval Z Z Z975 Z so that With 95 con dence laZ 981 196 708 s lnOR 981 196 70 or 67 1nOR 263 6 Exponentiate e 67 S S 62 63 so that With 95 con dence 512 S OR S 1387 Puleth 640 4 Categorical Data Analysis Page 22 of 78 7 Strategies for Controlling Confounding We can control confounding at study desi 0 Restriction 0 Matching We can also control confounding analytically o Strati cation o Standardization 0 Matching Restriction Restriction is the inclusion of only persons Who are the same With respect to the confounder o A study of males only Will not produce results that are confounded by gender effects 0 A study of nonsmokers only Will not produce results that are confounded by the effects of smoking The advantage is a guarantee of control for confounding However there are also disadvantages The sample size is limited and generalizability is reduced Puleth 640 4 Categorical Data Analysis Page 23 of 78 Matching in a Cohort Study Matching in a cohort study involves the following o Enrollment of exposed persons Without restriction o Enrollment of unexposed only if they match exposed Matching in a CaseControl Study In a casecontrol study the following occurs 0 Enrollment of cases Without restriction o Enrollment of controls only if they match cases ylILI39 Be careful is not necessarilv a good idea In casecontrol studies controls may be arti cially similar to cases Estimates of association may be spuriously low If matching is related to exposure only not confounding then spurious confounding may be introduced Sample size is reduced Identical matched pairs provide no information Do not match 0 Most casecontrol studies 0 On a variable that is intermediary 0 When a large number of controls are available Consider matching 0 In an experiment and some cohort studies 0 On some variables age sex site Puleth 640 4 Categorical Data Analysis Page 24 of 78 8 Standardization of Rates Many epidemiological studies of association involve the comparison of rates using the method of standardization to control for confounding Goal of Standardization 0 To compare event rates in two populations while taking into account differences in their distributions with respect to one or more confounders o This is achieved by comparing the event rates in the two populations as their distributions with respect to the confounder were the same Example We wish to compare death rates in two populations Exposed Single women UneXposed Ever married women Age is a confounder since 1 Death rates increase with age AND 2 Ever married women tend to be older How do we pretend that the two groups have the same age distribution 1 In both populations we de ne strata of age 2 We obtain the relative frequencies of each stratum of age 3 These relative frequencies become weights 4 A standardized rate is just a weighted average 6 g 7 Standardized rate Z relative frequency of stratum rate in stratum agz Z weight rate in stratum age Puleth 640 There are two commonly used methods of comparing 4 Categorical Data Analysis J I rd rates Page 25 of 78 Name Weight 1 Indirect SMR standardized mortality Relative frequency of stratum ratio among EXPOSED 2 Direct SRR standardized risk ratio Relative frequency of stratum among UneXposed A comparison of standardized rates has the following form Standardized rate among exposed Standardized rate among the unexposed Puleth 640 4 Categorical Data Analysis Page 26 of 78 A Indirect Standardization Standardized Mortality Ratio SMR SMR 2 weight in EXPOSED rate in exposed Observed deaths in exposed 2weight in EXPOSED rate in Unexposed Expected deaths in exposed Suppose we observe the following Exposed Unexposed Stratumi Freguency n1i Events ai Freguency noi Events bi Young 10000 50 100000 50 Old 1000 4 200000 400 Total 11000 300000 Young 5010000 005 501000000005 Old 004 002 Stay 2 For a SMR our weights will be the relative frequency of the age strata in the exposed population Exposed Stratumi Freguency n1i Weight Young 10000 1000011000 9091 Old 1000 100011000 0909 Total 11000 1000 Step 3 Compute the standardized event rate in each stratum Exposed Event Rate Weight Assignment Weight Rate Young 005 9091 004546 Old 004 0909 000364 E 004910 Unexposed Event Rate Weight Assigment Weight Rate Young 0005 9091 000455 Old 002 0909 000182 Z 000637 Puleth 640 4 Categorical Data Analysis Page 27 of 78 Stay 4 Compare SMR 004910 77 000637 We say that the number of events among the exposed is 77 times greater than What was expected in the absence of exposure Puleth 640 4 Categorical Data Analysis Page 28 of 78 B Direct Standardization Standardized Rate Ratio SRR SRR Z gweight in UNexposed grate in exposed Expected deaths in UNeXposed 2 weight in UNexposed rate in Unexposed Observed deaths in UNeXposed Consider the same pattern of event occurrence Exposed Unexposed Stratumi Freguency n 1i Events ai Freguency noi Events bi Young 10000 50 100000 50 Old 1000 4 200000 400 Total 11000 300000 Step 1 Calculate stratum speci c death rates in each of the exposed and unexposed populations just as we did before Stratumi Exposed Unexposed Young 5010000 005 501000000005 Old 41000 004 400200000002 Step 2 For a SRR our weights will be the relative frequency of the age strata in the unexposed population UnExposed Stratumi Frequency 11 1i Weight Young 100000 100000300000 3333 Old 200000 2000003000006667 Total 300000 1000 Step 3 Compute the standardized event rate in each stratum Exposed Event Rate Weight Assigment Weiaht Rate Young 3333 00167 Old 004 6667 00267 2 00434 Unexposed Event Rate Weight Assigment Weight Rate Young 0005 3333 00017 Old 002 6667 00133 Z 0015 Puleth 640 Step 4 Compare 4 Categorical Data Analysis SRR 00434 289 0015 Page 29 of 78 Summary of Standardization Indirect Direct Name Standardization is to Formula Advantages Disadvantages SMR Standardized Mortality Ratio Exposed Population 2 weight exposed rate in exposed 2 weight exposed rate in unexposed Observed in exposed Expected in exposed Easy 0 Intuitive o Standardization is to exposed o Often cannot compare populations SRR Standardized Rate Ratio Unexposed Population 2 weight unexposed rate in exposed 2 weight unexposed rate unexposed Expected in Unexposed Observed in Unexposed o Standardization is to unexposed 0 Single reference can be used for comparison of many populations 0 Not intuitive Two warnings about standardiz ationll Warning 1 Standardization is not meaningful if the distribution of the confounder in the exposed and unexposed populations do not overlap Eg 7 An example of nonoverlapping distribution occurs when the ages among the exposed population range 50 and older while the ages among the unexposed population range 18 to 35 Warning 2 Standardization assumes that the effect of exposure is the same in each stratum If it is not we say that the stratification variable is an effect modi er Puleth 640 4 Categorical Data Analysis Page 30 of 78 9 Strati ed Analysis of Rates In a strati ed analysis of rates the goal is to understand an exposure disease relationship while taking into account confounding or effect modi cation Need a review of confounding and effect modification See Appendix E Example We Will explore in some detail a data set investigating exposure to Video display terminals and spontaneous abortion SAB In this unit we ll do a strati ed analysis of this association considering strata de ned by month of gestation Suppose the following are observed Unexposed Exposed Month of Gestation SAB Pregnancies SAB Pregnancies IOUIbUJNi a 10512 20 38502 75 15462 32 7449 16 2442 05 4440 09 2436 05 1366 03 30365 82 12335 36 5323 15 4318 13 1314 03 1313 03 Puleth 640 4 Categorical Data Analysis Page 31 of 78 The following analysis plan might be followed Step 1 Are the stratum speci c OR the same Estimate common OR Test homogeneit39 of OR l l OR are the same OR are different Step 2 Step 2 Is common OR 1 Report stratum speci c OR Step I O A preliminary is the estimation of an assumed common odds ratio This will be the MantelHaenzel estimate Testing homogeneity of the stratum speci c OR involves comparing the stratum speci c OR s to the MantelHaenzel OR Step 2 IF we judge the stratum speci c odds ratios OR to be within noise of each other the same THEN we evaluate whether it is close to unity no association or IF we judge the stratum speci c odds ratios OR to be different THEN we report stratum speci c OR Puleth 640 4 Categorical Data Analysis Page 32 of 78 How to Estimate the MantelHaenzel Odds Ratio ORMH o It is a weighted average of the stratum speci c odds ratios 0 The weights are a function of the variances of the stratum speci c odds ratios Stay 1 For each stratum obtain the following Case Control Exposed a b M1 UNexposed c d M0 N1 N0 T ORstratum variance ORstratum E bc Stay 2 Calculate the ORMH as a weighted average of stratum speci c OR Mantel Haenszel Odds Ratio ORMH ad OR Z M gm T MH strata VarORstram Z strata Puleth 640 4 Categorical Data Analysis Page 33 of 78 Here are the calculations for the data in our example MONTH OF NOTATION OBSERVED GESTATION 1 a b M1 10 502 512 0R172709 c d M0 1 365 366 N1 N0 T 11 867 878 2 a b M1 38 464 502 0R209145 c d M0 30 335 365 N1 N0 T 68 799 867 3 a b M1 15 447 462 0R309032 c d M0 12 323 335 N1 N0 T 27 770 797 4 a b M1 7 442 449 0R410072 c d M0 5 318 323 N1 N0 T 12 760 772 5 a b M1 2 440 442 0R503568 c d M0 4 314 318 N1 N0 T 6 754 760 Puleth 640 4 Categorical Data Analysis Page 34 of 78 6 a b M1 4 436 440 0R6 28716 c 1 M0 1 313 314 N1 N0 T 5 749 754 7 a b M1 2 434 436 0R7 14378 c 1 M0 1 312 313 N1 N0 T 3 746 749 Month of Geqtatinn a b c d T adT bcT 1 10 502 1 365 878 41572 05718 2 38 464 30 335 867 146828 160554 3 15 447 12 323 797 60790 67302 4 7 442 5 318 772 28834 28627 5 2 440 4 314 760 08263 23158 6 4 436 1 313 754 16605 05782 7 2 434 1 312 749 08331 05794 TOTALS 311223 296935 ORMH Z sadT 311223 10481 2 bcT 296935 Puleth 640 4 Categorical Data Analysis Page 35 of 78 A Woolf Test of Homogenei M For each stratum obtain the following 1 l l l l 1n 0Ri and Weight wi abcd bici i i i i Stay 2 Obtain a weighted average of the stratum speci c lnOR Ks Lrata wilnORi 11 Kstrata Wi i1 Stay 3 The Woolf statistic under the null hypothesis of homogeneity of OR is distributed chi square With degrees of freedom strata 7 l 39 Xtratal Kiafa W1 111ORi anET Kstrata 2 aWi lIlORi o W1 lnORi 11 Kstrata W1 11 Puleth 640 4 Categorical Data Analysis Page 36 of 78 Example Month a b c d W IQOR W In OR 1 1 10 502 1 365 09052 1983882 1795805 2 38 464 30 335 154346 0089365 137932 3 15 447 12 323 64378 0101763 0655126 4 7 442 5 318 28714 0007214 0020713 5 2 440 4 314 13238 1030529 1367496 6 4 436 1 313 07965 1054855 0841448 7 2 434 1 312 06642 0363106 0241485 Totals 2843872 0502491 Kstrata WilnORi 0502491 anR 11 0017669 Km 2843872 Kstrata F1 XSZtratal Z anEJZ i1 61284 Step 4 Signi cance level calculation pValue Probability Chi square df6 3 61284 409 Do not reject The null hypothesis is retained because the Woolf statistic is not statistically signi cant Inasmuch as the stratum specific odds ratios range from 035 to 727 the lack of statistical signi cance is re ecting the limited availability of sample size to study Puleth 640 4 Categorical Data Analysis Page 37 of 78 B Mantel Haens7el Test of No A 39 quot It has been determined previously that it is reasonable to assume that the stratum speci c odds ratios are the same Now we ask Are the stratum speci c odds ratios all unity Stay 1 For each stratum the hypothesis of no association means that the count a has a distribution that is central hypergeometric Case Control Exposed a b M1 UNeXposed c 1 M0 N1 N0 T NM NNMM Ea 1T I vaa1 1Tz Tll Stay 2 The test statistic Will be the sum over strata of the counts a 2 A EA2 ValA Where df1 NM NNMM A EAWZHEH 1T varA z Tl1 Puleth 640 4 Categorical Data Analysis Page 38 of 78 We get the following in our example Month of Gestation a NlMlT NlNQMlMQTZHDl 1 10 64146 26435 2 38 393725 52931 3 15 156512 63637 4 7 69793 28784 5 2 34895 14505 6 4 29178 12086 7 2 17463 07278 TOTALS 78 765712 305656 A 2a 78 EA 765712 52mm mam T varA z 305656 T2T 1 2 A EA2 78 7657122 1 Z 2 305656 varA 00668 Si i cance Level gP value If the assumption of no association is true then the chances of a chi square statistic more extreme than 00668 is PValue Prob Chi square W dfl 3 00668 080 Do not Reject Conclude that overall data do not suggest an association This is not surprising inasmuch as ORMH 1048 Puleth 640 4 Categorical Data Analysis Page 39 of 78 10 Factors Associated with Mammographic Screening Source Evans et al 1998 Factors Associated with Repeat Mammography in a New York State Public Health Screening Program Public Health Management Practice 45 6371 Background 0 Breast cancer is a major cause of morbidity and mortality In the US it is the second major cause of cancer deaths for women 0 There is no known way of primary prevention In the meantime secondary prevention is of critical public health importance 0 Mammography detects cancer approximately 17 years before a woman could feel the lump herself It also locates cancers too small for detection by clinical breast exam Stage of breast cancer at diagnosis is related to survival Stage at Diagnosis Percent Surviving t0 5 Years Early 97 Late 20 0 One screening mammogram is not enough The risk of breast cancer increases with age 0 Previous work has shown that mammography is underutilized 0 Therefore surveillance of patterns of repeat mammographic screening among women is needed to identify targets for intervention Such a study is among the activities of the New York State Department of Health Research Question Among women with no history of breast cancer and with a normal mammogram what factors among selected sets of characteristics sociodemographic cancer risk health behavior health care access predict the occurrence of a repeat mammogram Puleth 640 4 Categorical Data Analysis Page 40 of 78 Design Cohort study investigation of the occurrence of a repeat screening mammogram during the period 19881993 among women without a history of breast cancer and who received a baseline screening mammogram that is documented in the Breast and Cervical Screening Program Database of the New York State Department of Health Breast and Cervical Cancer Screening Program Cohort 19881993 New York State Department of Health 9 Mammography Sites 16529 baseline mammograms among women aged over 50 i Exclusions 6311 due nonnegative baseline mammogram 205 due requirement for followup testing 528 due history of breast cancer 7044 exclusions total Analysis Cohort N9485 women 0 No history of breast cancer 0 No missing data Puleth 640 4 Categorical Data Analysis Characteristics of Analysis Cohort Fre uenc Total 9485 100 Age 5069 years 3670 39 NonWhite RaceEthnicity 5160 54 Less than High School Education 6472 68 Family History of Breast Cancer 1130 12 Previous Mammogram 4366 46 Returned for Repeat Screening Mammogram 2604 27 Page 41 of 78 Puleth 640 4 Categorical Data Analysis Page 42 of 78 Recall 0 Among 9485 women with an initial negative mammogram 2561 27 returned for a repeat screening mammogram 0 Interest is in variations in these events of return with demographics medical history and access to health care 0 Rationale is the importance of detecting breast cancer in its early stage A reasonable analysis plan is the following Goal Rationale Methods 1 Description of Analysis Sample To describe sample 0 Relative frequency To compare sample with target To identify data errors 217 quot39 ofCrudeA quotquot To obtain these associations To identify candidates for adjusted analysis To guide adjusted analysis Relative frequency 11 OR 95 CI Chi square tests of association DJ Model FreeF 39 of Adinamd A 39 A To obtain estimates of independent Strati ed estimates of OR predictive signi cance and 95 con dence intervals 0 To obtain model free hypothesis tests Test of homogeneity of OR 0 Test of homogeneity of OR Estimation of ORMH 0 To discover effect modi cation 0 To discover confounding Puleth 640 4 Categorical Data Analysis Page 43 of 78 Characteristics of Participants with Negative Mammograms at Initial Visit New York State 19881991 N9485 Partial listing n Age years 70 885 93 5069 3670 387 4049 2805 296 lt40 2061 217 unknown 64 07 RaceEthnicity White NonHispanic 4325 456 Black NonHispanic 2567 271 Hispanic Asian Other 2587 272 unknown 6 0 1 Time Since Last Mammogram Less than 1 year 1552 164 15 years 2220 234 More than 5 years 594 63 No prior mammogram 4933 520 unknown 186 20 Note Initial visits occurred during the years 19881991 Almost half were over the age of 50 46 were White 52 had never had a mammogram Give counts of unknown Puleth 640 4 Categorical Data Analysis Page 44 of 78 Crude Associations with Return for Screening Mammogram Among Women with Initial Negative Mammogram New York State 19881993 N9485 Screening Mammogram N n Age 70 885 292 330 5069 3670 1328 362 4049 2805 691 246 lt40 2061 288 140 Pa 0001 RaceEthnicity White 4325 1380 319 Black 2567 755 294 Hispanic 2587 468 181 Asian0ther P0001 Last Mammogram Less than 1 year 1552 576 371 15 years 2220 727 327 gt 5 years 594 165 278 No prior 4416 958 217 P0001 a Chi square test of association 0 Reminder PValues are not very useful They are especially uninformative in large scale studies 0 Best return is seen among women 5069 years of age 0 Crude analysis suggests that Hispanics Asian women of other raceethnicity are less likely to follow their initial negative mammogram with a repeat screen 0 Not surprisingly women with a history of mammogram are more likely to return for a repeat screen Puleth 640 4 Categorical Data Analysis Page 45 of 78 11 The Chi Square Goodness of Fit Test Another use of the chi sguare distribution 0 So far we ve used the chi square statistic to test the hypothesis of no association 0 Now we ll use the chi square distribution to assess whether two distributions are the same or reasonably the same goodnessof t Suppose that a histogram of the observed data looks like Of interest Can we reasonably assume for purposes of analysis that the data represent a sample from a Normal distribution This permits application of normal theory estimation and hypothesis testing approaches like the ones we learned in Puleth 540 Introductory Biostatistics This might also be of interest if we d like to know if the sample distribution can reasonably be described as that of another distribution eg Binomial or Poisson Puleth 640 4 Categorical Data Analysis Page 46 of 78 Consider the setting Where interest is in goodnessof t to the Normal distribution Which normal distribution Let s consider the Normal distribution that is the closest By closest we mean u sample mean X 2 39 2 a sample varlance S The idea is to consider an overlay of this Normal distribution on the histogram of the observed data rmal u 4 Puleth 640 4 Categorical Data Analysis Page 47 of 78 The Idea of the GoodnessofFit Test 397 1 Divide up the range into intervals 2 l 0 l 2 Interval il i2 i3 etc 1K 4 In each interval I obtain 3 Observed count Oi 2 Expected count E 1 Also obtain for each interval I What is called a component chi sguare 0 7 2 31 6 1 2 2 Oi Observed 01 02 etc OK E i E Xpected E1 E2 etc EK Each is a comparison of the observed and expected counts 01 E1 2 O2 Ely etc OK EK 2 1 E2 K 2 2 K Oi Ei Sum these to obta1n the Ch1 sguare Zgof 2 E Goodness of Fit Test il 1 Puleth 640 4 Categorical Data Analysis Page 48 of 78 Behavior of the Chi Square Goodnessof Fit Statistic This is a setting where the null hypothesis is typically the one that we hope is operative The null hypothesis says that the unknown true the distribution that gave rise to the data is reasonably similar to the hypothesized in this example Normal Values of the chi square goodness of t test will be small when the two distributions are reasonably similar This is because the observed and expected counts are similar giving rise to component chi square values that are small How many degrees of freedom has the Chi Square Statistic Degrees of freedom to use intervals total this is K l ONE df is lost for the last interval parameters estimated using the data DF K l parameters estimated Puleth 640 4 Categorical Data Analysis Page 49 of 78 Example Source Rosner B Fundamentals of Biostatisties second edition Boston Duxbury 198617 352 Test for goodness of t the normal probability distribution for the following data comprised of nl4736 blood pressure readings Note 7 these data have sample mean and variance values of i8068 and S2 122 respectively Step 1 Obtain Observed counts from a histogram Class Observed i Interval Count Oi 1 lt50 57 2 3 50 to lt 60 330 3 3 60 to lt 70 2132 4 3 70 to lt 80 4584 5 3 80 to lt 90 4604 6 390 tolt100 2119 7 3100 to lt 110 659 8 3 110 251 TOTAL 14736 Tip Check that the sum of the observed counts MATCHES the total sample size Puleth 640 4 Categorical Data Analysis Page 50 of 78 Step 2 Obtain the u and Oquot2 of the comparison normal distribution Compute from the sample X 8068 82 122 So we ll compare the data to the normal distribution With u 8068 52 122 Step 3 7 Calculate the likelihood of a value in each interval using the z score method introduced in BE540 For interval i1 PrXlt50PrzltM PrZlt2556000529 For interval i2 508068 608068 lt Zlt Pr50ltXlt60Pr 12 12 Pr2556ltZlt l 72330424200529037l Etc For interval iK 8 PrXgt1 10 PrZ gtW1 PrZgt24433000728 Puleth 640 4 Categorical Data Analysis Page 51 of 78 Step 4 7 Calculate the expected count of observations in each interval using Expected count sample size X probability of interval For interval i1 39 E1 14736 000529 7795 For interval i2 E2 14736 00371 54671 Etc For interval iK8 E8 14736 000728 10728 Step 5 Obtain Observed counts from a histogram Component 1 Class Interval Observed Count Oi Expected Count Ei Ei l lt50 57 7795 56306 2 3 50 to lt 60 330 54671 859015 3 360 to lt70 2132 212640 00147 4 3 70 to lt 80 4584 428375 210447 5 3 8 to lt 90 4604 447827 35299 6 390 tolt 100 2119 243144 401485 7 3100 to lt 110 659 68375 08959 8 3110 251 10757 1912444 TOTAL 14736 14736 34841 Puleth 640 4 Categorical Data Analysis Page 52 of 78 Tip Check that sum of observed sum of expected sample size Step 6 Determine degrees of freedom DF K l parameters estimated 8711foru7lforo 5 Step 7 Assess statistical signi cance Zgzoodness of t df5 348 41 pValue Prob Chi square W df5 3 34841 ltlt 00001 This suggests that the data cannot reasonably be assumed to follow a normal distribution Examination of the component chi squares suggests that the normal distribution t is reasonable for blood pressures between 60 and 110 mm Hg but is poor for readings below 60 mm Hg or above 110 mm Hg Puleth 640 Exampl 4 Categorical Data Analysis Page 53 of 78 Source Zar JH BiostatistiealAnalysis third edition Upper Saddle River Prentice Hall 99617 461 A plant geneticist wishes to know ifa sample of n250 seedlings comes from a population having a 933l ratio of yellow smooth yellow wrinkled green smooth green wrinkled seeds In this example expected counts are computed using the hypothesized phenotype ratios Component i Phenotype 0i Expected Count E 0i Ei2 1 Yellow 152 P h r 9 250 5625 140 625 09201 smooth n r0 enmype m m l 2 Yellow 39 P h t i 3 250 1875 46 875 13230 wrinkled n rp eno ypen m 3 Green 53 P h t i 3 250 1875 46 875 08003 smooth n rP 110 YPen m 4 Green 6 P h t i 1 250 0625 15 625 59290 wrinkled n r0 eno ype m m TOTAL 250 250 8972 DF K l parameters estimated 4 7 l 0 because we didn t have to estimate any 3 2 Zgoodness offit df3 28972 pValue Prob Chi square w df 3 3 8972 002967 This suggests that the data do NOT come from a population having a 933l ratio of the four seedling types Puleth 640 4 Categorical Data Analysis Page 54 of 78 Appendix A The Chi Square Distribution In Puleth 540 the chi square distribution was introduced in Unit 6 Estimation and in Unit 8 Chi Square Tests This appendix explains the appropriateness of using the chi square distribution a model for a continuous random variable for the analysis of discrete data The chi square distribution is related to the normal distribution Has a Chi Square Distribution IF THEN with DF Z has a distribution that is Normal Z2 1 01 X has a distribution that is Normal u 62 so that Zscore gt2 1 Z score i X1 X2 Xn are each distributed Normal u 62 and are independent so that Zscore gt2 1 Y is Normal u czn and Y Z score aJh X1 X2 Xn are each distributed Normal u 62 and are independent 11 DSZ and we calculate II1 Puleth 640 4 Categorical Data Analysis Page 55 of 78 The chi square distribution can be used in the analysis of categorical count data for reasons related to th Z1 Z2 Zn are each Bernoulli with probability of event p EZ p VarZ 02 p1 p t e normal distribution and in particular the central limit theorem L The net number of events X is Binomial Np i1 A We learned in Puleth 540 that the distribution of the average of the Zi is well described as Normalu czn Apply this notion here By convention 3 So perhaps the distribution of the w is also well described as Normal At least approximately If Y is described well as Normal u czn Then X 11X is described well as Normal my n62 Exactly X is distributed Binomialnp Approximately X is distributed Normal my n62 Where Id 13 and 02 1313913 Puleth 640 Putting it all together 4 Categorical Data Analysis Page 56 of 78 IF THEN Comment X has a distribution that is Binomial np exactly up no2 apgroximately where ifp 6 p1p l X has a distribution that is Normal X EX SDX Z score Xnu Ea Xnp lnp p is approx Normal0 1 l Zscore gt2 has distribution that is well described as Chi Square We arrive at a continuous distribution model for count data Puleth 640 4 Categorical Data Analysis Page 57 of 78 A F eel for things continued 7 You will come to think of the chi square distribution as this when analyzing count data For one cell Observed Expected 2 Count Count is Chi Square df 1 approximately Expected Count For the sum of all RC cells in a R x C table Observed EXPeCted R c C t C t 39 39 Z 2 01111 131 Gun 13 is Chi Square df Rlllc39ll 11 11 Expected Count 11 approximately Puleth 640 4 Categorical Data Analysis Page 58 of 78 Appendix B Selected Models for Categorical Data Various study designs eg 7 case control cohort surveillance give rise to categorical data utilizing some of the probability distributions that have been introduced in Unit 3 eg 7 binomial poisson product binomial and product poisson 1 CaseControl We count events of exposure Case Control Exposed I a I b I 1 Not I c d I FIXED FIXED The count a is distributed Binomial trials ac Probcase exposed The count b is distributed Binomial trials bd Probcontml exposed 2 Cohort We count events of disease Disease Not Exposed a b FIXED UneXposed I C I d I FIXED The count a is distributed Binomial trials ab Probexposed disease The count 0 is distributed Binomial trials cd Probmexposed disease 3 2x2 Table We count events of joint occurrence of exposure and disease Disease Not Exposed a b FIXED Not c l d FIXED FIXED FIXED The count a is distributed Hypergeometric Puleth 640 4 Categorical Data Analysis Page 59 of 78 4 2x2 Table We count events of all 4 types of joint events Disease Not Exposed I a I b I Not I c I d I The count a is distributed Poisson in The count b is distributed Poisson 1b The count 0 is distributed Poisson if The count d is distributed Poisson 1d 5 RxC Table General Mild Moderate Severe Exposed a b c FIXED Not d e f FIXED The triplet of counts abc is distributed Multinomial The triplet of counts def is distributed Multinomial Nate The multinomial distribution has not been discussed in this course It is an extension ofthe Binomial distribution to the setting of more than two outcomes Puleth 640 4 Categorical Data Analysis Page 60 of 78 Appendix C Concepts of Observed versus Expected In categorical data analysis methodology we compare observed counts of events with expected counts of events Emphasis on counts Consider an investigation of a possible association between electronic fetal monitoring EFM and delivery by caesarian section Caesarian Section Yes No EFM Exposure Yes I 5 I l I 6 No 2 7 9 7 8 15 The observed counts are with EFM exposureyes AND Caesarian sectionyes with EFM exposureyes AND Caesarian sectionno with EFM exposureno AND Caesarian sectionyes with EFM exposureno AND Caesarian sectionno NHU The expected counts depend on what we believe Absent a null hypothesis Cohort Study Suppose we allow for possibility of different probabilities of caesarian section for EFM exposed women versus nonEFM exposed women Best guess of prcaesarian section for EFM exposed women 56 Best guess of prcaesarian section for nonEFM exposed women 29 CaseControl Study Suppose we allow for possibility of different probabilities of history EFM exposure caesarian section women versus non women Best guess of prEFM history for Csection women 57 Best guess of prEFM history for non Csection women 18 Puleth 640 4 Categorical Data Analysis Page 61 of 78 Expected Counts Under Independence No Association Homogeneity 7 row totalcolumn total Expectedrl e 7 mm grand total Example Expected Count in a Cohort Study Viewed as a cohort study the outcome is caesarian section The null hypothesis of independence no association homogeneity of proportions suggests that Best Guess of pr caesarian section Overall proportion of csection A 7 column quotyesquot total pcsection E grand total Best Guess of prNO caesarian section Overall proportion of NON csection 8 column quotnoquot total ISNON csecu39on 15 grand total Caesarian Section Yes No A 8 nefmyespNo csection E A 7 nefmyespcsection column quotnoquot total H H t t 1 CO H yes 0 a 2 row quotyesquot total grand total Yes 2 row quot esquot total y grand total efm row quotyesquot totalcolumn quotyesquot total row quotyesquot totalcolumn quotnoquot total grand total grand total 7 8 1 1 A 9 n A 9 lt lt 15 lt lt 15 N0 Kmquotnoquot totalcolumn yes total rowunou totalcolumn no total grand total grand total row quotnoquot totalcolumn quotyesquot total row quotnoquot totalcolumn quotnoquot total grand total grand total Puleth 640 4 Categorical Data Analysis Page 62 of 78 Expected Counts Under Independence No Association Homogeneity row totalcolumn total grand total Example Expected Count in a CaseControl Study Viewed as a casecontrol study the outcome is history EFM exposure The null hypothesis of independence no association homogeneity of proportions suggests that Best Guess ofpr zx EFM Overall proportion of EFM exposure 6 row quotyesquot total p1 EFM E grand total Best Guess ofpr zx NO EFM Overall proportion of NO EFM exposure A 9 row quot no total pNO EFM E grand total Caesarian Section Yes No A 6 A 6 nosectionyespthPM nosectionnopthFM H H t t 1 H H t t 1 2 column quotyesquot totalw 2 column quotnoquot totalw grand total grand total efm H H H H 2 column quotnoquot totalrow quotyesquot total 2 column yes totCa11roN yes total grand total gran tota A 9 A 9 nosectionyespNOhx EFM nosection pNO hx EFM N0 1 5 1 5 2 column quotyesquot totalw 2 column quotnoquot totalw grand total grand total 2 column quotyesquot totalrow quotnoquot total 2 column quotnoquot totalrow quotnoquot total grand total grand total Puleth 640 4 Categorical Data Analysis Page 63 of 78 Observed and Expected Counts General R x C Table A useful notation is O for observed and E for expected and the following subscripts 44 Oij Observed count in row and column Eij Expected count in row and column j Oi EL ni Observed and Expected row total for row i 01 Ej nj Observed and Expected column total for column 397 1 397 1 Yes it s true Under the null hypothesis the expected and observed totals row totals column totals grand total match Observed Counts Columns j j 1 C Rows i 139 l 011 01c N1 01 l R 0R1 ORC NR OR N1 Ol NCOC N O Expected Counts under Null Independence No Association Homogeneity Columns j nn 7 Rows 1 1 1 E r11 Elc 1 C Nifoi 11 n 1 1 I nn 1R Em R 1 ERC NROR 1 1 1 1 N1 O1 NC Puleth 640 4 Categorical Data Analysis Page 64 of 78 Appendix D Review Measures of Association Recall that various epidemiological studies prevalence cohort casecontrol give rise to data in the form of counts in a 2x2 table Recall again the goal of assessing the association between exposure and disease in a 2x2 table of counts represented using the a b c and d notation Disease Healthy Exposed I a I b I a b Not Exposed I c I d I c c a c b d Let s consider some actual counts Disease Healthy Exposed I 2 I 8 I 10 Not Exposed 10 290 300 12 298 3 10 We might have more than one 2x2 table if the population of interest is partitioned into subgroups or strata Example Strati cation by gender would yield a separate 2x2 table for men and women Puleth 640 4 Categorical Data Analysis Page 65 of 78 A good measure of association is a single measure that is stable over the various characteristics strata of the population Excess Risk Suppose that the cumulative incidence of disease among exposed 71 and that the cumulative incidence of disease among nonexposed 70 Excess Risk The difference between the cumulative incidence rates b75139750 Example In our 2x2 table we have 71 210 20 70 10300 0333 Thus b 20 0333 1667 o The effect of exposure is said to be additive because we can write 71 70 b o Hypothesis testing focuses on HO b 0 o For a population that has been strati ed with strata k l K the additive model says that Ttkl 75kg b Note The absence of a subscript k on the excess risk b says that we are assuming that the excess risk is constant in every stratum e g among men and women 0 Biological mechanisms which relate exposure to disease in an additive model often do not operate in the same way across strata o If so the additive risk model does not satisfy our criterion of being stable Puleth 640 4 Categorical Data Analysis Page 66 of 78 Relative Risk RR The relative risk is the ratio of the cumulative incidence rate of disease among the exposed 71 to the cumulative incidence rate of disease among the nonexposed 70 Relative Risk The ratio of the cumulative incidence rates RR751TEO Example In our 2x2 table we have 71 210 20 70 10300 0333 Thus RR 200333 6006 0 The effect of exposure is said to be multiplicative because we can write 751 750 RR 0 Hypothesis testing focuses on HO RR l o This model is also said to be additive on the log scale It is also said to be an example ofa loglinear model To see this 71 70 RR 3 lnnl lnno lnRR 3 lnnlln rto3 where BlnRR o It has been found empirically that many exposuredisease relationships vary with age in such a way that the log linear model is a good description Speci cally the change with age in the relative risk of disease with exposure is reasonably stable In such instances the model is preferable to the additive risk model Puleth 640 4 Categorical Data Analysis Page 67 of 78 Attributable Risk The attributable risk is proportion of the incidence of disease among exposed persons that is in excess of the incidence of cases of disease among nonexposed persons Often it is expressed as a percent Attributable Risk 7239 AR when expressed as a percent Recalling that RR 71 70 reveals that RR l AR RR Example In our 2x2 table a RR 6006 yields an attributable risk value of AR 6006 716006 8335 8335 Puleth 640 4 Categorical Data Analysis Page 68 of 78 Odds Ratio Recall that the odds ratio measure of association has some wonderful advantages both biological and analytical Recall rst the meaning of an odds Probabilityevent 1 OddsEvent TElTE Let s look at the odds that are possible in our 2x2 table Disease Healthy Exposed I a I b I a b Not Exposed I c I d I c d a c b d Cohort study design 7 a a b a 2 Est1mated Odds of d1sease among exposed 7 25 b a b b 8 Estimated Odds of disease among non exposed M 3 0345 d c d d 90 Casecontrol study design MI l 20 Estimated Odds of exposure among diseased c a c c 10 bbdI b 8 Estimated Odds of exposure among healthy 0276 d b d d 290 Puleth 640 4 Categorical Data Analysis Page 69 of 78 Odds ratio Cohort study design OR Odds disease among exposed ab Odds disease among non exposed cd bc Casecontrol study design OR Odds exposure among disease Odds exposure among healthy bd bc Terri c The OR is the same regardless of the study design cohort prospective or casecontrol retrospective Example In our 2x2 table a 2 b8 010 and d290 so the OR 725 This is slightly larger than the value ofthe RR 6006 Thus there are advantages of the Odds Ratio OR Many exposure disease relationships are described better using ratio measures of association rather than difference measures of association N ORcohortstudy ORcasecontrol study 5 The OR is the appropriate measure of association in a casecontrol study Note that it is not possible to estimate an incidence of disease in a retrospective study This is because we select our study persons based on their disease status 4 When the disease is rare ORcasecontml RR Puleth 640 4 Categorical Data Analysis Page 70 of 78 Appendix E Review Confounding of Rates Is our estimate of a diseaseexposure relationship measuring what we think it is Or is there some other in uence that plays a role The presence of other in uences might be as confounders or effect modi ers A confounded association does not tell us about the association of interest A confounded relationship is biased because of an extraneous variable An effect modi ed relationship changes with variations in the extraneous variableSeveral examples illustrate these ideas Example Among 600 women it appears that nulliparity is protective against breast cancer CaseControl Status Breast Cancer Control Exposure Null 120 180 300 Status 40120300 60180300 Parous 180 120 300 300 300 600 Odds Ratio 044 However when we take into account exposure to radiation a different story emerges No radiation Radiation Cancer Control Cancer Control Null 30 170 200 Null 90 10 100 Parou 10 90 100 Parous 170 30 200 s 40 26 300 260 40 300 Odds Ratio 16 Odds Ratio 16 The unadjusted odds ratio of 044 is reversed It now appears that nulliparity is a risk factor for breast cancer this is re ected in the odds ratio that is greater than 1 Puleth 640 4 Categorical Data Analysis Page 71 of 78 How did this apparent contradiction occur 0 In the nulliparous group there are disproportionately fewer women exposed to radiation with the result that radiation Women exposed to radiation are more likely to have breast cancer Women exposed to radiation were less likely to be nulliparous OR 044 is biased due to the confounding effect of exposure to The calculation of an association for example an RR or an OR for a 2x2 table of counts may be misleading because of one or more extraneous in uences An extraneous in uence can be Confounder Effect modi er Both Neither A confounded association is biased and does not tell us about the association of interest An effect modi ed relationship changes with variations in the extraneous variable Intuitively confounding is the o Distortion of a predictoroutcome relationship due to a third variable that is related to both predictor and outcome 0 The bias from confounding can be a spurious strengthening weakening elimination reversal o A reversal is said to be an example of Simpson s Paradox Puleth 640 4 Categorical Data Analysis Page 72 of 78 Apparent but not true confounding can occur in the absence of a relationship between exposure and disease Example Are breath mints associated with cancer Casecontrol Status Cancer Control Exposure Breath 200 1646 1846 Status Mints 77200260 1816468935 None 67 7289 73 56 260 893 5 9202 Odds Ratio 1322 It looks like we should not be eating breath mints What happens if we control for smoking Smokers NonSmokers Cancer Control Cancer Control Breath 194 706 900 Breath 6 940 946 Mints Mints None 21 79 100 None 46 7210 7256 215 785 1000 52 8150 8202 Odds Ratio 103 Odds Ratio 100 Controlling for smoking eating breath mints is no longer associated with cancer If the extraneous variable has no effect on disease then it will not cause confounding Page 73 of 78 Puleth 640 4 Categorical Data Analysis Example Hot tea is suspected of being associated with esophageal cancer CaseControl Status Cancer Control Exposure Tea 1420 3650 5070 Drin 9414201504 8136504499 Water 84 849 933 1 504 4499 6003 Odds Ratio 393 Notice that the tea drinkers have disproportionately fewer smokers Smoker NONSmoker Exposure Tea 5000 5070 Drin 14705070 Water 100 933 89833933 903 5 100 6003 Interestingly smoking status does not distort the association of tea With cancer SMOKERS Cancer Control Tea 20 50 70 Water 75 758 833 95 808 903 Odds Ratio 404 Water NON SMOKERS Cancer Control 1400 3600 5000 9 91 100 1409 3691 5100 Odds Ratio 393 Puleth 640 4 Categorical Data Analysis Page 74 of 78 This is because smoking itself is not associated With esophageal cancer WATER TEA Cancer Control Cancer Control SMOKER 75 758 833 SMOKER 20 50 70 NOT 9 91 100 NOT 1400 3600 5000 84 849 933 1420 3650 5070 Odds Ratio 100 Odds Ratio 103 Thus 0 It is possible to observe a strong relationship between the extraneous variable smoking and exposure tea 0 With Q confounding of the exposuredisease relationship of interest 0 This Will occur when the extraneous variable is wrelated to the disease outcome Puleth 640 4 Categorical Data Analysis Page 75 of 78 If the extraneous variable has no relationship to exposure then it will not cause confounding Example A crude analysis suggests that use of sugar substitutes is associated with bladder cancer Case Control Status Cancer Healthy Substitute 10675 73813 844 Exposure Status Sugar 35 5149 5184 141 5887 6028 Odds Ratio 2113 However we have learned that smoking is associated With bladder cancer Cancer Healthy Smoker 12790 305152 3178 NON 14 2836 2850 Smoker 141 5887 6028 Odds Ratio 843 Puleth 640 4 Categorical Data Analysis Page 76 of 78 However the variable smoking is not related to the use of sugar substitutes Substitute Sugar Smoker 44514 2733 3178 NON 39914 2451 Smoker 2850 844 5184 6028 Odds Ratio 10 The independence of smoking and sugar substitute use means that the stratum speci c odds ratios Will be close to the unadjusted odds ratio Stratum Smokers Cancer Control Substitute 95 350 445 Sugar 32 2701 2733 127 3051 3178 Odds Ratio 2291 Stratum NONSmokers Cancer Control Substitute 1 1 388 399 Sugar 3 2488 2491 14 2876 2890 Odds Ratio 2351 Thus an extraneous variable unrelated to exposure does not cause confounding Puleth 640 4 Categorical Data Analysis We have what we need to de ne confounding De nition Confounding A variable is confounding if 1 It is extraneous not intermedia 2 It is related to disease BOTH among the exposed AND among the unexposed 3 It is related to exposure Recall that an intermedia variable is an intermediate in a causal pathway Example Coal dust gt Asthma gt Lesions on Lung Asthma is the intermediary variable Strati cation on an intermedia variable eliminates the exposure disease relationship When we discuss the logistic regression model we ll learn about effect modification Page 77 of 78 Puleth 640 4 Categorical Data Analysis Page 78 of 78 Appendix F Computer Resources Applets AppletCalculator for Analysis of 2 Way Contingency Table courtesy of John C Pezullo PhD httpstatpages0rgctab2x2html Fisher s Exact Test source Vassar Stats httpfacultyyassaredulowgchSahtml Chi Square Test for General R x C Table maximum 9 x 9 Source Colorado State University httpwwwphvsics cshsin J quot html Standardized Mortality Ratio Calculation Download an Excel File Calculator Source Pennsylvania Department of H ealth http wwwdsfhealthstatepaushealthcwpView asp q 202 1 14