### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Biostat Methods in Categorical Data 171 203

UI

GPA 3.94

### View Full Document

## 46

## 0

## Popular in Course

## Popular in Biostatistics

This 131 page Class Notes was uploaded by Dedric Ritchie on Friday October 23, 2015. The Class Notes belongs to 171 203 at University of Iowa taught by Brian Smith in Fall. Since its upload, it has received 46 views. For similar materials see /class/228027/171-203-university-of-iowa in Biostatistics at University of Iowa.

## Popular in Biostatistics

## Reviews for Biostat Methods in Categorical Data

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/23/15

Biostatistical Methods in Categorical Data 171 203 Section 1 Introduction Brian J Smith PhD October 8 2007 Table of Contents 11 Introduction 1 111 Role of Statistics in Biomedical Studies 1 112 Principles of Causality 2 113 Epidemiology 2 Steps for Conducting an Epidemiologic Study 3 Factors to Consider when Selecting a Statistical Method 4 12 Disease Prevalence 4 121 Definition 4 122 Example Undergraduate Binge Drinking at UI 5 Estimated Prevalence 6 95 Confidence Interval 6 SAS Program and Output 7 Test for an Association Binge Drinking and Greek Membership 9 SAS Program and Output 10 Interpretation 11 Questions 11 11 Introduction 111 Role of Statistics in Biomedical Studies In this class our focus will be on statistical methods for the analysis of categorical data Examples from epidemiologic studies will be used to illustrate many of the methods 0 Summarize and describe data 0 Use data from samples of subjects to make inference about larger populations 0 Estimate associations between disease outcomes and select risk factors Quantify the level of uncertainty in sample estimates Control for the interplay between multiple factors in characterizing the risk of disease Provide evidence not proof to support or refute causality 112 Principles of Causality Sir Bradford Hill outlined seven criteria by which to evaluate the strength of evidence in favor of causation Six of his most relevant criteria are given below 1 Strength of association clinical significance vs statistical significance N Time sequencing of exposure and disease onset ecologic study vs prospective cohort study Biologic plausibility collaboration with subjectmatter experts Consistency with other investigations literature review Doseresponse relationship variation in exposures OUU IFOO Lack of more compelling explanations consideration of bias confounding and interaction 113 Epidemiology The study of the distribution and determinants of disease frequency in human populations Steps for Conducting an Epidemiologic Study 1 Identify the disease and risk factors of interest 2 Specify the questions hypotheses to be addressed 3 Design the study a Select an appropriate design 0 Descriptive Ecological o Observational CaseControl Cohort CrossSectional 0 Experimental Clinical and Intervention Trials b Specify the data to be collected 0 InclusionExclusion Criteria 0 Variables to be measured c Determine the appropriate statistical methods for describing and analyzing the data 0 Number of Subjects 4 Carry out the study and collect the data 5 Analyze the data 6 Assess the validity of any observed statistical results with respect to chance bias and confounding 7 Draw conclusions about the subject population Factors to Consider when Selecting a Statistical Method 0 Scientific questions to be addressed 0 Study design 0 Type of data to be analyzed nominal ordinal discrete continuous 12 Disease Prevalence 121 Definition 0 The number of individuals in a population that are diseased at a given point in time o Often expressed as a rate or percentage number of diseased individuals 2 total number at risk o Denominator includes subjects appearing in the numerator 0 Value lies between zero and one 122 Example Undergraduate Binge Drinking at U A crosssectional study of 1468 University of Iowa students was conducted in order to assess the nature of alcohol consumption on campus Analysis Goals 0 Estimate the prevalence of binge drinking at Iowa 0 Test for an association between binge drinking and fraternitysorority Greek membership Table 1 Summary of binge drinking study data Binge Greek Drinking Yes No Tetal Yes 398 624 1022 No 83 363 446 Total 481 987 1468 Estimated Prevalence 0 Prevalence is estimated with the usual binomial proportion p 2 10221468 2 697 95 Confidence Interval o If the sample size is sufficiently large say np1 p 2 5 then Normal theory methods can be used to construct the confidence interval pi20975 p1 p0696i196 I 7 1468 673720 o lfthe Normal theory method is not appropriate then an exact confidence interval must be constructed directly using the binomial distribution SAS Program and Output data uialcohol data uialcohol input Binge Greek N input ID Binge Greek cards cards Yes Yes 398 1 Yes Yes Yes No 624 N0 Yes 83 898 Yes Yes N0 N0 363 399 Yes No E 1022 Yes No proc freq orderdata datauialcohol weight N 1023 No Yes table Binge binomial 39 run 1106 NO NO proc freq orderdata datauialcohol table Binge binomial Pun The FREQ Procedure Cumulative Cumulative Binge Frequency Percent Frequency Percent Yes 1022 6962 1022 6962 No 446 3088 1468 10000 Binomial Proportion for Binge Yes Proportion ASE 95 Lower Conf Limit 95 Upper Conf Limit Exact Conf Limits 95 Lower Conf Limit 95 Upper Conf Limit Test of H0 Proportion ASE under H0 Z Onesided Pr gt Z Twosided Pr gt Z Sample Size 1468 6962 0120 6727 7197 0000 O 6719 07196 05 00130 150335 lt0001 lt0001 Test for an Association Binge Drinking and Greek Membership Recall the factors to consider in choosing a statistical method 0 Question to be addressed Is there an association between the two variables 0 Study design Crosssectional study of 1468 subjects randomly selected from the UI student population independent of their drinking or Greek status Note that the proportion of students who binge drink or who belong to Greek organizations can be estimated from these data 0 Type of variables to be analyzed Both variables are nominal categorical variables with two levels YesNo ie dichotomous variables Two common choices 1 Pearson chisquare test for an association appropriate if no more than 20 of the expected cell counts are less then 5 and none is less than 1 2 Fisher s exact test nonparameteric analog to the Pearson test Useful when the sample size is small SAS Program and Output proc freq datauialcohol weight N table BingeGreek chisq run The FREQ Procedure Table of Binge by Greek Binge Greek Frequency Percent Row Pct Col Pct No Yes Total No 363 83 446 2473 565 3038 8139 1861 3678 1726 Yes 624 398 1022 4251 2711 6962 6106 3894 6322 8274 Total 987 481 1468 6723 3277 10000 Statistics for Table of Binge by Greek Statistic DF Value Prob ChiSquare 1 582732 lt0001 Likelihood Ratio ChiSquare 1 620183 lt0001 Continuity Adj ChiSquare 1 573539 lt0001 MantelHaenszel ChiSquare 1 582335 lt0001 Phi Coefficient 01992 Contingency Coefficient 01954 Cramer39s V 01992 Fisher39s Exact Test Cell 11 Frequency F 363 Leftsided Pr lt F 10000 Rightsided Pr gt F 2949E15 Table Probability P 1998E15 Twosided Pr lt 4174E15 Sample Size 1468 Interpretation 0 The null and alternative hypotheses are Ho no association HA association 0 The sample size is large enough to satisfy the assumptions of the Pearson test SAS will print a warning if too many of the expected cell counts are less than 5 o Pearson s test gives a chisquare statistic of 583 with a pvalue lt 00001 At the 5 level of significance there is a significant association between Binge Drinking and Greek membership 0 Note that in this case Fisher s exact test gives the same conclusion This is not necessarily always the case The advantage of Fisher s test is that it is appropriate regardless of the sample size Questions 0 Are Greeks more or less likely to binge drink 0 How would the analysis differ if the study design were casecontrol or cohort Biostatistical Methods in Categorical Data 171 203 Section 2 SAS and R Statistical Software Brian J Smith PhD October 8 2007 Table of Contents 21 Introduction 12 211 Data Management 12 212 Iowa Radon Study Example 13 213 Entering Data 14 SAS Program 14 214 Importing Data 15 SAS Program 15 R Program 15 215 Exporting Data 16 SAS Program 16 R Program 17 216 Modifying Existing Datasets 18 SAS Program 18 22 Descriptive Summaries for Numerical Data 19 221 Univariate Statistics 19 SAS Program and Output 20 Normality Test Result 22 222 Plots 23 R Program and Output 23 Guidelines for Formatting Plots 26 23 Descriptive Summaries for Tabular Data 27 231 Frequency Tables 27 SAS Program and Output 28 Association Test Result 30 24 Pairwise Association for Numerical Data 31 241 Correlation Analysis 31 SAS Program and Output 31 Iowa Radon Study Results 33 25 TwoSample Parametric Test for Numerical Data 34 251 TwoSample TTest 34 SAS Program and Output 34 Iowa Radon Study Result 36 26 TwoSample NonParametric Test for Numerical Data 37 261 RankBased Tests 37 SAS Program and Output 37 Iowa Radon Study Results 39 21 Introduction 211 Data Management Data management refers to the creation storage and manipulation of data The popularity of the SAS Software Environment is due in large part to its extensive collection of powerful data management procedures In this class we will rely primarily on the SAS DATA step procedure for data processing This procedure provides a generalpurpose programming language for data management and will be used to perform the following tasks 0 Entering raw data to create SAS datasets o Importing data into SAS datasets 0 Creating new SAS datasets by subsetting merging modifying or updating existing datasets 0 Constructing new variables from existing datasets o Exporting SAS data and results for use in external software programs In addition to these tasks we will also use SAS as our primary data analysis software Plotting however will be performed in the R software environment httpwwwr proiectorg due to its superior graphics capabilities Thus we will cover the basics of data management in R 212 Iowa Radon Study Example Fourhundred thirteen lung cancer cases and sixhundred fourteen populationbased controls were enrolled in the Iowa Radon Lung Cancer casecontrol study The investigators were interested in assessing the effect of radon exposure on lung cancer risk Listed below is a subset of the variables collected in the study Variable Description Values case Lung cancer indicator 1 case 0 control age Age at enrollment control or continuous diagnosis case pyr Cigarette packyears continuous 4485 school Attained education level 1 grade school 2 high school 3 some college 4 college degree 5 beyond college wlm20 20year radon exposure continuous 192 We will consider a few basic techniques for creating and manipulating datasets for the radon data in SAS 213 Entering Data SAS Program data radon input case age pyr school wlm20 cards 1 65 478439425 60 699691992 1 4 6608462927 0 59159479808 05 4 12691266326 O 75 258042437 0 5 11 14448953 1 66179829227 2975 2 7688580114 1 81087645448 11502659138 2 51768967405 1 52405201916 20109548255 3 56601141221 run Syntax o This DATA step defines a new SAS dataset named irlcs 0 input defines the variables in the dataset By default variables are assumed to be numerical To designate a variable as a character variable insert a after the name in the input statement 0 The cards statement precedes the data that will comprise the dataset 214 Importing Data SAS Program proc import datafilequotLBios203irlcstxtquot outirlcs dbmsquotTABquot replace Syntax o The IMPORT procedure reads data from an external file into a SAS dataset o datafile is the external file name 0 out specifies the name of the SAS dataset to be created 0 dbms specifies the type of data to be imported Here TAB indicates that the data are stored in a tabdelimited text file Other file types are available including EXCEL2002 for importing data from a Microsoft Excel spreadsheet R Program irlcs lt PeaddelimquotLBios203inlcstxtquot Syntax o readdelim reads a tabdelimited text file and creates a data frame from it See readtable for a more general R import function 215 Exporting Data SAS Program pr oc export outfilequotLTempir lcstxtquot datair lcs dbmsquotTABquot replace Syntax o The EXPORT procedure saves a SAS dataset to an external file 0 outfile is the external file name 0 data specifies the name of the SAS dataset o dbms specifies the type of data to be exported The file type options are the same as those for the IMPORT procedure R Program wr itetableir lcs quotLTempir lcstxtquot quoteF sepquottquot r ownamesF Syntax o writetable saves the specified data frame to an external text file 0 quote is a logical argument indicating whether values of character variables should be enclosed in quotation makes 0 sep is a character string giving the delimiter t indicates a tab rownames is a logical argument indicating whether the row names in the data frame are to be outputted 216 Modifying Existing Datasets SAS Program data newir lcs set ir lcs smkever pyr gt 0 college school 3 or school 4 or school 5 lnwlm20 logwlm20 run Syntax o A new SAS dataset newirlcs is created from an existing one irlcs in this DATA step set allows for the inclusion of data from an existing SAS dataset 0 New variables may be defined in the DATA step c smkever is created from the pyr variable It will take on a value of 1 if pyr is positive and 0 othenNise 0 college is created from the school variable It will take on a value of 1 if school equal 3 4 or 5 and a value of 0 othenNise o nwm20 is the result of applying the natural log transformation to wlm20 22 Descriptive Summaries for Numerical Data 22 1 Univariate Statistics The UNIVARIATE procedure in SAS provides data summarization methods that produce univariate statistics and information on the distribution of numerical variables PROC UNIVARIATE provides Descriptive statistics based on moments such as the mean standard deviation and standard error Median mode range and quantiles Plots of the data distribution ShapiroWilk tests of normality Paired t test sign test and Wilcoxon signed rank test for use with differenced data SAS Program and Output proc univariate nor mal datanewirlcs class case var wlm20 run Syntax o The normal option specifies that tests of Normality be performed 0 class specify that the results be generated separately for each level of the given variable In this example summary statistics are calculated separately for the cases and controls in the radon study 20 The UNIVARIATE Procedure Variable WLM20 CASE 0 Moments N 614 Sum Weights Mean 103672855 Sum Observations 636551331 Std Deviation 835364296 Variance 697833507 Skewness 309058311 Kurtosis 14 6680267 Uncorrected SS 108770288 Corrected SS 42777194 Coeff Variation 805769547 Std Error Mean 033712559 Basic Statistical Measures Location Variability Mean 1036729 Std Deviation 35364 Median 787101 Variance 6978335 Mode Range 6 23687 Interquartile Range 03331 Tests for Location Mu00 Test Statistic Value Student39s t t 30752 Pr gt t lt0001 Sign M 307 Pr gt M lt0001 Signed Rank S 944025 Pr gt S lt0001 Tests for Normality Test Statistic Value ShapiroWilk W 0732898 Pr lt W lt00001 Kolmogorov Smirnov D 0159199 Pr gt D lt00100 Cramervon Mises WSq 5396563 Pr gt WSq lt00050 AndersonDarling ASq 3199803 Pr gt ASq lt00050 21 0uantiles Definition 5 0uantile Estimate 100 Max 6965952 99 5291604 95 2410922 90 1859181 75 03 1338010 50 Median 7 87101 25 01 5 34678 10 3 36676 5 2 78499 1 2 31351 0 Min 1 42265 Extreme Observations Lowest Highest Value Obs Value Obs 142265 151 573208 959 1 89906 1022 574753 402 208609 931 635324 649 214718 963 646272 990 220491 962 696595 987 Missing Values Percent 0f Missing Missing Value Count All Obs Obs 5 081 10000 Normality Test Result The ShapiroWilk test can be used to assess whether the data are normally distributed The null and alternative hypotheses forthis test are Ho Data are normally distributed HA Data are not normally distributed Conclusion At the 5 level of significance the WLM20 measurements are not normally distributed p lt 00001 22 222 Plots R Program and Output Histogram Plots windows95 parmarc5441 mfrowc12 histirlcsWLM20irlcsCASEO mainquotContPolsquot XlabquotWLM Radon Exposurequot histirlcsWLM20irlcsCASE1 mainquotCasesquot XlabquotWLM Radon Exposurequot Box Plots windows76 paPmaPc3411 boxplotWLM2O CASE datairlcs Xlabquotquot ylabquotWLM Radon Exposurequot axesF axis1 atc1 2 labelscquotContPolsquot quotCasesquot axis2 box Syntax 0 windows opens a new graphics window of the specified or default size 0 par sets or queries graphics parameters for the active window mar is vector giving the bottom left top and right margin sizes respectively mfrow is a vector setting the number of rows and columns of plots to display 23 Frequency Controls Cases 0 0 LD Ln F a F C D 3 o a o 9 L 8 O O LO LO 0 O l l 0 1 0 20 30 40 50 60 70 0 20 40 60 80 1 00 WLM Radon Exposure WLM Radon Exposure Figure 1 Histogram plots of radon exposures among Iowa Radon Study cases and controls 24 80 WLM Radon Exposure 40 0 1mooo o O O W O 20 I Controls Cases Figure 2 Box plots of radon exposures among Iowa Radon Study cases and controls 25 Guidelines for Formatting Plots Plots provide graphical summaries of data They should be selfexplanatory and understandable to all other researchers involved in the project Use descriptive labels for the axes If a qualitative variable is plotted use the category names as labels rather than any arbitrary numeric values that may be used to code the variable in the dataset Labels for quantitative variables should describe the variable and give the units of measurement avoid using variable names from the dataset as labels Plots should be interpretable if displayed as a grayscale image Be careful about using color in analysis reports and manuscripts since readers may want to print out a blackandwhite copy Include captions with your plots Descriptive captions often indicate the type of plot the data being plotted the source of the data and any other features that are being highlighted by the plot Be consistent with capitalization and punctuation Decide whether to capitalize the first letter of all words in the caption and whether to end captions with a period do so for all plots Use plot titles sparingly Captions are the best place to describe the plot an additional plot title is generally not needed 26 23 Descriptive Summaries for Tabular Data 231 Frequency Tables The FREQ procedure in SAS provides tabular summaries for categorical data For one way tables PROC FREQ can compute binomialbased test statistics for proportions For twoway tables PROC FREQ computes chisquare test statistics and measures of association For nway tables PROC FREQ does stratified analysis including the calculation of stratumspecific and pooled summary statistics 27 SAS Program and Output proc freq datanewir lcs tables school tables college binomial tables caseschool chisq run Syntax o A frequency table will be provided for variables that are individually listed in the tables statement contingency tables for variables that are listed together with the symbol 0 Estimated proportions exact and approximate 95 confidence intervals may be obtained for dichotomous variables using the binomial option 0 The chisquare test for an association may be applied to contingency tables via the chisq option 28 The FREQ Procedure Cumulative Cumulative SCHOOL Frequency Percent Frequency Percent 1 89 867 89 867 2 535 5209 624 6076 3 288 2804 912 8880 4 82 7 98 994 9679 5 33 321 1027 10000 Cumulative Cumulative college Frequency Percent Frequency Percent 0 624 6076 624 6076 1 403 3924 1027 10000 Binomial Proportion for college 0 Proportion 0 6076 0 0152 95 Lower Conf Limit 05777 95 Upper Conf Limit 06375 Exact Conf Limits 95 Lower Conf Limit 05770 95 Upper Conf Limit 06376 Test of H0 Proportion 05 ASE under H0 Z 68962 Onesided Pr gt Z Twosided Pr gt Z lt0001 Sample Size 1027 29 Table of CASE by SCHOOL CASE SCHOOL Frequency Percent Row Pct Col Pct 1 2 5 0 47 299 60 25 458 2911 1782 584 2 43 765 4870 2980 977 407 5281 5589 6354 7317 7576 1 42 236 22 8 409 2298 1022 214 0 78 1017 5714 2542 533 194 4719 4411 3646 2683 2424 Total 89 535 82 33 8 67 5209 2804 798 321 Statistics for Table of CASE by SCHO0L Statistic DF Value Prob ChiSquare 4 164845 00024 Likelihood Ratio ChiSquare 4 170087 00019 MantelHaenszel ChiSquare 1 157067 lt0001 Phi Coefficient 0 1267 Contingency Coefficient 0 1257 Cramer39s V 0 1267 Sample Size 1027 Total 614 5979 413 4021 1027 10000 Association Test Result The chisquare test can be used to assess whether there is an association between two categorical variables The null and alternative hypotheses for this test are Ho There is no association HA There is an association Conclusion At the 5 level of significance there is an association between casecontrol status and education p 00024 30 24 Pairwise Association for Numerical Data 241 Correlation Analysis The CORR procedure in SAS is a statistical procedure for numerical random variables that computes correlation coefficients including 0 Pearson correlation o Spearman rankorder correlation 0 Pearson Spearman and Kendall partial correlation SAS Program and Output pr oc cor r pear son spearman datanewirlcs var pyr wlm20 r un Syntax o spearman requests the Spearman rankorder correlation coefficients pearson requests the Pearson correlation coefficients Pearson is the default unless otherwise specified 31 The CORR Prooedu re 2 Variables PYR WLM20 Simple Statistics Variable N Mean Std Dev Median Minimum Maximum PYR 1027 1982656 2565853 385000 0 13845175 WLM20 1027 1064205 889201 817985 142265 9153930 Pearson Correlation Coefficients N 1027 Spearman Correlation Coefficients N 1027 Prob gt r under H0 Rho0 Prob gt r under H0 Rho0 PYR WLM20 PYR WLM20 PYR 100000 001254 PYR 100000 001560 06882 06175 WLM20 001254 100000 WLM20 001560 100000 06882 06175 32 Iowa Radon Study Results The correlation coefficient may be used to assess whether there is an association between two quantitative variables The null and alternative hypotheses for this test are Ho The two variables are not correlated HA They are correlated Conclusion At the 5 level of significance packyears is not correlated with radon exposure p 06175 33 25 TwoSample Parametric Test for Numerical Data 251 TwoSample TTest The TTEST procedure in SAS performs ttests for one sample two samples and paired observations The onesample t test compares the mean of the sample to a given number The twosample t test compares the mean of the first sample minus the mean of the second sample to a given number The paired observations t test compares the mean of the differences in the observations to a given number SAS Program and Output proc ttest datanewirlcs class case var wlm20 run Syntax o Grouping variables are listed in the class statement 0 Analysis variables are listed in the var statement 0 The paired statement may be used in place of the class and var statement to perform a paired t test It has the general form paired ltvariable 1gtltvariable 2gt 34 The TTEST Procedure Variable WLNQO WLNQO WLNQO Variable WLNQO WLNQO WLNQO Variable WLNQO WLNQO Variable WLNQO Statistics Lower CL Upper CL Lower CL CASE Mean Mean Mean Std Dev 0 614 97052 10367 11029 79111 1 413 10119 11051 11982 90178 Diff 12 1793 0683 04269 85213 Statistics CASE Std Err Minimum Maximum 0 03371 14227 6966 1 0474 20461 91539 Diff 12 05658 TTests Method Variances DF t Value Pr gt t Pooled Equal 1025 121 02274 Satterthwaite Unequal 797 117 02405 Equality of Variances Method Num DF Den DF F Value Pr gt F Folded F 412 613 133 00014 35 Std Dev 83536 9633 889 Upper CL Std Dev 8849 10339 92923 Iowa Radon Study Result The twosample t test may be used to asses the difference in means between two independent groups The test assumes that the mean difference has a tdistribution This assumption is appropriate if 1 the variable is normally distributed or 2 the sample sizes are large rule of thumb n1n2 2 30 The associated null and alternative hypotheses are Ho The group means are equal HA The mean for group 1 is less thannot equal togreater than that for group 2 Conclusion At the 5 level of significance there is no evidence of a difference in mean radon exposures between cases and controls p 02405 36 26 TwoSample NonParametric Test for Numerical Data 261 RankBased Tests The NPAR1WAY procedure in SAS performs nonparametric tests for location and scale differences for a oneway classification of subjects including o the Wilcoxon ranksum test 0 the KruskalWallis test SAS Program and Output proc npar1way wilcoxon datanewirlcs class case var wlm20 r un Syntax o The wilcoxon option will request the Wilcoxon ranksum test in the case of two groups and the KruskalWallis test 0 Grouping variables are listed in the class statement 0 Analysis variables are listed in the var statement 37 The NPAR1WAY Procedure Wilooxon Scores Rank Sums for Variable WLM20 Classified by Variable CASE Sum of Expected Std Dev Mean CASE N Scores Under H0 Under H0 Score 1 413 2173420 2122820 466085021 526251816 0 614 3105360 3155960 466085021 505758958 Wilooxon TwoSample Test Statistic 2173420000 Normal Approximation Z 1 0855 OneSided Pr gt Z 0 1388 TwoSided Pr gt Z 0 2777 t Approximation OneSided Pr gt Z 01390 TwoSided Pr gt Z 02779 Z includes a continuity correction of 05 KruskalWallis Test ChiSquare 11786 DF 1 Pr gt ChiSquare 02776 38 Iowa Radon Study Results The Wilcoxon ranksum test may be used to compare the distribution of a given variable between two independent groups This test is a nonparametric analog to the two sample t test The associated null and alternative hypotheses are Ho The variable is equally distributed in the two groups HA The distribution in group 1 is shifted to the leftleft or rightright of that in group 2 Conclusion At the 5 level of significance there is no evidence that the radon exposures differ systematically between cases and controls p 02777 39 Biostatistical Methods in Categorical Data 171 203 Section 3 Measures of Risk Brian J Smith PhD October 8 2007 Table of Contents 31 Overview 40 32 Data Layouts 41 321 Total Number of Cases and Noncases 41 Multiple Exposure Categories 41 Two Exposure Levels 42 33 Relative Risk 43 331 Estimation 43 332 Confidence Interval 44 Approximate Method 44 Example 45 SAS Code and Output 47 34 Odds Ratio 49 341 Estimation 49 342 Confidence Intervals 51 Approximate Method 51 Example 51 343 Relationship between the Relative Risk and Odds Ratio 53 Comments on the Odds Ratio 54 35 Pearson Correlation 55 Example 56 31 Overview For now we will concentrate on categorical measures of exposure 0 Measures of association involve a direct comparison of frequency counts across different values or categories of a risk factor 0 These measures rely on the selection of an appropriate reference population 0 Exposed vs nonexposed 0 Female vs male 0 Older age group vs youngest age group 0 Current or previous smokers vs nonsmokers c We will cover the following categorical measures of association 1 Relative Risk 2 Odds Ratio 3 Correlation 4o 32 Data Layouts 321 Total Number of Cases and Noncases Multiple Exposure Categories Our focus in this section will be on the number of observed diseasednondiseased and exposedunexposed subjects in the study Such data could be derived from any study design cohort casecontrol crosssectional etc Diseased EXposure Levels Totals X1 X2 X Yes a1 a2 a n1 NO b1 b2 b I72 Totals m1 m2 m n where o a and b are the number of diseased and nondiseased subjects at exposure level i 0 n1 and n2 are the total number of diseased and nondiseased subjects respectively 0 m is the total number of subjects at exposure level i 41 Two Exposure Levels The situation of twoexposure levels which often arises in practice will be given a slightly different notation Diseased Exposed Yes No Totals Yes a b a b No c d c d Totals a c b d n 42 33 Relative Risk 331 Estimation A ratio comparison of two risk estimates is called a risk ratio or Relative Risk RR The relative risk of disease for the 1 exposure category relative to the 1 exposure category may be calculated directly as PrlDlEl axm PrDE 7 aimi I where c 7 and 7 are the probability of disease forthe 1 andj 1 exposure categories 0 a and a are the number of diseased subjects within each exposure category 0 m and m are the total number of subjects diseased plus nondiseased within each exposure category aa b ccd o For 2 x 2 tables the relative risk formula may be written as RR 2 43 Notes This estimator assumes that all subjects are followed for the duration of the study ie no loss to followup It is only appropriate if subjects are not enrolled conditional on their disease status In other words subjects must be a sampled independent of their disease status 332 Confidence Interval Approximate Method The 95 confidence interval is based on a normal theory approximation for relative risk on the naturallog scale Katz et al 1978 1 1 1 1 39 RRizo97sxlg mz cd Exponentiation of this result yields the desired confidence interval for the relative risk on the original scale 1 1 1 1 RR ex iz X p 039975Va ab c cd Example Consider the following data from a cohort study Exposed Yes D39seased No Totals Yes 40 80 120 No 60 320 380 Totals 100 400 500 The relative risk of disease for subjects who are exposed versus those unexposed is R40120211 60380 The 95 confidence interval for the relative risk estimate of211 is 1 5o297 45 211gtlt expi196140 1120 160 1380 211gtlt 0709211gtlt1410 Conclusions 0 Exposed individuals are 211 times as likely to develop disease as those who are unexposed The risk of disease for exposed individuals is 211 times the risk forthe unexposed c We are 95 confident that the interval 150 297 contains the true risk of disease for exposed versus unexposed individuals 0 Exposure has a statistically significant positive effect on the risk of disease 0 Why would we not be able to make this statement if the study had used a case control design The explicit interpretation is that if the study was repeatedly carried out on the same population 95 of the resulting confidence intervals would contain the true parameter the relative risk 46 SAS Code and Output data Phexample proc freq orderdata datahhexample input Case Exposed N weight N cards tables ExposedCase relrisk Yes Yes 40 run Yes No 60 No Yes 80 No No 320 Syntax 0 SAS expects the exposure reference cell to be given in the second column of the table To ensure that this happens the following steps were taken 1 The exposed subjects are entered first in the data set 2 The orderdata option was specified in PROC FREQ 3 A table with exposure as the row variable and case status as the column variable is requested via the tables ExposureCase statement 0 The relrisk option generates relativerisk estimates for the specified frequency tables 47 The FREQ Procedure Table of Exposed by Case Exposed Case Frequency Percent Row Pct Col Pct Yes No Yes 40 80 800 1600 3333 6667 4000 2000 No 60 320 1200 6400 1579 8421 6000 8000 Total 100 400 2000 8000 Total 120 2400 10000 Statistics for Table of Exposed by Case Estimates of the Relative Risk Row1Row2 Type of Study CaseControl Odds Ratio Cohort Col1 Risk Cohort Col2 Risk Sample Size 500 Value 95 Confidence Limits 26667 1 6681 4 2629 21111 1 4975 2 9762 07917 0 6925 0 9050 48 34 Odds Ratio In a casecontrol study where subjects are enrolled conditional on their disease status we cannot estimate exposurespecific rates risks or relative risks without additional information Unfortunately the relative risk is often the population parameter of interest 341 Estimation Recall the general notation used in the table Exposed Yeleseased No Totals Yes a b a b No c d c d Totals a c b d n The odds of exposure among the diseased is PrED A aa c mm7 49 whereas the odds among the nondiseased is PrElI A bb d PrEII db d 2 d The ratio of these two odds is 0R 2 aC bd bc Notes 0 The ratio of these two odds is known as the Odds Ratio OR o The numerator is the odds of exposure among diseased subjects the denominator is the odds of exposure among nondiseased subjects 0 The odds ratio is symmetric with respect to disease and exposure status Specifically the formula for the disease odds ratio is the same as that for the exposure odds ratio given above Hence the odds ratio is often interpreted as the odds of disease for the exposed relative to the unexposed subject 0 The odds ratio can be estimated regardless of the study design 50 342 Confidence Intervals Approximate Method The 95 confidence interval for the odds ratio Woolf 1955 is OR gtlt expizo975 g Example From the example data used to compute the relative risk Exposed Yes D39seased N0 Totals Yes 40 80 120 No 60 320 380 Totals 100 400 500 ad 40320 the odds ratio is found to be OR bc 8060 2267 51 The 95 confidence interval for the odds ratio of 267 is 4o 80 60 320 267gtlt0626267gtlt1599 167427 267gtltexpi196 ii 1 1 Conclusions 0 The odds of disease for exposed individuals is 267 times the odds for the unexposed The disease odds ratio for exposed individuals relative to those who are unexposed is 267 c We are 95 confident that the interval 167 427 contains the true odds of disease for exposed versus unexposed individuals 0 There is a statistically significant positive association between exposure and disease 52 343 Relationship between the Relative Risk and Odds Ratio Note that the relative risk is defined as PrDiseaselExposed PrDlE PrDiseaselUnexposed PrDlE RR This can be rewritten as PrDlE PrDEPrE PrmE PrDEPrE PrDEPrEE PrE PrE DE PrDEPrE gtE PrE PrDE R PrDlE OR1 PrDE PrlZ Dl 1 PrDE If the overall probability of disease is low in the exposed and unexposed populations so that PrDlE and PrDlE are near zero then z PrEDPrED PrEfgtPrEE gt 39 53 The qualification that the overall disease risk is low is referred to as the rare disease assumption Under the rare disease assumption the odds ratio is an approximation to the relative risk of disease Comments on the Odds Ratio The odds ratio is a useful measure of association in its own right In the special situation where the disease of interest is rare the odds ratio is also an approximation to the relative risk The odds ratio is equally valid for data from casecontrol cohort or crosssectional studies In all of these designs the calculated odds ratios are estimating the same population parameter It can be interpreted either as the odds of disease for exposed versus unexposed individuals or the odds of exposure for diseased versus nondiseased individuals When computing the odds ratio from tabular data pay attention to the order of the categories in the table Odds ratios can be produced in SAS using the same PROC FREQ statement used to obtain relative risk estimates see SAS code and output starting on page 47 54 35 Pearson Correlation Recall that in the case of normally distributed data the correlation coefficient is defined as covXY p 1lvarXvarY and has the following properties Its value ranges from 1 to 1 It measures the extent of the linear association between variables X and Y Values of 1 and 1 indicate a positive and negative linear association respectively with all points lying on a straight line A value of 0 indicates no linear association I2 is the amount of variability in Xand Yexplained by the linear association between the two The Pearson correlation coefficient is an estimate of the population correlation and is computed as ZXi Yy39 J7 55 If both variables are dichotomous say X is the exposure status 0 unexposed 1 exposed and Yis the disease status 0 noncase 1 case then the Pearson formula simplifies to ad bc jm1m2n1n2 Notes This measure of association is appropriate for any study design A value of 1 indicates that all diseased subjects are exposed and all nondisease subjects are unexposed a perfect positive association A value of 1 indicates that all disease subjects are unexposed and all non diseased subjects are exposed a perfect negative association A value of 0 is equivalent to an odds ratio of 1 no association Can be obtained in SAS PROC FREQ by including the measures option in the tables statement Example The Pearson correlation coefficient for the relative risk example is 40x 320 80x60 m 201873 56 Biostatistical Methods in Categorical Data 171 203 Section 4 Statistical Inference for Risk Measures Brian J Smith PhD October 8 2007 Table of Contents 41 Overview 57 42 Relative Risk 57 Example 57 421 Hypothesis Testing 58 Pearson ChiSquare Test 59 Fisher s Exact Test 61 SAS Program and Output 61 43 Odds Ratio 63 431 Hypothesis Testing 63 432 Relationship between Confidence Intervals and Hypothesis Testing 64 CounterExample 65 44 MultiLevel Exposures 67 SHHS Example 68 441 General Test for an Association 68 Pearson ChiSquare Test 69 SAS Program and Output 72 Pairwise Comparisons 74 SAS Program and Output 77 442 Tests for Trend 79 CochranManteIHaenszel Test 80 SAS Program and Output 82 41 Overview Statistical inference provides a means for using sampling data to draw conclusions about a larger population It involves the estimation of population parameters the quantification of uncertainty and the testing of hypotheses In this section we will extend our discussion of measures of association to include inferential methods for 0 Testing for an association between exposure and disease 42 Relative Risk Example Recall the cohort data used previously to illustrate the relative risk and odds ratio Diseased Exposed Yes No Totals Yes 40 80 120 No 60 320 380 Totals 100 400 500 57 The estimates were b R aa 40120 211 cc d 60380 ad 40320 E 8060 OR 267 421 Hypothesis Testing For now let us focus on the comparison of disease risk across two exposure levels We will eventually address the general problem of making comparisons across 2 or more levels of an exposure variable Suppose that we are interested in testing the hypotheses H0RR1 HARR 139 This is something that we already know how to do Remember that the relative risk is computed as the ratio of two probabilities PrDiseaseExposed PrDiseaseUnexposed 71 39 2 58 Thus the hypotheses can be rewritten as a comparison of the probabilities between two independent groups HO 7r1 72 HA 7r1 i 72 Two potential options are 0 Pearson chisquare test for an association or o Fisher s exact test Pearson ChiSquare Test The Pearson test can be used to test for an association between the levels of two categorical variables Since it is based on normal theory methods it is only appropriate if the sample size is large enough Our specific interest is in using the test to compare the probability of disease between an exposed and unexposed group of subjects Comments on the Pearson test when the two variables are dichotomous a 2x2 table 0 The sample size is deemed large enough if none of the expected cell counts e lt 5 where e0 2 mnn o The Person chisquare test is equivalent to the twosample test for binomial proportions o The null hypothesis is one of no association between the two variables the alternative is that there is an association 59 o The test statistic is 2 nad bc2 2 m1m2n1n2 1 X for which the 2sided pvalue is p Prg12 2X2 Example The Pearson chisquare test statistic evaluates to 2 X2 250040x320 80x60 21754 380x120gtlt400gtlt100 which gives a pvalue of p Prg1221754 000003 Therefore the relative risk is significantly different from one p lt 00001 In particularly the relative risk estimate of 211 is significantly greater than one There is a statistically significant positive association between exposure and disease 60 Fisher s Exact Test Fisher s test is a nonparametric analog to the Pearson chisquare test The test is always appropriate and is particularly useful if the sample size is not large enough to use the Pearson test The hypotheses and conclusions are the same as before We will rely on SAS to carry out the test SAS Program and Output data Prexample input Case Exposed N cards Yes Yes 40 Yes No 60 No Yes 80 No No 320 I proc freq orderdata datarrexample weight N tables ExposedCase Pelrisk chisq nopercent nocol expected run Syntax o nopercent and nocol suppress the printing of the overall and column percentages respectively in the outputted table 0 expected adds the expected cell counts to the table 61 The FREQ Procedure Table of Exposed by Case Exposed Case Frequency Expected Row Pct Yes No Yes 40 80 24 96 3333 66 67 N0 60 320 76 304 1579 8421 Total 100 400 Estimates of th Total e Relative Risk Row1Row2 Type of Study Value 95 Confidence Limits CaseControl Odds Ratio 26667 16681 42629 Cohort 0011 Risk 21111 14975 29762 Cohort 0012 Risk 07917 06925 09050 Sample Size 500 62 Statistics for Table of Exposed by Case Statistic DF Value Prob ChiSquare 1 175439 lt0001 Likelihood Ratio ChiSquare 1 161557 lt0001 Continuity Adj ChiSquare 1 164645 lt0001 MantelHaenszel ChiSquare 1 175088 lt0001 Phi Coefficient 01873 Contingency Coefficient 01841 Cramer39s V 01873 Fisher39s Exact Test Cell 11 Frequency F 40 Leftsided Pr lt F 10000 Rightsided Pr gt F 4659E05 Table Probability P 3008E05 Twosided Pr lt P 6928E05 43 Odds Ratio 431 Hypothesis Testing The hypotheses of interest are H0OR1 HAOR 1 which can be addressed with the same tests used for the relative risk namely the Pearson chisquare and Fisher s exact tests 63 432 Relationship between Confidence Intervals and Hypothesis Testing Our hypotheses H0RR1 H0OR1 an HARR 1 HAOR 1 can be tested using either confidence intervals or test statistics Say we are interested in conducting tests at the 5 level of significance The two options are for hypothesis testing are 1 Confidence Interval Approach If the 95 confidence interval does not contain 1 then the null hypothesis is rejected in favor of the alternative 2 Test Statistic Approach lfthe pvalue computed from the test statistic is less than 005 then the null hypothesis is rejected in favor of the alternative It would be nice if the two approaches always led to the same conclusion that is if they were equivalent methods for testing the hypotheses 64 CounterExample Consider the SAS output given on the following page from the analysis of a hypothetical dataset Notes Based on the 95 confidence interval of 09126 221893 for the odds ratio we would fail to conclude that HA OR 7 1 Based on the 95 confidence interval of 08364 132842 for the relative risk we would fail to conclude that HA RR 7 1 Of course the conclusion based on the confidence interval for the odds ratio may differ from that for the relative risk This is relevant for studies in which either measure of association is appropriate eg cohort studies Based on the Pearson chisquare statistic we would reject the null hypothesis and conclude that there is an association between exposure and disease p 00497 Based on Fisher s exact test we would fail to conclude that there is an association p 00653 The confidence intervals and test statistics do not necessarily give equivalent results 65 The FREQ Procedure Table of Exposed by Case Exposed Case Frequency Yes No Total Yes 14 2s 42 N0 2 18 20 Total 16 46 62 Estimates of the Relative Risk Row1Row2 Type of Study Value 95 Confidence Limits CaseControl Odds Ratio 45000 09126 221893 Cohort 0011 Risk 33333 08364 132842 Cohort 0012 Risk 07407 05717 09597 Sample Size 62 66 Statistics for Table of Exposed by Case Statistic DF Value Prob ChiSquare 1 38525 00497 Likelihood Ratio ChiSquare 1 43363 00373 Continuity Adj ChiSquare 1 27303 00985 MantelHaenszel ChiSquare 1 37904 00515 Phi Coefficient 02493 Contingency Coefficient 02419 Cramer39s V 02493 Fisher39s Exact Test Cell 11 Frequency F 14 Leftsided Pr lt F 09922 Rightsided Pr gt F 00446 Table Probability P 00367 Twosided Pr lt P 00653 44 MultiLevel Exposures Our main focus has been on statistical tests for an association between a dichotomous exposure exposed versus unexposed and disease We now turn to methods for assessing the effect of a categorical exposure with 2 or more levels The notation in this more general situation is Diseased EXposure Levels Totals X1 X2 X Yes a1 a2 a n1 NO b1 b2 b I72 Totals m1 m2 m n That is interest lies in the association between a dichotomous disease variable and a categorical exposure variable with llevels The null hypotheses to be addressed are HO RR2 RR3 RR 1 and H0 OR2 20R3 OR 1 where the first exposure category x1 is taken as the reference group As we will see the choice of a statistical test will depend on our specified alternative hypothesis 67 SHHS Example The following data present subjects from the Scottish Heart Health cohort Study TunstallPedoe et al 1997 classified by cholesterol and coronary heart disease CHD status Cholesterol Status CHD 1Iow 2 3 4 5high Tom39s Yes 15 2o 26 41 48 150 No 798 794 791 785 777 3945 Totals 813 814 817 826 825 4095 Analysis Goal Test for an association between cholesterol and risk of coronary heart disease 441 General Test for an Association Suppose that we would like to address the following hypotheses HO RR2 RR3 RR 1 HA RR 7 1 for somei In other words the null hypothesis is one of equal risk across all exposure levels versus the alternative that the risk differs between at least two of the levels 68 The hypotheses can be written equivalently as H07Z3917Z3927Z3937Z39l HA 17 i 7 for some i andj where 7z is the probability of disease at exposure i This is precisely the situation for which the Pearson chisquare test of homogeneity is appropriate Pearson ChiSquare Test The Pearson chisquare test statistic is calculated as 2 2 observedexpected 2 X Z 2 expected Jar 1X04 rows columns which in our case is where the expected number of subjects is computed as min en U n The 2sided pvalue is p Prg212 X2 69 Notes 0 The test is appropriate if no more than 20 of the expected cell counts are less than 5 and no expected count is less than 1 SAS will print a warning if this is the case Fisher s exact test can be used if this criterion is not satisfied however SAS may not be able to carry out the exact test for large sample sizes 0 Note that we may reject the null hypothesis in favor of the alternative if any of the relative risks is significantly different from one There is no assumed ordering of the relative risks or the exposure levels Hence this test is appropriate for nominal ordinal or discrete exposure variables 0 May be used for the analogous test of equality across odds ratios SHHS Example The first step is to calculate the expected cell counts For instance the expected count in the first cell CHD No Cholesterol Status 1 is m1n1 813150 e11 22978 n 4095 The complete set of calculations for the Pearson chisquare test statistic are given in the following worksheet 70 2 2 i j quotif elf quotr er nil 61quot el39 1 1 15 2978 21845 734 1 2 20 2982 9637 323 1 3 26 2993 1542 052 1 4 41 3026 11542 381 1 5 48 3022 31614 1046 2 1 798 78322 21845 028 2 2 794 78418 9637 012 2 3 791 78707 1542 002 2 4 785 79574 11542 015 2 5 777 79478 31614 040 Test Statistic X 2632 The resulting pvalue is Prvf 2 2632 000003 Therefore there is a significant association between cholesterol and CHD risk The risk of disease is not equal across the cholesterol categories An obvious followup question to ask is where do the cholesterol categories differ with respect to the risk of CHD and what is the direction of the association 71 SAS Program and Output data shhs input Case Exposure N cards Yes 1 15 No 1 798 Yes 2 20 No 2 794 Yes 3 26 No 8 791 Yes 4 41 No 785 Yes 48 No 777 01013gt 1 proc freq ondendata datashhs weight N tables CaseExposune chisq exact run Syntax o For 2x2 tables Fisher s exact test is automatically performed when the chisq option is given For tables with more than two columns or rows Fisher s exact test must be requested explicitly via the exact option 72 The FREQ Procedure Table of Case by Exposure Case Exposure Frequency Percent Row Pct Col Pct 1 2 3 4 5 Yes 15 20 26 41 48 037 049 063 100 117 1000 1333 1733 2733 3200 185 246 318 496 582 No 798 794 791 785 777 1949 1939 1932 1917 1897 2023 2013 2005 1990 1970 9815 9754 9682 9504 9418 Total 813 814 817 826 825 1985 19 88 1995 2017 2015 Statistics for Table of Case by Exposure Statistic DF Value Prob ChiSquare 4 263232 lt0001 Likelihood Ratio ChiSquare 4 264405 lt0001 MantelHaenszel ChiSquare 1 253900 lt0001 Phi Coefficient 00802 Contingency Coefficient 00799 Cramer39s V 00802 Total 150 366 3945 9634 4095 10000 Fisher39s Exact Test Table Probability P Pr lt P Sample Size 4095 73 1523E10 2813E05 Pairwise Comparisons In our example we rejected the null hypothesis and concluded that the risk of CHD was not equal across all cholesterol levels This global test of equality does not identify specific difference in the relative risks One method for doing so is to look at all pairwise comparisons of the exposure levels o If there are l levels for the exposure variable there will be ll 12 pairwise comparisons to be made o If we use an a39 level of significance for each of the pairwise comparisons the overall significance level will be a 1 1 a H2 A significance level of a 005 is typically used in hypothesis testing Thus a39 should be adjusted to ensure that the desired overall level of significance is maintained 0 Two conservative methods for determining the significance level to be used in the individual pairwise comparisons are I 1 Bonferroni Method 0 L I 12 2 Probability Method 01 1 051H2 The Bonferroni method is used more often however the probability method is slightly less conservative see Table 1 74 0 Pairs of exposure categories can be compared individually using the Pearson chi square or Fisher s exact test as usual Table 1 Adjusted significance level for use in statistical tests of multiple pairwise comparisons Exposure PainNise overall Individual Test LeveIs Comparisons Significance Significance a39 l 12 a Bonferroni Probability 3 3 005 001667 001695 4 6 005 000833 000851 5 10 005 000500 000512 SHHS Example In the test of global equality we rejected the null hypothesis that the relative risks were all equal to one p lt 00001 To determine where the cholesterol categories differ with respect to CHD we can perform painNise comparisons of the exposure levels For each pair of exposure levels the relative risk is computed and its significance tested using the Pearson chisquare test 75 Cholesterol Status RR pvalue 2 vs 1 133 03949 3 vs 1 172 00847 4 vs 1 269 00005 5 vs 1 315 lt00001 3 vs 2 130 03763 4 vs 2 202 00073 5 vs 2 237 00006 4 vs 3 156 00680 5 vs 3 183 00100 5 vs 4 117 04421 The Bonferroni method suggests a significance level of 0005 for the individual painNise comparisons Comparisons for which the pvalue is less than the Bonferroni value are deemed to be significant Specifically the relative risks are significant for the cholesterol levels 4 vs 1 p 00005 5 vs 1 p lt 00001 and 5 vs 2 p 00006 The associated relative risks indicate a positive association between elevated cholesterol and disease risk 76 SAS Program and Output pr oc freq or der data datashhs wher e Exposure in 12 weight N tables EXposur eCase noper cent nocol nor ow r elr isk chisq run Syntax o The where statement can be used in any SAS procedure to restrict the analysis to a subset of the original data The statement here specifies that the analysis be limited to the data for which the Exposure variable equals 1 or 2 77 The FREQ Procedure Table of Exposure by Case Exposure Case Frequency Yes No Total 1 15 798 813 2 20 794 814 Total 35 1592 1627 Estimates of the Relative Risk Row1Row2 Type of Study Value 95 Confidence Limits CaseControl Odds Ratio 07462 03793 14680 Cohort 0011 Risk 07509 03872 14563 Cohort 0012 Risk 10063 09919 10209 Sample Size 1627 78 Statistics for Table of Exposure by Case Statistic DF Value Prob ChiSquare 1 07237 03949 Likelihood Ratio ChiSquare 1 07262 03941 Continuity Adj ChiSquare 1 04622 04966 MantelHaenszel ChiSquare 1 07233 03951 Phi Coefficient 00211 Contingency Coefficient 00211 Cramer39s V 00211 Fisher39s Exact Test Cell 11 Frequency F 15 Leftsided Pr lt F 02486 Rightsided Pr gt F 08465 Table Probability P 00951 Twosided Pr lt P 04949 442 Tests for Trend In the Scottish Heart Health Study as is often the case the levels of the exposure variable are ordered Rather than testing for a general association between exposure and disease interest commonly lies in testing for a consistent trend in the risk of disease across the exposure levels Such a trend is also known as a doseresponse effect We now focus on tests to address the hypotheses HA 1ltRR2ltRR3 ltltRRor1gtRR2gtRR3gtgtRR Specifically the alternative hypothesis is that disease risk is increasing or decreasing across the levels of the exposure variable An examination of the relative risk estimates can be used to determine the actual direction of the association 79 CochranMantelHaenszel Test One popular statistic for performing a test of trend is 2 aj X 21 l n 2 m 2 231X1 1 where mj I Zj1Xf 7 th and x Is the userspecnfled weight the numeric value for the exposure category 0 This is referred to as the CochranMantelHaenszel row mean scores test statistic o The choices of weights that we will consider are 1 Integer Weights Assigns integer values say 1lto the exposure levels This assumes that the rate of increasesdecreases is constant across the levels 2 Ranks Column ranks are defined as x1m112 x 11mkmj12 ie the column rank based on the cumulative number of exposed individuals 80 o Other choices that may be of interest are the mean median and midpoint values of the exposure variable within each category 0 The CochranMantelHaenszel test is more powerful for detecting positivenegative trends in the data than the Pearson chisquare test for a general association Tests for trend also provide stronger evidence of a causal relationship SHHS Example SAS was used to carry out the CochranMantelHaenszel test using integer weights for the cholesterol categories The test statistic value was 2539 with a pvalue lt00001 Thus at the 5 level of significance it can be concluded that there is a significant dose response effect of cholesterol on the risk of CHD The estimated relative risks from our multiple comparisons example are Cholesterol Status 1 2 3 4 5 RR 100 133 172 269 315 which indicate a positive association between elevated cholesterol and risk of CHD 81 SAS Program and Output pr oc freq or der data datashhs weight N tables CaseExposur e cmh scorestable run Syntax o The cmh option request that the CochranMantelHaenszel test be performed 0 scores is used to select the weights to be used in computing the row mean scores test statistic o scorestable is the default and uses the values of the column variables as the weights 0 scoresrank requests that the ranks be used 0 Disease must be given as the row variable and Exposure as the column variable 82 The FREQ Procedure Table of Case by Exposure Case Exposure Frequency Percent Row Pct Col Pct 1 2 3 4 5 Total Yes 15 20 26 41 48 150 037 049 063 100 117 366 1000 1333 1733 2733 3200 185 246 318 496 582 No 798 794 791 785 777 3945 1949 1939 1932 1917 18 97 9634 2023 2013 2005 1990 1970 9815 9754 9682 9504 9418 Total 813 814 817 826 825 4095 1985 1988 1995 2017 2015 10000 Summary Statistics for Case by Exposure CochranMantelHaenszel Statistics Based on Table Scores Statistic Alternative Hypothesis DF Value Prob 1 Nonzero Correlation 1 253900 lt0001 Row Mean Scores Differ 1 253900 lt0001 General Association 4 263168 lt0001 Total Sample Size 4095 83 Iowa Radon Example Subjects from the Iowa Radon Lung Cancer casecontrol Study are classified by disease and radon exposure status in the table below Lung Cancer Radon EXposure Totals 0423 424847 8481270 12711694 gt1695 Yes 56 147 87 56 67 413 No 104 229 118 75 88 614 Totals 160 376 205 131 155 1027 Median Exposure 316 618 1050 1458 2116 If the medians are to be used as weights in the CochranMantelHaenszel test we can use those as the numeric values for the exposure variable in our dataset or compute the statistic by hand 84 Table 2 Worksheet calculation of the CochranMantelHaenszel statistic for the Iowa Radon Lung Cancer Study data X 8 m 2 m Exposure Cases Controls Totals 1 X 1 E u X X1 y 7 0423 56 104 160 316 04285 04923 70860 424847 147 229 376 618 21996 22626 50777 8481270 87 118 205 105 22119 20959 00709 12711694 56 75 131 1458 19769 18598 27888 1695 67 88 155 2116 34327 31936 191213 Totals 413 614 1027 102510 99041 341448 Chisquare 24132 pvalue 01203 At the 5 level of significance we do not have evidence of a doseresponse effect of radon exposure on lung cancer risk p 01203 85 Biostatistical Methods in Categorical Data 171 203 Section 5 Sample Size and Power Brian J Smith PhD October 8 2007 Table of Contents 51 Introduction 86 511 Notation 87 512 Confidence Interval 87 52 Parameter Estimation 89 Proportion Example 89 Odds Ratio Example 90 53 Hypothesis Testing 90 Sample Size Algorithm 91 Proportion Example 91 SAS Program and Output 92 Odds Ratio Example 94 SAS Program and Output 94 54 Multivariate Analyses 96 51 Introduction When designing a study it is important to consider the sample size needed to provide a reasonable opportunity to address the research questions of interest We will examine methods for estimating sample size requirements in the context of 1 Parameter Estimation 2 Hypothesis Testing Recall the following definitions related to hypothesis testing Significance Level Probability of rejecting the null hypothesis when it is true also referred to as the Type I error rate a Power Probability of rejecting the null hypothesis when it is false 1 minus the Type II error rate 1 8 86 511 Notation We will primarily consider sample size estimation in cases where the outcome of interest is dichotomous ie diseased versus nondiseased and comparisons are between two groups ie exposed versus unexposed Let i 12 index the two groups of exposed and unexposed individuals respectively such that 7 Probability of disease cohort or exposure casecontrol in Groupi n Number of subjects in Groupi r n2n1 Number of subjects in Group 2 relative to Group 1 512 Confidence Interval In general if a population parameter 6 can be estimated with a sample statistic that is approximately normally distributed then the associated confidence interval has the general form 0 W i 217042 87 The parameters of interest in our sample size discussion are the disease probability odds ratio and relative risk The table below summarizes the forms of these parameters for which the normal assumption is typically used in constructing confidence intervals Table 1 Common parameters and approximate standard errors Parameter 6 aJE Disease Probability 7r Jn1 7rE Difference 751 75 ln11 7z1r 17z21 7z2n LogOdds Ratio In J 46 711 7r1 rrr21 7r2 LogRelative Risk WE 1 7r11 7r2E n1 m2 Where in the twogroup comparison n1 2 n is the sample size for Group 1 and n2 2 rn is the sample size for Group 2 88 52 Parameter Estimation The sample size required to obtain a 1001 a confidence interval of width Wis n 22W2 aW2 Note that we could write the confidence interval of interest in terms of Wsuch that Proportion Example A particular gene polymorphism has been identified as a cancer risk factor Public health officials would like to obtain 95 confidence intervals that are within 5 points of the estimated prevalence of this particular gene How many subjects should be sampled for the estimation 89 Odds Ratio Example A casecontrol study is being designed to study the effects of residential radon on the risk of leukemia cancer The study will enroll twice as many controls as cases and the investigators would like the confidence interval to be within 25 of the estimated odds ratio Approximately half of the control subjects are expected to have high radon exposure 53 Hypothesis Testing The sample size needed for testing the twosided hypothesis HO 6 60 H A 6 7 60 with significance level a and power 1 6 under the assumption that the true value of the parameter is 6 6 is 2 n 2 2104200 2160A 60 6A If this is for a twogroup comparison then n1 2 n is the sample size for Group 1 and n2 2 rn is the sample size for Group 2 For onesided alternatives 0 is substituted fora2 in the sample size formula 90 Sample Size Algorithm Express the null and alternative hypotheses of interest in terms of the appropriate population parameter in Table 1 M Compute the probabilities under the alternative hypothesis that the population parameter 6 6A Use these in the standard deviation formula to calculate oA 00 Compute the probabilities under the null hypothesis that the population parameter 6 60 Use these in the standard deviation formula to calculate 00 A Insert 6A 60 oA and 00 into the sample size formula Proportion Example A clinical trial is planned to study the efficacy of a new cancer treatment Efficacy will be measured as the proportion of patients that respond to the treatment The investigators would like to perform a 5 level test of the null hypothesis that the response rate is less than or equal to 20 versus the alternative that it is greater than 20 How many subjects should be enrolled to have 80 power to detect a true response rate of 35 91 SAS Program and Output proc power onesamplefreq testz alpha005 powePO80 sidesU nullproportion020 proportion035 ntotal methodnormal Syntax test indicates whether the test statistic is z adjz or exact method specifies the computational method exact binomial distribution normal approximation to the binomial The later must be used obtain sample size estimates alpha gives the significance level of the test and power the test power sides indicates whether the alternative hypothesis is onesided with the alternative in the direction of the effect 1 twosided 2 onesided with the effect greater than the null value U or onesided with the effect less than the null value L nullproportion sets the proportion for the null hypothesis proportion sets the alternative value at which the study is powered ntota requests sample size estimates alternatively the sample size can be given and power estimated with the option power 92 The POWER Procedure Z Test for Binomial Proportion Fixed Scenario Elements Method Normal approximation Number of Sides U Null Proportion 02 Alpha 005 Binomial Proportion 035 Nominal Power 08 Computed N Total Actual N Power Total 0801 50 93 Odds Ratio Example For the leukemia casecontrol study described previously suppose that a 5 level test is planed to determine if the odds ratio for radon is significantly different from unity As before it is expected that 50 of control subjects will have high radon exposure How many subjects should be enrolled to ensure 80 power to detect a true odds ratio of 150 SAS Program and Output proc power twosamplefreq testpchi alpha005 powePO80 sides2 oddsratio15 Pefproportion05 groupweights2 1 ntotal Syntax 0 test can be either pchi lrchi or sher o oddsratio is the value at which the test is powered refproportion is the proportion in the reference group groupweights specifies the relative number of subjects in each group 94 The POWER Procedure Pearson Chisquare Test for Two Proportions Fixed Scenario Elements Distribution Asymptotic normal Method Normal approximation Number of Sides 2 Alpha 005 Reference Group 1 Proportion 05 Odds Ratio 1 5 Group 1 Weight 2 Group 2 Weight 1 Nominal Power 08 Null Odds Ratio 1 Computed N Total Actual N Power Total 0801 876 95 54 Multivariate Analyses The statistical methodology for determining sample size when there are multiple predictor variables is beyond the scope of this class The two most commonly used methods are based on 1 Chisquare tests and the noncentrality parameter associated with the alternative hypothesis 2 Simulations Popular software programs for computing sample size 0 NCSS PASS wwwncsscom 0 Power and Precision wwwpowerandprecisioncom o nQuery httpwwwstatsoLie 96 Biostatistical Methods in Categorical Data 171 203 Section 6 Confounding and Interaction Brian J Smith PhD October 8 2007 Table of Contents 61 Overview 97 611 Confounding 97 Definition 98 Example 1 99 Example 2 101 Example 3 SHHS 102 Notes 103 MantelHaenszel Methods 104 Odds Ratio 106 Relative Risk 107 612 Test of Homogeneity 108 613 Hypothesis Testing 111 SAS Program and Output 113 614 Interaction 117 Definition 117 Gene Example 117 Types of Interaction 118 Testing for Interaction 120 615 Confounding versus Interaction 122 Notes 122 Evans County Heart Study ECHS Example 124 Esophageal Cancer Example 126 SAS Program 128 SAS Output TobaccoCancer adjusted for Alcohol 129 Notes 130 SAS Output AlcoholCancer adjusted for Tobacco 131 616 Application to Matched Data 133 Advantages of Matching 133 Disadvantages of Matching 133 DDimer Example 134 SAS Program and Output 137 617 Comments on the CochranMantelHaenszel Test 139 NonLinear Trend Example 140 Linear Trend Example 141 Notes 142 61 Overview Thus far we have limited our discussion to the relationship between only two variables However there are often other variables orfactors that have an important influence on the apparent relationship between the exposure and disease of interest Whenever an epidemiologic study is designed or analyzed you need to consider the issues of o Confounding 0 Interaction 611 Confounding Confounding is the bias in the risk estimate that can result when the exposuredisease relationship under study is partially or wholly explained by the effects of an extraneous variable For example a relationship between the number of children and prevalent breast cancers for a sample of mothers may be explained by the ages of the mothers 0 Older mothers tend to have more children and also have a greater chance of developing breast cancer 0 Age is the extraneous variable which explains the relationship between number of children and breast cancer 0 The effect of number of children is confounded with the effect of age In this case age is called a confounding variable or a confounder 97 Definition A confounder is an extraneous variable that partially or wholly accounts for the observed effect of the exposure on disease risk In order for a variable to be a confounder it must 1 be related to the disease 2 be related to the risk factor and 3 not be a consequence of the risk factor The effects of the confounder must be controlled for in the analysis in order to correctly measure the relationship between exposure and disease In the case of categorical data control means assessing the relationship across different levels or strata of the confounder Controlling for the confounder requires a consideration of both causal and data based associations That is confounders may arise due to biologic relationships or simply due to patterns that exist in the sampled data There may be multiple confounders that need to be accounted for in the analysis Indeed potential confounders should be identified during the design of the study so that the appropriate data is collected 98 Example 1 Table 1 presents disease and exposure data for a hypothetical group of study subjects Based solely on this data the crude odds ratio is 1816 Table 1 Crossclassification of exposure and disease Diseased Nondiseased Exposed 81 29 Unexposed 28 182 Odds Ratio 1816 Suppose that the presence or absence of a potential confounder C was recorded for each subject One way to assess the impact of C is to calculate separately the odds ratios within each level of the confounder The separate estimates are illustrated in the following table Table 2 Crossclassification of exposure and disease by levels of a confounder Confounder Present Confounder Absent Diseased Nondiseased Diseased Nondiseased Exposed 80 20 1 9 Unexposed 8 2 20 180 OR 100 100 99 Thus when considered within levels of the confounder the exposure has absolutely no effect on the disease The apparent relationship crude odds ratio of 1816 is explained by the confounding variable Why is this If we examine the confounder and its relationship with disease and exposure we see that there is a strong association with both The odds ratio between disease status and the confounder is 36 while the odds ratio between exposure status and the confounder is 200 Table 3 Crossclassification of disease and the confounder Confounder Present Confounder Absent Diseased 88 21 Nondiseased 22 189 Odds Ratio 36 Table 4 Crossclassification of exposure and the confounder Confounder Present Confounder Absent Exposed 100 10 Unexposed 10 200 Odds Ratio 200 Therefore when we think we are seeing the effect of exposure we may really be seeing the effect of the confounder 100 Example 2 Consider the following data for which there appears to be no association between exposure and disease Diseased Nondiseased Exposed 240 420 Unexposed 200 350 Odds Ratio 100 However it could happen that the risk estimates indicate an association within the levels of a confounder Confounder Present Confounder Absent Diseased Nondiseased Diseased Nondiseased Exposed 120 378 120 42 Unexposed 20 175 180 175 OR 278 278 Thus we have the reverse scenario to Example 1 Here there is an association within the levels of the confounder but no overall association when the confounder is ignored 101 Example 3 SHHS In the Scottish Health Heart Study information was collected on whether subjects owned or rented their place of residence Residence was thought to be a surrogate measure of socioeconomic status and investigators were interested in looking at its effect on disease CHD Totals Residence Yes No Rented 85 1821 1906 Owneroccupied 77 2400 2477 Relative Risk 143 106 194 Thus there appears to be an association but care must be taken to account for potential confounders such as smoking Smokers Nonsmokers Residence CHD No CHD CHD No CHD Rented 52 898 33 923 Owner 29 678 48 1722 RR 133 127 102 Notice that the stratumspecific estimates are lower than the crude estimate of 143 The reduced estimates indicate that a portion of the crude estimate is due to smoking However there does appear to be an additional effect of residence after controlling for smoking Notes Examples 1 and 2 both illustrate perfect confounding That is the risk estimates are equal across the levels of the confounder but different from the crude risk estimate If the stratumspecific risk estimates are all very similar to oneanother as well as to the crude estimate then confounding is not an important issue Confounding is characterized by stratumspecific risk estimates that are consistently higher or lower than the crude estimate May need to control for multiple confounding variables see Table 5 103 Table 5 Odds ratios for myocardial infarction by cigarette smoking habits amongst men aged 3054 living in the northeast USA Kaufman et al 1983 Smoking Unadjusted Ageadjusted Multiplyadjusted Never 1 1 1 Ex 15 11 12 lt25day 21 21 25 2534 25 24 29 3544 41 39 44 2 45 44 40 50 Adjusted for age geographic region drug treatment for hypertension history of elevated cholesterol drug treatment for diabetes family history of myocardial infarction or stroke personality score alcohol consumption religion and marital status MantelHaenszel Methods We need a method to estimate the disease risk for an exposure variable in the presence of confounding The first method we will discuss is that of MantelHaenszel This is appropriate if the disease exposure and confounding variable are categorical or can be categorized We start by partitioning our data into strata defined by the q levels of the confounders For strata i1q we will extend our previous notation to 104 Diseased Nondiseased Totals Exposed a b a b Unexposed c d c d Totals a c b d n The MantelHaenszel method 0 assumes that there is a true odds ratio which is consistent across all strata and 0 provides a pooled estimate of the common odds ratio In essence it is a weighted average of the odds ratios from the individual strata Note that it only makes sense to report the MantelHaenszel estimate if the exposure disease relationship is consistent across the strata 105 Odds Ratio The MantelHaenszel estimate of the odds ratio is 7 ad 7 bc ORMH with estimated standard error computed on the logscale as SEInORMH I ZP R ZBSIZQIRI ZQISI I2ltRgt2 ZZRIZS 2ltzsigt2 where P a dn Q b Cn RI aidIni Si biCini39 106 Relative Risk The MantelHaenszel estimate of the relative risk is i Wt TW I with estimated standard error computed on the logscale as SE InRRMH 2 3 biC d ai of aiC ni V EaCi dniZCia39 bini MantelHaenszel estimates can be obtained in SAS 107 612 Test of Homogeneity It is important to keep in mind that these pooled risk estimates should only be reported if the risk is consistent homogeneous across the levels of the confounder There are several test statistics that address the hypothesis of homogeneity We will discuss the BreslowDay statistic which is formulated as X p i ai Ea2 2 1 vara Z H39 This is of the same form as the Pearson chisquare test statistic The difference is in the calculation of the expected value In the Pearson test the expected value was computed under the null hypothesis of no association between disease and exposure Here our null hypothesis is one of homogeneity that the odds ratios are equal across the levels of our confounder H0OR1ORq OR HAOR ORJ In other words the null hypothesis of homogeneity implies that the stratumspecific odds ratios are all equal to a common odds ratio OR Thus the expected value is the number of subjects we would expect to observe in the stratumspecific tables if there was a common odds ratio 108 If we define A E Ea then the expected cell counts in stratum i are as follows Diseased Nondiseased Totals Exposed A a b A a b Unexposed a 0 A n a b a c A C d Totals a c b d n We find A Ea by noting that under the null hypothesis which can be rewritten as An a b a cA OR a b Aa c A OR 1A2 OR 1abacnA abacOR0 39 109 We then solve left as an exercise for those interested for A to get an expression for the expected value ie Bi182 4OR 1aibaCOR 39 2OR 1 where P OR 1a b a Cn To evaluate this formula we need a value for the odds ratio OR The most common choice in practice is the MantelHaenszel estimate of the odds ratio ORMH The variance terms in the Breslow Day test statistic are computed as vara 1 1 1 71 I 53139 Ebi ECI Ed 1 1 1 A ab A ac A 1 n a b a cA Finally the twosided pvalue is P Pr571 2 X50 110 If the pvalue is significant then the null hypothesis is rejected and it is concluded that the odds ratios are not homogeneous across strata Specifically it is not appropriate to report the MantelHaenszel pooled estimate of the odds ratio a similar test statistic can be formulated for the relative risk The test of homogeneity should be performed before deciding to report the pooled odds ratio 613 Hypothesis Testing The null hypotheses HO ORMH 1 can be tested against the alternative HA ORMH 21 with the following MantelHaenszel statistic where atHaj 2 391 Xll2 H Z I 751 Vara Eai a bI7fa c varai a bc da Cb d39 n2n 1 The 2sided pvalue is p Pr X12 2 Xfm 111 Example 3 SHHS The next few pages display the SAS analysis of the effect of residence on CHD risk controlling for smoking status An interpretation of the results proceeds as follows 1 The BreslowDay test does not provide evidence against homogeneity of the risk ratios p 08701 Consequently it is decided that the MantelHaenszel pooled estimate is appropriate to report 2 The MantelHaenszel estimate of the common relative risk is 130 with a 95 confidence interval of 096 178 3 The MantelHaenszel test statistic indicates that the adjusted relative risk is not significantly different from one p 00940 Therefore the association between residence and CHD is not significant after controlling for smoking status 112 SAS Program and Output data shhs proc freq orderdata datashhs input CHD Residence Smoker N weight N cards tables SmokerResidenceCHD relrisk cmh Yes Rented Yes 52 run Yes Rented No 33 No Rented Yes 898 No Rented No 923 Yes Owner Yes 29 Yes Owner No 48 No Owner Yes 678 No Owner No 1722 Syntax o In the tables SmokerResidenceCHD statement the confounding variables is positioned first Conversely the measures and test of association will focus on the association between the last two variables o It is a good idea to request the stratumspecific risk estimates via the relrisk option in order to check that the desired relative risks are being computed o cmh will produce the MantelHaenzel odds ratios and relative risks and carry out the Breslow Day test of homogeneity 113 The FREQ Procedure Table 1 of Residence by CHD Controlling for SmokerYes Residence CHD Frequency Percent Row Pct Col Pct Yes No Total Rented 52 898 950 314 5419 5733 547 9453 6420 5698 Owner 29 678 707 175 4092 4267 410 9590 3580 4302 Total 81 1576 1657 489 9511 10000 Statistics for Table 1 of Residence by CHD Controlling for SmokerYes Estimates of the Relative Risk Row1Row2 Type of Study Value 95 Confidence Limits CaseControl Odds Ratio 13538 08503 21554 Cohort 0011 Risk 13344 08563 20797 Cohort 0012 Risk 09857 09646 10072 Sample Size 1657 114

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.