Class Note for STAT 528 at OSU 59
Class Note for STAT 528 at OSU 59
Popular in Course
Popular in Department
This 23 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Ohio State University taught by a professor in Fall. Since its upload, it has received 31 views.
Reviews for Class Note for STAT 528 at OSU 59
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Stat 528 Autumn 2008 Associations for twoway tables Reading Sections 91 92 o Associations between categorical variables A motivating example Joint distributions Marginal distributions Conditional distributions 0 lnference for two way tables 7 tests for association The hypotheses Comparing expected and observed counts Pearson7s X 2 statistic The chi square distribution Genetic damage example cont Fisherls exact test Motivating example Is genetic damage associated with air pollution One possible effect of air pollution is genetic damage A study designed to examine this problem exposed one group of mice to air near a steel mill and another group to air in a rural area and compared the number of mutations in each group Here are the data for a mutation of the Hm Q gene locus 0 Of the 96 mice exposed to steel mill air 30 had mutations 0 Of the 150 mice exposed to rural air 23 had mutations Associations between categorical variables 0 We use tables of counts or relative frequencies to sum marize relationships between categorical variables A table comparing two categorical variables is called a twoway table 0 We look for patterns in the data ls there a simple relationship between 6 and y What variable goes in the row or column ln the air pollution example gtllt the row variable is mutation gtllt the column variable is location The joint distribution for the air pollution example o Summarized as proportions we have Location Mutation Steel Inill air Rural air Yes 0122 0093 No 0268 0516 1000 0 Conclusions Marginal distributions c To investigate the relationships Within the column and row variables separately we calculate the marginal distribu tions Calculate the row and columns totals the margins We often present these marginal distributions as pro portions or percentages of the overall total Marginal distributions for the air pollution example Location Steel mill air Rural air Total Number 96 150 246 Proportion 0390 0610 1000 o Ignoring Whether or not a mouse has a mutation at the H m Q gene locus Mutation Yes No Total Number 53 193 246 Proportion 0215 0785 1000 c Ignoring the location Conditional distributions o Marginal distributions tell us nothing about the relation ship between variables The conditional distribution gives the distribution of one variable given a particular value of the other variable Can be useful for investigating associations Conditional distributions for the example Conditioning on being exposed to air from the steel mill Mutation Yes No Total Number 30 66 96 Proportion 0313 0688 1000 Conditioning on being exposed to air from the rural area Mutation Yes No Total Number 23 127 150 Proportion 0153 0847 1000 Conditional distributions for the example cont Conditioning on having a mutation Location Steel mill air Rural air Total Number 30 23 53 Proportion 0566 0434 1000 O Conditioning on not having a mutation Location Steel mill air Rural air Total Number 66 127 193 Proportion 0342 0658 1000 A test for association 0 Suppose we have two categorical variables We can present the distribution of both variables in a two way table with 7 rows and 0 columns let Om be the observed count for row 239 column j 0171 0172 010 0271 0272 0270 0731 0732 OTC 0 Consider the following set of hypotheses H 0 There is no association between the row and column variables versus Ha There is some unspeci ed association between the row and column variables 10 The marginal counts C o The row sum for row i is 02 E Om j1 T o The column sum for column j is O j E Om 21 0 Then the sum of all the counts is n 2023 2029 ZOJ 21 j1 21 j1 11 Under the null hypothesis c When the row and column variables are not associated the conditional distributions of the columns are all the same 0 Since we don7t know the conditional distribution we estimate it o The estimate for this common distribution is O A p forz1r n o The expected cell count in the 27 cell is the product of the number of observations in column j and the estimated probability for row 239 an EJOMP n o If the null hypothesis is true Om should be close to Em We roll all of these 7 X c comparisons into a single test statistic 12 Carrying out the test o Pearson7s X 2 statistic is de ned by X2ZZW expected rows columns zzlt aEm a 21 j1 0 Under H0 X 2 has approximately a chi square distribution with df 7quot 1 X c 1 degrees of freedom Remember 7 is the number of rows and c is the number of columns The approximation improves for larger 7 and c and works best when each Em is not close to zero Rule of thumb Make sure that each EM 2 5 13 Calculating the Pvalue o The approximate P value for the test of association is 13062 2 X2 Where X2 is a Chi square distribution with df r 1gtlt C 1 degrees of freedom 0 Table F provides the upper critical values of this distribution for different degrees of freedom 14 Air pollution example cont ls there evidence to conclude that location is related to the occur rence of mutations Perform the signi cance test and summarize the results Observed Steel mill air Rural air Total Mutation 30 23 53 No Mutation 66 127 193 Total 96 150 246 Expected Steel mill air Rural air Total Mutation No Mutation Total 15 Pearson s chisquare test in Minitab c To perform Pearson7s chi square test With Minitab the data must be input to the worksheet in a speci c format C1 C2 Steel Mill Rural 30 23 66 127 0 Use the menu sequence Stat gt Tables gt ChiSquare Test TwoWay Table in Worksheet 0 In the dialog box enter C1 and C2 as the columns that contain the table 0 Results include marginal counts the expected count for each cell of the table the contribution of each cell to the chi square statistic and a summary of the chi square test consisting of the test statistic degrees of freedom and p Value 16 Fisher s exact test c There is a second test for association that is commonly used with two by two tables The test has the same null and al ternative hypotheses that we have been using H 0 There is no association between the row and column variables versus Ha There is some unspeci ed association between the row and column variables 0 These hypotheses are often described in different terms ie in terms of the odds ratio 0 Fisher7s exact test is based on randomization or permu tation arguments o All numbers in the margins of the table are considered xed and the reference distribution arises from the different ways to ll in the interior values for the cells 17 Fisher s exact test cont o This test does not require large sample sizes to make ap proximations work However small marginal counts remain troublesome The reference distribution is a discrete distribution When the marginal counts are small there are only a few tables of data that are consistent with the margins Each of these tables gets a share of probability Some shares will be big The distribution of the pvalue is jumpy You may not be able to perform a test at a level near your chosen oz 0 This test can be extended to tables with 7 rows and 0 columns However Minitab does not currently implement such a test Many other software packages also do not implement it 18 Fisher s exact test in Minitab c To perform Fisher7s exact test With Minitab the data must be input to the worksheet in a speci c format C4 C5 C6 Counts Location Mutation 30 O 1 66 O O 23 1 1 127 1 O 0 Use the menu sequence Stat gt Tables gt Cross Tabu lation and ChiSquare 0 ln the dialog box rows are Mutation columns are Loca tion and frequencies are Counts 0 Click on the Other Stats button and check the box for Fisher7s exact test 19 Simpson s paradox 0 Similar associations between variables across groups can re verse direction when we sum aggregate across these groups This is often due to the effect of lurking variables A lurking variable is a variable that is not observed in a study but can be associated With the observed vari ables o lt is easiest to demonstrate this result With an example 20 An example Here are the number of ights on time and delayed for two airlines at ve airports Overall on time percentages for each airline are often reported in the news Lurking variables can often make such reports misleading Alaska Airlines on time delayed America West on time delayed Los Angeles 497 62 694 117 Phoenix 221 12 4840 415 San Diego 212 20 383 65 San Francisco 503 102 320 129 Seattle 1841 305 201 61 All Airports 3274 501 6438 787 21 Presenting the numbers as proportions cont Alaska Airlines on time delayed America West on time delayed Los Angeles 089 011 086 014 Phoenix 095 005 092 008 San Diego 091 009 085 015 San Francisco 083 017 071 029 Seattle 086 014 077 023 All Airports 087 013 089 011 o What percentage of Alaska Airline ights were delayed What percentage of America West ights were delayed These are the numbers usually reported 0 America West does worse at every one of the ve airports yet does better overall This sounds impossible How can this happen 22 Another summary of the data Data from a two way table can be summarized in many ways The relative risk of a ight being delayed at airport 139 is de ned to be RR 7113292 Alaska America Airlines West RR delayed delayed Los Angeles 011 014 130 Phoenix 005 008 153 San Diego 009 015 168 San Francisco 017 029 170 Seattle 014 023 164 o For these data the relative risk is more stable than the difference in proportions as the experimental conditions city change 23
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'