INTRO TO STATISTICS 2
INTRO TO STATISTICS 2 STA 3024
Popular in Course
Popular in Statistics
This 21 page Class Notes was uploaded by Golden Bernhard on Friday September 18, 2015. The Class Notes belongs to STA 3024 at University of Florida taught by Douglas Sparks in Fall. Since its upload, it has received 6 views. For similar materials see /class/206558/sta-3024-university-of-florida in Statistics at University of Florida.
Reviews for INTRO TO STATISTICS 2
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/18/15
Chapter 3 Contingency Tables Much of what we7ll do this semester can be condensed to a simple idea examining whether a response variable depends on one or more explanatory variables and if so how The procedures we7ll use will depend on whether the variables involved are categorical or quantitative as follows Explanatory Variables Response Variable Methods Categorical Categorical Contingency Tables Categorical Quantitative Quantitative Quantitative Regression Quantitative Categorical not discussed We7ll start by looking at what to do when both the explanatory and response variables are categorical This chapter corresponds to Chapter 11 of the textbook 31 Basics of Contingency Tables First we7ll review the basic way we present data when both the explanatory and response variables are categorical You may have seen part or all of this section before in your rst semester statistics course 0 Displaying Data with Tables When both the explanatory and response variables are categorical each sub ject can be classi ed into a particular combination of the variable values We 31 Basics of Contingency Tables 29 typically represent this using a table called a contingency table or two way table with the explanatory variable values as rows and the response variable values as columns Each combination of explanatory and response variable values constitutes a group of subjects represented by a cell in the table Note This is always the setup we will use for contingency tables in this course You might also sometimes see contingency tables presented the opposite way with the explanatory variable values as columns and the response variable values as rows It would be okay to do things that way instead but everything we re going to say in this chapter about rows and columns would need to be reversed EXAMPLE 31 We want to see if voters7 choices in the 2006 election for governor of Florida Charlie Crist Jim Davis or other depended on their level of education no college some college or college degree If we think about the two variables together there are nine different combinations which we can represent visually with a table Vote Education Crist Davis Other No College Some College College Degree Each individual in the population can be classi ed into one of these nine groups or cells in the table ltgt Populations and Samples Usually were interested in the entire population We would like to know the values of both variables for every member of the population so we could classify every individual subject Instead we typically have data only for a sample We can count how many subjects in the sample fall into each group and ll in the cells of the contingency table with these observed counts from our sample EXAMPLE 32 Continuing from Example 31 we take a random sample of 2804 voters and nd the following 31 Basics of Contingency Tables 30 Vote Education Crist Davis Other Total No College 321 328 23 672 Some College 456 334 24 814 College Degree 684 599 35 1318 Total 1461 1261 82 2804 Remember that the data in this contingency table7 like most contingency tables7 is only the data for the sample We typically dont know the data for the whole population ltgt 0 Conditional Percentages and Independence It can be hard to tell much from the raw data of a contingency table We can often tell more about whats going on by looking instead at conditional percentages Conditional Percentages To calculate conditional percentages7 we look at each category of the explana tory variable and calculate the percentage of subjects in that category that fall into each category of the response variable We call these conditional percentages the conditional distribution of the response variable given the explanatory variable In terms of the contingency table7 we calculate the conditional percentages for our sample by dividing each observed count by its row total EXAMPLE 33 To calculate the conditional distribution in Example 327 we divide each observed count by its row total l Vote Education l Crist Davis Other l Total No College 100 672 Some College 100 814 College Degree 100 1318 The numbers shown may not add properly due to rounding ltgt 31 Basics of Contingency Tables 31 Be careful not to get the conditioning backwards when interpreting con ditional percentages The statement 478 of voters with no college voted for Crist77 is a very different statement from 478 of Crist voters had no college7 In fact7 the percentage of Crist voters who had no college doesnt have to be anywhere close to the percentage of voters with no college who voted for Crist It may help to remember that in the American population7 a very high percentage of US senators are men7 but a very low percentage of men are US senators Connection to Conditional Probability Conditional percentages are essentially the same idea as conditional proba bility Think about randomly selecting a subject from our contingency table Let A represent the event that the subject has a particular explanatory vari able value7 and let B represent the event that the subject has a particular response variable value Call the number of subjects for which A is true 71A the number of subjects for which E is true 713 the number of subjects for which both A and B are true nAB and the total number of subjects n Then our contingency table looks like this Response Explanatory Category B Category C Total Category A nAB 71A Category E Category I Total 713 n Then the probability of B given A is RAB77 n PM RAn nA which was exactly how we calculated the conditional percentage PBlA Independence When we say that two random variables X and Y are independent7 what we mean is that if A is any event involving just X and B is any event involving just Y7 then the events A and B are independentithat is7 PBlA PB 32 The Chi Squared Test 32 More speci cally7 what this means is that if we look at the data for the entire population7 the conditional distribution of the response variable given the explanatory variable should be the same for all values of the explanatory variable This is easier to see with an example EXAMPLE 34 Continuing frorn Example 337 suppose for the entire popu lation that 521 of people voted for Crist7 450 voted for Davis7 and 29 voted for someone else If education level and vote were independent7 then for the entire population7 the distribution of votes for each education level Vote Education Crist Davis Other Total No College 100 Some College 100 College Degree 100 In other words7 if education level and vote were independent7 then voters7 education level of them voting for each candi date ltgt When we talk about independence of two variables7 were talking about those variables7 values over the entire population7 not just our sample In our sarnple7 randorn variation makes it unlikely that our conditional distributions will match up perfectly7 even when the variables really are independent 32 The ChiSquared Test The chi squared test is a hypothesis test for testing whether two categorical variables are independent First well look at the basic idea behind how it works7 and then well cover the details of how to do it 0 Expected Counts Recall when we thought about the the explanatory and response variables taking certain values as events A and B We thought about our contingency table this way 32 The Chi Squared Test 33 Response Explanatory Category B Category C Total Category A nAB 71A Category E Category I Total 713 n If the explanatory and response variables are independent then A and B are independent events and so PA m B PAPB But PA m B nABn PA nAn and PB nBn so substituting in we get that eel A713 RAB 71 So if the explanatory and response variables are independent then we would predict that each observed count would equal its row total times its column total divided by the total sample size We call this the expected count where by expected77 we really mean expected if the variables are indepen dent77 Here7s the formula again in words instead of symbols which we can rearrange as Row totalColumn total Ex ected count p Total sample size Each cell has its own expected count The expected counts are just math ematical calculations so its okay for them to be decimalsiwe don7t round them off to whole numbers The observed counts of course are the actual data from the real world so they7re always whole numbers EXAMPLE 35 Refer back to the row and column totals in the contingency table in Example 32 Vote Education Crist Davis Other Total No College 321 328 23 672 Some College 456 334 24 814 College Degree 684 599 35 1318 Total 1461 1261 82 2804 32 The Chi Squared Test 34 Repeating the process for the other seven cells we arrive at the following table of expected counts Vote Education Crist Davis Other No College Some College College Degree Remember that we leave the expected counts as decimals ltgt Agreement of Observed and Expected Counts Because we have a random sample we dont expect the observed and ex pected counts to match exactly even if the variables really are independent However if there are cells where the observed and expected counts are dras tically different then that provides clear evidence that the variables aren7t independent So the question is How much do the observed and expected counts have to differ for us to be be convinced that the variables aren7t independent well use the chi squared test to answer this question 0 ChiSquared Test Procedure The chi squared test is a hypothesis test so it consists of the usual ve steps Assumptions The chi squared test makes two assumptions 32 The Chi Squared Test 35 o The data must come from a random sample or randomized experiment More speci cally7 the random selection or assignment of each subject in our sample must take place separately from that of each other subject For example7 we should not randomly select whole families and then count each person in the family as a separate subject in our sample 0 Each cell7s expected count must be at least 5 Note The second assumption comes from the underlying math of the test procedure7 which just requires that all of the expected counts are not too small We ll follow our textbook and say that not too small means at least 5 Hypotheses The hypotheses of the chi squared test are as follows H0 The variables are independent Ha The variables are dependent in some way Test Statistic We want to measure how much the observed counts differ7 collectively7 from their corresponding expected counts We do this with the chi squared test statistic7 X27 which is calculated as X2 observed count 7 expected count all cells expected count So for each cell7 we take the difference between the observed and expected counts and square it7 and we divide the result by the expected count Then we sum up over all the cells to nd X2 Let7s think about how we expect X2 to behave o If H0 is true7 then the observed counts will probably be close to the expected counts7 and the value of X2 will probably be small An exact match between the observed and expected countsiunlikely7 but possibleiwould give X2 0 o If H0 is false7 then the observed counts will probably differ more from their expected counts7 and the value of X2 will be larger 32 The Chi Squared Test 36 So larger values of X2 mean more evidence against H0 more evidence that the variables are dependent PValue and the ChiSquared Distribution The next step is to calculate the p value associated with our value of X2 To do that7 we need to know the distribution of the X2 test statistic when H0 is true If our sample size is large enough this is where the at least 577 rule comes from7 then the distribution of X2 when H0 is true is whats called a chisquared distribution The exact shape of the chi squared distribution is determined by its de grees of freedom df A X2 statistic that comes from a contingency table with 7 rows and 0 columns corresponds to a chi squared distribution with df 7quot 71c71 Figure 31 shows a chi squared distribution with 4 degrees of freedom I I I I I 0 5 10 15 20 Value ofX2 FIGURE 31 Chi squared distribution with df 4 Larger values of X2 indicate more evidence against H07 so the p value for an observed test statistic value X2 is the probability to the right of the observed X2 value for a chi squared distribution with df r 7 1c 7 17 as shown in Figure 32 The shaded area in Figure 32 represents the probability that X2 2 8 A chi squared distribution has only one tail7 since it lives only on the positive numbers7 so we never need to worry about two tail probabilities or doubling anything 32 The Chi Squared Test FIGURE 32 Tail of a chi squared distribution with df 4 10 15 Value ofX2 37 To get exact p values we typically need to use computer software How ever we can approximate the p value using a chisquared table which tells us the X2 values that correspond to certain right tail probabilities 005 010 etc for each df value The rst few rows of a chi squared distribution table might look like Figure 33 Each row corresponds to a df value and the number in the body of the table is the X2 value that corresponds to the right tail probability at the top of the column Right Tail Probability df 0250 0100 0050 0025 0010 0005 0001 1 132 271 384 502 663 788 1083 2 277 461 599 738 921 1060 1382 3 411 625 781 935 1134 1284 1627 4 539 778 949 1114 1328 1486 1847 FIGURE 33 First few rows of a chi squared table 32 The Chi Squared Test 38 EXAMPLE 36 Suppose we observe a X2 statistic value of 517 from a table with two rows and three columns Decision To make a decision7 we compare the p value to our signi cance level a o If the p value is less than or equal to 04 we reject H0 and conclude that the response variable depends on the explanatory variable in some way o If the p value is greater than 04 we fail to reject H0 and conclude that its reasonable that the variables are independent EXAMPLE 37 Refer to the observed and expected counts from Example 35 Let7s conduct the ve steps of a chi squared test for this data7 using 04 005 32 The Chi Squared Test 39 Variable Labels Notice that we dont treat the explanatory variable any differently from the response variable at any point in the test That means the chi squared test gives the same results no matter which way we identify our variables as the explanatory variable and the response variable 0 ChiSquared Tests for 2x2 Tables In your rst sernester course7 you learned how to do a two sided hypothesis test to compare two groups when the response variable was categorical with two values See Section 101 of the textbook for details7 which we wont worry about But notice that this kind of data could also be represented as a contingency table with two rows and two columns a 2x2 table simply by calling group nurnber77 our explanatory variable and letting the two groups be categories EXAMPLE 38 The example on page 469 of the textbook compares the pro portions of people who suffered a heart attack in a group taking a placebo and another group taking aspirin As shown in Table 101 of the textbook7 this situation can be represented with a contingency table Heart Attack Group Yes No Placebo Aspirin This is the right kind of data for either the test from Section 101 of the textbook or the chi squared test ltgt 32 The Chi Squared Test 40 Which way is better It turns out that the chi squared test for a 2x2 table is exactly equivalent to the two sided hypothesis test you learned in Section 101 in the sense that they will always yield exactly the same p value and thus exactly the same conclusion We won7t worry about the details of why this is true but see pages 5577559 in Section 112 of the textbook if you7re interested 0 Limitations and Other Tests The chi squared test is a very useful tool but it has its limitations Some times there are other procedures that we should use instead of or in addition to the chi squared test Ordinal Variables There are two types of categorical variables 0 An ordinal variable like education level has ordered categories 0 A nominal variable like race has unordered categories For ordinal variables notice that the chi squared test doesn7t use the ordering inforrnationiputting the categories in a different possibly nonsensical order would give the same results There exist other tests that do use the ordering information and these tests might work slightly better than the chi squared test speci cally by making slightly fewer type ll errors In practice however chi squared tests are still used fairly often in these situations Fisher 7s Exact Test If any of the expected counts are less than 5 the chi squared test should not be used However there exists another test called Fisher7s exact test that is ne no matter how small the expected counts are Fisher7s exact test is more complicated and more cornputationally intensive than the chi squared test so if we have a choice between the two we usually choose the chi squared test But if we have an expected count that7s less than 5 then Fisher7s exact test is our only choice Since computer software can do it for us we wont worry about how Fisher7s exact test actually works Section 115 of the textbook gives an overview 32 The Chi Squared Test 41 Strength of Evidence A common misinterpretation of a chi squared test occurs when people confuse strong evidence of an association with evidence of a strong association EXAMPLE 39 Suppose a random sample over several years yields the fol lowing data for admissions decisions at a particular college l Admissions Decision Gender Accepted Rejected Female 3830 45 4659 55 Male 3537 43 4692 57 If we calculate the value of the X2 statistic we nd that X2 7728 which corresponds to a p value of 005 This is a small p value smaller than any 04 we would commonly use so there is strong evidence that admissions decision depends on gender However the association itself does not appear to be very strong In our sample females were accepted at a 45 rate while males were accepted at a 43 rateionly slightly lower ltgt Strong evidence of an association is not the same thing as evidence of a strong association If we have large sample sizes it is possible to nd strong evidence that an association exists but that the association itself is fairly weak The X2 test statistic measures the strength of the evidence not the strength of the association itself well learn about ways to measure the strength of the association in the next section Pattern of Association The chi squared test also tells us nothing about the pattern of association between the two variables When there are more than two categories for each variable the pattern of association may be complicated and the chi squared test gives us no information about it EXAMPLE 310 From the chi squared test in Example 37 we concluded that vote depended on education level But how did vote depend on education level For which education level were people more likely or less likely to vote for Crist or for Davis or for someone else We cant tell from the chi squared test alone O In the next section we7ll learn how to identify the pattern of association by examining the standardized residuals 33 Examining the Association 42 33 Examining the Association When we want to investigate the possible association between two categorical variables we rst perform a chi squared test as discussed in the previous section Then one of two things happens 0 The chi squared test concludes that its reasonable that the variables are independentithat is that there7s not enough evidence to conclude that an association exists Then there7s usually no reason to exam ine the strength or pattern of the association because we cant even conclude that there7s an association in the rst place The chi squared test concludes that the variables are dependentithat is that there is an association Then we usually want to examine the strength and pattern of that association Since the chi squared test itself tells us nothing about the strength or pattern of the association we7ll need to learn some additional methods 0 Strength of Association A measure of association is a number that summarizes the strength ofthe de pendence of two variables It could summarize the association in the sample in which case its a sample statistic or it could summarize the association the population in which case its a population parameter Like everything else we typically use a measure of association for the sample as an estimate of the corresponding measure of association for the population For now lets restrict our attention to categorical variables with only two categoriesiin other words situations that we can represent with 2x2 contingency tables Difference Between Proportions The most obvious measure of association is just the difference between the proportions in the conditional distributionsithat is the difference between the rows of the conditional distributions EXAMPLE 311 Recall the contingency table and conditional percentages for our college admissions data in Example 39 33 Examining the Association 43 Admissions Decision Gender Accepted Rejected Female 3830 45 4659 55 Male 3537 43 4692 57 We said that a chi squared test for this data yields a p value of 00057 so we can conclude that there is an association But how strong is that association ltgt Notice that we do not calculate differences between the columns of con ditional percentages proportions as a measure of association EXAMPLE 312 In Example 3117 the fact that the proportion of females accepted is 010 lower than the proportion of females rejected tells us nothing the strength of the association between admissions decision and gender ltgt Note Our discussion of rows and columns assumes we ve calculated our conditional percentages by dividing each cell by its row total lf7 for some reason7 we ve decided to calculate our conditional percent ages the opposite way7 by dividing each cell by its column total7 then everything about rows and columns in this section should be reversed 33 Examining the Association 44 If our variables are independent then the population difference between proportions must be zero However due to random variation the sample difference between proportions is unlikely to be exactly zero although it will probably be close lt7s possible for the difference between proportions to be anything between 71 and 1 where its positive or negative depending simply on which way we do the subtraction The farther the difference between proportions is from zero the closer it is to 71 or 1 the stronger the association is between the two variables To see why think about what would happen if the association between the variables were as strong as it could possibly be EXAMPLE 313 What would happen if admissions decision depended on gender as strongly as possible Then we would see something like this Admissions Decision Gender Accepted Rejected Female Male Relative Risk The difference between proportions may not be very informative if one of our categorical variables includes a category that is rare EXAMPLE 314 Suppose we7re looking at the association between birth de fects and exposure of the mother to a certain toxic substance during preg nancy We take a random sample and calculate the following sample condi tional percentages Condition of Baby Exposure Defects Healthy Exposed 6 94 Not Exposed 2 98 33 Examining the Association 45 When some conditional proportions are close to zero it7s often better to use the ratio of proportions instead of the difference When the proportion refers to something that7s bad in the real world the ratio of proportions is called the relative risk EXAMPLE 315 In Example 314 the sample relative risk of birth defects for exposure to the substance versus nonexposure is ltgt If our variables are independent then the population relative risk must equal 1 However due to random variation the sample relative risk is un likely to be exactly equal to 1 although it will probably be close lt7s possible for the relative risk to be any number greater than or equal to zero Relative risks farther from 1 in either direction toward zero or toward in nity represent stronger associations Relative risks that are reciprocals such as 4 and 025 represent the same strength of association More than Two Categories Although the relative risk and difference between proportions are easier to understand in 2x2 contingency tables the ideas still work even if we have more than two categories The only difference is that a single situation could have several different relative risks or several different differences between proportions associated with itione for each pair of categories EXAMPLE 316 In Example 314 suppose that we had instead classi ed exposure to the toxic substance into three categories high exposure low exposure and nonexposure 33 Examining the Association 46 Con dence Intervals We know that we can use a measure of association for the sample to estimate the corresponding measure of association for the population But we could also use the sample data to create a con dence interval for the population relative risk or population difference between proportions We won7t worry about how to actually do this7 but computer software can do it for us EXAMPLE 317 In Example 3157 the sample relative risk was 3 Suppose computer software calculates that a 95 con dence interval for the popula tion relative risk is 263452 Then we are 95 con dent that the true value of the relative risk for the population is between 263 and 452 O 0 Pattern of Association When we have a contingency table with several rows or several columns7 if we conclude from a chi squared test that our variables are associated7 it may not be obvious how the response variable depends on the explanatory variable Residuals We can examine the pattern of association using the residuals For each cell7 its residual equals its observed count minus its expected count So the residual for a cell is how many more if its positive or fewer if its negative subjects we observed in that cell than we would have expected ifthe variables were independent Standardized Residuals Rather than looking at the residuals themselves7 we instead look at the stan dardized residuals Each cell7s standardized residual is its residual divided by the standard error of that residual We typically use computer software to calculate standardized residuals for us Distribution of the Standardized Residuals If the variables are independent7 then the standardized residuals will approxi mately follow a standard normal distribution mean 07 standard deviation 1 Things that follow a standard normal distribution should rarely have values 33 Examining the Association 47 above about 2 or 37 or below about 72 or 73 so a standardized residual that far or farther from zero for a particular cell suggests that something is going on with that cell that prevents the variables from being independent Just to make things more precise7 well use 25 for this about 2 or 377 cutoff Actually7 the cutoff varies in a complicated way based on the the number of rows and columns in our table7 but 25 is close enough Note The textbook uses 3 instead of 257 but I really think 25 is a better guideline when our plan is to examine the standardized resid uals only after a chi squared test shows evidence of an association Interpreting Standardized Residuals The standardized residuals allow us to draw conclusions about the population regarding which cells of the table have more subjects than we would expect if the variables were independent 0 A standardized residual above 25 indicates that the population has more subjects in the corresponding cell than we would expect if the variables were independent 0 A standardized residual below 725 indicates that the population has fewer subjects in the corresponding cell than we would expect if the variables were independent The pattern of association is sometimes easier to interpret when you think about it one row explanatory variable value at a time EXAMPLE 318 In Example 377 we used a chi squared test with 04 005 to conclude that people7s voting decisions depended on their level of edu cation To examine the pattern of this association7 we calculate each cell7s standardized residual using computer software and get the following Vote Education Crist Davis Other No College 7258 229 088 Some College 265 7268 005 College Degree 7021 048 7080 Let7s draw conclusions about the population one row at a time 33 Examining the Association 48
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'