ELEM STATISTICS [C3T1G1]
ELEM STATISTICS [C3T1G1] MATH 220
Popular in Course
Popular in Mathematics (M)
This 58 page Class Notes was uploaded by Eunice Schoen on Saturday September 26, 2015. The Class Notes belongs to MATH 220 at James Madison University taught by Rickie Domangue in Fall. Since its upload, it has received 18 views. For similar materials see /class/214024/math-220-james-madison-university in Mathematics (M) at James Madison University.
Reviews for ELEM STATISTICS [C3T1G1]
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/26/15
Chapter 8 Statistical Inference Signi cance Tests About Hypotheses 81 What are the Steps for Performing a Signif icance Test A 0 De nition Hypothesis page 368 In statistics a hypothesis is a statement about a population usually of the form that a parame ter such as u or p takes a particular numerical value or falls in a certain range of values De nition Sign cance Testing is major form of inferential statistics It is a method of using data to summarize the evidence about a hypothesis Five Steps of Signi cance Testing see p 371 Assumptions Specify variable and parameter Assumptions commonly pertain to method of data produc tion randomization sample size shape of population distribution 1 Hypotheses Null hypothesis H0 statement that the pa rameter takes a particular value Alternative hypothesis Ha states that the pa rameter falls in some alternative range of val ues Test Statistic Measures distance between point estimate of parameter and its null hypothesis value usu ally by the number of standard errors between them P value Presume H0 to be true The P value is the probability that the test statistic takes the ob served value or a value more extreme in the direction of Ha Smaller P values represent stronger evidence against H0 Conclusion Report and interpret the P value in the con text of the study Based on the P value make a decision about H0 is one is needed If a deci sion is needed reject presumption of H0 being true and conclude Ha true if P value 3 oz do not reject presumption of H0 or do not con clude Ha true is P value gt oz oz is a small value such as 005 or 001 prescribed by the researcher or some other interested party 82 Sign cance Tests About Proportions 1 Example 1 One Sided Signi cance Test In the 1980s it was generally believed that congenital abnormali ties affected about 5 of the nation s children Some people believe that the increase in the number of chemicals in the environment has led to an increase in the incidence of abnormalities A recent study examined 384 children and found that 30 of them showed signs of an abnormality Is this strong evi dence that the risk has increased Use a signi cance level of 005 5 Steps 0 Assumptions 7 The data is binary success child has abnor mality and failure not have abnormality Assume 384 children selected at random 7 n 384 is suf ciently large at the null hy pothesis value of p to ensure that sample distribution of variable 13 is approximately normal Rule of Thumb expected number of success and failures at null hyp value of p are both at least 15 True here since 384005 192 3841 05 3648 are both at least 15 o Hypotheses Let p population proportion of children TO DAY who have congenital abnormalities 7 H0 p 005 Risk today no different than in the 80s 7 Ha p gt 005 There is an increased risk to day 0 Test Statistic Let 13 be the sample proportion of 384 children who have congenital abnormalities 7 Test Statistic 1370005 Z 384 o P value 7 Suppose that H0 is true ie the risk is 005 the same as it was in the 80s The Z test statistic has an approximate stan dard normal distribution when H0 is true 7 Calculated 13 30384 0078 7 00787005 Calcuated z i 0I05 17005 384 Calculated z 255 P value Pz 2 255 1 09946 00054 Conclusion including interpretation of P value lnterpretation It is almost impossible P 00054 to obtain a sample proportion of 384 children having congenital abnormalities like 0078 or a difference like 0028 from a population proportion of 005 due solely due random sampling varia tion The P value is smaller than our signi cance level of 005 so there is strong evidence against the population proportion of children affected by congenital abnormalities being the same 05 as it was in the 80s and in favor of an increased risk 2 Example 2 One sided signi cance test A number of 5 initiatives on the topic of legalized gambling have ap peared on state ballots in recent years Suppose that a political candidate has decided to support legaliza tion of casino gambling if he is convinced that more than than two thirds 23 0667 of US adults approve of casino gambling USA Today June 17 1999 reported the results of a Gallup poll in which 1523 adults selected at random from households with telephones were asked whether they approved of casino gambling The number in the sample who approved was 1035 Does the sample provide con vincing evidence that more than two thirds approve Use a signi cance level oz 005 o Assumptions 7 The data is binary success adult approves or failure adult does not approve 1523 adults were selected at random 7 n 1523 is su iciently large at the null hy pothesis value of p to ensure that sample dis tribution of variable 13 is approximately nor mal Rule of Thumb expected number of suc cess and failures at null hyp value of p are both at least 15 True here since 15230667 10158 15231 667 5072 are both at least 15 o Hypotheses Let p population proportion of adult who approve of legalization of casino gambling H0 p 23 0667 Proportion approving in population is 2 3 7 Ha p gt 23 0667 More than 23 of pop ulation approves 0 Test Statistic Let 13 be the random variable sample propor tion of a random sample 1523 adults who ap prove Test Statistic 7066 Z 0557170557 1523 o P value 7 Suppose that H0 is true ie the population approving is 23 0667 The Z test statistic has an approximate stan dard normal distribution when H0 is true 7 Calculated 13 10351523 0680 Calcuated z 0 680 0 667 0557170557 1523 Calculated z 108 P value Pz 2 108 1 08599 01401 Conclusion including interpretation of P value lnterpretation There is about a 14 chance P 01401 of obtaining a sample proportion like 0680 or higher of adults approving or a differ ence like 0013 from a population with propor tion equal to 0667 due solely due random sampling variation The P value is not smaller than our signi cance level of 005 so there is not convinc ing evidence against the population proportion approving being equal to 23 or there is not con vincing evidence that the population approving is greater than 23 7 3 Example 3 One sided signi cance test In December 2003 a county wide water conservation campaign was conducted in a particular county In January 2004 a random sample of 500 homes was selected and water usage was recorded for each home in the sample The county supervisors want to know whether the data support the claim fewer than half the households in the county reduced water consumption Suppose that the sample proportion that reduced water con sumption was 0440 Use a signi cance level of 001 o Assumptions 7 The data is binary success reduced water consumption or failure did not reduce water consumption 7 500 homes were selected at random 7 n 500 is su iciently large at the null hy pothesis value of p to ensure that sample dis tribution of variable 13 is approximately nor mal Rule of Thumb expected number of success and failures at null hyp value of p are both at least 15 True here since 50005 250 5001 5 250 are both at least 15 o Hypotheses Let p population proportion of homes that reduced water consumption 7 H0 p 05 Exactly 05 of homes in pop reduced water consumption 7 Ha p lt 05 Fewer than 05 of homes in pop reduced water consumption 0 Test Statistic Let 13 be the random variable sample propor tion of a random sample 500 homes that re duced water consumption 7 Test Statistic 13705 Z 051705 500 o P value 7 Suppose that H0 is true ie the population of homes reducing water consumption is 05 The Z test statistic then has an approximate standard normal distribution when H0 is true 7 Observed 13 0440 Calcuated z M 05210305 Calculated z 606526 273 P value Plz g 273 00032 0 Conclusion including interpretation of P value lnterpretation It is almost impossible P 00032 of obtaining a sample proportion like 0440 or lower of homes reducing consumption or a dif ference like 006 from a population with propor tion equal to 05 due solely due random sampling variation The P value is smaller than our signif icance level of 005 so there is strong evidence against the population proportion approving be ing equal to 05 and in favor of the population proportion of home reducing consumption being less than 05 4 Example 4 Two sided signi cance test The Asso 9 ciated Press February 27 1995 reported that 71 of Americans age 25 and older are overweight a substantial increase over the 58 gure from 1983 Although this information came from a Harris Poll a survey rather than a census of the population let s assume for the purposes of this example that the na tionwide population proportion is exactly 071 Sup pose that an investigator wishes to know whether the proportion of such individuals in her state who are overweight differs from the national proportion A random sample of size n600 results in 450 who are classi ed as overweight What can the investigator conclude Answer this question by carrying out a signi cance test with sign cance level of 001 o Assumptions 7 The data is binary success overweight or failure not overweight 600 persons were selected at random 7 n 600 is su iciently large at the null hy pothesis value of p to ensure that sampling distribution of variable 13 is approximately nor mal Rule of Thumb expected number of suc cess and failures at null hyp value of p are both at least 15 True here since 600071 426 6001 71 174 are both at least 15 o Hypotheses Let p population proportion of people in her state that are overweight H0 p 071 Proportion of overweight indi viduals in state same as nationally 7 Ha p y 05 Proportion of overweight indi viduals differs from national proportion Test Statistic Let 13 be the variable sample proportion of overweight individuals in a random sample of 600 from her state 7 Test Statistic 137071 Z 01107071 500 o P value 7 Suppose that H0 is true ie the population of individuals in the state that are overweight is 071 the same as the national rate The Z test statistic then has an approximate standard normal distribution when H0 is true 7 Observed 13 450600 075 Calcuated z 0 75 0 71 07107011 500 Calculated z 211 P value Two tailed probability P value Plz 2 2110712 3 211 P value 2Pz g 211 200174 00348 Conclusion including interpretation of P value lnterpretation There is about a 35 chance P 00348 of obtaining a sample proportion as far away as 004 in either direction from a population proportion of 071 due solely to ran dom sampling variation While this probability is rather small it is not small enough according to our criterion of 001 That is P value gt 001 and 11 thus there is not quite enough evidence based on this sample to conclude that the proportion of all individuals in the her state differs from the national proportion of 071 5 Summary Steps of a Signi cance Test for a Popula tion Proportion p See page 381 6 Some things to know 0 The P value is a measure of how likely it is to obtain the kind of data that we got due solely to randomsamplingchance variation from a true null hypothesis So small P values would provide evidence against the null hypothesis and in favor of the alternative hypothesis The signi cance level or oz level of a test a small number is a prescription for how small the P value has to be before we say we have convinc ing evidence against the null hypothesis Typical values of oz are 005 or 001 Thus if P value 3 oz then there is convincing evidence against the null hypothesis and in favor of the alternative hypothesis We say that we reject the null hypothesis and accept the alter native hyp If P value gt oz then there is not enough evidence against the presumed truth of the null hypothesis Then we say that we do not reject the presumed truth of the null hypothe s1s o A decision Do not reject the null hypothesis DOES NOT MEAN ACCEPT THE NULL HY POTHESIS It means that the null hypothesis is plausible but the alternative might be plausible as well 0 A decision to Accept the alternative hypothesis DOES MEAN THAT the alternative hypothesis is plausible and the null hypothesis is not 7 How do we Decide between a One sided and a Two sided Test page 384 textbook 83 Signi cance Tests about Means 1 General Form of Signi cance Test page 392 2 Example A study conducted by researchers at Penn sylvania State University investigated whether time perception a simple indication of a person s abil ity to concentrate is impaired during nicotine with drawal The study results were presented in the paper Smoking Abstinence lmpairs Time Estima tion Accuracy in Cigarette Smokers After a 24 hr smoking abstinence 18 smokers were asked to estimate how much time had passed during a 45 second period Suppose the resulting data on per ceived elapsed time in seconds are as shown 69 65 72 73 59 55 39 52 57 56 50 70 47 56 45 70 64 53 The researchers wanted to determine whether smok ing abstinence had a negative impact on time per 13 ception causing elapsed time to be overestimated Use a signi cance level of 005 o Assumptions 7 Variable is quantitative here time perception seconds 7 Assume 18 smokers were randomly selected 7 Population distribution is approximately nor mal at least when sample size is small n 1 30 A stem and leaf display shows the data distri bution to be roughly bell shaped not disput ing a normal population 0 Hypotheses Let u population mean amount of perceived time 7 H0 u 45 Mean amount of time perceived is the same as actual time 7 Ha u gt 45 Mean amount of time perceived is greater than actual time thus impairing concentration 0 Test Statistic Let I be the variable sample mean amount of perceived time Let s be the sample standard deviation of amount of perceived time 7 Test Statistic t 7 T745 7 3m o P value 7 Suppose that H0 is true ie the population of individuals have a mean perceived amount of time equal to the actual time 45 seconds The t test statistic then has a t distributin with df 18 1 17 when H0 is true SPSS Observed E 5844 s 1002 Calcuated t b l Calculated t 5693 P value One tailed probability P value Plt 2 5693 Using Table B P lt 0001 which is less than 005 0 Conclusion including interpretation of P value lnterpretation There is less than a 01 chance of obtaining a sample mean as far above 45 like observed value 5844 or higher due solely to ran dom sampling variation This probability is smaller than our signi cance level of 005 and thus there is evidence based on this sample to conclude that smoking abstinence had a negative impact on time perception causing elapsed time to be over estimated 84 Decisions and Types of Errors in Signi cance Tests 1 Two Potential Types of Errors page 400 2 ExampleOn Time Arrivals The US Department of Transportation reported that during a recent period 77 of all domestic passenger ights arrived on time meaning Within 15 minutes of the scheduled arrival Suppose that an airline with 00 4 15 a poor on time record decides to offer its employees a bonus if in an upcoming month the airline s pro portion of ontime ights exceeds the industry rate of 077 Let p be the true proportion of the airline s ights that are on time during the month of interest A random sample of ights might be selected and used as a basis for testing H0 19 077 versus Ha p gt 077 ExampleSlowing the Growth of Tumors Researchers at the National Cancer Institute announced plans to begin studies of a cancer treatment thought to slow the growth of tumors Having tested the treatment on only 13 patients the researchers had very little information but they stated that the treat ment appears to be less toxis than standard chemother apy treatments An experiment to study the treat ment more extensively was reported as being in the planning stages Let n denote the true population mean growth rate of tumors for patients receiving the new treatment Data resulting from the planned experiments can be used to test H0 u mean growth rate of tumors without treat ment H0 u lt mean growth rate of tumors without treat ment The signi cance level is the Probability of a Type 1 Error Chapter 9 Comparing Two Groups 90 Introduction A Bivariate Analyses A Response Variable and a Binary Explanatory Variable 0 Example Response Variable Categorical Response Variable Student Binge drinker or not Explanatory Variable Gender 0 Example Response Variable Quantitative Response Variable GPA Explanatory Variable Gender 2 B Dependent and Independent Samples 0 Independent Samples 7 Experiment subjects randomly assigned to two treatments A and B A and B are values of binary explanatory variable Example 100 patients with disease randomly assigned to new treatment A standard treat ment B with 50 assigned to each Binary Explanatory variable is treatment received Re sponse variable is length of life after treat ment Observational study one sample chosen at random from one population A and another sample chosen independently and at random from a second population B A and B are val ues of binary explanatory variable Example From one list of malesA select random sample from one list of femalesB select random sample Binary response vari able is gender Response variable GPA Observation study subjects selected at ran dom from one population and then grouped by binary variable A or B Example From one list of students select a sample and then split sample into two accord ing to live on campus A and live off campus B Binary explanatory variable is live on or off campus Response variable study time o Dependent Samples 7 Experiment gtllt Experiment subjects paired according to common characteristic one person Within each pair assigned at random to one treat ment A other person Within pair assigned to other treatment B This is repeated for several pairs A and B are values of binary explanatory variable Subjects paired according to similar IQ One member of pair assigned to study method A other member of pair assigned to study method B Binary explanatory variable is study method Response variable score on exam 9 Experiment each subject receives both treat ments A and B A and B are values of bi nary explanatory variable Each subject with a cold receives two dif ferent cold medicines A and B at different times Binary explanatory variable is cold medicine Response variable is measure of relief of symptoms Observational study Pairs are sampled one person in pair has value A of binary variable other person has value B of binary variable Example Sample couples husbandswives Husbands A de nes one sample WivesB de ne other sample Binary explanatory vari able is person in pair husband or wife Re sponse variable is opinion on some issue 91 Categorical Response How Can We Com pare Two Proportions 911 Signi cance Testing About Two Proportions 1 Example 1 Source Devore and Peck Some people seem to believe that you can x anything with duct tape Even so many were skeptical when researchers announced that duct tape may be a more effective and less painful alternative to liquid nitrogen which doctors routinely use to freeze warts The article What a Fix lt Duct Tape Can Remove Warts San Luis Obispo Tribune October 15 2002 de scribed a study conducted at Madigan Army Medical Center Patients with warts were randomly assigned to either the duct tape group treatment or the more traditional freezing treatment Those in the duct tape group wore duct tape over the wart for 6 days then removed the tape soaked the area in water and used an emery board to scrape the area This process was repeated for a maximum of 2 months or until the wart was gone Of 100 people in the liquid nitrogen freezing group 60 had their warts success fully removed Of 104 in the duct tape group 88 had their warts successfully removed Do the data suggest that freezing is less sucessful than duct tape in removing warts Use a signi cance level oz 005 o Assumptions Categorical response variable for two groups here groups de ned by binary explanatory variable duct tape and liquid nitrogen groups categorical response variable is wart removed SUCCESS or not removed FAILURE Independent random samples in survey or ran dom assignment in experiment this example 204 subjects randomly assigned to two treat ments 7 Two sample sizes 711 and 712 are sufficiently large to ensure that sampling distribution of 131 132 is approximately normal Rule of Thumb Number of successes and num ber of failures in both samples at least 10 Here for liquid nitrogen group 60 successes and 40 failures both at least 10 and for duct tape group 88 successes and 16 failures both at least 10 o Hypotheses Let p1 population proportion of warts suc cessfully removed by freezing and p2 popu lation proportion of warts successfully removed with duct tape treatment 7H0p1p2 thatis p1 p20 iHap1ltp2 thatis p1 p2lt0 0 Test Statistic Let 131 be the random variable sample propor tion of 100 subjects assigned to liquid nitrogen treatment whose warts are removed i Let 132 be the random variable sample pro portion of 104 subjects assigned to duct tape treatment whose warts are removed 7 Test Statistic 15152F0 13 proportion of successful removals pooled across both groups P value 7 Suppose that H0 is true The Z test statistic has an approximate stan dard normal distribution when H0 is true 7 Calculated 131 60100 060 Calculated 132 88104 085 Calculated 6088100104 148204 073 Calcuated z 60 85 10 1 007317073mm Calculated z 0157263 403 P value Plz g 403 m 0 Conclusion Interpretation of P value conclusion about hypotheses answer questions posed lnterpretation N It is practically impossible P value 0 to obtain a difference in sample proportions like 025 from sampling error if in fact the population proportions of successful wart removals for the two treatments are the same null true Conclusion about hypotheses The P value being about 0 is smaller than our signi cance level of 005 so the null hypothesis 7 is rejected and the alternative hyp is accepted Answer question Yes the data do suggest that freezing is less successful than duct tape in re moving warts 2 Example 2 A consumer magazine polls car owners to see if they are happy enough with their vehicles that they would purchase the same model again They randomly selected 450 owners of American made cars and 450 owners of Japanese models 342 of the 450 owners of American made cars said they would purchase the same model again 351 of the 450 owners of the Japanese model said they would pur chase the same model again Is there sui cient evi dence of a difference in opinion among the two types of car owners Use a signi cance level oz 005 o Assumptions Categorical response variable for two groups here groups de ned by binary explanatory variable own American made car own Japanese made car categorical response variable is would purchase same model again SUCCESS or not purchase same model again FAILURE lndependent random samples in survey or ran dom assignment in experiment this example two samples of car owners were randomly and independently selected 7 Two sample sizes 711 and 712 are sui ciently large to ensure that sampling distribution of 131 132 is approximately normal Rule of Thumb Number of successes and num ber of failures in both samples at least 10 Here for American made car owners group 342 successes and 108 failures both at least 10 and Japanese made car owners 351 successes and 99 failures both at least 10 o Hypotheses Let p1 population proportion of all owners of American made cars that would purchase same model again and p2 population pro portion of all owners of Japanese made cars that would purchase same model again 1110171172 thaws 191 192 0 Ha 3P1 192 that is P1p2 0 0 Test Statistic Let 131 be the random variable sample pro portion of random sample of 450 owners of American made cars that would purchase again i Let 132 be the random variable sample pro portion of random sample of 450 owners of Japanese made cars that would purchase again 7 Test Statistic 15152F0 1 n 13 proportion of owners that would purchase again pooled across both groups Z 0 P value 7 Suppose that H0 is true The Z test statistic has an approximate stan dard normal distribution when H0 is true 7 Calculated 131 342450 076 Calculated 132 351450 078 Calculated 13 342 351450 450 693900 077 Calcuated z mi 787101 0771777m Calculated z 071 P Value Plz g 071 or z 2 071 202389 04778 Conclusion Interpretation of P Value conclusion about hypotheses answer questions posed Interpretation of P Value Sampling from two populations with equal pop ulation proportions null true would result in a sampling error of 025 or more extreme with probability of about 48 Conclusion about hypotheses The P Value about 048 is larger than our sig ni cance level of 005 so the null hypothesis is not rejected Answer question No there is not suf cient eV idence of a difference in opinions between the Arnerican rnade and Japanese rnade car owners regarding the same model purchase again 10 912 Con dence Interval for the difference between two A w 0 population proportions Large Sample Con dence Interval 131 132 i 15107351 1321ng 2 Z score depends on con dence level such as 196 for 95 con dence Assumptions 1 Independent random samples random assignment for two groups 2 Large enough sample sizes 711 and 712 so that there are at least 10 successes and 10 failures in each group Example DeVeauX Velleman Bock There has been debate among doctors over whether surgery can prolong life among men suffering from prostate cancer a type of cancer that typically develops and spreads very slowly In the summer of 2003 The New England Journal of Medicine published results of some Scandinavian research Men diagnosed with prostate cancer were randomly assigned to either un dergo surgery or not Among the 347 men who had surgery 16 eventually died of prostate cancer com pared with 31 of the 348 men who did not have surgery Construct a 95 con dence interval for the difference in rates of death for the two groups of men Binary Explanatory variable surgery or not Response variable died from prostate cancer suc cess or not failure No surgery group 131 31348 0089 Surgery group 132 16347 0046 Z score for 95 con dence is 196 00890046i1960i089 0i089 004651770046 0043 1 0037 0006 lt 191 p2 lt 0080 Interpretation We are 95 that among persons diagnosed with prostate cancer the rate of death from prostate cancer is somewhere between 06 and 8 higher for those not having surgery as compared to those having surgery Assumptions 1 Subjects randomly assigned to two treatment groups 2 Samples sizes are large 31317 successes failures at least 10 in surgery group and 16331 successesfailures at least 10 in non surgery group 12 92 Quantitative Response How Can We Com pare Two Means 921 Signi cance Testing About Two Proportions 1 Example 1 Source Devore and Peck To assess the impact of oral contraceptive use on bone mineral density BMD researchers in Canada carried out a study comparing BMD for women who had used oral contraceptives for at least 3 months to BMD for women who had never used oral contraceptives Oral Contraceptive Use and Bone Mineral Density in Premenopausal Women Canadian Medical As sociation Journal 2001 1023 1029 Data on BMD in grams per centimeter consistent with summary quantities given in the paper appear in the follow ing table the actual sample sizes for the study were much larger Never used oral contraceptives 082 094 096 131 094 121 126 109 113 114 Used oral contraceptives 094 109 097 098 114 085 130 089 087 101 Is there suf cient evidence to conclude that women who use oral contraceptives have a lower mean BMD than women who have never used oral contracep tives Use a signi cance level oz 005 o Assumptions 7 Independent random samples in Obs study or random assignment in experiment In this ex ample assume two groups of women represent independent random samples 7 Two sample sizes 711 and 712 are both large in general 30 or larger or population distribu tions are normal In this example the two sample sizes are small so the population dis tributions of BDM for the two groups have to be normally distributed 0 Hypotheses Let 1 population mean BMD for women who used oral contraceptives and H2 popu lation mean BMD for women who never used oral contraceptives 711102 112 thatis M1 M20 7Ha2mltug thatis m u2lt0 0 Test Statistic Let E1 be the random variable sample mean BMD of 10 women randomly selected from population of women using oral contraceptives Let T2 be the random variable sample mean BMD of 10 women randomly selected from population of women who have never used 7 Test Statistic t E17T270 2 2 i1 1 n1 n1 5 s o P value 7 Suppose that H0 is true The t test statistic has a approximate t dis tribution with df smaller of m 1 and 712 1 7 Calculated E1 100 Calculated T2 108 Calculated 31 014 Calculated 32 016 Calcuated t W Calculated t 113 df smaller of 10 1 10 1 9 P value Phi S 113 m 015 0 Conclusion Interpretation of P value conclusion about hypotheses answer questions posed Interpretation There is about a 15 chance of obtaining a dif ference in sample means like 008 or more ex treme from sampling error if in fact the popula tion means of BMD are the same null true Conclusion about hypotheses The P value about 15 is greater than our sig ni cance level of 005 so the null hypothesis is not rejected Answer question No the data do not provide suf cient evidence that women who use oral con traceptives have a lower mean BMD than women who have never used oral contraceptives 15 922 Con dence Interval for the difference between two population means A Con dence Interval 82 82 E1 E2 l1 Nn lL 7 t percentile depends on con dence level B Assumptions 1 Independent random samples for obs study random assignment to groups for experiment 2 Two sample sizes 711 and 712 are both large in general 30 or larger or population distributions are normal C Example Peck Olsen Devore Does talking el evate blood pressure contributing to the tendency for blood pressure to be higher when measured in a doctor s of ce than when measured in a less stressful environment called the white coat effect The ar ticle The Talking Effect and White Coat Effect in Hypertensive Patients Physical Effort or Emotional Content Behavioral Medicine2001 149 157 de scribed a study in which patients with high blood pressure were randomly assigned to one of two groups Those in the rst group the talking group were asked questions about their medical history and about the sources of stress in their lives in the minutes be fore their blood pressure was measured Those in the second group the counting group were asked to count aloud from 1 to 100 four times before their blood pressure was measured The following data values for diastolic blood pressure in millimeters Chapter 3 Association Contingency Correlation and Regression 30 Introduction 0 Goal of Chapter Examine the association or re lationship between two variables one usually called response variable and the other explanatory vari able 0 Examples 7 Is Expected Grade in Math 220 associated with math grades earned in high school 7 Is Number of Pairs of Shoes Owned associated with or related to gender 0 De nitions page 90 The response variable is the outcome variable on which comparisons are made The explanatory variable de nes the groups to be compared with respect to values on the response variable 0 De nition page 90 An association exists between two variables if a particular value for one variable is 1 more likely to occur with certain values of the other variable 31 How We Explore the Association between Two Categorical Variables A Contingency Table page 92 A display for two cat egorical variables lts rows list the categories and its columns list the categories of the other variable Each entry in the table is the frequency of cases in the sample with certain outcomes on the two vari ables 0 Expected Grade DV versus High School Math Grades TV Expected Grade HSGRADE A B C Total A 5 2 O 7 B 14 6 O 20 C 1 2 2 5 Total 20 10 2 32 0 Political Views DV versus Gender TV Political View Gender Liberal Moderate Conservative Total Fern ale 8 6 1 15 M ale 4 7 5 16 Total 12 13 6 31 0 Drive Standard Transmission DV versus Gen der 1V Drive Standard Gender No Yes Total Female 7 8 15 Male 8 9 17 Total 15 17 32 4 B Conditional Proportions Tables Proportions on DV for different lV Categories 0 Expected Grade DV versus High School Math Grades TV Continengenoy Table Again Expected Grade HSGRADE A B C Total A 5 2 0 7 B 14 6 0 20 C 1 2 2 5 Total 20 10 2 32 Conditional Proportions Table Expected Grade HSGRADE A B C Total n A 071 029 000 100 7 B 070 030 000 100 20 C 020 040 040 100 5 0 Political Views Versus Gender Contingency Table Again Political View Gender Liberal Moderate Conservative Total Female 8 6 1 15 M ale 4 7 5 16 Total 12 13 6 31 Conditional Proportions Table Political View Gender Liberal Moderate Conservative Total 11 Female 053 040 007 100 15 Male 025 044 031 100 16 0 Drive Standard Transmission Versus Gender Contingency Table Again Drive Standard Gender No Yes Total Female 7 8 15 Male 8 9 17 Total 15 17 32 Conditional Proportions Table Drive Standard Gender No Yes Total n Female 047 053 100 15 Male 047 053 100 17 7 C Clustered Bar Chart Graphical Representation of Relationship Between Two Categorical Variables 0 Expected Grade in Math 220 Versus High School Math Grades Expected Grade HSGRADE A B C Total n A 071 029 000 100 7 B 070 030 000 100 20 C 020 040 040 100 5 Glade HS Math Classes A a c Expected Grade In Math 220 0 Political Views Versus Gender Conditional Proportions Table Political View Gender Liberal Moderate Conservative Total n Female 053 040 007 100 15 Male 025 044 031 100 16 Clustered Bar Chart Gender El Femle B Nhle Liberal Moderate Conservative Political ew 0 Drive Standard Transmission versus Gender Conditional Proportions Table Drive Standard Gender No Yes Total n Female 047 053 100 15 Male 047 053 100 17 Clustered Bar Chart Gender El Ferrale a Male Yes No Drive Standard Transm39ssion 10 32 How Can We Explore the Association be tween Two Quantitative Variables A The Scatterplot 1 De nition page 99 Graphical display for two quantitative variables It uses horizontal axis for the explanatory variable x and the vertical axis for the response variable y The values of 06 and y for a subject are represented by a point relative to the two axes The observations for the n subjects are n points on the scatterplot 2 Example y Dad s Age X Morn s age 65 o 0 60 o o o o o u 25539 o I o o o o 39u o o N 50 o o o o o o o o o o 45 o o o 40 I I I I I I 35 4o 45 so 55 so Mom39s Age 3 Characteristics of Association in Scatterplots Direction Positive Negative None 7 Form Linear Curved None 7 Strength Outliers 4 Another example Source Peck Olsen Devore page 690 X treadmill time to exhaustion in minutes y 20 km ski time in minutes Data from 11 US biathletes X 77 84 87 90 96 96 100 102 y 710 714 650 687 644 694 630 646 X 104 110 117 y 669 626 617 In39 ns 5 Ta 20km ski time o o I I I I I I 70 80 90 100 110 120 Tim to Exhaustion on treadm39ll m39ns 5 Another example Source Peck Olsen Devore page 244 X distance meters from shore of measurement on river water velocity y velocity of water at distance cm second n 10 X 05 15 25 35 45 55 65 75 85 95 y 2200 2318 2548 2525 2715 2783 2849 2818 2850 2863 2300 Velocity cmlsec a E I 2200 0 oo zlo 4lo slo alo 150 Distance from Bank meters 14 B The Correlation coef cient correlation as a mea sure of strength 1 Correlation 10 The correlation summarizes the direction and strength of a straight line association between two quantitative variables Correlation is denoted by 7 i 7 takes on values between 1 and 1 i 7 gt 0 implies a positive association and 7 lt 0 implies a negative association 7 The closer 7 is to l1 the closer the data points fall to a straight line and the stronger is the linear association The closer 7 is to 0 the weaker is the linear association See page 104 in book i The value of the correlation does not depend on the unit of measurement for either variable 7 Two variables have the same correlation no matter which is treated as the response vari able Formula Correlation 7 emgx nil 3y We will use SPSS or your calculator so start practicing read your calculator manual 3 Examples Dad s Age versus Mom s Age 65 60 o o 0 55 0 W o o o E 50 o o o O n o o o O O o o o 45 o o o 40 I I I I as 40 45 50 Mom39s Age 7 0669 7 Data On 50 states y percentage of individuals without health insurance X high school graduation rate 2500 o E 8 2000 0 0 quot 0 EE 0 o 0 o o o g1500 0 o O m 3 00 55 00 0 6 8 0 69 3 1000 0 880900 a o b 0 500 I I I I I 1500 0000 5500 9000 9500 HSGradRate 7 045l 7 Source Peck Olsen and Devore page 125 y yield of a plot of grain kghectare X time between owering and harvesting days X 16 18 20 22 24 26 28 30 y 2508 2518 3304 3423 3057 3190 3500 3883 X 32 34 36 38 40 42 44 46 y 3823 3646 3708 3333 3517 3214 3103 2776 oo 10100 20100 30100 40100 50100 This days between owarlng and harvestan 7 0274 18 33 How Can We Predict the Outcome ofa Vari able A Review of Equation of Straight Line 0 y mac b slope intercept form of equation m slope of line b y intercept 0 Example y 306 2 0 Given two points on line Find slope and inter cept and thus equation B Lines in Statistics 0 y coordinate of line in statistics used to predict value of some response variable y for a particular value of some explanatory variable X Equation of line in statistics 1 a ban 7 predicted y a y intercept b slope of line Line is called regression line or least squares line Example in book page 110 f1 614 2406 g predicted height cm of individual whose partial remains found at burial site 06 represents femur thigh bone length cm 24 slope of line interpretation 614 slope of line interpretation 19 Example of Prediction If a femur bone is found and 06 40 cm about 16 inches then the pre dicted height of person is g 6142440 1574 cm or about 63 inches Note The point 401574 is a point on the line Using Data to Find the Slope and y intercept of a Regression Line SPSS commands ANALYZE REGRESSION LINEAR Move response variable to dependent variable box Move explanatory variable to independent vari able box Click on SAVE button and check unstandard ized predicted values and unstandardiZed residu als to save these values Click on CONTINUE Click on OK 0 Example of Output from SPSS TreadM l Ski Race Time X 77 84 87 90 96 96 100 102 y 710 714 650 687 644 694 630 646 X 104 110 117 y 669 626 617 720 700 mins E E E 20km ski tins o g u o S 110 310 910 130 110 150 Tlme to Exhaustlon on tmadnlll nlns Coefhclents a a Dependent Vanable skljlme f1 88796 233406 Equation line is called regression or least squares equation line 720 700 680 39 I660 skl t 640 600 I 9 100 110 120 treadnilltime o Residuals Measures Size of Prediction Errors Residual page 115 The prediction error for an observation which is the difference Q Q be tween the actual value Q in the data set and the predicted value of the response variable is called a residual Examples Treadrnill Ski Run Time X 77 84 87 90 96 96 100 102 y 710 714 650 687 644 694 630 646 X 104 110 117 y 669 626 617 Q 88796 233458 Suppose observation is Xy 77710 For x 77 Q 88796 233477 708 Prediction error residual Q Q 710 708 02 23 Prediction Errors or residuals for all n 11 ob servations Treadmill Time Ski Time Predicted y g 77 710 708 02 84 714 692 22 87 650 685 35 90 687 678 09 96 644 664 20 96 694 664 30 100 630 655 25 102 646 650 04 104 669 645 24 110 626 631 05 117 617 615 02 Graphical interpretation of prediction error or residual Summary Measure to Evaluate Predictions with a Line 7 Residual sum of squares Eresidual2 My 22 Sum of squares for our Trealelill Ski Tim line is 022 222 352 022 434 Residual sum of squares can be obtained in SPSS output I I I a Predictors Constant treadmilltim b Dependent Variable skitime ANUVACD I Model I I Sum of Squares I df I Mean Square I F I I 1 I Regression I 74630 I 1 I 74630 I 15585 I I I Residual I 43 097 I 9 I 4 789 I I I I Total I 117727 I 10 I I I e Least Squares Method for Finding a Line Among all possible lines 7 a bx that can go through data points in a scatterplot the regression line results from the least squares method This gives the line that has the smallest value for the residual sum of squares of all possible lines Any other line through treadmill ski time scat terplot will result in a residual sum of squares smaller than ours which is 434 003a I 25 Estimating the error in a prediction on y for a new value of X 7 Use average of prediction errors from observed ys called standard error of the estimate in SPSS Standare Err of the Estimate Prediction Formula W Tread Mill Ski Time Example SE Estimate 22 minutes Model Summary Table from SPSS Model Summary I Model I R I R Square I Adjusted R Square I Std Error of the Estimate I I 1 I 796a I 634 I 593 I 21883 I I I I I a Predictors Constant treadmilltime Formulas for slope and intercept Swim a bwr ay bT bbr 134 o Treadmill Ski Time Data 3 6624 3y 343 T 966 31 117 34 What are Some Cautions in Analyzing As sociations A Extrapolation is Dangerous 0 Extrapolation refers to using a regression line to predict y values for x values outside the ob served range of data 0 Example Example 39 page 136 in book B Be cautious of in uential outliers Regression Outlier o A Regression outlier is an observation xy that is well removed from the trend that the rest of the data follows 0 Example In uential Observation 0 An in uential observation is an observation that has a large effect on the results of a regression analysis or correlation analysis 0 An observation is in uential if 7 Observation is a regression outlier X value is relatively low or high compared to the rest of the data 7 See web site httpilluminationsnctmorgLessonDetailast7IDL456 C Correlation Does Not Imply Causation 0 Example 7 Study to investigate relationship between smok ing during pregnancy and IQ of child at 4 years of age Subjects Pregnant women Measured X average number of cigarettes smoked per day during pregnancy y IQ score of child at 4 years of age Found a moderately strong negative correla tion Causal connection between X and y o Lurking Variable page 131 in book A variable usually unobserved that in uences the associa tion between the variables of primary interest 0 Example Sample of students from grades 1612 Positive Association found between height of stu dents and vocabulary skills r 081 D Simpson s Paradox o Controlling for the effects of a third variable Z by examining association between x and y at xed values of third variable Example Break pregnant mothers into groups depending upon how amount of alcohol drunk Z Examine relationship between x and y within each group After controlling get same association as before reversal of association or no association Simpson s Paradox reversal in the direction of an association between x and y when controlling for effects of third variable Examples of Simpson s Paradox Two quantitative variables crime rate versus education examle page 129 and 130 7 Two categorical variables smoking status and survival examples page 132