### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Introduction to Probability and Statistics MATH 120

Cal State Fullerton

GPA 3.92

### View Full Document

## 15

## 0

## Popular in Course

## Popular in Mathematics (M)

This 167 page Class Notes was uploaded by Gunner Price III on Wednesday September 30, 2015. The Class Notes belongs to MATH 120 at California State University - Fullerton taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/217028/math-120-california-state-university-fullerton in Mathematics (M) at California State University - Fullerton.

## Reviews for Introduction to Probability and Statistics

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/30/15

Relationships Between Categorical Variables 9 O Chapter 6 Principal Question Q Is there a relationship between the two variables so that the category into which individuals fall for one variable seems to depend on the category they are in for the other variable 39 61 Displaying Relationships I Data displayed in a contingency or twoWay table I If one variable is explanatory use it to de ne the rows of the table I Two types of conditional percents row percents and column percents I Use row percents if the explanatory variable is the row variable Between Categorical Variables Example 61 Smoking and Divorce Risk Data on smoking nabrts and divorce nrstory forthe 1559 respondents who had ever been mamed unit an I Slvmkinvnlld 39 quot Mull GSS Survuvs Bl1991 m Among smokers 49 have been divorced 51 have not mong nonsmokers only 32 have been dworeed 52 have not The difference between row percents rnoreates a reiatronsnrp 11gt Example 62 Tamas andEar Pierce Remonses fmmn 565 men to tvm questions l Du ynu have atattnn7 2 Hnwmanytntal earplerces on ynu have7 113 52 I EMPiemesln Twoes 1w Mentor 555 WM um role a zit a m l at it 7n immi ns it 71 ml mu 3394 M timinnmmhohm mm to the null mm llm Among men Withnn earplerses 43424 10 have a tattnn Among men Wth one ear pierce 1670 23 have a tattnn Among men Wl two nrmnre earplerses 2671 37 have a tattoo Wth Hanna asnumbernfearpla39ces1 gt relatl o c uid Examme column percmts see mph abnve nr nvemll percents the s Example 63 Glenda andReasonsfor Taking Care of Youerly 1997 poll landomsdlglt dialing of 1212 southem cn lesldents Queshon W39hatls the ostlmpoltantleason wh yout to take care ofyourbod 715 ltmostly to be attractlve to othersmostly to keep healthy olmostly to help your selfsconfldence orwhat7 um m I nnsnnslmfakinngeal ndynndlinmlar 7 Don t to Mini w in t with w i 2v erseut distribution of responses shown for men and women Pattern oflesponses is ver similar Response does not seem to be related to gender 39 62 Risk Relative Risk Odds Q Ratio and Increased Risk Number in category Risk Total number in group Example Within a group of 200 individuals asthma affects 24 people In this group the risk ofasthma is 24200 012 or 12 Risk in category 1 Relative Risk Risk in category 2 Example those who don t drive under the in uence Relative risk 1 gt two risks are the same Risk in denominator o en the baseline risk 39 Example 64 Smoking and Divorce Risk com ABLE 6 Smaliirinniid Divnrujss SllrvnvsISSI WSJ For smokers risk ofdivorce 238485 ram 0491 or49 1 1 For nonsmokers risk of divorce 3741184 03160r316 Relative Risk ofdivorce 153 32 In this sample the risk of divorce for smokers is 1 53 times the risk of divorce for nonsmokers Evu mumm m on v in No 17 mi MI law isii A Venn Diagram ofthe Smok Smukevs Nunrsmukevs m Divu m Percentincrease in risk 7 Difference in risks x 100 Baseline ris relative risk 71 x 100 When risk is smaller than baseline risk relative risk lt 1 and the percent increase Will actually be negative so We say percent decrease in risk Example 65 Smoking and Divorce Ri e cont Relative Risk of divorce for smokers 153 Percent increase in risk of divorce for smokers 153 71x100 53 Difference in risk 49 7 32 x 100 Baseline ofnsk 32 3 x 100 The risk of divorce is 53 higher for smokers than it is for nonsmokers Odds Misleading Statistics mber in category 1 to Number in category 2 About Risk Number in category lNumber in category 2 to 1 Odds Ratio Questions to Ask Odds for group 1 Odds for group 2 I What are the actual risk What is the Example baseline Iis 7 Odds ofgetu39ng a divorce to not getting a divorce for smokers are 238 to 247 or 096 o I What IS the POP Jla l f r Wthhthe Odds gemquotE a dive to not gemquotg a reported risk or relative risk applles nonsmokers are 374 to 810 or046 o a What is the time Pe od for this H39ska Odds Ratlo 096 046 21 gt the odds of divorce for smokers are about double the odds for nonsmokers divorce for n Example 67 Disaster in the Skies Case Snug 12 Revim39ted Example 68 0392on Fntnndorem Cancer 639 Italian scientists report that a dietn39ch in animal Errors by air trailic controllers climbed protein and fatr cheeseburgers french fries and ice cream for example 7 increases from 746 in scal 1997 to 878 in scal 1998 an 18 Increase USA Today a woman s risk o es cancer threefold Preventtan Magazine S Grant RealmHealth Facts 1991p l 2 Leak at riSk or centraler quotTlquot TWO reasons info is useless Iquot 1998 5 395 ms 9quot 11 gh39s 1 Don t know how data collected nor 1quot 1997 43 mspe m llim ights what population the Women represent Don t know ages ofwomen studie so don t lmow baseline rate Risk of error increased but the actual risk is very small N i Example 68 Die ny Fat Q 63 The Effect of a Third Variable Mm C m quot Mm and Simpson s Paradox Age is a critical factor Accumulated lifetime risk ofwoman developing Example 63 Edy a unalS m m1 breast anc by certain ages Dquotva Illquot Summit Um By ageso 1 in 50 By age60 1in 23 1996 nationwide survey of 11847 individuals 16 or over By age 35 1 in9 Response was DIDmg Status with 3 categories Unimpaired never drove while irnpaired Annual nSkl in 3700 for women in y 30 Alcohol drove withinz ours ofaleohol use but never alter drug use If 1 ran smdy was on very young wome n th threefold increase in risk represents armallincrease 7 39thinzhours ofdrug use and possibly alter alcohol use I n Example 69 Educat39mnalsmtuxa OmIConooceprr39ve Use tion increases the proportion who Hypothetical data on 2400 women Recorded oral drove within ours of alcohol use al 0 increases contraceptive use and ifhad high blood pressure Example 610 Shad Pressure and Drivbtg n er Submmce Use As arnount ofeduca 39 39 tw h rAElE 55 I PnluumwilllNillllElualleisumlul swsnmmnnllstls nl uml Colllmcupllws s l Numbl mun wimmunw ll Du plml um i alumna w l lnll39llluDlnl L39nlllrunpmm mm las lSEtHm arrr Percent with high blood pressure is about the sarne arnong oral contraceptive users and nonusers 1n i Example 610 Or with respect to such afactor the factor conohm Blood pressure increases With age and users tend to be younger unless an Im m Ettucl at Age Ag in mg nquot Smith onquot inlay hen ihul my one Mm mint 7m Dnll39t cl Dolmen1mva mm mm mm highetfotusets than for nonusers gt Simpsnn In each age gionp the percentage With high blood pressure is S Parallax u Shad pmsm and a Cannacep w Use 50110 I i Many factors affect blood pressure Ifusers and non users dlffe the results a u n can tw ll Mull a r 2am l mum 64 Assessing the Statistical Significance of a 2x2 Table Question Can arelationship observed in the sam le data be Inferred to hold in the population represented by the data A statistically signi cant relationship or ifference is one that is large enough to be unlzkely to have occurred in the observed sample ifthere is no relationship or difference in the population Five Steps to Determining Statistical Signi cance O G 1 Determine the null and alternative hypotheses N the data into LA Assuming the null hypothesis is time nd the pvalue as significant based on the pv ue u Verify necessary data conditions andif su arize an appropriate test statistic met Decide whether or not the result is statistically Report the conclusion in the context of the situation Step 1 Null and Alternative Hypotheses null hypothesis The two vaIiables are not related alternative hypotheses e two vaIiables are related Example Popular Kids districts ln Michlgan Wer questionnair to deterrnin 0 following is importantt th 7 Go dGrades 7 Athletic Ability 7 Popularity 478 students in grades 46 from 3 school e glven a e Which of the em Example Pnpular Kills 15 there a relatronshrp between Gender and 302157 HE Gender and Goals are rndependent Hz Gender and Goals are dependent Mlnltab stat gt Tables gt Cross Tabulation and Chl7Square lnspecnng each row of the there seems to be Among Other things the questionnaire 3150 61 erences The quesnon ls whetherthese differences asks for gender grade level r e and are statrsncally srgnrfrcant other demographic information 39 See data set l 2005 15 IJMWW 2005 It Example Popular Kills Tabulated statistics G ender G oals Raw Gender Cannons Goal Grades Popular Sports All boy 117 so 50 227 girl 130 91 30 251 All 247 141 90 478 Cell Contents Count Jamem 2005 Step 2 The Chisquare Statistic Chi7squaoe statistic measures the difference between the observed counts and the counts tha would be expectedifthere were no rela 39onshlp Large difference 7gt evidence otarelan39onshlp compute expected count for each cell Exp ectel l count Row totalxGolumn total Tntal n fur table Compute foreach cell obscount7 Exp count2 Ex 0 Compute test statrsnc by totaling over all cells 7 Obs count7 Exp taunt Exp count Example Popular Kids Rows Gender Columns Goals Grades Popular Sports All boy 117 50 60 1173 670 427 2270 1297 740 473 2510 All 247 1 478 2470 1410 900 4780 Cell Contents Count qzected count farm zoos girl 130 91 30 i1173 Expected Count Calculation 227x 247478 Example Popular Kids Rows Gender Columns Goals Grades Popular Sports All boy 117 50 60 227 029 259 girl 130 91 30 0299 16960 17 59 Calculation of residuals 251 117 117299 299 All 247 141 90 478 Cell Contents Count esi ual fWuMyv zoos Example Popular Kids Rows Gender Columns Goals Grades Popular Sports All boy 227 1 7 50 60 000 958 69697 girl 130 91 30 251 00007 38851 63032 All 247 141 90 478 Cell Contents Count Contribution to Chisquare 00 Contribution to 29911173 00076 Example Popular Kids Pearson ChiSquare 21455 DF 2 PValue 0000 Conclude that there is a statistically signi cant relationship between goals and gender 09 i Step 3 The pvalue of the Chisquare Test Large test statistic gt evidence ofarelationship So how large is enough to declare signi cance Q If there is actually no relationship in the popul 39 39 h likelihood that the chisquare statistic could be as large as it is or larger A The pvalue Note The pvalue is generally reported in computer output a Steps 4 and 5 Making and Reporting a Decision Large test statistic gt small pvalue 39 ence a real relationship exists in the population Common rule 39pvalue S 005 gt say relationship is statistically signi cant and We reject the null hypothesis alue 05 gt cannot say relationship is statistically signi cant and We cannot reject the null hypothesis Nine Foi 2x2 tables a test statistic of 3 84 oilaigei is signi cant n Example 610 mummy Pick s or Q Of92 college students asked Randomly choose one of theletteissiii Q 55 5192pickeds OfanotheiQR students asked Randomly choose one ofthelettels gm 5quot 45 4598 pickeds nieuei mum to iuuiu u low iiuumi umiu Can we conclude order h mm m loin s not bl Z 3942 and the response are as statistically significant Factors that Affect Statistical Signi cance I The strength of the observed relationship Example 610 Of those With S or Q 66 picked S Of those With Q or Squot 46 picked S Difference in percentages 66 46 re ects the s ength of the observed relationship i Factors that Affect Statistical Significance com I How many people were studied Example 1 Treatment A had 8 of 10 patients improve Treatment B had 5 of 10 patients improve Strength 80 50 30 seems large but studyis too small The pValue is 0 15 Treatment A had so of 100 patients improve Treatment B had 50 of 100 patients 39mprove Strength 80 50 30 is again large F1 0 Practical versus Statistical Signi cance suitisu39cal Signi cance does not mean the r atimship is orpmnicnt importance Example 612 Ae n39n MHmnAmch39 is swirlich signi cant Placeh n 18911023 l 71 had attack quot mm iiiir ii ii iir ii iiuii iiiii Aspirin 10411037 o 94 had attack Dinerem only i 71 e o 94 iiiii m m in n with large sample this impamr Interpretin a Nonsignificant Result The sample results are not Strong enough to safely conclude that there is arelationship in the population The observed relationship could have resulted by chance even ifthere is no relationship in the population This is not the same as saying there is no relationship Q9 Case Study 62 Drinking Dn39ving and the Supreme Court Random Rmdsirle Surveyquot at drivas unda zu ymrs at age 0 7 201 gt the observed associaaon could easily have occurred even ifthere is no rel aaonship in the population MIK39FI39tlml This result was us ermm alaw that alluwed sale ufheertu females hutnutmales ed by Supreme Court tci civ Sampling Surveys and How to Ask Questions Chapter 3 Principle Idea 00 The knowledge of how the data were generated is one of the key ingredients for translating data intelligently 31 Description or Decision Using Data Wisely Descriptive Statistics using numerical and graphical summaries to characterize a data set Inferential Statistics using sample information to make conclusions about a broader range of individuals than just those observed 9 039 The Fundamental Rule for Using Data for Inference Available data can be used to make inferences about a much larger group if the data can be considered to be representative with regard to the questi0ns of interest 0 G Represent Other Women Example 31 Do First Ladies Past First Ladies are not likely to be representative of other American women nor even future First Ladies on the question of age at death since medical social and political conditions keep changing in ways that may affect their health Example 32 Do Penn State Students Represent Other College Students If question of interest average handspan of females in college age range gt Yes If question of interest how fast ever driven a car gt N0 since Penn State in rural area with open spaces county roads little traffic Does our in class survey represent the whole CSUF or all college students or all persons of the same age This will depend on the questions asked Lets look at the survey questions Click for Survey questions 1m 2005 7 Populations Samples and Simple Random Samples Population the larger group of units about which inferences are to be made Sample the smaller group of units actually measured Simple Random Sample every conceivable group of units of the required size from the population has the same chance to be the selected sample Helps ensure sample data will be representative of the population but can be dif cult to obtain x 32 The Beauty of Sampling 9 Sample Survey a subgroup of a large population questioned on set of topics Special type of observational study Less costly and less time than a census With proper methods a sample of 1500 can almost certainly gauge the percentage in the entire population who have a certain trait or opinion to within 3 The Margin of Error The Accuracy of Sample Surveys The sample proportion and the population proportion with a certain trait or opinion differ by less than the margin of error in at least 95 of all random samples Conservative margin of error X 100 J Add and subtract the margin of error to create an approximate 95 con dence interval Example 41 The Importance of Religion for Adult Americans Poll of n 1003 adult Americans How importan would you say religion is in your own life Very important 65quotn Fairly important 23 Not very important 12 No opinion 0 Conservative margin of error is 300 003 1 41003 Approx 95 confidence interval for the percent of all adult Americans who say religion is very important 65 l 3 or 62 to 68 Interpreting Con dence Interval The interval 62 to 68 may or may not capture the percent of adult Americans who considered religion to be very important in their lives But in the long run this procedure will produce intervals that capture the unknown population values about 95 of the time gt called the 95 con dence level I lSee Choosing Sample Size for a Survey The sample size mainly depends on the accuracy margin of error that one requires for estimation For example when estimating a proportion one may require a margin of error To obtain the required sample size for a given accuracy we use the formula for the margin of error and solve for n Example The sample size required to obtain an accuracy of r 3 is computed as follows l l 003 V gt ln quot 03 1 2 n llll121112 03 The Effect of Population Size It turns out that the population size has almost no effect in the accuracy of the survey and the required sample size The formula for the sample size that we used assumes that the population size is in nite 33 Simple Random Sampling and Randomization Probability Sampling Plan everyone in population has speci ed chance of making it into the sample Simple Random Sample every conceivable group of units of the required size has the same chance of being the selected sample Choosing a Simple Random Sample You Need 1 List of the units in the population 00 2 Source of random numbers Portion of a Table ofRandom Digits RD 0 00157 37071 79553 31152 42411 79371 25506 69135 38354 03533 95514 03091 75324 40182 59785 46030 63753 53067 79710 52555 72307 27475 10484 24616 13466 41618 08551 18314 2896 98879 50735 87442 16157 02883 22656 44133 90599 E o N N o 2 an to 2 2 N 8 u o 0 2 2 8 an o N a wmumam NbAOINNOb ESE E BB BB b umooub tomsloaolbuna m2 001 28 OIN 3 a m m 1 N o v m 393 2 m m 8 o to c 12522 20743 28607 63013 60346 711135 90348 86615 Simple Random Sample of Students Class of 270 students Want a simple random sample of 10 students H Number the units Studenm numbered 001 to 270 Choose a starting point Row 3 2nd column 10484 Read off consecutive numbers 3digit labels here 104 842 461 613 466 416180 855118 314 577 002 896 N 4 1f number corresponds to a label select that unit If not skip it Continue until desired sample size obtained Row 0 00157 37071 79553 31062 42411 79371 25506 69135 1 33354 03533 95514 03091 75324 40132 17302 64224 2 59735 46030 753 53067 79710 52555 72307 10223 3 27475 10434 24616 13466 41613 03551 13314 57700 4 23966 35427 09495 11567 56534 60365 02736 32700 Simple Random Sample of Students 5 Step 4 very inefficient Can give each unit in population multiple labels eg use 001 to 270 then 301 to 570 601 to 870 so the second 3digit number of 842 would correspond to unit with label 842 7 600 242 Q Using method in Step 4 selected units would be 104 180 118 002 etc Using method in Step 5 selected units would be found more efficiently as 104 242161 013166116180 255118 014 Example 44 Representing the Heights of British Women Simple random sample of 10 from 199 British women 1 Assign an ID number from 001 to 199 to each woman 2 Use random digits to randomly select ten numbers between 001 to 199 sample the heights ofthe women with those IDs Sample 1 Using statistical package Minitab IDs 176 10 1 40 85 162 46 69 77 154 Heights 606 634 626 657 693 687 618 646 608 599 mean 637 inches Sample 2 Using Table Row 5 C013 multiple labels approach IDs 41 93 167 33 157 131 110 180 185 196 Heights 594 665 638 626 650 602 673 598 677 618 mean 634 inches 34 Other Sampling Methods Not always practical to take a simple random sample can be difficult to get a numbered list of all units Example College administration would like to survey a sample of students living in dormitories show a simple random sample of 30 rooms Stratified Random Sampling Divide population of unis into groups called strata and take a simple random sample from each of the strata College survey Two strata undergrad and graduate dorms Take a simple random sample of 15 rooms from each of the strata for a total of30 rooms Ideal stratify so little variability in responses Within each of the strata Cluster Sampling Divide population of unis into groups called clusters take a random sample ofclusters and measure only those items in these clusters College survey Each oor of each dorm is a cluster Unuzmmmme emanamumn Take a random sample W quot quotW ofS oors and all rooms on those oors are surveyed Advantage need only a list of the clusters instead of a list of all individuals Systematic Sampling Order the population of units in some way select one of the first k unis at random and then every k 11 unit thereafter College survey Order list of rooms starting at top oor of 1St undergrad dorm Pick one of the rst 11 rooms at random gt room 3 then pick every 11th room a er that um a N H good alternative to random a biased sample W m H Systematic Sampling Say that the goal is to determine how much a shopper spends on average on groceries each day Suppose that we have access to the records of a grocery chain indicating how much each shopper spends In order not to examine all records we select a systematic sample starting at a random day in January and then picking the record every 7 days from that point on For the day that is picked all the records are examined clustering Will this lead to an OK result JWWM 2005 14 RandomDigit Dialing Method approximates a simple random sample of all households in the United States that have telephones List all possible exchanges area code next 3 digits 2 Take a sample of exchanges chance ofbeing sampled based on White pages proportion of households With a speci c exchange 3 Take a random sample of banks next 2 digits Within each sampled exchange 4 Randomly generate the last two digits from 00 to 99 Once a phone number determined make multiple attempts to reach someone at that household 9 6 Multistage Sampling Using a combination of the sampling methods at various stages Example Stratify the population by region of the country For each region stratify by urban suburban and rural and take a random sample of communities Within those strata Divide the selected communities into city blocks as clusters and sample some blocks Everyone on the block or Within the xed area may then be sampled O G Example 47 The Nationwide Personal Transportation Survey Nationwide Personal Transportation Surve taken every 5 years by the US Depamnent of Tmnsportation 1995 Survey 21000 households Interviews conducted by telephone using a computerassisted telephone interviewing CATI system Multistage Sample US households were stratified by region of country size of metropolitan area and whether there is a subway system Households were then selected by rmdomz git dialing Everyone in a selected household was included gt each household was a cluster Example 48 A Los Angeles Times National Poll halfofAmericans polled said they view Jan 1 2000 as just another New Year s Day About one in 10 report that they are stockpiling goods Los Angeles Times Times Poll 0 1249 adulm nationwide by telephone 0 Over a twoday period in February 1999 0 Telephone numbers chosen from all exchanges in nation 0 Randomdigit dialing techniques used so listed and non listed numbers could be contacted 39 35 Difficulties and O Disasters in Sampling Some problems occur even when a sampling plan has been well designed Using wrong sampling frame Not reaching individuals selected Selfselected sample ConvenienceHaphazard sample Using the Wrong Sampling Frame The sampling frame is the list of units from which the sample is selected This list may or may not be the same as the list of all units in the desired target population Example us1ng telephone hmempmmim dul directory to survey general 3quot 53 5 population excludes those who move Often those with Noie lepnone Sampling frame or unlisted isleyzlmno wlill pag s unlisted home numbers and l those Who cannot afford a New handsquot We telephone Solution use szmuisd ol Haliulls randomdigit dialing Not Reaching the Individuals Selected Failing to contact or measure the individuals who were selected in the sampling plan leads to nonresponse bias Telephone surveys tend to reach more women Some people are rarely home Others screen calls or may refuse to answer Quickie polls almost impossible to get a random sample in one night Nonresponse 0r Volunteer Response In 1993 the GSS General Social Survey achieved its 7 highest response rate ever 824 This is ve percentage points higher than our average over the last four years GSS News Sept 1993 The lower the response rate the less the results can be generalized to the population as a whole Response to survey is voluntary Those who respond likely to have stronger opinions than those who don t Surveys often use reminders follow up calls to decrease Nonresponse rate Trash ed the Public 82 of scientists trashed the media agreeing With the statement T he media do not understand statistics well enough to explain new ndings Science Mervis 1998 Example 49 Which Scientists Science Poll 0 1400 professionals in science and injournalism 0 Only 34 response rate among scientists 0 Typical respondent was white male physical scientist over age of 50 doing basic researc Respondenm represent a narrow subset of scientists gt inappropriate to generalize to all scientists Disasters in Sampling Responses from a self selected group convenience 7 sample or haphazard sample rarely representative of any larger group Example 410 A Meaningless Poll Do you support the President s economic plan Results from TV quickie poll and proper study Televisinn Poll Survey Yes support plan 42 75 No don39t suppon plan 58 18 Mar sure 0 7 Those dissatisfied more likely to respond to TV poll and it did not give the not sure option Digest Poll of 1 93 6 Election of 1936 Democratic incumbent Franklin D Roosevelt and Republican Alf Landon Case Study 31 T he Infamous Literary Literary Digest Poll Sent questionnaires to 10 million people from magazine subscriber lists phone directories car owners who were more likely wealthy and unhappy with Roosevelt Only 23 million responses for 23 response rate Those with strong feelings the Landon supporters wanting a change were more likely to respond Incorrectly Predicted a 3to2 victory for Landon Case Study 31 T he Infamous Literary Digest Poll of 1 93 6 Election of 1936 Democratic incumbent Franklin D Roosevelt and Republican Alf Landon Gallup Poll George Gallup just founded the American Institute of Public Opinion in 1935 Surveyed a random sample of 50000 people from list of registered voters Also took a random sample of 3000 people from the Digest lists Correctly Predicted Roosevelt the winner Also predicted the wrong results of the Literary Digest poll within 1 36 How to Ask 0 Survey Questions Possible Sources of Response Bias in Surveys Deliberate bias The wording ofa question can deliberately bias the responses toward a desired answer Unintentional bias Questions can be worded such e meaning is misinterpreted by a large percentage of the respondenm Desire to Please Respondents have a desire to please the person who is as ing the question Tend to understate response to an undesirable social habitopinion gt Possible Sources of Response Bias in Surveys cont Deliberate bias Example Given that the threat of nuclear war is higher now than it has ever been in human history and the fact that nuclear war poses a threat to the very existence of the human race would you favor an allout nuclear test ban Probably Will result in a higher percentage in favor of The test ban as opposed to the question Are you in favor of or opposed to a nuclear test ban 4 1m 2005 38 1m 2005 39 Possible Sources of Response Bias in Surveys cont Unintentional bias Example Have you ever used drugs In this question the type of drug is not clear Desire to Please Example When asked whether you have voted or not respondents often say that they have voted where in fact they might have not Response Bias in Surveys cont Possible Sources of O 0 Asking the Uninformed People do not like to admit that they don t know what you are talking about when you ask them a question Unnecessary Complexity If questions are to be understood they must be kept simple Some questions ask more than one question at once Ordering of Questions If one question requires respondenm to think about something that they may not have otherwise considered then the order in which questions are presented can change the results Possible Sources of Response Bias in Surveys cont 0 Ordering of Questions Example Do you favor increase in taxes Do you favor increase in taxesfor education The order here may affect the response 1m 2005 41 Possible Sources of Response Bias in Surveys cont 0 Con dentiality and Anonymity People will often answer questions differently based on the degree to which they believe they are anonymous Easier to ensure con dentiality promise not to release identifying information than anonymity researcher does not know the identity of the respondents Be Sure You Understand What Was Measured Words can have different meanings Important to get a precise definition of what was actually asked or measured Eg Who is really unemployed Some Concepts Are Hard to Precisely De ne Eg How to measure intelligence Measuring Attitudes and Emotions Eg How to measure selfesteem and happiness Open or Closed Questions Should Choices Be Given 39 Open question respondenm allowed to answer in own words Closed question given list of alternatives usually offer choice of other and can fill in blank lf closed are preferred they should first be presented as open questions in a pilot survey for establishing list of choices Resulm can be difficult to summarize with open questions Case Study 32 No Opinion onour Own LetPolitics Decide 1978 Poll Cincinnati Ohio people asked whether they favored or opposed repealing the 1975 Public Affairs Act No such act about onethird expressed opinion 1995 Washington Post Poll 1000 randomly selected people asked Some people say the 1975 Public Affairs Act should be repealed Do you agree or disagree that it should be repealed 43 expressed opinion 24 agreeing should be repealed Case Study 32 No Opinion onour Own LetPolitics Decide cont 39 Second 1995 Washington Post Poll polled two separate groups of 500 randomly selected adults Group 1 President Clinton aDemocrat said that the 1975 Public Affairs Act should be repealed Do you agree or disagree Of those expressing an opinion 36 of the Democrats agreed should be repealed 16 of the Republicans agreed should be repealed Group 2 The Republicans in Congress said that the 1975 Public Affairs Act should be repealed Do you agree or disagree Ofthose expressing an o inion 36 of the Republicans agreed should be repealed 19 of the Democrats agreed should be repealed Turning Data Chapter 2 Into Information 21 Raw Data 6 0 Raw data are numbers and category labels that have been collected but have not yet been processed in any way 0 When measurements are taken from a subset of a population they represent sample data 0 When all individuals in a population are measured the measurements represent population data 0 Descriptive statistics summary numbers for either population or a sample 22 Types of Data O Qualitative Categorical I Nominal I I Ordinal IIContinuousII Discrete I 1m 2005 22 Types of Data 0 Raw data from categorical variables consist of group or category names that don t necessarily have a logical ordering Examples eye color country of residence 0 Categorical variables for which the categories have a logical ordering are called ordinal variables Examples highest educational degree earned tee shirt size S M L XL 0 Raw data from quantitative variables consist of numerical values taken on each individual Examples height number of siblings 22 Types of Data Discrete variables are those whose possible values are countable Example number of siblings Continuous variables are those that take on values in intervals Example height Sometimes the type of the variable depends on the way it is being observed For example the following two questions about income lead to two different types of variable State your annual income in dollars Isyour income 1 between 20000 40 000 2 between 40 0007 60000 3 above 60000 Jme 2 05 O Asking the Right Questions One Categorical Variable Question In How many and what percentage of individuals fall into each category Example What percentage of college students favor the legalization of marijuana and what percentage of college students oppose legalization of marijuana Question 1 b Are individuals equally divided across categories or do the percentages across categories follow some other interesting pattern Example When individuals are asked to choose a number from 1 to 10 are all numbers equally likely to be chosen 0 6 Asking the Right Questions Two Categorical Variables Question 2a Is there a relationship between the two variables so that the category into which individuals fall for one variable seems to depend on which category they are in for the other variable Example In Case Study 16 we asked if the risk of having a heart attack was different for the physicians who took aspirin than for those who took a placebo Question 2b Do some combinations of categories stand out because they provide information that is not found by examining the categories separately Example The relationship between smoking and lung cancer was detected in part because someone noticed that the combination of being a nonsmoker and having lung cancer is unusu Asking the Right Questions One Quantitative Variable Question 3a What are the interesting summary measures like the average or the range of values that help us understand the collection of individuals who were measured Example What is the average handspan measurement and how much variability is there in handspan measurements Question 3b Are there individual data values that provide interesting information because they are unique or stand out in some way Outliers Example What is the oldest recorded age of death for a human Are there many people who have lived nearly that long or is the oldest recorded age a unique case Asking the Right Questions One Categorical and One Quantitative Variable Question 4a Are the measurements similar across categories Example Do men and women drive at the same fastest speeds on average Question 4b When the categories have a natural ordering an ordinal variable does the measurement variable increase or decrease on average in that same order Example Do high school dropouts high school graduates college dropouts and college graduates have increasingly higher average incomes Asking the Right Questions Two Quantitative Variables Question 5a If the measurement on one variable is high or low does the other one also tend to be high or low Example Do taller people also tend to have larger han pans Question 5b Are there individuals Whose combination of data values provides interesting information because that combination is unusu Example An individual who has a very low IQ score but can perform complicated arithmetic operations very quickly may shed light on how the brain works Neither the IQ nor the arithmetic ability may stand out as uniquely lOW or high but it is the combination that is interesting Explanatory and Response Variables 9 When there is a question about the relationship between two variables it is useful to identify one variable as the explanatory variable and the other variable as the response variable In general the value of the explanatory variable for an individual is thought to partially explain the value of the response variable for that individual Explanatory and Response Variables Example The relationship between the dosage of a blood pressure lowering drug With the reduction in bloo pressure of a patient Within 30 minutes is of interest Response variable Blood pressure of the patient after 30 minutes Explanatory variable The dosage of the drug 1m 2005 12 23 Summarizing One or Two Categorical Variables Numerical Summaries Count how many fall into each category Calculate the percent in each category lftwo variables have the categories of the explanatory variable de ne the rows and compute row percentages Example 21 Importance of Order Survey ofn 190 college students About half 92 given the question Randomly pick a letter S or Q Note 66 picked the first choice of S Other half 98 given the question Randomly pick a letter Q or S Note 54 picked the first choice of Q TAB LE 22 I Orller of Letters on Form aml Choice of Letter 9 SPicked IPicked Tnlal SLislell rst 61 166 3113mm 92 a listed rim 45 45 53 54 93 Total TUE 56 E4 44 190 Example 22 Lighting the Way t0 Nearsightedness Survey ofn 479 children Those Who slept With nightlight or in fully lit room before age 2 had higher incidence of nearsightedness myopia later in childhood TA B LE 2 3 I Niglmilne Liglllillg ill llliallcy alld Eyesiglll Sleplwilh Nu Mynpin Myullin High Myopiu Tnlal Darkness 155 90 1519 mm 172 Nithliuht 15355 72 SWll 7 3 232 Full Light 34 45 36 48 50 75 Total 342 71 1231294 mam 79 Note Study does not prove sleeping with light actually caused myopia in more children Using Min itab Summarizing one categorical Variables Stat gt Tables gt Tally individual variables Tally Individual ValiiRIES 09 1 Sex Variables c2 HrsSleep c3 SQplck SQPIEW CA Height c5 RandNumb cs Fastest c7 RtSpan ca LitSpan c9 Farm kahy l7 Cnums l7 Percenls r Cumulative mums aqs HIv 07ml Using Minitab Summarizing one categorical Variables Minitab Output SQpick Results for pennstate1mtw Tally for Discrete Variables SQpick Count Percent Q 4421 S 106 5579 N 190 JWWV 2005 O Using Minitab Summarizing two categorical Variables I Stat gt Tables gt Cross tabulation and Crass Tahulalinn and Chiquuar k Frequencies m in nplinnzl r Tnlal pevcenls Chi Squave cum Stats Optinns Using M initab Summarizing two categorical Variables S or Q 31 3690 61 5755 92 4842 Tabulated statistics SQpick Form Rows SQpick Columns Form All 84 10000 106 10000 190 10000 E a Si 1 Q or S o 53 E 6310 H s 45 4245 All 98 5158 Jme 2005 90 Visual Summaries for Categorical Variables Pie Charts useful for summarizing a single categorical variable if not too many categories Bar Graphs useful for summarizing one or two categorical variables and O O particularly useful for making comparisons when there are two categorical variables Good Randomizers Example 23 Humans Are Not Survey ofn 190 college students Randomly pick a number between 1 and 10 4mm 5mm 5mm Hzasm 7 39m 4 5 5 7 Numhel cum HENRI Zl FIB chin nl numhevs nickel HEUREZZ Bar nmph ul nnmhnrs picked Results Most chose 7 very few chose 1 or 10 and Nearsightedness Example 24 Revisiting Nightlights Survey of n 479 children 3 c Response Percem with each nwnpla level 0 c Explanatory Amount of Sleeptime Lighting Hm hghx Dark ngmhght Lxgmm Londmnns before age 2 mums Eurellanlarlnyapia and niuhllim liuhling in inlanny Minitab Pie Chart I Graph gt Pie Chart gt Tally individual variables Jme 2005 Minitab Pie Chart output me CharlnfRzmvunb F Minitab Bar Graphs I Graph gt Bar Chart 1 Chart 171 mm min Sex 25 Zn m I H I wwwwwwwwwww ttqtt JMWM2005 24 Finding Information in Quantitative Data Long list of numbers 7 needs to be organized to obtain answers to questions of interest TA 8 LE 24 I Stretched Bight Handspanstcm nt 190 College Students 09 Miles 87 xmdanlsl Ir 571571 AF MR 1r I AF NEH NR 571157157 7K7IF7E7A H7151 711 77a 77 77777 57557075 5 Hum nwuunn 72147117277 2122 5 215 245 22 22 21 23 22 52112252323 23 215 13 215 Females 1113 students nmwna n 1175711191 n5777n71517 1 7n917l1q71 201521221810211922521201 2120 5 21 22 20 20 13 21225 22 519191922520 132022519518519175182115521119215 1819195201152113221 am Mmmaxmmwmamaxrnr mm 85 19 22 1 21 21 Zn n71nnrnqr FiveNumber Summaries Find extremes high low the median and the quartiles medians of lower and upper halves of the values Quick overview of the data values Information about the center spread and shape of data FiveNumber Summaries Median ofa set ofdata is the middle value when the I data set is arranged in increasing or decreasing order How to find the median a Arrange the data from smallest to largest b If n is odd then the T observation in the ordered data set is the median c If n is even then the mean of the 2 and 1 Observations in the ordered data set is the median FiveNumber Summaries Quartiles are numbers that approximately divide the ordered data into quarters A rst quartile is denoted by 0 At least 25 ofthe observations are less than or equal to Q1 and atleast 75 ofthe data are greater than or equal to Q1 A third quartile is denoted by Q3 At least 75 ofthe observations are less than or equal to Q3 and atleast 25 ofthe data are greater than or equal to Q1 To calculate the quartiles 1 Locate the median M ofthe observations 2 The first quartile Q is the median ofthe observations that are less than M 3 The third quartile Q3 is the median ofthe observations that are greater than M Example Odd number of observations Q1 M Q 0 2 5 7 8 9 14 Even number of observations Q1 M Q 0 2 5 8 9 14 Example 25 Right Handspans Males Females 87 Students 103 Students Median 225 200 Quartiles 215 235 190 210 Extremes 180 250 125 2325 About 25 of handspans of females are between 125 and 190 centimeters about 25 are between 19 and 20 cm about 25 are between 20 and 21 cm and about 25 are between 21 and 2325 cm Quantitative Variables Interesting Features of Location center or average eg median Spread variability e g difference between two extremes or two quartiles Shape later in Section 25 Outliers and How to Handle Them Outlier a data point that is not consistent with the bulk of the data Look for them via graphs Can have big in uence on conclusions Can cause complications in some statistical analyses Cannot discard without justi cation 33 Example 26 Ages of Death of US First Ladies Partial Data Listing and fivenumber summary TRELE 15 I Tlll firsll dils llhe Unilzd Sinus nlAmulica rimudiax39 Emma Ayn nun Age at Death Mme mmcmwnmw mum musmmm mum 7 Media 70 3571M iil ii m 3251quot i7 Duaniles an 315 233533IL THZXLTMm 3312 33 WW 3 97 Extremes are more interesting here Who died at 34 Martha Jefferson Who lived to be 97 Bess Truman 34 Possible Reasons for Outliers and Reasonable Actions Mistake made while taking measurement or entering it into computer If verified should be discardedcorrected Individual in question belongs to a different group than bulk of individuals measured Values may be discarded if summary is desired and reported for the majority group only Outlier is legitimate data value and represents natural variabilityfor the group and variables measured Values may not be discarded 7 they provide important information about location and spread Example 27 Tiny Boatsmen Weights in pounds of 18 men on crew team Cambridge1885 1830 1945 1850 2140 2035 1860 1785 1090 Ogg ord 1860 1845 2040 1845 1955 2025 1740 1830 1095 Note last weight in each list is unusually small They are the coxswains for their teams while others are rowers 25 Pictures for Quantitative Data Histograms similar to bar graphs used for any number of data values Stemandleaf plots and dotplots present all individual values useful for small to moderate sized data sets Boxplot or boxandwhisker plot useful summary for comparing two or more groups Interpreting Histograms Stemplots and Dotplots nnnuauununnanoonns rnnn ma 0 Values are centered around 20 cm 0 Two possible low outliers ApaIt from outliers spans range from about 16 to 23 cm 38 Step 1 Decide how many equally spaced same width intervals to use for the horizontal axis Between 6 and 15 intervals is a good number Creating a Histogram Step 2 Decide to use frequencies count or relative frequencies proportion on the vertical axis Step 3 Draw equally spaced intervals on horizontal axis covering entire range of the data Determine frequency or relative frequency of data values in each interval and draw a bar with corresponding height Decide rule to use for values that fall on the border between two intervals 39 Histogram 9 Example A marketing consultant observed 50 shoppers at a grocery store One variable of interest was ow much each h spent in the store Here are the data in dollars arranged in increasing order m r o 1 1 1 232 661 690 804 945 1026 1134 1163 1266 1295 1367 1372 1435 1452 1455 1501 1533 1655 1715 1822 1830 1871 1954 1955 2058 2089 2091 2113 2385 2604 2707 2876 2915 3054 3199 3282 3326 3380 3476 3622 3752 3928 4080 4397 4558 5236 6157 6385 6430 6949 Histogram 9 Frequency table summarizes this data as follows Dollar Spent Frequency Relative Frequency 2321232 8 016 12322232 20 040 22323232 7 014 32324232 8 016 42325232 2 004 52326232 2 004 62327232 3 006 Total 50 100 Histogram A few notes gt Each ofthe intervals in the first column is called a measurement class gt The observed values that fall on the boundaries ofthe measurement classes should consistently go into lower or upper subinterval gt The number of measurement classes for a data set should be chosen so that the least amount of information is lost while the data are effectively summarized Too few classes summarizes data too much Too many classes does not summarize data effectively Minitab Histogram O I Graph gt Histogram I In order to determine the number of bins follow the following steps 1 Select the histogram bars by clicking on one of the bars 2 Right click on the graph and select Edit bars 3 Choose the binning tab 4 Typein the number of bins desired data H39slngrzm nllheshnpping am Hlsmgram 0fthe shoppmg data m u m Histogram Three histograms 0fthe shopping x a 5m 2 smugmus mm mm an mu m m an m m mu m m m m m m an ran 7 an an 55H m Hmnumm mlhe shnr r mn mm m Creating a Dotplot 0 A dotplot displays a dot for each observation along a number line If there are multiple occurrences of an observation or if observations are too close together then dots will be stacked vertically If there are too many points to t vertically in the graph then each dot may represent more than one point Minitab Release 121 1998 45 Creating a StemandLeaf Plot Step 1 Determine stern values The stem contains all but the last of the displayed digits ofa number Stems should define equally spaced intervals Step 2 For each individual attach a leaf to the appropriate stern A leaf is the last of the displayed digits of a number Often leaves are ordered on each stern Note More than one way to define stems Can use splitstems or truncateround values first 46 For a given number the stem consists of all but the final rightmost digit The leafconsists ofthe nal digit Stem and leaf plots A leaf digit unit LDU determines the location of the decimal place Example Number Stem leaf LDU 3475 347 5 01 3475 347 5 1 values were truncated to integer numbers Example Stem and leaf display ofthe shopping data 26689 0112233444556788899 000136789 012334679 035 4 1349 LDU1 ONUIAUJNHO In Minitab use Graph gt Stemand leaf to get a stem and leaf display Increment can be used to set the scale Example About how many CDs do you own Estimated music 205 owned lnr n 24 Penn State students 28 Big Music Collection 0 0 0 001222233 1 3369 Stem is lOOs and leafunit is lOs 1 5 Final digit is truncated g 302 Numbeis ranged from 0 to about 450 3 0 with 450 being a clear outlier and 3 most values ranging from 0 to 99 4 4 5 49 Describing Shape Symmetric bellshaped Symmetric not bellshaped Skewed Right values trail off to the right Skewed Left values trail off to the left 9 O Can you think of everyday life data that may follow each of the following shapes A a Ball shapad b Triangular c Uniformnrvenangulal d Reverse Jrshaped e Jrshaped 1 Right skewed g Left skewed h Bimodal v Multimndal JimWm 2005 51 26 Numerical Summaries of Quantitative Data Notation for Raw Data n number of individuals in a data set x1 x2 x3 x represent individual raw data values Example A data set consists of handspan values in centimeters for siX females the values are 21 19 20 20 22 and 19 Then n 6 x1 21 x2 19 x3 20 x4 20 x5 22 andxs 19 Describing the Location of a Data Set Mean the numerical average Median the middle value ifn odd or the average of the middle two values n even Symmetric mean median Skewed Left mean lt median Skewed Right mean gt median Z Comparison of mean and median for various shapes A Mean median in mm A Mean lt median lt91 LLliskewr g Mean gt median l Riqhnkewed 1 mum 1m 2005 Mean lt median K Determining the Mean and Median 2x The Mean x n where 239 means add together all the values The Median See Slide 28 Example 29 Will Normal Rainfall Get Rid of Those Odors Data Average rainfall inches for Davis California for 47 years Wer m Mean 1869 inches 39 Median 1672 inches In 199798 a company with odor problem blamed it on excessive rain That year rainfall was 2969 inches More rain m quot quot occurred in 4 other years HEURIZT Annunl rxmlnll In Dam Enlllnmlx The In uence of Outliers 0n the Mean and Median Larger in uence on mean than median High outliers will increase the mean Low outliers will decrease the mean If ages at death are 70 72 74 76 and 78 then mean median 74 years If ages at death are 35 72 74 76 and 78 then median 74 but mean 67 years Describing Spread Range and Interquartile Range Range high value low value Interquartile Range IQR upper quartile lower quartile Standard Deviation covered later in Section 27 Example 210 Fastest Speeds Ever Driven FiveNumber Males 137 Students Summ ary for 87 males Median 11o Quartiles 95 120 Extremes 55 150 Median 110 mph measures the center of the data Two extremes describe spread over 100 of data Range 150 7 55 95 mph Two quartiles describe spread over middle 50 of data Interquartile Range 120 7 95 25 mph Notation and Finding the Quartiles also see slide 30 Split the ordered values into the half that is below the median and the half that is above the median Q1 lower quartile median of data values that are below the median Q3 upper quartile median of data values that are above the median 60 Example 210 Fastest Speeds cont Ordered Data 55 50 80 80 80 80 85 85 85 85 1n TOWS 0f 95 9595100100100100100100100 10 values 100100101102105105105105105105 105105109 110110110110110110110 for the 87 males39 110110110110110112115115115115 115115120120120 120120120120120 120120124125125125125125125130 130140140140140145150 0 Median 87l2 443911 value in the list 110 mph 0 Q1 median of the 43 values below the median 43H 2 22nd value from the start ofthe list 95 mph 0 Q3 median of the 43 values above the median 43H 2 22nd value from the end of the list 120 mph 51 Percentiles 6 The kth percentile is a number that has k of the data values at or below it and 100 7 k ofthe data values at or above it 0 Lower quartile 25th percentile Median 50th percentile Upper quartile 75th percentile Picturing Location and Spread with Boxplots Boxplots for right handspans Box covers the middle of males and females 50 0f the data Line within box marks the median value 2 Possible outliers are T marked with asterisk N 7 Apart from outliers lines extending from box reach to min and max values Flight handspan cm 2 Female Male of a Quantitative Variable Step 1 Label either a vertical axis or a horizontal axis with numbers from min to max of the data How to Draw a Boxplot Step 2 Draw box with lower end at Q1 and upper end at Q3 Step 3 Draw a line through the box at the median M Step 4 Draw a line from Q1 end of box to smallest data value that is not further than 15 X IQR from Q1 Draw a line from Q3 end of box to largest data value that is not further than 15 X IQR from Q3 Step 5 Mark data points further than 15 X IQR from either edge of the box with an asterisk Points represented with asterisks are considered to be outliers How to Draw a Boxplot 00 Boxplol of AmoLnt 39 I Outliers m m cum g Q3 lt an 2n M m Min except Q n Jme 2005 ES 27 BellShaped Distributions of Numbers Many measurements follow a predictable pattern 0 Most individuals are clumped around the center 0 The greater the distance a value is from the center the fewer individuals have that value Variables that follow such a pattern are said to be bell sh aped A special case is called a normal distribution or normal curve Example 21 1 BellShaped British Women s Heights Data representative sample of 199 married British couples Below shows a histogram of the Wives heights with a norm curve superimposed The mean height 1602 millimeters Minitab Frequmy i 1100 mm V700 woo Goo was s imgm Describing Spread with Standard Deviation 09 Standard deviation measures variability by summarizing how far individual data values are from the mean Think of the standard deviation as roughly the average distance values fall from the mean with Standard Deviation Describing Spread O Numbers Mean Standard Deviation 100 100 100 100 100 100 9090100110110 100 10 Both sets have same mean of 100 Set 1 all values are equal to the mean so there is no variability at all Set 2 one value equals the mean and other four values are 10 points away from the mean so the average distance awayfrom the mean is about 10 Calculating the Standard Deviation Formula for the sample standard deviation n l The value of s2 is called the sample variance An equivalent formula easier to compute is 2 2 S zxi nx n l Calculating the Standard Deviation Step 1 Step 2 Calculate 1 6 the sample mean For each observation calculate the difference between the data value and the mean Step 3 Step 4 Square each difference in step 2 Sum the squared differences in step 3 and then divide this sum by n 7 1 Step 5 Take the square root of the value in step 4 039 Calculating the Standard Deviation Consider four pulse rates 62 68 74 76 62687476 7 280 7 4 4 Steps 2 and 3 quot5 Slap 2 Value Value Mean Step 1 3 70 Sup 3 Value Mean Step 4 Step5 s 4063 6 Population Standard Deviation Data sets usually represent a sample from a larger population If the data set includes measurements for an entire population the notations for the mean and standard deviation are different and the formula for the standard deviation is also slightly different A population mean is represented by the symbol m mu and the population standard deviation is n Interpreting the Standard Deviation for BellShaped Curves The Empirical Rule 9 For any bellshaped curve approximately 0 6800 of the values fall within 1 standard deviation of the mean in either direction 0 9500 of the values fall within 2 standard deviations of the mean in either direction 0 997 o of the values fall within 3 standard deviations of the mean in either direction Deviation and the Range 0 Empirical Rule gt the range from the minimum to the maximum data values an approximate bell shape s H Range 6 The Empirical Rule the Standard about 4 to 6 standard deviations for data with 0 You can get a rough idea of the value of the standard deviation by dividing the range by 6 O 0 equals Example 211 Women s Heights cont Mean height for the 199 British women is 1602 mm and standard deviation is 624 mm 0 6800 of the 199 heights would fall in the range 1602 r 624 or 15396 to 16644 mm 0 9500 of the heights would fall in the interval 1602 r 2624 or 14772 to 17268 mm 0 997 o of the heights would fall in the interval 1602 r 3624 or 14148 to 17892 mm 0 Example 21 1 Women s Heights cont Summary of the actual results Numerical Empirical Rule quotu Actual Range Range and Number Number Actual Percent Mean 1 sd 153961016644 68 of 199 135 140 140199 or 70 Mean 25d 147721017298 95 of 199 189 189 189199 or 95 Mean 3 sd 141481017892 997 of 199 198 198 198199 or 995 Note The minimum height 1410 mm and the maximum height 1760 mm for a range of 1760 7 1410 350 mm So an estimate of the standard deviation is smwz 583mm 6 6 Standardized z Scores Standardized score or z score Observed value Mean Standard deviation Example Mean resting pulse rate for adult men is 70 beats per minute bpm standard deviation is 8 bpm The standardized score for a resting pulse rate of 80 280 70 125 A pulse rate of 80 is 125 standard deviations above the mean pulse rate for adult men The Empirical Rule Restated For bellshaped data 0 About 68 0f the values have z scores between 1 and 1 0 About 95 0f the values have z scores between 2 and 2 0 About 997 of the values have z scores between 3 and 3 9 G Relationships Between 9 O Three Tools we will use 39 Scatterplot atwodimensional graph of Chapter 5 a a 39 Correlation a statistic that measures the Quantltathe nrength and direction of a linear relationship 39 39 Regression equation an equation that varlables describes the average relationship between a response and explanatory variable 39 51 Looking for Patterns Lookingl or Patterns with Scatterplots with Scatterplots Questions to Ask about a Scatterplot 39 What is the average pattern Does it look like a straight line or is it curved 39 What is the direction of the pattern 39 How much do individual points vary from the average pattern 39 Are there any unusual data points Mrnrtap Command Graph gt Scatterplot gt Grrnpre Graph gt Scatterplot gt Wrm Group JW 2005 A Cuuege Example Data were cemented hum the 25 up hbera arts colleges andthe 251 research unwersmes The fullEIWlng Vanables are Eumalned W a data set Name Nameme chsmaal Smaaljvpe Caded beAn s m hheval ansand who tummversny r quotquot quot I X 5m Memsn mmhmed Math and Vernal 5m some mstudems Am a apphcamsa ed Explure he owahare relauunsmp between paws ufvanables SAT Accem a d Tupjs 2005 s PositiveNegative Association PosmveAssomauon Two variables have a positive 39 39 association When the values of quot one variable tend to increase as the a values of the other variable increase m 1 Two variables have a negative m 39 association When the values of 1 W a W 5 s one variable tend to decrease as the u u s ofthe other Vmable increase Schools with higherpercentage ofstudents whohave graduate 01 on the top 10 ofthen graduating class have students with Higher median SA 1 JW 2005 x Negative Association Schools with amwer39 acceptance rate have students with RAgKEV mIrdlan SAT scores JW 2005 Example 51 Height and Handqmn D ata 69 22 0 Data shown are the rst 66 18 5 12 observations of a g 3 data set that includes the 72 24 0 heights in inches and 2 fully stretched handspans 76 24 5 in centimeters of 67 20 0 167 college students 70 230 62 17 o and so 11 o for n 167 observaqu 39 Example 51 Height andHandqmn Taller people tend to have greater handspan e urements than shorter people do When two variables tend to increase together we say that they have a positive association Example 52 Dn39verAge and Maximum Legibility Distance of Highway Signs 39 A research rm determined the maximum distance at which each of 30 drivers could read a newly designed sign 39 The 30 participants in the study ranged in age from 18 to 82 years old 39 We want to examine the relationship between age and the sign legibility distance O 9 Example 52 Driver Age and Maximum Legibility Distance of Highway Signs 600 I I E 500 I u o o o o o 3 I U I I a o o a 400 C I n I 3007 u O r l T l39 quotT Tit 65 75 45 We see a negative association with a linear pattern We will use a straightline equation to model this relationship 13 Example Carbon Dioxide Concentration O 9 Trends in C02 concentration in the last two centuries Source of data WorldWatch Institute Data le caridi0MTW amrumm unz pun vs Year znnn iasn mu mu yer There is a positive association between Year and the C02 concentration level The association however is nonlinear 00 Consider the College data Explore the relationship between pairs of variables Mamma orgy amt 511W gram mars JWM 2005 O O Groups What does um graph tell you7 00 39 Use different plotting symbols or colors to represent different subgroups My mcmmm because 39 quot W 56 m a I n JMWJOOS matdmm gmphteuyw 52 Describing Linear Patterns With a Regressmn Line When the best equation for describing the relationship between andy is a straight line the equation is called the regression line Two purposes of the regression line 39 to estimate the average value of y at any m m w m in m speci ed value ofx 39 to predict the value of y for an individual given that individual s x value JW 2005 n 7 Example 51 Height andHamlqmn com Regression equation Handspan 73 035 Height Example 51 Height andHamlqmn com Regression equation Handspan 3 035 Height Estimate the average handspan for people 50 inches tall Average handspan 3 03560 18 cm Slope 035 gt Handspav Increase 0 5 cm Predict the handspan for someone who is 50 inches tall 0 raga f Ch quot me f1 quot Ch quot he gm39 Predicted handspan 3 03550 13 cm The Equation for the Regression Line Example 52 Driver133 quot11 Maximum L A egibility Distance of Highway Signs com 3 be blx Regression equation Distance 577V 3Age f is spoken as yrhat andit is also tefetteol to eitloet as Slope ofeg tells us that f I ptediotedy ot estimated y image a legibmy bu is the intercept oftlie sttaightline The intetcept is the distance decreases 3 feet o s a when age increases by 17 is the slope oftloe sttatght line The slope tells us how much one year slim t verage distance for 20yearold drivers tells us whetliety tncteases ot decteases when xmcreases Avmge distance 577 7320 517 Predict the legibility distance for a 20yearold diivet Predicted distance 577 7 320 517 it 12 SAT 137172 83Accept Ifa school has 50 acceptance rate then on average the median SAT score for thatschoolis 13717 2 83 50 Prediction Errors and Residuals Prediction Error difference between the observed value of y and the predicted value Residual y 7 Components in a regression model JW 2005 O 0 Example 52 DriverAge and Maximum Legibility Distance of Highway Signs com Regression equation 9 577 7 ix 1 Age y Distance 5 577 7 3x Residual 18 510 5777 318523 510 r 523 13 20 590 5777320517 5907517 73 22 515 577r322511 5157511 5 Can camputetheresidual far all 30 nbservatjnus Pnsitive residual gt nbserved value higher than predicted Negative residual gt nbserved value inwer than predicted n the sum of squared prediction err 39 SSE Sum of squared prediction errors 39 Formulas for Slope and Intercept 2630 1 I 201 170 frilly Least Squares Line and Formulas 39 Least Squares Regression Line minimizes 53 Measuring Strength and Direction with Correlation Correlation rindicates the st the Mention of a straightline relationship 39 The strength of the relationship is determined by the closeness of the points to a straight line 39 The direction is determined by Whether one variable generally increases or gene cr 39 rally de eases When the other variable increases Interpretation of r and a Formula r is always between 1 and 1 niagiitude indicates the strength r 1 r 1 indicates a perfect linear relationship sign indicates the direction r 0 indicates a slope ofO so knowingx does not change the predicted value ofy Formula for correlation Example 51 Height andHandqmn com Regression equation Handspan 3 035 Height 0 Correlation r 07 gt a somewhat strongpositive linear relationship Example 52 Dn39verAge and ax39 L M tmum egibility Distance of Highway Signs cont Regression equation Distance 577 e 3 Age Correlation r 70 omewhat str gt ong negative linear association O O Ifyou know the span 0 you accurately predict hisher le handspan Correlation r 095 gt Example 56 Lg and RightHandqmnx fa person s light hand can a very strong positive linear relationship 0 6 39 Example 57 Verbal SAT and GPA Example 58 Age andHours ofTV Viewing Grade point aver es GPAs and verb for a sample ag a1 SAT scores A of100 university students Correlation r 043 gt a moderately strong positive linear relationship Relationship between age and hours of daily television viewing for1913 survey respondents relation r 012 gt a weak connection Note afew claimedto watch more than 20 hoursday O Example 59 H urs ofSleep d 0 an Hours of Study Relationship between reported hours of sleep the prevrous 24 hours andtne reportednours ofstudy during the same e Correlation r 4136 gt a not too strong negative association Interpretation of r2 and a formula Squared correlation r2 is between 0 and 1 and indicates the proportion of Variation in 39 x the response explame by ssro sum or squares total sum ofsquared differences between observed y values and y s sum or squared errors residuals sum of squared differences between observedy values and predicted values based on least squares line r2 7 SSTO r SSE SSTO Interpretation of r2 Example 5 predictable from span of other hand Example 58 TVviem39ng andAge 1 0014 gt only about 14 knowing a person s age doesn t help much in predicting amount of daily TV viewing Le andRighthdwanx 11 0 90 gt span of one hand is very O Example 56 Le and Right Handguns m is r 18 RISDM smev T P M792 305 n 003 00282 57 Loon gt 053 Rsoladjt 2 902 Analysis or Vdrimlre Source Dr i P Remomml I i735 33 0 DO Residual Error ids Tom 13 Q Regression m Mimtab Stat gt Regression gt Regression Regressinu Analysis SAT versus ACCEPT The regression equation 15 SAT 13717 2 83 ACCEPT Predictor Coe SE Coef T P Constant 1371 5 2145 53 92 0 000 ACCEPT V2 8300 0 5351 V5 29 0 000 f s 50 0585 Rqu 35 8 Rrsqadj 35 5 JW 2005 a 54 Why the Answers May Not Make Sense Allowing outliers to overly in uence the res Combining groups inappropriately Using correlation and a straightline equation to describe curvilinear data O 6 39 Example 54 Height and Foot Length com Three outliers were data entry errors Reg essinu eqna nu unc xenteddata 154013height corrected data 73 2 0 42 height Cnrrelatjnu uncorrected data 1 0 28 conecteddt 1069 Example 510 Emhquakes in US 0 San anciscn mrth quake nf 19cm Correlation alldata r 073 WoSF r7096 O Scatterplnt of all am College student heights andresponses m e question What 15 the fastestyou have ever Sratterplnt by gender Combining illegitimate correlati on Example 512 Don t Predict without a Plot Population or Correlation r 095 Regression Line population 72213 1218Year Poor Prediction for Year 203 s 12182030 or about 255 million only 5 million more than 1990 N E Extrapolation Usually a bad idea to use a regression equation to predict values far outside the range Where the original data fell No guarantee that the relationship Will continue beyond the range for which we have observed data 0 G 55 Correlation Does Not Prove Causation O 0 Interpretations of an Observed Association 1 Causation 2 Confounding Factors Present 3 Explanatory and Response are both affected by other variables 4 Response variable is causing a change in the explanatory variable Case Study 51A Weigth 1mm Females Males at munm hulk 1 amtmt m mmma t at uAnmm ul mama Mul m u m amut w a u a Relationship between Actual and Ideal Weight Statistics Success Stories Chapter 1 and Cautionary Tales 11 What is Statistics Statistics is a collection of procedures and principles for gathering data and analyzing information in order to help people make decisions when faced with uncertainty 12 Seven Statistical Stories With Morals Case Study 11 Who Are Those Speedy Drivers Case Study 12 Safety in the Skies 0 Case Study 13 Did Anyone Ask Whom You ve Been Dating 0 Case Study 14 Who Are Those Angry Women Case Study 15 Does Prayer Lower Blood Pressure Case Study 16 Does Aspirin Reduce Heart Attack Rates Case Study 17 Does the Internet Increase Loneliness and Depression First let s take a survey 90 Please take a few minutes to take the survey given at We Will use the result of this survey in learning some of our course material Let s enter the data in Minitab The instructions are given in the Minitab Manual that is contained in the CD that comes With your textbook Jme 2005 4 Study 11 Who Are Those Speedy Drivers Question What s the fastest you have ever dri en a car mph Data 87 male and 102 female students from large statistics class at University Malex110109 90140105150120110 110 90 115 95145140110105 85 95100115 124 95100 125140 85 120 115105125102 85120110 120 115 94125 80 85140120 92130125110 90110110 95 95110105 80100 110 130 105105 120 90100 105 100120 100 100 80 100120105 60125120100115 951101018011212011011512555 90 Females 80 75 83 80100100 90 75 95 85 90 85 90 90120 8510012075 85 80 70 85 110 85 75105 95 75 70 90 70 82 85100 90 75 90110 80 80 110110 95 75130 95110110 80 90105 90 110 75 100 90 110 85 90 80 80 85 50 80100 80 80 80 95100 90100 95 80 80 50 88 90 90 85 70 90 30 85 85 87 85 90 85 75 9010280100 95110 80 95 90 80 90 Which gender has driven faster How to summarize data Case Study 11 Who Are Those Speedy Drivers Dotplot each dot represents the response of an individual student Males 1111111111 Fema1es I l 11 111 I 7395 1110 1 1 25 mu Faslesl speed mph Case Study 11 Who Are Those Speedy Drivers Fivenumber summary the lowest value the cutoff points for 1A 12 and 3A of the data and the highest value WI Median 110 89 luartiles 95 120 80 95 Extremes 55 150 30 130 Note 3A of men have driven 95 mph or more only 1A of women have done so Moral Simple summaries of data can tell an interesting story and are easier to digest than long lists 0 Case Study 12 Safety in the Skies Planes get closer in midair as traf c control errors rise Errors by air traf c controllers climbed from 746 in scal 1997 to 878 in scal 1998 an 18 increase USA Today Levin 1999 Sounds ominous but The errors per million ights handled by controllers climbed from 48 to 55 So the original rate of errors in 1997 from which the 18 increase was calculated was less than 5 errors per million ights Moral When discussing the change in the rate or risk of occurrence of something make sure you also include the base rate or baseline risk 69 Whom You ve Been Dating According to a new USA TodayGallup Poll of teenagers cross the country 57 percent of teens who go ou on dates say they ve been out with someone of another race or ethnic group Peterson 1997 Case Study 13 DidAnyoneAsk Sacramento Bee headline read Interracial dates common among today s teenagers I Millions ofteenagers in US Did polltakers ask all of them No I The article states the results ofthe new poll of602 teens conducted Oct 13720 re ect the ubiquity of interracial dating today I Asked only 602 teens Could such a small sample tell us anything about the millions ofteenagers in the US Yes if those teens constituted a random sample from the population 9 Case Study 13 DidAnyoneAsk Whom You ve Been Dating How accurate could How many have dated this sample be U 5 teens somebody oi another race Margin of error i W of 495 who have dated is about 5 GEES 57 have dated somebody V at another race The percent of all teenagers in the US who date that would say they have dated interracially is likely to be in the range 57 r 5 or between 52 and 62 The 5 given above is the margin of error A quick estimate ofthe margin error is For example 1 496 045 m Case Study 13 DidAnyoneAsk Whom You ve Been Dating Moral A representative sample of only a few thousand or perhaps even a few hundred can give reasonably accurate information about a population ofmillions O 6 Case Study 14 Who Are Those Angry Women Shere Hite sent questionnaires to 100000 women asking about love sex and relationships Only 45 of the women responded and Hite used those responses to write her book Women and Love The women who responded were fed up with men and eager to ght them For example 91 of those who were divorced said that they had initiated the divorce The anger of women toward men became the theme ofthe book Moore 1997 p 11 Extensive nonresponse from a random sample or the use of a selfselected ie allvolunteer sample will probably produce biased results Moral An unrepresentative sample even a large one tellsyou almost nothing about the population 0 Case Study 15 Does Prayer Lower Blood Pressure Attending religious services lowers blood pressure more than tuning into religious TV or radio a new study says USA Today headline read Prayer can lower blood pressure Davis 1998 Based on observational study followed 2391 people 6 years People who attended a religious service once a week and prayed or studied the Bible once a day were 40 less likely to have high blood pressure than those who don t go to church every week and prayed and studied the Bible less Researchers did observe a relationship but it s a mistake to conclude prayer actually causes lower blood pressure Case Study 15 Does Prayer Lower Blood Pressure In observational studies groups can differ by important ways that may contribute to the observed relationship People who attended church regularly may have been less likely to smoke or drink alcohol had a better social network been somewhat healthier and able to go to church These other factors are possible confounding variables Moral Cause and e ect conclusions cannot generally be made based on an observational study 14 Case Study 16 Does Aspirin Reduce Heart Attack Rates Physician s Health Study 1988 5year randomized experiment 0 22071 male physicians of age 40 84 randomly assigned to one of two treatment groups 0 Group 1 aspirin every other day Group 2 placebo 0 Physicians blinded as to which group they were in O Case Study 16 Does Aspirin Reduce Heart Attack Rates TA B LE 11 O The Ellen 01 As irin nn Heart Attacks 39 I 11quot 11037 342 11034 1713 Aspirin 104 Placebo 189 Aspirin group 942 heart attacks per 1000 panicipants Placebo group 1713 heart attacks per 1000 participants Randomization gt other important factors age diet etc should have been similar for both groups Only important difference should be Whether they took aspirin or placebo Moral Unlike with observational studies cause and e ecl conclusions can generally be made on the basis of randomized experiments 00 Case Study 17 Does the Internet Increase Loneliness and Depression 7 9 greater use of the Internet was associated with declines in participanm communication with family members in the household7 declines in size of their social circle7 and increases in their depression and loneliness Kraut et al 1998 p 1017 New York Times headline read Sad Lonely World discovered in Cyberspace Harmon 1998 Study included 169 individuals from 73 households in Pitmburgh given free computers and internet service Participants answered questions at beginning and either 12 years later on social contacm stress loneliness depression 17 Case Study 17 Does the Internet Increase Loneliness and Depression 7 Changes were quite small although signi cant of people in local social network decreased from average of2394 to 2290 people on scale 1 to 5 selfreported loneliness decreased from average of 199 to 189 lower gt greater loneliness on a scale 0 to 3 selfreported depression dropped from average of 073 to 062 lower gt higher depression Followup study Using the Internet at home doesn t make people more depressed and lonely after all Elias 2001 Whether the link ever existed will never be known but it is not surprising given the small magnitude of the original nding that it subsequently disappeared Case Study 17 Does the Internet Increase Loneliness and Depression 7 I Moral A statistically signi cant nding does not necessarily have practical importance When a study reports a statistically signi cant nding nd out the magnitude of the relationship or difkrence A secondary moral to this story is that the implied direction ofcause and e ect may be wrong In this case it could be thatpeople who were more lonely and depressed were more prone to use the Internet And as the follow up research makes clear remember that truth doesn t necessarily remain xed across time 13 The Common Elements in the Seven Stories In every story data are used to make a judgment about a situation This is what statistics is all about N Asking the right questions Collecting useful data which includes deciding how much is needed Summarizing and analyzing data with the goal of answering the questions The Discovery of Knowledge 90 K Making decisions and generalizations based on the observed data Turning the data and subsequent decisions into new knowledge 539 Class exercises 7 Page 10 22 7 Page 11 26 enomenon is caHed random if ndwidua outcomes are uncertain butthere is nonethe ess a regu ar distrbution of reiame frequencies m a arge number of repetitions Chapter 7 Prob m g Y an x E M T 3 06 5 5 as E E M S S U a n n K A Di mznanwsosammm Inmaaaasnanmwwm Numhen mm Numth m mum 1 71 Random Circumstances Examples of Random Phenomenon Alma Has a Bad Day Doctor Vin Diagnostic test eomes back positive Oi a disease D Testis 95 accurau About 1 out of 1000 women actually have D latim39c Clam Professor randomly selects 3 sepaiaee students at me eginnin ofeach class to answer questions Alicia is picked to answer me mini question Random Circumstances in Alicia s Day Random Circumstances in Alicia s Day R d c ta 1 D Rande Cbutmiance 3 1 student s name is draws an om lrcums nce lsease Alida is Salaam status Alicia is not selected Aifcia 335 D h Randum Cbmmxiance 4 2nd student s name is drawn A 1613 065 quot01 3V6 13 Alicia is selected Alicia is not selected Ramwm CWWMM 2 Tes 35quot Randum Cbmmxian e 5 3quot1 student s name is drawn Test is positive Alicia is selected Test is negative Alicia is not selected Assigning Probabilities 72 Interpretations of Probability 0 A probability is a value between 0 and 1 an 39 wntten either as a fraction or as a decimal action The Relative Frequency A probability simply is a number between 0 and 1 Interpreta ion 0 Pmbability that is assigned to a possible outcome ofa random circumstance In situations that we can imagine repeating many times we de ne the probability ofa speci c outcome as the proportzon oftzmes 22 would occur over the tang 71m W calle For the complete set of distinct possible outcomes m circumstance the total ofthe assigned me lam frequency probabilities must equal 1 uutcume 39 Example 71 Probability ofMate V terms Female Binhx Longrrun 1elat1ve frequency ofmales bo1n 1n 1teol states1s about 512 Mmanou PlantzAManaclWlp 815 Table pmmdesresultsufstmulauun thepmpumunls far 39um 512 ova the 111st fewweeks but 1n the langrun settles down around 512 Determining the Relative Frequency Probability of an Outcome Method 1 Make mt Axmmptitm about the Physical Worm Example 72 A Stmpte Lamry Choose athreetugit number between 000 and 999 ln long run a player should win aboutl out of1000 times The does not mean a player Wlll Wln exactlyonce 1n everythousand plays Determining the Relative Frequency Determining the Relative Frequency Probability of an Outcome Probability of an Outcome M hall 1 M a n on vquot H mm oa Rzlm39 v Example 73 Prababtttzy Altcta hax m Answer a Quexltan There are 50 student names in a bag If names mixed well can assume each student is equally likely to be selected Probability Alicia will be selected to answer the rst question is 150 Example 74 The mutability 0fLaleuggugz quot1 m 17 p11552ng215 on US atrlme camera wttt Mmpamrtly axe Lhetr luggage s numbe11s based on data collected overthe long run So the p1obab1l1ty tnat arandomly selectedpassenger on a U 5 e1 W111 temporarily lose luggage 15 1175 o1 about 0 005 i Proportions and Percentages as Probabilities Ways to express the relative lsequtnoy or lost luggage The proportion ofpassengers who lose their luggage is 1175 oi about 0006 About06 n ofpassmgers lose their luggage The probability that a randomly selected passenger will lose hisher luggage is about0006 The probability thatyau Witt tone your luggage is about 00 5 Last statement is not exactly correct 7 yout plobablllty depends on other factors how late you amve at the aupott etc I n Estimating Probabilities from Observed Categorical Data Assuming data are representative the probability ofaparticular outcome is estimated to be th e relativ e frequency proportion with which that outcome w Apprwa39mte might uferrur for the estimated probability is XE O 39 Example 75 NightlightxandIlIyopia Revim39ted Assuming these data ate teptesentabve 0fa latget populabon quot whatls the approximate probability that someone ftom that population who sleepswith a uightlightin eatly chlldhood will develop some degree of myopia7 73 Probability De nitions and Relationships A sample space usually denoted by s s the set of all an experiment 0 0 Sleulwilht stamina Mvnni NtullMynlttn lonl muttquots issty t ISt zlm in Example ID 4gt S123456 Nluluhuhl mum in mm 2n Fltllllultt alum xttwlt slim tn 1am mum mum want in 2 CW SHHHT WTT Note 72 7 79 0fthe 232 nightlight usets developed some the ptobability to be 79232 0 34 Thls estlmatelsbasedon asample 0f232pe0ple amatgin ofenol of about 0 055 Testing apabent 4 S p Measunng lifetime 4 so s xsan 0f alight bulb Anv subset of a sernoe space rs oated an event Examples 1n roHng a we gemng an even number n ppng two oorns gemng a rneton EHH77 Genera1v events are denoted ov captat tetters Me A B E and tnerr prooeorrtres are denoted ov PM PBr WE 2005 n Example 72 A Simple Lottery com Random Ctrcnmemnce A threerdsgn Wmmng lottery numbens selected Sample Space 000001002003 997992999 There are 1000 s1mple events Prababtltltexfar tmple mm Probab1l1ty any spemfm threerdtgltnum 15 awmnens 11000 Assume all threerdlgltnumbers are equally l1kely EveutAlastd1g1t1s n9 009019 999 Since one out of ten numbers 1n set PA 110 Event B three dgtts are all the same 000 111 222 333444 555 555 777 m 999 Since eventB contatns 10 events ma 101000 1100 Unron of events Tne event AUE rs tne eventtnet etner A or B or ootn occur tn 5 srnge oerorrnenoe of an expertwent AUE A B Example Expenrnert RUMmg Wu dre EvertA gettrng a sum or 7 Evert B gettrng a 4 un ateast one or ne drse Evert AUE get rng a sum or7 ora4 un one ortne drse oos 1mm 2 10 Intersection of events Tne mtersecton oftva events and B denoted vanB rs tne event tnet occurs onv tr oot A and B occur tn 5 srnge oerorrnenoe of AHB an expenmen B Exampe Fort7e orevrons exampe AnB 43 34 Cornoernent of an event E rs aH tne events rn tne sernoe space tnet are not rn E tt s denoted ov E JW 2005 Tvvo events A and B are caiied mutueih exclusive it A Two events A and B are caiied independent oteech and B cannot occur at the same tnne other it the knowiedge that A has occurred does not m5 0 effect the probabihty that B occurs A Example Roll two dice A Getting a 3 on the first die B Getting a 4 on the second die A and B are independent MutuaHy Exciusive events Example Example Roll two dice A Geningasum of3 A GettingaSonthefirstdie B Getting a 4 on at ieast one ofthe dice B Gemng a 5W 0f 10 A and B are not independent zoos u n Example 77 Winning aFree Lunch Computing probabilities involving sample spaces with equally likely events Drawing held once a week for free lunch 1 uu and 4 39 39 If every single event m s is as likely to occm as any other single event m s the outcomes m s are calledeqimlly liker Event A You win in week 1 Event B Vanessa wins in week 1 Event C Vanessa wins in week 2 Exampie thptng twotancotns SHH HT TH 7T Suppose that ntst denotes the number of eiements in EventsA andB refei to the same random 8 an E denotes the num er of eiements in E ii dmummnce and m Mmmlmmm aii eiements in s are eqoaiiv iikeiv then Events A and c refei to to dz mnz random circumstances and are bidependent 11 JW 2005 u The basic three rules of probability theory are called Axioms of Probability Axioms of Probability 1 For any event E 0 S PE S 1 2 Probability of all the events in the sample space is 1 PS 1 3 If E and F are two mutually exclusive events then PEUF PE PF farm zoos 25 Example When rolling two dice compute he probabilities of eac of he following events A getting a sum of7 B getting a sum of11 C getting a sum of 7 or a sum of 11 D getting a sum of7 and a sum of11 nS36 1st die CAUB andAnBz i 2 a PC PAUB PA PB 36 36 36 00 Alternatively mum 8 so PAUB DA B and A B0 PD PA 18 PO 0 Not equally likely events Example A box contains 3 red marbles and two white marbles I Two marbles are picked at random with replacement what is the sam ple space for this experiment 3R 3 RWRRWWWR W 2nd draw 4 sf 5 9 39W39E A two whites B two reds C Red and white Example You have a pair of black socks and a pair of whtte socks lf you l ocks from these at 0 what is the probabthty that they match E21 hrst black In class Excercises Sections 7173 W1 hrst whtte 822hd black W22hd whtte M Event that the socks match Pages 269271 39 518Zlab2224ab253b JW 2005 39 74 Basic Rules for Finding Probabilities Addition Rule lfA and B are two events then Probability an Event Does Not Occur A US PMHMB P Rule 1 for not the event PAc A H P A Example Acard is picked at random what is the probability the card is ah ace or 5 dl 0 d7 Example 79 Probability a Stranger A Card 5 5 ace Does Not Share Your Bmh Date B Card is a diamond Pnext strangeryou meet Wm share youxbnthday 1365 4 1 AUBPA PB PAnB mum strangetyou meet Will nut share yourbmhday 52 52 52 52 17 653643650 73 n IJWWW 2005 n Example 710 Roommate Conwatibility Brett is offto college There are l000male students Brett hopes his roommate wlll notlike to party andnot snore Snares V23 lees to Party 100 PA 2501000 0 25 ma 3501000 0 35 Probability Brett will be assigned a roommate who either likes to party or snores orboth is PA orB PAPB7PAand e0150 5 So the probability his roommate is acceptable is 17 0 45 0 55 tr l n Conditional Probability Example We ruH Wu dice and mute thatthe ttrst ute ts shuwtng a E Rutlth a sum ufE F Rulltng a4 an the lstute F EF PE F PE39F39 Hg Example Forthe example above We have 1 S PtE FE anu PFE 135 1 636 6 JW 2005 15 therefure PtElF Example 78 Probability That a Temager Gambles Depends upon Gender Survey 78564 students 93911 and 12m graders The proportions of males and females admitting they gambled at least once a week during e previousyear were reported Results for 9th grade Pstudent is weekly gambler l teen is boy 020 Pstudent is weekly gambler l teen is girl 005 Notice dependence between weekly gambling habit and gender Knowledge of a 9th grader s gend changes probability that she is a weekly gamb er Ifwe know Alicia is picked questions what is die proba Example 713 AliciaAnxwe ng to answer one o bility it was the B e Aliciais selectedto answer y one ofthe questions the rst question A Alicia selectedto answer Question 1 PA 150 Paz 350 since A is a subset ofB PA and B 150 PAB PA and BPB 150350 13 0 Example A card is picked alraneem and it is nuteuthal it is a tame What is me prubablhty that it is a queen E Gel mg a queen F Gemng a tame P E F 452 1 PEF n F F 1252 3 The general muttlphcatlon mle LetEandee two events Since PM PW PEnF PEFPF PElF JW 2005 39 Example 78 Probability beale and Gambler com For 93911 graders 229 ofthe syar girls admitted they gambled at least ii eiou e Thepopu boys and45 ofthe once a week lation consisted 0509 girls and 491boys Event A male PA 0491 Pmale 11M gambler P A PAPBA 049 Event B PB weekly gambler A 0229 t and B 10229 01124 About 11 of all 9h gadeis are males and weekly gamblers n O G Independent and Dependent Events Two events are independent of each other ifknowirlg that one will occur or has occurred does not change the probability that the other occurs Two events are dependent ifknowirlg that one will occur or has occurred changes the probability that the other occurs Two evehts E and F are called thdepehdeht t Fla3403 Event A Alida is selected to answer Question 1 Event B Alida is selected to answer Questio 2 Events A and B refer to different random eireumstanees bu are A and 39 dependent events PA 150 Ifevenr A occurs her name is no longer in the bag so PB 0 Ifevent A does not occur there are 49 names in the bag including Alicia s name so 1313 149 Knowing whether A occurred ehanges 135 Thus the events A and B are not independent I in general E 5 lndepe denl of F t the knowledge that F occurrence of E Note that it PF 0 theh pElFpE ea mama PF 4 PEnFPEPF Thus two events Eahd Fare lhdepehdeht t and Onlyl PE F PEPF in general evehts E1 E2 and unyf RENE HEquot E are independent If PlEilF lEr PlE Example 711 Probability ofTwo Boys or Two Girls in inhx What is the probability that awomarr who has children has either two girls ortwo boys ofa girl is 0488 Then we have using Rule 3b Event A two girls PA 04880488 02381 Event B two boys 1313 05120512 02o2l Note Events A and B are mnuatyercnnve disjoint Probability woman has either two boys or two girls is PA or B PA PB 02381 02521 05002 Example 712 Probability Two Smongers Both Shar YourBinh Month Assume all 12 birth months are equall likely What is the probability that the next two unrelated uirtlr rrrurulr Event A 15 stranger shares your birth month PA 112 Event B 2 shares your birth month PB ll2 Note Events A andB are independent Pboth strangers shareyour birth month PA and B PAPB 112l12 0007 Note The prob ability that 4 unrelated strangers all share your birth month would be ll2gt4 In Summary Steps for Finding Probabilities Step 1 Li teach separate random ciicumstahce 0 involved in the pio 1 Step 2 List the possible outcomes for each Manually mom a u Eleusivu 3quot mce39 WM W quotquotquotquot W W Step 3 Assign Whatever probabilities you can Any germ A m mum m 4 umu wi the knowledge you have Step 4 Specify the event for which you want to d termine the probability Step 5 Detemiihe which ofthe probabilities from step 3 and which probability rules can he combined 39 39 39 39 39 rest I u m Bayes rule Conditional probability de he event in physical temis and see ifknow probability Know mm but want PAB 1313 PA and B PAc and B then use Rule 4 Use Rule 3a to nd 09 PM 3 PB i APA PB i A PA Example luhh is uhuemueu as in Whahertake a French uurse u 81 u h a mbabmty ufrecelvmg a French uurse Whereas ltwuulu he uhly W 3 Chemistry uurse lrluhh uemues in base hls uemsluh uhthe m at a cum whaus the pmhahmtythat he gets ahA W Chemistry 0 Chuusmg Chemlsm F Chuusmg French A Gemhg ahA WANG 7 PAF Pamg Pc1 z PA CPACPC 111 3 z 6 JW 2005 What is the probability that Alicia has the disease 6 giwn that the test was positive Steps to 3 Random aimum mares outcomes pmbalzilitizx m ottcnmstaaoel Alicia s disease status Fox at Outcomes A disease AC no disea Prab blllllex PA 11000 001 PAC Random ottonmstaaoe z Alicia s test results Posslbte Outcomes B test is positive BC testis negative blti s se 0 999 PBClAC 0 95 negative test given no disease 39 Example 717 AliciaIsPrabablyHealthy Example 717 Alicia Healthy cam Step I Specify emit you wanna dzlzrminz lthmlmllility pdisease l positive test PAlB Step 5 Determine wwhpmlmllili ex and mlmll39 ruler 39 lthmlmllility oimrest Note PA and B PBlAPA similarly fur PA5 and B A and BPBlAEPAEns999 04995 95nnl 00095 ma 04995 00095 0509 PM B PAand5 00095 019 PB 0509 Thug islessthanz 2 chanced 39 39 Eventhn as the dismsa tAIJcia ghhsrtsst waspnsitj O G TwoWay Table Hypothelical Hundred Thousand Example 78 Teeny 11M Gamblbig mm Sample of9Lh grade teens 491 boys 509 girls Results 229 ofboys and 45 ofgirls admitted they gambled at least once a week during previous year Start with hypothetical 100000 teens 491100000 49100 b s and thus 50900 girls the 49100 boys 22949100 11a44 blets ofthe 50900 girls 04550900 2a91 e weekly gam lets Example 78 Teens and Gambling cam Weekl Gambler NotWaakl Gambler Total Bay 11244 37 556 49100 Girl 2 291 45 609 50 900 Total 13 535 as 465 100 000 Pb0y 11M gambler 11244100000 01124 Pb0y l gambler 1124413535 03307 Pgamhler13 35100000 013535 Tree Diagrams 1 Deter mme rstrzndum nmumstan sequEnce ufbranches multiply the grub abdmes branches Th1s1s an zpphcahun emee 32 Step 5 Tu deter mme me pmbablh ufbranches add the mammal pmbabllmes thus 1 s Cy ufany cullecuun ufsequmce fur sequence as fuundm step 4 Th1s1s an apphcahun ufRule z ee m sequmce and h Example 718 Alicia s Fumble Fates mm mmnmhw mm I ns mum1mAllln uumllca laxl PAlicia has D and has a positive test 00095 Ptest is positive 95 PAlicia has D 1 positive test 000950509 019 see 1 w E 24 Pboy gambler wzvuygu1nw1 1 m1 ymamm 1mm I124 mm m m 111m em 011v HOW we Km 229 1124 r 509955 4861 022913 11241353 8307 39 Example 78 Teens and Gambling com Inclass exercises Sections 7475 pages 271 274 39 32 35 38 44 47 50 52 53 I Gathering Chapter 3 Useful Data 41 Speaking the Language of Research Studies Observational Study Researchers observe or question participants about opinions behaviors or outcomes Participants not asked to do anything differently Two special cases sample surveys and casecontrol studies Experiment Researchers manipulate something and measure the effect of the manipulation on some outcome ofinterest O 039 Randomized experiments participants are randomly assigned to participate in one condition called treatment or another Sometimes cannot conduct experiment due to practicalethical issues Who is Measured Units Subjects Participants Unit a single individual or object being measured If an experiment then called an experimental unit When units are people often called subjects or participants Roles Played by Variables Measured or Not Explanatory variable or independent variable is one that may explain or may cause differences in a response variable or outcome or dependent variable A confounding variable is a variable that a ects the response variable and also is related to the explanatory variable A potential confounding variable not measured in the study is called a lurking variable Example 33 What Confounding Variables Lurk behind Lower Blood Pressure Recall Case Study 15 people who attended church regularly had lower blood pressure than those who stayed home Possible confounding variables Amount of social support Health status Age Attitude toward life Exam ple 34 The Fewer the Pages the More Valuable the Book Data on number of pages and price of 15 books ordered by number of pages Do prices increase No many books with fewer pages are more expensive TAB L E 31 I Pages versus Price forthe Books on a Professor39s Shelf Bank Pages Price ll Book it Pages Price 1 104 3235 3 417 435 2 138 2435 10 417 3375 3 220 4335 11 436 5 35 4 264 7335 12 458 6000 5 336 450 13 466 4335 6 342 4335 14 463 533 7 378 435 15 585 535 8 385 5 33 Exam ple 34 The Fewer the Pages the More Valuable the Book Confounding Variable Type of book hardcover versus paperback For each type of book 50 gigs price does tend to N increase with number i of pages especially for technical books 0 Type of book affects p price and is related to 100 m 300 0 number of pages Pages and Bad Teeth Children exposed to lead are more likely to suffer tooth decay Today Case Study 31 Lead Exposure Observational stud involving 24901 Children sample or 245m chlldran Explanatory variable level of lead exposure Response variable extent child has missingdecayed teeth l l Possible confounding variables Low exposure High exposure income level dret time since last dental visit Lurking variables am uoride in Water health care 42 Designing a Good Experiment Randomized experiments often allow us to determine cause and effect Random assignment to make the groups approximately equal in all respects except for the explanatory variable Who Participates in Randomized Experiments Participants in randomized experiments are often volunteers Remember Fundamental Rule Available data can be used to make inferences about a much larger group if the data can be considered to be representative with regard to the questi0ns of interest Randomization The Crucial Element Randomizing the Type of Treatment Randomly assigning the treatmenm to the experimental units keeps the researchers from making assignmenm favorable to their hypotheses and also helps protect against hidden or unknown biases Randomizing the Order of Treatments If all treatments are applied to each unit randomization should be used to determine the order in which they are applied eg test tasting experiment of a soda Case Study 42 1039sz and WeightLi ing Is Weight training good for children If so is it better to li heavy Weights for few repetitions or moderate Weights more times Randomized Experiment involvmg 43 young volunteers 3 control group O Leg extension strength signi cantly increased in both exercise groups compared with that in the control subj ects Faigenbaum et at 1999 p 25 3 vulumaavs 5 210115 years old Random assrgmnenl u gmups Group I Group 2 Gmup 3 Moderate COHUOI luau load gmle K T 1 Muscle strength is comgarsd 1m 2005 Randomly assigning treatments 50 Volunteers to be tested for he effect of alcohol and Marijuana t1 39 d 39 39n Driving skill compared 0 O Control Groups Placebos and Blinding Control Groups Treated identically in all respects except they don t receive the active treatment Sometimes they receive a dummy treatment or a standardexisting treatment Placebo Looks like real drug but has no active ingredient Placebo e ect people respond to placebos Blinding Singleiblind participants do not know Which treatment the have received Doubleiblind neither participant nor researcher making measurements knows Who had Which treatment Double Dummy Each group given two treatments Group 1 real treatment 1 and placebo treatment 2 Group 2 placebo treatment 1 and real treatment 2 Example of a Double Dummy experiment Determining the effect of nicotine patch vs nicotine gum on helping people quit smoking Subjects Nicotine Patch Nicotine Gum Placebo Gum Placebo Gum Placebo Patch Placebo Patch Treatmem Groups 1m 2005 Comm Group MatchedPair Designs Use either two matched individuals or same individual receives each of two treatments Special case of a block design Important to mndomize order of two treatments and use blinding if possible Pairing and Blocking Block Designs ct ymwmm anum mmit JV EXPe 39memal units divided W aw into homogeneous groups called blocks mmmmwm nmmsqwm mmmsmm each treatment randomly assigned to one or more units in each block m W w mm 4 If blocks individuals and units repeated time periods in which receive varying treatments called repeatedimeasures designs Case Study 43 Quitting Smoking with Nicotine Patch es A er the eightweek period of patch use almost half 46 of the nicotine group had quit smoking while only one h 20 of the placebo group had Newsweek March 9 1993 p 62 Doubleblind Placebocontrolled Randomized Experiment 0 smokers recruited volunteers Randomized to 22mg nicotine patch or placebo controlled patch for 8 weeks Doubleblind neither the participants nor the nurses taking the measurements knew who had received the active nicotine patches Example of Blocking We want to determine the effect of two types of fertilizers on growth of tomatoes We happen to have three types of tomatoes So we block on tomato type 09 Fertilizer I Fertilizer II 43 Designing a Good Observational Study Disadvantage more difficult to try to establish causal links Advantage more likely to measure participants in their natural setting Types of Observational Studies Retrospective Participants are asked to recall past events Prospective Participants are followed into the future and events are recorded CaseControl Studies Cases who have a particular attribute or condition are compared to controls who do not see how they differ on an explanatory variable of interest A J Fm few 1 and n ofPotentz39al Confounding Variables through careful choice of controls Case Study 44 Baldness and Heart Attacks 6 Men with typical male pattern baldness are anywhere from 30 to 300 percent more likely to suffer a heart attack than men with little or no hair loss at all Newsweek March 9 1993 p 62 Case control study cases men admitted to hospital with heart attack controls men admitted for other reasons Explanatory variable heart attack status yes or no Response variable degree ofbaldness V sample o f co ntrols 772 men in hospital but not for heart attack Sample of cases 665 men with heart attack Extent of batttness compared 44 Dif culties and Disasters in Experiments and Observational Studies Confounding Variables and the Implication of Causation in Observational Studies Big misinterpretation reporting cameand e ect relationship base on an observational study No Way to separate the role of confounding variables from the role of explanatory variables in producing the outcome variable if mndomization is not used Extending Results Inappropriately Many studies use convenience samples or volunteers Need to assess if the results can be extended to any larger group for the question of interest Interacting Variables A second variable can interact With the explanatory variable in its relationship with the outcome variable Results should be reported taking the interaction into account Example V Interacti n in Case Study 33 quot The difference between the nicotine and p acebo patches is greater When there are no smokers in the home than When there are smokers in the home M 5mm t umw Hawthorne and Experimenter Bias Hawthorne effect participanm in an experiment respond differently than they otherwise would just because they are in the experiment Many treatmenm have higher success rate in clinical trials than in actual practice Experimenter effects recording data to match desired outcome treating subjects differently etc Most overcome by blinding and control groups Ecological Validity and Generalizability When variables have been removed from their natural setting and are measured in the laboratory or in some other artificial setting the results may not reflect the impact of the variable in the real world Example 37 Real Smokers with aDesire to Quit Case Study 33 Ensured ecological validity and genemlizability by using participants around the country of wide mnge of ages and recorded many other variables and checked that they were not related to the patch assignment or the response variable Relying on Memory or Secondhand Sources Can be a problem in retrospective observational studies Try to use authoritative sources such as medical records rather than rely on memory If possible use prospective observational studies 9 Random Variables Chapter 8 8 1 What is a Random Variable Often when an Expenment s perfumed We arm same mncnun arms numume as uppusedtu 5 u Exam 2 Expenment Ruwng Wu mce Smg e nutcumes 1 1 2 s en Funcnun at he numumes Sum urme Wu me 12 nterested m ng e O E amp e Expenment thpmgmree cums Smg e nutcumes my TTH HTT MHH Funmun urme umcumes Number ur heads 23 These funcuons ofme outcomes are CaHed random Vamab es More formaHy A random vanab e MS a asswgns a u que numbe expemment funcuon or Me that rto each outcome of an Sample Space Real Numbers 09 random variable X number ofheads Y number of consecutive heads Example Suppose a coin is ipped three times De ne the following s Example A die is rolled continuously until a 6 is rolled Let X be the number oftrials required to stop he experiment What are the possible values ofX 39 a stops randomly at a point Let X be the value at which the spinner stops What are the possible values of X Lolt 39 lt1 14 12 0 Sample T T T H H H T H Space T T H T H T H H ExampleAlightbulbis pickedatrandomLetXbe he lifeofthis T H T T T H H H light bulbWhat arethe possiblevalues ofX X 0 1 1 1 2 2 2 3 Y 0 0 0 0 2 0 2 3 5 a Just like data there are two types of random variables discrete an pimlel wllcclallu continuous A discrete random variable takes on a countable number of values not necessary ni e A continuous random variable assumes values in one or more intervals In the examples ofthe random variables given in the previous slides which are continuous and which are discrete Probability distribution mass function Considertne experlment orriippinutwo coins and derine tne random variable x to be tne number or neads Event 9 Value ofX Probability HH 2 1 HT TH 1 12 TT 0 1 ann value OleS assigned a probability value Tne runcnontnat asslgns probabilines to eacn possible value or a discrete random variable is alleutne probabilitymass uistribution functlun Tne probability distribution runotion is denoted bypx Properties or m D g pot is usually given in tnerorm or a formula atable or a rapn In Probability Distribution of a Discrete RV Using the sample space to nd probabilities Step 1 List all siinple events in sample spaee Step 2 Find piobability foi eaen siniple event Step 3 Listpossible values on random an id 39 no the value on eaen siinple event Step 4 Find all siinple events on whith k f eaen possible value k Step 5 PX k is the sum ofthe probabilities for all simple events for whichX k Probability lSlIIbIItinu function p f X15 3 table or iule that assigns probabilities to possible values ofX Example Twenty peroert orpeople in a population smote e person is picked at ranuom LetX betne random variable tnatassigns u irtne person picked uoes not smoke anu ne val erson picked smokes gtlt nas tne rolloWing probability uistribution funcnun on uel irtnep Example 84 How Many Girls are Likely Family has 3 children Probability ofa girl is What are the probabilities ofhaving 012 or3 girls Sample spaee For each birth write either B or G There are eight possible arrangements ofB and G for three births These ar theslmple events Sample Space and Pmbaln39lia39er The eight simple events are equally likely Random VariableX uum39uci quot 39 quot Example 84 How Many Girls com liE W tut l l ecu W GEE W 2 Probability distribution function for Number of Girls X It a I z 3 Pill kl lx 38 38 Ia For each simple event the value oins the number of G s listed 39 Cumulative Distribution Function of a Discrete Random Variable GO PX k for any real number It Example 34 Cumulative Distribution Function far the Number of Girl can k 0 l 2 PIXlt kl 18 48 73 Finding Probabilities for Complex Events Example 34 A Mixture nfCln39Hren What is the probability that a family with 3 ehildren will have atleast one child ofeach sexquot EX Number of Gllls then elthelfamlly has one glll and two boys X l or two girls and one boy X 2 PXl orX 2 PX l PX 2 38 38 58 34 de for Number of GirlsX 3 Ir a I 2 Pl Ila 38 38 lll Example In a given popula ion the probability that a person smokes is 020 M this population until our selectibn rresults in a person that smokes We then s op the experiment Let X be the number of samples required to s op he experimen a Give the probability distribution function of X b Compute the probability hat less than three samples are required to stop he experiment c Compute he probability that at least three samples are required to stop he experiment a First hink about possible values ofX Now let39s compute he probabilities associated with each value ofx X 1 means that he rst person picked smokes PX1 p1 020 X2 means that person oes PX2 p2 Ps and S Ps PS a 216 the rst person picked does not smoke but the second Similarly you can show that PX3 p3 PS and sE and S PS PS PS a22 123 00 Th F y following table b PXlt3 p1p228236 PX231 PXlt31 3664 s 83 Expectations for Random Variables for random varia les Let XMXZ xbe mean ofthese observations I Just like data we can de ne mean variance and standard deviation n observa ions from a random variable X Then the 39s O The Law ofLarge Numbers Under suitable conditions as n gets large 3 will approach a real number called the mean on The mean of Yis denoted by 1 me m n are randum vaname V s an an m as me apeczedvaue of v and s denoted by EV EV s umputed r m he prubabmty massmnctmn as ruuuws EiXJAgxpV Examp e Let Vbeme numbermatappears when mng a me Then py1s v12 1 5m 1a 1 5 JW 2005 Example 86 Calgfomia Decca Lottay much mum winlaw 10 E00 7 41795 INNUS 1349 gtlt UUIHS 4 Standard Deviation for a Discrete Random Variable The standard deviation ofa random variable is essenu39ally me average distance me random an39a e falls from its mean ove the g 15 x vana lewnh possiblevaluesxx2x3 occurring with probabilities M V2 V3 dexpected valueE 7 than w Variance of X VX Example 87 Stability 0rExcitement Two plans for investing 100 w on would you choose Plull x Nnraalu Ptobnbllnv vlu 1 7 Mn Bum Plalubll v a z 5 Expected Value in each plan Plan 1 EX 5000x 001 1000x 005 SM 994 10 00 Plan EY20x310x24x510 00 O G Example 87 Stability 0rExcitement cont Variability in each plan 1177 ul t 1X woo 3101 752 mm 45mm ml mum tsa slot loo Plan 1 VX 29900 00 and Plan 2 VX 48 00 a 172 92 a 5 93 The possible outcomes folPlan l are mucln moie vanable Kyou wantedto invzxmmt39 u Plan 2 butifyou wantedto have the chance to large amunzofmonzy you would clnoose Plan 1 Inclass exercises Sections 8183 Pages 321 323 39 2 61018 20 21 84 Binomial Random Variables class ofdzscrzlz random vanables Binomial 77 results from a binomial expenment Condi nns for a binomial exp erim em 1 There are n trials where n is deteimined In advance andis not a iandom value Twn possible outcomes on each nial called success and fallulequot and denoted s and F Outcomes are inde Pr to endentfmm one trial to the next no obability 0139 a success denoted by p iemains same m one tnal to DY hahilit 0f fa1urequot is l 7p Examples of Binomial Random Variables A binomial random variable is de ned asX iumber of successes in the n trials of a binomial experiment landomllnriahlc Success Failure n p mm luau lair umns x7 numth nl Hm ran 3 1 heads m Null udm mum mm x Immlm ul 5 66 1 23 5 e us m null 65 m Randomly snmple x nllmhnv who Sm Hm um mun Dpumnn u I us mus have scan um um 5m um nllndnllswhu lmwsm u urn m Run mm mm x numan m Sum s7 SMIth I ma us once mm sum 5 7 Finding Binomial Probabilities n k quot1 PXkgt p lip f ork70 1 2 01716 gt gt gt was was 035 030 m can 5 025 3 us 5 025 395 5 2 can 2 H5 3 m a Ms E v10 E mm E mm 005 nos 005 non am can 4 5 Numberofsuccesses Numbnr a Suczesse Numbur uf manic Act pus m pn75 IMAM 2005 3n Example 89 Probability ufTwu Wins in ThrzzPlays p probability Win 02 plays of game are independent X number of wins in three plays What is PX 2 PX 2k ZZU 2 32Z8 0096 Expected Value and Standard Deviation for a Binomial Random Variable For a binomial random variable X based on n tn39als and success probability p Mean uEXnp Standard deviation 039 inpil p Example 812 Exlraterrexm39al Life 50 oflarge population would say yes if asked Doyou believe there is extraterrestrial life Sample ofm 100 is taken X 7 number in the sample who say yes is approximately a binomial random variable Mean 11 EX1005 50 Standard deviation a 005 5 5 s y yes rlre amouml 7y wlrrelr llral member wauld dz zr from Sample lo Sample ls aboul 5 wall 1m repealed samples ofrl 100 ml average 50 people a O O 85 Continuous Random Variables Recall mattne range at a un lnubus randum yanable ls lrtenals ln tne real number llne ranqu yarlable X SDElatEd wltll ne lnetntal area under ne nenslty tun lnn ls l A probablllty densllyfuncllon at a nntlnunus ls arunctlbn m trbm wnlcn brbbabllltles as random yanable are computed P X 1mm zoos n lne brnbabllltytnat vtalls between any twu real numbers a and D a lt b ls tne area under ne nenslty between aann D ltgtltls a sun lnubus ranqu yarlable and als a bnstarltl men PXa 1mm zoos 25 Example Vuu expecta all between new and an mlnutestrnm nnw lt lt ls reasonable to assume that ne waltlngtlme ls unltbnnly nlstnbuten a Glyetne brnbablllty nensltytnrtne waltlng lme b Wnat ls tne brnbablllty nat ynu wlll naye tn walt between 20 to 25 mlnutes7 Wnat ls tne brnbablllty nat ynu wlll naye tn walt exactly l5 mlnutes 1mm zoos n 86 Normal Random Variables The normal dslhodlloh ls delehhlhed bytwu parameters u and 7 We Whle vu Nua The probedHy dehslly ruhcllohrdr he hdhhal dlstrlbutl n ls glVEn by drme pmbablllty denslty rdhehdh ms 3 scale paramelev ll delehhlhes the spread at h Example 814 College Women sHa39ghtx Data suggest the distribution ofheights ofcollege m mean Women modeled by a normal curve w a 55 inches and standard deviation a 27 inches Note Tick marks given atthemean andat123 standard deviations above and below the mean t Empirical Rule are exact charactenstlcs of anormal curve model Standard Scores The formula for convening any value x to azscore 15 2 Value 7 Mean 7 Standard deviation xix 039 Azscore measures the number of standard deviations that avalue falls from the mean 0 O Example 814 Height com For apopulation of college Women the zscore corresponding to aheight of 62 inches is If ValuerMean 7 62765 77111 Standard deviation 27 I Thiszscore tells us that 62 inches is 111 standard deviations below the mean height for this population O O Finding Probabilities for zscores Table A1 Standard Normal z Probabilities mm rm mu m am m m mm m m m up 1 ion m mm mm m my on a m mm m m W W mu om ow Body oftable contains PZ Szquot Le most column oftable shows algebraic sign digit before the decimal place the rst decimal Second decimal place ofzquot is in column heading a O 6 an if mm mm More Finding Probabilities for ziscores Table A1 Standard Normal z Probabilities m m mi mi mi um mmmmmlm in m m m m w um mm um m m 2 Am PZg259 0048 PZg 131 9049 Magma 9772 PZ S 475 000001 from in the extreme a Example 815 Probabilityz gt 131 PZgt 13117PZg 131 1790490951 39 Example 816 ProbabilityZ is between 7259 and 131 P259 gzg 131 PZ 1317PZ g 259 9049 7 0048 9001 Use zscores to Solve General Problems Example 814 Hezght cunt What is the probability that a randomly selected allege woman is 62 inches or shorter 62 a 65 27 PZ 91111335 PXg62 P Z g About 13 ofcollege women are 52 inches or shorter a la Example When a beverage dlspenslng maehthe ls 521m Use z39scures to save General Problems mspehse ah amuunl tt t1 mspehses vamuuhuhal has a nurmal dlstnbu un wtth meah uahn varlanEE o 25 At what amuum Example 814 H318 mm shuuldmls machthe be settn su t1 Wuuld mspehse mDrE hah 72 5 l 2 5 rth t 7 What propomon ofcollege woman are taller than 58 Inches WES W D E W 687 65 We Wan PXgt68PZgt Z7 PZgt11117PZilll 2 PVgt12517PV 125 m 1786651335 a 125 v About 13 of Theveluvewe wahtn such that PZ H H college wom Nule that PZ 136 975 are taller than Theveluve we have 25 68 inches T mae 2 n1254 SE5M 52 975 Inclass exercises Sections 8486 pages 323326 32 40 46 55 59 60 Omit Sections 87 amp 88 JWW zoos Q Gathering 41 Speaking the Language of Research Studies Observational Study Researchers observe or question participants Chapter 4 Useful about opinions behaviors or outcomes Participants not asked to do anything D ata differently Two special cases sample surveys and casecontrol studies 39 E Who is Measured xpenment Units Sub39ects Partici ants esearchers manipulate something and J p measure the effect of the manipulation on some outcome of interest Randomized experiments participants are randomly assigned to p 39cipate in one condition called treatment or another Sometimes cannot conduct experiment due to practicalethical issues Unit a single individual or object being measured If an experiment then called an experimental unit When units are people o en called subjects or participants Measured or Not Explanatory variable or independent vaIiable is one that may explain or cause differences in a response variable or outcome or dependent vaIiable A confounding variable is a vaIiable that affects the response variable and is related to the explanatory variable A potential confounding va1iable not measured in the study is called a lurking variable i Roles Played by Variables Example 33 What Confounding Van39ables Lurk behind Lower Blood Premre Recall Case Study 15 people Who attended church regularly had lower blood pressure than those Who stayed home Possible confounding variables f soci support Health status ge Attitude toward life Example 34 The Fewer the Pages the More Valuable the l39iaak I Data on numbei of pages and price of15 books ordered by number of pages Du pm nimm No many books with fewer pages are more expensive YABIE I I vnuomnusvneaIndianamamPintsam sslnlt oak unk 1 Mn mt V w a m a w 2 m m m aw m n m m I Nut 12 at m n m n as m it w 1 In t 1 I Example 34 The Feww the ages the More Valuable the Baak Confounding Variable Type of book hardcover veisus papeibaek For each type ofbook pnee does tend to 39neiease with ofpages especially for technical books Type ofbook affects price and IS telatedta number ofpages i Case Study 41 LeadExpomre and Bad Teeth lnvulvlng 249U1 chlldren llllwymllllml Exp natm39y varia le level uflead expusure has mlsslngdecayed teeth Possible cunfnundingvariah meemelevel met me smee last dental vlslt Lurking variables amuunt uunde m Water health care 12 7 lwam WNW In class exercise Section 41 pages 15 14 39 42 Designing a Good Experiment Randomized experiments often allow us to determine cause and effect Random assignment to make the groups approximately equal in all respects except for the explanatory ariable Who Participates in Randomized Experiments Participants in mndomized experiments are o en volunteers Remember Fundamental Rule Available data can be used to make inferences about a much larger group if the data can be corm39dered to be representative with regard to the question r of interext O JW 2005 Randomizing the Type of Treatmen Randomly a 39 39 Randomization The Crucial Element fro e to their hypotheses gainst hidden 0 unknown g e treatments to the experimental units keeps the researchers 39 39 favorabl Case Study 42 Kids and Weighthng 1s waghttramng guud fur eht1drenv1 se rsrthett erghts fur fewrepetrtrens er mederate we ghts er te 11tt heayy mere nmes7 1nvulv1ng 43 yuung ye1unteers 1 t m making assignmmts T r e n ee greu and also helps protect 1113 1m 39 5 5 Pmdera elmd ureter reenter tr 1 3 eentre1 geup Randomlzlng the Onler ol Treatm ents I I nLeg mensmn Waugh em rm 7 enter 3 If all treatments are applied to each umt randornezauon srgmgmuy 1ncreased m heth I 1 quot391 quotth 39t39tmt tm should be used to deterrmne the order in which they erererse gmups eemear wrth 1 4 are a lied e test tastin ex en39rnent of a soda quot entree marine pp g g p Fatgunbmm 12m 1999 p 25 I Randomly assigning treatments 0 Volunteers to he tested For l1c cl l39cel ol39nlcolml and Muruuana 111 39 39 Randoml assl n lrcatmen mhg ka comm 39cd Cnntr 16 roll s d rdentrea11y m a11respeets exce treatment semetrmesth yreca erastandarder pt they dun39trecewe the ye a dummy treatment rstmg treatment 1aeeb e Luuks l1kereal drug hethas ne aetryemgredrent Placebo E izd penple respend te p1aeehes lmding39 singlehund pametpants de net knew whteh treatment they have reeetyed Deuhlehund netther pamepant ner researeher makmg measurements knuws whe had whreh treatment Double Dummy Eaeh eup gtyen tw Gmup 1 rea1 Gr u u treatmean treatment 1 and p1aee1ee treatment 2 e e z p1aee1ee treatment 1 andrea1treatment 2 Control Groups Placebos and Blinding Treat P aetrye Example of a Double Dummy expen39ment Pairing and Blocking Matcherer Desi Determining the effect of nicotine patch Vs niconne Use Ether gum on helplng people quit smoking en two matehedindimduals or sarneindimdual reeeives eaeh uftwu treatments speeial ease ufa blocc112ng important to randomize order uftwu treatments and usehlindingifpossihle Subjects Black Desi ns H l u Nicotine Patch g l l Experimental units dinided into humugeneuus gmups Placebo Gum called hlnc htxat t d PlaceboPatch 33 e quot 3 m y p p assignedtnnnenrmore A i I r unitsineaehblunk Nicotine Gum Placebo Gum Placebo Patch lfblack indim uals and mm repeated ame penods in whieh Tl ei leedL GranVS Comm t reeeive varying treatments ealled rqmtedemmmrzs design JMW2005 39 C359 Study 43 Q14an SM0king Example of Blocking with Nicotine Patches A er the elghteweek pennd ofpateh use almost half 45 of We want to determine the effect of two types offertilizers quheplacebu goup had quot Newrweezt March 9 1995 p 62 n tomatoes So we block on tomato type Dnubleeblind Placebnecnutrnlled Randomized Experiment 24o smokers recruited volunteers Randole ed to 22mg niconne patch or placebo controlled patch for 8 weeks Doubleollnd neithertheparticipants nor the nursestaking the measurements knew who had receivedthe active niconne patches Ferulizer 11 teeall past events Prospective Partlclpants are followed into the future and events are recorded CaseConcrol Studies Cases who have apatheulai attribute or condition are compared to controls who do not see how they dilrei on an explanatory vanable dfinteiest Advantages E clency and Reducltan afPaIenLlxl Catantadhg Varmblexthrough careful choice of Controls Case Study 44 Baldness and HeartIIIka pical male pattemhaldness are anywhere am e likely tn suffer a heart attaek than men at all quot Newsweek March 9 1995 p 62 Casesnncrnl study eases hnen admittedtn huwltal Wth heart attaek cuntxuls hnen admitted furutha39reasuns Explanatnry variable heart attaek status yes Dr nn Respnnsevarlahle degree nfhaldness Sanwlh o eases min nth mi hum that gt 44 Difficulties and Disasters in Experiments and Observational Studiew Coniounding Variables and the Implication of Causation in Observational Studies Big mlslnta pretatmn repumng cauxerandre ct relatmnshlp based un an nbseryatannal study Nu way to separate the rule uf ennlnundingyanables rrnmthe rule ufexplanatury yana es in prndunngthe uutcumevmablelfrandumlzatmnls nntused Extending Results Inappropriater Many studiesuse ennyenienee samp es nrynlunteers Need to assesslftheresults ean be Extendedtu anylarger guup furthe questlun nfinterest 0 Interacting Variables A seennd yanable ean Interact With the relatinnship With the nutenme yan tahngtheinteraeannintn aeenunt explanatory yanable in its able Results should be reported Examp Interaction in Case Study 33 smukers in the hume than when there are smukers in the hume O G ect participants in an experiment respond Hawthorne and Experimenter Bias e th xpe success rate in clinical trials than in actual practice Experimenter e 39ects recording data to match desired outcome treating subjects differently etc Most overcome y blinding and control groups O Ecological Validity and Generalizability When variables have been removed from their natural setting and are measured in the laboratory or in some other arti cial setting the results may not re ect the impact ofthe variable in the real world Example 37 RealSmterswim 11 Desim to it Case Study 3 3 Ensured eenlngeal validity and generalizability by using pameipants around the enuntry ufwlde range ufages and reenrded many eitheryanables and eheelred that they were nutrelatedtn thepateh assignment nrtherespnnse yanable Relying on Memory or Secondhand Sources Can be a problem in retrospective observational studies Try to use authoritative sources such as medical records ratlur au rely on memory Il possible use prospective observational smdies O Q In class exercises Section 4344 pagesl45147 34 42 49 O 6 Chapter 2 Turning Data Into Information 39 21 Raw Data 39 Raw data are numbers and category labels that have been collected but have not yet bee processed in anyway 39 n measurements are taken from a subset population they represent sample ta the measurements represent population data 39 Descriptive statistics summary numbers for either population or a samp e 39 When all individuals in a population are measured 0 n O ofa 22 Types ofData O O Categorlcal Ouantltaiwe JW 2005 r 22 Types of Data 39 Raw data from cate group or category n essan y have a logical ordering Examplex eye color country of residence 39 Categorical variables for which the categories have a logical ordering are called ordinal variables Examples highest educational degree arned tee shirt size S XL 39 Raw data from quantitative variables consist of numerical values taken on each individual Examples height number of siblings gorical variables consist of ames that don 39 O 22 Types ofData Discrete Variables are th se whose possible values are countable Example number ofslbllngs being observed For example the following two questions about income lead to two dlfferent types ofvarlable State your annual income in dollars 40000 60000 3 above 60000 Continuous Variables are those that take on values in intervals Example height Sometimes the type othe variable depends on the way it is Is your income 1 between 20000 7 40000 2 between 0 O K Asking the Right Questions One Categorical Van39able Q in to w many and what percentage of individuals fall into each category Example Whatpercentage ofcollege students f legalization of marijuana and what percentage of college students oppose legalization ofmarijuana Qm nn in Are individuals equally divided across cat gones or do the percentages across categories follow some other interesting pattern dividuals are asked to choose a umber from 1 to 10 are all numbers equally likely to be chosen Example When in n avor the O O i Asking the Right Questions Two Cutegun39wl Variables 39 n 2 s there a relationship between the two variables so that the category into which individuals fall one n on which category they e study 1 6 we asked lthe nsk ofhavlng a rt attack was dlfferent for the physicians who took for thos who k ap acebo mbinations of categories stand out e information that is not found by e categories separatel 7 ng and lung cancer in part ause someone noticed thatthe combination ofbelng a nonsmoker and having lung cancer is unusual O 0 r Asking the Right Questions One le e unab Question 3o What are the interesting summary measures like the average orthe range ofvalues that help us understand the collection of individuals who were measured7 Emmple what is the average handspan measurement and h uch variability is there in handspan measurementsv Question 3o Are there individual data values that provide interesting information because they are unique or stand out in some way Outlla s Example what is the oldest recorded age ofdeaLh for a humam Are there many people who have lived nearly that long or is the oldest recorded age a unique c3567 0 O Example Do men and women drive at the same fastest speeds on average Ques nn 4 When the categories have a natural 39ng an ordinal variable the measurement variable increase or decrease on average in that same order Example Do high school dropouts high schoo graduates college dropouts and college g l raduates have increasingly higher average incomes Asking the Right Ques ons Asking the Right Questions One Categorical and One Quantitative Variable Tum Quanu39m ve Variables nn 4a Are the measurements similar across QMEA39H39 Lhe measure categories on 5111f meat on one variable is high or low does the other one also tend to e high or low7 Emmple Do tallerpeople also tend to have larger handspans7 Querrioir Sb Are there individuals whose combination of data values provides interesting lnfol39matlon because that Comblnatlon ls unusual7 vidual who has a very low IQ score but can n u ckly or high but it is the combination that is interesting Explanatory and Response Variables When there is a question about the relationship between two variables it is useful to identify one ry variable and the other variable as the response variable In general the vozoe ufthe explanatory vanable for an individual is thought to partially explain the value ufthz respmlsz voiioot39e for that individual Explanatory and Respons Variables Example The relationship between the dosage ofa blood pressure lowering drug with the reduction in blood pressure ofa patientwithin 30 minutes is ofinterest Reaponre Vartable Blood pressure othe patient alter 30 mlnutes Explanatory Variable The dosage othe drug 23 Summarizing One or Two Categorical Variables Numerical Summaries Count how many fall into each category Calculate the percent in each category If two variables have the categories of the explanatory variable de ne the rows and compute row percentages Example 21 Importance ofOraer s or Q Note 66 pickedthe rst choice ofS Other half 98 given the question Ran omly pick a etter Q or Note 54 pickedthe rst choice on 39l A E LE 2 s Picked sum nm 51 new a um am 45 new him we aim I Elder hr allels nn rimquot and clinics ol leller Wicked rml av uni 92 as 154 as in w 1 Isa 00 Example 22 Lighting the Way to Nearsightedness Survey ofn 479 children Those who slept with nightlight or in fully lit room before age 2 had higher incidence of O O Imst mm Study does not prove sleeping With light actually caused myopia in more children Using Minitab Summarizing one categorical Variables StztgtszlesgtTzlly duzlvz ables r Wim mm may O O Using M initub Summarizing one categorical Variables Minimb Output Resulu fur pmnsmte1mtw Tally fur Discrete Variables SQpick SQple Count Percent 84 44 21 S 106 55 79 N 190 00 Using Minitub Summarizing tWo categorical Variables statgt Tables gt Crass tabulat n and Using M initub Summarizing two categorical ariables Tabulated statistics sopiok Farm E Rows SQple Columns Form g Q or s s or Q All 9 53 31 84 3 63 10 3o 90 100 00 e 45 61 106 42 45 57 55 100 00 All 98 92 190 51 58 48 42 100 00 JW 2005 00 In class exercises from Sections 2 1 3e 2 2 Pages 58 l 5 ll 16 Visual Summaries for Categorical Variables Pie Charts useful for summarizing a single categorical variable if not too many categories Bar Graphs useful for summarizing one or two categorical variables and particularly useful for making comparisons When there are two categorical variables Example 23 Humans Are Not Gumi Randamizers Survey ofn 190 college students Randomly pick anumber between 1 and 10 mill Pinzhinulnnmhlwplqmu mum ninmiminu Results Most chose 7 very few chose 1 or 10 Example 24 Revisiting Nightligmx and Nearsightedness Survey of n 479 children Response ree of Myopia Explanatory Amount 0 Sleeptime Li tin 39 gh g HDWEZJ Blvclnnlnrmwuia nmI niumlimlwhlinqin lrllnilcv 11 Minitab Pie Chart 0 Graph gt Pie Chan gt Tally individual variables E Minitab Pie Chart output JW 2005 15 O O Minitab Bar Graphs cmmnmmsa 00 24 Finding Information in Quantitative Data Long list ofnumbers 7 needs to be organized to obtain answers to questions of interest mus 2 4 SivalclmlRIquMIII spnnslrmlIHSIEullugzsmdzm Mnlut Hulkv 21522 25 n hmlextlwmdmxk zu I9minimumIsmazzmnmmmzzz 202Mansz Lzl m 21232 22nsz wzzszunl WmmLzzznzn wxzzszzs w mum B J 1 1919 zuns Five Number Summaries Find extremes high low the median and the quartiles medians of lower and upper halves of the values Quick overview of the data values Information about the center spread and shape of data 0 O Five Number Summaries Five Number Summaries Medan ofa setofdata is the middle value when the data set is arranged in increasing or decreasing order Quattes are numbers that approximately divide the ordered data into quarters How to nd the median A frstquame is denoted by Q1 At least 25 olthe less than or equal to Qland ateast 75 of the data are greater than or equa o 1 observations are a Arrange the data lrom smallest to largest b lln is oddthen the T observation in the ordered data set is the median A tind quartte is denoted by Q3 At least 75 olthe n observations are less than or equalto Q3 d c lln is eventhen the mean olt e obs a quot an at ast h 3 and 31 25 of the data are greater than orequa 0 Q1 erv tions in the ordered data set is the median n To calculate the quartiles 1 Locate the median M olthe observations 2 The rst quartile Q7 is the median ofthe observations that are less than M Media 225 m 3 The third quartile Q3 is the median ofthe observations that are greater than M 0 Example Odd number of nbservatmns About 25 4 0f hmdspms of females are Q1 M between 125 and 190 centt meters 0 2 5 7 a a about 25 are between 19 and 20 cm about 25 are between 20 and 21 cm and EVE quotumbergfbsewvm 0 about 25 are between 21 and 2325 cm a 21 5 a 93 M V Interesting Features of Quantitative Variables Location center or average eg median Spread variability eg difference between two extremes or two quartiles Shape later in Section 25 O 6 39 Outliers and How to Handle Them Outlier a data point that is not consistent with the bulk of the data Look for them via graphs Can have big in uence on conclusions Can cause complications in some statistical analyses Cannot discard without justi cation Example 26 Ages ufDeath of US First Ladies Partial Data Listing and venumber summary um I S nonnrimnnu umHumul menn Median nul llu Enlzmes Extremes are more interesting here Who died at 347 ManhaJetrerson Who lived to be 97 Bess Truman l39irxl lll u39 Aim ll nam ea 4 O 0 ti 5 57 I Possible Reasons for Outliers and Reasonable Actions Mrstare matte write akmg measurement Dr enterrng rt rnta Camputer Lfven39 ed should be discardedcorrected Imizvzdual m questzurl bzlmlgs u a tir erentgraup tnan butr Ufmdzvzduals measu red Values ma b discarded if summary is desired and reported for the majority group only Outrer z tegrtrmate data value and represents natural vanabzlztyfm the group and varrabes measured ay not be discarded 7 hey provide importan information about location and spread Example 27 Tiny Bnatxmen Weights in pounds of 18 men on crew team 6 Cambridge1885183019451850 2140 2035186017851090 1860 1845 2040 1845 1955 2025174018301095 Oxford Note last weight in each list is unusually small 5 They are cowgirl for their team while others are rowerr 24 Pictures for Quantitative Data Histograms similar to bar graphs used for any number of data values Stemandleaf plots and dotplots present all individual values useful for small to moderate sized data sets Boxplot or boxandWhisker plot useful summary for comparing two or more groups Interpreting Histograms Stemplots and Dotplots Values are centered around 20 cm Two possible low outliers Apart from outliers spans range from about 16 to 23 cm r Creating a Histogram Step 1 Decide how many equally spaced same width intervals to use for the horizontal axis Between 6 and 15 intervals is agood number Step 2 Decide to usefrequeneies count or relative frequencies proportion on the vertical axis Step 3 Draw equally spaced intervals on horizontal axis 39 e data Determine frequency or relative frequency of data values in each interval and allar with corresponding height Decide rule to use for values that fall on the border between two intervals r n Histogram 00 Example A marketing consultant observed 50 shoppers at a grocery store One variable ofinterest was how much each shopper spent in the store Here are the data in dollars Histogram Frequency table summarizes this data as follows 9 O gt Each of the intervals in the first column is called a measurement 0 ass gt The observed values that fall on the boundaries of the measurement classes should consistently go into lower or upper subinterval gt The number of measurement classes for a data set should be chosen so that the least amount of information is lost while the data are effectively summarized Too few classes summarizes data too much Too many classes does not summarize data effectively waw 2005 4 folloW1ng steps 54 Choose the binning tab 4 Typein the number of bins desired JWVWW zoos arranged in increasing order D52 FrequenCy RelatlveOFJunencY 232 661 690 804 945 10 26 11 34 1163 1232392232 20 040 12 66 12 95 13 67 13 72 14 35 14 52 14 55 1501 22323232 7 014 15 33 1655 1715 1822 18 30 1871 19 54 1955 32324232 8 016 20 58 20 89 20 91 21 13 23 85 26 04 27 07 28 76 42325232 2 004 29 15 30 54 31 99 32 82 33 26 33 80 34 76 36 22 52 3262 32 2 0 04 37 52 39 28 40 80 43 97 45 58 52 36 61 57 63 85 39 39 39 64 30 69 49 62327232 3 006 Total 50 100 waw 2005 u vaWW 2005 42 H swgmm Mlnlt ab Histogram A few noteS I Graph gt Histogram In order to determine the number of bins follow the Select the histogmm bars by clicking on one of the bars Right click on the graph and select Edit bars Histogram Three histograms orthe shopping data JW 2005 u Creating a Dotplot A dotplot displays a dot for each observation along a number line If there are multiple occurrences of an obse ation or if observations are too close together then dots will be stacked venically If there are too many points to t venically in the graph then each dot may represent more than one poin Minitab Release 121 1998 l o Creating a Stem and Leaf Plot Step 1 Determine stem values The stem contains all but the last ofthe displayed digits ofa number Stems should de ne equally spaced intervals Step 2 For each individual attach a leaf to h m ate stern A leaf is the last ofthe displayed digits ofanumber o en leaves are ordered on each stern Nute More than one way to de ne stems Can use splitstems or truncateround values rst A7 Stern and leaf plots Fora given number the stem COl lSlSlS ofall but the nal ngntmost digit The eafcon5l5t5 ottne tinai digit A Ieafdigit unit LDU determines the location of the decimal place Example Number Stem leaf LDU 34 75 347 5 at 3475 347 5 l Example Stem and leaf display ofthe shopping data values were truncated to integer numbers 26689 0112233444556788899 000136789 012334679 035 4 1349 omhwmwo LDU1 In Minitab use Graph gt Stemandleaf to get a stem and leaf display Increment can be used to set the scale Example Big Music Collection About how many CDs do you own Estimated music 05 owned for n 24 Penn State students 0 001222233 0 55569 1 002 Stem 1s 100s and leafumt 1s 10s 1 5 Final digit is truncated gm Numbers ranged from 0 to about 450 3 0 with 450 being a clear outlier and 3 most values ranging from 0 to 99 4 4 5 O O faxWm 2005 49 so Describing Shape Can you think of everyday life data that may follow Symmetric bellshaped Symmetric not bellshaped Skewed Right values trail off to the right Skewed Left values trail off to the left 09 each of the following shapes A a Sellshaped b Trldngulal d Reverse J shaped e l shaped mmm g lelt skewed h Bime nl faxWm 2005 in Uniform a rcaarlgulm 11 mm skewed i Multimodal Picturing Location HOW to Draw 2 FDXPM and Spread with Boxplots ofa Quanntauve Variable Step 1 Label either a vertical axis or ahoxizontal axis Boxplots for right handspans I Box covers the middle with um fmm min to max ofthe dam of males and females 0 0f the 0 Step 2 Draw box with lower end ath and upper end at ya quot Line Wi hiquot bOX quotmks Step 3 Draw a line tlrrouglr the box at the median M 1quot mmquot 1 Step 4 Draw a line from Q1 end ofbox to smallest data 39 Possible outliers are value tlrat is not further tlrar 15 x IQR from 1 ar W39th asterisk Draw a line from g3 end ofbox to largest data value Apm rOm mum lines tlrat is not further tlrar 15 x IQR from ya extending from box reach Step 5 Mark data points further tlrar 15 x IQR from eltlrer to min and max values edge oftlre b x wltl an asterisk Pemts represented thh asterisks are consideredta be ourlxers How to Draw a Boxplot 00 Exerclses from Sectlons Z l at Z Z Emplol u hmunl i Pages 5863 Seeuons 2 3 amp 2 4 26 29a 33 34 37 Notation for Raw Data n number of individuals in a data set x1 x2 x3 xH represent individual raw data values Example A data set consists of handspan values in centimeters for six females the values are 21 19 20 20 22 and 19 Then n 6 x1 2lx2 19x3 20 x4 20 x5 22 andxs l9 57 25 Numerical Summaries of Quantitative Data v of a Data Set 0 Mean the numerical average 0 Median the middle value if n odd or the average of the middle two values n even Symmetric mean median Skewed Left mean lt median Skewed Right mean gt median Describing the Location Comparison of mean and median for various shapes A Mean median m m mm A gt Mean lt median 9 mm g Mean gt median 0 Righukbwed Mean lt median 1 1mm JW 2005 Determining the Mean and Median EX The Mean x n where 2x means add together all the values The Median See Slide 28 Example 29 Will Normal Rainfall Get Rid of These Odors Dam Average rainfall inches for Davis California for 47 years Mean 1869 inches Median 1672 inches In 199798 a company with odor problem blamed it on excessive ram That year rainfall was 29 691nehes More min occurred in 4 other years quotmu Amulnlmll hmms Mum The In uence of Outliers on the Mean and Median Larger influence on mean than median High outliers Will increase the mean Low outliers Will decrease the mean If ages at death are 70 72 74 76 and 78 then mean median 74 years Ifages at death are 35 72 74 76 and 78 then median 74 but mean 67 years Describing Spread Range and Interquartile Range Range high value 7 low value Interquartile Range IQR upper quartile 7 lower quartile Standard Deviation covered later in Section 27 Example 210 Fastest Speeds Ever Driven FiveNumb er um 31 swam umm any for 87 males Median nu nuuniles 95 in Exlrelnes 55 15p imam 110 mph measures the center ofthe data Two extremes describe spread over 100 of data Range150755 95 m h Two quartiles describe spread over middle 50 of data Imrqrmriie Range 120 e 95 25 mph Notation and Finding the Quartiles also see slide 30 Split the ordered values into the half that is below the median and the half Example 21 Ordered Data for the 87 males Median 87H QFm an 4312 22mi v Q median ofth 4312 22mi v the 43 values below the 0 Fastest Speeds cum SS UXUBUXUXUBSXSBSXS QUQUQUQUQUDZDADSDSDS 959595 lEIEI mm mm mm mm lEIEI mm mm lEIEI 1m mzms ms lEIS ms msms lEISlEISlEIDll llEIllEIllEIllEIllEIllEI 39 1m1m11u11u11u112115115115115 115 115 12a 12U12U12012U 1m 12mm 1m 12m 124 125125 125 125 125 125mm 13m 14m 14m 14mm 145 15a 2 44quot value in the list 110 mph median alue from the start ofthe list 95 mph e 43 values above the m 39 edtan alue from the end ofthe list 120 mph n O O that is above the median Q1 lower quartile median of data values that are below the median Q3 upper quartile median of data values that are above the median Percentiles The kLh percentile is a number that has 0 k of the data values at or below it and 100 7k of the data values at or above it Lower quartile 25Lh percentile e 39an 50Lh ercenti Upper quartile 75Lh percentile In class exercises Section 2 5 amp 2 6 60 64 6571 V 27 Bell Shaped Distributions of Numbers Many measurements follow a predictable pattern 39 Mort individuals are clumped around the center 39 The greater the distance a value is from the center the fewer individuals have that value Variables that follow such a pattern are said to be belleshaped A special case is called a normal distribution or normal curve Example 211 BellShaped British Women s H eighn Data representative sample of199 married British couples Below curve Superimposed The mean height 1502 millimeters shows a histogram ofthe wives heights with a normal Describing Spread with Standard Deviation 0 Standard deviation measures variability by summarizing how far individual data values are from the mean Think of the standard deviation as roughly the average distance values fall from the mean I Describing Spread with Standard Deviation Numbers Mean Standard Devlalinn 100100 IUDIIDGJUO 100 0 Ion Iu amen 100110 nu Both sets have same mean of100 Set 1 all values are equal to the mean so there is no variability at all are 0 p0 dzstahce away fmm the mean is abuut 10 0 Set 2 one value equals the mean and other four values ints away from the mean so the average Calculating the Standard Deviation Formula for the sample standard deviation O 2o eff n71 The value ofrZ is called the sample variance An equivalent formula easier to compute is Z Z sat 2 n71 S Calculating the Standard Deviation Step 1 Calculate 7 the sample mean Step 2 For each observation calculate the difference between the data value and the mean Step 3 Square each difference in step 2 Step 4 Sum the squared differences in step 3 and then divide this sum b n 7 Step 5 Take the square root of the value In Calculating the Standard Deviation Consider four pulse rates 62 68 74 76 62682747670 Step 1 3 Steps 2 and 3 335 VIMSwan wnlufEFJiud PM 52 55 74 76 0 Ste 4 240 p 4 1 Step 5 sJE63 Population Standard Deviation Data sets usually represent a sample from a larger population Ifthe data set includes measurements for the standard deviation is also slightly different A opulatjon mean is represented by the symbol m mu and the population standard deviation is 2 for BellShaped Curves The Empirical Rule For any bellshaped curve approximately 39 68 of the values fall Within 1 standard deviation of the mean in either direction 39 95 of the values fall Within 2 standard deviations of the mean in either direction 39 997 of the values fall Within 3 standard deviations of the mean in either direction Interpreting the Standard Deviation The Empirical Rule the Standard Deviation and the Range 39 Empirical Rule gt the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data With an approximate bell shape 39 You can get a rough idea of the value of the mmdard deviaiion by dividing the range by 6 Ran e 6 Example 211 Women s H eight cum Mean height for the 199 British women is 1602 mm and standard deviation is 624 mm 39 68 ofthe 199 heights would fall in the range 602 r 624 or 15396 to 16644 m 39 95 ofthe heights would fall in the interval 1602 2624 or 14772 to 17268 mm 39 997 of the heights would fall in the interval 1602 3624 or 14148 to 17892 mm Example 211 Women s Heights cunt Summary of the actual results Mumllul Empiiical iiiI 1 All lMI Mann and Nlllllllll iiunim lcmnl Pami Morin int iszsnia ism H bul m 125 um iii119mm Mann 1 mi 14mm mix mat m m was iwtaanrqsi Moan3xli uuainnasz 9970HN193 is Wigwam Note The minimum height 1410 mm and the maximum height 1750 mm to a range of1760 7 1410 350 mm So an estimate ofthe standard deviation is n m m 2 583m 6 Standardized z Scores Standardized score or z score z 7 Observed valueMean 7 Standard deviation ample Mean resting pulse rate for adult men is 70 beats per minute bpm standard deviation is 8 bpm The standardized score for a resting pulse rate of 80 0 70 8 z 125 A pulse rate of 80 is 125 standard deviations the mean pulse rate for adult men A5623 B44443 n The Empirical Rule Restated For bellshaped data About 68 of the values have z scores between 1 and 1 About 95 of the values have z scores between 2 and 2 About 997 of the values have z scores between 3 and 3 0 O In class examses Section 2 7 Page 65766 783 79a 84 94 21

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.