### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# APPLIEDSTATISTICALMETHODS STAT1000

Pitt

GPA 3.52

### View Full Document

## 47

## 0

## Popular in Course

## Popular in Statistics

This 290 page Class Notes was uploaded by Josefa Cartwright Jr. on Monday October 26, 2015. The Class Notes belongs to STAT1000 at University of Pittsburgh taught by NancyPfenning in Fall. Since its upload, it has received 47 views. For similar materials see /class/229430/stat1000-university-of-pittsburgh in Statistics at University of Pittsburgh.

## Popular in Statistics

## Reviews for APPLIEDSTATISTICALMETHODS

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/26/15

Lecture 20 Nancy Pfenning Stats 1000 Standardized Statistics Recall If the underlying population variable X is normal with mean 1 standard deviation 7 then for a random sample of size n the random variable X is normal with mean 1 standard deviation We used this fact to transform X to a standard normal random variable Z and solved for probabilities with normal tables Z is normal with mean 0 standard deviation 1 Note that the spread of Z is always 1 regardless of sample size In situations involving a large sample size n sample standard deviation 3 is approximately equal to a and we can treat as approximately a standard normal variable Z X H 7 s W does not follow a standard normal distribution Because of subtracting the expected value of X that is M X s of dividing by s which is not the standard deviation of X the standard deviation off is not xed at 1 as it is for Z Sample standard deviation 3 contains less information than a so the spread off is greater lf sample size n is small 3 may be quite different from a and the random variable which we call t from X in the numerator the distribution oft is like Z centered at zero and symmetric Because than that of Z especially for small sample sizes 71 Since 3 approaches a as sample size 71 increases the t distribution approaches the standard normal Z distribution as 71 increases Thus the spread of sample mean standardized using 3 instead of 7 depends on the sample size n We say the distribution has n 7 1 degrees of freedom abbreviated df Since there are many different t distributionsione for each dfiit would take too much space to provide tables for each of them in as much detail as was provided for the standard normal 2 in Table A 1 lnstead t tables are condensed to provide minimal adequate information needed to state useful results Statistical Inference Statistical inference is the process of inferring something about a larger group the population by analyzing data for a part of that group the sample There are two general forms of statements we make using statistical inference 1 con dence intervals and 2 signi cance tests We use these forms of inference in order to answer questions about a population proportion p for categorical data or b the population mean M for quantitative data In addition we can use signi cance tests to answer questions about relationships between two variables such as the chisquare test of a relationship between two categorical variables The chisquare statistic chisquare sum of allW is another standardized statistic that follows a known pattern with values and probabilities that can be summarized in a table 1 Con dence Interval Questions a for p In May 2000 56 of 1012 respondents to an Associated Press survey supported gays rights to inherit from their partners What interval should contain the proportion of all Americans who support gays rights to inherit How con dent can we be that this interval contains the true proportion p Q V for M A random sample of 25 laboratory mice from a large colony was found to have mean weight 33 grams and standard deviation 5 grams Within what interval does mean weight for all colony mice lie How con dent can we be about the correctness of this interval 2 Signi cance Test Questions a for p In May 2000 56 of 1012 respondents to an Associated Press survey supported gays rights to inherit from their partners Can we conclude that a majority of the population support gays rights to inherit b for M Researchers are going under the assumption that their lab mice weigh an average of 30 grams but an assistant feels they actually weigh more She takes an SRS of 25 mice and nds their mean weight to be 33 grams Somehow it is known that weights of all mice in the lab vary normally with standard deviation 5 grams If the mean weight were really only 30 grams how unlikely would it be to get a sample of 25 whose mean weight is as high as 33 grams The laws of probability will enable us to answer such questions with precision But these laws are inapplicable and useless if our data have not been produced correctly For example maybe the lab assistant s selection was biased towards slower heavier mice or maybe it was biased towards smaller cuter mice The sample must be chosen at random in such a way that it serves as an adequate representative of the entire population The reliability of our conclusions still depends on conscientious adherence to the basic principles of statistical design presented in Chapters 3 and 4 Chapter 10 Estimating Proportions With Con dence Probability vs Con dence Recall our Rules for Sample Proportions stated that if numerous samples or repetitions of the same size are taken sample proportion 16 has mean p the true proportion for the population standard deviation MW and a shape that is approximately normal as long as np 2 10 and nl 7 p 2 10 Because of approximate normality we can invoke the Empirical Rule it tells us that the approximate probability is 68 that 16 falls within 170717 of p n 95 that 16 falls within 2 170717 of p n 997 that 16 falls within 3 170717 of pi If n falls within H7 n gt of p thenp must fail within 170 of 131 Similarly if n falls within 24 of p then p must fall within 2 170717 of p etci n n But p is not a random variable like 16 its value is not a numerical outcome of a random phenomenon but xed and unchanging even if we don t happen to know what it is Thus we cannot talk about the probability of p lying in a certain interval lnstead if we take a sample of size n from a population and record the sample proportion 16 in the category of interest we can be approximately 68 con dent that the interval 16 i M 170 contains the unknown population proportion pi Notice that the standard deviation of 16 is H 1717 Since p is unknown this standard deviation cannot be known either so we estimate it by substituting 16 for p the standard error of is A 261 7 26 5e 7 p n In general standard error is calculated from the sample as an estimate for population standard deviation Now combining the Empirical Rule with the language of con dence and the standard error approxima tion we say we are approximately 68 con dent that p is in the interval 15 i w 1713 95 con dent that p is in the interval 15 i 2 w 997 con dent that p is in the interval 16 i 3 13071 n The 95 con dence interval is by far the one most commonly seen When news reports refer to the margin of error they mean the giveortake around the estimate that results in an interval that captures the unknown parameter with a 95 success rate in the long run namely 2 I 17 Note If we substitute 16 5 into this expression for margin of error the result equals the conservative margin of error77 introduced in Chapter 4 when we rst discussed estimating a population proportion based on sample proportion For values of 16 further from 5 in either direction that is closer to 0 or 1 i W will be considerably smaller than and will let us be more precise in our interval estimate for Example 664 teenagers who reported having sex for the rst time between 1999 and 2000 were asked where this rst encounter took place 56 said it was at their own or their partner s home Assuming those 664 constitute a random sample of all US teens give an approximate 95 con dence interval for the proportion of all USteens having their rst sexual encounter at home 17 56 44 imM 56 i2 56 i 0385 m 56 i 04 5260 n In Chapter 4 we calculated the conservative margin of error to be 0388 quite close to this margin of error because 56 is close to 5 Again we are approximately 95 con dent that the proportion of all teens having their rst sexual encounter at their or their partner s home is between 564 and 564 that is between 52 and 60 Example An article entitled Helping stroke victims reports Lowering stroke victims body tempera ture with cooling blankets and other means can signi cantly improve their chances of survival researchers say German researchers who took steps to reduce the temperature of 25 people who had suffered severe strokes found that 14 survived instead of the expected ve77 Based on the information provided we can set up a 95 con dence interval for the proportion of all severe stroke victims who would survive with the cooling blanket treatment First the sample proportion of survivors is 16 56 The 95 con dence interval is 56 44 56 i 2 75 56 i 20 36 75 In order to con rm that chances of survival are signi cantly improved we note that the ex pected survival rate is only 20 which falls below the entire con dence interval for overall survival rate of those who are treated with cooling blankets Exercise Find an article or report that includes mention of sample size and summarizes values of a categorical variable with a count proportion or percentage Based on that information set up a 95 con dence interval for population proportion in the category of interest Lecture 21 Recall We used the fact that the probability is 95 for sample proportion 16 to fall within 2 standard errors of population proportion p from the Empirical Rule in order to construct a 95 con dence interval for unknown population proportion p based on a sample proportion 16 that has been observed Example Of 5685 respondents in a survey 4948 confessed to routinely singing in their cars Give a 95 con dence interval for the proportion of all people who routinely sing in their cars Since 80 our 95 con dence interval is 80 i 2 825 80 i 011 m 79 81 Note that the larger sample size results in a smaller margin of error and thus a narrower con dence interval 81 Other Levels of Con dence Example 1000 husbands and wives were surveyed about the secrets they kept from their spouses the most common secret admitted by 48 of the 40 who said they kept secrets that is by 19 of the original 1000 was not telling their spouses about the real price of something they bought 1 Give a 95 con dence interval for the proportion of all spouses who kept a secret about the real price of something they bought 019081 19 i 2 1000 719 i 0248 N 017021 Notice that the margin of error is smaller this way than it is when we use the easier more conservative formula in because 19 is rather far from 5 Unfortunately the level of precision obtained from the Empirical Rule is not always adequate for our purposes and so we turn now to normal tables to understand how to obtain a higher level of precision In fact 95 probability of being within a certain distance of the mean corresponds to left and right tail areas of 20250 which correspond to 2 not quite 2 but 196 The margin of error for 95 con dence is not quite EM 1956 20248 but 19 81 196 1500 0243 m If I want to say I m almost positive that the population proportion p is in suchandsuch an interval I may want to set my desired level of con dence at 99 instead of 95 First ifl had a standardized score 2 the probability is 99 that 2 lies between what values 72 and 2 The ones that have area w 2005 to the left and right respectively According to Table All 72 with 2005 to the left is between 257 and 3258 An extra decimal digit of accuracy is obtained from the in nite row of Table A2 recall that the t distribution with in nite degrees of freedom is equivalent to the standard normal distribu tion 99 con dence corresponds to 2 2576 72 72576 17717 The standard normal variable Z of interest here is standardized sample proportion svev 1 99 P72576 lt Z lt 2576 1372576 lt 517 lt 2576 P72i576siei lt p 7p lt 22576siei Pp 7 2576siei lt p lt p 2576siei so 99 is the probability that p lies within 2576siei of p so we are 99 con dent that the interval p i 2576siei contains p ire that p is in the interval 01981 1000 19 i 2576 48 i 03 016022 3 I can narrow the interval by reducing my con dence level according to Table A2 90 con dence corresponds to 2 1645 I can be 90 con dent that p is in the inter val i l645siei that is the unknown population proportion is in the interval 19 i 19 81 1645 1500 019i 002 017021 Note the tradeo we have a higher rate of con dence for a wider less precise interval and a lower rate of con dence for a narrower more precise interval In general a level C con dence interval for any parameter is an interval computed from sample data by a method that has probability C of producing an interval that contains the true value of the parameter For now Chapter 10 the parameter of interest is p in Chapter 12 it will be M or other parameters involving population mean We want to say our con dence level is C that the actual proportion p lies in a certain interval in other words that p lies within a certain distance of 16 in other words that p lies in the interval estimate i margin of error where the estimate is 16 and the margin of error depends on con dence C C equals a probability associated with a standard normal value 2 First note that if C is the area under the standard normal curve between 72 and 2 then the regions to the left of 72 and to the right of 2 each have area We call 2 with probability lying to the right under the standard normal curve the multiplier that accompanies the con dence level C The in nite row of Table A2 provides 2 values for the four most common con dence levels C H 90 is the con dence level C for 2 1645 2 95 is the con dence level C for 2 1960 3 98 is the con dence level C for 2 2326 4 99 is the con dence level C for 2 2576 For a given C the approximate margin of error is 2 H I Conditions The interval i2 1 A times the sample and n and nlip are both at least 10 The former guarantees approximate independence of selections if they were dependent the standard deviation would change The latter simply requires a check that there have been at least ten each of successes and failures observed In general n is approximately correct as long as the population is at least ten 1 A 90 con dence interval for p is 16 i 1645 35 2 A 95 con dence interval for p is 16 i 1960 35 3 A 98 con dence interval for p is 16 i 2326 35 4 A 99 con dence interval for p is 16 i 2576 35 Example An article reported that in a random sample of 244 doctors 184 said they would object to the sale of human organs for transplants Obtain a 90 con dence interval for the proportion p of all doctors objecting to such sales First we nd 15 754 For 090 2 1645 35 4w 0276 A 90 con dence interval for p is 754 i l6450276 754 i 045 m 71 80 We are 90 con dent that between 71 and 80 of all doctors object to the sale of human organs for transplant Caution the margin of error accounts for random sampling error only it does not include bias which may result from the selection process the wording of questions etc Choosing a Sample Size Sometimes before the sample has been taken we have in mind a particular margin of error that we would like to report in our con dence interval It is easy enough to take our expression for a conservative margin of error m W and turn it around to solve for n in terms of m l n 7 m2 Thus if we desired a margin of error equal to 03 we would take 71 f llll Polling organizations often sample roughly 1000 people and report a margin of error close to 37 If we desired a margin of error equal to 02 we would take 71 2500 Note that as sample size goes up margin of error goes down 000 Example A New York Times article entitled Lawsuits Cast Attention on Passengers Blood Clots on Long Flights describes a study published in the New England Journal of Medicine in September 2001 One detail of the study is that for passengers arriving at Charles de Gaulle Airport near Paris there were 3 cases of pulmonary embolism for 2 million passengers who traveled more than 5000 miles We should not use this information to set up a con dence interval for the proportion of all passengers traveling more than 5000 miles who would suffer from pulmonary embolism because the number of successes is too small the distribution of sample proportion wouldn t be normal enough to justify setting up a confidence interval based on normal critical values Exercise Here is an excerpt from a Pittsburgh PostGazette article entitled Criminal pasts cited for many city school bus drivers State auditors checking the records of a random sample of 100 city bus drivers have found that more than a quarter of them had criminal histories The audit also found that 26 of the drivers were never checked for child abuse historiesiin Pennsylvania schools a mandate for all employees and even some volunteers In all the auditors discovered 80 convictions for various offenses among the 100 sampled Thirtyfour of those incidents occurred more than ten years ago including one rape and four drug offenses In Pennsylvania it s perfectly legal for school o eicials to hire a bus driver with certain convictions that are more than five years oldibut that doesn t mean they should state Auditor General Robert P Casey Jr said yesterday in releasing the report No one convicted of rape should be driving a school bus full of children 7 said Casey who also said he was disappointed with the school district s initial response to the audit The General Assembly needs to look at this law 7 he said A series of problems last year with school bus driversiincluding a February accident that was nearly fatal to an 8yearold Elliott girliprompted Casey to take a closer look at Pittsburgh s staff of 750 drivers he said When his o eice presented their results to school o eicials about eight months ago Casey said they were very reluctant to do anything about it 7 and sent him only a brief response outlining what steps were being taken to remedy the problemsw Note that the article states that about 25 in a sample of Pittsburgh school bus drivers had criminal records Report a 98 confidence interval for the proportion of all Pittsburgh school bus drivers with criminal records One of the conditions for our approximation is not quite met what is it Lecture 22 Interpreting Con dence Intervals Example Suppose the proportion p of MampMs that are blue is unknown and when I take a sample of 75 MampMs to estimate p I get 16 975 12 that are blue A 95 confidence interval for p is 12 i 2H 127588 12 i 075 i045 i195 Tell whether each of the following is a correct interpretation of this interval l The probability is 95 that the proportion of all MampMs that are blue is between 045 an 195 0 this is the most common misinterpretation of the interval and the word 4 probability is the problem Even though it may be unknown population proportion p is a xed parameter not subject to the laws of probability Remember that probability is the study of random behavior it applies to random variables not to parameters The probability is 95 that the sample proportion of blue MampMs is between 045 and 195 No in fact the probability is 100 that our sample proportion is in the confidence interval F 84 because we built the interval around 16 Remember that setting up a con dence interval is a form of statistical inference a process whereby we use statistics to draw conclusions about parameters and so we need to be making a statement about p not 3 We are 95 con dent that the proportion of all MampMs that are blue is between 045 and 195 Yes 4 The probability is 95 that the interval we produced 045 195 contains p Yes because sample proportion 16 varies from sample to sample the interval built around 16 varies ran domly as long as the sample was random Thus the word probability does apply to the interval produced Picture 100 students each selecting a random sample of 75 MampMs from a large bowlful and setting up a 95 con dence interval for the proportion p of all MampMs that are blue Roughly 95 of those 100 intervals that is 95 of the intervals should contain p Now imagine the students each randomly selected 75 MampMs from a huge barrelful instead of a bowlful Would their con dence intervals be any more or less accurate No population size does not enter into our calculations It is irrelevant as long as it is at least ten times the sample size and as long as the samples are selected at random Using Con dence Intervals to Guide Decisions Example In a group of 371 college students 196 wore some type of corrective lenses 1 Give a 95 con dence interval for the proportion of all college students wearing corrective lenses Since 196371 53 our interval is f 53 i 05 48 58 2 Are you convinced that a majority of students wear corrective lenses No because the interval 48 58 contains values less than 5 suggesting that the population proportion p is not necessarily greater than 5 Example In a group of 371 college students 128 wore contact lenses 1 Give a 95 con dence interval for the proportion of all college students wearing contact lenses Since 128371 35 our interval is 35 35i2 35 i 05 30 40 371 2 Are you convinced that a minority of students wear contact lenses Yes because the interval 30 40 doesn t even come close to containing proportions of 5 or more Example 32233 137 of the 233 females in a group ofcollege students wore glasses whereas 36138 261 of the 138 males wore glasses Compare the con dence intervals for population proportions of females and of males wearing glasses in order to decide if these population proportions could be equal For the females a 95 con dence interval for p is 137 i2 w i137i i045 092 182 85 For the males a 95 con dence interval for p is 261i 2 w 261i 075 186 336 The males interval for p is higher than that of the females to the point where the intervals share no overlap It seems doubtful that the proportion of all males wearing glasses is the same as the proportion of all females wearing glasses Example A 95 con dence interval for the proportion of female students smoking is 081 167 and a 95 con dence interval for the proportion of male students smoking is 062 170 ls it reasonable to assume that the proportion smoking is the same for females and males Yes because the intervals do overlap When Con dence Intervals are Not Appropriate Remember that we set up a con dence interval based on sample data in order to draw conclusions about the larger population from which the sample was obtained Con dence intervals are not appropriate if there is no larger group being represented by the sample Example In 2000 923000 a 1233 000 75 of all bachelor s degrees were earned by whites Construct a 95 con dence interval for the proportion of all bachelor s degrees earned by whites There is no need to construct a con dence interval because the given proportion already describes the population Two Types of Inference Con dence Intervals and Hypothesis Tests In the preceding examples we examined con dence intervals for popuation proportion in order to get a feel for whether or not the population proportion could take a hypothetical value Because this type of conclusion in the form of a yesorno decision is often quite important we will now take a more rigorous approach to such problems The following pairs of problems will help us to highlight the similarities and differences between situations involving con dence intervals and hypothesis tests 1 a b 2 a b 3 a In a group of 371 Pitt students 42 were lefthanded Give a 95 con dence interval for the proportion of all Pitt students who are lefthanded In a group of 371 Pitt students 42 were lefthanded Is this signi cantly lower than the proportion of all Americans who are lefthanded which is 12 In a group of 371 students 45 chose the number seven when picking a number between one and twenty at random Give a 95 con dence interval for the proportion of all students who would pick the number seven In a group of 371 students 45 chose the number seven when picking a number between one and twenty at random Does this provide convincing statistical evidence of bias in favor of the 1 5 number seven in that the proportion choosing seven was signi cantly higher than 20 One year a university o ers admission to 1200 students and 888 accept Assuming that year is representative of all the recent years give a 99 con dence interval for the proportion accepting in any given year A university has found over the years that out of all the students who are offered admission the proportion who accept is 70 After a new director of admissions is hired the university wants to check if the proportion of students accepting has changed signi cantly Suppose they offer admission to 1200 students and 888 accept Is this evidence of a change from the status quo Like the con dence interval problems 1a 2a and 3a the signi cance test problems 1b 2b and 3b all involve a single categorical variable with two possible values smoking or not picking the number seven or not accepting admission or not We know the sample size n and the sample count X in the category of interest and so can calculate the sample proportion 16 I in the category of interest Based on this sample proportion we want to draw conclusions about the unknown population proportion p In a con dence interval problem our conclusion takes the form of an interval estimate for p In a hypothesis test problem a hypothetical value for unknown population proportion p is proposed and we need to decide whether or not p really takes that proposed value We will begin to solve such problems next lecture Lecture 3 Nancy Pfenning Stats 1000 We learned last time how to construct a stemplot to display a single quantitative variable A backtoback stemplot is a useful display tool when we are interested in comparing the values of a single quantitative variable for two categorical groups Example Let s use a backtoback stemplot to compare earnings in thousands of dollars of 28 male and 51 female students Since the earnings range from 0 to 22 thousand we will split the stems 0 l and 2 ve ways each Note that besides the quantitative variable earnings we are adding in a categorical variable sex that has two possible values male and female Sharing the same stems male earnings precede them right to left while female earnings follow the stems left to right 00000000011111111 3333332222220222222222222233333333 544045555 76066677 808889 10010 2212 15 l l 12 22 The center is clearly higher for the males midpoint at 3 than for the females midpoint at 2 Male earnings range from 0 to 22 thousand whereas female earnings range from 0 to 15 thousand However the spreads appear comparable if we disregard the high outliers Shapes are very skewed to the right as is often the case with monetary variables such as earnings costs of homes etc Both distributions have a single peak in the low thousands 2 or 3 thousand dollars was a common amount for both sexes Note that some of the stems have no leaves These stems must not be omitted otherwise we could not see outliers for what they are Since we see a tendency in this particular class for males to earn more than females it is natural to wonder whether the same conclusion can be drawn about Pitt students or all college students in general Our ultimate goal in this course is to go beyond the data at hand and draw conclusions about the larger population from which the data originated a process called statistical inference This requires careful development of needed theory over the course of the semester Up to now we ve mentioned center as simply the midpoint median and spread as the range These only provide limited information from a couple of observations Since center and spread are the most important features of a distribution they should be de ned carefully ne measure of center is the median or middle value There is a single middle value for an odd number of observations For an even number of observations we take the median to be the average of the two middle values Example The median earnings of the 28 male students is the average of the 14th and 15th or 342 3 thousand dollars The median earnings of the 51 female students is the 26th value 2 thousand dollars We can say that the typical male student earns 1 thousand dollars more than the typical female student Example The median of 11 Math SAT scores is the 6th or 592 468 472 511 534 557 592 592 614 667 669 704 Another measure of center is the mean or arithmetic average just add up all the numbers and divide by how many there are The mean of n observations x1 x2 x is denoted 1 mym n n Example The mean earnings of 28 males is 0 22 58 I 28 The mean earnings of 51 females is 0 a T T 15 37 51 Example The mean of 11 Math SAT scores is i 446 682 580 For fairly symmetric distributions like the distribution of 11 SAT scores mean and median are approx imately the same median 592 vs mean 5 0 For a distribution that is skewed left or has low outliers like age at death of all Americans the mean tends to be less than the median For a distribution that is skewed right or has high outliers such as earnings of males or females the mean tends to be greater than the median males median 3 vs mean 58 females median 2 vs mean 37 In general we prefer the mean as a measure of center because it includes information from all the observations However if a distribution has pronounced skewness or outliers the median may be better because it is less affected by those few extreme values For this reason we call median a resistant measure of center If we use median as our measure of center we can use quartiles to help describe the spread they tell us where the middle half of the data values occur The lower quartile has one fourth of the data below it it is the middle of the values below the median The upper quartile has three fourths of the data below it it is the middle of the values above the median For an odd number of values we will exclude the median when nding the middles of the values below and above it Software or even other textbooks may use a different algorithm and produce slightly different quartiles Example Let s nd the quartiles for earnings of male and female students For the 28 males Q1 is the middle of the lower 14 values that is the average of the 7th and 8th M 2 Q3 is the middle of the upper 14 values that is the average of the 21st and 22nd 9 For the 51 female earnings Q1 is the middle of the lower 25 values or 13th which is 1 Q3 is the middle of the upper 25 values or 39th which is 5 The Five Number Summary is a good way to describe a quantitative data set It lists the minimum Q1 median Q3 and maximum Example The Five Number Summary for male earnings is 0 2 3 9 22 The Five Number Summary for female earnings is 0 l 2 5 15 Note that only one quarter of the males earned 2 thousand or less whereas one half of the females earned 2 thousand or less A boxplot lets us take in the information from the Five Number Summary visually l The bottom whisker extends to the minimum F The bottom of the box is at Q1 There is a line through the box at the median F95 The top of the box is at Q3 9quot The top whisker extends to the maximum 77 or other symbol and extend whiskers to the minimum For a modi ed boxplot denote outliers with a and maximum nonout iers A simple criterion to identify outliers is based on the interquartile range QR Q3 7 Q1 which tells the range of the middle half of the data Any value below Q1 7 15 X QR will be considered a low outlier and any value above Q3 15 X IQR will be considered a high outlier Example We can draw sidebyside boxplots of the male and female students earnings for a good visual comparison The males lQR is 9 7 2 7 so low outliers would be below 2 7 15 X 7 785 of course there are none and high outliers would be above 9 15 X 7 195 Thus the values 21 and 22 would be considered high outliers for the males The females lQR is 5 7 l 4 so low outliers would be below 1 7 15 X 4 75 there are none and high outliers would be above 5 15 X 4 ll Thus the values 12 and 15 would be considered high outliers for the females Note2 sidebyside boxplots have the advantage over backtoback stemplots in that we can compare more than two distributions at a time Boxplots of earnings by sex means are namth by em cross 200007 100007 earnings sex Clearly earnings are higher for the males Disregarding the outliers overall spreads are compa rable but the middle half of the males earnings has a considerably Wider range than that for the fema es Example Let s consider the values of the variable age compared for oil and on campus students I expect oil campus students to be older overall higher center With more spread because oil campus there may be students beyond their twenties and right skewnesshigh outliers for both groups more so for oil campus students 4 3s 40 7 3 it n 3 i lt 30 7 v 46 is 20 e l l off on Live Descriptive Statistics Age by Live Variable Live N N Mean Median TrMean Age off 222 1 21253 20330 20630 on 222 0 19488 19420 19424 0 1 Variable Live StDev SE Mean Minimum Maximum 11 Age off 3824 0257 17670 45580 19580 on 0901 0060 17920 28420 19080 Variable Live 13 Age off 21 520 on 19 750 NOTE N missing 2 Sidebyside boxplots con rm that center and spread are both greater for o campus students and there are many high outliers in both groups Surprisingly a low outlier appears for the oncampus students Five Number Summary values are 1767 1958 2033 2152 and 4558 for o campus 1792 1908 1942 1975 and 2842 for oncampus According to the 151QR Rule boundaries for low and high outliers are 1667 and 2443 for o campus students none are below the lower bound upper bound exceeded by many 18075 and 20755 for oncampus students there are a few below the lower bound and many above the upper bound Exercise Consider the values of one quantitative variable in our survey compared for two categorical groups First state your expectations about how the quantitative values would compare for the two groups Then use MlNlTAB to get sidebyside boxplots and report the Five Number Summary for each Tell how their centers spreads and shapes compare Use the 151QR Rule to report the boundaries for low and high outliers in both groups and tell whether there are any outliers according to the Rule Lecture 4 The median is an OK measure of center especially in the case of skewness or outliers but in general the mean is our preferred measure of center The measure of spread to accompany the mean is the standard deviation 3 or square root of the average squared deviation from the mean 3 tells us how far the observations tend to be from their mean i If we solve rst for the average squared deviation from the mean or variance 32 and then take its square root to nd 3 we can write the variance and standard deviation of n observations x1 x2 xn as SQ x17i2xn7i2 1 n 7 1 n 7 1 s V32 It is natural for students to wonder why we divide by n1 instead of n Ultimately variance 32 from a sample a is used to estimate the variance of the entire population It does a better job of estimating when we divide by n 7 1 instead of 2951 7 92 Example 468 472 511 534 557 592 592 614 667 669 704 It can be shown that the 11 Math SAT scores have mean a 580 standard deviation 3 80 How do we interpret these numbers They are telling us that these students typically scored about 580 give or take about 80 points How do we calculate the standard deviation by hand We must nd square root of average squared deviation from the mean 1 Find the mean 580 2 Find the deviations from the mean 468 7 580 7112 472 7 580 7108 7047 580 124 15 3 Find the squared deviations from the mean 71122 12544 71082 ll664 1242 15376 4 Average the squared deviations dividing their sum by the number of observations minus one This gives us the variance 3 2 125441166415376 63924 S 7 10 6392 5 Take the square root of the variance to nd the standard deviation 3 Vs V6392 m 80 Example For the male earnings we can calculate a 58 thousand and s 57 thousand This tells us the typical distance of their earnings from their mean 58 is 57 thousand dollars Are they really that far away If we exclude the outliers 21 and 22 a 46 and s 37 The outliers had substantially in ated the value of the mean and also the value of the standard deviation For a data set with outliers caution should be used in describing its center and spread with mean and standard deviation because their values can be severely affected by just one or a few extreme observations For such data sets resistant measures like median and quartiles should be used They are hardly affected at all if one or a few extreme observations are included or not The distribution of SAT scores was fairly symmetric and outlierfree so standard deviation should provide an adequate measure of spread The earnings have high outliers and right skewness and so would be best summarized with the Five Number Summary In many cases an outlier occurs because of faulty recording of data A student may report her height in inches as 52 when she means 5 feet 2 inches Or I may mistype a height of 62 inches as 662 In these instances the outliers should be corrected or eliminated 27 BellShaped Distributions of Numbers Some quantitative variables such as SAT scores have a distribution with a symmetric singlepeaked shape Such a shape occurs naturally in all sorts of contexts Example Suppose I constructed a histogram for heights of 50 female students using classes of width 2 inches Then the total area taken up by the histogram s rectangles would be 100 2inches 200inches Let s divide the percentages by 2 inches so that our vertical scale is now percent per inch and the total area will be 100 Then the area of any block tells the percentage of females in that height range For example if the median height is 65 inches then the area under the histogram to the left of 65 would be 50 In general if the vertical scale of my histogram is adjusted so that the total area of all rectangles together is l or 100 then the area of the rectangles over any interval tells us the proportion or percentage of observations which fall in that interval lmagine making more and more observations on a continuous variable like height of female college students in this class at Pitt in the US and recording their values to the utmost accuracy Then the pro le of our histogram would be smoothed out ldealized we would have a smooth curve A density curve is an idealized representation of a distribution where the area under the curve between any two values gives the proportion or percentage of observations which fall between those values By this 16 construction the total area under a density curve must be 1 or 100 Whereas a frequency histogram displays sample data values a density curve displays the behavior of a continuous quantitative variable for an entire population We denoted the mean of sample data as a and the standard deviation 3 Now we denote the mean of a density curve with the Greek letter M called mu and the standard deviation with the Greek letter 7 called sigma i The density curve for heights of males or females in a certain age group like many naturally occuring density curves follows a symmetric bellshape called normal Besides providing a good model for many actual data sets eg heights weights test scores measurement errors etc the normal curve also approx imates typical longrun random behavior eg dicerolls coin ips Most importantly it approximates the shape of the distribution of sample mean or sample proportion when large enough random samples are taken from a quantitative or categorical population whose shape is not necessarily normal The normal curve is symmetric about its mean M indicating that it is just as likely for the variable to take a value below its mean as above It is singlepeaked and bellshaped with tapering ends showing that values closest to the mean are most likely and values further from the mean are increasingly less common The total area as for any density curve must be 100 or 1 In fact the mean M and standard deviation 7 tell us everything about a normal distribution but it is easiest to begin by specifying three useful landmarks on the normal curve Empirical Rule For any normal curve approximately 68 of the values fall within 1 standard deviation of the mean 95 of the values fall within 2 standard deviations of the mean 997 of the values fall within 3 standard deviations of the mean Example The distribution of verbal SAT scores in a certain population is normal with mean M 500 standard deviation 7 100 What does the Empirical Rule tell us about the distribution of scores 68 fall within 100 of 500 iiei between 400 and 600 95 fall within 2 100 of 500 iiei within 200 of 500 between 300 and 700 997 fall within 3 100 of 500 iiei within 300 of 500 between 200 and 800 Standardizing Normal Values A way of assessing a particular value of any normal distribution is to identify how many standard deviations below or above the mean it is We do this by nding its standardized score or Zscore observed value 7 mean standard deviation The zscore will be positive if the value is above the mean negative if it is below the mean Example Say a variable is normal with mean 50 standard deviation 10 If it takes the value 70 what is its standardized value The standardized value of 70 is 70750 2 For a variable with mean 50 standard deviation 10 the value 70 is 2 standard deviations above the mean Example Say a variable is normal with mean 36 standard deviation 2 What is the standardized value of 33 7 3 3 7 3 6 7 7 2 7 7 7 2 7 In other words 33 is 15 standard deviations below the mean 715 Example The distribution of heights of young women in the US is normal with mean 65 standard devia tion 27 The distribution of heights of young men in the US is normal with mean 69 standard deviation 3 Who is taller relative to other members of their sexiJane at 71 inches or Joe at 75 inches Jane s standardized height is 7765 222 Joe s standardized height is 75g69 200 Jane is taller for a woman than Joe is for a man Lecture 5 Chapter 3 Gathering Useful Data In the early part of this course Chapters 2 5 and 6 we use descriptive statistics to summarize the data at hand that is we summarize the sample Our ultimate goal in the course Chapters 10 to 16 is to use inferential statistics to draw conclusions about the larger population from which our sample originated Such inferences can only be made if the sample data are truly representative of the population with regard to the question of interest Example I could use heights of female students in this class to draw conclusions about heights of all college females But I could not use SAT scores of class members to draw conclusions about SAT scores of all college students because Pitt Stat 200 students would not be representative of all college students with regard to SAT scores Example Larry Flynt s sample of girls who had posed for his magazines would not be representative of the general population of women with regard to whether or not they see pornography as being exploitative The simplest wayiin theory at leastito guarantee that the sample truly represents the population is to take a simple random sample where every possible group of a given size has the same chance of being selected This is sampling at random and without replacement In practice samples are chosen in a wide variety of ways Common sense is often the best guideline in assessing whether or not a sample truly represents the population Statistical data are gathered via two basic types of research studies observational studies and experi ments In an observational study researchers note values of the variables of interest as they naturally occur In an experiment researchers impose a treatment manipulating the explanatory variable so they can see the effect on the response variable of interest If researchers nd out what type of sunscreen people use and how much time they spend in the sun then they are conducting an observational study If researchers provide people with one or the other type of sunscreen and then nd out how much time is spent in the sun they are conducting an experiment A randomized experiment is one in which the treatments are assigned at random a good way to control for possible confounding variables those which are tied in with the explanatory variable and may affect 18 the response of interest If researchers had let the subjects participants in an experiment choose which sunscreen to use their inclination to spend40r avoiditime in the sun could in uence their choice and also impact how much time they sunbathed Another word for a confounding variable is a lurking variable it is lurking in the background clouding the issue of interest Example I recorded students weights and how much time they spent on the phone in a given day two quantitative variables and found that students who weighed more tended to spend less time on the phone Does phone time really go down as weight goes up A possible confounding variable would be gender we know males tend to weigh more than females and perhaps they tend to spend less time on the phone In order to control for a confounding variable study similar groups separatelyithat is look at the relationship between weight and phone time for women then for men As if by magic the relationship vanishes in fact weight has nothing to do with time spent on the phone Designing Experiments Example Does sugar cause hyperactivity in children How can this be tested In order to prevent confounding variables from clouding the issue researchers could conduct an experiment The subjects would most likely be volunteers There are two variables of interest sugar intake is the explanatory variable and activity level the response Each of these has the potential of being handled as quantitative or categorical To keep things simple let s take both of them to be categorical sugar intake is low or high activity level is normal or hyper The critical stage in which to employ randomization in an experiment is during the assignment of treatments some children should be randomly assigned to higher levels of sugar Should they be given Count Chocula cereal for breakfast while their counterparts are given half a grapefruit No the treatment group receiving sugar should be compared to a control group which is treated identically in all other respects Thus the children should be provided with the same diet except that one group has foods sweetened with sugar the other with an arti cial sweetener When the treatment is a drug the control is a placebo pill Researchers know that subjects often respond to the idea of being treated and are careful to prevent this from confounding their results Because many people have heard that sugar causes hyperactivity if subjects knew they were given additional sugar they might alter their behavior Thus the subjects should be blind that is not know whether they re given sugar or an arti cal sweetener ls this really possible What if researchers know whether or not a child received sugar when it comes time to assess activity level Since this evaluation could be rather subjective it is important that the researcher not be aware of which treatment a subject received In other words the experiment should ideally be doubleblind Activity habits and levels vary greatly from family to family depending on region socioeconomic status etc One way to control for all of these in uences would be to use a matched pairs design select two siblings from each family and randomly assign one to the sugar diet the other to an arti cial sweetener Note the most common matched pairs design is one in which the same individual is evaluated for both treatment and control such as in a beforeandafter study If the order of treatments could play a role in the response then order should be randomized For example if we want to see if people prefer Pepsi or Coke each individual can taste both drinks blind of course but which is tasted rst should be determined at random say by a coin ip Besides matched pairs designs another way of controlling for outside variables is blocking that is dividing the units rst into groups that are similar in a way that may play a role in their responses then randomize assignment to treatments within these blocks lf age is important for 19 activity level divide the children rst into younger medium and older groups If gender may be important divide rst into males and females Example A recent newspaper article entitled Heights fears ease with pill reports In a small study released yesterday a drug already on the market for tuberculosis helped people who were terri ed of heights get over that fear with only two therapy sessions instead of the usual seven or eight The study led by Michael Davis a professor of psychiatry and behavioral sciences at the Emory University School of Medicine was described at a session about unlearning fears at the Society for Neuroscience meeting Davis based his work on research that had found the transmission of a certain protein to a brain receptor was critical to overcoming fear He found the TB drug Dcycloserine aids transmission of the protein77 This was an experiment because clearly the drug was administered to study participants by the researchers as opposed to observing differences in height fears for people who do and do not happen to take that particular drug The explanatory variable of interest is whether or not the TB drug is taken a categorical variable The response variable records the effectiveness of therapy by counting how many sessions are needed for patients to overcome their fear of heightsithus it is a quantitative variable The subjects were apparently a small group of people who were terri ed of heights The treatment was the TB drug Dcycloserine and we can assume that researchers randomly assigned the drug or a placebo to patients in order to make the comparison We can also assume the study was doubleblind because a reputable researcher would not compromise his study by allowing the experimenter effect to enter in Exercise Find an article or report about an experiment Tell what the variables of interest are whether they are quantitative or categorical and which is explanatory and response Describe the subjects treatments whether or not the study was blind etc Lecture 20 Sampling Distributions Means uTypicaI Inference Problem for Means n3 Approaches to Understanding Dist of Means uCenter Spread Shape of Dist of Means n6895997 Rule Checking Assumptions 2 mm mm mm Elemenuw sums mm m aw Vinnie C 2007 Nancy Pfenning Looking Back Review El 4 Stages of Statistics I Data Production discussed in Lectures 14 I Displaying and Summarizing Lectures 512 I Probability El Finding Probabilities discussedin Lectures 1314 El Random Variables discussed in Lectures 1518 El Sampling Distributions Proportions discussed in Lecture 19 Means I Statistical Lnference mmmmnm mm Elementatvstatstics mammals mm m2 Typical Inference Problem about Mean The numbers I to 20 have mean 105 sd 58 Ifnumbers picked at random by sample 0f400 students have mean 116 does this suggest bias in favor of higher numbers Solution Method Assume temoraril that population mean is 105 find of sample mean as high as 116 If it s too improbable we won t believe population mean is 105 we ll conclude there is bias in favor of higher numbers 2mm Nsnwmm Elemenhwstahshcs mm m swims an Key to Solving Inference Problems Elementary Statistics Looking at the Big Picture For a given population mean M standard deviation 0 and sample size n need to nd probability of sample mean X in a certain range Need to know sampling distribution of X Notation if denotes a single statistic X denotes the random variable 2mm mm mm armsan slums mm mm aw mm Lm 4 De nition Review Sampling distribution of sample statistic tells probability distribution of values taken by the statistic in repeated random samples of a given size Looking Back We summarized probability distribution of sample proportion by reporting its center spread shape Now we will do the same for sample mean C 2mm Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture LZEI 5 C 2007 Nancy Pfenning E Understanding Sample Mean 3 Approaches 1 Intuition 2 Handson Experimentation 3 Theoretical Results Looking Ahead We llfind that our intuition is consistent with experimental results and both are con rmed by mathematical theory C 2mm Nancy Pfenning Eiernentary statistics Luuking attne Big Picture LZEI E Example Intuit Behavior of Sample Mean I Background Population of possible dicerolls X are equally likely values 123456 with a uniform flat shape and p 35 a 17 III Question How does sample mean X behave for repeated random samples of size I n2 Experiment each student rolls 2 dice records sample mean on sheet and in notes C 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture LZEI 7 Elementary Statistics Looking at the Big Picture Example Intuit Behavior of Sample Mean I Background Population of possible dicerolls X are equally likely values 123456 with a uniform flat shape and p 35 a 17 III Question How does sample mean X behave for repeated random samples of size I n2 Looking Back We ve seen the probability histogram for this R V Probability nonei 391 C ZEIEI7 Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture LZEI B C 2007 Nancy Pfenning E Example Intuit Behavior of Sample Mean Example Intuit Behavior of Sample Mean I Background Population of possible dicerolls I Background Population of possible dicerolls X are equally likely values l23456 with X are equally likely values l23456 with a uniform flat shape and u 35 a 17 a uniform flat shape and u 35 o 17 III Response For repeated random samples of III Response For repeated random samples of size 2 X is a quan RV summarize with size 2 X is a quan RV summarize with I Center Some means less than 35 others more I I altogether they should average out to I I SpreadMeans 2 dice range easily from to I Shape up from 1 to 35 down to 6 Example Intuit Behavior of Sample Mean Example Sample Mean for Larger n I Background Population of possible dicerolls I Background Population of possible dicerolls X are equally likely values l23456 with X are equally likely values l23456 with a uniform flat shape and n 35 a 17 a uniform flat shape and u 35 o 17 III Response For repeated random samples of III Question How does sample mean X behave size 2 X is a quan RV summarize with for repeated random samples of size I Shape up from 1 to 35 down to 6 I n2 Experiment each student rolls 8 dice records sample mean on sheet and in notes Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Sample Mean for Larger n I Background Population of possible dicerolls X are equally likely values 123456 with a uniform flat shape and 1 35 a 17 III Response For repeated random samples of size 8 X is a quan RV summarize with I Center Some means less than 35 others more altogether they should average out to I Spread Means for n8 rarely as low as 1 or as high as 6 spread than for 142 e 2mm Nancy Ptenning Eiementary Statistics Looking atthe Big Picture LZEI i7 E Example Sample Mean for Larger n I Background Population of possible dicerolls X are equally likely values 123456 with a uniform flat shape and a 35 a 17 III Response For repeated random samples of size 8 X is a quan RV summarize with I Shape Bulges more near 35 tapers more at extremes 1 and 69 shape close to e 2mm Nancy Ptenning Eiementary Statistics Looking atthe Big Picture LZEI iB Mean of Sample Mean Theory For random samples of size n from population with mean M we can write sample mean as XX1X2Xu where each X has mean a The Rules for constant multiples of means and for sums of means tell us that X has mean uXuuunuu e 2mm Nancy Pfenning Eiementary Statistics Looking atthe Big Picture LZEI 2n Standard Deviation of Sample Mean For random samples of size n from population with mean M standard deviation 0 we write XX1X2Xn where each X has sd 039 The Rules for constant multiples of sds and for sums of variances tell us that has sd lr0202ltn02i n n e mi Nancy Ptenning Eiementary Statistics Looking atthe Big Picture LZEI 2i Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E Rule of Thumb Review Central Limit Theorem Review I Need population size at least 1011 Approximate normality of sample statistic for repeated random samples of a large enough formula for sd of X approx correct even size is cornerstone of inference theory if sampled without replacement Note for means there is no Rule of Thumb 39339 Makes intUitiVe sense for approximate normality that is as simple El Can be verified with experimentation as the one for propomons ll Proof requires higherlevel mathematics quot19 and quot119 bOth at leaSt 1Ol result called Central Limit Theorem c2uu7 Naney F39fErlrllrlg Elementary Statistles Luuklng attne Big Pleture LZEI 22 c2uu7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Big Pleture LZEI 23 Shape of Sample Mean Behavior of Sample Mean Summary For random samples of size n from population For random sample of size n from population of quantitative values X the shape of the with mean M standard deviation 0 sample distribution of sample mean X is mean X has approximately normal if I mean M 0 I X1tself1s normal or I standard deviation X is fairly symmetric and n is at least 15 or I shape approximately normal for large X is moderately skewed and n is at least 30 enough n c2uu7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Big Pleture LZEI 24 c2uu7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Big Pleture LZEI 25 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning Behavior of Sample Mean Implications Behavior of Sample Mean Implications For sample of size n from population For random sample 0 Wit mean M sample meanX has with mean M sd 039 sample mean X has I mean to I mean M a 9 X is unbiased estimator of M l standard deviation n in denominator sample must be random 9 X has less spread for larger samples population size must be at least lOn e 2mm Nancy Prenan Elementary Statistles Luuklng attne Eilg F39lcture LZEI 2B e 2mm Nancy Prenan Elementary Statistles Luuklng attne Eilg F39lcture LZEI 27 i l Behavior of Sample Mean Implications Example BehaViOV OfSamPle Mean 2 Dice For random sample of size n from population I Background Population of dice rolls has with mean a sd 039 sample mean X has 35 039 17 mean L I Question For repeated random samples of a standard devlatlon n n2 how does sample mean X behave shape approx normal for large enough n 9can find probability that sample mean takes value in given interval Looking Ahead Finding probabilities about sample mean will enable us to solve inference problems 2uu7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Eilg F39lcture LZEI a e 2mm Nancy Prenan Elementary Statistles Luuklng attne Eilg F39lcture LZEI 2a Elementary Statistics Looking at the Big Picture 6 Example Behavior of Sample Mean 2 Dice I Background Population of dice rolls has a 35 a 17 III Response For repeated random samples of n2 sample mean rollX has I Center mean I Spread standard deviation I Shape because the population is at not normal and n2 is very small e mi Nancy Pfenning Eiementaiy Statistics Looking attne Big Picture LZEI 3i C 2007 Nancy Pfenning E Example Behavior of Sample Mean 8 Dice I Background Population of dice rolls has 2 35 a 17 III Question For repeated random samples of n8 how does sample mean X behave e mi Nancy Pfenning Eiementaiy Statistics Looking attne Big Picture LZEI 32 Example Behavior of Sample Mean 8 Dice I Background Population of dice rolls has a 35 a 17 III Response For repeated random samples of n8 sample mean rollX has I Center mean Spread standard deviation Shape than for n2 Central Limit Theorem e mi Nancy Pfenning Eiementaiy Statistics Looking attne Big Picture LZEI 34 Elementary Statistics Looking at the Big Picture 6895997 Rule for Normal RV Review Sample at random from normal population for sampled value X a RV probability is El 68 thatX is Within 1 standard deviation of mean I 95 thatX is Within 2 standard deviations of mean I 997 thatX is Within 3 standard deviations of mean area16 area16 i a area025 area025 are 0015 area 15 a i I i i i muesigma mu725igma mu iSIgme mu muisigma mu2igma mu35igma 95 997 e ZEIEI7 Nancy Pfenning Eiementaiy Statistics Looking attne Big Picture LZEI 35 C 2007 Nancy Pfenning 6895997 Rule for Sample Mean For sample means X taken at random from large population with mean M sd 039 probability is u 68 that Xis withini of u L El 95 that X15 W1th1n 2M5 of M n 997 that X is within 3 of it These results only hold if n is large enough 2mm mnwmm ziiminuwsmsm mm mm gimme 12D as Example 689599 7Rule for 8 Dice El Background Population of dice rolls has 4 35 039 17 For random samples ofsize 8 sample mean roll X has mean 35 standard deviation 06 and shape fairly normal El Question What does 6895997 Rule tell us about behavior of X 2mm mm mm Emmy Siaisiics mm mm aw mm Lm 37 Example 689599 7Rulefor 8 Dice El Background Population of dice rolls has it 35 a 17 For random samples ofsize 8 sample mean roll X has mean 35 standard deviation 06 and shape fairly normal El r Therquot quotis 39 quotJ 068 that is within 106 of35 in I 095 that X is Within 206 of35 in I 0997 that is Within 306 of35 in 2mm mnwmm ziiminuwsmsm mm mm gimme 12D 39 Typical Problem about Mean Review Elementary Statistics Looking at the Big Picture The numbers I to 20 have mean 105 sd 58 If numbers picked at random by sample of 400 students has mean 116 does this suggest bias in favor of higher numbers Solution Method Assume temoraril that population mean is 105 ndof sample mean as high as 116 If it s too improbable we won t believe population mean is 105 we ll conclude there is bias in favor of higher numbers 2mm mm mm Emmy Siaisiics mm mm aw mm Lm 40 C 2007 Nancy Pfenning Example Testing Assumption About Mean I Background We asked The numbers I to A 20 have mean 10 5 sd 58 Ifnumbers picked at random by sample of 400 students Example Testing Assumption About Mean have mean 11 6 does this suggest bias in 4 35 favor ofhigher numbers 9i6 9L 1 2 527 its 1i 7 Response If a 105 039 58 for n400X Birgg eijsesgrrJIIZZEZrAtOO 116 Shoum have mean 7 S39d Response If LL 105 058 for n400X should Since1161s more than 3 sds above the have mean Sd 31110611615 mean thls Suggests blag 1n favor Of hlgher more than 3 sds above the mean this suggests numbers bias in favor of higher numbers LZEI 44 LZEI 42 C 2mm Nancy Pfenning Eiementaiy Statistics Luuking atthe Big Picture C 2mm Nancy Pfenning Eiementaiy Statistics Looking atthe Big Picture i i 39 g Example Behavior of Individual vs Mean Example Behavior of Individual vs Mean I Background IQ scores are normal with mean 100 sd 15 El Response I IQX of a randomly chosen individual has mean 100 sd 15 For x88 z I Background IQ scores are normal with mean 100 sd 15 III Question Is 88 unusually low for I IQ of a randomly chosen individual I Mean IQ of 4 randomly chosen individuals LZEI 45 C ZEIEI7 Nancy Pfenning Eiementaiy Statistics Looking atthe Big Picture LZEI 47 C 2mm Nancy Pfenning Eiementaiy Statistics Looking atthe Big Picture Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Behavior of Individual vs Mean Example CheckingAssumptions El Background IQ scores are normal with El Background Household sizes in the US mean 100 sd 15 have mean 25 sd 14 El Response El Question Is 3 unusually high for I Mean IQ Xof 4 randomly chosen individuals has I Size of a randomly chosen household mean 7 sd ForX 88 we have I Mean size of 10 randomly chosen households 2 I Mean size of 100 randomly chosen households 2mm mnwmm amnuwsmgm mm mm swim um mmmmnm mm ammmsmm makmva heaiv mm man Example Checking Assumptions Lecture Summa Sampling Distributions Means El Background Household sizes in the US have mean 25 sd 14 El Typical inference problem for means El Response Is 3 unusually high for El 3 approaches to understanding dist of sample mean I Size of arandomly chosen household I Intuit I Handson I Mean size of 10 randomly chosen households I Theory El Center spread shape of dist of sample mean El 6895997 Rule for sample mean I Revisit typical problem I Checking assumptions for use ofRule I Mean size of 100 randomly chosen households amnuwsmgm mm mm swim 12 53 mmmmnm mm ammwsmms tmkmvameaiv mm L19 54 Elementary Statistics Looking at the Big Picture 10 C 2007 Nancy Pfenning 39 39 E i i 9 Looking Back Review Lecture 7 III 4 Stages of Statistics Quantitative Variables I Data Production discussed in Lectures 14 I Displaying and Summarizing ii Single variables 1 categorical l quantitative ii Relationships between 2 variables I Probability Summaries Begin Normal iiMean vs Median iiStandard Deviation I Statistical Inference iiNormay Shaped Distributions c2uu7 Nancy Prennirig Eiernentary Statistics Luuking aims Big Picture c2uu7 Nancy Prennirig Eiernentary Statistics Luuking aims Big Picture L7 2 Ways to Measure Center and Spread De nition in Five Number Summary already discussed ii Mean the arithmetic average of values For E and Standard D eviati on n sampled values the mean 1s called Xbar xi can n I The mean of a population to be discussed later is denoted it and called mu C 2mm Nancy Prenning Eiementaiy Statistics Luuking attne Big Picture L7 3 C ZEIEI7 Nancy Ptenning Eiementaiy Statistics Luuking attne Big Picture L7 4 Elementary Statistics Looking at the Big Picture 1 C 2007 Nancy Pfenning E Example Calculating the Mean Example Calculating the Mean El Background Credits taken by 14 other students El Background Credits taken by 14 other students 47111111131314141517171718 47111111131314141517171718 El Question How do we find the mean number of El Response credits c2uu7 Nancy Ptennirig Eiernentary Statistics Luuking atthe Big Picture L7 5 c2uu7 Nancy Ptennirig Eiernentary Statistics Luuking atthe Big Picture L7 7 Example Mean vs Median Skewed Left Example Mean vs Median Skewed Left El Background Credits taken by 14 other students El Background Credits taken by 14 other students 47111111131314141517171718 47111111131314141517171718 El Question Why is the mean 13 less than the El Response median 135 i g i ii iii 6 7 credits 0 i s i ii iii 6 4 credits c 2mm Nancy Prennirig Eiernentary Statistisniarer Picture L7 8 c 2mm Nancy Ptennirig Eiernentary Statisticsquotirraer Picture L7 in Elementary Statistics Looking at the Big Picture 2 C 2007 Nancy Pfenning E Example Mean vs Median Skewed Right Example Mean vs Median Skewed Right I Background Output for students computer times I Background Output for students computer times 0 10 20 30 30 30 30 45 45 60 60 60 67 90 100 120 200 240 300 420 0 10 20 30 30 30 30 45 45 60 60 60 67 90 100 120 200 240 300 420 Variable N Mean Median TrMean StDeV SE Mean Variable N Mean Median TrMean StDeV SE Mean computer 20 979 600 854 1097 245 computer 20 979 600 854 1097 245 El Question Why is the mean 979 greater than the El Response median 60 in c m 7 Frequency in i Frequency ui i o i i o i i i i i i i i i i A i i i i i i o 56 T 50 250 350 compuw 0 5G T i50 250 350 compl ter median60 mean97 as median60 mean9785 Eiernentaiy Statistics Luuking attne Big Picture L7 is L7 ii CZEIEI7 Nancy Ptenning C 2mm Nancy Ptenning Eiernentaiy Statistics Luuking attne Big Picture Mean vs Median as Summary of Center I Pronounced skewness 0utliersgt Report median II Otherwise in generalgt Report mean contains more information Role of Shape in Mean vs Median II Symmetric mean approximately equals median II Skewed left low outliers mean less than median II Skewed right high outliers mean greater than median L7i5 Eiernentaiy Statistics Luuking attne Big Picture L7 M e ZEIEI7 Nancy Ptenning Eiernentaiy Statistics Luuking attne Big Picture e mi Nancy Ptenning Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Ways to Measure Center and Spread De nition II Five Number Summary I Standard deviation square root of average i3 Mean and Standard Deviation squared distance from mean E For n sampled values the standard deviation is 8 x1 E2mn E2 In l I Looking Ahead Ultimately squared deviation from a sample is used as estimate for squared deviation for the population It does a better job as an estimate we divide by nI instead of n 0 2mm Nancy Pfenning Elementary Statisties Luuking attne Big Picture L7 ie Interpreting Mean and Standard Deviation Example Guessmg Standard Devm oquot I Mean typical value I Background Household size in US has I Standard deviation typical distance of mean apmeimately 25 people values from their mean I Question Which is the standard deviation Having a feel for how standard deviation 3 0014 b 014 0 14 d 140 measures spread is much more important than being able to calculate it by hand Hint Ask if any students grew up in a household with number of people quite close to the mean what is the distance of that value from the mean Next a student whose household size was far from the mean reports it and its distance from the mean Now consider all US household sizes distances from the mean what would be their typical size e 2mm Nancy Pfenning Elementary Statisties Luuking attne Big Picture L7 18 e 2mm Nancy Pfenning Elementary Statisties Luuking attne Big Picture L7 18 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Guessing Standard Deviation El Background Household size in US has mean approximately 25 people El Response a 0014 b 014 c 14 d 140 Sizes vary they differ from 7 by about 7 2mm mnwmm Eiemenhwstahshcs mm tithe awrmme 1221 Example Standard Deviations from Mean El Background Household size in US has mean 25 people standard deviation 14 El Question About how many stande deviations above the mean is a household with 4 people Looking Ahead For performing inference it will be useful to identify how many standard deviations a value is below or above the mean a process known as standardizing 2mm mm mm Eiementaiv Statstms 1mm um aw mane 1722 Example Standard Devi ations from Mean El Background Household size in US has mean 25 people stande deviation 14 El Response sd s above mean for 4 people I I Example Estimating Standard Deviation El Background Consider ages of students El Question Guess the standard deviation of 1 Ages of all students in a high school mean about 16 2 Ages ofhigh school seniors mean about 18 3 Ages of all students at a university mean about 20 Looking Back What distinguishes this style of question from an earlier one that asked us to choose the most reasonable smndard deviation for household size Which type of question is more challenging 2 mm mm mm Emma same mm tithe aw 7mm 1 24 2mm mm mm Eiementaiy Statstms 1mm am my mane 1725 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning Example Estimating Standard Deviation Example Calculating a Standard Deviation I Background Consider ages of students El BaCRgl Ollndi Heights 64 66 67 67 68 70 MW mean 67 El Question What is their standard deviation II Response tt Ages of all students in a high school mean about 16 standard deviation N Ages of high school seniors mean about 18 standard deviation 3 Ages of all students at a university mean about 20 5 standard deviation e 2BB7 Nancy Pfenning Eiernentaiy Statistics Luuking attne Big Picture L7 27 e 2BB7 Nancy Pfenning Eiernentaiy Statistics Luuking attne Big Picture L7 2B Example Calculating a Standard Deviation Example How Shape A ects Standard Deviation I Background Heights 64 66 67 67 68 70 have I BackgroundOutput histogram for student earnings mean Variable N Mean Median TrMeaii SE Mean D S Earned 446 60377776l 2000 2823 6503 0308 sq root of average squared deviation from mean mean67 deviations E a squared deviations o e i 7 average sq deviation 0 1 2 game 5 6 7quot cc II Question Should we say students averaged 3776 s sq root of average sq dev1ation This is the ical distance from the a era e hei ht 67 and eamlngs dlffered from thls by about 6500 If v typ g g not do these values seem too high or too low c 2BB7 Nancy Pfenning Eiernentaiy Statistics Luuking attne Big Picture L7 3D c 2BB7 Nancy Pfenning Eiernentaiy Statistics Luuking attne Big Picture L7 ai Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning Example How Shape A ects Standard Deviation Focus on Particular Shape Norma El Background20utput histogram for student earnings Symmetric just as likely for a value to occur Variable N Mean Median TrMean StDev SE Mean Earned 446 3776 MOO 2823 6m 0m a certain distance below as above the mean Note if shape is normal mean equals median II Bellshaped values closest to mean are most E 5 common increasingly less common for values 0 to occur further from mean El Response 2quot 22mg 5quot 6 7 In fact most are Within about of 02nn7 Nancy Pfenning Eiernentary Statistics Looking attne Ellg Picture L7 33 02nn7 Nancy Pfenning Eiernentary Statistics Looking attne Ellg Picture L7 34 Focus on Area of Histogram Histogram of Normal Data Can adjust vertical scale of any histogram so it HiStogram 039 Norma39IV39Shaped Data set shows percentage by areas instead of heights bulges in the middle Then total area enclosed is l or 100 tota area 100 or 1 mean symmetric about mean e 2mm Nancy Pfenning Eiernentary Statistics Leean atthe Big Picture L7 35 e 2mm Nancy Pfenning Eiernentary Statistics Leean atthe Big Picture L7 3B proportion tapers at the ends Elementary Statistics Looking at the Big Picture 7 C 2007 Nancy Pfenning Example Percentages on a Normal Histogram Example Percentages on a Normal Histogram El Background IQs are normal with a mean of 100 as El Background IQs are normal with a mean of 100 as shown in this histogram shown in this histogram percentage of lQ39s between 90 and 120 percentage of IQ39s between 90 and 120 Z Total area inside histogram is 100 Total area inside histogram is 100 90 100 120 90 100 120 El Question About what percentage are between 90 El Response and 120 c2uu7 Nancy Pfenning Elementary Statistics Looking attne Big Picture L7 37 c2uu7 Nancy Pfenning Elementary Statistics Looking attne Big Picture L7 39 What We Know About Normal Data 6895997 Rule for Normal Data If we know a data set is normal shape with Values of a normal data set have given mean center and standard deViatiOIl El 68 within 1 standard deviation of mean Spread then 1t 13 known What percentage 0f El 95 within 2 standard deviations of mean Values eeeur 1 any 1nterVal El 997 within 3 standard deviations of mean 68 95997 Rule for Normal Distributions Following rule presents tip of the iceberg gives general feel for data values 68 039 values 95 Ol values 4 99 7 Cl VEiLlSS l i mean 35 meat r255 meglhlsd mea 5 15d mealqizsd mea masd C 2mm Nancy Pfenning Elementary Statistics Looking attne Big Picture L7 4n C 2mm Nancy Pfenning Elementary Statistics Looking attne Big Picture L7 47 Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning J l J 6895997 Rule for Normal Data 6895997 Rule for Normal Data If we denote mean 5 and standard deviation 8 then values of a normal data set have 0 68 in f lsils o 95 in E 2si2s 68 of values 997 in E 3s53s l 95 of values 7 997 of values I I I I mean vasd mean 23d me I39I 1Sd mean mehn1sd mean23d mean35d L7 42 0 2mm Nancy Pfenning Elementary StatIstIcs LuukIng althe Big F39Icture L7 43 e 2mm Nancy Pfenning Elementary StatIstIcs LuukIng althe ElIg Picture Example Using Rule to Sketch H isiogram Example Using Rule to Sketch H isiogram I Background Shoe sizes for 163 adult males I Background Shoe sizes for 163 adult males normal with mean 11 standard deviation 15 normal with mean 11 standard deviation 15 III Question How would the histogram appear I Response 6895997 Rule for Male Shoe Sizes 68 Cl Values 95 Di Values 99 7 01 values L7 46 L7 44 e 2mm Nancy Pfenning Elementary StatIstIcs LuukIng althe ElIg Picture e 2mm Nancy Pfenning Elementary StatIstIcs LuukIng althe ElIg Picture Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E Example Using Rule to Summarize Example Using Rule to Summarize I Background Shoe sizes for 163 adult males I Background Shoe sizes for 163 adult males normal with mean 11 standard deviation 15 normal with mean 11 standard deviation 15 III Question What does the 6895995 Rule D Response tell us about those shoe sizes I 68 in I 95 in I 997 in i i 39 Example Using Rule for Tail Percentages Example Using Rule for Tail Percentages El Background Shoe sizes for 163 adult males normal El Background Shoe sizes for 163 adult males normal with mean 11 standard deviation 15 with mean 11 standard deviation 15 El Question What percentage are less than 95 El Response 6895997 Rule for Male Shoe Sizes 6895997 Rule for Male Shoe Sizes ll 65 oi vaiues39 lt68 oi vaiues 95 Cl vaiues 7 99 7 m vaiues l 39 Al 7 r 95 at values ii iz s i4 1amp5 4 if as moivaiuesTji g mean 39 5 5 is 1 1 12 s it i s mean e 2mm Nancy Pfenning 1 5 L7 an e 2mm Nancy Pfenning L7 52 7 i sd15 Elementary Statistics Looking at the Big Picture 10 C 2007 Nancy Pfenning Example Using Rule for Tail Percentages Example Using Rule for Tail Percentages I Background Shoe sizes for 163 adult males normal I Background Shoe sizes for 163 adult males normal with mean 11 standard deviation 15 with mean 11 standard deviation 15 El Question The bottom 25 are below what size El Response 68 95997 Rule for Male Shoe Sizes 68 95997 Rule for Male Shoe Sizes H 65 oi iaiue 39 7 7 95 oi values 4 as 7ui uaiues w W Sigi ZZE39oil lauiizs 37 5 5 5 team 9 5 Hi i s As 3 is 51 12 s i4 i s c 2mm Nancy Pfenning sd 5 L7 53 c 2mm Nancy Pfenning ma 5 L7 55 From Histogram to Smooth Curve From Histogram to Smooth Curve I Start quantitative variable with in nite possible El If shape is normal result is normal curve values over continuous range Such as foot lengths not shoe sizes I Imagine infinitely large data set In nitely many college males not just a sample II Imagine values measured to utmost accuracy 68 f i Record lengths like 9 7333 not just to nearest inch vaues II Result histogram turns into smooth curve 7 95 oi values I 4 99 7 ofvaiue I If shape is normal result is normal curve mean rasd meal28d me quot456 mien meiinnsd mea mzsd meamasd 3 e 2mm Nancy Pfenning Eiementary Statistics Luuking aime Big Picture L7 SE C 2mm Nancy Pfenning Eiementary Statistics Luuking aime Big Picture L7 57 Elementary Statistics Looking at the Big Picture 11 C 2007 Nancy Pfenning i Lecture Summary Quantitative Summaries Begin Normal El Mean typical value average Mean vs Median affected by shape Standard Deviation typical distance of values from mean Mean and Standard Deviation affected by outliers skewness EIEI El El Normal Distribution symmetric bellshape 68 95 997 Rule key values of normal dist Sketching Normal Histogram amp Curve EIEI 2mm mnwmm amnuwsmsm mm tithe swim Lam Elementary Statistics Looking at the Big Picture 12 Lecture 26 Nancy Pfenning Stats 1000 Chapter 12 More About Con dence Intervals Recall Setting up a con dence interval is one way to perform statistical inference we use a statistic measured from the sample to construct an interval estimate for the unknown parameter for the population We learned in Chapter 10 how to construct a con dence interval for unknown population proportion p based on sample proportion p when there was a single categorical variable of interest such as smoking or not In this chapter we will learn how to construct other con dence intervals 0 for population mean a based on sample mean 55 when there is one quantitative variable of interest 0 for population mean di erence Md based on sample mean di erence CZ in a matched pairs study when the single set of quantitative di erences d is the variable of interest 0 for di erence between population means 1 7 2 based on di erence between sample means 921 7 922 in a twosample study The latter two situations involve one quantitative variable and an additional categorical variable with two possible values although we may think of the distribution of di erences in the matchedpairs study as a single quantitative variable Also discussed in the textbook but not in our course is the method of constructing a con dence interval for the di erence between two population proportions p1 7p based on the di erence between sample proportions pl 7 162 Because such situations involve two categorical variables they can be handled instead with a chi square procedure which will be discussed further in Chapter 15 The Empirical Rule for normal distributions allowed us to state that in general the probability is 95 that a normal variable falls withing 2 standard deviations of its mean Since sample proportion p for a large enough sample size n is approximately normal with mean p and standard deviation gt i n In general an approximate 95 con dence interval for a parameter is the accompanying statistic plus or minus two standard errors this works well if the statistic s sampling distribution is approximately normal If we are interested in the unknown population mean a when there is a single quantitative variable of interest we use the fact established in Chapter 9 that sample mean a has mean a and standard deviation For a large enough sample size n say 71 at least 30 population standard deviation 7 will be fairly well approximated by sample standard deviation 3 and so our standard error for a is self Also for large n by virtue of the Central Limit Theorem the distribution of a will be approximately normal even if the underlying population variable X is not Thus for a large sample size n the Empirical Rule tells us that an approximate 95 con dence interval for population mean a is we were able to construct an approximate 95 con dence interval for p p i 2 I s a i 27 Example The mean number of credits taken by a sample of 81 statistics students was 1560 and the standard deviation was 18 Construct an approximate 95 con dence interval for the mean number of credits taken by all statistics students does this interval also have a 95 chance of capturing the mean number of credits taken by all students at the entire university S 18 1560 i 27 1560 i 40 15201600 V81 This interval applies to statistics students only Especially because the intro stats courses are 4 credits each instead of the usual 3 these students may average slightly higher credit hours than students in general ii2 100 Recall The Empirical Rule is only roughly accurate besides we sometimes may prefer a different level of con dence other than 95 More precise standard normal values for con dence levels 90 95 98 and 99 may be obtained the in nite row at the bottom of Table A2 The row is called in nite because tquot multipliers converge to 2 for in nite sample sizes7same as in nite degrees of freedom We can summarize the intervals as follows for a large sample size 71 an approximate 90 con dence interval for M is s 7 i 16457 x W 95 con dence interval for M is s 7 i 19607 x W 98 con dence interval for M is 2 i 2326 S W 99 con dence interval for M is s i 25767 x W Example The mean number of credits taken by a sample of 81 statistics students was 1560 and the standard deviation was 18 Construct a more precise 95 con dence interval for the mean number of credits taken by all statistics students Then construct a 90 con dence interval A 95 con dence interval is 3 18 i1967 1560 i1967 1560 i 39 1521 1599 9 81 l A 90 con dence interval is 3 18 2 i16457 1560 i 16457 1560 i 33 1527 1593 81 Note the tradeoff we obtain a narrower more precise interval when we make do with a lower level of con dence Recall we learned in Chapter 9 that not all standardized test statistics follow a standard normal curve In particular when the sample size n is small 3 may be quite different from a and the random variable 52 follows a t distribution with n 7 1 degrees of freedom not a 2 distribution Especially for small samples tfhas more spread than the standard normal 2 It is still symmetric about zero and bellshaped like the 2 curve Table A2 provides tquot multipliers for constructing 90 95 98 or 99 con dence intervals for unknown population mean M when the sample size is on the small side Example Suppose a sample of only 9 statistics students averaged 1560 credits with standard deviation 18 Construct 95 and 90 con dence intervals for the mean number of credits taken by all statistics students A sample of size n 9 has df n 7 1 9 7 1 8 and so we obtain the correct tquot multipliers from the 8 df row of Table A2 A 95 con dence interval is 3 18 i 2317 71560 i2317 1560i139 1421 1699 n 101 A 90 con dence interval is 18 2 i 186i 156 i1 86 1560 i 112 14481672 f Not only are the tquot multipliers larger than the 2 multipliers but we are dividing by the square root of a much smaller sample size n which results in much wider intervals than we had for the sample of size 81 Example Suppose the FAA weighed a random sample of 25 airline passengers during the summer and found their weights to have mean 180 standard deviation 40 Give a 99 con dence interval for the mean summer weight of all airline passengersi We use the tquot multiplier for the df 25 7 l 24 row and the column for con dence level 99 Our 99 con dence interval is 180 i 280 40 7 180i 224 1476 2024 m lt gt Conditions for Using the 75 Con dence Interval It is important to remember that if the population X is not normal then neither is sample mean i and so the RV 1 does not have a t distribution and the tquot values from Table A2 are not necessarily correcti Fortunatelyft procedures tend to be robust against nonnormality especially for larger sample sizes 71 except for extreme outliers or strong skewness in the population demonstrated by outliers or skewness in the data Methods of Chapter 2 are essential now for determining the shape of the population Note there is no way of rescuing data that has been obtained through a poor design All of our theory requires a simple random sample taken from a population that is at least 10 times the sample size Thus tquot values will produce an accurate con dence interval if the sample size is large or if the sample size is small but the data show no outliers or pronounced skewness The tquot values will not produce an accurate con dence interval if the sample size is small and the data show outliers or skewnessi Example The sample of 9 students included a parttime student taking only 4 credits ls our con dence interval necessarily accurate No because the sample size is small and there is a very low outlieri Example ls our con dence interval for mean weight of airline passengers accurate Weights tend to follow a normal curve and anyway the sample size of 25 isn t especially small so the interval obtained above should be quite accurate Lecture 27 Last time we learned to construct a con dence interval for unknown population mean of a quantitative variable such as credits taken by statistics students or weights of airline passengersi Table A2 provides tquot multipliers for various sample sizes and four di erent levels of con dence The in nite row contains 2 multipliers which apply when the sample size is large enough that s is virtually identical to 7 102 Matched Pairs 75 Procedures One of the Three Basic Principles of Experimental Design is to control the effects of confounding variables by comparing several treatments or treatment to control One way to do a comparison is a matched pairs study where individuals are matched in pairs Two different treatments may be assigned to each pair with the assignment randomized for instance using coin ips and outcomes are compared within each pair Alternatively the response of an individual before treatment is paired with his or her response after treatment Or values of a particular variable may be studied for both members of a pair eg comparing earnings of husbands and wives Although such studies of a quantitative variable originally include an additional categorical variable such as whether the subject was given the drug or the placebo or whether the spouse is male or female a matched pairs situation reverts to the study of a single quantitative variable namely the single sample of di e39rences The population mean difference is denoted Md whereas the sample mean difference is denoted cl Robustness is assessed based on the 71 pairs of observed differences d not the Zn data values Example A social scientist wants to produce statistical evidence that men earn more than women She records these salaries for a sample of 11 husbandwife pairs Do the data support the scientist s theory lf robust construct a 95 con dence interval for the population mean difference MD and check if it contains zero Since the sample size is quite small and the income differences have an obvious outlier 145 we should not use t procedures Example Here are average weekly losses of manhours due to accidents in 10 individual plants before and after a certain safety program was put into operation Construct a 95 con dence interval for the mean decrease in weekly manhours lost due to accidents for all plants after implementing the safety program and use the interval to decide if the program seems effective 103 Before After Di erence 45 36 9 73 60 13 46 44 2 124 119 5 33 35 2 57 51 6 83 77 6 34 29 5 26 24 2 17 11 6 d 52 3d 41 First we can verify with a histogram that the data show no outliers or skewness and are approx imately normal Next we nd that a 95 con dence interval for the population mean difference Md is it tquot where tquot comes from the df 9 row and the 95 con dence level column Our con dence interval is 41 52 i 22627 52 t 29 23 81 10 We are 95 con dent that the population mean difference in average weekly man hours lost is between 23 and 81 Implicit is the assumption that the plants constitute a random sample of all plants for which such a safety program is intended Since the interval is strictly to the right of zero containing only positive numbers it suggests that there was a real decrease in mean manhours lost from before to after However the study design is somewhat awed because time could possibly be a confounding variable Perhaps because of heightening awareness of safety issues and increased fear of lawsuits there was a general decrease in manhours lost due to accidents during that time period even in plants that did not implement the safety program How could we control for this possible confounding variable By comparing our ten plants to another sample of plants over the same time period which did not implement the safety program Such a design because it involves samples from two distinct populations is called a twosample design Comparing TWO Means We will use inference to compare the mean responses in two groups each from a distinct population This is called a twosample siutation one of the most common settings in statistical applications One example would be to compare mean lQ s of male and female seventhgradersiiiei comparing results in an observa tional study Another example would be to compare the change in blood pressure for two groups of black men where one group has been given calcium supplements the other a placeboiiiei comparing results in an experiment In general a twosample t procedure arises in situations where there is one quantitative variable of interest plus a categorical variable which has two possible values The variables in the rst example are IQ and gender in the second example they are blood pressure and whether the subject has been given calcium or a placebo Responses in each group must be independent of those in the other sample sizes may differ The setting is not appropriate for matched pairs which represent a single population The following notation is used to describe the two populations and the results of two independent random samples Parameters Statistics Population RV l mean l sid sample size l sample mean l sample sid X1 M1 71 n1 561 Si 2 X2 2 72 n2 562 52 104 Naturally enough we estimate the parameter 1 7 2 with the statistic 921 7 9221 As one would hope and expect it turns out that the distribution of the RV 551 7 552 is centered at 1 7 2 providing an unbiased estimator The spread of the distribution is not so intuitive it can be shown that the standard error of 921 7 922 is i i Although this RV does not have a t distribution per se it can still be used with tquot values in either or two ways Option 1 Approximate 2 i df 2 2 2 in 1 12 1 22 n1 712 ln171n1 n271n2l and use the t table The computer takes this approach but for obvious reasons we would rather not if solving a twosample problem by hand Instead we will use Option 2 conservative approach use the smaller of 711 7 1712 7 1 as our df in the t table An approximate con dence interval for M1 7 M2 is 2 2 122it 152 n1 n where tquot uses the smaller of n1 7 1 n2 7 1 as its df and the desired con dence level dictates which column from Table A2 to use This interval should be fairly accurate as long as the sample sizes are large or if small samples show no outliers or skewness Example In random samples of 47 male and 31 female seventhgraders in a Midwest school district lQ s were found to have the following means and standard deviations 11 What shapes are required of the underlying populations to justify use of twosample t procedures Any shapes should be acceptable since the sample sizes of 47 and 31 are reasonably large Use a twosample t procedure to give a 90 con dence interval for the difference between mean lQ s males minus females The 90 con dence interval for M1 7 M2 is given by 2 2 21722it4S71i n1 n where we take tquot to be the value for the smaller of 477 1 317 1 df and con dence level 1901 We nd the tquot value for the 30 row 90 column to be 1170 and our 90 con dence interval 18 122 142 1117106i11701E i 5 i6 7111 105 3 It is common for boys to score somewhat higher than girls on standardized tests Does this seem to be the case for all seventhgrade boys and girls in this school district The interval just barely contains zero so it is difficult to be sure Eventually we will learn to carry out a formal test of whether or not two means 1 and a are equa Pooled TwoSample t Procedures If the samples are coming from populations that have equal variances we can use a pooled procedure The test statistic can be shown to follow a genuine t distribution with n1 n 7 2 df This places us further down on the t table than taking the smaller of n1 7 1 an n 7 l as our df resulting in slightly narrower con dence intervals One criterion for use of a pooled procedure is to check that sample standard deviations are close enough to suggest equal population standard deviations and hence equal variances We do this by verifying that the larger sample standard deviation is no more than twice the sma ler Example Looking at the sample standard deviations for le we note that 14 is not more than twice 12 so a pooled procedure seems appropriate There are actually much better criteria for use of a pooled procedure which are outlined in your textbook In any case for our purposes in this course the nonpooled procedure will be considered adequate Example In a previous Example we explored the sampling distribution of sample mean height when random samples are taken from a population of women whose mean height is claimed to be 645 We noted the sample mean height of surveyed Stats female students and calculated by hand the probability of observing such a high sample mean if population mean were really only 645 We used this probability to decide whether we were willing to believe that population mean was in fact 645 or if the population of female Stats students is actually taller on average For this Example we address the same question by using MlNlTAB to set up a con dence interval for unknown population mean height given that population standard deviation is 25 thus a z procedure is used When a onesided alternative is not speci ed the con dence interval just barely contains 645 it goes down to 64491 and so we can t quite produce evidence that population mean height of females di ers 645 If a greaterthan alternative is speci ed then our lower bound is 64538 which would suggest population mean height is higher than 645 If the standard deviation of 25 were not given we would carry out a t procedure Again the con dence interval just barely contains 645 with a twosided alternative and just barely misses it with a onesided alternative Considering the 95 con dence interval will give results that match up neatly with those of a hypothesis test at the 5 level only in the case of a twosided alternative Due Sample Z HTfemale Test of mu 645 vs mu not 645 The assumed siana 25 Variable N Mean StDev SE Mean HTfemale 281 64783 2637 0 149 Variable 950 CI HTfemale 64491 65075 190 0058 Due Sample Z HTfemale 106 Test of mu 645 vs mu gt 645 The assumed siana 25 Variable N Mean StDev SE Mean HTfemale 281 64783 2637 0 149 Variable 95 0 Lower Bound Z P HTfemale 64538 190 0029 One Sample T HTfemale Test of mu 645 vs mu not 645 Variable N Mean StDev SE Mean HTfemale 281 64783 2637 0 157 Variable 950 CI T P HTfemale 64473 65093 180 0073 One Sample T HTfemale Test of mu 645 vs mu gt 645 Variable N Mean StDev SE Mean HTfemale 281 64783 2637 0 157 Variable 95 0 Lower Bound T P HTfemale 64523 1 80 0037 Exercise In a previous Exercise we explored the sampling distribution of sample mean number selected when random samples are taken from a population where all numbers between 1 and 20 are equally likely so population mean is 105 We noted the sample mean selection by surveyed Stats students and calculated by hand the probability of observing such a high sample mean if population mean were really only 105 We used this probability to decide whether we were willing to believe that population mean was in fact 105 or if students were rather biased towards higher numbers For this Exercise address the same question by using MlNlTAB to set up a con dence interval for unknown population mean selection given that population standard deviation is 577 Does your interval contain 105 What do you conclude Exercise For this Exercise address the same question again by using MlNlTAB to set up a con dence interval for unknown population mean selection but this time assume population standard deviation is unknown Does your interval contain 105 What do you conclude Lecture 28 Chapter 13 More About Signi cance Tests Recall Hypothesis tests are a form of statistical inference we use a statistic measured from the sample to decide whether or not the unknown parameter for the population equals a hypothetical value We learned in Chapter 11 how to test a hypothesis about an unknown population proportionp based on sample proportion p when there was a single categorical variable of interest such as smoking or not In this chapter we will learn how to perform other hypothesis tests 0 about population mean u based on sample mean a when there is one quantitative variable of interest 107 0 about population mean difference Md based on sample mean difference CZ in a matched pairs study when the single set of quantitative differences d is the variable of interest 0 about difference between population means M1 7 M2 based on difference between sample means 921 7 922 in a twosample study Also discussed in the textbook but not in our course is the method of testing hypotheses about the difference between two population proportions pl 7 p2 based on the difference between sample proportions 161 7 g Because such situations involve two categorical variables they can be handled with a chisquare procedure which will be discussed further in Chapter 15 Paralleling the speci c steps we learned to test a hypothesis about a single proportion the following ve steps can be taken to test a hypothesis about any unknown parameter 1 Determine the null and alternative hypotheses 2 Verify that the necessary data conditions are met if so standardize the sample statistic DJ Find the pvalue which is the probability assuming the null hypothesis is true that the test statistic would take a value as highlowdi erent as the one observed u Decide whether results are statistically signi cant reject the null hypothesis if the pvalue is small 01 State conclusions in context Hypothesis Tests About M 0139 Md If we are interested in the unknown population mean M when there is a single quantitative variable of interest we use the fact that sample mean a has mean M and standard error In order to obtain accurate results for smaller sample sizes since 3 may be quite different from 7 our standardized test statistic 52 is taken to follow a t distribution with n 7 1 degrees of freedom not a 2 distribution We learned to use Table A2 in Chapter 11 to get a range for the Pvalue in hypothesis tests about p by surrounding our test statistic 2 with values 2 from the in nite row of the table The columns correspond to symmetric tail areas of 05 for con dence level 90 tail areas 025 for con dence level 95 tail areas 01 for con dence level 98 and tail areas 005 for con dence level 99 Now we use Table A2 to get a range for the Pvalue in hypothesis tests about the mean by surrounding our test statistic t with values tquot from the df n 7 1 row of the table Again the columns correspond to symmetric tail areas of 05 025 01 and 005 respectively Using Table A2 our hypothesis test about M follows these steps lt 1SetupH0MM0vsHaM gt Mo 7 2 Verify that the sample size is large or the data set shows no outliers or skewness if so calculate tstatistic Ego y 3 Get an expression for the Pvalue PTRVVV S tsmtistic for Ha M lt M0 1311va 2 tstatistic f0 Ha I gt 0 21311va 2 ltstatisticl f0 Ha I 0 4 Assess signi cance by comparing the t statistic to tquot values in Table A2 and getting a range for the Pva ue 5 State conclusions in the context of the particular mean of interest 108 Example 1 had been going under the assumption that my students averaged 15 credits in a semester but then I thought that because mine is a 4 credit course their mean may actually be higher than 15 The mean number of credits taken by a sample of 81 statistics students was 156 and the standard deviation was 18 Does this provide evidence that statistics students overall average more than 15 credits 1 H02M15vsi Ha2Mgt 15 2 Since 71 81 is large nonnormal shape would not be a problem calculate t gig 3 m 3 Pvalue PT 2 3 4 For 80 df 3 is greater than 264 so the Pvalue is less than 1005 results are statistically signi cant and we reject H0 5 Overall statistics students average more than 15 credits Example Suppose a sample of 36 statistics students had been taken How many df should we use from Table A2 Note that the table does not include exactly 35 df so we must choose between 30 and 40 Always choose the smaller df because this makes it slightly more difficult to reject the null hypothesis which is the safer approach to take Thus we would carry out the test using the 30 df row of Table A2 Conditions for Using the 75 Test Just as with con dence intervals Pvalue ranges obtained by comparing the t statistic to tquot values in Table A2 will produce accurate results if the sample size is large or if the sample size is small but the data show no outliers or pronounced skewness The table will not produce accurate results if the sample size is small and the data show outliers or skewness Matched Pairs Hypothesis Tests When the mean of a quantitative variable is explored via a matched pairs design hypothesis tests are carried out on the population mean difference Md based on the sample mean difference 1 Example To test if students mothers tend to be younger than their fathers I looked at the difference mother s age minus father s age for a sample of 12 students This difference had mean 7 715 and standard deviation 3d 3 ls the mean difference signi cantly less than zero using oz 05 as the cutoff probability To test H0 Md 0 vs Ha Md lt 0 we check if the distribution seems fairly symmetric and outlierfree it is and calculate t 3 7 73 Because the alternative has the lt77 sign our Pvalue is PT 3 7173 According to the table the probability of a T random variable with 11 df being greater than 180 is 05 likewise the probability of being less than 18 is also 05 The test statistic 173 isn t as far out on the tail of the t curve as 18 so its tail probability is more than 05 We 0 not have evidence to reject the null hypothesis at oz 05 sample of 12 age differences was not enough to convince us that mothers tend to be younger In fact another much larger sample was taken producting a much smaller Pvalue and this sample did provide evidence that the mean age difference is negative 109 Hypothesis Tests About the Difference Between Two Means We can test for equality of the mean responses in two groups each from a distinct population This is called a twosample siutation one of the most common settings in statistical applications One example would be to compare mean lQ s of male and female seventhgraders7ie comparing results in an observational study Another example would be to compare the change in blood pressure for two groups of black men where one group has been given calcium supplements the other a placebo7ie comparing results in an experiment As with con dence intervals we use the following notation Parameters Statistics Population RV l mean l sd sample size l sample mean l sample sd X1 M1 71 n1 561 Si 2 X2 2 72 n2 562 52 The null hypothesis is H0 M1 2 same as H0 M1 7 M2 0 and the alternative substitutes the appropriate inequality for 77 We carry out our test using the twosample 1 statistic 551 552 1 i 2 52 72 771 n2 1 and use the smaller of n1 7 1712 7 l as our df in the t table The approximate Pvalue is found in the usual way from Table A2 Example In random samples of 47 male and 31 female seventhgraders in a Midwest school district lQ s were found to have the following means and standard deviations Males Females ls the mean male lQ signi cantly higher than that for the females Test at level oz 05 We will test H0 M1 7 M2 0 vs Ha M1 7 2 gt 0 Our twosample t statistic is 111 7 106 7 0 12 14 f t i For 30 df 163 is less than 170 so our onesided Pvalue is greater than 05 There is not quite enough evidence to reject H0 at the 05 level the population of boys doesn t necessarily average higher than the population of girls in this district Example Suppose the FAA weighed a random sample of 25 airline passengers during the summer and found their weights to have mean 180 standard deviation 40 Are airline passengers necessarily heavier now than they were in 1995 when mean weight for 16 passengers was 160 with standard deviation 30 Answer this question two ways rst by looking at a 90 con dence interval for the difference in mean weights then by testing at the 05 level if the mean weight increased Note that since the sample sizes aren t especially large we should rst check that the weight distributions do not show obvious outliers or skewness We have n1 16 n2 25 921 160 922 180 31 30 32 40 110 1 We use the tquot multiplier for the df 16 7 1 15 row and the column for con dence level 90 Our 90 con dence interval for M1 7 M2 is 302 402 1607180i1 75 E g 720 i 19 739 71 The interval contains only negative numbers and suggests a signi cant increase in mean weight from 1995 to 2002 F We test H0 M1 7 M2 0 vs Ha M1 7 2 lt 0 about population mean weight in 1995 minus population mean weight in 2002 The test statistic is t M 182 For 15 302 402 df 182 is between 175 and 213 so the Pvalue is between 05 and 11025125We reject H0 at the 05 level and conclude that mean weight has increased signi cantly The FAA reached this conclusion in the spring of 2003 and made new restrictions on number of passengers aboard smaller planes based on the fact that people are heavier than they used to be Multiple Hypothesis Tests Example Verbal SATs have mean 500 An education expert samples verbal SAT scores of 20 students each in 100 schools across the state and nds that in 4 of those schools the sample mean verbal SAT is signi cantly lower than 500 using oz 05 Are these schools necessarily inferior in that their students do signi cantly worse on the verbal SATs No First note that 20 indicates the sample size here and 100 is the number of tests7in other words we test H0 M 500 vs Ha M lt 500 over and over one hundred times Remember that if oz 05 is used as a cutoff then 5 of the time in the long run we will reject H0 even when it is true Roughly 5 schools in 100 will produce samples of students with verbal SATs low enough to reject H0 just by chance in the selection process even if the mean for all students at those schools is in fact 500 Example Kanarek and others studied the relationship between cancer rates and levels of asbestos in the drinking water After adjusting for age and various demographic variables but not smoking they found a strong relationship between the rate of lung cancer among white males and the concentration of asbestos bers in the drinking water pvaluelt001 An increase of 100 times the asbestos concentration results in an increase of 105 per 1000 in the lung cancer rate7one additional lung cancer case per year for every 20000 people The investigators tested over 200 relationshipsmthe pvalue for lung cancer in white males was by far the smallest one they got Does asbestos in the drinking water cause lung cancer in white males No When they test hundreds of relationships sooner or later by chance alone some will end up looking signi cant There are other problems with this study failing to control for the possible confounding variable of smoking and calling a relationship strong even though it would imply just one additional case of lung cancer for every 20000 white males Example A researcher of ESP tests 500 subjects Four of them do signi cantly better each Pvalue lt 01 than random guessing Should the researcher conclude that those four have ESP No In so many trials even if each subject is just guessing chances are that a few of the 500 will do signi cantly better than guessing and a few will do signi cantly worse The researcher should proceed with further testing of those four subjects In general we should be aware that many tests run at once will probably produce some signi cant results by chance alone even if none of the null hypothese are false 111 Example Do students overall spend more time on the computer than they do watching TV If so then when I consider the differences in minutes spent computer minus TV for a population of stu dents l d hypothesize the mean of the differences to be positive A paired t procedure based on computer and TV times of several hundred Stats students could be used to test my hypothesis Since a paired test like this really just involves one quantitative variableithe single sample of differencesian appropriate display would be a histogram of observed differences note that it is remarkably bellshaped but may or may not be centered at zero Histogram of Differences with H0 and 95 rconfidence interval for the mean iEIEI7 Frequency HOX l 75le TI 5E lEl Differences Paired T for Compu TV N Mean StDev SE Mean Compu 444 8164 8861 421 TV 444 58 18 7028 334 Difference 444 2347 11050 524 95 lower bound for mean difference 1482 T Test of mean difference 0 vs gt 0 T Value 447 P Value 0000 The Pvalue of 0000 lets me reject the null hypothesis that the population mean difference is zero and conclude that it is indeed positive Apparently students do spend more time overall on the computer than they do watching TV Exercise Find paired data in our survey such as math and verbal SATs ages of mothers and fathers heights of females and their mothers or heights of males and their fathers Use MlNlTAB to test H 0 Md 0 against an appropriate Ha State your conclusion in terms of the variable chosen Example Who carries more cash males or females Or don t they differ I can use MlNlTAB to test the null hypothesis that mean cash carried for populations of females is the same as that or males vs a twosided alternative I had no preconceptions in advance of one group carrying more money When comparing values of one quantitative variable for two categorical groups 112 sidebyside boxplots would be an appropriate display Boxplots of Cash by Sex means are indicated by Solid circles 4e ADDr SEIEIr ZEIEIr 4e Cash ltltli E D l female male Am wwwx 26k4lt Sex Two sample T for Cash Sex N Mean StDev SE Mean female 280 240 396 24 male 159 34 2 58 4 46 Difference mu female mu male Estimate for difference 1023 95 CI for difference 2047 002 T Test of difference 0 vs not T Value 197 P Value 0050 DF 241 NDTE N missing 7 The Pvalue of 05 is on the small side leading us to conclude that there is a signi cant difference between males and females in the amount of cash that they carry Since the difference between sample means female minus male was negative we have reason to believe that overall males carry more cash The sample mean for females was about 24 for males about 34 Exercise Compare values of a quantitative survey variable for two categorical groups such as males and females or on and off campus students by testing H0 M1 7 M2 0 against an appropriate Ha State your conclusion in terms of the variable chosen Exercise Read the article The most important meal which reports that in a study of American eightgraders in 96 public schools in San Diego New Orleans Minneapolis and Austin overweight students were more likely to skip breakfast than students who were not overweight Unstack the data in our class survey according to gender then for each gender group test the null hypothesis of equal weights for students who did and did not eat breakfast according to their survey responses Make sure to formulate the correct alternative hypothesis Exercise Read Science lifts mummy s curse and use the means for Age at death exposed vs unexposed along with the sample sizes 71 and standard deviations in parentheses to test for a signi cant difference in age at death between those who were and were not exposed to the mummy s curse State your conclusions clearly 113 Lecture 23 Nancy Pfenning Stats 1000 Chapter 11 Testing Hypotheses About Proportions Recall last time we presented the following examples 1 In a group of 371 Pitt students 42 were lefthanded Is this signi cantly lower than the proportion of all Americans who are lefthanded which is 12 m In a group of 371 students 45 chose the number seven when picking a number between one and twenty at random Does this provide convincing statistical evidence of bias in favor of the number seven in that the proportion of students picking seven is signi cantly higher than 120 05 9quot A university has found over the years that out of all the students who are o ered admission the proportion who accept is 70 After a new director of admissions is hired the university wants to check if the proportion of students accepting has changed signi cantly Suppose they o er admission to 1200 students and 888 accept Is this evidence of a change from the status quo Each example mentions a possible value for p which would indicate no di erenceno changestatus quo The null hypothesis H 0 states that p equals this traditional value In contrast to the null hypothesis each example suggests that an alternative may be true a signi cance test problem always pits an alternative hypothesis Ha against H0 Ha proposes that the proportion di ers from the traditional value poiHa rocks the boatupsets the apple cart marches to a di erent drummer A key di erence among our three examples is the direction in which Ha refutes H0 In the rst it is suggested that the proportion of all Pitt students who are lefthanded is less than the proportion for adults in the US which is 12 In the second we wonder if the proportion of students picking the number seven is signi cantly more than 05 In the third we inquire about a di erence in either direction from the stated proportion of 70 We can list our null and alternative hypotheses as follows 1H0p12 Haplt12 2H0p05 Hapgt05 3H0p70 Hap3 70 lt In general we have H0 p p0 vs Ha p gt p0 a Note that your textbook may have expressed the rst two null hypotheses as H0 p 2 12 and H0 p S 05 These expressions serve well as logical opposites to the alternative hypotheses but our strategy to carry out a test will be to assume H0 is true whic means we must commit to a single value p0 at which to center the hypothesized distribution of Thus we will write H0 p p0 in these notes Alternatives with lt or gt signs are called onesided alternatives with y they are twosided When in doubt a twosided alternative should be used because it is more general Note In statistical inference we draw conclusions about unknown parameters Thus H0 and Ha are statements about a parameter p not a statistic We can t argue about its value has been measured and taken as fact Note Just as success and failure in binomial settings lost their connotations of favorable and unfa vorable Ha may or may not be a desired outcome It can be something we hope or fear or simply suspect is true However because p0 is a traditionally accepted value we ll stick with H0 unless there is convincing evidence to the contrary H0 is innocent until proven guilty ow can we produce evidence to refute H0 By using what probabilty theory tells us about the behavior of the RV sample proportion p it is centered at p has spread and for large enough n its shape is normal Our strategy will be to determine if the observed value 16 is just too unlikely to have occurred if H0 p p0 were true If the probability of such an outcome called the Pvalue is too small then we ll reject H0 in favor of Ha The Pvalue of a test about a proportion p is the probability computed assuming that H0 p p0 is true that the test statistic would take a value at least as extremeithat is as low or as high or as di erentias the one observed The smaller the Pvalue the stronger the evidence against H 0 Since Example 1 has Ha p lt 12 the Pvalue is the probability of a sample proportion of lefthanders as low as 11 or lower coming from a population where the proportion of lefthanders is 12 Based on what we learned about the sampling distribution of 16 we know that 16 here assuming H0 is true has mean p 12 standard error M 123239 and an approximately normal shape since 37112 m 45 and 37188 m 326 are both greater than 10 Also we have in mind a much larger population of Pitt students certainly more than 10371 3710 Pvalue P 3 113 m PZ S V 113 quot12 PZ g 741 3409 371 Note For con dence intervals since 16 has standard deviation p 1717 gt with p unknown we estimated it with 359 M Now we carry out our test assuming H0 p p0 is true so the standard deviation of is H L770 and the test statistic is 2 A 100 P0 Since it s not at all unlikely probability about 321 for a random sample of 371 from a population with proportion 12 of lefthanders to have a sample proportion of only 11 lefthanders we have no cause to reject H0 p 12 The proportion of lefthanders at Pitt may well be the same as for the whole country 12 Because we rely on standard normal tables to determine the Pvalue we transform from an observed value 16 to a standardized value 2 177170 The way to compute the Pvalue depends on the form of Ha as illustrated below rst in terms of 16 then in terms of 2 Pvalue for Tests Observed Ha1pltP0 PValue P RVVV S stat Ha1pgtP0 PValue P RV 2 stat Ham oo PValue combined area 100 p one of these 90 of Signi cance about p St and ardized Ha1pltP0 Z Z 7 pipe M 0 n Ha 110 gt100 PValue PZ 2 2 Z Z pipe 0 p01p0 n Ha p 100 PValue 2Pltz 2 M Z Summary of Test of Signi cance about p Say a simple random sample of size n is drawn from a large population with unknown proportion p of successes We measure 16 W and carry out the test as follows lt lSetupH0pp0vsHap gt pg 2 Verify that the population is at least 10 times the sample size and that npo 2 10 and nl 7 pg 2 10 Then calculate standardized test statistic 2 A 3 Find Pvalue PZRVVV S Zsta stjg fOI Ha 1 P lt P0 PZRV 2 Zstatistic f0 Ha 1 p gt 100 2PZRV 2 lzstatisticl f0 Ha I 100 4 Determine if the results are statistically signi cant if the Pvalue is small reject H0 in favor of Ha and say the data are statistically signi cant otherwise we have failed to produce convincing evidence against H0 For speci ed oz reject H0 if Pvalue lt 1 5 State conclusion in context of the particular problem Example Let s follow these steps to solve the second problem In a group of 371 students 45 chose the number seven when picking a number between one and twenty at random Does this provide convincing statistical evidence of bias in favor of the number seven in that the proportion of students picking seven is signi cantly higher than 120 05 First calculate 16 g 12 371 lH0p05 Ha2pgt05 2 We have in mind a very large population of all students We check that 37105 19 and 37l95 352 are both greater than 10 Next 2 71205 g 6 19 3 Pvalue PZ 2 619 PZ 3 7619 m 0 4 Since the Pvalue is very small we reject H0 and say the results are statistically signi cant 5 There is very strong evidence of bias in favor of the number seven Example I also suspected bias in favor of the number seventeen In a group of 371 students 25 chose the number seventeen when picking a number between one and twenty at random Does this provide convincing statistical evidence of bias in favor of the number seventeen in that the pr portion of students picking seventeen is signi cantly higher than 120 05 First calculate 15 32 H H02p05 Ha2pgt05 m N l o lt72 q l o m l l 01 o 9quot Pvalue PZ 2 150 PZ 3 7150 0668 tb We could call this a borderline Pvalue Next lecture we ll discuss guidelines for how small the Pvalue should be in order to reject H0 and we ll solve the third example Often a cutoff probability oz is set in advance in which case we reject H0 if the Pvalue is less than oz Lecture 24 Testing Hypotheses About Proportions Last time we learned the steps to carry out a test of signi cance lt liSetupH02pp0vsiHap gt p0 r 2 In order to verify that the underlying distribution is approximately binomial check that the population is at least 10 times the sample size In order to justify use of a normal approximation to binomial proportion check that npo 2 10 and nlip0 2 10 Calculate standardized test statistic 2 p70 3 Find Pvalue PZRVVV S zstatistic for Ha p lt p0 PZRV 2 Zstatistic f0 Ha 1 gt 100 2PZRVVV 2 lzstatisticl for Ha p 100 4 Assess signi cance if the Pvalue is small reject H0 in favor of Ha and say the data are statistically signi cant otherwise we have failed to produce convincing evidence against H0 For speci ed oz reject H0 if Pvalue lt 01 5 State conclusions in context Last time we began to solve the following example Example When students are asked to pick a number at random77 from one to twenty I suspect their selections will show bias in favor of the number seventeen In a group of 371 students 25 chose the number seventeen Does this provide convincing statistical evidence of bias in favor of the number seventeen in that the proportion of students picking seventeen is signi cantly higher than 120 05 The null and alternative hypotheses were Ha p gt 05 and so the zstatistic was 150 and the Pvalue was PZ 2 150 PZ 3 7150 0668 Step 4 says to reject H0 if the Pvalue is small How small is small Sometimes it is decided in advance exactly how small the Pvalue would have to be to lead us to reject the null hypothesis a cutoff probability oz is prescribed in advance Then if the Pvalue is less than oz we reject H0 and say the results are statistically signi cant at level a Otherwise we do not have suf cient evidence to reject H0 Example ls there evidence of bias in favor of the number seventeen at the oz 05 level The P value is 0668 which is not less than 05 so by this criterion it is not small enough to reject H0 It could be that students didn t have any systematic preference for the number seventeen and the proportion of seventeens selected was a bit high only by chance Example A university has found over the years that out of all the students who are offered admission the proportion who accept is 70 After a new director of admissions is hired the university wants to check if the proportion of students accepting has changed signi cantly Suppose they offer admission to 1200 students and 888 accept ls this evidence at the oz 05 level that there 92 333 73 is the sample proportion of students who accepted admission has been a real change from the status quo How about at the 02 level First we nd that pA 1200 H Setup H02pi70vsi Ha2p3 i70 Both conditions are satis ed 2 w 227 0 Because of the twosided alternative our Pvalue is 2PZ 2 l227l 2PZ 3 7227 21011610232 Since 0232 lt 05 we have evidence to reject at the 5 level But 10232 is not less than 02 so we don t have evidence to reject at the 2 level P 9quot th Fquot If we set out to gather evidence of a change in either direction for overall proportion of students accepting admission we would say yes with a cutoff of 05 no with a cutoff of 02 Thus this test is rather inconclusive Example A university has found over the years that out of all the students who are offered admission the proportion who accept is 70 After a new director of admissions is hired the university wants to check if the proportion of students accepting has increased signi cantly Suppose they offer admission to 1200 students and 888 accept ls this evidence at the oz 05 level that there has been a signi cant increase in proportion of students accepting admission How about at the 02 level Again we nd that 16 73 is the sample proportion of students who accepted admission 1 The subtle rephrasing of the question increased77 instead of changed results in a dif ferent alternative hypothesis H0 p 1 vs a p gt 7 r r r i 7 73770 2 The 2 statistic is unchanged 2 7 m 227 3 Because of the onesided alternative our Pvalue is PZ 2 227 PZ S 7227 10116 4 Since 0116 lt 05 we again have evidence to reject at the 5 level This time 0116 is also less than 02 so we also have evidence to reject at the 2 level 5 If we set out to gather evidence of increased overall proportion of students accepting admis sion we would say yes we have produced evidence of an increase whether the oz 05 or oz 02 level is used The previous examples demonstrate that 1 It is more difficult to reject H 0 for a twosided alternative than for a onesided alternative In general the twosided Pvalue is twice the onesided Pvalue The onesided Pvalue is half the twosided Pvalue 2 It is more difficult to reject Hg for lower levels of a Calculating the Pvalue in Step 3 gives us the maximum amount of information to carry out our testiwe know exactly how unlikely the observed 16 is If a cutoff level oz is prescribed in advance then it is possible to bypass the calculation of the Pvalue in Step 3 Instead the zstatistic is compared to the critical value 2 associated with a For example if we have a twosided alternative and oz is set at 05 then the rejection region would be where the teststatistic 2 exceeds 196 in absolute value The disadvantage to this method is that it provides only the bare minimum of information needed to decide whether to reject H 0 or not We will not employ the rejection region method in this course but students should be aware of it in case they encounter it in other contexts A method that falls somewhere in between those which provide maximum and minimum information is the following close in on77 the Pvalue by surrounding the z statistic with neighboring values 2 from the 93 in nite row of Table A2 The advantage to this method is that it familiarizes us with the use of Table A2 which will be needed when we carry out hypothesis tests about unknown population mean of a quantitative variable Note that 1645 corresponds to an area of 90 symmetric about zero so each tail probability that 2 takes a value less than 1645 or greater than 1645 is 05 2 1960 corresponds to an area of 95 symmetric about zero so each tail probability that 2 takes a value less than 1960 or greater than 1960 is 1025 2 2326 corresponds to an area of 98 symmetric about zero so each tail probability that 2 takes a value less than 2326 or greater than 2326 is 01 2 i576 corresponds to an area of 99 symmetric about zero so each tail probability that 2 takes a value less than 2576 or greater than 2576 is 005 These tail probabilities may be penciled in at the top or bottom ends of the columns in Table A2 for easy reference We will now resolve some of our earlier examples using Table A2 instead of Table A11 Example A university has found over the years that out of all the students who are offered admission the proportion who accept is 70 After a new director of admissions is hired the university wants to check if the proportion of students accepting has increased signi cantly Suppose they offer admission to 1200 students and 888 accept ls this evidence at the oz 05 level that there has been a signi cant increase in the proportion of students accepting First we found that 16 73 is the sample proportion of students who accepted admission and set up H0 p 170 vs Ha 2p 2 70 Next we calculated 2 M 227 o 3ol Our Pvalue is PZ 2 227 According to Table A2 2 227 is between 2 1960 and 2 2326 Therefore our pvalue PZ 2 227 is between 1025 and 01 which means it must be less than 05 We can reject H0 at the 5 level Recall2 Table All showed the precise Pvalue to be 0116 which is in fact between 1025 and 01 Example A university has found over the years that out of all the students who are offered admission the proportion who accept is 70 After a new director of admissions is hired the university wants to check if the proportion of students accepting has changed signi cantly Suppose they offer admission to 1200 students and 888 accept ls this evidence at the oz 05 level that there has been a real change in either direction from the status quo First we found that 16 73 is the sample proportion of students who accepted admission and we set up H0 p 70 vs H 70 Next we calculated 2 397373970 a p Because of the twosided alternative our Pvalue is 2PZ 2 l227l 2PZ 2 227 According to Table A2 2 227 is between 2 1960 and 2 2326 Therefore PZ 2 227 is between 1025 and 01 and the Pvalue 2PZ 2 227 is between 21025 and 2101 that is between 05 and 02 We still can reject H0 at the 5 level but not at the 2 level Example In a group of 371 Pitt students 42 were lefthanders which makes the sample proportion 1113 ls this signi cantly lower than the proportion of Americans who are lefthanders which is 12 Earlier we found the zstatistic to be 41 and the Pvalue to be PZ 3 141 2 33 33971 Consulting Table A2 we see that 41 is less extreme than 1645 so the Pvalue is larger than 05 Again we have failed to produce any evidence against H0 Example When students are asked to pick a number at random from one to twenty I suspect their selections will show bias in favor of the number seventeen In a group of 371 students 25 chose the number seventeen Does this provide convincing statistical evidence of bias in favor of the number seventeen in that the proportion of students picking seventeen is signi cantly higher than 120 05 The null and alternative hypotheses were H0 p 05 Ha p gt 05 and so the zstatistic was 2 150 and the Pvalue was PZ 2 150 lnstead of using Table Al to nd T the precise Pvalue we note from Table A2 that 150 is less than 1645 so the tail probability must be greater than 1390 05 Thus our Pvalue PZ 2 150 is greater than 05 and we do not have convincing evidence of bias Note Earlier we found the exact Pvalue to be PZ S 7150 0668 which is indeed greater than 05 Example Note In a previous Example we began by assuming that the proportion of freshmen taking intro Stats classes is 25 According to survey data we found the sample proportion of freshmen to be 08 By hand we calculated the probability of a sample proportion this low coming from a population with proportion 25 it was approximately zero I characterized this as virtually impossible and decided not to believe that the overall proportion of freshmen is 25 Alternatively I could use MlNlTAB to test the hypothesis that population proportion is 25 vs the less than alternative Since year allows for more than two possibilities it is necessary to use Stat then Tables then Tally to count the number of freshmen 35 Then use the Summarized Data option in the 1 Proportion procedure specifying 445 as the Number of Trials and 35 as the Number of Successes I opted to use test and interval based on normal distribution since that s how I originally solved the problem by hand The pvalue is zero and l reject the null hypothesis in favor of the alternative I again conclude that the proportion of freshmen in intro Stats classes at least in the Fall is less than 25 Tally for Discrete Variables Year Ye ar Count 1 35 2 257 3 102 4 37 other 14 445 1 Test and CI for One Proportion Test of p 025 vs p lt 025 Sample X N Sample p 950 Upper Bound Z Value P Value 1 35 445 0078652 0099642 835 0000 Exercise In a previous Exercise we explored the sampling distribution of sample proportion offemales when random samples are taken from a population where the proportion offemales is 5 We noted the sample proportion offemales among surveyed Stats students and calculated by hand the probability of observing such a high sample proportion if population proportion were really only 5 We used this probability to decide 95 whether we were willing to believe that population proportion is in fact 5 For this Exercise address the same question by carrying out a formal hypothesis test using MINITABl Be sure to specify the appropriate alternative hypothesis State your conclusions clearly in context Lecture 25 Type I and Type II Error When we set a cutoff level oz in advance for a hypothesis test we are actually specifying the longrun probability we are willing to take of rejecting a true null hypothesis which is one of the two possible mistaken decisions that can be made in a hypothesis test settingl Example Recall our testingfordisease example in Chapter 7 in which the probability of a false positive was 015 probability of false negative was 003 All the possibilities for Decision and Actuality are shown in the table below If we decide to use 015 as our cutoff probability pvalue lt 015 means reject H0 otherwise don t reject then 015 is the probability of making a Type I Errorithe probability of rejecting the null hypothesis even though it is true That means the probability of correctly accepting a true null hypothesis is 17 015 985 In medical situtations this is the speci city of the test I In our example we were told the probability ofa false negative or Type II Errorl Thus the probability of a correct positive for an ill person called the sensitivity of the test 1 minus the probabilty of Type II errorl Statisticians refer to this probability as the power of the test In a 2 test about population proportion p the probability of a Type II error incorrectly failing to reject the null hypothesis when the alternative is true can only be calculated if we are told speci cally the actual value of the population proportionl Thus we need to know the alternative proportion which contradicts the null hypothesized proportionl What we do not need in order to calculate the probability of Type II error is the value of an observed proportion Our probability is about the test itself not about the results Rather than focusing on making such calculations we will instead think carefully about the implications of making Type I or Type II errorsl Example For our medical example above the probability of incorrectly telling a healthy person that he or she does have AIDS is higher than the probability of incorrectly telling an infected person that he or she does not have AIDS If a healthy person initially tests positive Type I error then the consequence besides considerable anxiety is a subsequent more discerning test which has a better chance of making the correct diagnosis second time around If an infected person tests negative Type II error then the consequences are more dire because treatment will be withheld or at best delayed and there is the risk of further infecting other individuals Thus in this case it makes sense to live with a higher probability of Type I error in order to diminish the probability of Type II errorl Example Consider the following legal example the null hypothesis is that the defendant is innocent and the alternative is that the defendant is guilty The trial weighs evidence as in a hypothesis test in order to decide whether or not to reject the null hypothesis of innocence What would Type I and H errors signify in this context A Type I error means rejecting a null hypothesis that is true in other words nding an innocent person guilty Most people would agree that this is much worse than committing a Type ll error in this context which would be failing to convict a guilty person Dr Stephen Fienberg of CMU did extensive work for the government is assessing the effectiveness of liedetector tests He concluded that probabilities of committing both types of error were so high that he and a panel of investigators recommended discontinuing the use of such tests A peek at a brain can unmask a liar tells about the most recent technology for new sorts of lie detectors to replace the oldfashioned polygraph Role of Sample Size Example Suppose one demographer claims that there are equal proportions of male and female births in a certain state whereas another claims there are more males They use hospital records from all over the state to sample 10000 recent births and nd 5120 to be males or p 512 They test H0 p 5 vs Ha p gt 5 and calculate 2 24 so the Pvalue is 0084 quite 0 small Does this mean a they have evidence that the population proportion of male births is much higher than 5 or b they have very strong evidence that the population proportion of male births is higher than 5 The interpretation in b is the correct one a is not Especially when the sample size is large we may produce very strong evidence of a relatively minor difference rom the claimed p0 Conversely if n is too small we may fail to gather evidence about a difference that is quite substantial Example A Statistics recitation instructor suspects there to be a higher proportion of females overall in Stats classes She observes 12 females in a group of 20 students so oes this con rm her suspicions She would test H0 p 5 vs Ha p gt 5 First she veri es that 205 and 201 7 5 are both 10 just barely satisfying our condition for a normal approximation Also she has in mind a population in the hundreds or even thousands so the binomial model applies 3967395 i 89 The Pvalue is 1867 providing her with no statistical evidence 55 7 She calculates 2 20 to support her claim In fact the population proportion of females really is greater than 5 but this sample size was just too small to prove it In contrast a lecture class of 80 students with p 6 would produce a 2 statistic of 179 and a Pvalue of 0367 Remember that 15100 16mm M 10007100 We reject Hg for a small Pvalue which in turn has arisen from a 2 that is large in absolute value on the fringes of the normal curve There are three components that may result in a 2 that is large in absolute value which in turn cause us to reject H0 1 What people tend to focus on as the cause of rejecting H0 is a large difference p 7 p0 between the observed proportion and the proportion proposed in the null hypothesis This naturally makes 2 large and the Pvalue small 2 A large sample size 71 because is actually multiplied in the numerator of the test statistic 2 brings about a large 2 and a small Pvalue Conversely a small sample size 71 may lead to a smaller 2 and failure to reject H0 even if it is false a Type 11 error 3 If p0 is close to 5 then P01 p0 is considerably larger than it is for p0 close to 0 or 1 For example xp01 p0 is 5 for p0 5 but it is 1 for p0 01 or 09 When Hypothsis Tests are Not Appropriate Remember that we carry out a hypothesis test based on sample data in order to draw conclusions about the larger population from which the sample was obtained Hypothesis tests are not appropriate if there is no larger group being represented by the sample Example In 2002 the government requested and won approval for 1228 special warrants for secret wiretaps and searches of suspected terrorists and spies ls this signi cantly higher than 934 which was the number of special warrants approved in 2001 Statistical inference is not appropriate here because 1228 and 934 represent entire populations for 2002 and 2001 they are not sample data Example In 2000 19223 0880 75 of all bachelor s degrees were earned by whites ls this signi cantly lower than 86 the proportion of all bachelor s degrees earned by whites in 1981 We would not carry out a signi cance test because the given proportions already describe the population Example An internet review of home pregnancy tests reports Home pregnancy testing kits usually claim accuracy of over 95 whatever that may mean The reality is that the literature contains information on only four kits evaluated as they are intended to be usediby women testing their own urine The results we have suggest that for every four women who use such a test and are pregnant one will get a negative test result It also suggests that for every four women who are not pregnant one will have a positive test result77 From this information we can identify the probabilities of both Type 1 and H errors according to the review as being 1 in 4 or 25 Example Gonorrhea is a very common infectious disease In 1999 the rate of reported gonorrhea infections was 1322 per 100000 persons A polymerase chain reaction PCR test for gonorrhea is known to have sensitivity 97 and speci city 98 What are the probabilities of Type 1 and Type 11 Errors Given the high degree of accuracy of the test if a randomly chosen person in the US is routinely screened for gonorrhea and the test comes up positive what is the probability of actually having the disease The null hypothesis would be that someone does not have the disease A Type 1 Error would be rejecting the null hypothesis even though it is true testing positive when a person does not have the disease A Type 11 Error would be failing to reject the null hypothesis even though it is false testing negative when a person does have the disease A sensitivity of 97 means that if someone has the disease the probability of correctly testing positive is 97 and so the probability of testing negative when someone has the disease is 3 this is the probability of a Type 11 error A speci city of 98 means that if someone does not have the disease the probability of correctly testing negative is 98 and so the probability of testing positive when someone does not have the disease is 2 this is the probability of a Type 1 error A twoway table makes it easier to identify the probability we are seeking of having the disease given that the test is positive We begin with a total of 100000 people of whom 132 have the disease the remaining 999868 do not Sensitivity 97 means 127 of the 132 with gonorrhea test positive Speci city 98 means 979871 of the 999868 people without gonorrhea test negative The remaining counts can be lled in by subtraction Positive Negative Total Gonorrhea 127 5 132 No Gonorrhea 1997 97871 99868 Total 2124 97876 100000 Of the 2124 people who test positive 127 actually have the disease if someone tests positive the probability of having the disease is 21 06 Remember however that this probability applies to a randomly chosen person being screened If someone is screened because of exhibiting symptoms the probability is of course higher Exercise Refer to the article How not to catch a spy Use a lie detector which reports at the bottom of the rst column Even if the test were designed to catch eight of every 10 spies it would produce false results for large numbers of people For every 10000 employees screened Fienberg said eight real spies would be singled out but 1598 innocent people would be singled out with them with no hint of who s a spy and who isn t77 Based on this information set up a twoway table classifying 10000 employees as actually being spies or not and being singled out as a spy by the lie detector or not Report the probability of a Type 1 Error and of a Type 11 Error If someone is identi ed by the lie detector as being a spy what is the probability that he or she is actually a spy Lecture 15 Random Variables uDefinitions Notation uProbability Distributions uApplication of Probability Rules Mean and sd of Random Variables Rules 2 mm mm mm Elemenbw slums mm mm aw mm C 2007 Nancy Pfenning Looking Back Review El 4 Stages of Statistics I Data Production discussed in Lectures 14 I Displaying and Summarizing Lectures 512 I Probability El Finding Probabilities discussed in Lectures 1314 El Random Variables El Sampling Distributions I Statistical Inference mmmmnw mm ammw Statutes mm um aw mm L15 2 De nition Random Variable a quantitative variable whose values are results of a random process Looking Ahead In Inference we ll want to draw conclusions about population proportion or mean based on sample proportion or mean To accomplish this we will explore how sample proportion or mean behave in repeated samples If the samples are random sample proportion or sample mean are random variables 2mm mnwmm Elemenhwstahshcs mm more swim L153 De nition Elementary Statistics Looking at the Big Picture Random Variable a quantitative variable whose values are results of a random process Looking Ahead39 Sample proportion and sample mean are very complicated random variables We start out by looking at much simpler random variables 2mm mm Mm ammw slums mm um aw mm r15 4 C 2007 Nancy Pfenning De nitions I Discrete Random Variable one whose possible values are nite or countably in nite like the numbers 1 2 3 Continuous Random Variable one whose values constitute an entire in nite range of possibilities over an interval 2mm mnwmm amnuwsmgm mm atthe swim L155 Notation Random Variables are generally denoted with capital letters such asX Y or Z The letter Z is often reserved for random variables that follow a standardized normal distribution 2mm mm mm ammw Statstics mm um aw mm L15 5 Example A Simple Random Variable El Background Toss a coin twice and let the random variable X be the number of tails appearing El Questions I What are the possible values ofX I What kind of random variable isX 2mm mnwmm amnuwsmgm mm atthe swim L157 Example A Simple Random Variable Elementary Statistics Looking at the Big Picture El Background Toss a coin twice and let the random variable X be the number of tails appearing El Responses I Possible values I X is a random variable WWW mm amws5s awn 7 m C 2007 Nancy Pfenning De nitions De nition I Probability distribution of a random variable tells all of its possible values along with their associated probabilities Probability histogram displays possible values of a random variable along horizontal axis probabilities along vertical axis 2mm mnwmm amnuwsmgm mm tithe swim L15 1D Probability distribution of a random variable tellsf its possible values along with their associated probabilities Looking Back Last chapter we considered individual probabilities like the chance of getting two tails in two coin tosses Now we take a more global perspective considering the probabilities of all the possible numbers of tails occurring in two coin tosses 2mm mm mm ammw Statstics mm um aw mm L15 ii Median and Mean of Probability Distribution I Median is the middle value with half of values above and half below equal area value on histogram I Mean is average value balance point of histogram I Mean equals Median for symmetric distributions 2mm mnwmm amnuwsmgm mm tithe swim L1512 Example Probability Distribution of a Random Variable El Background The random variableX is the number oftails in two tosses ofa coin El Questions I What is the probability distribution of X I How can We display and summarize the distribution ofX 2mm mm mm ammw Statstics mm um aw mm L141 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Probability Distribution of a Random Example Probability Distribution of a Random Variable Variable El Background The random variable X is the number El Background The random variableX is the number of tails in two tosses of a coin of tails in two tosses of a coin El Responses POSSIble 011100111653 Responses Display probability histogram W 2nd 5 m 15 2nd El Summarize center meanmediani tst 2nd toss 055 i055 toss toss toss to s loss CD spread typical distance from 1 is about shape 77 Each has probability In if Xno of tails o l 1 i2 Probability 14 12 14 c9 Nonoverlappin g Or Rule PX1 Eiernentary Statistics Luuking attne Big Picture UAW n aiis Lizi is e mi Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture e mi Nancy Pfenning Notation Perm1331ble Probabilities and Interim Table SumtoOne Rule for Probability Distributions To construct robabili distribution for more PXx denotes the probability that the random p ty complicated random processes begin With variable X takes the value x b1 h 11 b1 Any probability distribution of a discrete random variable X Interlm ta 6 s Owlng a e outcomes and their probabilities must satisfy I O S PX x S l Wherexis any value ofX I PX 1PX 32PX 2 wk 2 1 Where 371 3327 39 39 39 7 517k are all possible values ofX According to this Rule if a probability histogram has bars of Width 1 their total area must be 1 Eiernentary Statistics Luuking attne Big Picture Lia i9 e mi Nancy Prenning Eiernentary Statistics Luuking attne Big Picture Lizi 18 e mi Nancy Pfenning Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning i Example Interim Table and Probability Example Interim Table and Probability Distribution Distribution El Background A coin is tossed 3 times and the El Background A coin is tossed 3 times and the random variable X is number of tails tossed random variable X is number of tails tossed El Questions What are El Responses I All possible outcomes values of X and probabilities I Interim Table I Probability distribution of X and probability histogram I Shape center and spread of the distribution 0 2mm Nancy Ptennirig Eiernentary Statistics Luuking atthe Big Picture Li4 2n 0 2mm Nancy Ptennirig Eiernentary Statistics Luuking atthe Big Picture Li4 2i L iii g Example Interim Table and Probability Example Interim Table and Probability Distribution Distribution El Background A coin is tossed 3 times and the El Background A coin is tossed 3 times and the random variable X is number of tails tossed random variable X is number of tails tossed El Responses I Use Nonoverlapping Or Rule to combine probabilities for X1 18181838 and forX2 Probability distribution and accompanying probability histogram are El Responses I Histogram has Probahiiily j a El Shape El Center medianmeani Xnooftailsi O l 2 i 3 M PXX 18 38 38 18 gt1 no oiis El Spread typical distance from meani a bit less than 1 since 1 and 2 which are more common are only 0 5 away from 1 5 and 0 and 3 which are less common are 1 5 away from 1 5 u Looking Ahead Standard deviation ofR V will be introduced later on o Prohaniiily VB i 2 3 x e it i 0 2mm Nancy Prennirig Eiernentary Statistics Luuking atthe Big Picture quot0 a s LN 22 0 2mm Nancy Ptennirig Eiernentary Statistics Luuking atthe Big Picture LN 24 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning J i J 39i i Exam le Probabili Distribution Based on Definition Revzew p ly LongRun Observed Outcomes II Probability chance of an event occurring determined as the I Background Census Bureau reported distribution of US household size in 2000 I Proportion of equally hkely outcomes compr1s1ng H theeventor X12345 c I Proportion of outcomes observed in the long run PXx l0 26 0 340160140 07 l0 02 l 0 01 I A 21 m that comprised the event or I Likelihood of occurring assessed subjectively II Question What is the difference between how these probabilities have been assessed and the way we assessed probabilities for coin ip examples e 2mm Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture Li 5 25 e 2mm Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture LN 27 Example Probability Distribution Based on LongRun Observed Outcomes Probability Rules Review Probabilities must obey II Back round Census Bureau re orted distribution of US household size in 2000 p I Perm1331ble Probabilities Rule I SumtoOne Rule I X12345 I Not Rule PXX lo 26 l 0 340160140 07 lo 02 l o 01 l I NonOverlappmg or Rule 2 x I Independent And Rule II Response Com ip probabilities are based on I General or R 16 known properties of coin two equally likely faces u Household probabilities are based on I General And R1116 311 hOHSGhOIdS in us in 2000 I Rule of Conditional Probability e 2mm Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture LN 29 e 2mm Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture Li 5 an Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning Example Permissible Probabilities Rule El Background Household size in US has El Question How do these probabilities conform to the Permissible Probabilities Rule 2mm mnwmm amnuwsmm mm tithe swim may Example Permissible Probabilities Rule El Background Household size in Us has El Response 2mm mm mm ammw Statstics mm um aw mm L15 3 Example SumtoOne Rule El Background Household size in US has El Question According to the SumtoOne Rule what must be true about the probabilities in the distiibution 2mm mnwmm amnuwsmm mm tithe swim m cm Example SumtoOne Rule Elementary Statistics Looking at the Big Picture El Background Household size in Us has El Response According to the Rule we have 0260340160140070020017 2mm mm mm ammw Statstics mm um aw mm L15 as C 2007 Nancy Pfenning Example Not Rule Example Not Rule El Background Household size in US has El Background Household size in Us has El Question According to the Not Rule what El Response is the probability of a household not consisting of just one person 2mm mnwmm Eiemenurysuusucs mm tithe swim m cn mmmmnm mm ammmsmm makmva heaiv mm L15 9 Example N0nOverlapping 0r Rule Example N0nOverlapping 0r Rule El Background Household size in US has El Background Household size in Us has El Question According to the Nonoverlapping El Response The probability of having fewer Or Rule what is the probability of having than 3 people is fewer than 3 people PXlt3PX1 or X2PX1PX2 2mm mnwmm Eiemenurysuusucs mm tithe swim m au mmmmnm mm ammmsmm damning mm L1542 Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning Example Independent And Rule Example Independent b lnd Rule El Background Household size in US has El Question Suppose a polling organization has sampled two households at random According to the Independent And Rule what is the probability that the rst has 3 people and the second has 4 people 2mm mmmm Eiemenurysuusucs mm tithe swim mm El Background Household size in Us has El Response The probability that the rst has 3 people and the second has 4 people is PXl 3 andX24 PX1 3gtltPX24 where we useXl to denote number in 1st household X2 to denote number in 2quot household 2mm mm mm ammw mm mm um aw mm L15 45 Example General Or Rule Example General Or Rule El Background Household size in US has El Question Suppose a polling organization has sampled two households at random According to the General Or Rule what is the probability that one or the other has 3 people 2mm mmmm Eiemenurysuusucs mm tithe swim mm El Background Household size in Us has El Response The events overlap it is possible that both households have 3 people PX13 or X23PX1 3PX23PX1 3 and X23 W11 We apply the Independent And Rule for PX13 andX23 2mm mm mm ammw mm mm um aw mm L15 45 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Rule of Conditional Probability El Background Household size in US has El Question Suppose a polling organization samples only from households with fewer than 3 people What is the probability that a household with fewer than 3 people has only 1 person 2mm mnwmm Eiemenurysuusucs mm tithe swim m ae Example Rule of Conditional Probability El Background Household size in Us has El Response PXl givenXlt3 PXl andXlt3 PXlt3 2mm mm mm ammw Statstics mm um aw mm L15 51 Mean and Standard Deviation of Random Variable El Mean ofdiscrete random variableX l1 1PX 1 39 aIkPX ac11 Mean is weighted average of values where each value is weighted with its probability El Standard deviation of discrete random variableX a z 21 mam x1 a Wax a Standard deviation is typical distance of values from mean Squared standard deviation is the variance Looking Back Greek letters are med became these are the mean and smndard deviation 0 all the random variables values 2mm mnwmm Eiem tanStatistics mm tithe swim L1452 Example Mean of Random Variable Elementary Statistics Looking at the Big Picture El Background Household size in Us has El Question What is the mean household size 2mm mm mm ammw Statstics mm um aw mm L15 5 C 2007 Nancy Pfenning Example Mean of Random Variable I Background Household size in US has X1234567 pXX 026034016014007 002001 II Response 10262034 7001 is mean household size Looking Back Median is 2 has 05 at or below it Mean is greater than median because distribution i skewed right Also mean is less than the middle number 4 because smaller household sizes are weighted with higher probabilities C 2mm Nancy Ptenning Eiementary Statistics Looking attrie Big Picture ti 5 55 E Example Standard Deviation of R V I Background Household size in US has X1234567 PXX026034016014007002001 II Question What is the standard deviation of household sizes typical distance from the mean 25 a 0014 b 014 c 14 d 140 C 2mm Nancy Ptenning Eiementary Statistics Luuking althe Big Picture ti 5 5B Example Standard Deviation of R V I Background Household size in US has X1234567 PXX026034016014007002001 II Response The typical distance of household sizes from their mean 25 is the closest are 05 away 2 and 3 the farthest is 45 awa 7 0r calculate b hand or so tware Looking Back N onnormal 9distribution does not conform to 689599 7 Rule probability of being less than 2 sds below mean less than 0 3 is 0 C 2mm Nancy Ptenning Eiementary Statistics Looking althe Big Picture ti 5 SE Example Standard Deviation of R V II Response The typical distance of household sizes from their mean 25 is the standard deviation 35 Probability 4 5 mew x Household size standard deviation C 2mm Nancy Ptenning Eiementary Statistics Looking althe Big Picture ti 5 an Elementary Statistics Looking at the Big Picture J l J Rules for Mean and Variance I Multiply RV by constant its mean and standard deviation are multiplied by same constant or its abs value since sdgt0 I Take sum of two independent RVs in mean of sum sum of means in variance of sum sum of variances variance is squared standard deviation Looking Ahead These rules will help us identify mean and standard deviation of sample proportion and sample mean C 2007 Nancy Pfenning e 2mm Nancy Pfennan Elementary Statistles Luuklng attne Big F39lcture Li 5 El E Example Mean Variance and SD of R V I Background NumberX rolled on a die has Xnorolled 1 2 3 4 5 6 pom 16 16 16 16 16 16 II Question What are the mean variance and standard deviation of X e 2mm Nancy Pfennan Elementary Statistles Luuklng attne Big F39lcture Li 5 62 Example Mean Variance and SD of R V I Background NumberX rolled on a die has Xnorolled 1 2 3 4 5 6 PXX 16 16 16 16 16 16 ll Response I Mean same as median because symmetric I Variancefound by hand or with software I Standard deviationsquare root of variance e 2mm Nancy Pfennan Elementary Statistles Luuklng attne Big F39lcture Li 5 e4 Example Mean and SD forMultiple of R V il Background ljumberX rolled on a die has mean 35 sd 17 Probability s nn rolled on i die I QuestionWhat are mean and sd of double the roll e 2mm Nancy Pfennan Elementary Statistles Luuklng attne Big F39lcture Li 5 e5 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Mean and SD forMultl39ple of R V Example Mean and SD for Sum of R Vs I Background ljumberX rolled on a die has mean I Background Numbers X 1 X2 on 2 dice each have 35 sd 17 mean 35 variance 292 M W s nw ua reoiied DD 5 E amrgiicu El Response For double the roll mean is sd El QuestionWhat are mean variance and sd of total is on 2 dice c 2mm Nancy Pfenning 4 mm r giggfiieitzgm L15 B7 c 2mm Nancy Pfenning Eiementary Statistics Looking aime Big Picture L15 68 Example Mean and SD for Sum of R Vs Example Doubling R V or Adding T w0 R Vs El Background Numbers X 1 X2 on 2 dice each have El Background Double roll of a die mean7 sd 34 mean 35 Varlance 292 Total of 2 dice mean7 sd 24 s Xna ioiiod a i we swim a j El ResponseMean variance 2 n s Dqu g S d an I im urolledonidie V v r n2dme II Question Why d es total of 2 dice ave ess spread than double roll of 1 die c 2mm Nancy Pfenning Eiementary Statistics Looking Li 7n c ZEIEI7 Nancy Pfenning Eiementary Statistics Looking aime Big Picture L15 7i Elementary Statistics Looking at the Big Picture 13 Example Doubling R V orAa a ing T w0 R Vs Total of 2 dice mean7 sd 24 Pmbabillly A 5 gm ssss Vi Druouble he no rolled on i die C 2mm Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture Gi3i m u g 3 EN 52quot ii 0 D vii i ii i i ii 2 2 a o a a s i Pr El Background Double roll of a die mean7 sd 34 i2 tholal miiea on 2 dice II Response roll of 1 die makes extremes 212 or 2612 more likely totaling 2 dice tends to have low and high rolls cancel each other out man C 2007 Nancy Pfenning Example Doubling R V orAa a ing T w0 R Vs This is the key to the benefits of sampling many individuals the average of their responses gets us closer to what s true for the larger group oiu 9 n2quot Probability if El Response Doubling roll of 1 die makes 212 or 2612 more likely totaling 2 dice tends to have low and high rolls cancel each other out Li573 C 2mm Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture Example Doubling R V orAa a ing T w0 R Vs C 2mm Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture I If the numbers on a die were unknown and you had to guess their mean value would you make a better guess with a single roll or the average of two rolls Li 74 Lecture Summary Random Variables II Random variables I Discrete vs continuous I Notation Probability distributions displaying summarizing Probability rules applied to random variables Constructing distribution table Mean and standard deviation of random variable Rules for mean and variance EIEIEIEIEI C ZEIEI7 Nancy Ptenning Eiementaiy Statistics Luuking atthe Big Picture Lizi 75 Elementary Statistics Looking at the Big Picture Lecture Notes for Stat 1000 by Dr Nancy Pfenning Note Surveyarticle exercises must be handed in to me in lecture along with the textbook homework problems by the due date They must be your own individual work Each is worth a maximum of 2 points For problems involving survey variables access the survey data via my website nancyp pitt edu wwwpittedu nancypstat lOOOindexhtml where there is a link to the most recent survey data at surveymm dd yytXt followed by instructions for downloading into MlNlTAB Be sure to choose different variables from those used in lecture examples For problems involving your own newspaper articles or internet reports you must hand in a copy of the article or report itse f Lecture 1 Chapter 1 Statistics Success Stories and Cautionary Tales Statistics is a collection of procedures and principles for gathering data and analyzing information in order to help people make decisions when faced with uncertainty Some years ago Rutger s football team managed to rack up seven turnovers from Pitt s team but Pitt won 2917 Pitt s coach Walt Harris remarked The scoreboard is what matters Statistics are for proverbial losers77 Pitt s team may have come out ahead in spite of the statistics involved but studying statistics can actually put you ahead of the game in that it will help you understand the world around you much better than you would otherwise Consider the following quotation The country is hungry for information everything of a statistical character or even a statistical appearance is taken up with an eagerness that is almost pathetic the community have not yet learned to be half skeptical and critical enough in respect to such statements77 If this was true back in the 1870 s when spoken by General Francis A Walker superintendent of the 1870 census how much more true is it today when we are bombarded with information of a statistical nature in virtually all aspects of our lives Here are some examples of questions which can be answered with the help of statistics Example Suppose we have the following data on how much money three students earned in thousands of dollars last year Name Earned Jessica 10 Nicole 0 Brian 2 If someone asked us to summarize the information we could simply state that the students earned 10 0 and 2 thousand dollars respectively Now imagine data given for an entire class of 80 or 90 students Could we look at the list of all of their earnings and discover an overall pattern If so could we identify any clear exceptions to this pattern How could we brie y summarize the data using just a few words or numbers How were the data produced and measured My data came from a sample of students attending class Could we use information about their earnings to draw conclusions about the earnings of the population of all Pitt students How reliable would those conclusions be First Chapter 2 we perform data analysis summarizing the data at hand with graphical displays or key numerical and verbal descriptions Next Chapters 3 and 4 we ll nd out about good data production via experiments observational studies or surveys Then we look at relationships between two quantitative variables in Chapter 5 between two categorical variables in Chapter 6 Next Chapter 7 we ll establish enough groundwork in probability so that we can understand the behavior of random variables Chapter 8 In Chapter 9 we focus on particular random variables sample mean and sample proportion in order to establish the theory needed to perform statistical inference in Chapters 11 through 16 given information from a random sample we will draw conclusions about the entire population from which the sample was obtained All of this will be facilitated with MINITAB an easytouse statistical package Altogether ve recitations are to be held in the Stats Lab including the rst two Example In May 2000 56 of 1012 respondents to an Associated Press survey supported gays rights to inherit from their partners WThe AP poll of 1012 people was taken May 1721 lts error margin was plus or minus 3 percentage points slightly larger for the split sample77 What do those 3 percentage points mean and how were they calculated In fact methods of statistics will eventually tell us in Chapter 10 that we can be pretty sure that the percentage of all American adults who support gays rights to inherit is within 3 of 56 We call 3 the margin of error A rough approximation for the margin of error can be found by taking 1 divided by the square root of the sample size 1 1 m 32 03 3 In general we can be 95 sure that the population percentage comes within one margin of error of the sample percentage as long as the sample has been chosen at random from the population Assuming our 1012 American adults were sampled at random we can be 95 sure that between 53 and 59 of all American adults support gays rights to inherit lf about 500 each Democrats and Republicans were surveyed then the margin of error within each group would be about E i 4 4 which is why the article reports that the error margin is slightly larger for the split sample Example In a recent survey of 2500 American adults 475 that is 19 said they believed money could buy happiness What does this tell us about how all American adults feel Now the margin of error is 1 1 M2500 E Assuming our 2500 American adults were sampled at random we can be 95 sure that between 17 and 21 of all American adults believe money can buy happiness 022 Example Larry Flynt publisher of Hustler Magazine spoke to an interviewer about the issue of exploita tion Often some women on the fringe such as Gloria Steinem and that bunch see pornography as being exploitative and demeaning to women but of the thousands of girls who have posed for my magazines I ve never had one who felt she had been exploited77 Can we generalize from his sample to the population of all women No his was by no means a representative sample and so it really tells us nothing about how women in general feel about pornograp y Example A recent study found that men are twice as likely as women to be struck by lightning Should men fear for their lives No because the baseline risk for women is only 1 in 10 million thus the risk for men is still only 1 in 5 million Example Why is it that many languages have no speci c word for the color blue and do not distinguish between blue and green Researchers from Ohio State University reviewed 203 languages from around the world and levels of ultraviolet B which in high levels damage the eye to make it less able to distinguish between blue from green In areas with low levels of UVB languages tended to have a word for blue while areas with high levels tended not to Example Psychologists and social scientists noted that children who grow up in families with fewer kids tend to have higher le Should we conclude that parents can boost their kids lQ scores by having fewer children As with any observational study it is difficult here to establish proof of a causeeffect relationship In fact researchers have reason to believe that causation goes more in the other direction parents with higher le tend to have fewer children and by heredity those children tend to have higher le Example Suppose we noticed that people who use stronger sunscreen tend to stay in the sun longer Could we conclude that stronger sunscreen leads to more time in the sun No a confounding variable could be the person s inclination to seek or avoid time in the sun which could also in uence what type of sunscreen is used In an observational study where variables values are observed as they naturally occur it is common for a confounding variable to cloud the issue Example In a study of 87 French and Swiss college students researchers randomly gave half of them sunscreen with a protection factor of 10 and the other half with a factor of 30 The students who weren t told which lotion they had received went on summer vacation and recorded the amount of time they spent in the sun Users of the stronger sunscreen spent 25 more time in the sun mostly sunbathing because they typically waited until their skin turned red before rushing to the shade Can we conclude that in general using a stronger sunscreen leads people to spend more time in the sun Yes because an experiment was performed whereby researchers imposed the type of sunscreen treatment at random This controls for possible confounding variables such as we discussed for the observational study above and lets us draw a conclusion about cause and e ect Example Students who take a formal SAT prep course score signi cantly better on the SATs What does this mean Statistical signi cance means that it would be unlikely to see such a difference in the sample if there were actually no difference in the population Especially with a large sample this study involved over 14000 students we may be able to produce statistical evidence of a difference that has little practical signi cance In fact research shows that coaching may only improve SAT scores by about 20 points out of the possible 1600 Chapter 2 Turning Data Into Information Example Consider the following raw data which have not yet been processed Name Sex Earned Age Year Jessica f 223 4 Nicole f 0 194 2 Brian m 2 198 2 Survey results from a class can be considered sample data if we think of these students as being a subset of the larger population of all Pitt students A number that describes the sample is called a statistic 19 was the proportion of sampled adults who believed money could buy happiness whereas a number that describes the population is called a parameter the unknown proportion of all adults who believe money could buy happiness If all the individuals in a sample or population were the same then there would be nothing of interest to examine and statistics would be unnecessary But fortunately characteristics do vary from one individual to the next and so we call these characteristics variables A variable may be categorical like sex or quantitative like earnings or age What about year If I d only permitted responses of l 2 3 or 4 year could be treated as a quantitative variable But since the response other was also possible we must treat year as categorical We can call it ordinal because the years do follow an order from lowest to highest unlike major for example which cannot be ordered Lecture 2 The best way to handle statistical information depends to a large extent on the number and type of variables involved Let s identify these for the previous examples Example Earnings of students is a quantitative variable Example 19 of 2500 Americans said they believed money could buy happiness Just one categorical variable believing or not is involved Example Men are twice as likely as women to be hit by lightning Here we consider two categorical variables gender and whether or not a person is hit by lightning When the relationship between two variables is being considered it helps to decide which if any plays the role of explanatory variable and which is the response variable In this case gender would be the explanatory variable and being hit by lightning or not is the response Example Languages tend not to have a speci c word for the color blue if they are spoken in countries with high levels of UVB Low or high levels of UVB is the explanatory variable and having a word for the color blue or not is the response both are treated as categorical variables Example Do children in smaller families have higher le7 We are interested in two quantitative variables family size and IQ score Researchers originally thought of family size as being the explanatory variable but then they realized that it is closer to being the response explained by IQ of parents and heredity Example Do people stay in the sun longer if they use a stronger sunscreen Type of sunscreen is the explanatory variable and it s categorical Time in the sun is the response and it s quantitative Example Students who take SAT prep courses score signi cantly better SAT score is a quantitative variable If we compared scores for two groups those who did and did not take a prep course then we d be introducing an additional categorical variable Example A December 2003 New York Times article stated It s not your imagination it really is taking longer to get there Scheduled travel time between many major cities by air rail and bus all increased from 1995 to last year according to the Transportation Department s Bureau of Trans portation Statistics The bureau studied 261 citypair markets and found that in 68 percent of them scheduled air travel time increased for direct service The scheduled trip took longer by train in 61 percent of those city pairs and by bus in 52 percent77 The variables involved are change in scheduled travel time and mode of transportation As reported travel time is not summarized quantitatively rather it was recorded whether or not the time increased from 1995 to 20027a categorical variable Mode of transportation is a categorical variable allowing for three possibilties Whether travel time increased is summarized with percentages and these are cited for the various modes of transportation There is not much emphasis on making comparisons for plane vs train vs bus but if we wanted to make an assignment of explanatoryresponse mode of transportation would be explanatory and whether time increased would be response Exercise Hand in an article or report about a statistical study tell what variable or variables are involved and whether they are quantitative or categorical If there are two variables tell which is explanatory and which is response Chapter 2 Turning Data Into Information Now we ll begin to learn to how to display and summarize data depending on the number and type of variables involved There are various ways to display and summarize data depending on whether there are quantitative or categorical variables or some combination involved We summarize categorical variables by recording the count or usually preferable the percent or proportion in the category of interest Example In the 1311 crimes in Maryland from 1978 to 1999 where the defendants were eligible for the death penalty 690 involved black victims There is one categorical variable involved here namely race of the victim The percentage involving black victims was 53 and so the percentage involving white victims was 47 This information could be displayed with a piechart a 53 slice for black victims and a 47 slice for white or a bar graph bars of height 53 for black victims 47 for white Either way we see the percentages of victims in the two races are comparable both close to 50 If two categorical variables are involved we can record counts in the various category combinations with a twoway table Once we ve determined which is the explanatory variable we can compare percentages in the response category of interest for each of the explanatory groups Example Here is a twoway table for the Maryland crimes now classi ed not only according to the race of the victim but also as to whether or not the defendant was given the death pena ty Death Penalty No Death Penalty Total Black Victim 15 675 690 White Victim 61 560 621 Total 76 1235 1311 Now there are two categorical variables involved It only makes sense to take the victim s race to be the explanatory variable and whether or not the death penalty was imposed would be the response Why would the reverse be nonsensical Thus we would compare percentages sentenced to death for the case of black victim 61750 022 vs white victim 098 To display the relationship between two categorical variables we list the possible explanatory values along a horizontal axis and represent percentages in the various response categories with bars of the appropriate height bars of heights 22 and 978 showing the death or no death rate when the victim was black next to bars of heights 98 and 902 showing the rates when the victim was white The fact that the death penalty rate was more than four times higher in the case of white victims leads us to suspect that the victim s race plays a role in sentencing In order to convince someone that this difference cannot be explained away by attributing it to chance a statistical procedure called the Chi square test is needed This will be presented in Chapter 15 after the necessary theory has been developed and we will indeed show that there is strong statistical evidence of discrimination Interestingly the race of the defendant did not appear to impact sentencing Example In the rst lecture we considered data for how much money a large group of students earned in thousands of dollars the year before Name Earned Jessica 0 Nicole 0 Brian 2 To see the pattern of variation of a quantitative variable like earnings of class members some common display tools are dotplots stemplots histograms and boxplots A good display will help us to summarize a distribution by reporting its center spread and shape Until we learn more precise measures we will mention the midpoint for center and the range lowest to highest for spread As for shape we will focus on whether the distribution is balanced symmetric or lopsided skewed left or right whether it has one peak or more and whether outliers are present One very useful and straightforward display is the rather selfexplanatory dotplot A dotplot s horizontal axis corresponds to the full range of possible values each occurrence of a value is marked with a dot in the appropriate horizontal position and for multiple occurences the dots stack up vertically Example Here is a dotplot for amounts earned in thousands of dollars by 79 students Dotplot for earned earned Another way to display the distribution of a quantitative data set is with a histogram It is important to note that histograms differ from bar graphs in that they represent frequencies by area not height To construct a histogram we 1 Divide the range of data into classes of equal width In this case height and area correspond However it is possible to use classes of unequal width in which case it is important to represent frequencies with area not height 2 Count the number or percentage of observations in each class 3 Draw the histogram using the horizontal axis for the range of data values and the vertical axis for counts or percents Example Construct a histogram for earnings of 79 students 0000000011111111 2222222222222222222 3333333333333344455555 66667778888910101011121212152122 1 Since the earnings range from 0 to 22 thousand dollars I could maybe use 5 classes of width 5 or maybe more classes of a narrower width 2 A table helps to record the number of students with earnings in each class interval Class Count Percent 0 to 5 52 66 5 to 10 17 22 10 to 15 7 9 15 to 20 l 1 20 to 25 2 3 3 Our horizontal axis labeled earnings extends from 0 to 25 The vertical axis could be labeled coun 7 or percen 77 lf labeled with counts there would be rectangles of height 52 17 7 1 and 2 If labeled with percents we could either divide each count by the total 79 and have rectangles of height 66 22 9 1 and 3 or we could adjust the scale to represent percent per thousand and have rectangles of heights g and resulting in a total area of 100 The distribution is centered in the 0 to 5 range if we consider where the midpoint of all the values wou be Values are spread from 0 to 22 for a range of 22 The shape is extremely rightskewed with possible outliers in the twenties MlNlTAB opted to construct a histogram for the same data using 12 intervals of width 2 Frequency 0 10 earned The advantage of the histogram is that it is easily constructed for a large data set like earnings of a large group of students For a quick speci c display of a relatively small data set we can use a stemplot consisting of a vertical list of stems after each of which follows a horizontal list of onedigit leaves Example Use a stemplot to display the Math SAT scores of eleven students 511 592 704 667 468 592 614 472 534 669 557 Sorting the data helps us to keep organized 468 472 511 534 557 592 592 614 667 669 704 If we used the hundreds and tens digits as stems and the ones as leaves we would have 25 stems 46 47 W 70 with only 11 leaves not a very useful display Instead we will simply truncate the ones digits then use hundreds for stems and tens for leaves Gab1N Io m out 6 1 1 0 But this stemplot has rather few stems which may prohibit us from getting a feel for the shape of the distribution Besides truncating digits another option in constructing a stemplot is to split the stems two ve or ten ways Splitting two ways rst stem gets leaves 04 second stem gets leaves 59 would result in this plot Gamma OmHU Hm 305 We see the distribution to be centered at 592 not 59 with a spread of values ranging from the 4 hundreds to the 7 hundreds speci cally from 468 to 704 and with a more or less singlepeaked shape called unimodal which is fairly symmetrici There are no apparent outliersi Example Below are dotplot histogram and stemplot displaying results of the survey question which asked several hundred students to pick a number at random between 1 and 20 followed by MlNlTAB s descriptive statistics Dotplot for Random Random Frequency 10 Random Stem and leaf of Random N 446 Leaf Unit 10 1111111 222222222222233333333333333333333 llllllllllll 8888888888888888899999999999 0000000000111111111111111111111 llllllllllllll 4 U 00 MHHHHb OOOOO 00000000000000 Descriptive Statistics Random Variable N Mean Median TrMean StDev SE Mean Random 446 11614 13000 11714 5283 0250 Variable Minimum Maximum 11 13 Random 1000 20000 7000 17000 The median is 13 selections range from 1 to 20 and the shape exhibits peaks at certain popular numbers like 17 troughs at numbers like 1 10 or 20 that students tend to avoid There was a tendency for students to pick numbers greater than 10 Which results in some left skewness This data set should notiand does not ontain outliers The stemplot is not a good display in this case because it does not handle the large sample size very well additional three s and seven s are replaced by plus signs and doesn t let us focus easily on one number at a time Exercise Pick a quantitative variable from those in the survey Use MlNlTAB to display the variable s values With all three graphs discussed a dotplot a histogram and a stemplot Report the median for center range for spread and describe the shape Be sure to mention if there are outliers C 2007 Nancy Pfenning Looking Back Review Lecture 12 El 4 Stages of Statistics Relationships between TWO I Data Production discussed in Lectures 14 l Dlsplaymg and Summarrzing QuantltatIVe vanables RegreSSIOn El Single variables 1 catl quan discussed Lectures 58 El Relationships between 2 Variables Equatim 0f Regresswn Line ReSidua39S Categorical and quantitative discussed in Lecture 9 DE eCt 0f ExplanatoryResponse ROISS Two categorical discussed in Lecture 10 uUnusual Observations nSample vs Population I Probability uTime Series Additional Variables J I Stanstica Inference 2mm mmmmv sumsnuwsums mm mm admits mmmmnw mm Eiementaiystatstics mmsmsuw mm m 2 Review Least Squares Regression Line El Relationship between 2 quantitative variables Summarize linear relationship between I Display with scatterplot explanatory x and response 1 values with summarize line 3 a that minimizes sum of a Form linear or curved squared pre a1ct10n errors called residuals 539 Direction POSi Ve or negative El Slope predicted change in response y for E39 StreT gtl Strong m demtegt weal every unit increase in explanatory value x If form IS lrnear correlation r tells direction and strength Also equation of least squares regression line lets us D Intercept Where best nlng hue crosses y predict a response for any explanatory Value x axis predicted response for x0 2mm mmmmv sumsnuwsums mm mm admits m a mmmmnw mm Eiementaiystatstics inakimza neaiv mm L114 Elementary Statistics Looking at the Big Picture 1 C 2007 Nancy Pfenning Example Least Squares Regression Line El Background Carbuyer used software to regress price on age for 14 used Grand Arn s The regression equation is Price 14690 A 1288 Age El Question What do the slope 1288 and intercept 14690 tell us 2mm mnwmm amnuwsmsm mm mm swim m 5 Example Least Squares Regression Line El Background Carbuyer used software to regress price on age for 14 used Grand Arn s The regression equation is Price 14690 1288 Age El Response I Slope For each additional year in age predict price l Intercept Best tting line mumm mm ammwsmm mmmaw mm L117 Example Extrapolation El Background Carbuyer used software to regress price on age for 14 used Grand Arn s The regression equation is Price 14690 e 1288 Age El Question Should we predict a new Grand Am to cost 14690l2880l4690 2mm mnwmm amnuwsmsm mm mm swim m e Example Extrapolation Elementary Statistics Looking at the Big Picture El Background Carbuyer used software to regress price on age for 14 used Grand Arn s The regression equation is Price 14590 e 1288 Age El Response mumm mm ammwsmm mmmmw mm m m C 2007 Nancy Pfenning De nition El Extrapolation using the regression line to predict responses for explanatory values outside the range of those used to construct the line 2mm mnwmm amnuwsmgm mm We swim m 11 Example More Extrapolation El Background A regression of 17 male students weights lbs on heights inches yields the equation 17 438 8711 El Question What weight does the line predict for a 20inchlong infant mmmmnm mm ammwsmms makmva heaiv mm m 2 Example More Extrapolation El Background A regression of 17 male students weights lbs on heights inches yields the equation 393 7438 877 El Response 2mm mnwmm amnuwsmgm mm We swim m 4 Expressions for slope and intercept Elementary Statistics Looking at the Big Picture Consider slope and intercept of the least squares regression line 2 b0 2713 El Slope bl so ifx increases by a stande deviation predict y to increase by r stande deviations I r close to 1 y responds closely tox I r close to 0 y hardly responds to x mmmmnm mm ammwsmms makmva heaiv mm m 15 J r J Expressions for slope and intercept Consider slope and intercept of the least squares regression line i 2 b0 blx ll Slope bl T so ifx increases by a standard deviation predict y to increase by r standard deviations ll Intercept b0 2 y blf so when a E predictg 50 519 3 g 513 5193 33 the line passes through the point of averages 37 7 e mi Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Eilg F39lcture Lil W C 2007 Nancy Pfenning Example Individual Summaries 0n Scatterplot I Background Carbuyer plotted price vs age for 14 used Grand Ams 4 13000 8 4000 etc lscou 7 moon 7 price 5000 7 D l o i 5 in age I Question What are the approximate means and sds of age and price e mi Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Eilg F39lcture Lil l7 i l Example Individual Summaries 0n Scatterplot I Background Carbuyer plotted price vs age for 14 used Grand Ams 4 13000 8 4000 etc lsuoo 7 lUOOO 7 price 5000 7 D i i o 5 in age El Response Age has approx mean yrs sd yrs price has approx mean sd e mi Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Eilg F39lcture Lil 2i Elementary Statistics Looking at the Big Picture De nitions El Residual error in using regression line 14 2 b0 l39 blx to predict y given x It equals the vertical distance observed minus predicted which can be written yi 1 D s denotes typical residual size calculated as 5 y1 t1gt2ltyn yn2 n 2 Note s just averages out the residuals yr e 2007 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Eilg F39lcture Lil 22 C 2007 Nancy Pfenning Example Considering Residuals Example Considering Residuals I Background Carbuyer regressed price on age for I Background Carbuyer regressed price on age for 14 used Grand Ams 4 13000 8 4000 etc 14 used Grand Ams 4 13000 8 4000 etc The regression equation is rice 14686 1290 age 3 2175 R Sq 785397o R Sqadj 767 El Question What does s 2175 tell us The reression euation is rice 14686 1290 ae s 2175 RSq 78570 R Sqadj 767 II Response Regression line predictions not perfect Hanna 7 I x49 predict y m moon 7 actual yl30009prediction error E I x89 predict y 5000 actual y40009prediction error 0 7 I Typical size of 14 prediction errors is 7 7 dollars 0 S m c 2007 Nancy Pfenning Elementary siausligcg Luuking atme mg Picture w 23 c 2007 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture w 25 E quot1 gt 739 Example Considering Residuals Example Residuals and their Typical Size s I Typical size of 14 prediction errors is s 2175 dollars El Background For a sample of schools regressed Some points vertical distance from line more some less I average Math SAT on average Verbal SAT 2175 15 typlcal dIStance I avera e Math SAT on of teachers W advanced de rees Regression Plot Regression Plot Mam 96 8098 0 832573 Verbal Math 478038 0796632 AdvDegrees 50 7 S 705044 63930 939 r 37 3 262355 Fiqu 16 9 r mi 58L J 570 5262 is typical residual size 10000 7 5507 8 i 5307 77 08 is iy icai 7 Q E fesiduaisizg 2 530 7i39 39 5000 7 39 5m I 450160 m W W 5 0 so 70 3 0 sin 0 7 Verbal mmDegrees Sage a Question How are s708 left and s262 right w it amuamtsmtent Wlth Ethansaluesio tbetcorrelatlon r 27 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning Example Residuals and their Typical Size s Example Residuals and their Typical Size s I Background For a sample of schools regressed I Background For a sample of schools regressed I average Math SAT on average Verbal SAT I average Math SAT on average Verbal SAT I average Math SAT on of teachers W advanced degrees I average Math SAT on of teachers W advanced degrees Regression Plot Regression Plat sg ggojje ggjggg fTim Smjggzgggosa tgjjgg gydfjeg Note can better predict av Math SA T from av Verbal W i wapica A Closer Look If SAT than from of teachers w advanced degrees 5507 fsma 3 29 output reports R g 7 7 39 e T L Sq take its 1 N Looking Back Correlations based on averages are overstated E 530 sl BiSiypicai g 530 1 d es dua 53929 vi or epequot mg strength of relationship for individual students would be less 39 5 07 on slope to nd r lm W 460 739 in to 0 it to verbal AdvDegvees El ResponseOn left 1quot qu V0939 097 relation El Response On right 7 VRS V 0169 041 and trandaennaae only 7 0 madame ls negaisttemlze ls 262gte Example Typical Residual Size s close to 5y or 0 Example Typical Residual Size s close to 5y or 0 I Background Scatterplots show relationships I Background Scatterplots show relationships I Price per kilogram vs price per 1b for groceries I Price per kilogram vs price per 1b for groceries I Students nal exam score vs number order handed in I Students nal exam score vs number order handed in m 7 Regression I Jquot Aline approx 2 g quot 39 same as line E f I at average n I 1 m 0 D I 7 I 0 yvalue El Responses Plot on has s0no prediction errors El QU SUOHSI Wthh has S20 Plot on s close to SyRegressing on x doesn t Wthh has S 01056 to 37 7 help regression line approx horizontal c 2mm Nancy Pfenning Elementary Statistics Luuking atthe Big Picture w 34 c ZEIEI7 Nancy Pfenning Elementary Statistics Luuking atthe Big Picture w 36 Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning i i E Example Typical Residual Size s close to 5y Example Typical Residual Size s close to 5y El Background 20089 Football Season Scores Regression Analysis Steelers versus Opponents The regression equation is Steelers 23 5 0 053 Opponents El Background 20089 Football Season Scores Regression Analysis Steelers versus Opponents The regression equation is Steelers 23 5 0 053 Opponents S 9 931 Descriptive Statistics Steelers Descriptive Statistics Steelers Variable N Mean Median TrMean StDev SE Mean Variable N Mean Median TrMean StDev SE Mean Steelers 19 22 74 23 00 22 82 2 22 Steelers 19 22 74 23 00 22 82 9 66 2 22 Variable Minimum Maximum Q1 Q3 Steelers 6 00 38 00 14 00 31 00 Question Since s9931 and 5y 966 are very close do you expect iri close to 0 or 1 Variable Minimum Maximum Q1 Q3 Steelers 6 00 38 00 14 00 31 00 Response r must be close to if CZEIEI7 Nancy Ptenning Eiernentary Statistics Looking attne Big Picture Lii 37 C 2mm Nancy Ptenning Eiernentary Statistics Looking attne Big Picture Lii 39 3 u Exam le Re ression Line when Roles are ExplanatoryResponse Roles 1n Regress1on p g Swztched Our choice of roles explanatory or response does not affect the value of the correlation r 5 BaCRgl Ollndi Compare regreSSIOH Ofy 011 x left but it does affect the regression line and regress10n of x on y rlght for same 4 pomts Regiessiun of y on x Regression 0 x on y El Question Do we get the same line regressingy on x as we do regressing x on y 0 2mm Nancy Pfenning Eiernentaiy Statistics Looking attne Big Picture Lii 4n 0 ZEIEI7 Nancy Ptenning Eiernentaiy Statistics Looking attne Big Picture LB 4i Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E E am le Re ression Line when Roles are X p g Definitions Swztched ll Outlier in regression point with unusually El Background Compare regress10n of y on x left lar e resid a1 and regression of x on y right for same 4 points g u Hegressron oi y on x Mariam El In uential observation point with high degree of influence on regression line l 4 5 ll Response The lines are very different l P U 39 CJy on x slope I P U 39 CJxony slope c ZEIEI7 Naney F39fErlrllrlg Elementary Statistles Luuklng attne Ellg Pleture LB 43 c ZEIEI7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Ellg Pleture Lil 44 Example Outliers and In uential Observations Example Outliers and In uential Observations I Background Exploring relationship between orders El Background Exploring relationship between orders for new planes and eet size of some airlines in for 116W 1313116S and eet 5110 2004 r069 400 e V 069 m 7 g 300 i E 300 i E 200 7 E 00 m g loo 7 mm 3 mg a 10 nei i ize 4 ll Response 0 neii ee 0 4 El Question Is each of these an outlier or influential I Southwest omit it9 r reduces to 022 I Southwest I JetBlue large residual and I JetBlue omit it r increases to 097 0 ZEIEI7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Ellg Pleture Lil 45 c ZEIEI7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Ellg Pleture Lil 47 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Outliers and In uential Observations De nitions El Background Exploring relationship between orders I Slope B1 how much responsey changes in f r 11 l 11 11 iz 2282913 a es a d eet S 6 general for entire population for every unit Unusual Observations x Obs FleetSiz PlanesUr Fit SE Fit Residual St Resid 6 400 3970 3981 1271 11 0o4 x El Intercept 9 02 where the line that best ts all I 7 60 3730 1152 517 2578 216R I 1 f R denotes an observation with a large standardized residual exp anatory response pelnts or entlre X denotes an observation whose X value gives it large influence In uential observations tend to be extreme in horizontal direction Looking Back Greek letters often refer to El Response population parameters I Southwest marked in MINITAB I JetBlue marked in MINITAB CJZEIEI7 Nancy Pfenning Eiementaiy Statistics Looking althe Big Picture Lii 49 CJZEIEI7 Nancy Pfenning Eiementaiy Statistics Looking althe Big Picture Lii 5n Line for Sample vs Population Role of Sample Size II Sample line best fitting sampled points A larger sample helps provide more evidence of predicted response is a relationship between two quantitative 73 50 5106 variables in the general population I Population line best fitting all points in population from which given points were sampled mean response is My O 1 C ZEIEIi Nancy Pfenning Eiementary Statistics Looking althe Big Picture Lii 52 C Zuni Nancy Pfenning Eiementary Statistics Looking althe Big Picture Lii 5i Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Role of Sample Size Example Role of Sample Size El Background Relationship between ages of El Background Relationship between ages of students mothers and fathers both have r078 but students mothers and fathers both have r078 but sample size is ove on left or juson right sample size is over 400 on left or just 5 on right 307 55 anerAge Moli IerAge l MolherAge MalnerAge i 39 I I so 7 Ill so 7 39 z ml1x wig mil l39 mil 1 o r nl o 7 quotJ i i i i l l l l i i i i l l 50 so 70 so 48 53 53 30 40 so so 70 so 45 53 FatherAge FaiherAge FatherAge FatherAge ll Question Which plot provides more evidence of El Response Plot on Strong 19051the re13t10115hlp 111 POPUIaUOH Can believe configuration on occurred by chance c 2on7 Naney F39fErlrllrlg Elementary Stallstlcs Luuklng attne Eilg Pleture Lil 53 c 2on7 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Eilg Pleture Lil 55 Time Series Example Time Series If explanatory variable is time plot one response I Background Time series plot shows average daily for each time value and connect the dots to blrths 63 11 month 111 Year 2000 m the US3 look for general trend over time also peaks A and troughs E tosao 7 7 7T 7 7i 7 77 J i M A i J I g 0 ill I Month El Question Where do you see a peak or a trough c 2on7 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Eilg Pleture Lil 56 c 2on7 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Eilg Pleture Lil 57 Elementary Statistics Looking at the Big Picture 10 C 2007 Nancy Pfenning Example Time Series Example Time Series I Background Time series plot shows average daily I Background Time series plot shows average daily births each month in year 2000 in the US blrths 63011 month 111 Year 2000 111 the USi k in September nths after December 0 H500 i UOO 1000 e Average Daily Births Average Daiiy Eirth 10500 p 1500 my gt El Response Trough 1n peak in El Questions How can we explain why there are I Conceptions in US fewer in July more in December I Conceptions in Europe more in summer fewer in winter c 2007 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture Lii 59 c 2007 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture Lii ED Example Time Series Additional Variables in Regression I Background Time series plot shows average daily i Confounding Variable Combining two blrths 63 11 month 111 Year 2000 m the US3 groups that differ with respect to a variable that is related to both explanatory and response variables can affect the nature of g their relationship II Multiple Regression More advanced i F M A M Wm A s o N D treatments cons1der impact of not Just one but U RCSPOIlse Dif cult to explain two or more quantitative explanatory A Closer Look Statistical methods can 39t always explain variables on a quantitative response why but at least they help understand what is going on c 2007 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture Lii Bi c 2007 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture Lii EZ Elementary Statistics Looking at the Big Picture 11 Example Additional Variables i J El Background A regression of phone time in relationship Phone 100 200 300 Do light people talk more C 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture minutes the day before and weight shows a negative WT I Questions Do heavy peoplmonthephone less Lii ES Example Multiple Regression its price I Question What additional quantitative variable would help predict a car s price e mi Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture I Background We used a car s age to predict Lii EB C 2007 Nancy Pfenning Example Additional Variables El Background A regression of phone time in minutes the day before and weight shows a negative relationship Phone female Phonegmale iiiiiiiiiii c I x 39 ibn 1 50 T femai is confounding variable regress El Response separately for males and females no relationship Lii BE e mi Nancy Ptenning Eiernentary Statistics Looking attne Big Picture Example Multiple Regression I Background We used a car s age to predict its price I Response e ZEIEI7 Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture Lii BE Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Lecture Summary Regression El Equation of regression line I Interpreting slope and intercept I Extrapolation Residuals typical size is s Line affected by explanatoryresponse roles Outliers and in uential observations Line for sample or population role of sample size Time series EIEIEIEIEIEI Additional variables 2mm mnwmm amnuwsmsm mm We swim m m Elementary Statistics Looking at the Big Picture 13 C 2007 Nancy Pfenning Looking Back Review Lecture El 4 Stages of Statistics I Data Production discussed in Lectures 14 Categorlcal amp Quantltatlve Varlable Displaying and Summarizing Lectures 512 Inference in SeveralSample Design Probability discussed in Lectures 1320 I Statistical Inference a uCompare and Contrast Several and 2sample 1wegmald s ss d quotL 39 r s 3923 a lquantitative discussed in Lectures 24 27 uVarIatIon Among Means or V thn Groups D m and qua palred 25mm uF Statistic as Ratio of Variation D 2 categori a1 u 2 quantitative Role of Sample Size Inference Methods for C9Q Review Display amp Summary Several Samples Review l Paired reduces to lsamplel El Display Side by side boxplots El Focused on mean of differences I One boxplot for each categorical group I TwoSample 2sample 2 similar to lsample I 39 All Share Same quantimiVe 50319 El Focused on difference between means I SeveralSample need new distribution F I FiVe Number Smma es 100kin at boxplots El Focus on difference among means 39 Means and Standard De da ons Looking Ahead Inference r population relationship nuses on means and standard deviations El Summarize Compare 2mm mnwmm amnuwsmsm mm mm awaits mus 2mm mm mm gummy Statstics mm um aw mm m 4 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Notation Two vs SeveralSample Inference Sizes Means I Similar test statistic standardizes difference Sample I no of groups compared n1n7ni sum to N 51322 151 overall 5 among sample means taking sample Sizes and Popmat39on I 1 2 quot standard deviations into account I Different severalsample test statistic F focuses on El Squared differences of means in numerator El Squared standard deviations variances in denominator Procedure called ANOVA ANalysis Of VAriance c 2mm Nancy Pfenning Elementary Statistics Luuking aime Big Picture Lari 5 c 2mm Nancy Pfenning Elementary Statistics Luuking aime Big Picture Lari E Two vs SeveralSample Inference l and F Distributions I Similar test statistic standardizes difference I Left sampled 100 values from at distribution among sample 11163115 takmg sample SlZeS and I Right squared the 100 values from t distribution Standard devlauons into account 2 Squaring makes F nonnegative rightskewed For 2 groups of equal Sizes and 01 039 2 F t 25 7 so and conclusions including Pvalue are the same 20 40 Frequency oi 8 51 l l i Frequency A m o o o e i 0 0 a 2 ii ii g 0123456159 Sampled 100 values from t distribution Squared sample 01 100 values from t distribution LED 8 e mi Nancy Pfenning Elementary Statistics Luuking althe Big Picture Lam 7 e ZEIEI7 Nancy Pfenning Elementary Statistics Luuking althe Big Picture Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Two vs SeveralSample Statistics What Makes 1 or F Statistics Large I How different are sample means I Large diff among sample means in numerator I How spread out are the distributions I Small spreads in denominator I How large are the samples I Large sample sizes denominator of denominator t Elif20 32 82 12 n1 n2 n1a ci 2 02 H2632 W men W I 1 n2n2 ntM1 1 721 1s n2 1gts n1 ls ltN 1 tnl 1 n2 1153 m 1s l I c 2mm Naney Pfenning Elementary Statistles Luuklng attne Big Picture LSD ll c 2mm Nancy Pfenning Elementary Statistles Luuklng attne Big Picture LSD l2 Example Sample SDs E ect 0n PValue Example Sample SDs E ect 0n PValue I Background Boxplots with 9 01 2 3 E2 4 i3 5 El Background Boxplots withn cl 3 52 2 4 33 5 could appear as on left or right depending on sds could appear as on left or right depending on sds 6 ntext sample 6 Context sample 57 mean monthly pay 5 a mean monthly pay 47 El 39 in 1000s for 3 4 El 39 in 1000s for 3 37 B racialethnic 3 E3 racialethnic 27 groups 2 groups El Question For which scenario does the difference El Response Difference between means appears among means appear more signi cant more signi cant on smaller sds overlap c 2mm Nancy Pfenning Elementary Statistles Luuklng attne Big Picture LED is c 2mm Nancy Pfenning Elementary Statistles Luuklng attne Big F39lcture LED is Elementary Statistics Looking at the Big Picture 3 C 2007 Nancy Pfenning E Example Sample SDs E ect on Conclusion Example Sample SDs E ect on Conclusion El Background Boxplots with E1 3 22 4 933 5 El Background Boxplots Withil 3 532 4 533 5 could appear as on left or right depending on sds COUld appear as 011 left or ght dependmg 011 SdS 6 Context sample 6 Context sample 5 a mean monthly pay 5 mean monthly pay 4 E 39 in 1000s for 3 4 E 39 in 1000s for 3 3 E3 racialethnic 3 E racialethnic 2 groups 2 groups I Questlon For which scenario are We more likely D Response Scenario on 2 smaller S 1 s 9 to reject hypothes1s of equal population means larger F Statistic9smaller PValue9 likelier to reject H o conclude c 2mm Nancy Pfenning Eiememary Statistics Luuking atthe Big Picture Lam i6 c 2mm Nancy Pfenning Eiememary Statistics Luuking atthe Big Picture Lam i8 i i 39 7 Measuring Variation Among and Within Numera tor of F Difference Among Means I 711031 9 02 H2652 32 7115quot 32 1 1 of Squared diffs among Groups iltn1 1gtse n2 1gts m we ltN Igt ssa 53 42 54 42 55 42 I Numerator variation among groups 3 Of Freedom for Groups I How different are 2721 EI from one another I Denominator variation within groups 3 diffs among Groups I How spread out are samples sds 517 31 monthly earnings in 1000sfor 3 racialethnic groups n 1 i e mi Nancy Pfenning Eiementaiy Statistics Luuking atthe Big Picture Lam 18 e ZEIEI7 Nancy Pfenning L a a i I Lam 2n Elementary Statistics Looking at the Big Picture 4 C 2007 Nancy Pfenning E Numerator of F Difference Among Means Denominator of F Spread Within Groups X Note numerator of F is the same for both scenarios i SSE Su a of Squared Error within Groups because the difference among means is the same SSE 511 2511l582511582 a El DFE Degrees f Freedom for Error s DFEV I 15 3 4 E a g II MSE Mean Square rror w1th1n Groups 2 SSE If MSEDFE12 quot1 5 1 Z 3 monthly earnings quot2 5 f2 I 4 in 1000s for 3 n2 2 5 3 5 15 E Z 4 metaletth groups fl 1 39 1 c2uu7 Nancy Pfenning Elementary Statisties Luuking attne Big Pietuie Lari 2i c2uu7 Nancy Pfenning L g g I Lari 22 i i 39 7 Denominator of F Spread Within Groups The F Statistic II Note denominator ofF is smaller for the F mo 1 52 71252 m2 n1T m2 I 1 scenario on the right because of less spread 711 18g n2 308 I I 305 N I M SG i 2 Is2 large 47 E MSE 25 3 E3 measures difference among sample means relative to spreads and sample sizes IfFis large reject H0 M1 M2 M3 onc ue population means differ II Because the numerators are the same F the quotient is considerably larger on the right e 2mm Nancy Pfenning Elementary Statisties Luuking attne Big Picture tau 23 e 2mm Nancy Pfenning Elementary Statisties Luuking attne Big Picture tau 24 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning Example Size of Standardized Statistics El Background Say standardized statistic is 2 El Question Is 2 large I For 2 I For t I For F 7 2mm mmmnm amnuwsmsm mm mm swim mm Example Size of Standardized Statistics El Background Say standardized statistic is 2 El Response I 22 combined tail probs 4 I F2 large I F2 large depends on based on total sample sizeN and number of groups 2mm mm mm ammw Statstics mm um aw mm m 27 F and its Degrees of Freedom Family of F curves all nonneg rightskewed Spreads vary depending on DFG I l in numerator DFE N I in denominator 2mm mmmnm amnuwsmsm mm mm swim muzs Example Degrees of Freedom for F Elementary Statistics Looking at the Big Picture El Background Consider these F distributions I F with I5 N390 F with DFG2 DFE12 written F212 El Questions I What are degrees of freedom ifI5 N390 I What are I andN ifDFG2 DFE12 2mm mm mm ammw Statstics mm um aw mm m 29 C 2007 Nancy Pfenning E Example Degrees of Freedom for F Example Assessing Size of F Statistic I Background Consider these F distributions El BaCRgl Ollndi Say F3 for DFG4 DFE385 08 7 i FWlth 15 N390 m e 552 3223 3 i3 2i12 333i I FWith DFG2 DFE12 0 5 t 05 7 III Responses 04 7 03 7 I 02 7 gt Since 1 121 Since N1N 312 N 7 7 00 I 1 2 3 F El Questions Is F3 large Will we reject a claim that the 5 population means are equal e 2mm Nancy Pfenning Elementary Statisties Laaking attne Big Picture tau 3i e 2mm Nancy Pfenning Elementary Statisties Laaking attne Big Picture tau 32 i i 39 7 Example Assessing Size of F Statistic Example Assessing F for Different DF I Background F3 for DFG4 DFE385 I Background Say F3 for DFG2 DFE12 F distribution for 4 dt in numerator 1 0 7 385 df in denominator 15 N390 F distribution for 2 df in numerator 12 dt in denominator l3 N15 PF gt 3018 PFgt30878 00 J I i 0 i i i i i i i i 1 2 3 F 1 2 3 A 5 6 7 8 El Responses PVal001859Very little area past F3 El Questions Is F3 large 9 F IS ReJeCt 0131111 that the 5 POPUIatlon What would we conclude if F2 for DFG2 means are equal DFE12 e 2mm Nancy Pfenning Elementary Statisties Laaking attne Big Picture tau 34 e 2mm Nancy Pfenning Elementary Statisties Laaking attne Big Picture tau 35 Elementary Statistics Looking at the Big Picture 7 C 2007 Nancy Pfenning E Example Assessing F for Different DF The F Statistic El Background F23 for DFGZZ DFEZlZ F n1il 302 n2 2 2 n ff 2 I 1 721 1sn2 1s nI 1siltN I F distribution for 2 of in numerator 12 df in denominator l3 N15 MSG i 2 Is 2 large for DFG2 DFE12 05 7 MSE 25 NO gt 2 I 1 measures difference among sample means relative to spreads and sample s1zes IfFis large reject H0 ul 2 a2 3 El Responses Pval008789F3 is PVal for F 2 must be 9Reject H o 9Conclude populatingg ieags ay be equal cum Nancy Pfenning Ei onc ue population means differ C 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture Lam 38 2 Example Drawing Conclusions Based on F Example Drawing Conclusions Based on F I Background Earnings for 5 sampled individuals I Background Earnings for 5 sampled individuals from three racialethnic groups had means 3 4 5 from three racialethnic groups had means 3 4 5 in thousands of dollars ANOVA procedure in thousands of dollars ANOVA procedure resulted in F 2 which in this case is not large resulted in F 2 which in this case is not large I Question What do we conclude about mean I Response Since F is not large sample means earnings for populations in the three racialethnic differ significantly from one another groups Conclude population mean earnings 31337 Nancy WEWVWQ E EWENIEN Statistics LDDWQ atthE Big F39iEWE L3H 39 31337 Nancy WEWVWQ E EWENIEN Statistics LDDWQ atthE Big F39iEWE L3H 4i Elementary Statistics Looking at the Big Picture 8 Example Role ofn in ANO VA Test ii Background Earnings for 12 instead of 5 sampled individuals from three racialethnic groups had means 3 4 5 in thousands of dollars ANOVA procedure resulted in F 48 and a Pvalue of 0015 C 2007 Nancy Pfenning E Example Role ofn in ANO VA Test ii Background Earnings for 12 instead of 5 sampled individuals from three racialethnic groups had means 3 4 5 in thousands of dollars ANOVA procedure resulted in F 48 and a Pvalue of 0015 ii Response Conclude population mean earnings for the three groups samples help provide more evidence against Ho e ZEIEI7 Nancy Pfenning Eiementary Statistics Luuking attne Big Picture LED 44 ii Question What do we conclude about mean earnings for populations in the three racialethnic groups c ZEIEI7 Nancy Pfenning Eiementary Statistics Luuking attne Big Pietuie LED 42 E 7 1 Mean of F Since thas sd typical distance of values from 0 approximately 1 and F is similar to squaringt distribution mean of F is approximately 1 F distribution for 4 df in numerator 385 at in denominator 5 N390 F distribution for 2 df in numerator 12 df in denominator I3 N15 PF gt 30185 PFgt3087B If e ZEIEI7 Nancy Pfenning Eiementary Statistics Luuking attne Big Picture LED 45 Elementary Statistics Looking at the Big Picture Example Testing Relationship 0r Parameters ii Background Research question For all students at a university are Math SATs related to What year they re in El Question How can the question be reformulated in terms of relevant parameters means instead of in terms of Whether or not the variables are related e ZEIEI7 Nancy Pfenning Eiementary Statistics Luuking attne Big Picture LED 46 C 2007 Nancy Pfenning Example Testing Relationship or Parameters El Background Research question For all students at a university are Math SATs related to What year they re in El Response 2 mm mm mm Eiemenhw 5mm mm We aw 7mm Example Testing Relationship or Parameters El Background Research question Do mean eamings differ significantly for three racialethnic groups El Question How can the question he reformulated in terms of relevant variables instead of in terms of Whether or not the means are equal 0mm mm mm Emmy mm mm m at W m 49 Example Testing Relationship or Parameters El Background Research question Do mean eamings differ significantly for three racialethnic groups El Response 2mm mnwmm amnuwsmsm mm We swim mum Lecture Summary Inference for Cat amp Quan ANOVA El Severalsample vs 2sample design I Notation I Compare and contrast t and F statistics I Whatrnakes t or Flarge Variation among means or Within groupsF as ratio of variations How large is large F I Fdegrees offreedorn I F distribution El Role of sample size El El 2mm mm mm ammw 3mm mm um aw We L19 52 Elementary Statistics Looking at the Big Picture Lecture 14 Finding Probabilities More General Rules uGeneral OI Rule uConditional Probability General And Rule uTwo Types of Error ulndependence 2 mm mm mm Elementary slums mm We aw 7mm C 2007 Nancy Pfenning Looking Back Review El 4 Stages of Statistics I Data Production discussed in Lectures 14 I Displaying and Summarizing Lectures 512 I Probability El Random Variables El Sampling Distributions I Statistical Inference 2mm mm mm ammw slums mm um aw mm L142 Basic Probability Rules Review Non Overlapping Or Rule For any two nonoverlapping events A and B PA or BPAPB Independent And Rule For any two independent events A and B PA and BPAxPB 2mm mnwmm ElemenDHStaushcs mm am whenquot L143 More General Probability Rules Elementary Statistics Looking at the Big Picture I Need Or Rule that applies even if events overlap I Need And Rule that applies even if events are dependent I Consult twoway table to consider combinations of events when more than one variable is involved 2mm mm mm ammw slums mm um aw mm L144 C 2007 Nancy Pfenning Example Parts of Table Showing Or and And Example Parts of Table Showing Or and And El Background Professor notes gender female or male and grade A or not A for students in class El Questions What part of a twoway table shows I Students Who are female and get an A I Students Who are female or get an A A not A Total Female 015 0 45 0 60 Male 0 10 0 30 0 40 Total 0 25 0 75 l 00 2mm mnwmm amnuwsmsm mm tithe swim L145 El Background Professor notes gender female or male and grade A or not A for students in class El Responses I Students Who are female and get an A table on if I Students Who are female or get an A table on if not A Total 2mm mm mm ammw Statstics mm um aw mm L147 Example Intuiting General Or Rule El Background Professor reports probability of getting an A is 025 probability of being female is 060 Probability of both is 015 El Question What is the probability of being a female or getting an A 2mm mnwmm amnuwsmsm mm tithe swim ma Example Intuiting General Or Rule El Background Professor reports probability of getting an A is 025 probability of being female is 060 Probability of both is 015 El Response 2mm mm mm ammw Statstics mm um aw mm m m Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Intuiting General Or Rule General Or Rule General Addition Rule El Response Illustration with A notA Total Fem ale Male mi i i not A Total Total 2mm mnwmm amnuwsmsm mm tithe swim mm For any two events A and B PAor PA PB PAandB 0 if no overlap A Closer Look In general the word or in probability entails addition 2mm mm mm ammw Stalstics mm um aw mm L1412 Example Applying General Or Rule Example Applying General Or Rule El Background For 36 countries besides the US who sent troops to Iraq the probability of sending them early by spring 2003 was 042 The probability of keeping them there longer still in fall 2004 was 078 The probability of sending them early anal keeping them longer was 033 Question What was the probability of sending troops early or keeping them longer El 2mm mnwmm amnuwsmsm mm tithe swim L141 El Background For 36 countries besides the Us who sent troops to Iraq the probability of sending them early by spring 2003 was 042 The probability of keeping them there longer still in fall 2004 was 078 The probability of sending them early anal keeping them longer was 033 El Response 2mm mm mm ammw Stalstics mm um aw mm L1415 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Basic Probability Rules Review 39t i Example When Probabilities Can 1 Simply be Multiplied Review NonOverlapping 0r Rule For any two nomoverlapping events A and E El Background In a child 5 pocket are 2 quarters and 2 nickels He randomly picks a coin does not PA 0r BPAPB replace it and picks another Independent And Rule For any two El Response To find the probability of the rst and independent events A and B the second coin being quarters we can t multiply 05 P A and BPAXPB by 05 because after the rst coin has been removed the probability of the second coin being a quarter is not 05 it is 13 if the first coin was a quarter 23 if the rst was a nickel e ZEIEI7 Nancy Pfenning Elementary Statisties taaking attne Big Picture Ll4 W e ZEIEI7 Nancy Pfenning Elementary Statisties Laaking attne Big Picture Ll4 lE L iii Example When Probabilities Can 1 Simply be De nition an d Notation Multiplied Possibilities iOi isi seieciion Conditional Probability of a second event given a first event is the probability of the second event occurring assuming that the I Probability of a qiiarter is 2412 I rSt eVent has occurred PB given A denotes the conditional Possibilities for 2nd selection pTObability 0f eVent B occurring giVen i that event A has occurred l LookingAhead Conditional probabilities help us Probability of a quarter is 13 Probability of a quarter is 23 I handle dependent events it 1st selection was a quarter i it 1st selection was a nickel e ZEIEI7 Nancy Pfenning Elementary Statisties taaking attne Big Picture HA is e ZEIEI7 Nancy Pfenning Elementary Statisties taaking attne Big Picture MA in Elementary Statistics Looking at the Big Picture 4 C 2007 Nancy Pfenning Example Intuttmg the General And Rule I Background In a child s pocket are 2 quarters and 2 nickels He randomly picks a coin does not replace it and picks another I Question What is the probability that the first and the second coin are quarters C 2mm Nancy Ptenning Elementary Statistics Luuking attle Big Picture Ll4 Zl E Example Intuttmg the General And Rule I Background In a child s pocket are 2 quarters and 2 nickels He randomly picks a coin does not replace it and picks another I Response probability of first a quarter 24 times conditional probability that second is a quarter given first was a quarter 13 C 2mm Nancy Ptenning Elementary Statistics Luuking attle Big Picture LN 23 Example Intuttmg the General And Rule the times Wiien isi CUlll is Ouai tei 739 Di 2m Colii is also OtallEl C 2mm Nancy Ptenning mentary Statistics Luuking attle Big Picture Example Intuttt39ng General And Rule with Two Way Table I Background Surveyed students classified by sex and Whether or not they have ears pierced Ears Ears not piercedi pierced Female 270 30 300 Total Male 20 180 200 Toiai 290 210 500 El Question What are the following probabilities I PM being male I PE given M having ears pierced given male I PM and E being male and having ears pierced C 2mm Nancy Ptenning Elementary Statistics Luuking attle Big Picture LN 25 Elementary Statistics Looking at the Big Picture Two Way Table Example Intuiting General And Rule with Ears Ears not pierced pierced Femaie 270 30 300 Total Male 20 180 200 Total 290 210 500 El Response PM I PE given M I PM and E C 2mm Nancy Pfenning Eiementary Statistics Luuking aims Big Picture I Background Surveyed students classi ed by sex and whether or not they have ears pierced Restricted to Male row LMZE C 2007 Nancy Pfenning Example Intuiting General And Rule with Two Way Table El C 2mm Nancy Pfenning Eiemenrary Statistics Looking at Background Surveyed students classi ed by sex and whether or not they have ears pierced Ears Ears not piercedi pierced Female 270 30 300 Total Male 20 180 200 Tom 290 210 500 Response I PM I PE given M I or PM and E Note PliIana39E 7E PMP 20400580272 mm a Big Picture Restricted to Male row Rule For any two events A and B PABPAPB given A I l PB if independent I entails multiplication C 2mm Nancy Pfenning Eiementary Statistics Luuking aims Big Picture General And Rule General Multiplication A Closer Look In general the word and in probability Li43i Example Applying General And Rule El II Question What are the following probabilities C Background Studies suggest lie detector tests are well below perfection 80 of the time concluding someone is a spy when heshe actually is 16 of the time concluding someone is a spy when heshe isn t We ll assume 10 of 10000 govt employees are spies I Probability of being a spy and being detected as one I Probability of not being a spy but detected as one I Overall probability of a positive lie detector test 2mm Nancy Pfenning Eiementary Statistics Luuking aims Big Picture LN 33 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Applying General Anal Rule El Background Studies suggest lie detector tests are well below perfection 80 of the time concluding someone is a spy when heshe actually is 16 of the time concluding someone is a spy when heshe isn t We ll assume 10 of 10000 govt employees are spies Note PD given S0s PD given not S016 PS0001 Pnol S0 999 El Response I being a spy and being detected as one I not being a spy and detected as one I Overall probability Or Rule 2mm mnwmm Eiemenurysuusucs mm 31th WNW was Example Or Probability as Weighted Average of Conditional Probabilities El Background Studies suggest lie detector tests are well below perfection 80 of the time concluding someone is a spy when heshe actually is 16 of the time concluding someone is a spy when heshe isn t We ll assume 10 of 10000 govt employees are spies El Question Should we expect the overall probability of being detected as a spy PCD to be closer to PD given S080 or to PD given not S016 2mm mm mm ammw 3mm mm um aw mm mun Example Or Probability as Weighted Average of Conditional Probabilities El Background Studies suggest lie detector tests are well below perfection 80 of the time concluding someone is a spy when heshe actually is 16 of the time concluding someone is a spy when heshe isn t We ll assume 10 of 10000 govt employees are spies El Response Vast majority are not spies9expect PCD closer to In fact PCD was found to be 016064 2mm mnwmm Eiemenurysuusucs mm 31th WNW mug General And Rule Leads to Rule of Conditional Probability Recall For any two evens A and B PA and BPAxPB given A Rearrange to form Rule of Conditional Probability PB given A PA and B PA 2mm mm mm ammw 3mm mm um aw mm mm Elementary Statistics Looking at the Big Picture Example Applying Rule of Conditional Probability El Background For the lie detector problem we have Probability of being a spy PS0001 Probability of spies being detected PD given S080 Probability of nonspies detected PD given not S0 16 Probability of being a spy and detected PS and D00008 Overall probability ofpositive lie detector PD16064 Question If the liedetector indicates an employee is a spy What is the probability that heshe actually is one El 2mm mnwmm Eiemenurysuusucs mm mm swim mm C 2007 Nancy Pfenning Example Applying Rule of Conditional Probability El Background For the lie detector problem we have I Probability ofbeing a spy PS001 I Probability of spies being detected PD given S080 I Probability of nonspies detected PD given not S416 I Probability ofbeing a spy and detected PD and 410008 I Overall probability of positive lie detector PD0 16064 Response PS given DPD and S PD Note PSgiven D is very difkrentfrom PD given S El 2mm mm mm ammw 3mm mm um aw mm mm Two Types of Error in Lie Detector Test 1st Type of Error Conclude employee is a spy when heshe actually is not 2 d Type of Error Conclude employee is not a spy when heshe actually is 2mm mnwmm Eiemenurysuusucs mm mm swim 1144a Example Two Types of Error in Lie Detector Test Elementary Statistics Looking at the Big Picture El Background For the lie detector problem we have I Probability of spies being detected PD given S080 I Probability of nonspies detected PD given not S 16 El Questions I What is probability of 1st type of error conclude employee is spy when heshe actually is not I What is probability of 2quotquot1 type of error conclude employee is not a spy when heshe actually is 2mm mm mm ammw 3mm mm um aw mm m 47 C 2007 Nancy Pfenning i i Exam le Two T es 0 Error in Lie Detector p yp f Testing for Independence Test The concept of independence is tied in with I Background For the he detector problem we have conditional probabilities I Probability of spies being detected PD given S080 I I Probability of nonspies detected PD given not S0 16 L00kmg Ahead Much ofstansncs concerns Itself with whether or not two events or two variables El Responses are dependent related I 1 type PD given not S I 2nd type Pnot D given S Eiementary Statistics Luuking atthe Big Picture Li4 5n e mi Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture LN 43 e mi Nancy Pfenning Example Intuiting Conditional Probabilities When Events Are Dependent Example Intuiting Conditional Probabilities When Events Are Dependent I Background Students are classified according to gender M or F and ears pierced or not E or not E Responses I Background Students are classified according to gender M or F and ears pierced or not E or not E El El Questions I Should gender and ears pierced be dependent or ind If I 7 Expect 7 7 lower than 777 dependent which should be lower PE or PE given M because fewer males have pierced ears I What are the above probabilities and which is lower I PE given M PE Ears Ears not Total piercedi pierced Female 270 30 300 Ears Ears not pierced pierced Femaie 270 30 300 Total Male 20 180 200 Male 20 180 200 Tota 290 210 500 Eiementary Statistics Luuking atthe Big Picture Tota 290 210 500 e mi Nancy Pfenning L e e LMSi CZEIEI7 Nancy Pfenning Li453 Elementary Statistics Looking at the Big Picture Example I ntuiting Conditional Probabilities When Events Are Independent C 2007 Nancy Pfenning El Background Students are classi ed according to gender M or F and Whether they get an A in Stats Questions I Should gender and getting an A or not be dependent or ind How should PA and PA given F compare I What are the above probabilities and which is higher A not A Total El Female 015 0 45 0 60 Male 0 10 0 30 0 40 Total 0 25 0 75 1 00 2mm mnwmm Eiemenuwsuusucs mm tithe swim L1454 Example I ntuiting Conditional Probabilities When Events Are Independent El Background Students are classi ed according to gender M or F and Whether they get an A in Stats El Responses l Gender and grade should be 7 i should have PA77PA given F l PA 9 015 045 010 030 0 25 0 75 100 2mm mm mm ammw Stalstics mm um aw mm was Independence and Conditional Probability Rule A and B independent9PBPB given A Test PBPB given A9A and B are independent PB PB given A9A and B are dependent Independente regular and conditional probabilities are equal occurrence of A doesn t affect probability of B 2mm mnwmm Eiemenuwsuusucs mm tithe swim L1457 Independence and Product of Probabilities Elementary Statistics Looking at the Big Picture Rule Independent9PA and BPAgtltPB Test PA and B PAxPB9independent PA and B PAxPB9dependent Independent pr0bability of both equals product of individual probabilities 2mm mm mm ammw Stalstics mm um aw mm L145a C 2007 Nancy Pfenning Table of Counts Expected if Independent Example COW Expected lflndependem I For A B independent El Background Students are classified according to PA and BPAgtltPB gender and ears pierced or not A table of expected I This Rule dictates what counts would counts 174 293004 091135 been pmduced39 appear in twoway table ifthe variable 31233Z titif f12 2 l A or not A is independent of the variable W B 0139 not B 300 I If independent count in category 200 combination A and B must equal Tm l 29quot 2 50quot total in A times total in B divided by overall total in table I Question How different are the observed and expected counts e 2mm Nancy Pfenning Elementary Statistics Luuking alme Big Picture L14 an e 2mm Nancy Pfenning Elementary Statistics Luuking alme Big Picture L14 El Example Counts Expected if Independent Example Counts Expected if Independent I Background Students are classified according to I Background Students are classified according to gender and ears pierced or not A table of expected gender and grade A or not A table of expected counts 174 w etc has been produced counts 15 m etc has been produced Counts expected It gender ano Counts actually 1 00 pierced ears were independent observed E E Tom Exp A not A Total Obs A not A Total WM F 15 45 60 F 15 45 60 M M 10 3o 40 M 10 3o 40 We Total 25 75 100 Total 25 75 100 El Response Observed and expected counts are very different 270 VS 174 20 VS 116 etc because I Question How different are the observed and expected counts e 2mm Nancy Pfenning Elementary Statistics Luuking alme Big Picture L14 as e mi Nancy Pfenning Elementary Statistics Luuking alme Big Picture L14 B4 Elementary Statistics Looking at the Big Picture 11 C 2007 Nancy Pfenning Example Counts Expected if Independent Ledlll e Summary Finding Probabilities More General Rules El Background Students are classi ed according to gender and grade A or not A table of expected 5 General 0139 Rule counts 15 etc has been produced El Conditional Probability El General And Rule El Two Types of Error El Independence I Testing for independence I Rule for independent evens I Counts ex ected if inde endent El Response Counts are identical because P P 2mm mnwmm Eiemenuwsuusucs mm tithe swim was mumm mm ammmsmm makmva heaiv mm mm Elementary Statistics Looking at the Big Picture 12 Lecture 14 Nancy Pfenning Stats 1000 Chapter 7 Probability Last time we established some basic de nitions and rules of probability Rule 1 PAC 17PAi Rule 2 In general the probability of one event or another occurring is PA or B PA PB 7 PA and B If the events are mutually exclusive then PA and B 0 and so PA or B PA PB Rule 3 In general the probability of one event and another is PA and B PAPBlA which we can reexpress as PBA P 113C1 B If the events are independent PBlA PB and so PA and B PAPBi T is time we utilize these rules to solve some more complicated problems and discuss why at times probabilities can be counterintuitive Tree diagrams are often helpful in understanding conditional probability problems Example Game Show Suppose a prize is hidden behind one of three doors A B or C after the contestant picks one of the three the host reveals one of the remaining two doors showing no prize He gives the guest a chance to switch to the door he or she did not select originally Is the probability of winning the prize higher with the keep or switch strategy Use a probability tree to calculate the probability of winning using each strategy When events occur in stages we normally are interested in the probability of the second given that the rst has occurred At times however we may want to know the probability of an earlier event having occurred given that a later event ultimately occurred Example Suppose that the proportion of people infected with AIDS in a large population is 01 If AIDS is present a certain medical test is positive with probability 99 called the sensitivity of the test and negative with probability 003 If AIDS is not present the test is positive with probability i015 negative with probability i985 called the speci city fo the test Does a positive test mean a person almost certainly has the disease Let s gure out the following if a person tests positive what is the probability of having AIDS A tree diagram is very helpful for this type of problem We ll let A and not A denote the events of having AIDS or not T and notT denote the events of testing positive or not First we ll nd the overall probability of testing positive Either a person has AIDS and tests positive or heshe does not have AIDS and tests positive PT PA and T PnotA and T According to the multiplication rule it follows that PT PAPTlA Pnot APTl not A Ol997 99015 00997 01485 02482 Now we can apply the de nition of conditional probability to nd the probability we seek prob ability of having AIDS given that a person tested positive 7 PA and T 7 00997 7 Am PT D2482 7 Thus even if a person tests positive heshe is more likely not to have the disease 40 PA and T 01997 00997 PnotA and T 99015 01485 not T Most studentsiand even most physiciansiwould have expected the probability to be much higher Conditional probabilities are often misunderstood and people often are misled by confusion of the inverse confusing the probability of having the disease given that you test positive PAlT 40 with the probability of testing positive given that you have the disease PTlA 997 Example What is the chance that at least two people in a class of 50 share the same birthday In a survey the average of 52 responses was 23 Is this intuitive guess close to the actual probability Example Here is an easier example to solve before tackling the previous more difficult one What is the chance that at least two people in a group of 3 share the same birthday Assume all days to be equally likely and disregard leapyear If we call the students A B and C then at least two can share the same birthday in any of these ways AB or AC or BC or ABC They are all mutually exclusive so by the addition rule the probability of any one or the other happening is the sum of the four probabilities Look rst at the probability of A and B having the same birthday and C di erent Whatever A s birthday is the probability of B having the same birthday is and the probability of C s birthday being different is So by the multiplication rule the probability of B being the same and C being different is E Similarly the probabilities of A and C or B and C being the same are each K15 The probability that A and B and C are all the same gt 1 1 r r r 1 364 1 364 1 364 1 1 7 is E E Altogether the probability is E E m m m m m E 7 0082 59 Example Consider this problem What is the probability of at least two out of 10 people sharing the same birthday If we call the people A B C D E F G H l J at least two can share the same birthday in more than 1000 ways AB AC AD AJ BC BD ABCmABCDEFGHll Imagine how much more complicated it would be for the original problem with 50 students instead of 10 The solutions are much easier if we employ an alternate strategy taking advantage of probability Rule 1 which tells us the probability of something happening must equal 1 minus the probability of not happening This is because the probabilities of all possibilities together must sum to 1 First we ll apply this strategy to redo the easiest problem the chance of at least 2 out of 3 sharing a birthday The probability of at least 2 out of 3 sharing the same birthday must equal 1 minus the probability of all 3 having different birthdays The probability of all 3 different is the probability of B different from A l times the probability of C different from both i Thus the probability of at least 2 the same is l 7 0082 the same answer we got originally Now we will use this strategy on the probability of at least 2 out of 50 sharing a birthday 1 minus probability of all 50 birthdays being different 364 363 362 316 364 363 316 1 7 7 7 l 7 365 365 quot39 36549 97 It is almost certain that at least 2 people in a class of 50 share the same birthday The fact that students personal probabilities for this event averaged only about 23 demonstrates that intuition is often an inade quate substitute for systematic application of the laws of probability Note in a class of 80 the probability of at least two birthdays the same is 999915 Another way to understand why shared birthdays in a large class are not so unlikely is to realize that if there are many unlikely events possible it is not so unlikely that at least one of them occurs This brings us to a discussion of coincidences A coincidence is a surprising concurrence of events perceived as meaningfully related with no apparent causal connection Should we really be surprised by coincidences Example On my trip to Denver in 1997 checking into the Sheraton with my husband and three children I was dismayed to nd they only had a single room reserved for us even though I d asked for two queens and a cot After I got a call in our room the next morning from a woman I d never met claiming to be a friend of Nancy Pfenning we gradually pieced together the truth Nancy Pfenning from Bismarck North Dakota was supposed to stay at the Sheraton that night tool In fact we had usurped her reservationiwe were supposed to be at the other Sheraton down the road Class members may have had similarly surprising experiences There are so many possible improbable events that may occur in our lives in the long run some of them are bound to happen Thus coincidences rather than defying the laws of probability can actually be explained by the laws of probability Note We will not cover computer probability simulation in this course Exercise Write up and email me directly not as an attachment a personal coincidence story that happened to you Were the occurrences really so unlikely Lecture 15 Chapter 8 Random Variables A random variable Rivi is one whose values are quantitative outcomes of a random phenomenon If it has a nite or countably in nite number of possible values like the counting numbers 1 2 3 then it is called discrete Probability distributions of discrete Rivi s can often be speci ed in a list 60 Example Consider the distribution of the RV X the number of girls in a randomly chosen family with 3 children This distribution can be found by rst examining the sample space of all possible outcomes with their associated probabilities then using Rule 2 to specify the probabilities of the events of having 0 l 2 or 3 girls X Value ofX 0 l 2 3 Probability g g g g Note that each probability is between 0 and l and together they sum to l 1 What is the probability that a randomly chosen family of 3 children has 2 girls PX 2 E 2 What is the probability of having at least 2 girls PX 2 2 g g 3 What is the probability of having more than 2 girls PX gt 2 Note that for a discrete RV like this whether or not we have strict inequality makes a difference The probability distribution of a discrete RV can be displayed in a probability histogram which represents all the possible values of a RV and their probabilities Thus it displays behavior for an entire possibly abstract population The frequency histograms of Chapter 1 displayed behavior of concrete sample data values In a probability histogram possible values of the RV X are marked along the horizontal axis As long as the possible values of X are in increments of l the height of each rectangle will be the same as its area which equals the probability that the RV X takes the value at the rectangle s base Means Expected Values and Variances of Random Variables Sample mean and sample proportion are random variables as long as the sample has been chosen at random Their distributions are of particular interest because they will allow us to establish how good an estimate our sample statistics are for the unknown parameters of interest As we learned in Chapter 2 a distribution may be summarized by telling its center and spread Now we will focus on the mean as center and variance as spread of a random variable The mean is simply the average of all the possible values of X where more probable values are given more weight We sometimes call it the expected value of X written EX le is a discrete random variable with possible values x1 x2 x3 occurring with probabilities p1 p2 p3 then the mean of X or equivalently the expected value is EX gUipi MP2 MP3 39 2961 Example What is the mean expected number of girls in a family of three children 9 M EltXgt0lt gt1lt gt2lt gt3lt gt 7 8 15 The average number of girls for a family with 3 children is 15 61 Example Use the probability distribution below to nd the mean M of all dice rolls Value of X Probability 71 2 94 4 5 61 7 7 325 Example Use the probability distribution below to nd the mean M of all household sizes X in the US ValueofX 1 2 3 4 5 6 7 Probabilityll2532 17 15 07 03 01 M 125 232 317 415 507 603 701 26 Notice that the mean equalled the median midpoint in the rst two examples because the distributions were perfectly symmetrici Median household size is 2 because 50 of households had 2 people or fewer mean household size is higher than median because this distribution is rightskewedi To describe spread of a random variable X we focus rst on the variance 72 VX then take its square root to nd the standard deviation 02 VX x1 7 02101 x2 7 102102 x3 7 Mst 2 7 10sz The standard deviation 0 of X is the square root of the variance Example Find standard deviation of X number of girls in a family of 3 children Recall that M 15 2 1 2 3 2 3 2 1 VarX 0 7 1 5 1 7 1 5 2 7 1 5 3 7 1 5 75 o Vll7 X m 87 You should know how to calculate M for a discrete random variable and you will be required to calculate o in a homework exercise but not on quizzes or examsi Binomial Distributions Our primary goal in this course involves the use of statistics sample mean a or sample proportion 16 to estimate parameters population mean M or population proportion pi If we take a simple random sample of size n from a population and observe for each individual the value of some quantitative variable like height number of girls or household size then we can calculate its sample mean value 92 and use it to estimate the unknown population mean Mi On the other hand if we observe the value of some categorical variable like gender to see whether or not each individual has a particular characteristic we can calculate sample count X then sample proportion p of units falling into that category and use p to estimate the unknown p In this section we will focus on sample count for categorical data Example A recent survey of 1012 Americans found that 516 of them opposed gay marriage We can set up a RiVi X for the sample count opposing and say X takes the value 516 for this sample Or we can set up a RiVi p m for the sample proportion opposing and say p takes the value 516 7 51 1012 Just as many quantitative variables fall into a particular pattern known as the normal distribution the counts for many categorical variables fall into a particular pattern called binomial In this section we study the distribution of binomial counts X in future chapters we will shift our attention to the distribution of proportions The two are directly related because 16 I The distribution of the count X of successes in the binomial setting is called the binomial distri bution with parameters 71 and p The binomial setting has the following requirements 1 There is a xed number n of observations 2 Each of the n observations is independent of the others 3 There are two possible categories success and failure for each observation 4 The probability p of success is the same for each observation Example The following RV s do not have a binomial distribution 1 Pick a card from a deck of 52 replace it pick another etc Let X be the number of tries until you get an ace 71 not e 2 Choose 16 cards without replacement from a deck of 52 Let X be the number of red cards chosen observations not independent 3 Pick a card from a deck of 52 replace it pick another etc Do this 16 times We are interested in the number of cards in each suit hearts diamonds clubs spades picked more than 2 possible categories for each observation 4 Pick a card from a deck of 52 replace it pick a card from a deck of 32 replace it back to 52 etc After 16 tries let X be the number of aces picked different probabilities for 1 1 success E or g Example The following RV does have a binomial distribution Pick a card from a deck of 52 replace it pick another Do this 16 times Let X be the RV for the number of red cards picked Then X is binomial with n 16 p Requirement 2 may be fudged slightly If only a very small degree of dependence is present we may still treat a RV as binomial In general sampling with replacement is associated with independence and sampling without replacement is associated with dependence Almost all of our examples in this course will assume data arises from a simple random sample that is sampling without replacement in which selections are dependent Under what circumstances can violation of Requirement 2 be overlooked Example Pick 2 people at random without replacement from a class where 25 out of 75 are male Let X be the number of males picked The probability of success for the rst person picked is exactly 1 333 The probability of success for the second person is 324 if the rst was male 77 337 if the rst was femalei pretty close Since population size 75 is much larger than sample size 2 X is approximately binomial with n 2 p E However if 2 people were picked from a group of only 3 where one third are male the probability of the second being male is either 0 or 7 very different Likewise there would be a higher degree of dependence if we picked 25 from 75 where one third are ma e Rule of Thumb If the population is at least 10 times the sample size replacement has little effect In such cases when taking a simple random sample of size n from a population with proportion p in a certain category the sample count X in the category of interest is approximately binomial with the same 71 and p Most problems for binomial RV s take the form of a question like this If X is binomial with a certain 71 and p given what is the probability that X equals some value or lies within some range of values77 To answer such questions we can 1 Use the formula PX k lt Z gtpk17 p k for k 0 n not done in this course or 2 use binomial tables not done in this course or 3 use a calculator or MlNlTAB not done in this course or 4 use a normal approximation to nd PX 3 52h done later in this chapter or P 3 52h where 16 done in Chapter 9 Since our normal approximation will be based on a variable having the same mean and standard deviation as the binomial variable of interest X we rst discuss the binomial mean and standard deviation Earlier we found the mean value of X the binomial count of girls in a family with three children by using our formula for means of discrete random variables 1 3 3 1 12 M EX 0 1 2 3 g 1 5 In fact there is a much simpler way to nd a binomial mean If sample count X of successes is a binomial RV for n observations with probability p of success on each observation then X has mean np Example The number of girls X in a family of 3 children is binomial with n 3 and p 5 The mean number of girls is M np 3515 The mean of a binomial RV is easy to grasp intuitively Say the probability of success for each obser vation is 2 and we make 10 observations Then on the average we should have 10 2 2 successes The spread of a binomial distribution is not so intuitive so we will not justify our formula for standard deviation 0 nplt17pgt Example Standard deviation for number of girls X in a family of3 children is a x 351 7 5 7 87 a much easier solution method than the one we used previously Example Pick a card from a deck of 52 replace it pick another Do this 16 times Let X be the RV for the number of red cards picked Then X is binomial with n 16 p 5 Find the mean and standard deviation of X M np 165 8 a xnp17p 16 5175 2 In this situation we d expect on average to get 8 red cards give or take about 2 Example The population of whole numbers from 1 to 20 have a mean of 105 and a standard deviation of 577 I am interested in the mean and standard deviation of all the numbers chosen randomly from 1 to 20 by students I suspect the mean may be higher than 105 since people tend to think larger numbers are more random than smaller numbers I suspect the standard deviation may be less than 577 since people may avoid extreme numbers like 1 and 20 First I record what proportion of students chose each number 1 157 2 291 3 448 4 359 5 448 6 448 7 762 8 i 381 9 247 10 224 11 471 12 583 13 1143 14 g 516 15 493 16 g 404 17 1480 18 471 19 359 20 i 314 Next I calculate mean and standard deviation as follows 11 EX10157 20291 20031411614 VarX 1711614201572 71161420291 20 71161420314 2785 a V117 X V2785 528 As I suspected the mean of 11614 is higher than what it would be if students truly picked at random and the standard deviation is lower We can say students selections averaged 11614 and they typically deviated from this average by about 528 Exercise Use the survey data to report the probability distribution of year for the undergraduates in the class years 1 2 3 and 4 You will need to tally the years and adjust the total to exclude other students Find the mean variance and standard deviation Use mean and standard deviation in a sentence about the distribution of year in order to tell what is typical for students in the class Lecture 16 Last time we talked about a distribution that often applies when we are interested in a single categorical variable the binomial count X for number of successes in a situation that allows for two possible categories success and failure Next we consider a distribution that arises when we are interested in a single quantitative variable whose possible values constitute a continuum Unlike discrete RV s continuous RV s can take all values in an interval of real numbers The probability distribution of a continuous RV X is represented by a density curve The probability of an event is the area under the curve over the values of X that make up the event Pa S X S b equals the area under the curve from a to b The total area under the curve is 1 so the probability of any event must be between 0 and l The mean 14 of a continuous RV tells its center or balance point the standard deviation 0 tells its spread The simplest continuous random variable is the uniform RV which takes any value over an interval with equal probability Example A student s friend can call her cell phone any time between 9 am and 5 pm an 8 hour interval with equal probability What is the probability that she calls during statistics class between 1000 and 1100 Normal probability distributions are for a particular type of continuous R V that we have already encountered in Chapter 2 Example The distribution of verbal SAT scores in a certain population is normal with mean M 500 standard deviation 7 100 The Empirical Rule told us for example that 95 of the scores fell within 2 standard deviations of the mean that is between 300 and 700 Now we will imagine choosing a score at random and consider its value to be a random variable X Using probability notation we now write P300 lt X lt 700 95 Recall The relative size of a normal value x is expressed by standardizing to nd the zscore observed value mean as M standard deviation 7 Example The zscore for a verbal SAT of 300 is 30013300 2 The zscore for a verbal SAT of 700 is 70013300 2 By construction the standard normal z distribution has mean M 0 and standard deviation 7 l The normal table All in the appendix gives us the area under the standard normal curve to the left of any value 2 which is also the proportion of standard normal observations which are less than the value 2 These are given for any 2 between 349 and 349 with a few extreme cases beyond these values to specify exactly how unlikely such values are But we often simply state PZ lt 2 is approximately zero for any 2 less than 35 which means PZ gt 2 is approximately one for 2 less than 35 Similarly PZ lt 2 is approximately one for any 2 greater than 35 which means PZ gt 2 is approximately zero for 2 greater than 35 Example Find the following standard normal probabilities 1 PZ lt 7102 0274 2 PZ lt 008 07517 3 PZ gt 25 PZ lt 25 by the symmetry of the standard normal curve about zero 9938 Alternatively because the total area under the curve is l we can write PZ gt 72517 PZ lt 725170062 7 0938 4 PZ gt122 PZ lt 7122 7 01112 or PZ gt 122 17 PZ lt1 221 8888 7 1112 5 P7100 lt Z lt 100 7 PZ lt 100 7 PZ lt 7100 7 084137 01587 7 0826 since the area between l and 1 is the area left of 1 minus the area left of l Recall2 we said 68 of values are within a of M For a standard normal 2 this means 68 are within 1 of 0 or from l to 1 The tables have given us two more decimal places of accuracy for our Empirical Rule Note that since Table All only shows us the probability of values 2 being less than a given value 2 we must either use symmetry or the fact that total area is one to nd proportion of values greater than a given value The above examples give a certain value 2 and ask for a probability The next examples will give us a probability or percentile and ask for the corresponding value 2 Keep in mind that standard normal values 2 are of the form leml lem lem and are shown along the margins The row is for ones and tenths the column netunes for the hundredths placel Probabilities below given 2 values are of the form 7 em 7 em 7 em 7 em and are shown inside the table Example 1 What is the zscore for the 33rd percentile In other words the probability is 33 that a standard normal variable Z falls below what value According to Table All 3300 PZ lt 7 44 so 244 is the zscore for the 33rd percentilel What is the 90th percentile of z 8997 PZ lt 128 so 128 is the 90th percentile of Z What is the 10th percentile of z 1003 PZ lt 7128 so 428 is the 10th percentile of Z l 955 4 99 is the probability of Z being greater than what value The same value for which 1 of the time Z is less than that value 0100 m 0099 PZ lt 7233 so 99 PZ gt 7233 and 233 has 99 of zscores above it Alternatively we can say 99 PZ lt 233 means 99 PZ gt 7233 5 48 is the probability of Z being greater than what value 48 PZ lt 7005 means 48 PZ gt 005 6 95 is the probability of Z being how far from zero 1295 0250 PZ lt 7196 PZ gt 196 so 95 is the probability of being within 196 of 0 ie approximately within 20 of a where a is l and a is 0 The Empirical Rule is only roughly accuratel Note by a percentile we mean a value on a scale of 100 that indicates the percent of a distribution that is equal to or below it Keep this de nition in mind when considering the following example Example Researchers recently reported that more than 20 percent of black and Hispanic children and more than 12 percent of white children were at or above the 95th percentile for body mass and thus classi ed as overweightl ls this possible Strictly speaking no the inconsistency results because doctors referred to the 95th percentile on charts made between 1960 and 1980 There are more overweight children today than there were at that time Using Table A1 for NonStandard Normal Values Example The distribution of heights of young women in the US is normal with mean 65 standard devia tion 27 The distribution of heights of young men in the US is normal with mean 69 standard deviation 3 What percentiles are Jane and Joe in if Jane is 71 inches and Joe is 75 inches tall For women PX lt 71 P lt PZ lt 222 9868 For men PX lt 75 PXg69 lt 369 PZ lt 2 9772 Jane is in the 98 68th percentile and Joe is in the 97 72nd percentilel Lecture 19 Sampling Distributions Proportions uTypical Inference Problem Definition of Sampling Distribution 13 Approaches to Understanding Sampling Dist uApplying 689599 Rule 2 mm mm mm Elemenhw slums mm tithe aw 7mm C 2007 Nancy Pfenning Looking Back Review El 4 Stages of Statistics I Data Production discussed in Lectures 14 I Displaying and Summarizing Lectures 512 I Probability El Finding Probabilities discussedin Lectures 1314 El Random Variables discussed in Lectures 1518 El Sampling Distributions Means l Statistical Lnference 2mm mm mm ammw slums mm um aw mm L19 2 Typical Inference Problem If sample of 1 00 students has 013 left handed can you believe population proportion is 010 Solution Method Assume temporarily that population proportion is 010 n of sample proportion as high as 013 If it s too improbable we won t believe population proportion is 010 2mm mammth ElemenDHStaushcs mm tithe whenquot my Key to Solving Inference Problems Elementary Statistics Looking at the Big Picture For a given population proportion p and sample size n need to nd probability of sample proportion 13 in a certain range Need to know sampling distribution of g5 Note 13 can denote a single statistic or a random variable 2mm mm mm ammw slums mm um aw mm Hg 4 C 2007 Nancy Pfenning De nition Behavior of Sample Proportion Review Sampling distribution of sample statistic For random sample of size n from population tells probability distribution of values with p in category of interest sample taken by the statistic in repeated random proportion 15 has samples of a given size I mean p Looking Back We summarize a probability I standard deviation pun p distribution by reporting its center I shape approximately normal for large spread shape enough n Looking Back Can find normal probabilities using 689599 7Rule etc C ZEIEI7 Nancy Pfennlrlg Elementary Statistles Luuklng attne Eilg F39lCturE Li a 5 C ZEIEI7 Nancy Pfennan Elementary statistles Luuklng attne Eilg F39lCturE Li a a 2 Rules of Thumb Review Understanding Dist of Sample Proportion I Population at least 10 times sample size n 3 Approaches Intuition t formula for standard deviation of 13 approximately correct even if sampled without replacement I up and nIp both at least 10 guarantees 13 approximately normal LookingAhead We llfind that our intuition is consistent with experimental results and both are con rmed by mathematical theory EV Handson Experimentation 3 Theoretical Results C ZEIEI7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Eilg F39lCturE Li a 7 C ZEIEI7 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Eilg F39lCturE Li a a Elementary Statistics Looking at the Big Picture 2 C 2007 Nancy Pfenning Example Intuit Behavior of Sample Proportion El Background Population proportion of blue MampM s is p16017 El Question How does sample proportion f behave for repeated random samples of size I n25 a teaspoon Experiment sample teaspoons of MampMs record sample proportion of blues on sheet and in notes need a calculator 2mm mnwmm amnuwsmsm mm mm swim rm Example Intuit Behavior of Sample Proportion El Background Population proportion of blue MampM s isp16017 Looking Ahead The shape of the underlying distribution will play a role in the shape of 13 56 16 forrFI 2mm mm mm ammw Statstics mm um aw mm L19 m Example Intuit Behavior of Sample Proportion El Background Population proportion of blue MampM s is p0l7 El Response For repeated random samples of size 25 23 is a quan RV summarize with I I 2mm mnwmm amnuwsmsm mm mm swim L19 2 Example Intuit Behavior of Sample Proportion Elementary Statistics Looking at the Big Picture El Background Population proportion of blue MampM s is p0 17 Response For repeated random samples of size 25 13 is a quan RV summarize with I Center Some 13 s more than 017 others less should balance out so mean of 13 s is I Spread of 23 s sd El For 116 could easily get if as lOW asi as high asii El For 1125 unlikely to get p as low asi as high asii El 2mm mm mm ammw Statstics mm um aw mm L1 14 C 2007 Nancy Pfenning Example Intuit Behavior of Sample Proportion El Background Population proportion of blue MampM s is p0 17 El Response For repeated random samples of size 25 23 is a quan RV summarize with I Shape 73 close to 017 most common far from 017 in either direction increasingly less likely9 2mm mnwmm amnuwsmsm mm mm swim L19 5 Example Intuit Behavior of Sample Proportion El Background Population proportion of blue MampM s isp017 El Question How does sample proportion p behave for repeated random samples of size I n25 a teaspoon I n75 a Tablespoon Experiment sample Tablespoons ofMampMs record sample proportion of blues on sheet and in notes 2mm mm mm ammw mm mm um aw mm L19 7 Example Intuit Behavior of Sample Proportion El Background Population proportion of blue MampM s is p017 El Response For repeated random samples of size 75 23 is a quan RV summarize with I Center I Spread I Shape 2mm mnwmm amnuwsmsm mm mm swim L19 a Example Intuit Behavior of Sample Proportion Elementary Statistics Looking at the Big Picture El Background Population proportion of blue MampM s isp017 El Response For repeated random samples of size 75 13 is a quan RV summarize with I Center Some 13 s more than 017 others less should balance out so mean of 13 s is I Spread Compared to spread of samples of 25 13 for samples of size 75 will have standard deviation 2mm mm mm ammw mm mm um aw mm L19 2 C 2007 Nancy Pfenning Example Intuit Behavior of Sample Proportion Understanding Sample Proportion El Background Population proportion of blue 3 Approaches MampM s is p0 17 1 Intuition El Response For repeated random samples of Hands0n Experimentation size 75 23 is a quan RV summarize with 1 Theoretical Results I Shape 13 s clumped near 017 taper at tails9 Looking Ahead We ll nd that our intuition is consistent with experimental results and both are con rmed by mathematical theory 2mm mnwmm amnuwsmsm mm tithe swim mgzz mumm mm Eiementavystatsties makmva heaiv mm L192 Central Limit Theorem Behavior of Sample Proportion Implications Approximate normality of sample statistic for Fosample of size n from population repeated random samples of a large enough w1t p 1n category of interest sample size is cornerstone of inference theory proportion 15 in has El Makes intuitive sense I mean p El Can be veri ed with experimentation 9 13 is unbiased estimator of p El Proof requires higherlevel mathematics sample must be random result called Central Limit Theorem 2mm mnwmm amnuwsmsm mm tithe swim HQle mumm mm Eiementavystatsties hakmva heaiv mm L19 25 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Behavior of Sample Proportion Implications Behavior of Sample Proportion Implications For random sample of size n fro For random sample of size n from population with p in category of interest samp e with p in category of interest sample proportion 13 has proportion 13 has I mean p I mean p I standard deviation 1912 n in denominator I standard deviation p 1 p 9 13 has less spread for larger samples I shape approx normal for large enough n population size must be at least 10n 9can find probability that sample proportion takes value in given interval i i 39 Example Behavior of Sample Proportion Example Behavior of Sample Proportion I Background Population proportion of blue I Background Population proportion of blue MampM s is p0l7 MampM s is p0l7 II Question For repeated random samples of III Response For repeated random samples of n25 how does 2 5 behave n25 15 has I Center mean I Spread standard deviation I Shape not really normal because is ZEIEI7 Nancy Pfenning Elementary Statisties Luuking attne Big Picture Li a 28 e ZEIEI7 Nancy Pfenning Elementary Statisties Luuking attne Big Picture Li a an Elementary Statistics Looking at the Big Picture 6 Example Sample Proportion for Larger n I Background Population proportion of blue MampM s is p017 II Question For repeated random samples of 1175 how does 13 behave e 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture Li a 3i C 2007 Nancy Pfenning Example Sample Proportion for Larger n I Background Population proportion of blue MampM s isp017 II Response For repeated random samples of n75 13 has I Center mean I Spread standard deviation I Shape approximately normal because e 2mm Nancy Ptenning Eiernentary Statistics Looking attne Big Picture Li a 33 6895997 Rule for Normal RV Review Sample at random from normal population for sampled value X a RV probability is El 68 thatX is Within 1 standard deviation of mean I 95 thatX is Within 2 standard deviations of mean I 997 thatX is Within 3 standard deviations of mean e 2mm Nancy Prenning Eiernentary Statistics Luuking attne Big Picture Li a 34 Elementary Statistics Looking at the Big Picture 6895997 Rule for Sample Proportion For sample proportions 33 taken at random from a large population with underlying p probability is El 68 that 23 is within 1 101P of p TL El 95 that 1 5 is within 2 Mali P of p El 997 that I is within 3mm ofp 77 e mi Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture Li a 35 C 2007 Nancy Pfenning Example Sample Proportion for 1175 190 1 7 Example Sample Proportion for 1175 190 1 7 III Background Population proportion of blue I Background Population proportion of blue MampMs is p0 17 For random samples of MampMSA 131F017 For random samples of 1175 15 approx normal with mean 017 and quot75 P approx normal Wlth mean 017 and sd 0171 017 sd 017g017 0043 MT 0043 I Res onse The robabilit isa roximatel II Question What does 6895997 Rule tell us I 01638 that iswi n 10043i017pi1 y about behavior of 33 095 that 5 is within 20043 of0l7 in I 0997 that is Within 30043 of 017 in C 2mm Nancy Ptenning Eiementary Statistics Luuking althe Big Picture Li a SE C 2mm Nancy Ptenning Eiementary Statistics Luuking althe Big Picture Li a 38 i i 39 a 90959899 Rule Review Example 90959899 Rulefor 1175 p017 For standard normal Z the probability is I Background Population proportion of blue ii 090 that Z takes a value in interval 1645 1645 MampMs is p0 17 For random samples of El 095 that Ztakes a value in interval 1960 1960 1175 13 approx normal with mean 017 and I 098 that Z takes a value in interval 2326 2326 Sd 0043 I 099 that Z takes a value in interval 2576 2576 El Question What does 90959899 Rule tell us about behavior of 23 C 2mm Nancy Prenning Eiementary Statistics Luuking althe Big Picture Li a 39 C 2mm Nancy Ptenning Eiementary Statistics Luuking althe Big Picture Li a 4n Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning Example 90959899 Rule for n75 p017 Typical Inference Problem Review I Background Population proportion of blue If sample of I 00 students has 013 lefthanded MampMs is p0 17 For random samples of 7 quot75 13 approx normal with mean 017 and can you believe population proportion is 010 sd 0171 o17 O 043 Solution Method Assume temporarily that 75 39 population proportion is 010 fin probability II Response The probability is approximately of sample proportion as high as 013 If it s too 39 0 90 that Iii with 77039043 f03917 in improbable we won t believe population I 095 that is Within 7 0043 of0l7 1n 098 that is within 0043 of017 in pTOPOI UO IS 010 I 099 thatjp is within 7 0043 of0l7 in c2uu7 Nancy Prennirig Eiernentaiy Statistics Looking atthe Big Picture LiBAZ c2uu7 Nancy Prennirig Eiernentaiy Statistics Looking atthe Big Picture Liazia Example Testing Assumption About p Lecture Summary Distribution of Sample Proportion I Background Earlier we asked If sample of 100 students has 013 lefthanded can you D Typlcal Inference pmblem believe population proportion is 010 D sampling dismbu on de nition D Response H 1920 10 I for quot2100 has mean I 3 approaches to understanding sampling dist 39 quot I Intuition 010 sd Dialog 2 003 and shape approx al since 1000T and 1001 010 are 39 Hands exp em both 2 10 According to Rule the probability is 106820 16 that 13 would take a value Central Limit Theorem of 013 l sd above mean or more Since this D Role of sample Size isn t so improbable we can believe p0 10 D Applying 689599 7 Rule I Theory II Center spread shape of sampling distribution C 2mm Nancy Prenning Eiementary Statistics Looking atthe Big Picture Li a 44 C ZEIEI7 Nancy Prenning Eiementary Statistics Looking atthe Big Picture Lizi 4E Elementary Statistics Looking at the Big Picture 9 C 2007 Nancy Pfenning Looking Back Review Lecture 3 1 El 4 Stages of Statistics I Data Production discussed in Lectures 14 Categoncal amp Quantltatlve Vanable Displaying and Summarizing Lectures 512 More About I Probability discussed in Lectures 1320 I Statistical Inference a 1 categorical discussed in Lectures 2123 nANOVA Hypotheses Table Test Stat Pvalue u 1 quantitative discussed in Lectures 24 27 111st Step In Practice Displays Summaries a cat and quan paired 2 sarrip1 uANOVA Output D Z WNW u 2 quantitative uGuIdeIInes for Use of ANOVA Inference MethOdS fer CQ Revjew Inference for Relationship Review I Paired reduces to lsarnplel I H0 and Ha about variables not related or related El Focused onmean ofdifferences El Applies to all three C9Q C9C Q9Q I TwoSample 2sarnple 2 similar to lsarnple I I 110 and Ha about parameters equality or not El Focused on difference between means El or mean0 for paired I SeveralSample F distribution El C90 P0P PIOPOI OHS equal El Focus on difference among means 539 Q9Q3 P0P 51013e equals 23m Eiemenhwstahsucs mm mm Bthieture m a mmmmnw mm Eiementaivstatstics inakmvauheaiv mare L114 Elementary Statistics Looking at the Big Picture 1 C 2007 Nancy Pfenning ANOVA Null and Alternative Hypotheses Example How to Refute a Claim about All H0 explanatory C amp response Q elated I Equivalently LJ 1 M2 2 MI difference among sample means just chance H a explanatory C amp response Q are related I Equivalently H a all the Hi are equal difference too extreme to be due to chance Depending on formulation the word no appears in H0 or Ha 2mm mmmmnv Eiemenhwstaushcs mm We swim m 5 El Background Reader asked medical advice columnist Dear Doctor does everyone with Parkinson s disease shake and doctor replied All patients with Parkinson s disease do not shake El Question Is this What the doctor meant to say 2mm mm mm ammw 3mm mm um aw mm m a Example How to Refute a Claim about All Example ANOVA Alternative Hypothesis El Background Reader asked medical advice columnist Dear Doctor does everyone with Parkinson s disease shake and doctor replied All patients with Parkinson s disease do not shake El Response No He meant 2mm mmmmnv Eiemenhwstaushcs mm We swim m s El Background Null hypothesis to test for relationship between race 3 groups and eamings Ho 1 2 3 El Question Is this the correct altemative Ha 3 H1 75 2 75 3 2mm mm mm ammw 3mm mm um aw mm m g Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example ANO VA Alternative Hypothesis The F Statistic Review D Background Null hypothesis to test for F ma c1 a n2E2 m2 rim 3 r I 1 relationship between race 3 groups and earnings I nl 15 n2 15g m 1kg N 1 HoiM1M2M3 D Response Ha 1 M1 72 2 72 M3 byitselfnot II Numerator variation among groups correct still other ways to disagree with Ho I HOW dlfferem are 3717 39 39 39 2331 from one anomer H a M1 H2 72 M3 iI Denominator variation within groups Ha I 1 72 M2 M3 I How spread out are samples sds 31751 H a 3 M1 M3 7395 2 Words are better say c ZEIEI7 Nancy Ptennirig Eiernentaiy Statistics tciciving attne Big Picture tai ii c ZEIEI7 Nancy Ptennirig Eiernentaiy Statistics Lciciving attne Big Picture tai i2 Role of Variations on Conclusion Review ANOVA Table Source Degrees of Freedom Sum of Squares Mean Sum of Squares F P Boxplots Wlth same Varlathn among groups 3 4 5 bm Factor DFG 1 e i SSG MSG SSGDFG F 9 gm different variation Within sds large left or small rror BFEI N7 I MSE SSEDP e e ta i r1 ht 7 O g iI Organizes calculations 6 5 a E I Source refers to source of variation 4 E ii Factor refers to variation among groups eXpl var 3 E This variation is from the numerator 2 El Error refers to individuals differing Within groups m among This variation is from the denominator Scenario on right smaller sds 9 larger F var with 9smaller Pvalue likelier to reject H 0 9conclude pop means differ c ZEIEI7 Nancy Prennirig Eiernentaiy Statistics tciciving attne Big Picture tai is c ZEIEI7 Nancy Ptennirig Eiernentaiy Statistics tciciving attne Big Picture tai i4 Elementary Statistics Looking at the Big Picture 3 C 2007 Nancy Pfenning ANOVA Table ANOVA Table SourcT Degrees of Freedom l Sum of Squares Mean Sum of Squares F P l Source Degrees of Freedom 3 Sum of Squaresr Mean Sum of Squares F P i Factor DFG 1 7 1 r SSG MSG SSGDFG F LEE p7value i Factor DFG I 7 1 r 550 IMSG SSGDFG F LEE p vaiue i Error Dr E l 7 l SSE MSE SSEDFE i Error DFE N 7 I 1551 MSE SSEDFE l Total r 7 1 l SST l Total N 7 1 SST ll Orgamzes calculat1ons ll Orgamzes calculat1ons I Source refers to source of variation I Source refers to source of variation I DF use I no of groups N total sample size I DF use I no of groups N total sample size D DFG 1 1 I SSG measures overall variation among groups B DFE N 39 I I SSE measures overall variation within groups SSG and SSE tedious to calculate other table entries straightforward except for Pvalue c 2mm Nancy Ftennlng Elementary Statlsllcs Luuklng attne Big F39lcture L31 15 c 2mm Nancy Ftennlng Elementary Statlsllcs Luuklng attne Big F39lcture L31 we ANOVA Table Example Key ANO VA Values Source Degrees of Freedom Sum of Squares Mean Sum of Squares F P i Factor DFG 1 7 1 ssc MSG SSGDFG 1 1 b 39 p vaiue D BaCkground compare mlleages for 8 sedans 8 list gi E1 NF MSE SSEDFE minivans 12 SUVs find SSG420 SSE1814 ll Organizes calculations ll Question What are the following values for table I Source refers to source of variation l DFG I DF use I no of groups N total sample size 39 DFE I MSG I SSG measures overall variation among groups I MSE I SSE measures overall variation within groups I F I Mean Sums Divide Sums by DFs I F Take quotient of MSG and MSE I Pvalue Found with software or tables cl 2mm Nancy Ftennlng ementary Statlstlcs tearing attne Eilg F39lcture Lat 17 C 2mm Nancy Ftennlng Elementary Statlstlcs tearing attne Eilg F39lcture tat ta Elementary Statistics Looking at the Big Picture 4 Example Key ANOVA Values I Background Compare mileages for 8 sedans 8 minivans 12 SUVs nd SSG420 SSE1814 El Response DFG 3 7 l DFEN 71 8812 7 3 MSGSSGDFG422 MSESSEDFE181425 FMSGMSE217256 C 2007 Nancy Pfenning C 2mm Nancy Pfenning Eiementary Statistics Lnnking atthe Big Picture Lai 2n E Example CompletingANOlA Table El Background Found these values for ANOVA DFG31 2 DFENI88123 25 MSGSSGDFG422 21 MSESSEDFE181425 7256 FMSGMSE217256 289 El Question Complete AN OVA table Source DF SS MS F P Factor Error C 2mm Nancy Pfenning Eiementary Statistics Lnnking atthe Big Picture Lai Zi Example CompletingANOlA Table El Background Found these values for ANOVA DFG31 2 DFENI88123 25 MSGSSGDFG422 21 MSESSEDFE181425 7256 FMSGMSE217256 289 El Response So u rce D F SS M S F P Factor Error C 2mm Nancy Pfenning Eiementary Statistics Lnnking atthe Big Picture Lai 23 ANOVA F Statistic and PValue I Sample means very different Pvalue small Reject claim of equal population means l Sample means relatively close F not large Pvalue not small Believe claim of equal population means C 2mm Nancy Pfenning Eiementary Statistics Lnnking atthe Big Picture Lai 24 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E HOW Large is Large F Example Examining Boxplots Particular F distribution determined by I Background For all students at a university are DFG DFE Math SATs related to What year they re in these determined by sample size number of groups aquot i i Pvalue in software output lets us know if F is large 70quot 39 E Note Pvalue is bottom line of test top line is E 600 i l H examination of display and summaries 500 g i i i 2 year 3 4 other I Question What do the boxplots suggest 0 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture Lai 25 0 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture Lai 29 2 Example Examining Boxplots Example Examining Summaries I Background For all students at a university are I Background For all students at a university are Math SATs related to What year they re in Math SATs related to What year they re in 300 Level N Mean StDev i 1 32 64375 6369 5 00 2 233 61391 6100 5 l a 3 87 60184 8979 500 i 4 28 58179 8973 4007 I other 10 57800 7208 2 Year 3 quot quote II Question What do the summaries suggest El Response As year goes up mean Suggests students scored better in Math 01337 Nancy F39fEWHNE EiEmEmaN Statistics LDDWQ BMW 3 9 Picture L31 13 01337 Nancy WEWHNE EiEmEmaN Statistics LDDWQ BMW 3 9 Picture L31 19 Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning E Example Examining Summaries Example ANO VA Output I Background For all students at a university are I Background For all students at a university are Math SATs related to What year they re in Math SATs related to What year they re in Analysis of Variance for Math Level N Mean StDev Source DP SS MS F P 1 32 543 75 63 69 Year 4 78254 19563 387 0004 2 233 613 91 6100 Error 385 1946372 5056 4 28 581 79 89 73 El Question What does the output suggest other 10 57800 7208 El Response Means decrease by about points for each successive year 1 to 4 Standard deviations are around and sample sizes are c 2mm Nancy Pfenning Eiernentaiy Statistics Luuking attne Big Picture Lai 3i c 2mm Nancy Pfenning Eiernentaiy Statistics Luuking attne Big Picture Lai 32 Example ANOVA OWPW HOW Large is Large F Review I Background For all students at a university are Particular F dist determined by DFG DFE Math SATs related to What year they re in Analysis of Variance for Math Source DF 33 MS F Pvalue in software output lets us know if F is large 0 004 Year 4 78254 19563 387 Pval00049F387 is large in given situation Error 385 1946372 5056 F distribution for Total 389 2024626 15 groups total N390 these determined by sample size number of groups El Response Test Pvalue 0004 Small RejectHO v Conclude all 5 population means are equal l 5039 9Year and Math SAT score related in pop c 2mm Nancy Prennirig Eiernentaiy Statistics Luuking attne Big Picture Lai 34 c 2mm Nancy Pfenning F387 tar as Elementary Statistics Looking at the Big Picture 7 C 2007 Nancy Pfenning Example ANOVA Output Continued El Response Test H0 til 2 2 M3 M M5 Pvalue is 0004 very small so rejectHo Conclude not all 5 population means are equal Year apparently plays a role in Math SAT score What about Data Production issues C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture tai 3B E Guidelines for Use of ANOVA Procedure I Need random samples taken independently from several populations I Confounding variables should be separated out I Sample sizes must be large enough to offset non normality of distributions I Need populations at least 10 times sample sizes I Population variances must be equal C 2mm Nancy Ptenning Eiementary Statistics LEIEIKWQ atthe Big Picture tai 37 i i Pooled TwoSample t Procedure Review If we can assume 01 02 standardized difference between sample means follows a pooled t distribution I Some apply Rule of Thumb use pooled t if larger sample sd not more than twice smaller The F distribution is in a sense pooled our standardized statistic only follows the F distribution if population variances are equal same as equal sds C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture tai 38 Example Checking Standard Deviations I Background For all students at a university are Math SATs related to what year they re in Level N Mean StDev 1 32 643 75 63 69 2 233 61391 6100 3 87 601 84 89 79 4 28 581 79 89 73 other 10 57800 7208 I Question Is it safe to assume equal population variances C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture tai 39 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E Example Checking Standard Deviations Example Reviewing ANOVA I Background For all students at a university are I Background For all students at a university are Math SATs related to What year they re in Verbal SATs related to What year they re in Level N Mean StDev Level N Mean StDev 1 32 59625 86 91 1 32 64375 6369 2 234 59276 6567 2 233 61391 6100 3 86 59651 7726 3 87 60184 8979 4 29 57983 7947 other 10 55100 12432 4 28 581 79 8973 Source DF SS MS F P other 10 57800 72 08 Year 4 23559 5890 110 0357 El Response El Questions Are conditions met Do the data Largest sd8979gt2smallest sd261122 PTOVide eVidence Of a relationShiP 9 assumption of equal variances OK c 2557 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture L31 41 c 2557 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture L31 42 Example Rewewmg ANOVA Guidelines for Use of ANOVA Review Verbal SATs related to What year they re in SCVCral POPUIaUO S Level N Mean StDev I Confounding variables should be separated out 1 32 595 25 8691 2 234 59276 6587 I Sample Sizes must be large enough to offset non 3 86 59651 7726 normality of distributions 4 29 579 83 7947 other 10 55100 124 32 I Need populations at least 10 times sample Sizes Source DF SS MS F P I Population variances must be equal Year 4 23559 5890 110 0357 I Responses 727 large and 12432 not gt 265879 Pval0357 smallEvidence of relationship c 2557 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture L31 44 c 2557 Nancy Prennirig Eiernentaiy Statistics Luuking atthe Big Picture L31 45 Elementary Statistics Looking at the Big Picture 9 C 2007 Nancy Pfenning Example Considering Data Production EX ample Considering Data Production El Background F test found evidence of relationship El Background F test found evidence of relationship between Math SAT and year Pvalue 0004 but between Math SAT and year Pvalue 0004 but not Verbal SAT and year Pvalue 0357 not Verbal SAT and year Pvalue 0357 El Question Keeping in mind that sample consisted El Response Students taking an introductory statistics of students in various years taking an introductory class are not necessarily representative of the larger statistics class are there any concerns about bias or population of students in terms of the relationship confounding variables between year and Math SAT Students who are quantitatively challenged may postpone their statistics requirement those who are good in Math may tend to take it as freshmen On the other hand verbal ability would tend not to be an issue 2mm mnwmm amnuwsmms mm We swim U146 mumm mm ammmsmm makmva heaiv mm m 47 Lecture Summary Inference for CatdQuan MoreAboutANOVA El ANOVA for severalsample inference I Formulating hypotheses correctly I ANOVA table I F statistic and PValue El 15 step in practice displays and summaries I Sidebyside boxplots I Compare means look at sds and sample sizes El ANOVA output El Guidelines for use of ANOVA amnuwsmms mm We swim mm Elementary Statistics Looking at the Big Picture 10 Lecture 33 Two Categorical Variables More About ChiSquare tupotheses about Variables or Parameters uComputing Chisquare Statistic Details of Chisquare Test uConfounding Variables 0 2mm Nancy Ptenning Elementary Statistics Luuklng attne Big Picture C 2007 Nancy Pfenning E Looking Back Review El4 e 2mm Nancy Ptenning Stages of Statistics Data Production discussed in Lectures 14 Displaying and Summarizing Lectures 512 Probability discussed in Lectures 1320 Statistical Inference El 1 categorical discussed in Lectures 2123 El 1 quantitative discussed in Lectures 2427 El cat and quan paired 2sample severalsample Lectures 2831 El 2 categorical El 2 quantitative Elementary Statistics Luuklng attne Big Picture L33 2 H 0 and H a for 2 Categorical Variables El In terms of variables I E two categorical variables are related I H a two categorical variables are related El In terms of parameters I HO population proportions in response of interest are equal for various explanatory groups equal for various explanatory group C 2mm Nancy Pfenning Elementary Statistics Luuklng attne Big Picture I E population proportions in response of interest are Word not appears in Ho about variables Ha about parameters L33 3 Elementary Statistics Looking at the Big Picture Chi 0 mm Na Square Statistic Compute table of counts expected if H 0 true each is Column total x Row total expected Table total El Same as counts for which proportions in response categories are equal for various explanatory groups Compute chisquare test statistic X2 observed expected2 chisquare sum of expected ncy Ptenning Elementary Statistics Luuklng attne Big Picture L33 4 C 2007 Nancy Pfenning Observed and Expected Example 2 Categorical Variables 39 Data Expressions observed and expected commonly I Background Interested in relationship between used for chisquare hypothesis tests gender amp lenswear cc 9 contacts glasses none All More generally observed is our sample statistic female 121 32 129 282 expected is what happens on average in the 42 91 11352 4574 10000 population whenHO is true and there is no difference from claimed value or no relationship male 42 37 85 164 2561 2256 5183 10000 S All 163 69 214 446 El Question What do data show about relationship in the sample e 2337 Nancy Ptenning Eiernentaiy Statistics tanking attne Big Picture t33 5 e 2337 Nancy Ptenning Eiernentaiy Statistics tanking attne Big Picture t33 3 Example 2 Categorical Variables Data Example 2 Categorical Variables Test I Background Interested in relationship between I Background Interested in relationship between gender amp lenswear gender amp lenswear contacts glasses none All female 121 32 129 282 C G N Total 42911 11354 4574 100001 F 121 32 129 282 male 42 37 85 164 M 42 37 85 164 2561 2256 5183 10000 Total 163 69 214 446 All 163 69 214 446 El Response Females wear contacts more than males II Question Is there evidence of a relationship in the males wear glasses more larger population from which sample was taken proportions with none are close e 2337 Nancy Ptenning Eiernentaiy Statistics tanking attne Big Picture t33 3 e 2337 Nancy Ptenning Eiernentaiy Statistics tanking attne Big Picture t33 3 Elementary Statistics Looking at the Big Picture 2 C 2007 Nancy Pfenning i i E Example 2 Categorical Variables Test Example 2 Categorical Variables Test El Background Interested in relationship between El Background Interested in relationship between gender amp lenswear gender amp lenswear Expected Contacts Glasses None Total Femae 163282446103 69282446 44 214282446135 282 Observed Contacts Giasses None Totai EXPECled COHlaClS Glasses None Total Male 16316444660 691644462521416444679 164 446 Total 1 68 59 21 4 Total 69 214 446 El Response Compare observed and expected counts El Response First calculate expected counts 9different L33 in C 2mm Nancy Ptenning Eiernentaiy Statistics Lnnking attne Big Picture L33 i2 C 2mm Nancy Ptenning Eiernentaiy Statistics Lnnking attne Big Picture Example 2 Categorical Variables Test Example 2 Categorical Variables Test El Background Interested in relationship between El Background Interested in relationship between gender amp lenswear gender amp lenswear 121 103 31 32 44 33 129 135 03 121 103 3 1 32 44 33 i29 1352O3 103 39 44 39 135 39 103 39 44 39 135 39 42 602 37 252 85 792 42 602 37 252 85 792 54 58 2 05 54 58 05 60 25 79 6O 25 79 El Response Sum components to get chisquare El Response Next nd components standardized squared differences between observed and expected I 54 58 largest most impact from Is it large I 03 05 smallest least impact from Eiernentaiy Statistics Lnnking attne Big Picture L33 i6 L33 i4 C2uu7 Nancy Ptenning C 2mm Nancy Ptenning Eiernentaiy Statistics Lnnking attne Big Picture Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning ChiSquare Distribution Review Example ChiSquare Degrees of Freedom 2 observed expected chIsquare sum of follows Predlctable I Background Exammmg relat1onsh1p between pattern known as gender and lenswear chisquare distribution With df r l x c l I r number of rows possible explanatory values C G N Tom I 0 number of columns possible response values F 121 32 129 282 Properties of chisquare I Nonnegative based on squares M 42 37 85 164 I Meandf 1 for smallest 2x2 table Tom 163 69 214 446 Skewed right I Question How many degrees of freedom apply c 2mm Nancy Pfenning Elementary Staltsttcs Luuking althe Big Picture L33 17 c 2mm Nancy Pfenning Elementary Staltsttcs Luuklng althe Big Picture L33 18 Example ChiSquare Degrees of Freedom Chisquare Density curve I Background Examining relationship between For chisquare with 2 df PX2 2 6 005 gender and lensweari 9 If X2 is more than 6 Pvalue is less than 005 INote Degrees C G N Total of freedom tell h F 121 32 129 282 5 0w 3 M 42 37 85164varyfreely Total 163 69 214 446bef0re the rest are locked in area05 El Response row variable male or female has r column variable contacts glasses none has c 2395 5390 60 7395 039 chi square with 2 df dye 2 by3 table e 2mm Nancy Pfenning Elementary Statistics Luuking althe Big Picture L33 in e 2mm Nancy Pfenning Elementary Statistics Luuking althe Big Picture L33 21 Elementary Statistics Looking at the Big Picture 4 C 2007 Nancy Pfenning Example Assessing ChiSquare Example Assessing ChiSquare I Background In testing for relationship between I Background In testing for relationship between gender and lenswear in 2x3 table found X2 184 gender and lenswear in 2x3 table found X2 184 El Question Is there evidence of a relationship in El Response For df21x312 chisquare general between gender and lenswear not just in considered large if greater than 6 the sample 9186 large PValue small 9evidence of a relationship between gender and lenswear c 2mm Nancy Pfenning Eiernentary Statisties Luuking attne Big Pieture L33 22 c 2mm Nancy Pfenning Eiernentary Statisties Luuking attne Big Pieture L33 24 Example CheckingAssumplions Example CheckingAssumplions I Background We produced table of expected I Background We produced table of expected counts below right counts below right Observed Contacts Giasses None Total Expected Contacts Giasses None Total Observed Contacts Giasses None Totai EXPEClSd comacts masses None Total Female Male Total 163 69 214 446 II Question Are samples large enough to guarantee Response All expected counts are more than individual distributions approx normal so sum of 9 standardized components follows X distribution o 2mm Nancy Pfenning Eiernentary Statisties Luuking attne Big Picture L33 25 o 2mm Nancy Pfenning Eiernentary Statisties Luuking attne Big Picture L33 27 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning 1 E Example ChiSquare with Software Example ChiSquare with Software El Background Some subjects injected under arm Expected counts are printed below observed counts Decreased NotDecreased Total W1th Botox others W1th placebo After a month Botox 121 40 161 reported if sweating had decreased Output shown 8050 8050 Expected counts are printed below observed counts Placebo 40 121 161 Decreased NotDecreased Total 8050 8050 Botox 121 40 161 Total 161 161 322 8050 8050 ChiSq 20376 20376 placebo 40 121 161 20376 20376 31503 8050 8050 DF 1 PValue 02000 Total 161 151 322 El Response Sample s1zes large enough Proportions ChiSq 20376 20376 with reduced sweating 20376 20376 81503 v 2 D1 1 PValue oooo seem d1fferent P val 9d1ff s1gmf1cant 1 Question What do we conclude o 2007 Nancy Pfenning Conclude Botox reduces sweating o 2007 Nancy Pfenning Elementary Siaiisnes Luuking aiine Big Picture L33 23 Elementary Siaiisnes Looking aiine Big Picture L33 3n Guidelines for Use of ChiSquare Review Example Confounding Variables I Need random samples taken independently from 1 Background Students of all yearszx2 136 p 0000 tWO 01 more populations 1 1 On Campus 1 Off Campus 1 Total 1 Rate On Campus 1 1 Undecided 1 124 1 81 1 205 1 12420560 1 I Confoundmg varlables should be separated out1 1Decided 1 96 1 129 1 225 1 9622543 1 I Sample sizes must be large enough to offset non Underclassmen X2 0025 p 0873 IIOI Inality of distributions 1 Underclassmen 1 On Campus 1 Off Campus 1 Total 1 Rate On Campus 1 1 Undecided 1 117 1 55 1 172 1 11717268 1 I Need populatlons at least 10 times sample s1zes 1 Decided 1 82 1 37 1 119 1 8211969 1 Upperclassmen X2 1267 0262 1 Upperclassmen 1 On Campus 1 Off Campus 1 Total 1 Rate On Campus 1 1 Undecided 7 1 26 1 3 73321 1 1 Decided 1 14 1 92 1 106 1 141oe13 1 II Question Major dec and living situation related 0 2007 Nancy Pfenning o 2007 Nancy Pfenning Elementary Siaiisnes Luuking aiine Big Picture L33 31 Elementary Siaiisnes Luuking aiine Ellg Pieinre L33 32 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Confounding Variables Activity I Background II Complete table of total students of each gender on I Students of all years X2 136 p 0000 roster and count those attending and not attending I Underclassmen X2 02513 2 873 for each gender group Carry out a chisquare test I Upperclassmen X2 126710 2 262 to see if gender and attendance are related in El Response general I Students of all years 1700009evidence of relationship Attend NOt Attend TOtal 77 But 7 is confounding variable Fem ale Separate by suspected confounding variable M a e Underclassmen 17873 evidence of relationship 77 Total I Upperclassmen 172629evidence of relationship 7 c 2337 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L33 34 c 2337 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L33 33 Lecture Summary Inference for Cat9Cat More ChiSquare II Hypotheses about variables or parameters I Computing chisquare statistic I Observed and expected counts II Chisquare test I Calculations I Degrees of freedom I Chisquare density curve I Checking assumptions Testing With software I Confounding variables e 2337 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture Li 3 33 Elementary Statistics Looking at the Big Picture 7 C 2007 Nancy Pfenning Looking Back Review Lecture El 4 Stages of Statistics I Data Production discussed in Lectures 14 Categoncal amp Quantltatlve Vanable Displaying and Summarizing Lectures 512 Inference in TWO Sample Design I Probability discussed in Lectures 1320 I Statistical Inference uSampling Distribution of Difference between Means a 1 categorical discussed in Lectures 2123 LIZsample t Statistic for Hypothesis Test D 1 Warm d s ss d quot Liam 227 1 at 39 2 uTest With Software or by Hand D c 3quot qlfaquot palm game ever sampe u 2 Categorical LIZsample Confidence Interval u 2 quaniiianve uPooled 2 sample 1 Procedures c mm WWW emcee mm move We WWW mm smmwsuere dammeer W L32 Inference Methods for C9Q Review Display amp Summary 2Sample Design Review I Paired reduces to lsample i already covered El Display Side by side boxplots El Focused on mean of differences I One boxplot for each categorical group I I TwoSample 2sample 2 similar to lsample t I 39 Both Share Same qua ta Ve 50319 El Focus on difference between means El Summarize Compare I SeveralSample need new distribution F I FiVe Number Smma es 100kin at boxplots I Means and Standard Deviations Looking Ahead Inference ir population relationship will focus on means and standard deviations 2mm mmmnm EiemenDnStahshcs mm tithe BlvVietuie 1293 2mm mm mm Eiementaiv Stitches mm me an mm LE 4 Elementary Statistics Looking at the Big Picture J r J Notation II Sample Sizes 711 2 El Sample I Means E1 2 I Standard deviations 31 52 El Population I Means ML 2 I Standard deviations 01 02 C 2997 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Ellg F39lcture L29 5 C 2007 Nancy Pfenning TwoSample Inference Inference about M1 M2 I Test Is it zero Suggests categorical explanatory variable does not impact quantitative response I CI If diff 7 0 how different are pop means Looking Back Estimated u with LE established the center spread and shape of relative to M Now estimate 1 M2 with 51 2 Probability background as RVX1 X2 has What center spread and shape C 2997 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Ellg F39lcture L29 6 i l TwoSample Inference Inference about 1 2 I Test Is it zero Suggests categorical explanatory variable does not impact quantitative response I CI If diff 75 O how different are pop means Estimate 1 7 2 with ii 7 f2 Probability background As RV X1 X 2 has I Center mean if samples are unbiasedu1 M2 2 I Spread sd if independentNZ 2 3 7 I Shape if sample means are normal normal C 2997 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Ellg F39lcture L29 7 Elementary Statistics Looking at the Big Picture TwoSample Inference Note claiming that the difference between population means is zero or not Hoiui M20VSHaiM1 M2750 is equivalent to claiming the population means are equal or not H03M1M2 VS Ha3M135l12 C 2997 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Ellg F39lcture L29 9 C 2007 Nancy Pfenning E TwoSample l Statistic Shape of TwoSample 1 Distribution Standardize difference between sample means I t follows twosample t dist only if sample t 531 52 M1 M2 f1 52 0 means are normal i I 2sample 2 like lsample 1 df somewhere n1 n2 n1 n2 between smaller n7 1 and m 722 2 assuming H0 true I Mean 0 if H0 2 M H 2 Z O is true I like 2 1f sample Sizes are large enough I sd gt 1 but close to 1 if samples are large I Shape bellshaped symmetric about 0 but not quite the same as Isample t e 2997 Nancy Pfenning Eiementaiy Statisties Luuking attne Big Picture L29 9 e 2997 Nancy Pfenning Eiementaiy Statisties Luuking attne Big Picture L29 iEI ii Shape of TwoSample 1 Distribution What Makes OneSample l Large Review 2 distribution Onesample 2 statistic t IE Mo ki Moipnl sx Iii t large in absolute value if I Sample mean far from 0 I Sample size n large t distribution n7 df6 i i i i i I Standard deviation s small 4 3 2 1 0 1 2 3 4 twosample t with equal standard deviations and n1n24 same as t with 6 df Eiementaiy Statisties Luuking attne Big Picture CJZEIEI7 Nancy Pfenning L29 ii e 2997 Nancy Pfenning Eiementaiy Statisties Luuking attne Big Picture L29 i2 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning What Makes TwoSample 1 Large Example Sample Means E ect 0n PValue Twosample 2 statistic El Background A twosample t statistic has been computed to test Ho 1 M1 2 0 vs Ha I 1 2 gt 0 El Question How does the size of the difference between sample means affect the PValue in terms of area under the twosample t curve large in absolute value if I E1 far from 2 I Sample sizes 711 712 large I Standard deviations 81 82 small e 2997 Nancy Pfenning Elementary Statistics Luuklng attne Big Picture L29 is e 2997 Nancy Pfenning Elementary Statistics Luuklng attne Big Picture L29 i4 Example Sample Means E ect 0n PValue Example Sample Means E ect 0n PValue El Background A twosample t statistic has been El Background A twosample t statistic has been computed to test Ho 1 M1 2 0 V Ha 1 H1 M2 gt 0 computed to test Ho I 1 2 0 vs Ha 1 M1 2 gt 0 El Response If the difference is not large the El Response As the difference becomes large the PValue is PValue becomes Small difference Large difference bemeen sample means between sample means pvalue IS largt 0quot 0 i 0 ltwo sample 1 1 0 llwo sampl ll is small is large c 2997 Nancy Pfenning Elementary Statisties Luuklng attne Ellg Pietuie L29 19 c 2997 Nancy Pfenning Elementary Statisties Luuklng attne Ellg Pietuie L29 i9 Elementary Statistics Looking at the Big Picture 4 C 2007 Nancy Pfenning Example Sample SDs E ect 0n PValue Example Sample SDs E ect 0n PValue El Background Boxplots with 501 37 i2 4 El Background Boxplots with E1 37 E2 4 could appear as on left or right depending on sds could appear as on left or right depending on sds 67 ltext sample a ltext sample 57 mean monthly pay 57 mean monthly pay 4 E in 1000s for 4 E in 1000s for 3 E3 females 3000 3 E3 females 3000 2 vs males 4000 2 vs males 4000 14 i El Question For which scenario does the difference I Response Difference between means appears between means appear more significant more significant on smaller sds less spread9 overlap c 2997 Nancy Pfenning Eiementary Statistics Luuking althe Big Picture L29 19 c 2997 Nancy Pfenning Eiementary Statistics Luuking althe Big Picture L29 21 Example Sample SDs E ect 0n Conclusion Example Sample SDs E ect 0n Conclusion El Background Boxplots with E1 37 E2 4 El Background Boxplots with E1 37 E2 4 could appear as on left or right depending on sds could appear as on left or right depending on sds Context sample mean monthly pay 4 E in 1000s for 3 E3 females 3000 2 1 vs males 4000 I II Question For which scenario are we more likely to reject H0 1 M1 2 0 e 2997 Nancy Pfenning Eiementary Statistics Luuking althe Big Picture L29 22 Elementary Statistics Looking at the Big Picture Context sample mean monthly pew E in 1000s for 3 Ea females 3000 2 vs males 4000 Response Scenario on smaller sds larger twosample t9 smaller PValue rejectingHO is c 2997 Nancy Pfenning Eiementary Statistics Luuking althe Big Picture L29 24 Example Sample Sizes E ect on Conclusion El Background Boxplot has 51 2 37 E2 4 C 2007 Nancy Pfenning Context sample mean monthly pay 3 in 1000sfor 2 females 3000 ie vs males 4000 II Question Which would provide more ev1dence to rejectHO and conclude population means differ if the sample sizes were each 5 or each 12 C 2997 Nancy Ptenning Eiementaiy Statistics Lnnking atthe Big Picture L29 25 E Example Sample Sizes E ect on Conclusion El Background Boxplot has 1 3 2 4 Context sample mean monthly pay 3 in 1000sfor 2 females 3000 ie vs males 4000 El Response sample size p10v1des more evidence to rejectHo C 2997 Nancy Ptenning Eiementaiy Statistics Lnnking atthe Big Picture L29 27 Example T woSample t with Software I Background Twosample tprocedure output based on survey data of students age and sex Twosample T for Age Sex N Mean StDev SE Mean female 281 2028 334 020 male 163 2053 196 015 Difference mu female mu male Estimate for difference 0250 95 CI for difference 0745 0245 TTest of difference 0 vs not TValue 099 PValue 0321 DF 441 I Questions Does a student s sex tell us something about age If so how do ages of male and female students differ in general C 2997 Nancy Ptenning Eiementaiy Statistics Lnnking atthe Big Picture L29 29 Elementary Statistics Looking at the Big Picture Example T woSample t with Software I Background Twosample tprocedure output based on survey data of students age and sex Twosample T for Age Sex N Mean StDev SE Mean female 281 2028 334 020 male 163 2053 196 015 Difference mu female mu male Estimate for difference 0250 95 CI for difference 0745 0245 TTest of difference 0 vs not TValue 099 PValue 0321 DF 441 El Responses Pvalue0321 small 9 age and sex related Sample means close Difference between pop means0 C 2997 Nancy Ptenning Eiementaiy Statistics Lnnking atthe Big Picture L29 99 C 2007 Nancy Pfenning Example T woSample I by Hand Example T woSample l by Hand ii Background Students age and sex summaries El Background Students age and sex summaries 281 females mean 2028 sd 334 163 males mean 2053 sd 196 281 females mean 2028 Sd 334 163 males mean 2053 Sd 196 D Question Are students sex and age related I Response Testing for relationship same as testing H 0 vs H a Standardized diff between sample mean ages is Samples are large Zsample t about same as Z lti just under l Pval for 2sided H a just over Small 9 evidence that sexampage are related C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture LZB 3i C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture LZB 33 Example T woSample l by Hand TwoSample Con dence Interval Note Software output consistent with results by hand Confidence interval for diff between population means is Twosample T for Age 82 82 Sex N Mean StDev SE Mean 1 2 female 281 2028 334 020 501 2 1 mUIt39pller male 163 2053 196 015 711 quot2 Difference mu female 39 mu male gt I Multiplier from twosample t distribution Estimate for difference 0250 957 c for difference 0745 0245 I Multiplier smaller for lower con dence TTest of difference 0 vs not TValue o99 PValue 0321 IDF 441 I Multiplier smaller for larger df If samples are large multiplier for 95 confidence is 2 as for Z distribution Samples are large Zsample t about same as Z lti just under llPval for 2sided Ha just over C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture LZB 35 C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture LZB 3B Elementary Statistics Looking at the Big Picture 7 C 2007 Nancy Pfenning Example Two Sample Con dence Interval El Background Students age and sex summaries 281 females mean 2028 sd 33439 163 males mean 2053 sd 196 El Question What interval should contain difference between population mean ages 2mm mmmm amnuwsmsm mm mm swam 12937 Example Two Sample Con dence Interval El Background Students age and sex summaries 281 females mean 2028 sd 33439 163 males mean 2053 sd 196 El Response For this large a sample size 2sample t multiplier same as z multiplier 2 We re 95 sure that females are between iyears younger and iyears older than males on average Thus is a plausible age difference consistent with test not rejecting Ho 2mm mm mm ammw mm mm um aw mm LE 39 Example Interpreting Con dence Interval El Background A 95 confidence interval for difference between population mean hts in inches females minus males is 64 53 El Question What does the interval tell us amnuwsmsm mm mm swam 12940 Example Interpreting Con dence Interval Elementary Statistics Looking at the Big Picture El Background A 95 confidence interval for difference between population mean hts in inches females minus males is 64 53 El Response We re 95 sure that on average females are shorter by 7 to 7 inches We would reject the null hypothesis of equal population means 2mm mm mm ammw mm mm um aw mm LE 42 C 2007 Nancy Pfenning Example Changing Order of Subtraction Example Changing Order of Subtraction El Background A 95 con dence interval for difference between population mean hts in inches females minus males is 64 53 Question What would the interval for the difference be if we took males minus females El 2mm mnwmm amnuwsmsm mm tithe swim 12943 El Background A 95 confidence interval for difference between population mean hts in inches females minus males is 64 53 El Response Interval for males minus females would 2mm mm mm ammw 3mm mm um aw mm LE 45 Pooled TwoSample t Procedure Example Checking Rule for Pooled t lfwe can assume 01 02 standardized difference between sample means follows an actuall distribution with df m 722 2 I Higher df9narrower Cl easier to rejectIIO I Some apply Rule of Thumb use pooled I if larger sample sd not more than twice smaller amnuwsmsm mm tithe swim 12946 El Background Consider use of pooled tprocedure El Question Does Rule of Thumb allow use of pooled l in each of the following I Male and female ages have sample 5115 334 and 196 I 1bedroom apartment rents downtown and near campus have sample sds 258 and 89 2mm mm mm ammw 3mm mm um aw mm LE 47 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Checking Rule for Pooled t El Background Consider use of pooled l procedure El Response We check if larger sd is more than twice smaller in each case I 334 gt 2l96 iJopooledtii I 258 gt 289 7 so pooled 77 2mm mmmm amnuwsmgm mm We swim 12949 Lecture Summary Inference for Cat amp Quart Two Sample Inference for 2sample design I Notation l Test I Con dence interval Sampling distribution of diff between means 2sample 2 statistic role of diff between sample means standard deviation sizes sample sizes Test with software or by hand Con dence interval El Pooled 2sample l procedures El EIEI EIEI 2mm mm mm ammw sunsth mm um aw mm L19 5 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning 7 E l l l Looking Back Review Lecture 23 El 4 Stages of Statistics I Data Production discussed in Lectures 14 Inference for Categorical Variable I Displaying and Summarizing Lectures 542 More About Hypothesis Tests I Probability discussed in Lectures 1320 I Statistical Inference Examples of Tests With 3 Forms of Alternative El 1 categorical con dence intervals hypothesis tests EIHOW Form of Alternative Affects Test 1 quantitative ElWhen P Value is Small Statistical Significance ElHypothesis Tests in LongRun ElReIating Test Results to Confidence Interval categorical and quantitative 2 categorical 2 quantitative e 2mm Nancy Pfennan Elementary etatlstles Leeklng attne ale F39leture e 2mm Nancy Pfennan Elementary etatlstles Leeklng attne ale F39leture L23 2 l l 39 7 Three Types of Inference Problem Review Hypothesis Test About p Review In a sample of 446 students 055 ate breakfast State null and alternative hypotheses H o and H a 1 What is our best guess for the proportion of all Null is status quo alternative rocks the boat students who eat breakfast p gt 100 Hoppo VS Ha pltpo P01ntEst1mate p 72 pO 2 What interval should contain the proportion of 1 l Consider sampling and study designl all Students Who eat breakfaSt 2 Summarize with standardize to Z assuming Confidence Interval that H0 p 2 390 is true consider if Z is large 3 Do more than half 50 of all students eat 3 Find Pvalueprobof Z this far abovebelowaway breakfast from 0 consider if it is small Hypothesis Test 4 Based on size of Pvalue choose H 0 or H a e 2mm Nancy Pfennan Elementary etatlstles Leeklng attne ale F39leture L23 3 e 2mm Nancy Pfennan Elementary etatlstles Leeklng attne ale F39leture L23 4 Elementary Statistics Looking at the Big Picture 1 Checking Sample Size Cl vs Test ll Confidence Interval Require observed counts in and out of category of interest to be at least 10 ma 2 X 2 10 n1 13n X2 10 III Hypothesis Test Require expected counts in and out of category of interest to be at least 10 assume p p0 TWO 2 10 n1 190 Z 10 e 2mm Nancy F39fErlrllrlg Elementary Statisties Luuklng attne Eilg F39lcture L23 5 C 2007 Nancy Pfenning E Example Checking Sample Size in Test I Background 304000075 students picked 7 at random from 1 to 20 Want to test H 0 p005 vs Ha pgt005 II Question Is n large enough to justify finding Pvalue based on normal probabilities e 2mm Nancy F39fErlrllrlg Elementary Statisties Luuklng attne Eilg F39lcture L23 6 Example Checking Sample Size in Test I Background 304000075 students picked 7 at random from 1 to 20 Want to testHO p005 vs Ha pgt005 il Response n P0 nlpo Looking Back For con dence interval checked 30 and 370 both at least 10 e 2mm Nancy F39fErlrllrlg Elementary Statisties Luuklng attne Eilg F39lcture L23 8 Elementary Statistics Looking at the Big Picture Example T est with gt Alternative Review CI Note Step 1 requires 3 checks I Is sample unbiased Sample proportion has mean 005 I Is population 210n Formula for sd correct I Are npo and nlpo both at least 10 Find or estimate Pvalue based on normal probabilities 1 Students are typical h n39 04 1 issue at hand 2 pr005 sd of is 005l 005nd Z 075 quot oo51 o05 39 400 3 Pvalue PZ Z 229 is small just over 001 4 RejectHo conclude Ha picks were biased for 7 e 2mm Nancy F39fErlrllrlg Elementary Statisties Luuklng attne Eilg F39lcture L23 9 C 2007 Nancy Pfenning Example Test with Less Than Alternative Example Test with Less Than Alternative I Background 111230 of surveyed commuters at a El Background 111230 of surveyed commuters at a university walked to school uanerSlty walked to SCh001 II Question Do fewer than half of the university s D Response FlrSt Wnte H03 VS Ha commuters walk to school 1 Students need to be representative in terms of year 2 Output9 13 2 Test and CI for One Proportion Test and CI for One Proportion Test ofp05 vsplt05 Test ofpO5 vsplt05 Sample X N Sample p 9507 Upper Bound ZValue P Value Sample X N Sample p 9507 Upper Bound ZValue PValue 1 111 230 0482609 0536805 053 0299 1 111 230 0482609 0536805 053 0299 3 Pvalue 4 RejectHo c 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L23 in c 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L23 i3 Example Test with Less Than Alternative Example Test with Not Equal Alternative CI Note Pvalue is a lefttailed probability because I Background 43 of Florida s community college alternative was less than students are disadvantaged II Question Is disadvantaged at Florida Keys Community College 169356475 unusual Test and CI for One Proportion Test of p 043 vs p not 043 Sample X N Sample p 950 CI Z Value P Value 1 169 356 0474719 0422847 0526592 170 0088 o 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L23 M o 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L23 i5 Elementary Statistics Looking at the Big Picture 3 C 2007 Nancy Pfenning Example Test with Not Equal Alternative Example Test with Not Equal Alternative 393 BaCkgmund 43 0f Florida s community COllege CI Note Pvalue is a twotailed probability because students are disadvantaged alternative was not equal El Response First write H 0 vs H a 1 356043 3561043 both210 pop210356 2 p z Test and CI for Cine Proportion Test of p 043 vs p not 043 Sample X N Sample p 9507 CI Z Value P Value 1 169 856 0474719 0422847 0526592 170 0088 3 Pvalue 4 RejectHO C 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture L23 i7 e mi Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture L23 iB E 9 90959899 Rule Outside Probabilities Onesided or Twosided Alternative I Form of alternative hypothesis impacts 05 Pvalue I Pvalue is the deciding factor in test e area025 1 area0 l area 005 area025 I Alternative should be based on what researchers hopefear suspect is true 3005 before snooping at the data area01 i i i 4545 I 70 just gm 1 6459 L1 645 I Z I If lt or gt is not obv1ous use twoSided 4960 l I 2325 1 329326 alternative more conservatlve 72576 2576 e mi Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture L23 in e 2007 Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture L23 2i Elementary Statistics Looking at the Big Picture 4 C 2007 Nancy Pfenning E Example How F arm of Alternative A ects Test Example How F arm of Alternative A ects Test I Background 43 of Florida s community college El Background 43 of Florida s community college students are disadvantaged students are disadvantaged II Question Is disadvantaged at Florida Keys 539 Response NOW Wme H03 VS H03 169356475 unusually high 1 Same checks of data production as before 2 Same 0475z170 Test of p 043 vs p gt 043 Sample X N Sample p 9501 Lower Bound Z Value PValue 1 169 356 0474719 0431186 170 0044 3 Now Pvalue 4 Reject H 0 e 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture L23 23 e 2mm Nancy Ptenning Eiernentary Statistics Looking attne Big Picture L23 25 i 39 e Pvalue for One or TwoSided Alternative Thinking About Data I Pvalue for onesided alternative is half Before getting caught up in details of test Pvalue for twosided alternative consider evidence at hand I Pvalue for twosided alternative is twice Pvalue for onesided alternative For this reason twosided alternative is more conservative larger Pvalue harder to reject Ho e 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture L23 2B e 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture L23 27 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning Example Thinking A bout Data at Hand El Background 43 of Florida s community college students are disadvantaged At Florida Keys the rate is 475 Question Is the rate at Florida Keys signi cantly lower El 2mm mnwmm Eiemenuwsuusucs mm tithe swim 1232s Example Thinking About Data at Hand El Background 43 of Florida s community college students are disadvantaged At Florida Keys the rate is 475 El Response 2mm mm mm ammw Stalslics mm um aw mm m cm De nition alpha 01 cutoff level which signi es a Pvalue is small enough to reject H 0 Eiemenuwsuusucs mm tithe swim 12331 How Small is a Small PValue Elementary Statistics Looking at the Big Picture I Avoid blind adherence to cutoff 05005 I Take into account 1 Past considerations is 10 Written in stone or easily subject to debate Future considerations What would be the consequences of either type of error I Rejecting H0 even though it s true I Failing to reject He even though it s false I Consider decisions encountered so far U 2mm mm mm ammw Stalslics mm um aw mm m 32 C 2007 Nancy Pfenning Example Reviewing P values and Conclusions El Background Consider our prototypical examples I Are random number selections biased PvaluF001 l I Do fewer than half of commuters walk Pvalue4299 I Is disadvantaged signi cantly different PvaluF0088 I Is disadvantaged signi cantly higher Pvalue0044 El Question What conclusions did we draw based on those Pvalues 2mm mnwmm amnuwsmsm mm mm aiwmme 1233 Example Reviewing P values and Conclusions El Background Consider our prototypical examples I Are random number selections biased Pvalue4011 I Do fewer than half of commuters walk Pvalue0299 I Is disadvantaged signi cantly different Pvalue4088 I Is disadvantaged signi cantly higher Pvalue0044 El Response Consistent with 005 as cutoff Oi I P value001 l 9Rej ect 7 I P value02999 Reject 7 I P value0088 9Rej ect 7 I P value0044 9Rej ect 7 2mm mm mm gummy Statstics mm um aw mm m 35 Example CutO s for Small quotP Value El Background Bookstore chain will open new store in a city if there s evidence that its proportion of college grads is higher than 026 the national rate El Question Choose cutoff 010 005 001 I if no other info is provided I if chain is enjoying considerable pro ts owners are eager to pursue new ventures if chain is in financial difficulties can t afford losses if unsuccessful due to too few grads 2mm mnwmm amnuwsmsm mm mm aiwmme mas Example CutO s for Small quotP Value Elementary Statistics Looking at the Big Picture El Response Choose cut0ff010 005 001 if no other info is provided El use77 I if chain is enjoying considerable pro ts owners are eager to pursue new ventures El use 7 I if chain is in financial difficulties can t afford loss if unsuccessful due to too few grads El use 2mm mm mm gummy Statstics mm um aw mm m as C 2007 Nancy Pfenning De nition Role of Sample Size n Statistically significant data produce Pvalue small enough to rejectHo Z plays a role l Large 11 may reject H 0 even though observed proportion isn t very far frompo Z 13 290 2 from a practical standpoint iPo1Po po1P0 TL Reject Ho ifPvalue small if Z large if I Sample proportion 13 far from p0 Very small Pvalue strong evidence against Ho but p not necessarily very far from po l Small 11 may fail to reject H 0 even though I Sample size n large it is false I Standard deviation small if pois close to 0 or 1 Failing to reject false H0 is 2 type of error e 2mm Nancy F39fErlrllrlg Elementary Statistles Leeklng attne Big F39lcture L23 3a e 2mm Nancy F39fErlrllrlg Elementary Statistles Leeklng attne Big F39lcture L23 4n i l De nition Hypothesis Test and LongRun Behavior I Type I Error reject null hypothesis even Repeatedly carry out hypothesis tests of p05 though it is true false positive based on 20 coinflips using cutoff 5 i Probability is cutoff Ct In the long run 5 of the tests will reject I Type 11 Error fail to reject null HO p05 even though it s true hypothesis even though it s false false negative e 2mm Nancy F39fErlrllrlg Elementary Statistles Leeklng attne Big F39lcture L23 M e 2mm Nancy F39fErlrllrlg Elementary Statistles Leeklng attne Big F39lcture L23 42 Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning J i J HypotheSIS Test and LongRun Behav10r Confidence Interval and Hypothesis Test Results 20 mi lps test H0 pgggy sb gggggg equal 50 l Con dence Interval range of plausible values TlTITHTH39lTHHT e HH 39 p39 p i eads45 Zquot45quotquotVal e 655 A I Hypothesis Test decides if a value is plausible HTI HHTHHTITHTHTlTHHT i Proportion of head 40 2389 p39Vame39371 4 IIIfOImally Z iagvaalueaaw 4 El If 170 is in confidence interval don t re ect Ho 7170 39 El pr0 is out51de confidence interval reject Ho 7170 THHHl tTHHHTHT HHH Z 2 24pvaiue 025 Relationship between 95 confidence interval pr p lhead 3975 l and twosided test with 05 as cutoff for pvalue 0 llips oi 20 o 95 chests do not reject Ho If 0 IS here I39e GCt HO 15 39 5 oi tests reject Ho i i l Cl 95 confidence interval i for population proportion V V Tl39H HTTTHTTHHTHHH proportion of heads 8204O 2289 pevalue37t A do not reiect Ho C 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L23 43 C 2mm Nancy Pfenning I If W is here do not relem Ho ppo I L23 44 Example Test Results Based on C Example Test Results Based on C I Background A 95 confidence interval for I Background A 95 confidence interval for proportion of all students choosing 7 at proportion of all students choosing 7 at random from numbers 1 to 20 is random from numbers 1 to 20 is 0055 0095 0055 0095 I Question Would we expect a hypothesis test I Response to reject the claim p005 in favor of the claim pgt005 Elementary Statistics Looking at the Big Picture 9 Example CI Results Based on Test El Background A hypothesis test did not reject HO p0 5 in favor of the alternative H 11 plt05 El Question Do we expect 05 to be contained in a con dence interval for p 2mm mnwmm amnuwsmgm mm tithe swim mm C 2007 Nancy Pfenning Example CI Results Based on Test El Background A hypothesis test did not reject HO p05 in favor of the alternative Ha plt05 El Response 2mm mm mm ammw Statstics mm um aw mm m 5 Lecture Summary iiI ore Hypothesis Tests for Proportions El Examples with 3 forms of alternative hypothesis El Form of alternative hypothesis I Effect on test results I When data render formal test unnecessary I Pvalue for lsided vs 2sided alternative Cutoff for small Pvalue Statistical signi cance role of n Type I or H Error Hypothesis tests in longrun EIEIEIEI Relating tests and confidence intervals 2mm mnwmm amnuwsmgm mm tithe swim Liam Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E l l 9 Looking Back Review Lecture 17 III 4 Stages of Statistics Continuous Random Variables I Data Production discussed in Lectures 14 I Displaying and Summarizing Lectures 512 Normal Distribution I Probability El Finding Probabilities discussed in Lectures 1314 El Random Variables introduced in Lecture 15 Relevance of Normal Distribution EIContinuous Random Variables I Binomial discussed in Lecture 16 689599 Rule for Normal RVs EIStandardizingUnstandardizing Probabilities for StandardNonstandard Normal RVs D samphng Disnilbunons I Statistical Inference e 2mm Nancy Pfenning Elementary Statistics Lnnking attne Big Picture e 2mm Nancy Pfenning Elementary Statistics Luuking attne Big Picture L17 2 Role of Normal Distribution in Inference Discrete vs Continuous Distributions I Goal Perform inference about unknown I Binomial Count X population proportion based on sample El discrete distinct possible values like numbers proportion 1 2 3 I Strategy Determine behavior of sample I Sample Proportion I3 proportion in random samples with known El also discrete distinct values like count poPUIation Proportion I Normal Approx to Sample Proportion I Key Result Sample proportion follows El continuous follows normal curve normal curve for large enough samples E Mean p standard deviation W Looking Ahead Similar approach will be taken with means it Elementary Statistics Looking at the Big Picture 1 C 2007 Nancy Pfenning Jr 1 E Sample Proportions Approx Normal Review Example Variable Types I Proportion of tails in n16 coin ips p05 has El Background Variables in survey excerpt A 050 2 MW 2 0125 shape approx normal l age breakfast comp credits I Proportion of lefties p0l in n100 people has no 120 15 no 120 16 M 0170 W 003 shape approx normal 1908 yes 40 14 W 5 n100 El Question Identify type catdiscquan contquan I Age I Breakfast Probanllrly mommy quot a 25 Sn if I 5 39lquot la is I Comp daily time in min on computer imam e prsampis ereeemen le rhanded Elli I Credits c 2mm Naney F39fErlrllrlg Elementary etatlstles Luuklng attne Big Pletere Ll7 5 c 2mm Nancy F39fErlrllrlg Elementary etatlstles Luuklng attne Big Pletere Ll7 6 Example Variable Types Probability Histogram for Discrete RV I Background Variables in survey excerpt HiStOgram for male Shoe SiZCX represents probability by area of bars i age breakfast comp credits 1967 no 120 15 I pX S 9on left 2008 no 120 16 I on ri ht 1908 yes 40 14 A PX lt 9 g ll Response I Age I Breakfast in H r snee male I C d 391 t39 39 39 t omp a1 y me In mm on compu er For discrete RV strict inequality or not matters I Credits C 2mm Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Big F39lcture Ll 7 a C 2mm Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Big F39lcture Ll 7 a Elementary Statistics Looking at the Big Picture 2 C 2007 Nancy Pfenning Definition Density Curve for Continuous RV Density curve smooth curve showing prob dist of Density curve for male foot lengthX represents continuous RV Area under curve shows prob PTObablhtY by area under CUIVe that RV takes value in given interval Looking Ahead Most commonly used density curve is normal z but to perform inference we also use t F and chisquare curves Probability of X less than 9 Probability of X less than or equal to 9 m i z D2 a l M J M s 7 la 9 in n la is l4 l5 la o2 N m r XFootlenglhimale mi Kl It PX g 9 PX lt 9 N i 7 J A F l 2 a Continuous RV strict inequality or not doesn t matter c 2mm Naney F39fErlrllrlg Elementary Statlstles Luuklng attne Eilg Pleture Ll7 in c 2mm Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Eilg Pleture Ll7 ll 6895997 Rule for Normal Data Review 6895997 Rule Normal Random Variable Values of a normal data set have Sample at random from normal population for El 68 within 1 standard deviation of mean sampled value X a RV probability is El 95 within 2 standard deviations of mean El 68 thatX is within 1 standard deviation of mean I 997 within 3 standard deviations of mean El 95 thatX is within 2 standard deviations of mean 3895997 Rule for Normal Distributions ll 997thatX1s Within 3 standard dev1ations of mean area16 area16 I 58 oi valuesigt areaaozs an 7 areal025 a oi 5 A a L areaTg 5 957a ofvalues e 0 I l l V rw meansad Illealll392sd mainijal gxiueietnnsa mealnzsd me a ntasd quot7950 72 r A 3 3920 MI30 c 2mm Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Eilg Pleture Li7 l2 c ZEIEI7 Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Eilg Pleture U7 is Elementary Statistics Looking at the Big Picture 3 C 2007 Nancy Pfenning 6895997 Rule for Normal RV Example 689599 7 Rule for Normal R V Looking Back We use Greek letters to denote 1 Background IQ for randomly chosen adult population mean and standard deviation is normal RV Xwith H 100 a 15 mean Itquot Standard deV39at39On a El Question What does Rule tell us about distribution of X area16 area16 I gt H684 area025 area025 are 0015 area 15 95 H a 9197 1 i i M 30 p 20 M 0 1 0 n2a M3a 02uu7 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture L17 14 02uu7 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L17 15 Example 689599 7 Rule for Normal R V Example Finding Probabilities with Rule I Background IQ for randomly chosen adult I Background IQ for randomly chosen adult is normal RVXwith u 100 0 15 is normal RVXwith u 100 0 15 III Response We can sketch distribution of X I Question Prob of IQ between 70 and 130 areai16 area16 area16 A area16 lt gt lt gt area39025 4 M gt area025 area39025 4 3929 area025 510015 95 arear0g5 are DO 95 area S 0 I 93997 igl 0 E I 997 74M I 55 70 5 100 1i15 1 30 1115 10 02uu7 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture L17 17 02uu7 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture L17 1E Elementary Statistics Looking at the Big Picture 4 C 2007 Nancy Pfenning i J Example Finding Probabilities with Rule Example Finding Probabilities with Rule I Background IQ for randomly chosen adult I Background IQ for randomly chosen adult is normal RVXwith u 100 0 15 is normal RVXwith u 100 0 15 III Response Prob of IQ bet 70 and 130 III Question Prob of IQ less than 70 area16 area16 area16 area16 lt gt lt gt area025 4 M gt area025 area025 4 39m gt area025 are 0075 area 15 are 10015 area 15 95 95 H 0 997 7 Tog 0 2 997 7A 55 7390 5 100 1l15 1l30 1115 55 7390 5 loo 1l15 1l30 11115 IQ IQ 02uu7 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L17 2n 02uu7 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L17 21 Example Finding Probabilities with Rule Example Finding Probabilities with Rule I Background IQ for randomly chosen adult I Background IQ for randomly chosen adult is normal RVXwith u 100 0 15 is normal RVXwith u 100 0 15 III Response Prob of IQ less than 70 I Question Prob of IQ less than 100 area16 area16 area16 A area16 lt gt lt gt area025 4 M gt area025 area025 4 3929 area025 510015 95 arear0g5 are DOiS 95 area S 0 93997 74 0 E 997 7 55 7390 5 100 1l15 1 30 1115 55 7390 5 loo 1l15 1 30 1115 IO 10 02uu7 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L17 23 02uu7 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture L17 24 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning 1 J Example Finding Probabilities with Rule Example Finding Values of X with Rule I Background IQ for randomly chosen adult I Background IQ for randomly chosen adult is is normal RVXwith u 100 0 15 normal RV Xwith u 100 0 15 III Response II QuestionProb is 0997 that IQ is between area16 area16 areai16 area16 lt gt lt gt area025 4 M gt area025 area025 4 39m gt area025 are 0075 area 15 are 10015 area 15 o 7 3 I d o 7 2 7 g 5395 7390 5 13900 1i15 1i30 115 5395 7390 5 lIOO 1i15 1i30 11175 IQ IQ 02uu7 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture L17 2B 02uu7 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture L17 27 Example Finding Values of X with Rule Example Finding Values of X with Rule I Background IQ for randomly chosen adult is I Background IQ for randomly chosen adult normal RVXwith M 100 0 15 is normal RVXwith M 100 0 15 III Response Prob 0997 that IQ bet and III Question Prob is 0025 that IQ is above areai16 area16 area16 A area16 lt gt lt gt area025 4 M gt area025 area025 4 3929 area025 are 0015 area 15 are 0015 area 15 o 0 2 fax a 0 2 f9 70 5395 7390 5 13900 1i15 1 30 1115 5 5 7390 5 lIOO 1i15 1 30 1115 IO 10 02uu7 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture L17 2a 02uu7 Nancy Pfenning Eiementary Statistics Looking atthe Big Picture L17 an Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning VJ Example Finding Values of X with Rule Example Using Rule to Evaluate Probabilities El Background IQ for randomly chosen adult is normal RV Xwith H 2 100 a 15 El Background Foot length of randomly chosen adult male is normal RVXwith M 11 U 15 in El Response Prob 1s 0025 that IQ 1s above 1 Question How unusual is foot less than 65 inches area16 area16 area 16 area 16 68 area025 area 15 95 area025 area025 area025 areg0015 95 areag 15 areg0015 0 i I 997 i I 0 1 997 r gt1 55 70 5 100 115 130 145 is Bill 5 1 1 13925 139410 1 55 lQ Xmale foot length C 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L17 32 C 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L14 33 Example Using Rule to Evaluate Probabilities Example Using Rule to Estimate Probabilities El Background Foot length of randomly chosen adult male is normal RVXwith u 11 a 15 in El Response Footlt65 El Background Foot length of randomly chosen adult male is normal RVXwith u 11 a 15 in El Question How unusual is foot more than 13 inches area16 area16 area16 area16 1 gt lt gt area1025 68 area025 areazlozs 4 Fa area025 39 are 0015 area 15 are 0015 area 15 a 95 g 3 95 g9 l 1 9197 l 1 l 1 49197 l 1 65 80 5 11 125 140 155 65 80 5 11 125 140 155 Xmale loot length Xmale loot length 0 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L14 35 0 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L14 3B Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning i E De nition Review Example Using Rule to Estimate Probabilities iI z score or standardized value tells how many standard deviations below or above the mean the original value is I Background Foot length of randomly chosen adult male is normal RVXwith a 11 a 15 in a Response PXgtl3 value mean A Z m r area16 3 D Notation for Population Z U LL AC2 39 area025 I zgt0 for x above mean 1 95 emery I zlt0 for x below mean 1997 r R F El Unstandardize c l 20 65 50 5 11 25 13940 1 55 M Xmale foot length C 2mm Nancy Pfenning Eiementary Statistics Luuking attne Big Picture LN 38 C 2mm Nancy Pfenning Eiementary Statistics Luuking attne Big Picture Li 7 39 Standardizing Values of Normal RVs Example Standardized Value of Normal R V Standardizing to Z lets us avoid sketching a different 393 BaCRgl Ollndi Typical nightly hours 51610th conege curve for every normal problem we can always Students 110111131 M 7i 0 15 refer to same standard normal 2 curve I Question How many standard deviations below or above mean is 9 hrs area16 area 6 i i gt 68 area025 area025 are 0015 area 15 0 7 997 ye i T I I i 3 2 1 0 1 2 3 Z c ZEIEI7 Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture Li7 4i Elementary Statistics Looking at the Big Picture 8 r Example Standardized Value of Normal R V I Background Typical nightly hours slept by college students normal it 7 a 15 El Response Standardize to Z 9 is standard deviations mean n 5 hoursx 0 9 standardized Nu C 2mm Nancy F39fErlrllrlg Elementary Statlstlcs Luuklng attne Eilg F39lCture Ll 7 43 J C 2007 Nancy Pfenning Example S tandardizin g Unstandardizin g Normal R V I Background Typical nightly hours slept by college students normal it 7 a 15 I Questions I What is standardized value for sleep time 45 hours I If standardized sleep time is 25 how many hours is it C 2mm Nancy F39fErlrllrlg Elementary Statlstlcs Luuklng attne Eilg F39lCture Ll4 44 Example S tandardizin g Unstandardizin g Normal R V I Background Typical nightly hours slept by college students normal u 7 a 15 CI Responses I What is standardized sleep time for 45 hours I If standardized sleep time is 25 how many hours is it C 2mm Nancy F39fErlrllrlg Elementary Statlstlcs Luuklng attne Eilg F39lCture Ll4 4e Elementary Statistics Looking at the Big Picture Interpreting zscores Review This table classifies ranges of zscores informally in terms of being unusual or not Size of z Unusual z greater than 3 extremely unusual 2 between 2 and 3 very unusual 2 between 175 and 2 unusual 2 between 15 and 175 maybe unusual depends on circumstances z between 1 and 15 somewhat lowhigh but not unusual 2 less than 1 quite common Looking Ahead Inference conclusions will hinge on whether or not a standardized score can be considered unusual C 2mm Nancy F39fErlrllrlg Elementary Statlstlcs Luuklng attne Eilg F39lCture Ll 7 47 C 2007 Nancy Pfenning Example Characterizing Normal Values Based Example Characterizing Normal Values Based on z Scores 0n z Scores I Background Typical nightly hours slept by college El Background Typical nightly hours slept by college students normal u 7 a 15 students normal u 7 0 15 El Questions El Responses I How unusual is a sleep time of45 hours z167 I Sleep time of45 hours I How unusual is a sleep time of 1075 hours z25 I Sleep time of 1075 hours Size of z Unusual z greater than 3 extremely unusual z between 2 and 3 very unusual 2 between 175 and 2 unusual 2 between 15 and 175 maybe unusual depends on circumstances 2 between 1 and 15 somewhat lowhigh but not unusual 2 less than 1 quite common c2uu7 Nancy Pfenning Eiememaiy Statistics Luuking attne Big Picture MAME c2uu7 Nancy Pfenning Eiememaiy Statistics Luuking attne Big Picture Lizi an N ormal Probability Problems Example Estimating Probability Given 2 I I Estimate probability given 2 I I Background Sketch of 6895997 Rule for Z El Probability close to 0 or 1 for extreme Z I Estimate 2 given probability area16 A area16 l Estimate probability given nonstandard x area025 3968 l Estimate nonstandard x given probability a39ea39 25 0015 15 a 95 area lg 0 i 997 gt1 i i T I 3 2 71 0 1 2 3 Z I Question Estimate PZlt147 C 2mm Nancy Pfenning Eiementary Statistics Looking attne Big Picture Li 7 5i C 2mm Nancy Pfenning Eiementary Statistics Looking attne Big Picture Ln 52 Elementary Statistics Looking at the Big Picture 10 C 2007 Nancy Pfenning Example Estimating Probability Given 2 Example Estimating Probability Given 2 11 Background Sketch of 6895997 Rule for Z 11 Background Sketch of 6895997 Rule for Z area16 1 area16 area16 area 68 68 gt area025 area39025 area025 are 0015 ar a 15 x 95 l 95 a e 709 0 s QIEW l gt1 0 1 227 i i l 73 2 0 1 2 393 Z a 72 71 0 1 2 la 1 47 Z 11 Response PZlt1 47 11 Question Estimate PZgt075 0 2mm Nancy Pfenmng E1ememary 5mm Lunkwg atthe E119 F39mture L17 54 0 2mm Nancy Pfenmng E1ememary 5mm Lunkwg atthe E119 F39mture L17 55 Example Estimating Probability Given 2 Example Estimating Probability Given 2 11 Background Sketch of 6895997 Rule for Z areao50 area7 areai16 11 Background Sketch of 6895997 Rule for Z area16 area16 b r area025 EBBH area 025 39r g are OO15 area 15 4 35 I h 0 3995 gt1 l n amalgam 1 997 7 1 71 IAVIIIIIIIIIIIIIIIIrA l t I l t 393 2 r1 0 1 2 1393 Z 3 2 1 0 1 2 3 75 V r 2 11 Response PZgt075 I III Question Estimate PZlt28 C 2mm Nancy Pfenmng E1ememary 5mm Lunkwg atthe E19 F39mture L17 57 C m7 Nancy Pfenmng E1ememary 5mm Lunkwg atthe E19 P1cture L17 58 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Estimating Probability Given 2 o V 1i r E Normal Probability Problems l Estimate probability given 2 El IProbability close to 0 or 1 for extreme Z I Estimate 2 given probability l Estimate probability given nonstandard x I Estimate nonstandard x given probability e ZEIEI7 Nancy Prenning Eierneritaiy Statistics Luuking atthe Big Picture U761 0 3VR Vq l 3 2 1 o 1 2 3 Z 28 I Response PZlt2 8 c 2667 Nancy Pfenning Eiernentary Statistics Luuking atthe Big Picture Li 7 EU I Example Probabilities for Extreme 2 I Background Sketch of 6895997 Rule for Z area16 area16 I 39 gt Z I Question Estimate PZlt145 e ZEIEI7 Nancy Prenning Eiementary Statistics Luuking atthe Big Picture 768 area025 area025 are 0015 magg 5 lt1 95 l o 997 7 TT i i T I l 39 3 72 c1 0 1 2 3 L17 EZ Example Probabilities for Extreme 2 I Background Sketch of 6895997 Rule for Z area16 A area16 E 7684 area025 area025 are t0015 area 5 95 gt g o 997 e h i i T i i i 3 72 c1 0 1 2 3 Z I Response PZltl45 e ZEIEI7 Nancy Prenning Eiementary Statistics Luuking atthe Big Picture w 64 Elementary Statistics Looking at the Big Picture Example Probabilities for Extreme 2 I Background Sketch of 6895997 Rule for Z area16 area16 lt I gt 68 area025 area025 are 0015 area 15 41 95 l g o 997 7 h 1 7 T 39 I 39 3 2 1 0 1 2 3 Z I Question Estimate PZgt38 C 2997 Nancy Pfenmng Etementary Staustms Luukmg atthe Ehg F39mture L17 as C 2007 Nancy Pfenning Example Probabilities for Extreme 2 I Background Sketch of 6895997 Rule for Z area16 area16 4 I 39 gt 68 area025 area025 are 0015 area 15 95 gt g 0 997 1 7 T 39 7T 3 2 1 0 1 2 3 Z I Response PZgt38 C 2997 Nancy Pfenmng Etementary Staustms Luukmg atthe Ehg F39mture L17 B7 Example Probabilities for Extreme 2 I Background Sketch of 6895997 Rule for Z area16 area16 I 39 gt 768 area025 area025 are 0015 area 5 lt1 95 l g o 997 7 TT 1 1 T I i 39 3 2 1 0 1 2 3 Z I Question Estimate PZlt13 C 2997 Nancy Pfenmng Etementary Staustms Luukmg atthe Ehg F39mture L17 BE Elementary Statistics Looking at the Big Picture Example Probabilities for Extreme 2 I Background Sketch of 6895997 Rule for Z area16 area16 4 E 7684 area025 area025 are 0015 area 5 95 gt g o 997 71 1 T 1 3 2 1 0 1 2 3 Z I Response PZltl3 C 2997 Nancy Pfenmng Etementary Staustms Luukmg atthe Ehg Pcmre L17 7n C 2007 Nancy Pfenning Example Probabilities for Extreme 2 Example Probabilities for Extreme z I Background Sketch of 6895997 Rule for Z I Background Sketch of 6895997 Rule for Z area16 areai16 area16 area16 lt gt lt gt area39025 3968 area025 area39025 3968 area025 are 0015 area 15 are 0015 area 15 0 1 1 37 7 a g 0 I237 5 gt I 7 I I 71 I T I I I t I T I I 3 2 71 0 1 2 3 3 2 71 0 1 2 3 Z Z I Question Estimate PZgt235 II Response PZgt235 cnum Nancy F39fErIrIIng EIememary StatIstIcs LuukIng althe Etg F39Icture L17 71 cnum Nancy F39fErIrIIng EIememary StatIstIcs Luukmg althe Etg F39Icture L17 73 N ormal Probability Problems Example Estimating 2 Given Probability I Estimate probability given z I Background Sketch of 6895997 Rule for Z 1 Probability close to 0 or 1 for extreme Z I I Estimate 2 given probability I area16 A area16 I Estimate probability given non standarcl x area025 3968 areaquot 025 l Estimate nonstandard x given probability quot i0015 a 95 area 5 0 I 997 gt1 1 I T I 3 2 1 0 1 2 3 Z I Question Prob is 001 that Zltwhat value cnum Nancy Pfenmng EIememary StatIstIcs LuukIng althe Etg F39Icture L17 74 02uu7 Nancy F39fErIrIIng EIememary StatIstIcs LuukIng althe Etg F39Icture L17 75 Elementary Statistics Looking at the Big Picture 14 C 2007 Nancy Pfenning Example Estimating 2 Given Probability Example Estimating 2 Given Probability I Background Sketch of 6895997 Rule for Z I Background Sketch of 6895997 Rule for Z area025 area16 area16 4 I 39 gt 68 larea3901 area39025 68 area025 agaaDOi 5 95 gt 91001 5 95 away 5 0 997 gt o 39 97 l I T I I Z i I gr I i 7T 393 272 0 1 2 3 3 72 71 2 1 2 3 I Response Prob is 001 that Zlt II Question Prob is 015 that Z gtwhat value cnum Nancy Pfenning Elementary Statistics Luuking atthe Big Picture LI7 77 cnum Nancy Pfenning Elementary Statistics Luuking atthe Big Picture LI7 78 Example Estimating 2 Given Probability Normal Probability Problems I Background Sketch of 6895997 Rule for Z l Estimate probability given Z area16 El Probability close to 0 or 1 for extreme z area15 I Estimate 2 given probability l Estimate probability given nonstandard x I l Estimate nonstandard x given probability rI n G YIAVIIllIIIIIlllt r 2 0 1 2 393 Z H I 3 II ResponsezProb is 015 that Zgt C 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture LI 7 an C m7 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture LI 7 Bi Elementary Statistics Looking at the Big Picture 15 C 2007 Nancy Pfenning Example Estimating Probability Given x Example Estimating Probability Given x El Background Hrs sleth normal t 7 a 15 El Background Hrs sleth normal M 7 a 15 area16 area16 area16 39 gt 1 I area 684 area025 area025 are 0015 15 95 l area rug 4 O l 23997 7 i i 0 Ir I VII1111157529 3 gt2 71 0 1 2 3 3 V2 2 El Question Estimate PXgt9 D Response CJZEIEI7 Nancy F39fErrrrrrrg Eiementary Statistics Luukrng atthe Big Picture L17 EZ CJZEIEI7 Nancy F39fErrrrrrrg Eiementary Statistics Luukrng atthe Big Picture L17 84 Example Estimating Probability Given x Example Estimating Probability Given x a Background Hrs slethnonnal i 7 or 15 U Background Hrs slethnonnal a 7 o 15 area68 area area16 area16 area025 area025 510015 3995 area g 5 O I i E f 0 i iH 19 l 3 72 1 1 2 3 3 2 I67 0 67 1 2 3 2 El Question Estimate P6ltXlt8 D Response WDWNWWEWQ Semen5mm WW am 3 9 mm W ES I A Closer Look 0 67 a71a39 0 67 are the Miles of the z 0W Elementary Statistics Looking at the Big Picture 16 C 2007 Nancy Pfenning Normal Probability Problems Example Estimating x Given Probability l Estimate probability given 2 El Background Hrs sleth normal 1 7 a 1 5 El Probability close to 0 or 1 for extreme Z 16 I Estimate 2 given probability L l Estimate probability given nonstandard x areawzs 68 Eek025 I Estimate nonstandard x given probability 3 0015 95 areaTilms 0 I i 997 7 gt 3 2 1 0 1 2 is Z ll Question 004 is PXlt 32007 Naney F39fErlrllrlg Elementary Statistles Luuklng attne Elg Pletere Ll7 ea 32007 Nancy F39fErlrllrlg Elementary Statistles Luuklng attne Elg Pletere Ll7 89 Example Estimating x Given Probability Example Estimating x Given Probability I Background Hrs sleth normal H 7 a 15 El Background Hrs sleth normal 3911 7 a 15 area16 area04 area16 area16 4 RC2 r 08 area025 Ml area025 are 0015 area 15 o 1337 7 gtl em 0 a 95 gtl g I 5 l l 2 i i 997 l i 77 gt3 2 e 1 2 3 3 2 1 0 1 2 3 2 El Response Z ll Question 020 is PXgt C 2007 Nancy F39fErlrllrlg Elementary Statistles Leenng attne Eilg F39lcture Lli Bl C 2007 Nancy F39fErlrllrlg Elementary Statistles Leenng attne Eilg F39lcture Li 7 92 Elementary Statistics Looking at the Big Picture Example Estimating x Given Probability C 2007 Nancy Pfenning area025 I Background Hrs sleth normal 7 area 20 area16 area1 I i 015 gt area025 El Response C ZEIEI7 Nancy Pfenning Eiementaiy Statistics Luuking aims Big Picture are 0015 area 15 95 I i Qg i i i i 7 3 2 1 0 1 2 3 2 mi 94 E Strategies for Normal Probability Problems l Estimate probability given nonstandard x El Standardize to Z El Estimate probability using Rule l Estimate nonstandard x given probability El Estimate Z El Unstandardize to x C ZEIEI7 Nancy Pfenning Eiementaiy Statistics Looking aims Big Picture Li 7 as Lecture Summary Normal Random Variables Relevance of normal distribution 6895997 Rule for normal RVs Standardizingunstandardizing Probability problems I Find probability given 2 EIEIEIEIEI I Find 2 given probability I Find probability given x I Find x given probability C ZEIEI7 Nancy Pfenning Eiementaiy Statistics Luuking aims Big Picture Continuous random variables density curves LMBB Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E 9 Looking Back Review III 4 Stages of Statistics Lecture 1 l Relatonshps between TWO I Data Production discussed in Lectures 14 I Displaying and Summarizing Quantltatlve Varlables Correlatlon El Single variables 1 catl quan discussed Lectures 58 El Relationships between 2 variables EIDiSplay and summarize I Categorical and quantitative discussed in Lecture 9 I Two categorical discussed in Lecture 10 llCorreation for Direction and Strength uPropertles of Correlatlon I Probability llRe ression Line 1 g I Stattstrcallnterence 0 2mm Nancy F39fErlrllng Elementary Statlstles Luuklng attne Ellg F39lcture e mi Nancy F39fErlrllng Elementary Statlstles Luuklng attne Ellg F39lcture Lll 2 Example T w0 Single Quantitative Variables Rev1ew I Single quantitative variables I Background Data on male students he1ghts and we1ghts I Variable N Mean Median TrMean StDev SE Mean height 17 69765 69000 69800 2137 0518 I Summanze w1th mean and standard dev1at10n weight 17 17059 17500 16933 2887 700 T ILL TD 05 65 67 68 ge grze 7t 72 73 74 12 4 55 lgm ED 220 2 0 El Question What do these tell us about the relationship between male height and weight e mi Nancy F39fErlrllng Elementary Statlstles Luuklng attne Ellg F39lcture Lll 3 e mi Nancy F39fErlrllng Elementary Statlstles Luuklng attne Ellg F39lcture Lll 4 Elementary Statistics Looking at the Big Picture 1 C 2007 Nancy Pfenning Example T w0 Single Quantitative Variables De nition D BaCRgl Ollndi Data 011 male Students heights and III Scatterplot displays relationship between 2 wei hts Variable g N Mean Median TrMean StDev SE Mean 323 l 35 39333 3253 52 of l Explanatory variable x on horizontal axis 5 I Response variable 2 on vertical axis Frequency i Frequency i i i i i i T W iii Ir r r r r r r i as se 67 as as 70 7t 72 73 74 2 4 5 l 2 22 2 Heigh Weight El Response e mi Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture w a e mi Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture w 7 Example ExplanatoryResponse Roles Example ExplanatoryResponse Roles I Background We re interested in the I Background We re interested in the relationship between male students heights relationship between male students heights and weights and weights II Question Which variable should be graphed II Response along the horizontal axis of the scatterplot e mi Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture w a e mi Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture w in Elementary Statistics Looking at the Big Picture 2 C 2007 Nancy Pfenning De nitions Example F arm and Direction II Form relationship is linear if scatterplot El Background Scatterplot displays relationship points cluster around some straight line between male Students helghts and welghts 240 g I Direction relationship is I positive if points slope upward left to right VSO 7 Weight pounds I negative if points slope downward left to right U107 i i i i i i i 55 66 e7 69 69 70 71 Height inches II Question What are the form and direction of the i i i 72 73 7A relationship c 2mm Nancy Pfenning Eiementary Statistics tanking atthe Big Picture tii ii c 2mm Nancy Pfenning Eiementary Statistics tanking atthe Big Picture tii i2 Example F071quot and DirectiOn Strength of a Linear Relationship U BaCRgl Ollni Sca erplo diS ays rela OlIIIShiP II Strong scatterplot points tightly clustered 11 111 11 1 11 1 betwee a e stude ts e g ts a dwe g ts around a 11116 A 7 v 39 I Explanatory value tells us a lot about response 225 r II Weak scatterplot p01nts loosely scattered in around a line I Explanatory value tells us little about response at mags 73 74 El Response Form is direction is c 2mm Nancy Pfenning Eiementary Statistics tanking atthe Big Picture tii M c 2mm Nancy Pfenning Eiementary Statistics tanking atthe Big Picture tii i5 Elementary Statistics Looking at the Big Picture 3 Example Relative Strengths ii Background Scatterplots display I mothers ht vs fathers ht left I males wt vs ht middle I mothers age s fathers age right V 2 70 39 t 39 1393 i F i g f i m 70 I i e 2 a 55 igo B en 5 Si 1 z o a s E i 3 2 so 507 t r quot 39 39 40 ND 1 K 1i 39 57 Pquot W i i 4 i i i i 10 nesseve 59 7a ii 72 73 u rit inc ii Which is strongest which is weakest C 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture a Egl herHT Bo H919 hes V 7 w 5 JFatiie l tige a 8 0 Questlon How do relationships strengths compare LiiiB C 2007 Nancy Pfenning E Example Relative Strengths ii Background Scatterplots display I mothers ht vs fathers ht left I males wt vs ht middle I mothers age vs fathers age right 39 39i r S i m 70 39 e e t g E in E en e g g weez V 40c Sr quotP P l i i equot i i i i m 6 0 30 senserwaa 59 70 ii 72 is 74 30 w 50 go 7 3 0 FatherHT Height inches Famemge ii Response I Strongest 1s plot on 7 I Weakest 1s plot on c ZEIEI7 Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture Lii iE Example Negative Relationship 14 used Pontiac Grand Am s eiowavera mi Wiiiaum u 39 1 0 o 5 El Questions Age lt1 years I Does it appear linear Is it weak or strong C 2mm Nancy Ptenning Eiernentary Statistics Luuking attne Big Picture I Why should we expect the relationship to be negative I Background Scatterplot displays price vs age for Liiia Elementary Statistics Looking at the Big Picture Example Negative Relationship I Background Scatterplot displays price vs age for 14 used Pontiac Grand Am s ism 7 39 a eavera g piicbdggikm train ea ionoo g Q I 539000 eiowavera xnrc Witiiaimv eveia age A 39 o Pric o o 5 10 El Responses Age in yam I I c ZEIEI7 Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture Lii 2i C 2007 Nancy Pfenning De nition Example Extreme Values of Correlation II Correlation r tells direction and strength of 5 BaCRgl Ollndi Scatterplots Show relatiOHShiPSm linear relation between 2 quantitative variables 39 Price Per kilogram W Price Per Pound for groceries I Used cars age vs year made I Direction r is El positive for positive relationship I Students final exam score vs number order handed in El negative for negative relationship quot El zero for no relationship I Strength r is between 1 and 1 it is El close to l in absolute value for strong relationship n 1 min arts 25 i ii 10 a in a a n i 2 F39ViCePeVPmmd quot Year iiamnei El Question Whichhasr1 r0 r1 pnanrnrkg A iinai ion El close to 0 in absolute value for weak relationship El close to 05 in absolute value for moderate relationship 0 2mm Nancy Pfenning Eiementary Statistics tanking auna Big Picture tii 22 0 2mm Nancy Pfenning Eiementary Statistics tanking auna Big Picture tii 23 Example Extreme Values of Correlation Example Other Values of r I Background Scatterplots show relationships I Background Scatterplots display I mothers ht vs fathers ht left males wt vs ht middle I mothers age vs fathers age right iwiiamikg A iiiiai 2w W i w i n n E y ii ip cepwmz 3 inns Yam mm mm ii in an 32 quot a 5U 7H 2 90 El Response E I itL 39 l Price per kilogram vs price per pound ii so 70 m Emmi6 5 69 70 n 7 2 7 3 7 4 I Used cars age vs year made Famernr Halawmes I Students final exam score vs order handed in El Questlon Whlch graphs go Wlth Whlch correlation For this class order tells us nothing about final score r 023 r 078 r 065 0 2mm Nancy Pfenning Eiementary Statistics tanking auna Big Picture tii 25 0 2mm Nancy Pfenning Eiementary Statistics tanking auna Big Picture tii 2n Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning Example Other Values of r Example Imperfect Relationships El Background Scatterplots display El Background For 50 states voting Republican vs mothers ht vs fathers ht left Democrat in 2000 presidential election had I males wt vs ht middle r 239096 70 I mothers age vs fathers age right E c 907 39 a39Jlj 40 ND t 39 39 an a Elinerl li Bo i yr ei Y llilghllmiheil 7 3 7 30 Hi 5 uFathe ii ige 7 0 Bio D 30 ggomemng 60 El Response I Why should we expect the relationship to be negative r on left r in middle r on I Why is it imperfect right c 2mm Naney F39fErlrllrlg Elementary Statistles Leaking attne Eilg Pietere Lil 28 c 2mm Nancy F39fErlrllrlg Elementary Statistles Leaking attne Eilg Pietere Lil 29 Example Impelfect Rela OHShiPS More about Correlation r U BaCRgl Ollndi For 50 States Voting Republican VS ll Tells direction and strength 0 linear relation Democrat in 2000 presidential election had between 2 quantitative ariables r 096 r I A strong curved relationship may have r close to 0 I Correlation not appropriate for categorical data 2000FiepVote II lUnaffected by roles explanatoryresponse l II Unaffected by change of units ll Responses II Overstates strength if based on averages I Negative I Imperfect e 2mm Nancy Pfenning Elementary Statistics Looking attne Eilg F39lcture Lil 3i e 2mm Nancy Pfenning Elementary Statistics Looking attne Big Picture Lil 32 Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning Example Correlation when Roles are Switched Example Correlation when Roles are Switched El Background For male students plot El Background For male students plot weight vs height left or height vs weight right weight vs height left or height vs weight right E i we Weight pounds E73 l i 272 gm I E70 969 Weight pounds e 67 1407 657 14071 4 E s fif if rig 50 m KHP i m m j 60 I eight inches We gm pounds Height inches Weggm pounds El Questions El Responses I How do directions and strengths compare left vs right I I How do correlations r compare left vs right I c ZEIEI7 Nancy Ptehhirig Eierhehtaiy Statistics Luuking atthe Big Picture mi 33 c ZEIEI7 Nancy Ptehhirig Eierhehtaiy Statistics Luuking atthe Big Picture mi 35 More about Correlation 1 Example Correlation when Units are Changed II Tells direction and strength 0 linear relation El BaCkgl Ollndi For male Students Plot 3 i iEIQ between 2 quantitative ariables Left wt lbs vs ht in or Right wt kg vs ht cm I A strong curved relationship may have r close to 0 g 2 39 W 2 B I Correlation not appropriate for categorical data II Unaffected by roles explanatoryresponse g Z i 1407 l m I Unaffected by change of un1ts l hemis Hhghrloiih 7 7 3 74 ampJi m39 it it ei inc es Heighucsillimewisi II Uverstates strength 1t based on averages 393 Quesmns I How do directions and strengths compare left vs right I How do correlations r compare left vs right 0 ZEIEI7 Nancy Prehhirig Eierhehtaiy Statistics Luuking atthe Big Picture Lii SE C ZEIEI7 Nancy Ptehhirig Eierhehtaiy Statistics Luuking atthe Big Picture mi 37 Elementary Statistics Looking at the Big Picture 7 C 2007 Nancy Pfenning 7 J E Example Correlation when Units are Changed More about Correlation r 1 Background For male students plot I Tells direction and strength of linear relation Left wt lbs vs ht in or Right wt kg vs ht cm between 2 quantitative variables 240 a r A l A strong curved relationship may have r close to 0 Es I Correlatlon not approprlate for categorlcal data 5 E III Unaffected by roles explanatoryresponse F ig f f i iaJ707 1 1 i 39 II Unaffected by change of units an m es Hexgmcemimeieisi U RCSDOHSW II Overstates strength if based on averages I I 01mm Nancy Pfenning Eiementaiy Statistics Luuking aime Big Picture L11 39 01mm Nancy Pfenning Eiementaiy Statistics Luuking aime Big Picture L11 4n Example Correlation Based on Averages Example Correlation Based on Averages 1m 1 65 1 68 1 69 1 70 1 71 1 72 1 73 74 1H1 65 1 68 1 69 1 70 1 71 1 72 1 73 74 1 Wt 1 140 1 130 150 181 1 125 150 172 180 185 1 168 180 1 145 175 214 1 1 Wt 140 1 130 150 181 1 125 150 172 180 185 1 168 180 1 145 175 214 1 1 AVWl 1 140 1 1537 1 1624 1 1740 1 1780 1 195 1 175 1 235 1 Ath 140 1 1537 1 1524 1 1740 1 1780 1 195 1 175 El Background For male students plot El Background For male students plot Left or Right Left wt vs ht or Right average wt vs ht i on 5 17 g g E g E E S E E g g 55 55 57 5 3 59 77 77 7 2 73 7 1 55 as 77 5 9 5 9 7 7 7 7 72 7 74 55 m 57 53 59 7o 71 72 73 71 55 as 57 7 s 5 in 7 7 72 7 74 Heighl inches Height iinchesi Helghl inches Helgm niches 11 Quest10nWh1ch one has r 087 other r 065 1 Response Plot on has r 087 stronger 01mm Nancy Pfenning Eiementaiy Statistics Luuking aime Big Picture L11 41 01mm Nancy Pfenning Eiementaiy Statistics Luuking aime Big P1cture L11 43 Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning Example Correlation Based on Averages In general correlation based on averages tends to overstate strength because scatter due to individuals has been reduce 2mm mnwmm amnuwsmsm mm tithe swim m 44 Least Squares Regression Line If form appears linear then we picture points clustered around a straight line I Questions Rhetorical 1 Is there only one best line 2 If so how can we nd it 3 If found how can we use it I Response 3 If found can use line to make predictions mumm mm ammwsmms mmumaw mm m 45 Least Squares Regression Line I Response 3 If found can use line to make predictions Write equation of line 73 2 b0 blx n Explanatory value is 139 1 Predicted response is n yintercept is 70 n Slope is 91 and use the line to predict a response for any given explanatory value 2mm mnwmm amnuwsmsm mm tithe swim m 46 Least Squares Regression Line Elementary Statistics Looking at the Big Picture If form appears linear then we picture points clustered around a straight line I Questions 1 Is there only one best line 2 If so how can we nd it 3 If found how can we use itPredictions I Response 2 Find line that makes best predictions 2mm mm mm ammw Statstics mm um aw mm m 47 C 2007 Nancy Pfenning J l J Least Squares Regression Line Least Squares Regression Line I Response 2 Find line that makes best predictions Minimize sum of squared residuals prediction errors Resulting line called If form appears linear then we picture points clustered around a straight line I Questions 1 Is there only one best line least squares line or regressron line A Closer Look The mathematician Sir Francis 239 If 30 how can We nd It Mmlmlze errors Galton called it the regression line because of 3 If found how can we use it Predictions the regression to mediocrity seen in any imperfect relationship besides responding to x I Response we see y tending WVde ITS average value 1 Methods of calculus unique best line 0 2mm Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Eilg F39lcture Lil 48 e mi Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Eilg F39lcture Lil 43 Least Squares Regression Line Example Least Squares Regression Line If form appears linear then we picture points 5 gackgrognd car39f cil chms to know iff 400015 1139 1391 139 139 I39 11 Alll39 139 clustered around a straight line a p 06 0 y 0 a uses so e to regress price on age for 14 used Grand Am s I Questions A Price146853ri28164A9 C 1 Is there only one best line 139 Mm 8 g l 2 If so how can we find it 3 If found how can we use it L W I Response 07 cc 3 8y 39 I o Agestin Years lo 1 BeSt 11116 has b1 Tg b0 y blx ll Question How can she use the line cnum Nancy F39fErlrllrlg Elementary Statistics Luuklng attne Big F39lcture Lil 5n CJZEIEI7 Nancy F39fErlrllrlg Elementary Statistics Luuklng attne Big F39lcture til it Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning i J Example Least Squares Regression Line Lecmre summary Quantitative Relationships Correlation I Background Carbuyer wants to know if 4000 is fair price for 8yrold Grand Am uses software to El Display with scatterplot 3 ref31655 Prlce on age for 14 used Grand Am 5 I Summarize w1th form direction strength Least Squares Line P39e quot mm it idit i sW335 W II Correlation r tells directlon and strength S2I7A89 Rsii7s5 r8 8 i I Properties of r I Unaffected by explanatoryresponse roles Price In 3 5 2 r I Unaffected by change of units I Overstates strength if based on averages i 5 ii I Least squares regression line for predictions Age in years A El Response Predict for x8 y C 2mm Nancy Pfenning Eiementaiy Statistics Luuking atthe Big Picture Lii 53 C 2mm Nancy Pfenning Eiementaiy Statistics Luuking atthe Big Picture LB 54 Elementary Statistics Looking at the Big Picture 11 C 2007 Nancy Pfenning Lecture 4 Designing Studies Focus on Sample Surveys ulssues for any Study Design ulssues in Design of Sample Survey Questions 2 mm mm mm Elementary shims mm We aw 7mm Looking Back Review El 4 Stages of Statistics I Displaying and Summarizing I Probability I Statistical Inference 2mm mm mm Eiementaiy Statstics mm mm aw mm in Looking Back Review El Types of Study Design varia C I Experiment researchers control explanatory I Observational study values occur naturally El Special case sample surveys o en selfreported l El Two steps in Data Production I Obtain an unbiased sample summary 0 sa I Assess variables values to obtain unbiased El Design survey questions to assess Values Without bias 2 mm mm mm Elementary shims mm We aw 7mm in Example Formulating a Survey Question El Background A popular 2005 movie sparked speculation how common is it for a 40year old male to be a virgin El Question Assuming you had a representative sample of 40yearold males what survey question would you ask to nd out what proportion are virgins Students can jot down question amp discuss a er covering issues in survey question design 2mm mm mm Eiementaiy Statstics mm mm aw mm m Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Sample Survey Design Issues to Consider Example Open vs closed questions El Open vs closed questions El Unbalanced response options El Leading questions or planting ideas with questions El Complicated questions El Sensitive questions El Hardtode ne concepts 2mm mnwmm Eiemenhwstahshcs mm tithe gimme L45 El Questions 1 What kind of question is this a open b closed 2 What is an open question 2mm mm mm ammw Statstics mm um aw mm L45 Example Open vs closed questions De nitions El Responses 1 What kind of question is this a open b closed 2 What is an open question 2mm mnwmm Eiemenhwstahshcs mm tithe gimme ma El An open question does not have a xed set of response options El A closed question either provides or implies a xed set of possible responses 2mm mm mm ammw Statstics mm um aw mm L4H Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Overly restrictive options Example Overly restrictive options El Background A neuroscientist asked survey respondents How o en do you dream in color Answer alwayssometimesnever El Question What is the most important improvement that should be made to this survey question 2mm mnwmm amnuwsmms mm tithe swim mu El Background A neuroscientist asked survey respondents How often do you dream in color Answer alwayssometimesnever El Response mmu mm mm ammwsmm makinvanheatv We L412 Example Unbalanced Response Options Example Unbalanced Response Options El Background 91 of Americans surveyed rated their own health as good to excellent El Questions I Is this result surprising to you I If so does it seem unexpectedly high or low 2mm mnwmm amnuwsmms mm tithe swim L413 El Background 91 of Americans surveyed rated their own health as good to excellent El Response mmu mm mm ammwsmm tnakmvanheaw We L415 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Unbalanced Response Options El Background 91 of Americans surveyed rated their own health as good to excellent Options provided were Excellent Very Good Good Fair Poor El Question Now is the result surprising 2mm mnwmm Eiemenuwsuusucs mm mm swim L416 Example Unbalanced Response Options El Background 91 of Americans surveyed rated their own health as good to excellent Options provided were Excellent Very Good Good Fair Poor El Response mmmmnm mm ammwsmm makmvanheaiv mm ma Example Deliberate bias El Background The following question was posted on wwwahumanrightcom Ifmy child or my spouse were assaulted I would choose one Run away and hope my kid or spouse can keep up 7 Be a good Witness so I can tell the cops What happened later Try to convince the attacker to stop through verbal PBISHZSIOH 4 Fight to stop the attack El Question Do we know what response the surveyer wants us to choose 4 2mm mnwmm Eiemenuwsuusucs mm mm swim L419 Elementary Statistics Looking at the Big Picture Example Deliberate bias El Background The following question was posted on wwwahumanrightcom If my child or my spouse were assaulted I would Run away and hope my kid or spouse can keep up Be a good Witness so I can tell the cops What happened later Try to convince the attacker to stop through verbal PBISHZSIOH 4 Fight to stop the attack El Response We are obviously supposed to c oosei H mmmmnm mm ammwsmm damning mm L421 C 2007 Nancy Pfenning Deliberate Bias If it s clear what response the surveyer wants then the results are not useful from a statistical standpoint 2mm mnwmm amnuwsmsm mm mm gimme L422 Example Complicated question El Background A telephone surveyer asked a homemaker to agree or disagree with this I don t go out of my way to purchase lowfat foods unless they re also low in calories El Question How can this survey question be improved 2mm mm mm ammw Stalstics mm um aw mm L423 Example Complicated question El Background A telephone surveyer asked a homemaker to agree or disagree with this I don t go out of my way to purchase lowfat foods unless they re also low in calories El Response 2mm mnwmm amnuwsmsm mm mm gimme L425 Example A controversial question Elementary Statistics Looking at the Big Picture El Background Anonymous FA Youth Survey given to 6m12Lh public school students asked How old were you when you first I got suspended from school I got arrested I carried a handgun etc Choose never have 10 or younger 11 12 17 El Questions I Why did parents object I Why was the question worded this way 2mm mm mm ammw Stalstics mm um aw mm L425 C 2007 Nancy Pfenning Example A controversial question El Background Anonymous FA Youth Survey given to 6 12 h public school students asked How old were you when you rst I got suspended from school I got arrested I carried a handgun etc Choose never have 10 oryounger 11 12 17 El Responses 2mm mnwmm amnuwsmgm mm mm mm L475 Example Keyboards for Sense ofAnonymity El Background A stats computer tutor was piloted in a class where students consented to be identi ed by name Still one student lled in the text boxes with obscenities El Question Why did the student write inappropriately in the computer lab and not on his hardcopy homeworks or exams 2mm mm mm ammw Statstics mm um aw mm mu Example Keyboards for Sense ofAnonymity El Background A stats computer tutor was piloted in a class where students consented to be identi ed by name Still one student lled in the text boxes with obscenities El Response This tendency is used to researchers advantage when seeking responses to sensitive questions 2mm mnwmm amnuwsmgm mm mm mm L432 Example HardtoDe ne Concepts Elementary Statistics Looking at the Big Picture El Background A survey found 19 of Americans believe money can bu appiness I Robert Frost Happiness makes up in height for What it lacks in length I Albert Camus But What is happiness except the simple harmony between a man and the life he leads El Questions I By Frost s de nition can money buy happiness I By Camus s definition can money buy happiness I What de nition of happiness Were respondents using 2mm mm mm ammw Statstics mm um aw mm L433 C 2007 Nancy Pfenning Example HardtoDe ne Concepts Example Formulating a Survey Question El Background A survey found 19 of Americans believe money can buy happiness I Robert Frost Happiness makes up in height for what it lacks in length I Albeit Camus But what is happiness except the simple harmony between a man and the life he leads El Responses I Frost I Camus I Respondenw 2mm mnwmm amnuwsmm mm tithe swim L435 El Background Earlier we asked Assuming you had a representative sample of 40year old males what survey question would you ask to nd out what proportion are virgins El Question Are you satis ed with the phrasing of your question and if not how would you rephrase it 2mm mm mm ammw Stalslics mm um aw We ma Example Formulating a Survey Question Issues to Consider for Any Study Design El Background Earlier we asked Assuming you had a representative sample of 40year old males what survey question would you ask to nd out what proportion are virgins El Response Consider I Open or closed I I f closed what response options are provided I Is question designed to elicit honest responses I Is the concept well de ned 2mm NW amnuwsmm mm tithe swim L437 a El Errors in Study s Conclusions 2mm mm mm ammw Stalslics mm um aw We Lass Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Sample Size and Study Design El Background Researchers want to know if stronger sunscreens cause more time in sun They could design an observational study or an experiment to test this El Question Which is better using 10 students or 100 students 2mm mnwmm Eiemenhwsuusucs mm tithe swim ma Example Sample Size and Study Design El Background Researchers want to know if stronger sunscreens cause more time in sun They could design an observational study or an experiment to test this El Response It depends I If study is flawed obs study with confounding variables or poorly designed experiment9 I If study is well designed9 2mm mm mm ammw Statstics mm um aw mm L441 Issues to Consider for Any Study Design El Sample size El Error nStudy s Conclu ons 2mm mnwmm Eiemenhwsuusucs mm tithe swim L442 Example Two Types of Error Elementary Statistics Looking at the Big Picture El Background A study tested effectiveness of radar guns to identify speeders El Question What are the two possible errors in the study s conclusions and the potential harmful consequences of each Note the study either concludes that the guns work properly or that they do not 2mm mm mm ammw Statstics mm um aw mm L443 C 2007 Nancy Pfenning Example Two Types of Error Example Sample Size and Error El Background A study tested effectiveness of radar guns to identify speeders El Response 2mm mnwmm Eiemenhwstausucs mm tithe swim L416 El Background A study tested effectiveness of radar guns to identify speeders El Question Which error is more likely to be made if only a small sample of guns is tested 2mm mm mm ammw Statstics mm um aw mm we Example Sample Size and Error Example Errors in Home Drug Testing El Background A study tested effectiveness of radar guns to identify speeders El Response 2 mm mm mm Elementary shims mm tithe aw 7mm El Background A study discussed limitations and risks in the use of home drug testing kits El Question What are the two possible errors in a drug test s conclusions and the potential harmful consequences of each 2mm mm mm ammw Statstics mm um aw mm L449 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Errors in Home Drug Testing El Background A study discussed limitations and risks in the use of home drug testing kits El Response 2mm mnwmm amnuwsmsm mm tithe swim L451 Lecture Summary Sample Surveys Elementary Statistics Looking at the Big Picture Open vs closed questions Unbalanced response options Leading questions Complicated questions Sensitive questions Hardtode ne concepts EIEIEIEIEIEIEI Issues for any study design I Sample size I Errors in st39udy s conclusions 2mm mm mm ammw Statstics mm um aw mm L452 C 2007 Nancy Pfenning Looking Back Review El 4 Stages of Statistics Lecture I Data Production discussed in Lectures 14 I Displaying and Summarizing Lectures 512 Binomial Random Variables Probability El Finding Probabilities discussedin Lectures 1314 uDefInItIon El Random Variables introduced in Lecture 15 uWhat If Events are Dependent Binomial uCenter Spread Shape of Counts Proportions 39 WWquot El Sampling Distributions uNormaI ApprOXImatlon I Statistical Lnterence 2 mm mm mm Eiemenhw shims mm tithe aw 7mm 2mm mm mm ammw Statstics mm am an mm Us 2 De nition Review De nition I Discrete Random Variable one whose Binomial Random Variable counts sampled possible values are nite or countably individuals falling into particular category in nite like the numbers 1 2 3 I Sample size n is fixed I Each selection independent of others Looking Ahead To perform inference about I Just 2 possible values for each individual categorical variables need to understand I Each has same probability p of falling in behavior of sample proportion A rst step is to category of interest understand behavior of sample counts We will eventually shift from discrete counts to a normal approximation which is continuous mm mmquot swam mt mm Wm m cmwm m Emmanm WWW W W Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E Example A Simple Binomial Random Variable Example A Simple Binomial Random Variable I Background The random variableX is the I Background The random variableX is the count of tails in two ips of a coin count of tails in two ips of a coin I Questions Why is X binomial What aren II Responses and p How do we display X I Sample size n xed I Each selection independent of others I Just 2 possible values for each I Each has same probability p C Zuni Nancy Ptenning Eiementary Statistics Looking attrie Big Picture Li a 5 C Zuni Nancy Ptenning Eiementary Statistics Looking atthe Big Picture Li a 7 i i 39 e Example A Simple Binomial Random Variable Example Determining R V is Binomial Looking Back We alreaaj discussed this random i Background Consider following RV Variable when learning aboutpmbabilily I Pick card from deck of 52 replace pick another distributions Xno of cards picked until you get ace El Responses Display with El Question IsX binomial 127 Probability i i i o 1 Xnumber of tails 0 mm Nancy Pfenning Eiernentaiy Statistics Looking atthe Big Picture Li B in c ZEIEI7 Nancy Ptennirig Eiernentaiy Statistics Looking atthe Big Picture Li E ii Elementary Statistics Looking at the Big Picture 2 C 2007 Nancy Pfenning Example Determining if R V is Binomial El Background Consider following RV I Pick card from deck of 52 replace pick another Xno of cards picked until you get ace El Response 2mm mnwmm Eiemenuwsuusucs mm tithe swim Us 13 Example Determining if R V is Binomial El Background Consider following RV I Pick 16 cards without replacement from deck of 52 Xno of red cards picked El Question IsX binomial 2mm mm mm ammw Sialsiics mm um aw mm Us 4 Example Determining if R V is Binomial El Background Consider following RV I Pick 16 cards without replacement from deck of 52 Xno of red cards picked El Response 2mm mnwmm Eiemenuwsuusucs mm tithe swim Us 5 Example Determining if R V is Binomial Elementary Statistics Looking at the Big Picture El Background Consider following RV I Pick 16 cards with replacement from deck of 52 Wno of clubs Xno of diamonds Y no of hearts Zno of spades El Question Are WX Y Z binomial 2mm mm mm ammw Sialsiics mm um aw mm Us 7 C 2007 Nancy Pfenning Example Determining if R V is Binomial El Background Consider following RV I Pick 16 cards with replacement from deck of 52 Wno of clubs Xno of diamonds Y no of hearts Zno of spades El Response 2mm mnwmm Eiemenuwsuusucs mm 31th swim Us 9 Example Determining if R V is Binomial El Background Consider following RV I Pick with replacement from German deck of 32 doesn t include numbers 26 then from deck of 52 back to deck of 32 etc for 16 selections altogether Xno of aces picked El Question IsX binomial 2mm mm mm ammw Stalsiics mm um aw mm Us 2 Example Determining if R V is Binomial El Background Consider following RV I Pick with replacement from German deck of 32 doesn t include numbers 26 then from deck of 52 back to deck of 32 etc for 16 selections altogether Xno of aces picked El Resp onse 2mm mnwmm Eiemenuwsuusucs mm 31th swim mm Example Determining if R V is Binomial Elementary Statistics Looking at the Big Picture El Background Consider following RV I Pick 16 cards with replacement from deck of 52 Xno of hearts picked El Question IsX binomial 2mm mm mm ammw Stalsiics mm um aw mm Us 24 C 2007 Nancy Pfenning Example Determining if R V is Binomial El Background Consider following RV I Pick 16 cards with replacement from deck of52 Xno of hearts picked Response I fixed n 16 El I selections independent with replacement I just 2 possible values heart or not I samep 025 for all selections 2mm mnwmm Eiemenuwsuusucs mm tithe swim maze Requirement of Independence Snag I Binomial theory requires independence I Actual sampling done without replacement so selections are dependent Resolution When sampling without replacement selections are approximately independent if population is at least 10n 2mm mm mm ammw Sialsiics mm um aw mm Us 27 Example A Binomial Probability Problem El Background The proportion of Americans who are lefthanded is 01 Of 44 presidents 7 have been le handed proportion 016 El Question How can we establish if being lefthanded predisposes someone to be president 2mm mnwmm Eiemenuwsuusucs mm tithe swim mm Example A Binomial Probability Problem Elementary Statistics Looking at the Big Picture El Background The proportion of Americans who are le handed is 010 Of 44 presidents 7 have been lefthanded proportion 016 El Response Determine if 7 out of 44 016 is when sampling at random from a population where 010 fall in the category of interest 2mm mm mm ammw Sialsiics mm um aw mm Us cm Solving Binomial Probability Problems Use binomial formula or tables Only practical for small sample sizes Use software Won t take this approach until later Use normal approximation for countX Not quite more interested in proportions Use normal approximation for proportion Need mean and standard deviation C 2mm Nancy Pfenning Elementary Statistles Luuklng attne Big Picture Li a at C 2007 Nancy Pfenning Example Mean of Binomial Count Proportion I Background Based on longrun observed outcomes probability of being lefthanded is approx 01 Randomly sample 100 people I Questions On average what should be the I count of lefties I proportion of lefties C 2mm Nancy Pfenning Elementary Statistles Luuklng attne Big Picture Li a 32 i l Example Mean of Binomial Count Proportion I Background Based on longrun observed outcomes probability of being lefthanded is approx 01 Randomly sample 100 people I Responses On average we should get I count of lefties I proportion of lefties C 2mm Nancy Pfenning Elementary Statistles Luuklng attne Big Picture Li a 34 Elementary Statistics Looking at the Big Picture Mean and SD of Counts Proportions Count X binomial with parameters n p has I Mean np I Standard deviation inp1 p Sample proportion 13 has I Mean 19 I Standard deviation pl p TL Looking Back Formulas for s a39 require independence population at least 10n C ZEIEI7 Nancy Pfenning Elementary Statistles Luuklng attne Eilg F39lCturE Li a 35 C 2007 Nancy Pfenning Example Standard Deviation of Sample Count C 2mm Nancy Pfenning Eiementaiy Statistics Looking atthe Big Picture I Background Probability of being lefthanded is approx 01 Randomly sample 100 people Sample count has mean 10001 10 standard deviationi100O11 01 3 III Question How do we interpret these E Example Standard Deviation of Sample Count I Background Probability of being lefthanded is approx 01 Randomly sample 100 people Sample count has mean 10001 10 standard deviationilOOO11 01 3 El Response On average expect sample count lefties Counts vary typical distance from is C 2mm Nancy Pfenning Eiementaiy Statistics Looking atthe Big Picture Li a 38 Example SD of Sample Proportion deviation 011o1 100 2 003 C 2mm Nancy Prenning Eiementaiy Statistics Looking atthe Big Picture I Background Probability of being lefthanded is approx 01 Randomly sample 100 people Sample proportion has mean 01 standard I Question How do we interpret these Elementary Statistics Looking at the Big Picture Example SD of Sample Proportion I Background Probability of being lefthanded is approx 01 Randomly sample 100 people Sample proportion has mean 01 standard deviation 0 110 1 El Response On average expect sample proportion lefties Proportions vary typical distance from is C ZEIEI7 Nancy Pfenning Eiementaiy Statistics Looking atthe Big Picture Li a 4i C 2007 Nancy Pfenning E Example Role of Sample Size in Spread Example Role of Sample Size in Spread I Background Consider proportion of tails in I Background Consider proportion of tails in various sample sizes n of coinflips various sample sizes n of coinflips I Questions What is the standard deviation for El Responses I n1 I n1 sd I n4 I n4 sd I n16 I n16 sd Because of n in the denominator of the formula for standard deviation spread of sample proportion as n increases cnum Nancy Pfenning Eiementary Statistics Luuking attne Big Picture LiBAZ cnum Nancy Pfenning Eiementary Statistics Luuking attne Big Picture HEM a Shape of Distribution of Count Proportion Example Underlying Com lp D Slrlbunon Binomial countX or proportion 13 for I Background Distribution of count or repeated random samples has shape proportion of tails in nl coinflip p05 approximately normal if samples are large enough to offset underlying skewness Central Limit Theorem For a given sample size n shapes are identical for count and proportion P bey F bbiw J o aiism siandavddevialicn5 icoinliip sundardd ttttttt n5 II Question What are the distributions shapes C 2mm Nancy Pfenning Eiementary Statisties Luuking attne Big Picture Li a 45 C 2mm Nancy Pfenning Eiementary Statisties Luuking attne Big Picture Li a 4B Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning Example Underlying Coin ip Distribution Example Disiribuiionfor 4 Coin ips I Background Distribution of count or I Background Distribution of count or proportion of tails in n1 coin ip p05 proportion of tails in n4 coinflips p05 2 816 7 39 j 0 MM 0 5 i39 aspmpomonoi u A i oitaiisi milsm wmliin a i was a 4 Xcounu o 25 mean5 75 i bploponiu II Response II Question What are the distributions shapes 02uu7 Nancy Pfenning Eiernentaiy Statistics Looking atthe Big Picture LiBAE 02uu7 Nancy Pfenning Eiernentaiy Statistics Looking atthe Big Picture LiBAB Example Distribution for 4 Cain ips Shift from Counts to Proportions I Background Distribution of count or I Binomial Theory begins with counts PTOPOI UOD 0f 1331113 111 quot4 COln lPS 19205 I Inference will be about proportions 7i fi ii i tails m 4 Oi ails in 4 standard deviationzi Coinllips i 1 id Lie l 2 infii II Response 02uu7 Nancy Prennirig Eiernentaiy Statistics Looking atthe Big Picture LiB 5i 02uu7 Nancy Pfenning Eiernentaiy Statistics Looking atthe Big Picture UB 52 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E Example Distribution of 13 for I 6 Coin ips Example Distribution for I 6 Coin ips I Background Distribution of proportion of I Background Distribution of proportion of tails in nl6 coinflips 1305 tails in nl6 coinflips 1305 Pmbabilliy f P 2 NW II Response C 2mm Nancy Pfenning Eiernentaiy Statistics Luuking atthe Big Picture Li a 53 C 2mm Nancy Pfenning Eiernentaiy Statistics Luuking atthe Big Picture ME 55 Example Underlying Distribution of Lefties Example Underlying Distribution of Lefties I Background Distribution of proportion of I Background Distribution of proportion of lefties p0 1 for sarnples of n1 lefties p0 1 for saniples of n1 9 i J isample propmiion ieil handed II Question What is the shape I Response C 2mm Nancy Pfenning Eiernentaiy Statistics Luuking atthe Big Picture LiE 5B C 2mm Nancy Pfenning Eiernentaiy Statistics Luuking atthe Big Picture ME 58 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example Dist of p of Lefties for n I 6 Example Dist of 13 0f Lefties for n I 6 I Background Distribution of proportion of I Background Distribution of proportion of lefties p0 l for nl6 lefties p0 l for nl6 T7 Probabiiily i i M25 in ism mo ms gesamgie propeiiiair ieitriariuea II Response II Question What is the shape i i 1575 mi 3 v5 Opo lorl leihhande cnum Nancy Prennirig Eiernentaiy Statistics Looking althe Big Picture UB 59 cnum Nancy Prennirig Eiernentaiy Statistics Luuking althe Big Picture Example Dist 0f 13 0f Lefties for 11 00 Example Dist of p of Lefties for 11 00 I Background Distribution of proportion of lefties p0 l for n100 n100 I Background Distribution of proportion of lefties p0 l for n100 Fmbablllly Probability 5 u 5 pesampie pvuponmn ie rhanded 20 5 El Response 5 u 5 Diample pmpo mn ieitriarigeg II Question What is the shape Eiementaiy Statistics Looking atthe Big Picture C 2mm Nancy Prenning LiBBZ Eiem ntaiy Statistics Looking atthe Big Picture C 2mm Nancy Prenning Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Rule of Thumb Example Applying Rule of Thumb Sample Proportion Approximately Normal A El Background Consider distribution of Dlstribution of p IS approximately normal 1f sample size n IS large enough relative to shape determined by population salnple proportion for various 7 andp PmPOm mP El Question Is shape approximately normal Require np Z 10 and 39n1 7 p 2 10 l 714 1705 Together these require us to have larger n forp close to 0 or 1 I quot20 1705 underlying distribution skewed right or le l 7120 p0l I 7120 1709 I n100p0l Example Applying Rule of Thumb Example Lefthanded Presidents Problem El Background Consider distribution of El Background The proportion of Americans sample Propomon for Varlous 7 andP who are le handed is 01 We consider El Response Normal Pp27440 16 for a sample of 44 presidents 39 F4 1705 7np403952lt10 El Question Can we use a normal I nizo piO39S 7 npi200395i10n139p approximation to nd the probability that at 39 quot40gt P Ol 7 np ZOWU NO least 7 of 44 016 are lefthanded n20 1709 n1 p201 092lt10 I n100 p0l np1000110 n1p100o990 both 2 10 2mm mnwmm amnuwsmsm mm alive mm mass mmmmnm mm ammmsmm wwwmaw mm mm Elementary Statistics Looking at the Big Picture 12 C 2007 Nancy Pfenning 5 Example Solving the Lefthanded Problem II Response I Background The proportion of Americans who are lefthanded is 01 We consider Pp2744O16 for a sample of 44 presidents approx is poor Probability i 0 o i bility Approximated probability is 01 0 C 200 Looking at the Big Picture L1671 Example From Count to Proportion and Vice Versa I Background Consider these reports I In a sample of 87 assaults on police 23 used weapons I 044 in sample of 25 bankruptcies were due to med bills I Question In each case what are n X and f5 C 2007 Nancy Pfenning Elementan Statistics Looking at the Big Picture L1472 Versa 39mi Example From Count to Proportion and Vice I Background Consider these reports I Response I First has n X 13 I Second has n 13 Z X C 2007 Nancy Pfenning Elementary Statistics Looking at the Big Picture I In a sample of 87 assaults on police 23 used weapons I 044 in sample of 25 bankruptcies were due to med bills L1474 Lecture Summary Binomial Random Variables II De nition 4 requirements for binomial II RVs that do or don t conform to requirements I Relaxing requirement of independence II Binomial counts proportions I Mean I Standard deviation I Shape II Normal approximation to binomial C 2007 Nancy Pfenning Elementan Statistics Looking at the Big Picture L1486 Elementary Statistics Looking at the Big Picture 13 C 2007 Nancy Pfenning I j l Looking Back Review Lecture 26 El 4 Stages of Statistics I Data Production discussed in Lectures 14 Inference for Quantltatlve varlable I Displaying and Summarizing Lectures 512 Confidence Interval l I Probability discussed in Lectures 1320 I Statistical Inference ll tConfidence Interval for Population Mean D 1 categorical discussedin Lectures 213923 ElComparing z and t Confidence Intervals D 1 quantltatlve ZCI Z te t test DWhen neither 2 nor tApplies ll categorical and quantitative e1 20 c ElOther Levels of Confidence D Z 6quot i El 4 qllillllllal C ll tTest vs Con dence Interval c2uu7 Nancy Pfennan Elementary Stallstlcs Luuklng attne Ellg F39lclure c2uu7 Nancy Pfennan Elementary statlstles Luuklng attne Elg Flcture L26 2 Behavior of Sample Mean Review Sample Mean Standardizing to 2 Review For random sample of size n from population 91f 0 is known standardized X follows w1th mean ll standard deV1at1on 0 sample Z standard normalLdistribution mean X has Z 7 H u mean a If o39is unknown but 11 1s large enough standard deVIatlon 5 20 or 30 then s m a and shape approximately normal for large f M Z enough n s c 2mm Nancy Ffennlng Elementary Stallstlcs Leekan attne Ellg F39lcture L26 3 e 2mm Nancy Ffennlng Elementary Stallstlcs Luuklng attne Elg Flcture LZB 4 Elementary Statistics Looking at the Big Picture 1 C 2007 Nancy Pfenning Sample mean standardizing to l Review For a nkno nandn small 7E M u w n t Inference by Hand Based on 2 or I a known 0 unknown II 1 m u t small sampen lt 30 z 0 5 I tlike z centered at 0 since Xcentered at H large sample n 2 30 2 Z m 2 I t like 2 symmetric and bellshaped if X normal I tmore spread than Z sdgtl s gives less info cc 2 used if 0 known or 12 large thas nl degrees of freedom spread depends on n I used if 039 unknown and n small e 2667 Nancy Fferlrllrlg Elementary Statlstles Leean attne Ellg F39lclure L26 5 e 2667 Nancy Pfennan Elementary statlstles Leean attne Elg Flcture ll l4 Inference Based on 2 or I Inference with Software Based on 2 or I 0 know 039 unknown zusedifO39is known m r m u small sampen lt 30 large sample n 2 30 2 t used if 0 is unknown stribution of t 39 h eavy tailed r small It e 0 l 2 a 4 z or t standardized difference between sample mean and proposed population mean Elementary statlstles Leuklng attne Ellg Pleture Q2667 Naney Fferlrllrlg L26 7 CJZEIEI7 Nancy Pfennan Elementary Stallstlcs Leean attne Elg Flcture L26 6 Elementary Statistics Looking at the Big Picture Con dence Interval for Mean Review C 2007 Nancy Pfenning mi2 C 2667 Nancy Pfenning Elementary Statistics Luuking althe Big Picture I multiplier 2 is from 2 distribution 95 confidence interval for p a known is a 95 of normal values within 2 sds of mean For n small0 unknown can t say 95 CI is L26 6 1 Con dence Interval for Mean 7 Unknown 95 confidence interval for M is i a i multiplier I multiplier from 1 distribution with 111 degrees of freedom dj l multiplier at least 2 closer to 3 for very small n C 2667 Nancy Pfenning Elementary statisties Luuking althe Big Picture L26 in Degrees of Freedom needed for elementary statistics c 2667 Nancy Pfenning Elementary Statisties Luuking althe Big Picture I Mathematical explanation of df not I Practical explanation of df several useful distributions like 1 F chisquare are families of similar curves df tells us which one applies depends on sample size n L26ii Elementary Statistics Looking at the Big Picture 2 or I Which to Concentrate On I For purpose of learning start with 2 know what to expect from 6895997 Rule etc only one 2 distribution I For practical purposes I more realistic usually don t know population sd a Software automatically uses appropriate t distribution with 111 df just enter data C 2667 Nancy Pfenning Elementary Statistics Luuking althe Big Picture L26 iz 1 Example Con dence Interval with t Curve El Background Random sample of shoe sizes for 9 college males 11512011015011510090100110 El Question What is 95 CI for population mean Use t 8 df area05 area05 area025 I area025 area01 area01 area005 Iar005 0 9 I 1186 1 861 1 231 o 4231 tfor8df 02007 Nancy Pfenning 3 36 quot 29 336 C 2007 Nancy Pfenning Example 1 Con dence Interval with t C urve I Background Random sample of shoe sizes for 9 college males 115120110 150 115 100 90 100 110 CI Response Mean 11222 S 1698 n9 mult 231 C 2007 Nancy Pfenning area05 area05 area025 I area025 1 area 01 area01 area005 arga005 IIJ I A I 99 l 13986 1 86 2 0231 o 2311 tfor8df 33639 29 836 Example t Con dence Interval with Software I Background Random sample of shoe sizes for 9 college males 115 120 110 150 115 100 90 100110 El Question What is 95 CI for population mean c 2007 Nancy Pfenning Eiementary Statistics Luuking althe Big Picture LZB 1B Elementary Statistics Looking at the Big Picture Example t Con dence Interval with Software I Background Random sample of shoe sizes for 9 college males 11512011015011510090100110 El Response One Sample T Shoe Variable Shoe 9 C 2007 Nancy Pfenning Just enter 9 values request interval Mean StDev SE Mean 95 0 CI 11222 1698 0566 9917 12527 Eiementary Statistics Luuking althe Big Picture LZB 18 1 Example Compare t and 2 Con dence Intervals I Background Random sample of shoe sizes for 9 college males 115120110 15011510090100110 We produced 95 t con dence interval 112221231 1122211307 2 992 1253 If 1698 had been population sd would get Z CI 11222i196 11222i1109 10111233 II Question How do the t and Z intervals differ LZEiB C 2007 Nancy Pfenning e 2007 Nancy Pfenning Eiementary Statistics Luuking attne Big Picture 1 Example Compare t and 2 Con dence Intervals I Background Random sample of shoe sizes for 9 college males 115120110 150 115 100 90 100110 We produced 95 t con dence interval 11222ltgt 112227 2 9921253 If 1698 had been population sd would getz CI 11222 1122 09 1011 1233 I Response tmultiplier is 231 z multiplier is 196 t interval width about 2 interval width only about 039 known info interval LZBZi e 2007 Nancy Pfenning Eiementary statistics Luuking atthe Big Picture if Example t vs 2 Con dence Intervals Large n El Background Earnings for sample of 446 students at a university averaged 3776 with sd 6500 The t multiplier for 95 confidence and 445 df is 19653 II Question How different are the t and 2 intervals c 2007 Nancy Pfenning Eiementary Statistics Luuking attne Big Picture LZE 22 Example t vs 2 Con dence Intervals Large n I Background Earnings for sample of 446 students at a university averaged 3776 with sd 6500 The t multiplier for 95 confidence and 445 df is 19653 El Response The intervals will be ahnost identical whether we use I t multiplier 19653 I precise z multiplier 196 I approximate z multiplier 2 Interval approx e 2007 Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture LZB 24 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Behavior of Sample Mean Review Guidelines for 2 Approx Normal Review For random sample of size n from population Can assume shape of X for random samples of size with mean M standard deviation 0 sample n 15 aPPlematelY normal 1f mean X has I Graph of sample data appears normal or I mean M a I Sample data fairly symmetric n at least 15 or I Sam 1e data moderatel skewed n at least 30 or I standard dev1at10n p y I Sample data very skewed n much larger than 30 I shape approx normal for large enough n M If X is not normal 1s not t 91f 0 IS unknown and n small 8 37r t S c 2667 Naney Pfenning Elementary Statistles Luuklng attne Big Pleture L26 25 c 2667 Nancy Pfenning Elementary statistles Luuklng attne Elg Pleture L26 26 Example Small Skewed Data Set Example Small Skewed Data Set ll Background Credits taken by 14 nontraditional ll Background Credits taken by 14 nontraditional students 47111112131314141717171718 students 47111112131314141717171718 II Question What is a 95 con dence interval for El Response n small shape of credits leftskewed population mean 9 A Looking Ahead i g 3 ll Nonparametric 2 3 2 7 l methods can be l L 1 7 7 ll used for small n 0 7 i skewed data 4 6 B 10 i2 14 i5 18 4 6 S Emmi 14 IS iii c 2667 Naney Pfenning Elementary Statistles Luuklng attne Big Pleture L26 27 c 2667 Naney Pfenning Elementary Statistics Luuklng attne Elg Pleture L26 26 Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning 1 i i 1 Intervals at Other Levels of Con dence Example Intervals at Other Con dence Levels 11 Background Random sample of shoe sizes for 9 I Lower con dence smaller t multiplier college males 115120 110 15011510090 100110 I Higher confidence larger tmultiplier I Table excerpt at any given level tgtz mult we can preduce 95 con dence Interval using s not ogives Wider interval less info 112 i 231g 2 991125 I tmultipliers decrease as df and n increase QueStiOIli What W0111d 99 COD dence interal Confidence Level be and how does it compare to 95 interval 90 95 98 99 Use the fact that t multiplier for 8 df 99 z infinite ni 1645 1960 or 2 2326 2576 confidence is 336 t df19 n20 173 21039 254 286 t dfl I n12 180 220 272 311 t df3 n4 235 318 454 584 11 L ki g atme Big Picture L26 36 c 2667 Nancy Pfenning Eiementary statistics Luuking atthe Big Picture L26 31 e 2667 Nancy Pfenning Eiementary 51w ii 14 Example Intervals at Other Con dence Levels Example Intervals at Other Con dence Levels 11 Background Random sample of shoe sizes for 9 college males 115120110 150 115 100 90 100110 We can produce 95 con dence interval area05 area05 17 area02 5 112 j 231W 99125 area025 Response 99 interval is 01 90 i gt 95 1121336 area005 ar a005 k 98 L I g 0 A99 1N I 0 486 5 186 tfor 8df I Width for 95 2 9 02393 531 Width for 99 29 77 3 5 336 L26 34 LZB 32 e 2667 Nancy Pfenning Eiementary Statistics Looking althe Big Picture c 2667 Nancy Pfenning Eiementary Statistics Luuking 311in Big Picture Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning l 7 l pl Summary of 2 Confidence Intervals LOOking BaCk ReVieW Confidence interval for u is i i multiplier El 4 Stages of Statistics where multiplier depends on I Data Production already discussed El df smaller for larger n larger for smaller n Displaying and Summarizing already discussed Probability Statistical Inference El level smaller for lower level larger for higher Note margin of error is larger for larger s 9 interval narrower for U 1 categorical l quantitative z CI z test t CI l larger 14 via df and 5 in denominator l I d H 1 C3 39t gOl39ICll an qllilll l 1 IVE l lower level of con dence I smaller sd distribution with less spread EIEIEI 2 categorical El 2 quantitame e 2667 Nancy Fferlrllrlg Elementary Statlstles Luuklng attne Ellg F39lclure L26 35 e 2667 Nancy F39fErlrllrlg Elementary statlstles Luuklrlg attne Elg Flcture LZB 3B r 39 ll 39la From 2 Confidence Intervals to Tests Review From 2 Confidence Intervals to Tests Review For confidence intervals used inside probabilities For hypothesis tests used outside probabilities areal05 area05 area025 E 3990 area01 3995 area005 area025 area01 g005 9399 V k l I 1 45 0 H645 z I 1 45 l1645 2 l 1 960 1 so 1960 0 ndeo 2326 2325 2326 2326 0 zuuigan5ey7P enmng Elementary Statlstles Luuklng attne Big Pleture 239 376 LZB 37 M will Nargy39i rZrlgng Elementary bratlstlcs LuuKlrlg attne tle meture 2576 Lzb an Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning l 1 From 2 Confidence Intervals to Tests Example HyPOZheSiS Tesa 1 VS Z El Background Suppose one test with very large n Con dence interval use multiplier for t dist n1 df has Z 2 another test with n19 18 df has I 2 Hypothesis testPvalue based on tail of t dist n1 df area05 arear s area025 area025 4 area Ul El area005 ar a005 o E l 2151 l 0 HE lo tior sample 2 55 2 SlZe n19 2 85 288 1118 El Question How do Pvalues compare for Z and t Assume alternative is greater than L26 36 C 2667 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng attne Elg Flcture L26 46 C 2667 Nancy Fferlrllrlg Elementary Stallstlcs Leekan attne Eilg F39lcture Example Hypomes s Tesr 1 VS Z Comparing Critical Values 2 with t for 18 df El Background Supposeone test with very large n has 2 2 another test With n19 18 d has t 2 area05 area05 area025 area025 area 5 Iarea 05 gt area025 l area025 areal01 area01 area 01 areazm area005 at39ea005l l yar 005 0 6 l l 1 45 0 ll645 z l quotl3 3 J l tfor sample 73216960 1960 21D 210 2 026 255 2 53 size n19 2578 2576 7288 e288df18 5177T1Frlr739 a Response 90959899 RuIeez Pvalue 2161l3 0 972 1 tier sample 21 t curve for 18 df9t Pvalue V265 tees size n19 A288 288 df18 02667 Naney FfEnnlng Elementary Stallstlcs Leekan attne Eilg Pleture L26 42 02667 Naney FfEnnlng Elementary Stallstlcs Luuklng attne Elg Pleture L26 43 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Example t Test by Hand 11 Background Wts of 19 female college students 110 110 112 120 120 120 125 125 130 130 132 133 134 135 135 135 145 148 159 El Question Is population mean reported by NCHS 1417 plausible or is there evidence that we ve sampled from population with lower mean or that there is bias due to underreportmg e 2mm Nancy Pfenning Eiementary Statistics Luuking aims Big Picture 126 44 Example t Test by Hand El Background Wts of 19 female college students 110 110 112 120 120 120 125 125 130 130 132 133 134 135 135 135 145 148 159 1 Response H o vsHa 1 Pop2190 shape of wts close to normal n19 OK 1293641417 i 2 a l2936s 128215 128mm 419 3 PvaluePtlt419 because t more extreme than 3 can be considered unusual for most n in particular for 18 df Ptlt288 less than C 2mm Nancy Pfenning Eiementary statistics Luuking atthe Big Picture LZB 4E Example t Test by Hand t tail probabilities df18 area05 area05 area025 area025 area01 area01 1 If area005 g005 0 v 72i 161i3 0 13210 i tfor sample 255 255 size n19 2188 Slnce 49lt288 258 df18 Pvaluelt0 005 c 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture LZB 47 Example t Test by Hand 11 Background Wts of 19 female college students 110 110 112 120 120 120 125 125 130 130 132 133 134 135 135 135 145 148 159 1 Response H0 11 1417 vs Ha a lt 1417 1 Popgt190 shape of wts close tp gosr nal n19 OK 41417 i z m 12936s 1282t Hahn 9 419 3 PvaluePts419 because t more extreme than 3 can be considered unusual for most n in particular for 18 df Ptlt288 less than 4 Reject H o Conclude bias sample unrepresentative or values underreported e 2mm Nancy Pfenning Eiementary Statistics Luuking atthe Big Picture LZB 4a Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Lecture Summary Inference forMeans t Con dence Intervals El 2 con dence interval for population mean I Multiplier from t distribution with nl if I When to perform inference with z or t I Constructing t C1 by hand or with software El Comparing z and l con dence intervals El When neither 2 nor t applies El Other levels of con dence El from con dence interval to hypothesis test El t test by hand 2mm mnwmm Eiemenbwstahshcs mm mm swim mm Elementary Statistics Looking at the Big Picture 11 Lecture 29 Nancy Pfenning Stats 1000 Reviewing Con dence Intervals and Tests for Ordinary OneSample Matched Pairs and TwoSample Studies About Means Example Blood pressure X was measured for a sample of 10 black men It was found that a 1149 s 1084 Give a 90 con dence interval for mean blood pressure M of all black men Note we can assume that blood pressure tends to differ for different races or genders and that is why a separate study is made of black menithe confounding variables of race and gender are being controlled This is an ordinary onesample t procedure A level 90 con dence interval for M is 55 i Vin where tquot has 10 7 l 9 df Consulting the df 9 row and 90 con dence column of Table A2 we nd tquot 183 Our con dence interval is 1149 11831304 10861212 Here is what the MlNlTAB output looks like N MEAN STDEV SE MEAN 900 PERCENT CI calcbeg 10 11490 1084 343 10862 12118 Example Blood pressure for a sample of 10 black men was measured at the beginning and end of a period of treatment with calcium supplements To test at the 5 level if calcium was effective in lowering blood pressure let the RV X denote decrease in blood pressure beginning minus end and MD would be the population mean decrease This is a matched pairs procedure To test H0 MD 0 vs Ha MD gt 0 we nd differences X to have sample mean CZ 50 sample standard deviation 3 874 The t statistic is t dim 3772 181 and the PValue is 5 T PT 2 181 We refer to Table A2 for the t9 distribution and Sgthat 181 is just under 183 which puts our PValue just over 05 Our test has not quite succeeded in nding the difference to be signi cantly greater than zero in a statistical sense Populations of black men treated with calcium may experience no decrease in blood pressure MlNlTAB output appears below TEST OF MU 000 VS MU GT 000 N MEAN STDEV SE MEAN T P VALUE calcdiff 10 500 874 276 181 0052 It is possible that our sample size was too small to generate statistically signi cant results An other concern is the possibility of confounding variables in uencing their blood pressure change The placebo effect may tend to bias results towards a larger decrease Or time may play a role if the beginning or end measurement date happened to be in the middle of a harsh winter or a politically stressful time results could be affected Example Data for a control group taking placebos of 11 black men at the beginning and end of the same time period produced control sample mean difference CZ 764 and 32 587 Now we test 0 M1 M2 0 same as H0 M1 M2 or mean difference for calciumtakers same as mean difference for placebotakers vs 114 Ha M1 7 2 gt 0 same as H0 1 gt 2 or mean difference for calciumtakers greater than mean difference for placebotakers The t statistic is 7 7 d1 7 d2 7 0 52 52 V i Since 10 7 1 lt 117 1 use 9 df In this row of Table A2 we see that 172 is smaller than 183 so the Pvalue is larger than 05 Once again there is not quite enough evidence to reject the null hypothesis Note that in the MlNlTAB output below degrees of freedom were calculated in a very complicated way to be 15 whereas we simply took the smaller sample size minus one which was 9 57764 564 1 72 3742 59012 3282 TWDSAMPLE T FDR calcdiff VS contdiff MEAN STDEV SE MEAN 500 8 74 276 calcdiff 10 O 64 5 87 1 77 contdiff 11 TTEST MU calcdiff MU contdiff VS GT T 172 P0053 DF 15 Robustness the sample sizes are quite small and so MlNlTAB plots of the above distributions should be consulted to verify that they have no pronounced outliers or skewness Data values are shown in the following table Calcium Control Beginning End Di erence Beginning End Di erence 100 7 123 124 1 110 114 4 109 97 12 123 105 18 112 113 1 129 112 17 102 105 3 112 115 4 98 95 3 111 116 5 114 119 5 112 102 10 112 114 2 136 125 11 110 121 11 102 104 2 117 118 1 107 106 1 119 114 5 130 133 3 N MEAN MEDIAN TRMEAN STDEV SEMEAN calcbeg 10 11490 1115 11388 1084 343 calcend 10 10990 10900 10925 780 247 calcdiff 10 500 400 462 874 276 contbeg 11 11327 11200 11311 902 272 contend 11 11391 11400 11389 1133 342 contdiff 11 064 100 089 587 177 Example A biologist suspects that the antiseptic Benzamil actually impairs the healing process To test her suspicions with a matchedpairs design 9 salamanders are randomly selected for treatment Each has one wounded hind leg treated with Benzamil the other wounded leg treated with saline as a control The healing X area in square millimeters covered with new skin is measured 115 after a certain period of time with the following results Animal No Benzamil Control Di erence BC 1 32 18 2 08 15 07 3 21 42 21 4 13 13 00 5 10 26 16 6 08 07 01 7 11 20 09 8 04 16 12 9 19 18 01 d 709 SD 084 First nd a 95 con dence interval for Ma the population mean di erence benzamil minus control Then test H0 Md 0 vs Ha Md lt 0 Note that because treatment and control are applied to both legs of the same salamander the design is matched pairs and a onesample t procedure should be used on the single sample of di erences The critical value tquot for our 95 con dence interval is taken from the 9 7 1 8 df row and the 95 con dence column of Table A2 tquot 231 Our interval is 084 709 i 2317 7090 i 065 7155 7025 Note that the above interval contains only negative values leading us to expect the di erence to be statistically signi cant To test H0 Md 0 vs Ha Md lt 0 use t 310 7321 Because Ha has the lt sign the Pvalue is PT S 7321 PT 2 321 by the symmetry of the T distribution Our test statistic is between 290 and 336 in the 8 df row which means the Pvalue is between 01 and 005 No level oz has been speci ed so we will simply judge the Pvalue to be small and reject H0 Our conclusion is that Benzamil does indeed impair the healing process N MEAN MEDIAN TRMEAN STDEV SEMEAN benzamil 9 01200 01100 01200 00543 00181 9 9 control 02100 01800 02100 01071 00357 diff 00900 00900 00900 00843 00281 N MEAN STDEV SE MEAN 95 0 PERCENT C I diff 9 00900 00843 00281 01548 00252 TEST OF MU 00000 VS MU LT 00000 N MEAN STDEV SE MEAN T P VALUE diff 9 00900 00843 00281 320 00063 Example Suppose a biologist uses a twosample design to test if Benzamil impairs the healing process one random sample of 9 salamanders has a wounded hind leg treated with Benzamil a di e39rent sample of 9 salamanders has a wounded hind leg treated with saline Now nd a 95 con dence interval for the di erence between the mean healings and test H0 M1 7 M2 0 vs Ha M1 7 2 lt 0 Note we use tquot for df the smaller of 9 7 1 and 9 7 1 which is of course 8 Once again we use 116 the 95 con dence column Benzamil Control 14 08 15 21 42 13 13 10 26 08 07 11 20 04 16 19 18 1 7 12 922 21 31 1054 32 1107 Now our 95 con dence interval is 0542 1062 12 7 211 2311T T 7090 i 092 7182 002 Note that this con dence interval does contain zero Our test statistic can be calculated to be 3125 producing a Pvalue between 05 and 1025 We could reject at the 05 level of signi cance but not at the 01 level of signi cance Thus we see how the matched pairs study had a better chance of pinning down a di erence Subtracting values for each individual cancels out the variation among individuals helping us to concentrate on the variation between treatment and contro 95 PCT CI FUR MU benzamil MU control 01781 0001861 TTEST MU benzamil MU control VS LT T 225 P0023 DF 11 Robustness What plots should be made rst to validate use of the above procedures Sample sizes are small so the data should show no outliers or skewness Looking at the shapes of the histograms below results of the matched pairs procedure are dubious because of skewness the twosample results should be ne Frequency N l 0 l l l l l l 70 12 70 as o m 0 00 difference 117 Frequency l 2 0 l l l l l l l l l 005 007 009 011013 015 017 019 021 benzamil g lm l l l l l l l l 005 010 015 020 0a 030 035 0 0 control Frequency Example Producers of gasoline want to test Which is better Gas A or Gas B Miles per gallon are measured for 6 cars using Gas A and for another set of 6 cars using Gas B Gas A Gas B l5 13 20 17 25 23 25 24 30 28 35 34 921 2500 552 2317 31 707 32 752 ml 6 n2 6 The twosample t statistic is w 43 The Pvalue according to MlNlTAB is ex 7 03972 7 522 6 tremely large 674 There is no evidence at all that 1 34 2 because the Pvalue is large because t is small because 31 and 32 are large High variation among mileages for various cars prevented us from pinning down the effects of using a di erent gas A matched pairs design would be better Example Producers of gasoline measure mpg for 6 cars using Gas A and for the same 6 cars using Gas B Which gas is used rst is determined by a coin ip Gas A Gas B Di erence l5 l3 2 20 17 3 25 23 2 25 24 l 30 28 2 35 34 d 1833 30 753 Now the t statistic is 17375 597 and according to MlNlTAB the Pvalue is extremely small 002 We have strong evidence against H0 Md 0 because the Pvalue is very small because 118 t is large because 3 is small Concentrating on the difference between mileages Gas A minus Gas B wipes out the differences among mileages for various cars and helped control this outside variable Lecture 30 Chapter 14 More About Regression In Chapter 5 we displayed the relationship between two quantitative variables with a scatterplot and summarized it by reporting direction form and strength If the form appeared linear then we ma e a much more speci c summary by describing the relationship with the equation of a straight line called the least squares regression line We also used the correlation r to specify the direction and strength of the relationship All of this was done for the sample of explanatory and response pairs only By construction our line was the one that best tted the sample data points but we did not attempt to draw conclusions about how the explanatory and response variables were related in the entire population from which the sample was taken Now that we are familiar with the principles of statistical inference our goal in this chapter is to use sample explanatory and response values to draw conclusions about how the variables are related for the population It should go without saying that such conclusions will only be meaningful if the sample is truly representative of the larger population As usual all results are based on probability distributions which tell us what we can expect from random behavior The rst step in this inference process is an important one examine the scatterplot to decide if the form of the relationship really does appear linear The methods of inference that we will develop cannot help us in producing evidence that a straightline relationship holds for the larger populationithis is something that we must decide for ourselves based on the appearance of the scatterplot If the points seem to cluster around a curve rather than a straight line then other more advanced options must be explored In more advanced treatments of relationships between two quantitative variables methods are presented for transforming variables so that the resulting relationship is linear In this book we will proceed no further if the relationship is nonlinear lf linearity seems to be a reasonable assumption then we can use inference to draw conclusions about what the line should be like for the entire population and also about how much spread there is around the line Whereas in Chapter 5 we only went so far as to predict a single value for the response to a given explanatory value we will now have the tools to make interval estimates Our rst example concerns two variables which common sense suggests should have a positive linear relationship ages of students mothers and fathers Example Are ages of all students mothers and fathers related If so what can we say about this relation ship for a population of age pairs Because couples tend to have ages that are reasonably close to one another we would expect the mother to be on the young side if the father is young and on the old side if the father is old There is reason to expect a rather steady increase in the variable MotherAge as values of the variable FatherAge increase Therefore we do expect the relationship to be positive and linear not just for a sample of age pairs but also for the larger population Using the methods of Chapter 5 we can look at a scatterplot of father and mother ages and decide that it appears linear The least squares method can be used to t a line that comes closest to the data points in the sense that it minimizes the sum of squared residuals which are the sample prediction errors The typical size of sample prediction error is S in the output it calculates the square root of the average squared distance of observed response values minus predicted response values This average is calculated dividing by n7 2 which will be our regression degrees 119 of freedom RegreSSIon Plot MuthevAge145ME n5557m Father g2 5325794 Rqu SlDWa MotherAge The regression equation is MotherAge 145418 0665761 FatherAge S 328794 R Sq 610 39Z R Sqadj 609 X 431 cases used 15 cases contain missing values Predictor Coef SE Coef T P Constant 14542 1317 1105 0000 FatherAge 066576 002571 2589 0000 Pearson correlation of FatherAge and MotherAge 0781 P Value 0000 You may have noticed that a pvalue is always reported along with the correlation r and that t statistics and pvalues are included with the regression output We will pay more attention to these once we have established a meaningful hypothesis test procedure for the regression context First we consider relationships for populations as opposed to for samples of quantitative explanatory and response values When we introduced the process of performing statistical inference about a single parameter such as population mean based on a statistic such as sample mean we acknowledged that although sample mean may be our best estimate for population mean it is almost surely off by some amount Similarly although the least squares regression line is our best guess for the line that describes the relationship for the entire population it is also probably off to some extent Unlike inference about a single parameter like unknown population mean when we perform inference about a relationship between two quantitative variables there are actually three unknown parameters 0 First of all we only know the slope b1 666 of the line that best ts our sample The line that best ts the entire population may have more or less slope We use l to denote the unknown slope of the population regression line The graph below shows that other slopes are plausible candidates for l 120 tine torpopuiation may have slopethis High ii39ne forpopqlaliori may iiaye siope this tow MotherAge 3n 7 Sample i i i i 5 Father kge i i in an 0 Next we only know the intercept b0 14542 of the least squares line tted from the sample The line that best ts the entire population has an unknown intercept ol There is a whole range of plausible values for o resulting in lines that may be lower or higher than the line constructed from our sample 7 tine forpopulatiori may have 39 intercept to position it Hi5 High MotherAge iine torpopuiation may have tow l l l 5 Father kge 7 an an on o Thirdly the typical size of the n 431 residuals for our sample is reported in the regression output to be S 3288 The spread 039 about the population regression line for all age pairs is unknown The population may exhibit less spread than what is seen in our sample or more EU 7 7D 7 w typicai size s3 3 otresiduais m teiis spread about time for E 5 sampie 5 g 5n 7 w E imitation spread maybe an 7 more oriessthan 5 3H w an en 7U 8n FatherAge When inference methods were rst introduced in Chapter 10 we learned to perform various forms of inference rst a point estimate such as a 121 around the point estimate and then a hypothesis test For practical purposes it is usually enough to use be as a point estimate for unknown intercept o of the population regression line and s as a point estimate for unknown spread a about the population regression line Because of the special role played by slope in the relationship between two quantitative variables we will not merely use b1 as a point estimate for unknown slope l for the population In addition we will be interested in an interval estimate for l and perhaps most importantly we will carry out a hypothesis test about li Later on in this chapter we will use inference to make response predictions in the form of intervals given a particular explanatory value Behavior of slope for a sample Before performing inference about population proportion in Chapters 10 and 11 and population mean in Chapters 12 and 13 we took a great deal ofcare in Chapter 9 to think about the behavior of sample proportion relative to population proportion and sample mean relative to population mean Similar considerations will be helpful now so that we can grasp the workings of the process of inference for regression We can imagine a large population of explanatory and response values ages of all students fathers and mothers from which a sample is taken Population of Age Pairs father mother gt Sample of Age Pairs lntuitively it makes sense that if the ages are related linearly in the population they should also be related more or less linearly in the sample If a certain slope l holds for the population then the slope 1 in the sample should be in the same ballpark Similarly if a certain intercept o holds for the population then the intercept be for the sample should be somewhere in that vicinity Also if responses for the entire population are spread about the line with some standard deviation 7 then the sample standard deviation 3 should be similar e behavior of statistics like sample slope b1 in random samples taken from the larger population of explanatoryresponse pairs is perfectly predictable as long as the population relationship meets certain requirements As we stated at the beginning of this chapter the relationship must be linear In addition the distribution of standardized sample slope is exactly 1 if the residuals are exactly normally distributed For any explanatory value responses then should vary normally about the population regression line and their standard deviation is the parameter a that is estimated by s The graph below is an oversimpli cation of the situation in that it shows only 20 normally distributed MotherAges for each FatherAge instead of an in nite number Likewise the idealized model assumes FatherAges to be a continuous normal distribution instead of just taking whole evennumbered age values as shown Otherwise it is a fair representation of how we imagine the population relationship between ages it is positive and linear with constant spread a about the regression line following a normal pattern 122 55 7 Eaen distribution ofMotherAges l5 centered at the 0 mean response 0 an suen FatnerAges on tohei population regression nne 39 MotherAge Tne standard dewanon of eaen distribution 5 Sigma 35 i o Tne snape ofeach distribution 5 normal 50 FatherAge Population relationship expressed as My u lx Notice that a new symbol My has been used to model the relationship in the larger population Because statistics concerns itself with drawing conclusions about populations based on samples we must always be sure to quot quot i between A kde ciibii r r gt and statistics describing samples In the case of a regression line we have already used the notation to refer to the response predicted for the sample be b195 where b0 and b1 are calculated from the sample data The corresponding parameter is My o lm the unknown population mean response to a given explanatory value x which responds linearly with an unknown intercept g and slope y Note the mean response My is the same thing as expected response Distribution of sample slope b1 As always we report the longrun behavior of a sample statistic by describing its distribution speci cally by telling its center spread and shape 0 Center If the previously mentioned requirements are metilinear scatterplot normally distributed residuals and apparently constant spread about the lineithen slope 1 of the least squares line for a random sample of explanatoryresponse pairs has mean equal to the unknown slope l of the least squares line for the population 0 Spread The standard deviation of sample slope b1 is a zlei2zneii which we estimate with the standard error of In 03951 SEE1 zlei2znei2 where s the estimate for spread a about the population regression line measures typical residual size Although the above formula need not be used for calculations as long as software is available it is worth examining SEE1 to see how the residuals contribute to the spread of the distribution of sample slope he appearance of s in the numerator of SEE1 should make perfect intuitive sense if the residuals as a group are small then there is very little spread about the line and we should be able to pinpoint its slope fairly precisely Conversely if the residuals are large then there is much spread about the line and there is a much wider range of plausible slopes Note that the quantity x1 7 is gen 7 5amp2 which appears in the denominator measures combined distances of explanatory values from their mean This will be larger for larger sample sizes and so 1 has less spread for larger samples Again our intuition tells us that we should be better able to pinpoint the unknown population slope l if we obtain a sample slope from a larger sample 123 0 Shape Finally b1 itself has a normal shape if the residuals are normal or if the sample size is large enough to offset nonnormality of the residuals The graph below depicts what we have established about the distribution of sample slope 1 for large enough sample sizes it is centered at population slope l has approximate standard deviation SE51 and follows a normal distribution l l l beta1 b1 sample slope SEb1 Distribution of standardized sample slope Recall that in Chapters 12 and 13 when we standardized sample mean using sample standard deviation 3 instead of unknown population standard deviation 7 the resulting random variable 3 followed a 1 distribution instead of Z This could only be asserted if the sample size was large enough to offset any nonnormality in the population distribution so that the Central Limit Theorem could guarantee sample mean to be approximately normally distributed In this chapter we standardize sample slope 1 using SE51 calculated from s because a is unknown The resulting standardized slope b1 7 1 SEEM t follows a 1 distribution and its degrees of freedom are n 7 2 the same as for 3 Again this can only be asserted if b1 follows a normal distribution which is the case if the sample is large enough to offset non normality in the residuals Remember also that for large samples the 1 distribution is virtually identical to that of Z The distribution of standardized sample slope is displayed below centered at zero as is any 1 distribution standard deviation subject to degrees of freedom which are determined by sample size in particular standard deviation close to 1 if the sample size is large enough to make 1 roughly the same as Z and bellshaped like any 1 distribution l l l l 0 b1beta1SEb1 standardized sd as fort sample slope distribution W39th n2 df Now that we know more about the behavior of 1 relative to l we will make use of the critical role played by l in the relationship between explanatory and response variables so as to set up a test for evidence of 124 a relationship in the larger population Because the construction of con dence intervals tends to be more intuitive than carrying out hypothesis tests we start by setting up a con dence interval for the unknown slope of the linear relationship in the population After that we will establish a procedure for testing the null hypothesis that slope for the population relationship is zero In practice it may make sense to carry out the test rst and then report the con dence interval for slope if there is statistical evidence of a relationship Inference about 51 If the relationship between sampled values of two quantitative variables appears linear then methods of Chapter 5 can be used to produce the line that best ts those sample values For example ages of students fathers and mothers produced the following regression output Pearson correlation of FatherAge and MotherAge 0781 P Value 0000 The regression equation is MotherAge 145 0666 FatherAge 431 cases used 15 cases contain missing values Predictor Coef SE Coef T P Constant 14542 1317 1105 0000 FatherAge 066576 002571 2589 0000 S 3288 R Sq 610 R Sqadj 609 The fact that r is 781 tells us there is a fairly strong positive relationship between 95 and y data values Based on the fact that b1 666 our best guess for how MotherAge responds to FatherAge is to predict that if one student s father is 1 year older than a second student s father his mother would be 666 years older than the second student s mother By now we know enough about behavior of samples to realize that there must be some margin of error attached to this slope For every additional year of FatherAge in the population does MotherAge tend to be an additional 666 years give or take about 1 years Or 666 years give or take about 1 year As usual the size of the margin of error will supply important information In the former case having evidence that population slope is in the interval 566 766 would convince us of a positive relationship whereas in the latter case where the range of plausible values 334 1666 for unknown population slope straddles zero we could not claim to have statistical evidence of a relationship Knowing enough about the distribution of sample slope relative to population slope will help us nd the answer to our earlier question about the relationship between ages Thus we are ready to begin the process of statistical inference to draw conclusions about the relationship between two quantitative variables in a larger population based on sample data about those variables Con dence interval for 51 Example For a population of students parents what does the age of the father tell us about the age of a mother Speci cally if one father is a year older than another how much older if at all do we expect the mother to be The estimate b1 for the unknown slope 61 of the line that relates the variables MotherAge and FatherAge in the larger population is shown not only in the regression equation but also as the coef cient of FatherAge in the second row of the output table Predictor Coef SE Coef T P Constant 14542 1317 1105 0000 FatherAge 066576 002571 2589 0000 125 It is reported to ve decimal places as 66576 and its standard error 02571 appears in the next column A 95 con dence interval for 61 is constructed in the usual way as estimate i margin of error The estimate is of course In and the margin of error is a multiple of the standard error SE51 where the multiplier is the value of the relevant 1 distribution that corresponds to a symmetric area of 95 As established in the previous section sample slope 1 follows the 1 distribution with n 7 2 degrees of freedom The output shows our sample size to be 431 and so there are 429 degrees of freedom With such a large sample the t multiplier is virtually identical to the 2 multiplier for 95 con dence which is approximately 2 Our 95 con dence interval for 61 is 666 i 202471 666 i 049 617 715 The fact that this interval contains only positive numbers supplies us with statistical evidence of a positive relationship between fathers and mothers ages in the population More speci cally for every additional year of FatherAge we are 95 con dent that the corresponding value of MotherAge is an additional 617 to 715 years lll lel lp puldll llllld I have slope as hlgh as 7l5 MotherAge lll le lorpopulatlon on 7 may have slope as low as 6l7 3n 7 the sample has slope 666 l l l l l l m an 5 Fathel kge You may perhaps wonder why MotherAge doesn t increase by a full year for every increase of one year in FatherAge as a student s father gets older doesn t his or her mother have to age at exactly the same rate It is important to recognize that ages are not being recorded as a time series year by year for only one mother and father Rather we are thinking about an entire population of age pairs from whichiat one point in timeiwe extract a sample of 431 independent age pairs If one of these fathers is older than another by one year then that mother may be older than the rst mother too but not necessarily On average we expect her to be older than the rst mother by about 666 years Independence of the observations from one another is an additional condition for our inference procedure methods to yield accurate results and the sampling process should always be considered in case there may be a violation of this condition Example Next lecture we will look at the relationship between male students heights and weights The data must consist of height weight pairs obtained randomly and independently from a population of male students Methods developed in this chapter would not apply if our data consisted of height and weight measurements for the same student recorded each month over several years time Motivated by the earlier example on the relationship between parents ages we now state our general con dence interval resu t 126 95 Con dence Interval for 51 An approximate 95 con dence interval for slope l of the line that best ts the population of explanatory and response values based on a random sample with large size n is estimate i margin of error b1 i 2SEb1 For a small sample size n the approximate 95 interval is b1 i multiplierSEb1 where the multiplier is the value of the 1 distribution for n 7 2 degrees of freedom associated with 95 con dence righttail area under the curve is 025 This multiplier is greater than 2 but as long as there are at least six explanatoryresponse pairs in our sample it will be no more than 3 This interval is only appropriate if 0 the scatterplot appears linear o the sample size is large enough to offset any nonnormality in the response values 0 spread of responses is fairly constant over the range of explanatory values 0 explanatory response pairs are independent of one another Hypothesis test about 51 Our rst step in learning to perform inference about proportions in Chapter 10 was to set up a con dence intervall By checking if the interval contained a proposed value of population proportion we were able to make a rather informal decision as to whether that value was plausible based on whether or not the value was contained in the intervall In Chapter 11 we learned to carry out a formal test of hypotheses about unknown population proportion following ve basic steps Similarly we used the con dence interval in our example above to informally conclude that the value of l is not zero A more formal way to reach this conclusion is by carrying out a test of hypotheses As with our other hypothesis test procedures about the relationship between two variables there are two formulations of the null and alternative hypotheses one about a key parameter the other about the variables and their relationship When there are two quantitative variables of interest the null hypothesis states that the slope l of the least squares line for the population is zero Equivalently it claims that the variables are not related because the equation My o lm reduces to My o when l is zero and the mean population response does not depend on the socalled explanatory variable 95 The alternative may be onesided or twosidedl The twosided alternative l y 0 is equivalent to the statement that the variables are related in the population The onesided alternatives l gt 0 or l lt 0 are more speci c in that they express a claim not only that the variables are related but also with regards to the direction of the purported relationship In order to determine which formulation is appropriate the wording and background of a problem must be carefully considered Example ls there statistical evidence of a relationship between FatherAge and MotherAge We could equivalently pose the question as H0 l 0 vs Ha l y 0 where betal is the slope of the line that relates ages of fathers and mothers for the entire population of students Because common sense would tell us to expect a positive relationship we may go so far as to formulate the alternative as onesided H0 l 0 vs Ha l gt 0 Example Based on information from a sample of 4 states can we conclude that for all states there is a negative quot 39 etween voting quot an voting republican In this case we would write H0 l 0 vs Ha l lt 0 127 Example A website called ratemyprofessorcom reports students ratings of their professors at universities around the country These are unofficial in that they are not monitored by the universities themselves Besides listing average rating of the professors teaching on a scale of 1 to 5 where 1 is the worst and 5 is the best there is also a rating of how easy their courses are where 1 is the hardest and 5 is the easiest ls there a relationship between the rating of teaching and the rating of ease offhand we may suspect that students would favor easy teachers in which case the relationship would be positive On the other hand teachers who are more conscientious may maintain higher standards not just for their students but also for themselves In this example because the direction could really go either way we should keep a more general twosided alternative and write H02 1 OvsHa 13 0 We have already established that the distribution of sample slope In if certain conditions are met is normal with mean l and approximate standard deviation SE51 Under the null hypothesis that l 0 the standardized test statistic 7 b1 7 0 7 SEE1 follows a 1 distribution with n7 2 degrees of freedom If the sample slope b1 is relatively close to zero taking sample size and spread into account then the standardized test statistic t is not especially large and so the pvalue is not small and there is no compelling evidence of a nonzero population slope l Thus i b1 isn t large enough we cannot produce evidence that the two quantitative variables are related in the larger population Conversely if sample slope b1 is relatively far from zero then t is large the pvalue is small and we have statistical evidence that the population slope l is not zero In other words a large t results in a small pvalue and a conclusion that the variables are related If a onesided alternative has been formulated and the sample slope b1 tends in the direction claimed by that alternative then the pvalue is the onetailed probability oft being as extreme as the one observed sample slope b1 close to zero sampe slope b1 farfrom zero p value is large gt p value is small E39 lissmal lis large Example Let s revisit the output for the regression of MotherAge on FatherAge carrying out a vestep test of hypotheses This will require us to focus our attention on the size of the t statistic and pva ue Predictor Coef SE Coef T P Constant 14542 1317 1105 0000 FatherAge 066576 002571 2589 0000 1 The null hypothesis states that the slope l of the line that relates MotherAge to FatherAge in the population is zero alternatively it may state that MotherAge and FatherAge are not related Since common sense would suggest a positive relationship between the two variables our alternative hypothesis would be that l gt 0 alternatively it could make a more general 128 claim that MotherAge and FatherAge are related for the general population of parents ages that is that l y 0 When we set up a con dence interval for unknown population slope l we noted that b1 and F its standard error SEE1 are reported in the second row of the regression table The rst row concerns intercept be which is not of particular interest to us since we tend not to perform inference about intercept oi Remember that it is the slope that provides key information about if and how the explanatory and response values are related Thus the standardized sample slope t is reported as 2589 Notice that the t statistic is easily calculated from 1 and SE51 7 b170 7 6657670 SE1 02571 Remember that 0 is subtracted from sample slope b1 because for random samples b1 is centered at population slope l which is proposed to be 0 in the null hypothesis For a large sample like this our cutoff point for large values of t is like that for 2 namely 2 Obviously 2589 is extremely large compared to 2 t 2589 9quot The pvalue corresponding to our t statistic is shown to be 0000 in the second row not the rst Just as t was extremely large the pvalue is extremely small The fact that the pvalue is so small tells us that obtaining a sample slope as far from zero as i66576 would be extremely unlikely if population slope were zero and so we conclude population slope is not zero This pvalue actually corresponds to a twosided alternative technically if we suspected all along that the slope would be positive and formulated the alternative as l gt 0 the pvalue should be half of the one shown in the output This only serves to strengthen our conclusion that ages are related th 9quot To summarize we have strong statistical evidence that MotherAge and FatherAge have a positive relationship not just in our sample but also in the larger population of students Motivated by the example above we summarize the process of testing for a relationship between two quantitative variables by testing the null hypothesis that slope of the regression line for the population of explanatory and response values is zero Hypothesis Test about 51 Just as for any hypothesis procedure there are ve basic steps to test for a relationship between two quantitative variables in the population of interest based on a random sample of size n 1 Assuming the relationship if it exists between two quantitative variables to be linear rather than curved we test the null hypothesis that the variables are not related which is equivalent to the claim H02 10 where l is the slope of the population least squares regression line The alternative may state more generally that the two variables are related which is equivalent to the claim Ha1 1 0 or a more speci c onesided alternative may be formulated as Ha l lt 0 if we suspect in advance that the relationship is negative or Ha l gt 0 if we suspect the relationship is positive b1 70 SE51 a below are met follows a 1 distribution with n 7 2 degrees of freedom This t statistic is a standardized measure for how far sample slope b1 is from zero 2 Software should be used to produce the standardized sample slope t which when conditions 129 3 The pvalue to accompany the t test statistic is the probability of a 1 random variable being as extreme as the one observed It is reported alongside t as part of the regression output A small pvalue suggests that t is unusually extreme if the null hypothesis were true that is that the sample slope could be considered unusually steep if it were coming from a population where the explanatory and response variables were not related tb If the pvalue is small we reject the null hypothesis of no relationship equivalent to rejecting the claim that slope l for the population is zero If the pvalue is not small we conclude that the null hypothesis may be true 5 Conclusions should be stated in context if the null hypothesis has been rejected we conclude that there is statistical evidence of a relationship between the explanatory and response variables If it has been rejected against a onesided alternative we conclude there is evidence of a negative or of a positive relationship depending on how the alternative has been expressed If the null hypothesis has not been rejected we conclude there is not enough statistical evidence to convince us of a relationship between the two quantitative variables Results of the above test are only valid if the following conditions are met 0 the scatterplot appears linear o the sample size is large enough to offset any nonnormality in the response values 0 spread of responses is fairly constant over the range of explanatory values 0 explanatory response pairs are independent of one another In the previous example not only did we have strong evidence of a relationship by virtue of the pvalue being close to zero but we also could assert that the relationship was strong by virtue of the correlation r being 78 which is pretty close to one It is nevertheless possible to produce weak evidence of a strong relationship or strong evidence of a weak relationship These possibilities will be explored in the following examples We will also consider an example where there is no statistical evidence of a relationship Example While most voters in a presidential election vote for the democratic or republican candidate other parties do account for a small percentage of the popular vote in each state The table below looks at the relationship between percentages voting democratic and republican in the year 2000 for just a few states State Democratic Republican Alabama 479 California 534 417 Ohio 464 500 Minnesota 479 455 The points in the scatterplot below do appear to cluster around some straight line rather than a curve The line has a negative slope because when the percentage voting republican is low then 130 the percentage voting democratic is high and vice versa Democratic or o l 48 7 47 7 o 46 7 41 42 4 44 45 46 47 49 50 Republican When a regression is carried out the correlation r is found to be quite close to 71 r 7922 suggesting a strong negative relationship On the other hand the pvalue 078 may not neces sarily be considered small enough to provide statistical evidence of a relationship Pearson correlation of dem and rep 0922 P Value 0 078 Pearson correlation of Democratic and Republican 0910 KP Value 0090 Due to the small sample size of only 4 we do not have especially strong evidence of a relationship in the larger population of states even though for the sample the relationship is apparently quite strong In other words we have weak evidence of a strong relationship between percentage voting republican and percentage voting democratic A larger sample of states would have certainly supplied very strong evidence of such a relationship In the preceding example we saw that although a linear relationship between two quantitative variables may be quite strong with too small a sample we may only produce weak evidence of that relationship In the next example we see that with a large sample we may produce very strong evidence of a rather weak relationship in the population Example As a contrast to the rather strong relationship between MotherAge and FatherAge we now look at the relatively weak relationship between MotherHt and FatherHt A scatterplot for the latter is shown be ow MomHT 131 There is apparently a slight tendency for relatively short fathers to be paired with relatively short mothers and for relatively tall fathers to be paired with relatively tall mothers Since height is such a minor factor when it comes to couples compatibility the relationship is naturally quite weak According to the output below the correlation is only T 225 On the other hand a test of whether the slope of the regression line could be zero for the general population of parents heights produces a very large t statistic 479 and a very small pvalue 0000 Pearson correlation of MomHT and DadHT 0225 The regression equation is MomHT 504 0200 DadHT 431 cases used 15 cases contain missing values Predictor Coef SE Coef T P Constant 50431 2936 1718 0000 DadHT 020019 004178 479 0000 S 2551 R Sq 51 R Sqadj 49 In this case by virtue of a large sample 431 height pairs we are able to produce very strong evidence of a relationship in the general population of parents heights but the relationship itself is rather weak Lecture 31 Other Interval Estimates in Regression In the previous lecture we learned two important regression inference procedures testing for statistical evidence of a relationship between the two quantitative variables of interest and estimating the slope of the line that relates those variables in the larger population For practical purposes two other types of estimation are quite common Example Reassessment of property values in Allegheny County Western Pennsylvania in 2002 were ex tremely controversial and some property owners believed the assessment was too high resulting in higher taxes Suppose a homeowner was told that his land not including house was reassessed at 40000 and he wants to contest it as being unreasonably high As a rst step he could look at a sample of assessment values of other properties in his neighbor hood A sample of 29 land values in the neighborhood have mean 34624 and standard deviation 17494 A value of 40000 at this point doesn t seem unusually high at W 31 stan dard deviations above the mean Variable N Mean Median TrMean StDev SE Mean LandValu 29 34624 25600 34226 17494 3249 Variable Minimum Maximum 11 13 LandValu 9000 71000 22200 49050 But the homeowner suspects his property at 4000 square feet is smaller than average and so he researches size of those neighborhood properties Variable N Mean Median TrMean StDev SE Mean Size 29 5619 4900 5544 2755 512 Variable Minimum Maximum 11 13 Size 1425 11853 3671 7299 132 By now we have established that his property s assessed value 40000 is higher than average 34624 although its size 4000 square feet is smaller than average 5619 square feet This in itself is not enough evidence to argue that the assessment is unfair the homeowner needs to show that in general the relationship between size and value is such that 40000 would be an unreasonably high value for a lot of size 4000 square feet First let s look at a scatterplot of the 29 size and value pairs 70000 7 00000 7 I G 50000 7 2 I a 40000 7 gt E m 30000 7 o 4 I O 20000 7 39 Lquot 39 10000 7 39 0 2 0 2000 4000 0000 8000 10000 12000 Size Certainly the relationship appears to be positive linear and quite strong suggesting that a smallerthanaverage lot should be given a smallerthanaverage assessment This is con rmed by the output below which shows the correlation to be quite close to one Furthermore the p value is close to zero providing evidence that this relationship should hold in the larger population from which the sample of land sizes and values was obtained Pearson correlation of Size and LandValue 0927 P Value 0000 An option when using software to perform a regression is to request a prediction interval for new observation with results shown below for an observed size of 4000 square feet Predicted Values for New Observations New Dbs Fit SE Fit 950 CI 950 PI 1 25094 1446 22127 28060 11066 39121 Values of Predictors for New Observations New Dbs Size 1 4000 The output includes two very different intervals one labeled 950 CI77 that ranges roughly from 22000 to 28000 and one labeled 950 13177 that ranges roughly from 11000 to 39000 Both intervals are centered at 25094 the predicted value for a lot of size 4000 square feet The rst of the intervals is not especially relevant to the homeowner because it presents a set of plausible values for mean value of all 4000squarefoot lots in the neighborhood The second interval reports a 95 prediction interval for the value of one individual lot whose size is 4000 square feet Since the assessed value of 40000 falls above the interval 11066 39121 the homeowner does have statistical evidence that the assessment is unusually high given the size of his lot In order to put new inference skills in perspective the following example includes a variety of estimates estimating an individual or mean value of a quantitative variable estimating an individual or mean response for a given explanatory value and estimating an individual or mean response for a different explanatory value We present a series of questions all alike in that they seek estimates concerning male weight but all different in terms of whether an estimate is sought for an individual or a mean and also in terms of what height information if any is provided 133 Example Based on a sample of male weights how do we estimate weight of an individual male H Based on a sample of male weights how do we estimate mean weight of all males m Based on a sample of male heights and weights how do we estimate weight of an individual 71inchtall male Based on a sample of male heights and weights how do we estimate mean weight of all 71inchtall males 5 Based on a sample of male heights and weights how do we estimate weight of an individual 76inchtall male Based on a sample of male heights and weights how do we estimate mean weight of all 76inchtall males DJ tb 03 Estimating an individual weight with no height information In Chapter 2 we learned that if a distribution is roughly normal and we know its mean and standard deviation we can report a range for most of its values using the 6895997 Rule which is based on a normal distribution For example the output below shows male weights to have mean 17083 and standard deviation 3306 Variable N N Mean Median TrMean StDev WTmale 162 2 17083 16500 16824 3306 Variable SE Mean Minimum Maximum 11 Q3 WTmale 260 11500 31500 15000 18500 If the shape of the distribution of weights is approximately normal then about 95 ofthe time any one individual weight should fall within 2 standard deviations of the mean from 170837 23306 to 17083 23306 that is in the interval 10471 23695 The accuracy of this interval is not necessarily to be trusted because the shape of the distribution of weights shown in the histogram below is not entirely normal but is rather rightskewe Frequency ll l l 1 2mm sun WTm ale Although weights are often purported to be normal for speci c age and gender groups the reality is that most populations include individuals with weights that are unusually high to the point where they cannot be balanced out by unusually low weights In our sample for instance the highest weight 315 is 315 7 170833306 44 standard deviations above the mean a man would have to weigh just 25 pounds to be this many standard deviations below the mean In fact the lowest weight 115 has a 2score of 115 7 170833306 717 so it is only 17 standard deviations below the mean 134 Estimating mean weight with no height information In Chapter 12 we learned to perform inference about the mean of a single quantitative variable These methods can be used to set up a con dence interval for the mean weight of all male college students based on a sample of weights One Sample T WTmale Variable N Mean StDev SE Mean 95 02 CI WTmale 162 17083 3306 260 16570 17596 Thus a 95 con dence interval for the mean weight of all male college students is 16570 17596 Notice how much narrower this interval is than the interval that should contain an individual weight The interval for individuals ranged all the way from about 105 to 237 pounds with a width of 132 pounds In contrast the interval for mean weight ranged only from about 166 to 176 pounds with a width of only 10 pounds It is much harder to pinpoint an individual as opposed to a mean value Remember that the spread of all values is estimated with 3 while the spread of sample mean is estimated with Whereas the nonnormality of the distribution of weights presented a problem in setting up a range for 95 of individual weights by virtue of the Central Limit Theorem the large sample size guarantees sample mean weight to be approximately normal and so this interval should be quite accurate Including Height Information In fact the con dence interval above is of limited usefulness because instead of asking What is a typical weight for any male college student77 we would be more inclined to wonder What is a typical weight for a male college student who is 95 inches tall77 A range of plausible values for the mean weight of all male college students is no longer appropriate if we are speci cally interested in what is plausible for the mean of say all 71inchtall male college students In order to really do justice to the variable weight the variable height should be taken into account Inference for regression can be used to produce a range of plausible values for the mean weight of all male college students of a given height Along the way we will also take a look at the range of plausible values for the weight of an individual male college student of a given height in order to contrast such intervals We have already examined the distribution of weights alone Now let s examine the heights of our sample of 162 male college students and look at the relationship between height and weight Then we ll produce a 95 prediction interval for the weight of an individual 71inchtall male along with a 95 con dence interval for the mean weight of all 71inchtall males These in turn will be compared to 95 prediction and con dence intervals for a given height of 76 inches In the end we will contrast these to the intervals already discussed which do not take height into account Variable N N Mean Median TrMean StDev HTmale 163 1 70626 71000 70626 2940 Variable SE Mean Minimum Maximum 11 Q3 HTmale 0230 63000 79000 68000 73000 135 Frequency l l l EEI 7D ED HTmaIe Based on the descriptive statistics and histograms we can say that sampled heights appear normally distributed symmetric bulging in the middle and tapering at the ends with mean 70626 and standard deviation 2940 The separate summaries and histograms produced so far do not supply any information about the relationship between height and weight this requires a regression procedure starting off with a scatterplot for display 3007 o o l I w 0 Tu quot El 393 2007 39 I l 39 lioi39 I II o a 8 l39 o 1007 l Y Y r 65 7o 75 so HTmae The scatterplot shows a moderately strong positive relationship between male heights and weights The right skewness that we saw in the histogram of weights is seen in the scatterplot as a looser scattering of points in the higher weight ranges Fortunately the sample size of 163 is large enough to offset this nonnormality Next we look at regression output The regression equation is WTmale 188 508 HTmale 162 cases used 2 cases contain missing values Predictor Coef SE Coef T P Constant 18755 56 12 334 0001 HTmale 50759 07942 639 0000 S 2960 R Sq 203 R Sqadj 198 The fact that the correlation r is the positive square root of 203 or 45 tells us that the relationship is of moderate strength for the sample of heightweight values Height does tell us something about weight but its prediction power is far from perfect The fact that p 0000 tells us we have very strong evidence that a relationship holds in the larger population from which the sample was taken And the fact that the slope is 508 tells us that if one male college student is 1 inch taller than another his weight should be about 5 pounds more 136 Estimating individual weight for a given height of 71 inches Output for a request of con dence and prediction intervals for a height of 71 inches will help us estimate weight of a particular male student who is 71 inches ta l Predicted Values for New Observations New Obs Fit SE Fit 950 CI 950 PI 1 17283 235 16820 17747 11420 23147 Values of Predictors for New Observations New Obs HTmale 1 710 When we only considered mean and standard deviation for weights with no additional informa tion provided by heights we could say that if the distribution were normal then 95 of the time my individual male weight should fall within 2 standard deviations of the mean in the interval 1047123695 With height taken into account the prediction interval 950 P177 reported in the regression prediction output tells us that 95 of the time the weight for a 71inchtall individual male should fall in the interval 1142033147 This interval is about 15 pounds narrower and therefore more precise than the above interval that did not utilize information about height Estimating mean weight for a given height of 71 inches If our goal is to produce a range of plausible values for mean height of all 71inchtall male college students it is the con dence interval labeled 950 C1 that is relevant We are 95 con dent that the mean weight of all 71inchtall male college students is somewhere between 16820 and 17747 pounds Once again we see a dramatic di erence between the extent to which we can pinpoint an individual interval width 23147 7 11420 11727 and a mean for all individuals interval width 17747 7 16820 927 Since heights and weights have a positive relationship we expect that weight estimates for taller men should be higher Estimating individual weight for a given height of 76 inches According to the output below if an individual male is 76 inches tall we predict his weight to be somewhere between 13897 and 25745 pounds Since 76 is 5 inches taller than 71 and since the slope b1 508 tells us that each additional inch in height is accompanied by about 5 more pounds in weight this entire interval is about 25 pounds higher than the interval of predicted weight for a height of 71 inches The width of this interval is 25745 7 13897 11848 just slightly wider than the interval for an individual 71inchtall male interval width 11727 pounds Predicted Values for New Observations New Obs Fit SE Fit 950 CI 950 PI 1 19821 488 18858 20784 13897 25745 Values of Predictors for New Observations New Obs HTmale 1 760 Estimating mean weight for a given height of 76 inches The output also includes a 95 con dence interval for the mean weight of all 76inchtall men The width of this interval 20784 7 18858 1926 is more than twice the width of the 95 137 con dence interval for the mean weight of all 7linchtall men 927 In the next section we will see that this di erence is due to the fact that 76 is much further from the mean height than 71 is For now we summarize our interval estimates with the display below 95 prediction interval for individual no height 95 con dence interval for mean Info used intervalS centered at sample rnean Weight i70 83 height 71 95 rediction interval for individual lnChes 95 con dence interval for mean intervals centered at Weight 172 84 predicted forheight 71 95 prediction interval for individual height 76 0 inches 95 A con dence Interval for mean inte alsc i mam i98 2i predicted forheight76 l l l l l l l l l weight mo in MO i60 i80 200 220 240 260 Especially in the case of prediction intervals if there is a substantial relationship between two quantitative variables we can produce a narrower interval if we include information in the form of a given explanatory value Con dence intervals for means will be considerably narrower than prediction intervals for individuals In the next section we will see that sample size plays an important role in the width of the con dence interval Naturally enough if the relationship is positive then for higher values of the explanatory variable both con dence and prediction intervals are centered at a higher response Role of 3 in Con dence and Prediction Intervals As is the case for any interval estimates our con dence and prediction intervals are of the form estimate i margin of error estimate i multiplier standard error No matter if we are constructing a prediction interval for an individual response or a con dence interval for the mean response to a given explanatory value the estimate at the center of our interval is the regression line s predicted response for that explanatory va ue Example The regression line for estimating male weight from height is WTmale 188 508 HTmale The con dence and prediction intervals for male weight when height is 71 are both centered at the estimate 7188 50871 17268 The con dence and prediction intervals for male weight when height is 76 are both centered at the estimate 7188 50876 19808 Both con dence and prediction intervals are centered at the predicted response be blew but the con dence interval for mean response is narrower When sample size is large the prediction interval extends roughly 23 on either side of the predicted response If there were no relationship between explanatory and response values this interval would be no di erent from the interval that extends two ordinary standard deviations in y 3y on either side of the mean response If there is a strong relationship between explanatory and response values this interval is noticeably more precise than the interval obtained without taking explanatory value into account 138 Example For the regression of male weight on height based on a large sample of 162 heightweight pairs the regression output showed 3 29 he prediction interval should have a margin of error equal to roughly twice this and so its entire width should be about four times 30 or 120 New Obs Fit SE Fit 950 CI 950 PI 1 17283 235 16820 17747 11420 23147 Values of Predictors for New Observations New Obs HTmale 1 710 New Obs Fit SE Fit 950 CI 950 PI 1 19821 488 18858 20784 13897 25745 Values of Predictors for New Observations 1 760 In fact the width of the prediction interval for weight when height equals 71 is 231477 11420 11727 and the width of the prediction interval for weight when height equals 76 is 25745 7 13897 11848 Both of these are quite close to our ad hoc calculation of 120 When sample size is large and a con dence interval for mean response is desired for an explanatory value that is close to the mean i this interval extends roughly 2 on either side of the predicted response Example Since the mean male height is 70626 as shown in our summary output a height of 71 is close to the mean and so for our sample of size 162 the standard error should be roughly 3 23 If our con dence interval for mean weight extends roughly 2 standard errors on either side of the predicted weight 17283 its width should be about 423 92 In fact the 95 con dence interval has width 17747 7 16820 927 Variable N N Mean Median TrMean StDev HTmale 163 1 70626 71000 70626 2940 WTmale 162 2 17083 16500 16824 3306 New Obs Fit SE Fit 950 CI 950 PI 1 17283 235 16820 17747 11420 23147 Values of Predictors for New Observations New Obs HTmale 1 710 On the other hand when predicting the mean response to an explanatory value far from the mean of all explanatory values the standard error is considerably larger than i W Example A height of 76 inches is rather far from the mean height of 70626 The con dence interval for mean weight of all 76inchtall men has a width of 20784 7 18858 1926 which is more than eight times the standard error 23 instead ofjust four times as was the case for estimating mean weight when height was 71 close to the mean of all heights The illustration shows that whereas prediction interval width remains fairly uniform throughout the range of explanatory values the con dence interval band widens considerably for explanatory values far below or above average 139 New Dbs Fit SE Fit 950 CI 950 PI 1 19821 488 18858 20784 13897 25745 Values of Predictors for New Observations 2 Sun 7 E margrn oferrorm Pl for I rndwrduar rs approxrrnatery 25 39 E margrn oferrorm or 3 mean rs more than 255qrtn for hergms 2 7 39 39 39 farfron mean 529 5971 y R egvessmn mu 7 margm of error rn cw formean 39 95Cl rs approxrmatery 255qrtn for a hergms nearmean 95 l l l l m HTmae These rough estimates are presented here merely as a reference point in practice the precise prediction interval and con dence interval should be found using software Sample size plays its usual role in that smaller samples result in Wider intervals Exercise Find two quantitative variables from our survey summarize their relationship as in Chapter 5 and then test H0 l 0 State your conclusions in terms of the variables of interest 140 C 2007 Nancy Pfenning Looking Back Review El 4 Stages of Statistics Lecture 32 I Data Production discussed in Lectures 14 TWO categorlcal vanables I Displaying and Summarizing Lectures 512 Square I Probability discussed in Lectures 1320 I Statistical Inference uFormulating Hypotheses to Test Relationship D 1 mg r ald s ss d mew39z 1 quantitative discussed in Lectures 2427 39red 2sample severalsample Lectures 2831 a uTest based on Proportions or on Counts a cat and quan39 pal a a uChisquare Test uConfidence Intervals 2 quantitative 0mm WWW ammum mm m mm WWW mm mmmsuere summer We m2 Inference for Relationship Review Example 2 Categorical Variables Hypotheses I H o and H a about variables not related or related El Background We are interested in Whether or not El Applies to all three C Q C c Q Q smoking plays a role in alcoholism I H O and Ha about parameters equality or not El Question How wouldHo and H a be written El C9Q pop means equal I in terms ofVariables El C9C pop proportions equal I in terms ofparameters El Q Q pop slope equals zero 2 mm mm mm Elemenhw sums mm Me an new m2 3 2mm mm mm ammw slums mm mm aw mm m 4 Elementary Statistics Looking at the Big Picture 1 C 2007 Nancy Pfenning Example 2 Categorical Variables Hypotheses Example Summarizing with Proportions I Background We are interested in whether or not El Background Research Question Does smoking smoking plays a role alcoholism a role alcoholism El Response 1 Question What statistics from this table should 39 germs of Variables we examine to answer the research question El 0 smoking and alcoholism 7 i related El Ha smoking and alcoholism 7 irelated Alcohollc NOt AICOhOhC TOtal in terms of parameters Smoker 30 200 230 El 10 Pop proportions alcoholic 7 for smokers nonsmokers El Ha Pop proportions alcoholic if for smokers nonsmokers Nonsmoker 1 0 760 770 Total 40 960 1 000 The word not appears in Ho about variables in Ha about parameters c 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L32 E c 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L32 7 Example Summarizing with Proportions Example T est Statistic for Proportions I Background Research Question Does smoking I Background One approach to the question of play a role in alcoholism whether smoking and alcoholism are related is to El ResponseCompare proportions response compare PTOPOTUOHS f0r explanatory Alcoholic Not Alcoholic Total Alcoholic Not Alcoholic Total Smoker 30 200 230 51 53 0130 A 10 Smoker 30 200 230 Non smoker 760 770 p2 7 75 0013 Nonsmoker 10 760 770 Total 960 1000 Total 40 960 1000 El Question What would be the next step if we ve summarized the situation with the difference between sample proportions 01300013 c 2mm Nancy Pfenning Elementary Statistics Looking atthe Big Picture L32 3 c 2007 Nancy Pfenning Elementary Statistics Looking atthe Big Picture L32 iEI Elementary Statistics Looking at the Big Picture 2 C 2007 Nancy Pfenning Example Te Slam ch P 7 01907 lions Advantage of z Inference for 2 Proportions El Background One approach to the question of Can test against onesided alternative whether smoking and alcoholism are related is to compare proportions Alcoholic Not Alcoholic Total Smoker 30 200 230 151 T200 2 0130 Non smoker 10 760 770 132 7 0013 Total 40 960 1000 El Response the difference between sample proportions 01300013 1 1 In fact stan diff is normal Z c 2mm Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Ellg F39lcture L32 l2 c 2mm Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Ellg F39lcture L32 l3 t L ed Another Comparison in Considering Categorical Dlsadvantage of z Inference for 2 Proportions Relationships Review 2by 2 table comparing proportions straightforward ll Instead of cons1dermg how different are the Larger table comparing proportions complicated cal st standardize one difference A A proportions in a twoway table we may cons1der Ju p1 p2 how different the counts are from what we d expect if the explanatory and response variables were in fact unrelated El Compared observed expected counts in wasp study Obs A T Exp A NA T B 16 15 31 B 1 1 31 U 7 31 U 11 31 T 40 22 62 T 40 22 62 Liam e 2mm Nancy F39fErlrllrlg Elementary Statlstles Luuklng attne Ellg F39lcture L32 l4 Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning Inference Based on Counts Example Table of Expected C oums To test hypotheses abOut relationship in I Background Data on smoking and alcoholism r byc table compare counts observed to counts expected if H 0 equal proportions in response of interest were true II Question What counts are expected if H o is true e 2mm Nancy Ptenning Eiementary Statistics Looking atthe Big Picture L32 i6 e 2mm Nancy Ptenning Eiementary Statistics Luuking atthe Big Picture L32 i7 Example Table of Expected Counts Example Table of Expected Counts El Background Data on smoking and alcoholism Expected Counts have been lled in El Response Overall proportion alcoholic is El Response Overall proportion alcoholic is 004 If proportions alcoholic were same for S and NS expect If proportions alcoholic were same for S and NS expect I 401000230 if smokers to be alcoholic I 401000230 if smokers to be alcoholic I 401000770 finonsmokers to be alcoholic also I 401000770 if nonsmokers to be alcoholicalso I 9601000230 if smokers not alcoholic I 9601000230 fismokers not alcoholic I 9601000770 iiinonsmokers not alcoholic I 9601000770 finonsmokers not alcoholic e 2mm Nancy Prenning Eiementary Statistics Looking atthe Big Picture L32 i3 e 2mm Nancy Ptenning Eiementary Statistics Looking atthe Big Picture L32 22 Elementary Statistics Looking at the Big Picture 4 C 2007 Nancy Pfenning Example Table of Expected Counts Chi Square Statistic I Components to compare observed and expected counts one table cell at a time observed expected2 component expected Components are individual standardized squared differences I Chisquare test statistic X2 combines all CI Note Each expected count is Column total KROW total com onents b summin them u Expect Table total p y g l I 40230 1000 77 smokers to be alcoholic Chisware sum of Obser ig cjggeded I 407701000 77nonsmokers to be alcoholic also I 9602301000 smokers not alcoholic Chisquare is sum of standardized squared differences I 9607701000 7 nonsmokers not alcoholic c Zuni Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture L32 24 c Zuni Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture L32 25 Example ChiSquare Statistic Example ChiSquare Statistic I Background Observed and Expected Tables I Background Observed and Expected Tables observed expected2 observed expected2 I expected El Question What is chisquare sum of W El Response Find chisquare sum of e Zuni Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture L32 2B e Zuni Nancy Pfenning Eiernentary Statistics Luuking attne Big Picture L32 23 Elementary Statistics Looking at the Big Picture 5 C 2007 Nancy Pfenning Example Assessing ChiSquare Statistic Example Assessing ChiSquare Statistic El Background We found chisquare 64 El Background We found chisquare 64 El Question Is the chisquare statistic 64 large El Response c 2mm Nancy Pfenning Elementary Statistics Luuking aime Big Picture L32 23 c 2mm Nancy Pfenning Elementary Statistics Luuking aime Big Picture L32 3i ChiSquare Distribution ChiSquare Density Curve 2 chisquare sum of W follows predictable For chis2quare with 1 df P2 2 384 005 pattern assuming H o is true known as 9 If X is more than 384 PValue is less than 005 chisquare distribution with df rl x cl I r number of rows possible explanatory values I C number of columns possible response values 2 7 7 rightlaii Properties of chisquare area05 I Nonnegative based on squares Properties of chisquare 0 7 2 w 5 I Meandf 1 for smallest 2x2 table I Nonnegative i f iti qgirilwih i if 7 y a e I Spread depends on df I Meandfl for smallest 2x2 table I Sk ewe d right I Spread depends on df I Skewed right e 2mm Nancy Pfenning Elementary Statistics Luuking althe Big Picture L32 32 e 2mm Nancy Pfenning Elementary Statistics Luuking althe Big Picture L32 33 Elementary Statistics Looking at the Big Picture 6 C 2007 Nancy Pfenning Example Assessing Chi Square Continued El Background In testing for relationship between smoking and alcoholism in 2x2 table found V2 64 El Question Is there evidence of a relationship in general between smoking and alcoholism not just in the samp e 2mm mnwmm Eiemenuwsuusucs mm tithe swim 11234 Example Assessing C hi Square Continued El Background In testing for relationship between smoking and alcoholism in 2x2 table found X2 64 El Response For df2lx2ll chisquare considered large if greater than 384 9chi 9 square64 large Pvalue small evidence of a relationship between smoking and alcoholism 2mm mm mm ammw mm mm um aw mm m as Inference for 2 Categorical Variables z or X2 For 2x2 table 22 X2 I 2 statistic compan39ng proportions9 combined tail probability005 for I chisquare statistic compan39ng counts9 lighttail prob005 for X2 39 384 2mm mnwmm Eiemenuwsuusucs mm tithe swim 11237 Example Relating ChiSquare amp z El Background We found chisquare 64 for the 2by2 table relating smoking and alcoholism El Question What would be the 2 statistic for a test comparing proportions alcoholic for smokers vs nonsmokers 2mm mm mm ammw mm mm um aw mm m as Elementary Statistics Looking at the Big Picture C 2007 Nancy Pfenning E Example Relating ChiSquare amp Z Assessing Size of Test Statistics Summary ll Background We found chisquare 64 for the When test statistic is large 2by2 table relating smoking and alcoholism I Z greater than 196 about 2 El Res onse p 1 depends on df greater than about 2 or 3 I I F depends on DFG DFE I X2 depends on djErlxc1 greater than 384 about 4 if d l e ZEIE7 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng allne El lg F39lcture L32 4n e ZEIE7 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng allne El lg F39lcture L32 4i i l 39 7 ExplanatoryResponse 2 Categorical Variables Example summaries mpaaed by R0195 I Roles impact what summaries to report El Background Compared proportions alcoholic 2 resp for smokers and nonsmokers expl I Rl notim 11 rP l 0 es do pact X stat st 0 O va ue Alcoholic Not Alcoholic Total Smoker so 200 230 731 0130 Non smoker 10 760 770 172 0013 Total 40 960 1000 El Question What summaries would be appropriate if alcoholism is explanatory variable e ZEIE7 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng allne El lg F39lcture L32 42 e ZEIE7 Nancy F39fErlrllrlg Elementary Stallstlcs Luuklng allne Big Picture L32 43 Elementary Statistics Looking at the Big Picture 8 C 2007 Nancy Pfenning Example Summaries Impacted by Roles Example Summaries Impacted by Roles I Background Compared proportions alcoholic I Background Compared proportions alcoholic resp for smokers and nonsmokers expl T6513 for smOk rS and non1 0kers expll Alcoholic Not Alcoholic Total A39C h quotc NOtA39COhO m Total Smoker 30 200 230 A m Smoker 30 200 230 11 238 0130 Nonsmoker 10 760 770 Nonsmoker 10 760 770 p2 W 0013 Total 40 960 1000 Total 40 960 1000 II Note we can summarize by saying I alcoholics are 3 to 4 times as likely to be II Response Compare proportions resp smokers for CXPD I Smokers are 10 times as likely to be alcoholics c 2mm Nancy Pfenning Elementary Statistics Looking attne Big Picture L32 45 c 2mm Nancy Pfenning Elementary Statistics Looking attne Big Picture L32 47 Guidelines for Use of ChiSquare Procedure Rule of Thumb for Sample Size in ChiSquare I Need random samples taken independently from I Sample sizes must be large enough to offset non several populations normality of distributions I Confounding variables Should be separated out Require expected counts all at least 5 in 2x2 table I Sample sizes must be large enough to offset non Requlremem adjusted for larger tables normality of distributions I N e e d populations at least 10 times sample Sizes Looking Back Chisquare statistic only follows chi square distribution individual counts vary normally Our requirement is extension of requirement for single categorical variables TIP 2 107 n1 i P 2 10 with 0 replaced by 5 because of summing several components C 2mm Nancy Pfenning Elementary Statistics Looking attne Big Picture L32 43 C 2mm Nancy Pfenning Elementary Statistics Looking attne Big Picture L32 4a Elementary Statistics Looking at the Big Picture 9 C 2007 Nancy Pfenning Example Role of Sample Size Example Role of Sample Size I Background Suppose counts in smoking and I Background Suppose counts in smoking and alcohol alcohol twoway table were 1 10th the originals twoway table were 1 10th the originals II Question Find chisquare what do we conclude II Response Observed counts 1 10th 9 expected counts 1 10th 9chisquare 1 10th instead of 64 However the statistic does not follow X2istribution because expected counts 092 2208 308 7392 are not all at least 5 individual dists are not normal C 2mm Nancy Ptenning Eiernentary Statistics Looking attne Big Picture L32 5n C 2mm Nancy Ptenning Eiernentary Statistics Looking attne Big Picture L32 52 1 739quot Confidence Intervals for 2 Categorical Variables Example C0n dence Intervalst 2 13701907170175 Evidence of relationship to what extent does I Background explanatory Variable affect response I Nonsmokers 95 CI for pop prop alcoholic 00050021 Focus on proportions 2 approaches I Smokers 95 CI for pop prop alcoholic 009 017 I Compare con dence intervals for population I Question What do the intervals suggest about proportion in response of interest one interval relationship between smokmg and alcoholism for each explanatory group I Set up con dence interval for difference between population proportions in response of interest 1St group minus 2nd group C 2mm Nancy Ptenning Eiernentary Statistics Looking attne Big Picture L32 53 C 2mm Nancy Ptenning Eiernentary Statistics Looking attne Big Picture L32 54 Elementary Statistics Looking at the Big Picture 10 C 2007 Nancy Pfenning Example Con dence Intervals for 2 Proportions Example Difference between 2 Proportions CI El Background El Background 95 CI for difference between I Nonsmokers 95 CI for pop prop alcoholic 00050021 population proportions alcoholic smokers minus I Smokers 95 CI for pop prop alcoholic 009 017 nonsmokers is 0088 0146 El Response Overlap 7 9 Relationship between D Question What does the interval suggest about smoking and alcoholism 7 likely to be relationship between smoking and alcoholism alcoholic if a smoker 2mm NamHerman EiemenDHSlahshcs mm tithe BlvVlduie mas Emo7Nanw mm Eiemenlawstalsllcs minimums mm mm Example Difference between 2Proportz39ons CI Leotul e summary Inference for Cat Cat Chi Square El Background 95 CI for difference between H l t f M t population proportions alcoholic smokers minus D ypo 6565 m ems 0 I as or Famine ers k 0 088 0 146 El Inference based on proportions or counts nonsmo ers 1s D Chrsquare test Table of expected counts El Response Entire interval above zero suggests smokers significantly more likely to be alcoholic9there a relationship Chisquare statistic chisquare distribution Relating z and chisquare for 2x2 table Relative size of chisquare statistic Explanatoryresponse roles in chisquare test Guidelines for use of chisquare Role of sample size Con dence intervals for 2 categorical Variables DUE 2mm NamHerman EiemenDHSlahshcs mm tithe BlvVlduie mas Emo7Nanw mm Eiemenlawstalsllcs lnakmva healv mats L19 an Elementary Statistics Looking at the Big Picture 11 Lecture 18 Nancy Pfenning Stats 1000 Chapter 9 Means and Proportions as Random Variables Recall in Section 13 we stated that the normal curve is idealized its curve depicts an idealized histogram from a population with in nite possible values all falling into a precise pattern We call its mean M and its standard deviation 7 Example Height a quantitative variable of collegeaged women in the US is normal with M 65 a 27 Of course this is idealizediwe haven t measured all of them On the other hand a set of actual observations from a variable x has a mean a and a standard deviation 3 Example Heights of women in this class have mean 55 07523971 standard deviation 3 07523971 Analogously for categorical variables Example The proportion of women in the US is p 5 Example The proportion of women in this class is p lin In the rst and third examples M and a p are parameters numbers which describe the entire population In the second and fourth examples a and s p are statistics numbers computed or measured from sample data Example Identify the sample the population the statistic and the parameter of interest in each of the following 1 A survey is carried out at a university to estimate the proportion of undergraduates living at home during the current term sample the undergrads surveyed population all undergrads at that university statistic proportion of sampled undergrads living at home parameter proportion of all undergrads at that university living at home F In 1988 investigators chose 400 teachers at random from the National Science Teachers Association list and polled them as to whether or not they believed in the biblical creation Of 200 respondents 30 did believe sample 200 respondents population National Science Teachers Association members statistic 30 proportion of believers in sample parameter unknown proportion of all NSTA members believing in biblical creation 9quot A survey of 1000 households in a certain city found their mean household size to be ap proximately 31persons sample 1000 surveyed r l all in that city statistic 31 mean size of sampled households parameter unknown mean household size in city u A balanced coin is ipped 100 times and the percentage of heads is 47 sample the 100 ips population all coin ips statistic 47 parameter 50 percentage of all coin ips that would result in heads Ultimately we will measure statistics and use them to draw conclusions about unknown parameters statis tical inference or reasoning backward First we must discover for a given parameter how the accompanying statistic tends to behave reasoning forward which is accomplished through use of the laws of probability This forward reasoning process is in a way impractical because in real life parameters are usually unknown and cannot be given But we need to learn how to reason forward before we can learn to reason backward In another 100 coin ips the percentage of heads will easily differ from 47 another 1000 households wouldn t necessarily have a mean size of 31 persons Sampling variability is a fact of life Although the outcome for any one sample is unknown it is also a fact that regular and predictable patterns emerge in the long run We will get a feel for these patterns by examining the sampling distributions of means and proportions In practice just one sample is taken from a population of categorical values and the statistic 16 sample proportion is measuredione time only In theory we may consider values of 16 for repeated samples in order to get an idea of how sample proportion as a variable behaves For samples taken at random sample proportion 16 is a random variable To get an idea of how such a random variable behaves we consider its sampling distribution the distribution of values taken by the statistic in all possible samples of the same size from the same population Sampling Distribution of Sample Proportion Example Suppose the population proportion of blue MampM s in a large bowl is p 17 What kind of values would sample proportion 16 of blue MampM s take for repeated samples of 1 n 25 a teaspoon 2 n 75 a Tablespoon 1 Behavior of sample proportion for samples of size 25 taken from a population whose pro portion in the category of interest blue is 17 a center Some sample proportions will be less than 17 some more but they ll tend to go below 17 just as much as they go above the mean of sample proportion 16 should equal population proportion p 17 b spread How much the sample proportions vary depends on the size of the sample If we d only taken samples of size 5 sample proportion of blues could vary all the way from 5 0 to 5 60 For samples of size 25 sample proportions would tend not to vary this much c shape The most common sample proportion in the long run will be about 17 with proportions below and above 17 becoming less and less likely we d expect a single peaked symmetric bellshape with tapering ends In other words it should follow the normal curve to Behavior of sample proportion 16 for samples of size 75 taken from a population whose proportion in the category of interest blue is p 17 This distribution should also be centered at 17 There should be less spread than for samples of 25 Once again the shape should be norma The laws of probability will con rm that what we expect to see in practice should also hold in theory Under the right circumstances statistical theory dictates the occurrence of precisely the same phenomena that can be observed in practice First of all our theory assumes a binomial model in order for observations to be approximately independent the sample size must not be too large relative to population size Thus we need a population at least ten times sample size Next in order for the Central Limit Theorem to apply the sample size 71 must be large enough relative to population shape which is determined by the value of p p 5 means the population is symmetric and a smaller sample should be adequate p closer to 0 or 1 means 73 the population is skewed left or right and a larger sample is needed We will require that np 2 10 and n1 7 p 2 10 Note this is identical to the requirement for use of a normal approximation to probabilities involving the binomial count X of successes That s because the sample proportion 16 I has the same shape as X the scale is simply divided by If the above conditions hold then we have the following Rules for Sample Proportions lf numerous samples or repetitions of the same size are taken H center The mean of the distribution of sample proportion 16 will be the true proportion p from the population Thus 16 is an unbiased estimator for p F spread Standard deviation of sample proportion is 200720 Thus the spread decreases as sample size increases 9quot shape The frequency curve made from proportions from the various samples will be approximately normal Central Limit Theorem Applying these rules to our MampM experiment we can predict that 1 For a teaspoon sample size 25 a The histogram of sample proportion values will be centered at population proportion 17 b The standard deviation of should be approximately 391753933 075 c The histogram should be only roughly normal because our requirement is not satis ed np 2517 425 is less than 10 2 For a Tablespoon sample size 75 a The histogram should also be centered at 17 b The standard deviation should be approximately 177233 043 c The histogram should be closer to normal because the requirement is satis ed np 7517 1275 and 7117 p 7583 6225 are both greater than 10 Recall The Empirical Rule introduced in Chapter 2 stated that for any normal curve with mean u standard deviation 7 approximately 1 68 of values should fall within 1 a of M 2 95 of values should fall within 2 a of M 3 997 of values should fall within 3 a of M This enables us to set up probability intervals for the sample proportion of blues in a Tablespoon For samples of size 75 approximately 1 68 of sample proportions should be within 1 043 of 17 that is in 127213 2 95 of sample proportions should be within 2 043 of 17 that is in 084256 3 997 of sample proportions should be within 3 043 of 17 that is in 041299 At this point we should check how well our own sample proportions conformed to the Empirical Rule 74 Example Lacking any further information one might begin by assuming that the proportion of freshmen taking intro Stats classes is 25 According to survey data the sample proportion of freshmen 35 7 among surveyed students is m 7 08 If the population proportion were truly 25 then sample proportion would have mean 25 and standard deviation W 02 The probability of a sample proportion as low as 08 coming from a population with proportion of freshmen equal to 25 would be P5508PZ PZ 785m0 08 7 25 lt 7 02 I would characterize this as virtually impossible and so I now decide not to believe that the overall proportion of freshmen is 25 Exercise Assume the proportion of females in all intro Stat classes is p 5 What are the mean and standard deviation of sample proportion if population proportion were indeed 5 Use the class survey responses to nd the sample proportion of females in the class Then use a normal approximation to nd the probability of a sample proportion as high as the one observed if the population proportion were truly 5 Characterize the results based on your probability in words such as not unusual unlikely almost impossible etc Finally tell whether you believe p is 5 Lecture 19 Sampling Distribution of Sample Mean In practice just one sample is taken from a population of quantitative values and the statistic 55 sample mean is measured ne time only In theory we may consider values of a for repeated samples in order to get an idea of how sample mean as a variable behaves For samples taken at random sample mean is a random variable written X To get an idea of how such a random variable behaves we consider its sampling distribution the distribution of values taken by the statistic in all possible samples of the same size from the same population Example The population of possible rolls X for a single die equally likely values l23456 has mean M 35 and standard deviation 7 e sample mean rol 7 0 ice takes on various values subject to the laws of chanceiit is a random variable We can summarize its sampling distributionijust as we summarized distributions of data values in Chapter 2iby telling about its center spread and shape 1 Sometimes the mean roll of 2 dice will be less than 35 sometimes greater than 35 It should be just as likely to get a lowerthanaverage mean than a higherthanaverage mean the sampling distribution of sample mean roll X should be centered at 35 m For the roll of 2 dice the sample mean roll X will have a fair amount of spread sample means all the way from 1 if two l s are rolled to 6 if two 6 s are rolled are not uncommon The most likely mean roll is 35 resulting from 16 25 34 43 52 or 61 Lower or higher mean rolls are progressively less likely with 1 two l s are rolled and 6 two 6 s are rolled being least likely Thus the shape should be somewhat triangular highest in the middle at 35 descending on either side 9quot Example The sample mean roll X of 8 dice is also a random variable whose sampling distribution can be summarized by telling its center spread and shape 1 Sometimes the mean roll of 8 dice will be less than 35 sometimes greater than 35 It should be just as likely to get a lowerthanaverage mean than a higherthanaverage mean the sampling distribution of sample mean roll X should be centered at 35 F For the roll of 8 dice the distribution of sample mean roll X would not be as spread as that for 2 dice All eight l s or 6 s will almost never happen rolling this many dice at once there tend to be some low numbers that balance out the high numbers 9quot The most likely mean roll is still 35 with lower or higher mean rolls progressively less likely But now there is a much better chance of the mean being close to 35 and a much worse chance of being as low as l or as high as 6 The shape of the sampling distribution bulges at the mean 35 and tapers away at either end it appears norma Rules For Sample Means These examples suggest some general results for the sampling distribution of any sample mean Suppose a simple random sample of size n is taken from a population of quantitative values for a random variable X having mean M and nite standard deviation 7 Then the following hold for the sampling distribution of sample mean X l The distribution of X is centered at M Thus if we are using sample mean a a statistic to estimate population mean M a parameter we may sometimes underestimate and sometimes overestimate but there will be no systematic tendency either way Thus we say X is an unbiased estimator of M F The distribution of X has more spread for smaller samples less spread for larger samples In fact it can be shown that the standard deviation of X is in where a is population standard deviation Thus we can tell precisely how much the spread decreases as sample size increases increasing from 2 to 8 dice means the spread of X decreases from H 12 to 17 9quot For large sample size n the sampling distribution of X is approximately normal This is the celebrated Central Limit Theorem A simpler situation is when the population itself is normal Then sample mean X is guaranteed to be normal for any sample size 71 even 71 1 We will summarize our results as follows Take a simple random sample of size n from a population of values of a quantitative variable X and consider sample mean X If X is normal with mean M standard deviation 7 then X is normal with mean M standard deviation Otherwise X is approximately normal with mean M standard deviation for large enough 71 CLT How large is large enough It depends on the shape of the population distribution More observations are required if the shape of the population distribution is far from normal The above rules enable us to set up probability intervals for the sample mean roll of 8 dice For samples of size 8 coming from a population with mean 35 standard deviation 17 approximately 1 68 of sample means should be within 1 17 of 35 that is in 29 41 2 95 of sample means should be within 2 17 of 35 that is in 23 47 3 997 of sample means should be within 3 17 of 35 that is in 17 53 At this point we should check how well our own sample means conformed to the Empirical Rule Example Women s heights are normal with mean 645 standard deviation 25 Pick one woman at random According to the 6895997 Rule the probability is 68 that her height X is between 62 and 67 inches 95 that her height X is between 595 and 695 inches 997 that her height X is between 57 and 72 inches 76 Now pick a random sample of 25 women Their sample mean height X is normal with mean 645 standard deviation 5 The probability is 68 that their sample mean height X is between 64 and 65 inches 95 that their sample mean height X is between 635 and 655 inches 997 that their sample mean height X is between 63 and 66 inches Thus the sample of 25 heights has a mean which is according to the laws of probability much closer to the true mean than the value for a single height would be Also note the tradeoH lower probability of mean height being in a narrower interval higher probability of mean height being in a wider interval Such tradeoHs will be encountered later with con dence intervals Example What is the probability that the height of a randomly chosen woman is less than 6375 inches 6375 7 645 lt PX lt 6375 PZ 2 5 PZ lt 73 3821 Example What is the probability that sample mean height for a random sample of 25 women is less than 6375 inches PX lt 6375 PZ lt 63 752 64 5 PZ lt 715 0668 m Thus it is not unusual for an individual woman to be less than 6375 inches but it would be unusual for the mean height of 25 women to be that low Example Household size X in the US has mean M 26 standard deviation 7 14 1 Do you think the population distribution is normal No most households will have about 1 or 2 or 3 people but a few households will be unusually largeithe distribution would be rightskewed Pick a household at random Find the probability that the household size exceeds 27 We can t answer this without knowing the exact distribution normal tables do not apply F 9quot Take a random sample of 10 households Find the probability that sample mean household size exceeds 27 Can t be done sample size n 10 is too small to expect the Central Limit Theorem to guarantee an approximately normal distribution of X so we cannot nd probabilities from normal tables tb Take a random sample of 100 households Find the probability that sample mean household size exceeds 27 Now X is approximately normal with mean 26 standard deviation 1 and so 7 27 7 26 PX gt 27 PZ gt T PZ gt 71 PZ lt 771 2389 m Example We considered the distribution of mean roll of 2 dice and 8 dice What about the mean roll of 100 dice It is unlikely to stray far at all from the overall mean of 35 According to the law of large numbers the actually observed mean outcome X must approach the mean M of the population as the number of observations increases 77 Example Presumably heights in inches of young women have a mean of 645 and a standard deviation of 25 Sample mean height for a random sample of 281 women would have mean 645 and standard deviation 5 3381 149 The observed sample mean for surveyed women s heights is 64783 The probability of a sample mean this high coming from a population with mean 645 is PX 2 64783 PZ 2 w PZ 2190 0233 It s pretty unlikely to get a sample mean as high as 64783 if population mean were 645 We have reason to suspect that the population mean is somewhat higher than 645 Some sources report the population mean as being 65 in reality it could well be somewhere between 645 and 650 Exercise If students each picked a number truly at random from 1 to 20 then their responses would follow a uniform distribution with each of the numbers appearing with probability 05 t can be shown that the mean of all the numbers between 1 and 20 is 105 and the standard deviation is 577 What are the mean and standard deviation of sample mean selection for a sample of 71 students if their selections are truly random Use the class survey responses to nd the sample mean random number selected Then use a normal approximation to nd the probability of a sample mean as high as the one observed if the population mean were truly 105 Characterize the results based on your probability in words such as not unusual unlikely almost impossible etc Finally tell whether you have statistical evidence of bias in favor of higher numbers

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.