Introduction to Probability and Statistics
Introduction to Probability and Statistics STAT 20
Popular in Course
Popular in Statistics
This 54 page Class Notes was uploaded by Floy Kub on Thursday October 22, 2015. The Class Notes belongs to STAT 20 at University of California - Berkeley taught by Staff in Fall. Since its upload, it has received 9 views. For similar materials see /class/226733/stat-20-university-of-california-berkeley in Statistics at University of California - Berkeley.
Reviews for Introduction to Probability and Statistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
Probability Models General Probability Rules 0 Coin tossing 0 Probability models 7 Sample spaces and events 7 Venn diagrams 7 Basic probability rules 7 Assigning probabilities a nite sample space 7 Assigning probabilities intervals of outcomes 7 Independence and the multiplication rule Randomness and Probability Recall We call a phenomenon random if individual outcomes are uncertain but there is a regular distribution of outcomes in a large number of repetitions The probability of any outcome of a random phenomenon if the proportion of times the outcome would occur in a very long series of repetitions Example COln tossing Fiench natuialist Buffon obseived 2043 heads in 4040 tosses 7 2043 7 Relative fiequency 4040 a 0 5069 English statistician Kail Peaison obseived 12012 heads in 24000 tosses Relative frequency 3353 0 5005 English mathematician John Kexxlch observed 5057 heads in 10000 tosses Relative frequency 5067 7 0 5067 10000 T as mo na quotimimusse mo na I38 u 2m mu am am mun quotimimusse 2 Probability Models The sample space7 denoted by S is the set of all possible outcomes of a random phenomenon o Toss a coin and iecoid the side facing up Then s HeadTail H T o Toss a coin tWice Record the side facing up each time Then s 7 o Toss a com twlce Record the number of heads m the two tosses Then S 7 An event is an outcome or a set of outcomes of a random phenomenon ie a subset of the sample space Toss a coin three times Then s HHH HHT HTHTHH HTTTHTTTHTTT 0 Let A be the event that we get exactly two tails Then A 7 0 Let B be the event that we get at least one head Then B A probability model is a mathematical description of a random phenomenon consisting of two parts a sample space S and a way of assigning probabilities to events The probability of an event A7 denoted by PA7 can be considered the long run relative frequency of the event A Set Notation Suppose A and B are events in the sample space S Then 0 AUBEAorBE the set of all outcomes in A or in B or in both 0 A B E A and B E the set of all outcomes that are in A AND in B o A O B 0 E A and B are disjoint E A and B are mutually exclusive E A and B have no outcomes in common 0 AC E the complement of A E the event that A does not occur Example be the event that we get 2 heads B the event C the event that we get at least one head Toss a com twice Let A that we get exactly 1 tall and So A HH B TH HT C HH HT TT 0 A 7 o AUB o B 7 o A D oA B OBUD Probabilities in a Finite Sample Space If the sample space is nite each distinct event is assigned a probability The probability of an event is the sum of the probabilities of the distinct outcomes making up the event If a random phenomenon has k equally likely outcomes each individual outcome has probability For any event A 7 number of outcomes in A P A 7 number of outcomes in S Rules of Probability 1 For any event A 0 PA 1 2 135 1 3 For any event A PA 17 PA F If PA O B Q then PA U B PA PB More generally PA o B PA PB e PA o B Example Roll a fair die and looking at the face value Sample space S 1 234 56 This is a nite sample space and each outcome is equally likely That is PX j 16 Vj E S where X is the face value of the die after rolling PXZ5PX5PX6161613 PX 2 7 mm Probabxhues Inter3 s of Outcomes A mama mam mmba 5mm 5 dSuned m o mm a m marinas inme bet mam swap saw David Shilane UC Berkeley The Accuracy of Percentages David Shilane Lecture 177 Statistics 20 University of California Berkeley Tuesday7 April 10th7 2007 April 10th7 2007 Statistics Pace 1 David Shilane UC1 Berkeley We re often interested in estimating a percentage Some examples include 0 Baseball batting averages o The proportion of customers who buy an item during a sale 0 The risk of obtaining a disease 0 Political approval rates The usual statistical technique we use to estimate percentages is to sample data and calculate the proportion of outcomes we re interested in However because data are random and we can usually only collect a small amount of it the question is how accurate are our estimates April 10th7 2007 Statistics Page 392 David Shilane U C Berkeley A Motivating Example Figure 1 Tarja Halonen President of Finland April 1013117 2007 Statistics Page 3 David Shilane UC Berkeley Figure 2 Tarja Halonen With her dappelganger Conan O Brien April 10th7 2007 Statistics Page 4 David Shilane UC1 Berkeley Estimating Tarja s Approval Rating We can de ne a politician s approval rating to be the proportion of constituents who approve of the politician Though of another way the approval rating is the probability that a randomly selected constituent will approve of the politician Sometimes approval is measured in different ways i it can be the proportion of people who intend to vote for the politician who approve of actions relating to an issue or even just whether people like the person or not Unfortunately people respond differently depending upon what question is asked and even how it s delivered Therefore it is important to remember that the results obtained from a poll are with respect to a particular question and we should be hesitant to generalize results for one question to answer another April 10th 2007 Statistics Pace David Shilane UC1 Berkeley Types of Surveys Do you approve of Tarja Halonen Yes No How strongly do you approve of Tarja Halonen 1 2 3 4 5 6 For the latter survey we might say people approve of Tarja if they responded With at least 4 and otherwise disapprove April 10th7 2007 Statistics Pace 53 David Sliilane UC1 Berkeley ThelData In the simplest case we would draw n names out of a hat with replacement and ask them to complete the survey Then our data is 1 i n 7 X 1 if person 239 approves Z 0 if person 239 disapproves Our quantity of interest is the approval rating PX7 1 p where p is an unknown number we wish to estimate We can do so by taking the empirical proportion 13 of people who approve which is also the sample mean of n Binomial1 p random variables pX 3H i a April 10th7 2007 Statistics Page 7 David Sliilane UC1 Berkeley Is it really with replacement The short answer is no In reality people are selected without replacement and this means we have to use more complicated methods 7 there are entire courses devoted to designing and analyzing surveys Another potential problem is that it may be dif cult to select people with equal probability There are do not call lists and of course some people don t have telephones Furthermore not everyone you call necessarily responds and this may lead to a selection bias in your results For all these reasons opinion polling is rarely as simple as the survey we re describing However at the end of the day you still have to compute an estimate and assess its accuracy so what we would do in the simple situation provides some building blocks for the more dif cult problems April 10th 2007 Statistics Page 8 David Shilane UC1 Berkeley But what are we estimating The underlying assumption here is that Tarja has a true approval rating p PX 1 EX because X is Binomial1p This is equivalent to the approval rating we would obtain from surveying all N 5 million people in Finland This is not only impractical but also very costly in terms of time and labor so the best we can do is sample as many people as we can If each person approves with probability p and disapproves with probability 1 p then sampling with replacement is like ipping a weighted coin n times April 10th7 2007 Statistics Page 9 U C Berkeley David Sliilane Expected Value and Variance for a single coin ip o EX 1p01 p p VarX EX2 EX2 1209 021 p P2 19 192 19 19 SDX xVa X 291 29 April 10th7 2007 Statistics Page 10 David Sliilane UC1 Berkeley Expected Value and Variance of the Sample Mean 0 EX E Z1X EX1Xn Z1ElX llPwpl 3 II 39B o VamQ Q Van 231 Xi 231 VamX7 lp1 pp1 pl p1 p W 51902 W W Note E 2321 Xi 231 always but Van 271 Xi 231 VamX only when X1 Xn are uncorrelated which they are by independence April 10th7 2007 Statistics Page 11 David Shilane U C Berkeley Mean and Variance for Tarja s Estimated Approval Rating A 1 0 Mean p X 5 231 Xi o Variance pan p 151 15 0 SD n April 10th7 2007 Statistics Page 1392 David Sliilane UC1 Berkeley The Law of Large Numbers and Central Limit Theorem 0 Because E p and SD03 V w the law of large numbers says that the sample mean will get closer and closer to the true mean as n grows larger This is true because as n gtoo SD03 a0 sop gt19 o The central limit theorem says that 13 will have approximately a Normal distribution as n grows large because it is a mean of n independent identically distributed random variables Therefore when we collect a large number of surveys we can use the Normal distribution to make probability statements about the results April 10th 2007 Statistics Page 13 David Shilane UC1 Berkeley Con dence Intervals We are often interested in determining a range of values that cover a certain proportion of the data This range is called a con dence interval and is speci ed by the proportion you ask for It is very common to use a 95 con dence interval but the number itself is somewhat arbitrary To determine a con dence interval we need to convert to standard units z 31340 We start by nding the value of z in the Normal table that speci es the desired proportion When we want to cover 95 of the area under the Normal curve we use z z 196 Sometimes we refer to this value as 2095 196 to indicate that it covers 95 of the area April 10th7 2007 Statistics Page 14 David Sliilane UC1 Berkeley Finding an Interval In order to nd a 95 con dence interval we need to backsolve the standard units equation 2095 gt 2095SDX X X gt Xm39ght This gives the right endpoint Then we just plug in z095 to nd the left endpoint Xleft X 2095SDX Then for any Normal random variable X a 95 con dence interval is given by XleftyXTight X l zOQBSDX April 10th7 2007 Statistics Page 15 David Sliilane UC1 Berkeley A 95 Con dence Interval for the Sample Mean Remember that E p p m 13 We would prefer to ll in the true expected value but since we don t know 19 the best we can do is ll in our estimate 13 in its place Likewise SD03 MPG 719 MW Therefore we plug in these values for the mean and SD to nd a 95 con dence interval for the sample mean as XleftmeLght 23 i Z095 pan p X i196v X1n X April 10th7 2007 Statistics Page 16 David Sliilane UC1 Berkeley Our Old Friend the Box Model If you re following along in Chapter 21 of the Freedman Pisani Purves text then it s perfectly equivalent to use the following Box Model to produce your results 1 Start with a box containing tickets labeled with 0 s and 1 s on them The proportion of 1 s is p and the proportion of 0 s is 1 p 2 Calculate the mean and SD of the box 3 Find the standard deviation of the approval rating by dividing the SD of the box by 4 Then since we don t actually know 9 estimate these numbers using the empirical mean 13 and SD V 150 ij 5 Construct a 95 con dence interval using the formula 13 i 196 151 15 April 10th7 2007 Statistics Page 17 David Sliilane UC1 Berkeley Interpreting Con dence Intervals 0 Before we generate a con dence interval we can say that the interval we get will have probability 095 of containing the true mean 0 Once we generate a speci c interval it either contains the true mean or doesn t It s a very common mistake to say that the speci c interval we obtain contains the mean with probability 095 However from our viewpoint the truth is not a random variable so this interpretation is not valid 0 What we can say is that if we repeated the experiment a large number of times then approximately 95 of the con dence intervals we generate will contain the true mean April 10th 2007 Statistics Page 18 David Sliilane UC1 Berkeley Example Tarja Halonen Approval Poll A total of n 1000 Finns were surveyed independently With replacement to determine Whether they approve of Tarja Halonen s job performance as president A total of 573 respondents approved and 427 did not Construct a 95 con dence interval for Tarja s approval rating 151 15 23 i 2095 X i 2095 71X 39I L 0573 1 0573 0573 i 196 0542 0604 The margin of error is about i31 for the poll April 10th7 2007 Statistics Page 19 David Shilane UC1 Berkeley Repeating the Experiment Now let s pretend that we know Tarja Halonen s approval rating is exactly 053 If we conduct a large number of polls What proportion of 95 con dence intervals Will contain her true approval I performed this experiment a total of 10000 times by simulating random numbers on a computer Which took about 4 seconds to run I ultimately found that 947 3 of the experiments generated con dence intervals containing the truth so a proportion of 0947 3 of all 95 con dence intervals contained the true value Is this a reasonable proportion Let s make another con dence interval April 10th 2007 Statistics Page 20 David Sliilane UC1 Berkeley We now have n 10000 experiments and on each one the con dence interval we generated either contained the value 053 or it didn t For 1 g 239 g n the data are of the form Y 1 if experiment i 3 CI contains 053 I 0 otherwise Because we were generating 95 con dence intervals our assumption is that PY7 1 p 095 We can validate this assumption if the 95 CI for 13 contains 095 This con dence interval is 1312095 15975 0947311960W 0942909517 Therefore the experiment produced a reasonable result that appears to validate the notion that roughly 95 of all con dence intervals will contain the true value April 10th 2007 Statistics Page 21 Producing Data We previously focused on ways to analyze data that has already been collected I Summary statistics I Look for patterns in the data I Relationships between variables Conclusions from exploratory data analysis alone is often not su icient because striking patterns in the data can arise from many sources ie lurking variables We will focus on producing trustworthy data and how to judge the quality of data produced by others The design for how the data is collected is the most important prerequisite for trustworthy statistical inference Sampling and experiments are used for collecting and producing data Observational Studies I An observational study observes individuals and measures variables of interest but does not attempt to in uence the responses I A sample survey is an example of an observational study A sample is a small group of people that is used to represent the larger population Example Opinion polls report the view of the entire population base on interviews with a sample of 1000 people A census attempts to contact every individual in the entire populations often expensive very time consuming and inaccurate lf goal is to get a picture of the entire population disturbed as little as possible by the act of gathering information observational studies are used First Steps For Data Collection When you collect and produce data you need to know the following I The individuals of interest for the study I The variables to be measured must be clearly de ned and measured accurately I Observational studies can not control for lurking variables so be careful when drawing conclusions from them Example Simpson s paradox Experiments I An experiment deliberately imposes some treatment on the individuals in order to observe their response I Example assigning some mice to high doses of saccharine and some to a control diet and observing their respective incidences of cancer I For understanding cause and effect experiments are the only source for obtaining fully convincing data I Generally experiments are preferred over observational studies especially for establishing causality but they may be either impossible to conduct or unethical Design of Experiments I Experimental units the individuals on which the experiments is done I Treatment A speci c experimental condition applied to the units I Randomization The use of chance to divide the experimental units into groups Factors are often called the explanatory variables Levels are speci c values of each factor that are applied to the experimental units Population and Samples The population of interest is the group of individuals about which we want information The sample is a part of the population from which we actually collect information from which we try to draw conclusions about the whole We seek to design a sample that is representative of the population Some Bad Ideas I Selfselection for example callin or instant polls Respondents are not representative 7 they tend to have strong opinions and may be members of a nonrepresentative TV channel or website audience Convenience sampling for example mall customers They tend to be wealthier than the average American and are more likely to be either teens or retired Furthermore mall interviewers tend to pick cleancut individuals skewing the sample even more Confounding Two variables are confounded when their effect on a response variable can not be distinguished from each other For example suppose a smoking study contains individuals who are either male smokers or female nonsmokers If the smoking group has a higher incidence of lung cancer than the nonsmoking group we cannot tell if the effect is due to smoking or gender Smoking and gender are confounded This is an exaggerated example but confounding may be subtle 7 and one of the confounded variables may not even have been measured Simple Random Samples SRS The bad sampling schemes above lead to bias 7 a systematic favoring of some outcomes over others We seek an unbiased sampling scheme A simple random sample SRS of size 71 consists of 71 individuals chosen from the population in such a way that every set of 71 individuals has an equal chance of being selected This can be accomplished by the proverbial drawing numbers from a hat77 I Use a computer I Use a random number table To choose an SRS by hand rst label each individual in the population using the smallest possible labels The use the random number table to select labels at random Throw out any labels that do not correspond to an individual SRS or Not ls each of the following samples an SRS or not I A deck of cards is shu led and the top ve dealt I A telephone survey is conducted by dialing telephone numbers at random ie each valid phone number is equally likely I A sample of 10 of the Berkeley student body is chosen by numbering the students 1 N drawing a random integerz39 from 1 to 10 and drawing every tenth student beginning with 239 eg ifz39 5 students 51525 are chosen Multistage Samples As the name would imply a multistage sample is drawn in stages This is often done for nationwide samples of families households or individuals Cost of sending interviewers to the widely scattered households would be too high The Current Population Survey for data on employment and unemployment I Stage 1 divide the US into 2007 geographical areas called Primary Sampling Units Select a sample of 754 PSUs I Stage 2 divide each PSU selected in stage 1 into small areas called Census Blocks Stratify the blocks using ethnic and other information and take a strati ed sample of the blocks in each PSU I Stage 3 Group the housing units in each block into clusters of four nearby units Strati ed Samples In general a probability sample is a sample chosen in such a way that we know what samples are possible and what probability each possible sample has Often an SRS is not practical and we need alternative types of probability samples To select a strati ed random sample rst divide the population into groups of similar individuals called strata Then choose a separate SRS in each stratum and combine these SRSs to form the full sample A strati ed design can produce more exact information than an SRS of the same size by taking the advantage of the fact that individuals in the same stratum are similar to one another Example a populations of election districts might be divided into urban suburban and rural strata lnterview the households in a random sample of these clusters Final sample consists of clusters of nearby households that an interviewer can easily visit Potential Problems with Surveys Undercoverage some groups in the population may be left out eg the homeless prison inmates students in dormitories households without telephones Nonresponse a selected individual can t be contacted or refuses to cooperate Tm may ha a gamma mum For mph mg yaw Kmmh mm momma a mgphm survey m whmhput 0 2m households can 1553 mm mm at mm mm m would mt mgr mg interview The nonresponse rate is 53 Response bias Respondents may lie particularly about illegal or unpopular behavior Wording effects Leading questions or even seemingly innocuous words that have positive or negative connotations may affect survey results Statistical Inference I Typical situation We want to answer a question about a population of individuals I To answer the question we use a sample of individuals from the population Using the sample we try to draw conclusions about the entire population I Parameter a number unknown in practice that describes the population We will call this p I Statistic a number describing the sample changes from sample to sample we will call this 33 I A statistic is used to estimate an unknown parameter Example US presidential election forecast 1936 Literary Digest mailed questionnaires to 10 million people 25 of voters at the time 24 million people responded Their prediction Landon 57 Roosevelt 43 Actual result Roosevelt 62 Landon 38 What went wrong Selection bias telephone books club memberships mail order lists automobile ownership lists Noniresptmse bias Only 24 responded and these were biased toward the Republicans Gallup Poll surveyed 50000 people and correctly predicted Roosevelt s Victory Example We are interested in the percentage of Americans adults that nd shopping for clothes frustrating and time consuming 2500 people are selected using a simple random sample and each individual is interviewed 1650 individuals in the sample agreed that shopping is often frustrating and time consuming I What is the population I What is the sample I What is the parameter I What is the statistic We want to estimate the par mater p percentage of Americans that nd shopping frustrating and time consuming Using the sample we have that 33 066 66 I If a second random sample of 2500 adults is taken would we expect exactly 1650 people to agree that shopping is frustrating I Nol The new sample will have different people in it so the results will not be the same The new sample may have 1440 people agree that shopping is often frustrating and 33 i576 576 for this sample I The value of 33 will vary from sample to sample The sampling distribution of a statistic is the distribution of the statistic in all possible samples of the same size from the same population Bias and Variability I Bias concerns the center of the sampling distribution A statistic is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimatedi Bias of an estimator mean of sampling distribution true value of parameter I The variability of a statistic is about the spread of its sampling distribution The spread is determined by the sampling design and the sample size Shrinks as sample size grows Variability of an estimator SD of sampling distribution Sampling Distribution for the Shopping Example I The shape of the distribution of 33 will be approximately normal We will see why later I The center or mean of the distribution will be 33 This is true for both large and small samples The spread of the distribution depends on the size of the sample The values of 33 for samples of size 2500 will be much less spread out than the values from samples of size 100 Inference for Regression o The simple linear regression model 0 Estimating regression parameters 0 Con dence intervals and signi cance tests for regression parameters 0 Inference about prediction 0 Analysis of variance for regression o The regression fallacy 1 Simple Linear Regression Model The simple linear regression model states that the response variable y and the explanatory variable x have a linear relationship of the form 24 30 3190 e where 0 BO and 31 are the y intercept and the slope of the true population regression line 0 eNOa o The 6 corresponding to the pairs mi are independent of each other 0 Given m y has mean 30 lm and variance 02 Ewe U ylac 30 lm is called the population regression line Varylz 05W a2 Simple Linear Regression Earlier in the course we discussed how to nd the best tting line for bivariate data Here we consider that problem from the perspective of statistical inference Suppose we observe pairs of observations 17y17 7n7 yn For example 0 xfather s height yson s height 0 miclterm score y nal score 0 xtemperature yyielcl The values of x de ne different groups of subjects which we think of as belonging to subpopulations one for each possible value of xi 2 ylw in the subpopulation with a certain value of xi Let gym and 0 denote the mean and variance of y Under the linear regression model With equal variance py w 30 31x ancl 036 02 possibly after transformation Estimating the Regression Parameters by LeastSquares Given a sample of 71 pairs of observations 1 yl mmyn we use the method of least squares to estimate the unknown parameters BO 31 and a This gives us the tted line 23 50 51 where 0 b1 w r is the estimate of 31 0 b0 y 7 bli is the estimate of 30 Recall that the residual is the difference between the observed value and the predicted value 6239 yibob1m 2427192 The sample variance of ei can be used to estimate 02 s is called the regression standard error and it has n 7 2 degrees of freedom Why 71 7 2 degrees of freedom 015 for Regression Parameters Under the assumption that e N N07 a 9le 17 lt6 may bONNltBO70W We don t know a so we will use 3 to estimate it This leads to 25 con dence intervals for 30 and 31 Conditions for regression inference o The sample is an SRS from the population 0 There is a linear relationship in the population We check this condition by assessing the linearity of a scatterplot of the sample data The standard deviation of the responses about the population line is the same for all values of the explanatory variable We check this by plotting the residuals and observing Whether or not the spread of the observations around the leastsquares line is roughly uniform as x variesi o The response varies Normally about the population regression line We check this condition by observing a Normal quantile plot of the residuals Note that the last three conditions are statements about the population that cannot be veri ed directly We use the sample to assess their reasonabilityi A level 17 04 con dence interval for 30 is given by 120 725 SEbO7 120 25 SEb0 where 25 is the upper 022 critical value of the 25 2 distribution and SEb0 s A level 17 04 con dence interval for 31 is given by 121 725 SEbl7 121 25 SEb1 where 25 is the upper 022 critical value of the 25 2 distribution and S SEb1 Z 351 7 W Hypothesis Tests for Regression Parameters To test the hypothesis H0 Bl a we use the test statistic b1 7 a SEb1 o the p value for the test statistic is found from the tn2 distribution 0 if the regression assumptions are true testing H0 Bl 0 corresponds to testing whether or not there is a linear relationship between y and z A similar test can be performed for 30 but it is rarely of interest Performing this regression analysis in STATA yields the following results regress damage distance Source 1 ss df MS Number of obs 15 aaaaaaa 747777777quotquotWWWWWWWWWa 1 13 15539 Model 1 341755403 1 341755403 Frob gt r 00000 Residuall 507509359 13 535545053 Risquared 09235 aaaaaaa 747777777quotquotWWWWWWWWWa 4d Risquared 0 9175 Total 1 91151739 14 55103335 Root MSE 23153 damage 1 Coe1 Std Err o moi 957 Cont Interval dstancel4919331 3927473 12525 0000 4070351 5757311 consl1027793 1420273 7237 0000 7209505 1334525 The following are a residual plot and a normal quantile plot of the residuals Rudunls vs a mummy mums quotmm are le 2 1 Residual n name m1 mmem Dummies Example Fire damage and distance to re station Suppose a re insurance company wants to relate the amount of re damage in major residential res to the distance between the burning house and the nearest re station The study is to be conducted in a large suburb of a major city a sample of 15 recent res in this suburb is selected The amount of damage in thousands of dollars and the distance in miles between the re and the nearest re station are recorded in each re Obs D1sc Damage 1 o 7 141 2 1 1 17 3 nungens nllnncem ruesmmm 3 1 3 173 4 2 1 240 39 5 2 3 23 1 a 2 a 19 a g 3 39 a c 7 3 o 22 3 1 a e 3 3 1 27 5 E g e 39 39 9 3 4 25 2 H 7 39 1o 3 3 251 m 39 11 4 3 31 3 V 391 1 1 1 1 1 1 2 a 5 a 12 4 5 31 3 551mm 13 4 3 354 14 5 5 350 15 5 1 43 2 10 Example cont The tted line is damage 1028 492 dist Suppose we want to predict the mean amount of damage for res 2 miles from the nearest re station In this case x 2 and our prediction is 1028 492 X 2 2012 Inference about Prediction What if we want to predict the amount of damage of a burning house which is 2 miles from the nearest re station Still the prediction is 1028 492 X 2 2012 The predicted values are the same but they have different standard errors Individual burning houses which are 2 miles away from the re station don t have the same amount of damage so the prediction for individual amount of damage has larger standard error than the prediction for mean amount of damage 015 for the Mean Response For a speci c value of m say mquot the assumption is that y comes from a Nguyw a distribution where Mylacquot 50 51K Plugging in our estimates of 30 and 31 gym is estimated by yw be haquot and a level 1 7 04 con dence interval for the mean response gym is given by ylacquot j where 25 is the upper 042 critical value of the tn2 distribution and mm 7 s How accurate is this estimate The error here will be larger than the error for the mean response SEXLAW because there is error in estimating gym as well as error in drawing a value from the normal distribution Nguyw a A level 1 7 Oz prediction interval for a future observation y corresponding to zquot is given by 3 i 7595819 where 25 is the upper 042 critical value of the 2272 distribution and Prediction Interval for a Future Observation Suppose we want to predict a speci c observation value at z mquot At each mquot y N Nguyw a We want to predict a y drawn from this distribution Our best guess is the estimated mean of the distribution 23 ylw b0 51f 14 Analysis of Variance for Regression Analysis of variance is the term for statistical analyses that break down the variation in data into separate pieces that correspond to different sources of variation In the regression setting the observed variation in the responses comes from two sources 0 As the explanatory variable x changes it pulls the response with it along the regression line This is the variation along the line or regression sum of squares SSRegression 7 m2 i1 c When z is held xed y still varies because not all individuals who share a common z have the same response y This is the variation about the line or residual sum of squares SSResidual 2W 7 13 i1 The ANOVA Equation It turns out that SSResidual and SSRegression together account for all the variation in y The ANOVA F Statistic 7 m2 7 g 7 331 As an alternative test of the hypothesis i1 i1 i1 H0 Bl O we use the F statistic SSTotal SSRegression SSResidual i MSRegression F MSResidual SS RegressiondfRegression SSResidualdfResidual u w w gt2 The degrees of freedom break down in a similar manner SSTotal SSRegression SSResidual SE71 7 2 Dividing a sum of squares by its degrees of 7 t freedom gives a mean square Under H0 F N F1n72 i n I 7 A 2 MSResidual Sam SM s2 where F1 2 is an F distribution with 1 and dfRes1dual n 7 2 n 7 2 degrees of freedom Regression SS R2 2 Total ss T 17 18 The Regression Fallacy Sir Francis Galton 182271911 who was the rst to apply regression to biological and psychological data looked at examples such as the heights of children versus the heights of their parents He found that the taller than average parents tended to have children who were also taller than average but not as tall as their parents Galton called this fact regression toward mediocrity As another example students who score at the bottom on the rst exam in a course are likely to do better on the second exam Is it because they work harder Example Background music and consumer behavior In a study conducted in a Northern Ireland supermarket researchers counted the number of bottles of F rench Italian and other wine IUfeI39ence fOI39 TWOway Tables purchased while shoppers were subject to one of three treatments no music French accordion o Two way table for categorical dataset music and Italian string music Chi39square teSt for two39way table The following twoway table summarizes the 0 Models for two way tables data 7 Examining independence between Music Variables Wine None French Italian Total 7 Comparing several populations French 30 39 30 99 Italian 11 1 19 31 Other 43 35 35 113 Total 84 75 84 243 1 2 Example cont The X2test for a r x c Table The table of counts looks suspiciously like the joint distribution tables we studied earlieri Indeed from Hypotheses these counts we can ascertain the empirical joint H0 the row and column variables are distribution marginal distributions and conditional independent ie there is no relationship distributions of wine type and music type between the two Music 0 Ha the row and column variables are Wine None French Italian Total dependent French 0123 01160 01123 01407 Intuition for the Test Italian 01045 01004 01078 01128 Other 0177 01144 01144 01465 Suppose H0 is true and the two variables are independent What counts would we expect to Total 01346 01309 01346 11000 Observe We are interested in determining whether there is Recall that under the independence assumption relationship between the row variable wine type and the column variable music typei PA B PAPB If this were the true distribution then the answer Thus7 for each cell we have would be clear music and wine are not independent t t 1 1 t t 1 so there is a relationship Expected Cell Count w total count However this table is random and we want to know whether or not music and wine are independent under our teSt W111 be based on a measure Of how far the true distribution This requires a statistical test the 01756741611 table 25 fTO m the EIPECtEd tabla Example cont For the supermarket example the expected counts are Music Wine None French Italian Total French 3422 3056 3422 99 Italian 1072 957 1072 31 Other 3906 3488 3906 113 Total 84 75 84 243 The X2 ChiSquared Statistic To measure how far this expected table is from the obsemed table we will use the following test statistic Observed 7 Expected2 X2 Expected all cells 5 What does the X2 distribution look like ChiSquared Densities Degrees of Density 1 l Unlike the Normal or t distributions the X2 distribution takes values in 0 00 As with the t distribution the exact shape of the X2 distribution depends on its degrees of freedom The X2 Distribution Under H0 the X2 test statistic has an approximate X2 distribution with r 7 1e 7 1 2 degrees of freedom denoted Xwiwcil Why r 71c 71397 Recall that our expected table is based on some quantities estimated from the data namely the row and column totals Once these totals are known lling in any r 7 1e 7 1 undetermined table entries actually gives us the whole table Thus there are only r 7 1e 7 1 freely varying quantities in the table 6 p Value for the X2Test If the observed and expected counts are very different X2 will be large indicating evidence against H0 Thus the p value is always based on the right hand tail of the distribution There is no notion of a two tailed test in this context The p value is therefore PX2T71671 3 X2 Recall that X2 has an approximate M2 ka distribution When is the approximation valid For any two way table larger than 2 x 2 we require that the average expected cell count is at least 5 and each expected count is at least one For 2 x 2 tables we require that each expected count be at least 5 Example cont Let s get back to our example Recall the observed and expected counts Observed Expected Music Music Wine None F It None F It Tot French 30 39 3O 34 22 3O 56 34 22 99 Italian 11 1 19 10 72 9 57 10 72 31 the 43 35 35 39 06 34 88 39 06 113 Total 84 75 84 84 75 84 243 2 30 e 34222 39 e 30562 30 e 34222 X 3422 1 30156 1 3422 35 e 34882 35 e 39062 quot39 34A 8 1 39 06 1828 The table is 3 X 3 so there are 7 7 1c 7 1 2 X 2 4 degrees of freedoml Finally the p value is found from the xi table 0001 PM 2 1828 g 0002 2 Comparing several populations Suppose we select independent SRSs from each of nC We then classify each individual according to a categorical 0 populations of sizes 711712 response variable with r possible values the same across populations This yields a r x 0 table and a X2 test can be used to test H0 Distribution of the response variable is the same in all populations Ha Distributions of response variables are not all the same Example Suppose we select independent SRSs of Psychology Biology and Math majors of sizes 40 39 35 and classify each individual by GPA range Then we can use a X2 test to ascertain whether or not the distribution of grades is the same in all three populations Models for TwoWay Tables The X2 test for the presence of a relationship between two directions in a two way table is valid for data produced by several different study designs although the exact null hypothesis varies 1 Examining independence between variables Suppose we select an SRS of size n from a population and classify each individual according to 2 categorical variables Then a X2 test can be used to test H0 The two variables are independent Ha Not independent Example Suppose we collect an SRS of 114 college students and categorize each by major and GPA eg 005051 354 Then we can use a X2 test to ascertain whether grades and major are independent Example Literary Analysis Rice 1995 When Jane Austen died she left the novel Sandman only partially words in several chapters from Various works Austen Imitatox se and Emma Sanditon I Sanditon 11 Word Sensibility a 147 186 101 83 an 25 26 11 29 this 32 39 15 15 that 94 105 37 22 1th 59 74 28 43 Without 18 1O 10 4 TOTAL 375 440 202 196 Question 1 is there consistency in Austen s work do the frequencies th which Austen used these words change from work t work Answer X2 12 27 df prvalue Question 2 Was the imitate successful are the frequencies of the words t e same in Austen s work and the imitatoxls work Tests of Signi cance Outline I General Procedure for Hypothesis Tbsting 7 Null and Alternative Hypotheses 7 Test Statistics 7 p values I Interpretation of the Signi cance Level I Tests for a Population Mean I Interpretation of p values I Statistical vs Practical Signi cance I Con dence Intervals and Hypothesis Tests I Potential Abuses of Tests Testing Hypotheses A hypothesis test is an assessment of the evidence provided by the data in favor of or against some claim about the population For example suppose we perform a randomized experiment or take a random sample and calculate some sample statistic say the sample mean We want to decide if the observed value of the sample statistic is consistent with some hypothesized value of the corresponding population parameter If the observed and hypothesized value differ as they almost certainly will is the difference due to an incorrect hypothesis or merely due to chance variation A con dence interval is a very useful statistical inference tool when the goal is to estimate a population parameter When the goal is to assess the evidence provided by the data in favor of some claim about the population test of signi cance are used Example Filling Coke Bottles A machine at a Coke production plant is designed to ll bottles with 16oz of Coke The actual amount varies slightly from bottle to bottle From past experience it is known that the SD 02oz A SRS of 100 bottles lled by the machine has a mean 1594oz per bottle Is this evidence that the machine needs to be recalibrated or could this difference be a result of random variation General Procedure for Hypotheses Testing 1 Formulate the null hypothesis and the alternative hypothesis I The null hypothesis H0 is the statement being tested Usually it states that the difference between the observed value and the hypothesized value is only due to chance variation For example M 16 oz I The alternative hypothesis HE is the statement we will favor if we nd evidence that the null hypothesis is false It usually states that there is a real difference between the observed and hypothesized values For example M 16 M gt16 or M lt 16 A test is called I twosided if HE is of the form M 16 I onesided if HE is of the form M gt 16 or M lt 16 Example GRE Scores The mean score of all examinees on the Verbal and Quantitative sections of the GRE is about 1040 Suppose 50 randomly sampled UC Berkeley graduate students have a mean GRE VQ score of 1310 We are interested in determining if a mean GRE VQ score of 1310 gives evidence that as a whole Berkeley graduate students have a higher mean GRE score than the national average What is H07 What is Ha For the Coke example we have that the mean of the sample is 1594 oz The population mean speci ed by the null hypothesis is 16 oz A test statistic is 1594 716 z 7 02x100 We ll have more to say about this in a moment General Procedure for Hypotheses Testing cont 2 Calculate the test statistic on which the test will be based The test statistic measures the difference between the observed data and what would be expected if the null hypothesis were true When H0 is true we expect the estimate based on the sample to take a value near the paramater value speci ed by HO Our goal is to answer the question How extreme is the value calculated from the sample from what we would expect under the null hypothesis In many common situations the test statistic has the form estimate hypothesized value standard deviation of the estimate 3 Find the pvalue of the observed result I The p value is the probability of observing a test statistic as extreme or more extreme than actually observed assuming the null hypothesis H0 is true I The smaller the pvalue the stronger the evidence against the null hypothesis I if the p value is as small or smaller than some number a eg 001 005 we say that the result is statistically signi cant at level a I a is called the signi cance level of the test In the case of the Coke example p 00013 for a onesided test or p 00026 for a twosided test Once again we ll have more to say about this in a moment Interpretation of the Signi cance Level To perform a test of signi cance level a we perform the previous three steps and then reject HO the pwalue is less than a The following outcomes are possible when conducting a test Our Decision Reality HO H a Type 1 Ho Error e H Ha T yp Error Suppose H0 is actually true If we draw many samples and perform a test for each one a of these tests will incorrectly reject Hot In other words a is the probability that we will make a Type I error Type II error is related to the notion of the power of a test which we will discuss later Tests for a Population Mean In the preceding example we were able to perform an exact Binomial test Frequently an exact test is impractical but we can use the approximate normality of means to conduct an approximate test Suppose we want to test the hypothesis that a has a speci c value Ho n no Since i estimates a the test is based on i which has a perhaps approximately Normal distribution Thus i 7 Mo f is a standard normal random variable under the null hypothesis pvalues for different alternative hypotheses 0 Ho a gt no 7 pvalue is PZ 2 2 area of rightrhand tail H no 7 pvalue is PZ g 2 area of leftrhand tail 0 He a no 7 pvalue is 2PZ Z area of both tails Example An Exact Binomial Tbst lh the lest 51 World Serles through 2003 there heve oeeh 24 seveh gerhe Suppose we wlsh to test he hy othesls so as ee mom a WW genes we Wehehdem mo em gem mm Powwow g WWW For the elterhetlve hypothesls let s use the geherlc Ha The model m 80 re meme 4 mm t r are e M e e gene Whet ls the peveluei We heed to tlhd m such thet PHOM 2 Assumlhg different veers World Serles ere lhdeoehdeht le thet the lest 51 World Serles ere s from the ooouletloh of World Series the humoer of seveh gerhe serles lh 51 tnals ls 551 515 PMZQO0086 PMZQJ0049 We weht to heve e slghlhcehce level of no more than a 5 so the crltlcel velue wlll he 21 Do we reject 80 et slghlslcehce level d o 057 Thls ls ust e metter of j checklhg whether our ooserved velue of M 24 exceeds the crltlcel velue 21 lt does so we reject HO Example Filling Coke Bottles contr We are interested in assessing whether or not the machine needs to be recalibrated which will be the case if it is systematically over or under lling bottlesr Thus we will use the hypotheses H01n16 bugle Recall that 2 1594 a 02 and n 100 Thus 53 Mo 73 Z o l The pvalue for a twosided test is p 2PZ 2 3 00026 If a 001 we reject Hot If a 005 we reject Hot Example TV Tubes TV tubes are taken at random and the lifetime measured n 100 a 300 and 2 1265 days Test whether the population mean is 1200 or greater than 1200 Ho n 1200 Ha n gt 1200 Under Ho i N N1200 30 z N N01 under Ho The test statistic is z w 217 and the pvalue is PZ 2 217lHo 0015 This is evidence against Ho at signi cance level 005 so we reject Ho That is we conclude that the average lifetime of TV tubes is greater than 1200 days Con dence Intervals and Hypothesis Tests A level a twosided test rejects a hypothesis Ho n no exactly when the value of no falls outside a 1 7 a con dence interval for n For example consider a twosided test of the following hypotheses Ho 1 M 0 Ha I H 7t 0 at the signi cance level a 05 o If no is a value inside the 95 con dence interval for n then this test will have a pvalue greater than 05 and therefore will not reject Ho o If no is a value outside the 95 con dence interval for n then this test will have a pvalue smaller than 05 and therefore will reject Ho A Rough Interpretation of pvalues pvalue lnterpretation p gt 010 no evidence against Ho 005 lt p g 010 weak evidence against Ho 001 lt p g 005 evidence against Ho p g 001 strong evidence against Ho Statistical vs Practical Signi cance Saying that a result is statistically signi cant does not signify that it is large or necessarily important That decision depends on the particulars of the problem A statistically signi cant result only says that there is substantial evidence that Ho is false Failure to reject Ho does not imply that Ho is correct It only implies that we have insu icient evidence to conclude that Ho is incorrect Example A particular area contains 8000 condominium units In a survey of the occupants a simple random sample of size 100 yields the information that there are 160 motor vehicles in the sample giving an average number of motor vehicles per unit of 16 with a sample standard deviation of 08 Construct a con dence interval for the total number of vehicles in the areas The city claims that there are only 11000 vehicles in the area so there is no need for a new garage What do you think More on Constructing Hypothesis Tests Hypothesis always refer to some population or model not to a particular outcome As a result HO and HE must be expressed in terms of some population parameter or parameters Ha typically expresses the effect that we hope to nd evidence for So HE is usually carefully thought out rst We then set up HO to be the case when the hopefor effect is not present It is not always clear whether Ha should be one sided or twosided ie does the parameter differ from its null hypothesis value in a speci ed direction Note You are not allowed to look at the data rst and then frame Ha to t what that data show Potential Abuses of Tests In many applications a researcher constructs a null hypotheses with the intent of discrediting it For example 0 Ho new drug has the same effect as placebo 0 Ho men and women are paid equally A small p value can help a drug company can get a drug approved by the FDA Similarly a researcher may have an easier time publishing his results if the pvalue is smaller than 005 Because of that we have to be aware of the following potential abuses 0 Using oneisided tests to make the pvalue onehalf as big 0 Conducting repeated sampling and testing and reporting only the lowest pvalue 0 Testing many hypothesis or testing the same hypothesis on many different subgroups In the last two even if there is actually no effect you will probably get at least one small pvalue Tests of Significance Outline General Procedure for Hypothesis Testing Null and Alternative Hypotheses Test Statistics p values Interpretation of the Significance Level Tests for a Population Mean Interpretation of p values Statistical vs Practical Significance Confidence Intervals and Hypothesis Tests Potential Abuses of Tests Testing Hypotheses A hypothesis test is an assessment of the evidence provided by the data in favor of or against some claim about the population For example suppose we perform a randomized ex periment or take a random sample and calculate some sample statistic say the sample mean We want to decide if the observed value of the sam ple statistic is consistent with some hypothesized value of the corresponding population parameter If the observed and hypothesized value differ as they almost certainly will is the difference due to an in correct hypothesis or merely due to chance varia tion Example Filling Coke Bottles A machine at a Coke production plant is designed to fill bottles with 1602 of Coke The actual amount varies slightly from bottle to bottle From past ex perience it is known that the SD 0202 A SRS of 100 bottles filled by the machine has a mean 1594oz per bottle Is this evidence that the machine needs to be recalibrated or could this dif ference be a result of random variation Example GRE Scores The mean score of all egtltaminees on the Verbal and Quantitative sections of the GRE is about 1040 Suppose 14 randomly sampled U of C graduate stu dents had a mean GRE VQ score of 1310 Does this indicate that as a whole U of C graduate stu dents have a higher mean GRE score than the na tional average General Procedure for Hypotheses Testing 1 Formulate the null hypothesis and the alternative hypothesis 0 The null hypothesis H0 is the statement being tested Usually it states that the difference be tween the observed value and the hypothesized value is only due to chance variation For example y 16 oz The alternative hypothesis Ha is the statement we will favor if we find evidence that the null hypothesis is false It usually states that there is a real difference between the observed and hypothesized values For example y 7 16 pgt 16 or plt 16 A test is called o two sided if Ha is of the form M 7 16 o one sided if Ha is of the form M gt 16 or M lt 16 2 Calculate the test statistic on which the test will be based The test statistic measures the difference between the observed data and what would be expected if the null hypothesis were true Our goal is to answer the question How many stan dard errors is the observed sample value from the hypothesized value under the null hypothesis For the Coke example the test statistic is 1594 7 16 02x100 2 We39ll have more to say about this in a moment 3 Find the p value of the observed result The p value is the probability of observing a test statistic as extreme or more extreme than actu ally observed assuming the null hypothesis H0 is true The smaller the p value the stronger the evi dence against the null hypothesis if the p value is as small or smaller than some number a eg 001 005 we say that the result is statistically significant at level a a is called the significance level of the test In the case of the Coke egtltample p 00013 for a one sided test or p 00026 for a two sided test Once again we39ll have more to say about this in a moment Interpretation of the Significance Level To perform a test of significance level a we per form the previous three steps and then reject H0 if the p value is less than a The following outcomes are possible when conduct ing a test Our Decision 0 H0 Type II Error Reality Ha Type I Error Ha y Suppose H0 is actually true If we draw many sam ples and perform a test for each one a of these tests will incorrectly reject H0 In other words a is the probability that we will make a Type I error Type II error is related to the notion of the power of a test which we will discuss later Example An Exact Binomial Test In the last 51 World Series through 2003 there have been 24 seven game series Suppose we Wish to test the hypothesis ames Within a World Series are independent With each team having probability of Winning For the alternative hypothesis let39s use the generic Hz The model in H0 is incorrect Let X denote the number of games in the World Seri Under Ho X has the followmg distribution It PX For our test statistic let39sjust use M seven game seri What is the pvalue We need to find m such that PHDM 2 m m 0 05 Assuming different years39 World Seri are independent i e that the last 51 World Series are an from the o ulation of World Seri the number of seven game seri in 51 trials is B51516 PM 2 20 0 086 PM 2 21 0 049 We want to have a significance level of no more than a 5 so the critical value Will be 21 Do we reject H0 at significance level a 005 This isjust a matter of checking whether our observed value of M 24 exceeds the critical value 21 It does so we reject H0 Tests for a Population Mean In the preceding example we were able to perform an exact Binomial test Frequently an exact test is impractical but we can use the approximate normality of means to conduct an approximate tes Suppose we want to test the hypothesis that b has a specific value Ho 1 M 0 Since a estimates p the test is based on 9 which has a per7 haps approximately Normal distribution Thus a i 0 fW is a standard normal random variable under the null nypotneL sis pvalues for different alternative hypotheses Ha p gt Mo 7 pvalue is PZ 2 2 area of rightihand tail 3 Ha p lt po 7 pvalue is PZ g 2 area of leftihand tail I Ha pi pg 7 pvalue is 2PZ 2 lzl area of both tails Example Filling Coke Bottles cont We are interested in assessing whether or not the machine needs to be recalibrated which will be the case if it is systematically over or under filling bot tles Thus we will use the hypotheses HOZMl6 Haip7bl6 Recall that 939s 2 1594 a 02 and n 100 Thus 5M0 av quot3 Z The p value for a two sided test is p 2PZ 2 3 00026 If a 001 we reject H0 If a 005 we reject H0 Example TV Tubes TV tubes are taken at random and the lifetime mea sured n 2 1000 300 and 939s 2 1265 days Test whether the population mean is 1200 or greater than 1200 H01 1200 Ha 11gt 1200 Under 10 N120030 2 N0 1 under H0 The test statistic is z W 217 and the p value is PZ 2 217lH0 0015 This is evidence against H0 at significance level 005 so we reject H0 That is we conclude that the average lifetime of TV tubes is greater than 1200 days A Rough Interpretation of p values p value Interpretation p gt 010 no evidence against H0 005 ltp g 010 weak evidence against H0 001 ltp g 005 evidence against H0 p g 001 strong evidence against H0 Statistical vs Practical Significance Saying that a result is statistically significant does not signify that it is large or necessarin important That decision depends on the particulars of the prob lem A statistically significant result only says that there is substantial evidence that H0 is false Failure to reject H0 does not imply that H0 is cor rect It only implies that we have insufficient evi dence to conclude that H0 is incorrect Confidence Intervals and Hypothesis Tests A level a twoisided test rejects a hypothesis Ho u no exactly when the value of no falls outside a 1 7a confidence interval for b For example consider a twoisided test of the following hyi potheses Ho 1 M 0 Ha 3 H 7 M0 at the significance level a I05 I If no is a value inside the 95 confidence interval for u then this test will have a pvalue greater than 05 and therefore will not reject Ho I If no is a value outside the 95 confidence interval for p then this test will have a pvalue smaller than 05 and therefore will reject Ho Example A particular area contains 8000 condominium units In a survey of the occupants a simple random sam ple of size 100 yields the information that the aver age number of motor vehicles per unit is 16 with a sample standard deviation of 08 Construct a confidence interval for the total number of vehicles in the area The city claims that there are fewer than 11000 vehicles in the area so there is no need for a new garage What do you think Potential Abuses of Tests In many applications a researcher constructs a null hypotheses with the intent of discrediting it For example I Ho new drug has the same effect as placebo I Ho men and women are paid equally A small p value can help a drug company can get a drug ape proved by the FDA Similarly a researcher may have an easier time publishing his results if the pvalue is smaller than 005 Because of that we have to be aware of the following potential abuses I Using onesided tests to make the pvalue onehalf as big I Conducting repeated sampling and testing and reporting only the lowest pva ue I Testing many hypothesis or testing the same hypothesis on many different subgroups In the last two even if there is actually no effect you will probably get at least one small pvalue Normal Approximation to the Binomial Central Limit Theorem I Binomial Calculations for Compound Events I The Normal Approximation to the Binomial I Parameters of the Approximating Distribution Behavior of the Approximation as a Function ofp I Calculations With the Normal Approximation I The Continuity Correction I Sampling Distributions I The Mean and Standard Deviation of 2 I The Central Limit Theorem I Normal Approximation to the Binomial Revisited LetXY1Y2Ynz1n Whatis EX7 MXEXEY1Y2Yn EY1 EY2 EltYn iEO z What is VarX7 7 VarX Va39rOl YQYn Va39rY1Va39rYz Va rYn ZVaMK 21 241 2 np1 2 What is the distribution of X7 Shopping Revisited Let p Where 0 lt p lt 1 be the proportion of American adults that nd shopping frustrating What is the proportion of American adults that do not nd shopping frustrating We take a simple random sample of 71 people from this population K Y2 Yn Where K 1 if the ith individual in the sample nds shopping frustrating and K 0 otherwise What is the expected value of Y1 MY ECG 01ip1p 27 What is the expected value for K Where 1 g 239 g n What is the variance of Y1 EKYrMYJZl Eer2l 021p1p2p p21p1p2p p1pp1p 270 X N Bn p Recall that the equation for binomial probabilities is Binomial Calculations for Compound Events For a compound event such as PX g k the probability is given by Png ijpXj The Normal Approximation to the Binomial Shopping Continued Suppose we draw an SRS of 1500 American Adults and are selected using a simple random sample and each individual is interviewed If p 12 what is the expted number of people in the sample that nd shopping frustrating It turns out that as 71 gets larger the Binomial EltXgt np 150002 180 distribution looks increasingly like the Normal dis What is the probability that the sample contains tribum n 170 or less people that nd shopping frustrating PX 170 imx j 170 2 190 0 12j0i881500j jO That s pretty ugly Is there an easier way Conslder the following Binomial histograms sacn representing 10000 samolss frDm a Binomial distribution Wicn o o 1 us iiln Parameters of the Approximating Distribution i i i 3 hi i i i m The approximating Normal distribution has the will will same mean and standard deviation as the 5n m underlying Binomial distribution Thus if X N Bnp having mean EX np and standard deviation SDX np1 7 p it is approximated by a Normal distribution X is a a approximately N lax np ax np1 7 17 X is the count of successes What about 33 quot 5quotquot quot 1quot the sample proportion of successes 33 is approximately N lax paX K 3n a an in 7a in 7a in in ma iii in Km in value value When is the approximation appropriate 1 v a E i i E g The farther p is from the larger 71 needs to be H g for the approximation to work Thus as a rule of a m D 3 thumb only use the approximation if t quotm quotquot1 g a 71p 210 and 7117 p 210 v a g H i 5 g i E i g 5 a o 9 Calculations with the Normal Approximation The Continuity Correction Recall the problem we set out to solve PltX S 170 Where X N 31500 012 The addition of 05 in the previous slide is an example of the continuity correction which is How do we calculate this using the Normal intended to re ne the approximation by approximation accounting for the fact that the Binomial If we were to draw a histogram of the distribution is discrete while the Normal 31500 012 distribution with bins of width one PX g 170 would be represented by the total area of the bins spanning distribution is continuous In general we make the following adjustments 705 05 05 15 l i i 1695 1705 PX g z x PY g 1 05 Thus using the approximating Normal PX lt I PX S I i 1 g PY S I i 0 5 distribution Y N N1801259 we calculate pltX 2 I g plty 2 I 7 0 5 PX g 170 s PY 1705 02253 PX gt z PX 2 z 1 s PY 2 2 on For reference the exact Binomial probability is 02265 so the approximation is apparently pretty good 3 se uonemgxolddv an O 10AEL SH Sampling Distributions The Normal approximation to the Binomial distribution is in fact a special case of a more general phenomenon The general reason for this phenomenon depends on the notion of a sampling distribution Consider the following setup We observe a sample of size n from some population and compute the mean In lecture 5 we de ned the sampling distribution of a statistic to be the distribution of the statistic in all possible samples of the same size from the same population If we repeatedly drew samples of size n and calculated 2 we could ascertain the sampling distribution of 2 The Mean and Standard Deviation of X What are the mean and standard deviation of X Let s be more speci c about what we mean by a sample of size n We consider the sample to be a collection of n independent and identically distributed or iid random variables X1 X2 X with common mean u and common standard deviation 7 Thus E0 E gimxi j HH n VarX Var 5 Vang 51 n 2 1 2 72 Z20 mna 1 SDX VarX A Word on Notation The thing to keep in mind is that 2 is a xed number while X is a random variable The authors of the book are not very careful in making this distinction and they denote 2 as the random variable The Central Limit Theorem Now we know that X has mean u and standard deviation but what is its distribution If X1X2 Xn are Normally distributed then X is also normally distributed Thus X N V39 j X N 7 2 M a 2 M W If X1X2 Xn are not Normally distributed then the Central Limit Theorem tells us that X is approximately Normal The Central Limit Theorem Suppose X1 X2 Xn are iid random variables with mean u and nite standard deviation 0 If n is su iciently large the sampling distribution of X is approximately Normal with mean u and standard deviation f Normal Approximation to the Binomial Revisited What does all this have to do With the Normal approximation to the Binomial An observation from a Binomial distribution Y is actually the sum of n independent observations from a simpler distribution the Bernoulli distribution A Bernoulli random variable X takes the value 1 With probability p or the value 0 With probability 1 7 p and has EX p and SDX P1 p Letting X1 Xn be 71 iid Bernoulli random variables Y 2X 7 21 According to the CLT X has a N0 0 distribution Where M p and a 7 It turns out that ii is also Normal and it has nu 71p VarnX nzVarX n20 nzm np17p 77 Thus in general if X1 Xn are iid random variables With mean M and standard deviation 0 then 2 X2 NW e Leastsquares regression Cautions about correlation and regression Outline 0 Least squares regression 7 Equations of regression line slope7 intercept 7 Residuals and residual plot 7 Outliers and in uential observations 0 Cautions about correlation and regression 1 Fitting the Regression Line to Data Since we intend to predict y from x the errors of interest are mispredictions of 37 for a xed m observed y The leastsquares regression line of 37 on x is the line that minimizes sum of squared errors This is the least squares criterion Given pairs of observations 17371 7 mm 37 7 the regression line is given by 37abz where b r571 and a Q7 bi 5z LeastSquares Regression Regression describes the relationship between two variables in the situation where one variable can be used to explain or predict the other The regression line is a straight line that describes how a response variable 37 changes as an explanatory variable x changes Interpreting the Regression Model 0 The response in the model is denoted 37 to indicate that these are predictd 37 values7 not the true observed 37 values The hat denotes prediction 0 The slope of the line indicates how much 37 changes for a unit change in z o The intercept is the value of 37 for z 0 It may or not have a physical interpretation depending on whether or not x can take values near 0 c To make a prediction for an unobserved z just plug it in and calculate 37 Note that the line need not pass through the observed data points In fact7 it often will not pass through any of them Facts about Least Squares Regression o The distinction between explanatory and response variables is essential Looking at vertical deviations means that changing the axes would change the regression line 0 A change of 1 sd in z corresponds to a change of r sds in y The least squares regression line always passes through the point i g o r2 the square of the correlation is the fraction of the variation in the values of y that is explained by the least squares regression on x When reporting the results of a linear regression you should report r2 These properties depend on the least squares tting criterion and are one reason why that criterion is used 5 Residual Plots A residual plot is a scatterplot of the residuals against the explanatory variable It can be used to assess the t of the regression line Patterns to look for o Curuature indicates that the relationship is not linear 0 Increasing or decreasing spread indicates that the prediction will be less accurate in the range of explanatory variables where the spread is larger 0 Points with large residuals are outliers in the vertical direction 0 Points that are extreme in the m direction are potential high in uence points In uential observations are individuals with extreme x values that exert a strong in uence on the position of the regression line Removing them would signi cantly change the regression line Residuals Residuals are the vertical distances between the data points and the corresponding predicted values ri observed y 7 predicted y 9i i 23139 9i i a t 595i For a least squares regression7 the residuals always have mean zero A Regression Example Consider the following data on unemployment rate and unemployment expenditure for several countries Unemp Unemp Countzy Rate Exp swz O 5 O 16 in 1 4 O 19 swd 1 6 O 72 iap 2 1 0 34 Summary Statistics aut 3 3 O 38 fin 3 4 O 66 pox 4 a 0 so gex 4 7 1 17 i 6 5 no 5 2 1 20 us 5 4 o 47 g uk 6 8 O 38 g 7 o o 42 7 ans 7 o 1 o1 80 7 bel 7 6 1 99 nl 7 3 2 25 8y 7 nz 7 9 1 34 can 3 1 1 73 T f1 8 9 1 34 den 9 7 3 22 H 10 3 0 40 u 13 8 2 79 sp 15 9 2 43 Regression Example cont Regression Coef cients unamwmsn EXDm lms m oEcn quotmum 177 7 9a 51 3 039 c 0737 5 337 i 0168 Ma E39 a 7 bi E 12070168X65 o 103 n 5 m 5 Umpwm m m Resmuzl mm Unemvlwmem ale 9 Cautions about Correlation and Regression cont Lurking variables are variables not among the explanatory or response variables in a study that may in uence the interpretation of relationships among the measured variables Lurking variables may falsely suggest a relationship when there is none or may mask a real relationship Association is not causation Two variables may be correlated because both are affected by some other measured or unmeasured variable Example Nations With more TV sets have higher life expectancies Do TVs cause longer life What s the real explanation Even if there is a causal relationship it only makes sense in one direction Sometimes the direction is obvious eigi if there is a time lag but not always For example the high correlation between self esteem and success in school or work Cautions about Correlation and Regression 0 Correlation and regression describe only linear relationships 0 They are not resistant Extrapolation is the use of a regression line for prediction far outside the range of m Values used to obtain the line Such predictions are not to be trusted Averaging Data smoothes out ne scale variation leading to higher correlation This phenomenon is called ecological correlation Results obtained on averages should not be applied to individuals Example In the 1988 CPS the correlation between income and education for men age 25 64 was about 04 Grouping data into nine census regions averaging each variable within each region and computing the correlation of the nine points yields r z 07 Establishing Causal Relationships The best way to establish a causal relationship is to conduct an experiment where values of one or several variables are manipulated and the effect on some outcome is observed What if an experiment is not possible There may be evidence for a causal relationship if c The association is strong 0 The association is consistent across multiple studies Higher doses are associated with stronger responses The alleged cause precedes the effect in time The alleged cause is plausible perhaps because of similar studies such as on animals
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'