Introduction to Statistics
Introduction to Statistics STAT 2
Popular in Course
verified elite notetaker
Popular in Statistics
This 130 page Class Notes was uploaded by Floy Kub on Thursday October 22, 2015. The Class Notes belongs to STAT 2 at University of California - Berkeley taught by Staff in Fall. Since its upload, it has received 33 views. For similar materials see /class/226739/stat-2-university-of-california-berkeley in Statistics at University of California - Berkeley.
Reviews for Introduction to Statistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
STAT 2 Lecture 1 2 How accurate is regression Last time calculating regression predictions One method Change x to standard units Predict y in standard units Predict y Note by far the best method for solving percentile problems Last time calculating regression predictions Regression formula Prediction for y meany x meanxrgt lt SDOSDX Last time Regression toward the mean the regression effect if someone has a high percentile on x and x and y are positively correlated he she will usually have a less high percentile on y Last time Regression paradox if x10 leads you to predict y20 then in general y20 will not lead you to predict x10 Today Residuals Prediction error 0 Regression and the normal distribution Residuals How accurate is regression We gm b 100 Residuals True value of y minus predicted values of y Vertical distance from regression line Residuals have mean 0 How accurate is regression Res dua b Residual plot for weigtheight regression Height Km Residual plots Plot residuals against xvalues If plot appears random regression is acceptable If there39s a pattern in the plot there39s some the regression hasn39t explained Example noninearity Residuals form a U shape Resi llllll 0t Straight line is not matching pattern in original data 0 Perhaps try a parabola Example outlier Outlier is m pulling the 9 Whole line of course Try regression after dropping the outlier explain the difference Example heteroscedasticity Residuals form Resi llllll 0t 0 a ca lnnel77 shape 0 SD is not constant with x 0 Perhaps try taking the log ofy More residuals plots Can also plot residuals g o against index order of data o w 0 ifthere39s a pattern there39s something you haven39t explained Residual plot for weightlheight regression More residual plots Can also draw boX plot andor histogram of residuals 0 Will reveal distribution and outliers but not patterns How accurate is regression Boxplot of residuals Histogram of residuals 3 o q 1 N O N i o 3 o r a m E E j 5 E E B r 1 LL 390 o 1 lt1 Y 1 1 N o R 1 A O xxxxxxx 780410040 Weight lb What to look for Size of residuals Outliers Do the residuals have a normal distribution II Regression and errors How big is the typical error Taking the mean of the residuals gives zero Instead nd the mean of the squared residuals then take the square root analogous to SD 0 This is called prediction error or RMS root mean square error For heightweight regression Find residuals weight minus predicted weight Square them Find the mean of the squares Take the square root of that mean We get an RMS error of 30 pounds RMS error RMS error will be approximately the true error of prediction z39fyour regression model is right 0 If your model is wrong true error may be much larger Technical note When making a prediction based on a sample even if the model is truly linear RMS error will be an underestimate of prediction error You can x this by adjusting for degrees of freedom However if your sample size is large it doesn39t make much difference Faster way of calculating RMS error 0 RMS error SDY sqrt1 r2 For heightweight regression SDY 327 pounds R 03967 RMS error 327 sqrt1 39672 302 pounds Example Savings ratio We have data for the savings ratio aggregate personal savings divided by disposable income for 50 countries Can we model this using the percentage of the population over 75 years old Savings ratio Savings ratio for SI countries Ho u N7 E g o E 2 o e E o c g 70 5 o gt o 00 1 o o O E W x 0 12 o 00 77 o o w o o E gt o O o O o g o g o 5 me o c m 0 o g o O gt m o w o O 1 2 3 4 Percentage uf population over 75 Savings ratio Population over 75 OLD has mean 23 and SD 128 Savings ratio SR has mean 967 and SD 444 Correlation is 03165 Savings ratio Regression line SR 967 OLD 23 O3165444128 967 1099 OLD 2525 1099 OLD 715 Savings ratio the regression line Savings ratio for SI countries Residuals quotif a E r 8 E E n 2 u 7 o In 0 6 o c s o n 2 o u n a o a j o E E 0 N 8 o E 0 u o o 00 0 m 391 o 9 E 0 gt 3 V a g o o 9 o 0 o a 0 E gt m o m o T t t t t 1 2 3 4 Percentage of population over 75 Percentage 01 population over 75 Savings ratio residuals Boxplot of residuals Histogram of residuals Savings ratio identifying outliers Savings ratio for SI countries Residuals v rJ39clpcm g o s E Zambia 2 1 0 e m 6 0 0 0 gt o n m o u E a o m a o E g o N x a g o u 0 00 II II C o c a o S m 3 o amp o o 9 o 0 E I O o 0 E S m m o O o T t t t t 1 2 3 4 Percentage 0f populauon aver 75 Percenlage of pnpulatlon over 75 Savings ratio RMS error SDSR sqrt1 r2 444 sqrt1 031652 42 Note RMS error drops to 36 if we remodel without Zambia and Japan Savings ratio Somewhat iffy to use this model for prediction for other countries not a random sample of countries Can only safely predict in limited circumstances Better to think of it as a model to help you to understand the relationship between the variables 111 T he distribution of residuals Savings ratio residuals Residuals often have an approx normal distribution If you assume a 5 m they are normal can calculate various Chances Example Heightweight Mean weight 1472 lb SD 3287 1h Regression line is Pred weight 4666lt height 15 84 What39s the chance that a 65 inch woman weighs over 140 pounds ie what proportion of 65 inch women weigh over 140 pounds Example Heightweight Prediction is 1449 pounds 0 New SD for 65 inch women RMS error 3287 sqrt1 39672 302 pounds 65 inch tall women have mean weight 1449 1b with SD 302 lb Example Heightweight Assuming normal distribution Change 140 pounds to standard normal Z 140 1449302 O16 PZlt O16 44 56 chance that a 65inch tall woman weighs over 140 pounds Example Heightweight What39s the lower quartile for weights of 7 Oinch tall women Average weight for 7 O in women 4666 7O 1584 1682 pounds New SD of their weights 302 pounds same as before Example Heightweight Lower quartile corresponds to z score of O67 Change into original units LQ 1682 067 302 148 pounds If residuals aren t normal 0 Look at histogram of residuals Histogram of residuals 0 Find percentiles of the residuals If residuals aren39t normal eg 12 of our residuals are above 1 Say for x5 we predict y6 What39s the chance that if x5 ygt7 Answer roughly 12 You need a lot of residuals to do this accurately Recap Residuals True value of y minus predicted value of y Vertical distance from regression line Plot residuals to check for patterns outliers homo 0r heteroscedasticity RMS error Find residuals Square them Find the mean of the squares Take the square root of that mean Or more simply SDY sqrt 1 r2 Normal distribution for residuals Draw histogram of residuals to check for a normal distribution If normal can use new SD RMS error to nd probabilities or percentiles of y given x Tomorrow Case study Predicting WNBA results or Why prediction is hard STAT 2 Lecture 1 9 Binomial probabilities This week 0 The binomial formula The law of averages Expected value and standard error 0 The normal approximation A reminder tossing coins Tossing coins I toss three coins What39s the probability that exactly two out of three are heads Draw a tree 11 Toss 2nd Toss quotII quotIn n MIDI 3rd Toss Oufcome Prob H T H T A 2 quotIn N nh MID M H T H T H T H T HHH HHT HTH HTT THH THT 1TH TTT xx 18 18 18 18 18 18 18 18 Using the tree PHHT12gt lt 12gt lt 12 18 PHTH12gt lt 12gt lt 12 18 PTHH12gt lt 12gt lt 12 18 P2 heads out of 3 38 Note that there are three ways of getting two heads in three tosses Each way has equal probability Why are there 3 ways Number of tosses is xed at 3 There needs to be two heads and one tail can only get head or tail The tail can fall on any of the three tosses So there are three places the tail can be Why are the probabilities the same for each way Probability of getting a head doesn39t change from toss to toss Tosses are independent Each way has probability 12 12 12 II Tossing a biased 00in Tossing a biased coin I have a biased coin which has a 60 chance of coming up heads I toss it three times What39s the probability that exactly two out of three are heads Drawing the tree HHT 6x6x6 216 6 H g H T HHT 6x6x4 144 H 6 H HTH 6x 4x6 144 4 T T HTT 6x4x4 096 THH 4x6x6 144 6 H H T THT 4x6x4 096 T 6 4 H TrH 4x4x6 096 T T Trr 4x4x4 064 Using the tree PHHT 664 0144 PHTH 646 0144 PTHH 466 0144 P2 H out of 3 3 0144 0432 There are three ways of getting two heads in three tosses Each way has equal probability Why are the probabilities the same for each way PHHT 664 0144 PHTH 646 0144 PTHH 466 0144 Tosses indep probs don39t change The probability of each way is found by multiplying PHPHPT in some order III Tossing a lot of biased coins Tossing biased coins I have a biased coin which has a 60 chance of coming up heads I toss it ve times What39s the probability that exactly three out of ve are heads Tossing biased coins Will take a long time to draw out the whole tree Can nd the probability by multiplying the number of ways by the probability of each way What39s the probability of one way Each way has three heads and two tails probability of each way will be found by multiplying PHPHPHPTPT in some order P0ne way 66644 063gt O42 003456 How many ways are there How many ways are there of arranging 3H39s and 2T39s Can list them all HHHTT HHTHT HHTTH HTHHT HTHTH HTTHH THHHT THHTH THTHH TTHHH 3910 ways Find the probability 10 ways Pone way 003456 Pthree heads in five tosses 10 003456 03456 3456 Quite a lot of work maybe there39s a formula or something IV Tossing way too many biased coins Tossing biased coins I have a biased coin which has a 60 chance of coming up heads I toss it eight times What39s the probability that exactly four cut of eight are heads What39s the probability of one way Probability of each way will be found by multiplying PHPHPHPHPTPT PTPT P0ne way 064gt O44 0003318 How many ways are there Too many ways to list need to count them without listing them Fortunately there39s the binomial coef cient Number of ways of choosing 4 out of 8 objects is 8765432143214321 Find the probability 87654321K43214321 Number of ways is 87654321K43214321 70 Pone way 0003318 P4 H out of 8 70 003318 02322 2322 Where did that binomial coef cient come from Say you have 4 different heads H1 H2 H3 H4 and 4 different tails T1 T2 T3 T4 How many ways are there of arranging these 8 coins 8 ways of picking the rst coin then 7 of picking the second given the first then 6 0f the third Where did that binomial coef cient come from If there are 8 different objects then there are 87654321 ways of arranging them We write this as 8 pronounced eight factorial Where did that binomial coef cient come from Now we don39t have 8 different objects the four head are the same How many times are we counting each set of four heads Think about the orders that the four heads could be in Where did that binomial coef cient come from The four heads H1 H2 H3 H4 could be in any of 4 different orders So we39re counting each way 4 times Need to divide number of ways by 4 Where did that binomial coef cient come from But the four tails could be in any of 4 different orders Then number of ways is 844 70 T he binomial formula The binomial formula The probability of some event happening on one trial is p I perform n independent trials What is the chance of the event happening exactly k times in n trials Number of times the event occurs has a binomial distribution The binomial formula Pk out of n no of ways prob of each way 11 k n k 1 kn kplt 1 When can you use the binomial formula We want to know how many times something does or doesn39t happen e g rolling a die multiple times can use binomial if we want to know number of sixes can39t use binomial if we want to know sum of rolls When can you use the binomial formula 0 n the number of trials must be xed in advance 39 p the probability of success on one trial doesn39t change 0 Trials must be independent Why can39t you use the binomial formula I toss a coin until I get a tail What39s the probability that I get exactly two heads I draw three cards from a shuf ed deck What39s the chance that exactly two of the cards are spades Example counting sixes 391 roll a die ten times What39s the probability I get exactly one six Example counting sixes Use formula 2 10 1 1 g 6 19 E 10X9gtlt8gtlt7gtlt6gtlt5gtlt4gtlt3gtlt2gtlt1 1X9gtlt8gtlt7gtlt6gtlt5gtlt4gtlt3gtlt2gtlt1 9 29 6 m 1 5 9 g 3230 Counting sixes a note 391 roll a die ten times What39s the probability I get no sixes 0 10 10 1 5 1gtlt 0106 6 ie same formula as the power rule 10 5 6 Using the addition rule 391 roll a die ten times What39s the probability I get exactly two sixes or fewer Can add up Pno sixes Pone six Ptwo sixes nd each of these probabilities using the binomial formula Using the addition rule 0 Pn0 sixes E 1615 1 1 5 9 P0ne six 10g g 3230 Ptw0 sixes 10 f E E3 2907 28 Pn0 more than 2 sixes 7752 VI IS it unusual Is it unusual I toss a coin six times and get ve tails Is this unusual PS or more tails PS tails P6 tails 6125121 126 Is it unusual I toss a coin six times and get ve tails Is this unusual PS or more tails PS tails P6 tails 6125121 126 1094 Not that unusual NB P5 or more heads 10 94 Is it unusual I toss a coin twelve times and get ten tails Is this unusual PlO or more tails PlO tails P11 tails P12 tails 661212 121212 1212 193 Unusual maybe the coin is biased Is it unusual I toss a coin 24 times and get 20 tails Is this unusual P20 or more tails 0077 or 1 in 1300 Very unusual strongly suspect the coin is biased Recap The binomial formula The probability of some event happening on one trial is p I perform n independent trials What is the chance of the event happening exactly k times in n trials Number of times the event occurs has a binomial distribution The binomial formula Pk out of n no of ways prob of each way 14 k n k 1 kn kplt 1 When can you use the binomial formula We want to know how many times something does or doesn39t happen e g rolling a die multiple times can use binomial if we want to know number of sixes can39t use binomial if we want to know sum of rolls When can you use the binomial formula 0 n the number of trials must be xed in advance 39 p the probability of success on one trial doesn39t change 0 Trials must be independent Tomorrow The law of averages STAT 2 Lecture 31 Statistical tests continued Today Statistical signi cance The z test for counts and percentages The t test The twosample z test Statistical Signi cance ctd Recall hypotheses I think that Berkeley students study on average 20 hours a week My friend thinks the true average is much less than this Null hypothesis average hours studied per week 20 Alternative hypothesis average hours studied per week lt 20 Onesided and twosided tests My friend is willing to let me win even if the average number of hours studied is much more than 20 We perform a oneSided or one tailed test we will only reject the null if the average is low not high Recall test statistic I decide to survey a simple random sample of 200 Berkeley students My test statistic will be the 2 statistic based on the average number of hours studied of the sample I can use this because the sample average for a large sample is approx normally distributed The data My sample has a mean of 19 hours and an SDgt lt of 9 hours Estimated SE of sample average 9sqrt200 0636 z statistic 19 20O636 l 57 it39s slightly better to use the SD when using the sample SD to stand in for the population SD Recall Pvalue From tables PZ lt 157 582 We don39t need PZ gt 157 because it39s a onetailed test so the PValue is 582 If we reject strictly if the PValue is less than 5 we can39t reject here the difference isn39t signi cant at the 5 level How do we interpret this I would say the difference was not statistically signi cant at the 5 level and keep believing the true average is 20 hours My friend would say I got lucky that the true value is less than 20 hours and though the Pvalue was low the test didn39t quite pick it up He might suggest another survey with a larger sample size this would reduce the SE and make it more likely the test would pick up the difference How do we interpret this 391 would reply that even if there were a difference it would be small it may or may not be statistically signi cant but it isn39t practically signi cant The tests we39ve seen don39t test for a practically signi cant difference which is usually subjective II T he ztestfor counts and percentages Example The UC Davis ESP test Clairvcyants made 7500 guesses as to which of 4 targets had been selected by the Aquarius random number generator 2006 guesses were right Is this more than we39d expect by chance Example The UC Davis ESP test 0 Test statistic z statistic based on number of correct guesses we could use the binomial but the normal approximation is very good and it saves us having to add thousands of probabilities Null hypothesis each guess is like a draw from the box l O O O that is each guess has a l in 4 chance of being correct and the event that a guess is right is independent of all other guesses Hard to de ne an alternative maybe the expected percentage of correct guesses is more than 25 hardcore frequentists would not approve of this Example The UC Davis ESP test What would we expect under the null hypothesis Expect 25 of guesses to be correct or 75004 1875 SE of number of correct guesses is sqrt7500 sqrtO25O75 37 Number of correct guesses has an approximately normal distribution Example The UC Davis ESP test 2006 guesses were right The z statistic is observed number expected number SE of number 2006 187537 35 PZ gt 35 002 so onetailed PValue is 002 Example The UC Davis ESP test The number of correct guesses is not consistent with the null hypothesis Does this mean the subjects have ESP Maybe or maybe the machine was malfunctioning Summary With a large sample we can use the z test to test a hypothesis about a count or a percentage since the Central Limit Theorem Will hold Exception for very rare e g the percentage of people Who are one of a set of triplets or almost certain events such tests may be misleading 111 T he ttest The ttest The z test doesn39t work well for moderatesized samples WS Gosset aka Student invented the t test for use in some such situations Example 391 know the true density of silver is 10490 kgm3 391 have a silverlooking bar I39m not sure if it39s pure silver so I want to nd out if the bar39s density is really 10490 kgm3 391 measure its the density ten times Example The measurements average 11000 kgm3 with an SE of 250 kgm3 From previous experience I believe my measurement errors are unbiased and have a normal distribution For a small sample the z test may not work well The tstatistic We perform a t test The test statistic is calculated the same way observed expected standard error but now the statistic has a t distribution not standard normal tdistribution The t distribution aka Student39s distribution is actually a whole set of distributions that depends on the sample size The particular distribution we use is determined by the degrees of freedom For the t test DF sample size l Example Null hypothesis average of all measurements true value chance error is 10490 kgm3 Alternative average of all measurements is not 10490 kgm3 We could also make our null true density of bar is 10490 kgm3 however this excludes the possibility of bias Example The tstatistic is 11000 10490250 204 DF10 19 We look up probabilities for a t distribution with 9 degrees of freedom from tables or a computer Example PT9 lt 204 359 PT9 gt 204 359 So PValue is 718 Not quite enough evidence to reject the null hypothesis the bar may be silver although we may be suspicious The tdisz ribuz ion Histogram of the tdistribution with 9 degrees of freedom Frapaman per mu 0 When do we use a ttest The data under the null hypothesis are like draws from a box 0 The box SD is known if known usually use the Ztest The number of observations is moderate The distribution of the data is approximately normal If we39re sure the data is normal we can perform the ttest with extremely small samples down to about n4 Similarities and differences between the z and ttests Both calculate the test statistic the same way observed expectedSE In the z test this statistic has a standard normal distribution under the null in the t test it has a t distribution with 141 degrees of freedom Similarities and differences between the z and ttests The z test assumes the sample average or percentage or total or count is normally distributed this Will always occur if your sample size is large enough 0 The t test assume the data or error is approximately normally distributed sample size doesn39t affect this gt Well not quite but close enough In practice Many scientists use the t test even when the data is not known to be normally distributed The usual justi cation is it works More accurately it works except when it doesn39t Statisticians do not condone t tests for nonnormal data Example 0 To correctly measure carbon monoxide levels a spectrophotometer should have a span gas reading of 70 ppm 0 A technician makes ve span gas readings 78 83 68 72 88 0 Are these readings signi cantly different from 70
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'