Elementary Statistical Methods
Elementary Statistical Methods STAT 30100
Popular in Course
Popular in Statistics
This 97 page Class Notes was uploaded by Bailey Macejkovic on Saturday September 19, 2015. The Class Notes belongs to STAT 30100 at Purdue University taught by Ellen Gundlach in Fall. Since its upload, it has received 10 views. For similar materials see /class/207926/stat-30100-purdue-university in Statistics at Purdue University.
Reviews for Elementary Statistical Methods
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
Section 52 Using the Normal Distribution with Sample Means Quality Control Learning goals for this chapter De ne sampling variability and understand how to minimize it Understand the difference between parameters and statistics Explain how sampling distributions work De ne the Central Limit Theorem and know how to apply it Distinguish between X and f problems Distinguish between forwards and backwards f problems State examples of processes and control vs capability Read create and interpret process control charts Calculate upper and lower control limits and the center line for control charts Know when a process is out of control In section 13 we learned how to use the normal distribution for X if we knew the p and c5 and if X was Normally distributed Here we will see how to use the normal distribution for sample means You just have to use the right u and c5 and everything else works just like in section 13 Normal distribution for X Review of 13 What is X X is a single score account height or weight or scores etc in general When do we use Normal distribution if X is Normally distributed Mean and standard deviation ofX u and cs told to us in the problem Formula we usefor Z Z H 0 Normal distribution for 2 Section 52 What is 7 I is the sample mean When do we use the Normal distribution le is Normally distributed then I is too However according to the Central Limit Theorem no matter what distribution the population has as long as you draw a SRS simple random sample with a fairly large sample size n then the sample mean is Normally distributed This is a very nice and very big deal Mean and standard deviation of a sample mean Formula we usefor Z Z X uf a How do you tell these J problems apartfrom section 13 X problems Section 13 X problems would have a sample size n l or would talk about the subj ectsunits in a very general sense Section 52 J problems have a speci c sample size bigger than 1 Example The scores of students on the ACT college entrance examination in 2003 had mean u 208 and cs 48 The distribution of scores is only roughly normal a What is the approximate probability that a single student randomly chosen from all those taking the test scores 23 or higher b Now take an SRS of 50 students who took the test What are the mean and standard deviation of the sample mean score 7C of these 50 students c What is the approximate probability that the mean score 7C of these 50 students is 23 or higher To summarize Chapter 5 1 What is your variable a An individual or all members of a population X b A sample mean TC need speci c sample size n 2 Get the appropriate mean and standard deviation T e Mean Standard Deviation Individual or all X X p 6x 6 Sample mean 7c p In 039 a J 3 Set up your inequality for example P f lt 02 4 Work the problem as a normal distribution problem from Section 13 Convert to Z using the appropriate mean and standard deviation use the standard normal table etc Know when it is appropriate to use the Normal approximation N E 4 Section 13 and Section 52 Examples x or E Forwards or backwards The time that a technician requires to perform preventative maintenance on an air conditioning unit is governed by an exponential distribution with a mean of 60 minutes and a standard deviation of 60 minutes Your company operates 70 of these units What is the probability that their average maintenance time exceeds 50 minutes A laboratory weighs lters from a coal mine to measure the amount of dust in the mine atmosphere Repeated measurements of the weight of dust on the same lter vary normally with standard deviation of 008 mg because the weighing is not perfectly precise The dust on a particular lter actually weighs 123 mg Repeated weighings will then have the Normal distribution with mean 123 mg and standard deviation 008 mg What is the probability that the laboratory reports an average of 3 weighings as 124 mg or higher for this lter The level of nitrogen oxides NOX in the exhaust of a particular car model varies with mean 09 grams per mile and standard deviation 015 gmi A company has 125 cars of this model in its eet What is the level L such that the probability that the sample mean is greater than L is only 001 The NCAA requires Division I athletes to score at least 820 on the combined math and verbal parts of the SAT exam to compete in their rst college year In 2002 the scores of the 13 million students taking the SATs were approximately Normal with mean 1020 and standard deviation 207 What percent of all students had scores less than 820 Stau39sljcal Pmmss Cnmml Qmmy chum A pm is a eham uraeuvmes at turns inputs mm Uulpuls A pruness 15 hke a pnp llau39nn eumauuhg all the Uulpuls at wuuld he pmdueed by me pruness ummu umEr m us presEnl slate The Uulpuls pmdueed tuday unhs week are a mlz mm Lhs pupmauuh The Na ur mum meESS euhuul nakeapmcessstableuverhmeandkeep n smhteuu1ess planned ehauges Au prunesses have manuh smusuml gaulnymmus me pauem urmnanuh rerrams smhle nut that there 15 his mmsured 1h cunlml tuuls that murmur a pmeess and am us when due Lhanlxsnuwuutufcumml Tmsxsasxyallu nd curred me muse unhe dmurhanne hu vanauuh m me ma Cnnlml sham are smusum1 pmeess has bean msmrbed su and Cnmmn came variau39nn A pmeess dams m cmtrul uh1y has eummuh musemuauuu Cummm musexmuauums the mhermhmnabxhty unhe systEm due m rrany smau muses Lhalare alwayspresem Special camevariau39nn When 39henurmal ruheuumhg unhepmeess 15 dmurhed hy sume urlpredmzble muse muauuu is added tu me eu uh muse muauuu Evan wenal We hupe n he able u mscuverwhat hes hehmd spenal muse vananuh and elmmle that muse u resume the stable ruueuumhg uf39hepmcess Cnmml chm d39sdxgu39shhenwmn vhe mmmnn causevar39avim and vhe glacial causevar39avim How do you know if a process is out of control One point outside the u i 3i limits J I A run of 9 consecutive points above the center line u or 9 consecutive points below the center line In this class we will focus on sample mean 7c charts but other common control charts include s standard deviation f7 sample proportion and R range Control vs Capability I If a process is in control we know what to expect in the nished product Statistical quality control only pays attention to the internal state of the process I If a process is capable the process can meet or exceed the requirements placed on it by some external source client demands goals of the organization etc What do you need to know about Statistical Quality Control for the homework and exams What does in control mean What is a process What is the difference between control and capability Where would you draw the center line and the upper and lower control limits on a control chart How do you know when a process is out of control Be able to recognize from a control chart I Give an example ofa process Chapter 13 TwoWay Analysis 0fVariance Learning goals for this chapter Know how twoway ANOVA is related to oneway ANOVA and 2sample comparison of means techniques Test the standard deviations to see if it is OK to pool the variances Understand why it is important to be able to pool the variances for twoway Explain and check the assumptions that must be met for doing twoway ANOVA Calculate R2 and the estimate for 039 Write the 3 sets of hypotheses for twoway ANOVA Use the F test statistics and Pvalues from SPSS to do twoway ANOVA hypothesis tests Write conclusions to twoway ANOVA tests in terms of the story including using the words population mean Interpret means plots in terms of the two main effects and potential interaction Understand that summary statistics and graphs refer to the sample data and hypothesis tests give us information about the population parameter Recognize the response variable factors number of levels for each factor and the total number of observations Identify whether the best statistical technique for a story is lsample mean matched pairs 2sample comparison of means oneway ANOVA twoway ANOVA or summary statistics Chapter 7 Twosample comparison of means t tests 1 categorical variable for sorting and 1 quantitative variable for measurement Example Are the mean taste ratings of chewy granola bars the same as those for crunchy granola bars if you conduct a taste test scale of 110 Chapter 12 F tests compare the means of several populations 1 categorical variable for sorting and 1 quantitative variable for measurement Example Are the mean taste ratings of Quaker Kellogg s and Nature Valley granola bars the same if you conduct a taste test scale of 110 Chapter 13 F tests compare the means of populations that are classified in 2 ways 2 categorical variables for sorting and 1 quantitative variable for measurement Example Do brand texture chewy vs crunchy andor their interaction make a difference to the mean taste ratings scale of 110 for granola bars What s similar for Two Way ANOVA Just as in One way ANOVA we still I assume the data are approximately normal 0 the groups have the same standard deviation even if the means may be different 0 pool to estimate the standard deviation 0 use F statistics for signi cance tests What s different for Two Way ANOVA We can look at each categorical variable separately and we can look at their interaction With oneway ANOVA it was impossible to look at interaction Example from 4th edition of M ampM 39 Each of the following situations is a 2way study design For each case identify the response variable and both factors and state the number of levels for each factor I and J and the total number of observations N a A study of smoking classifies subjects as nonsmokers moderate smokers or heavy smokers Samples of 80 men and 80 women are drawn from each group Each person reports the number of hours of sleep he or she gets on a typical night b The strength of concrete depends upon the formula used to prepare it An experiment compares 6 different mixtures Nine specimens of concrete are poured from each mixture Three of these specimens are subjected to 0 cycles offreezing and thawing 3 are subjected to 100 cycles and 3 specimens are subjected to 500 cycles The strength of each specimen is then measured c Four methods for teaching sign language are to be compared Sixteen students in special education and sixteen students majoring in other areas are the subjects for the study Within each group they are randomly assigned to the methods Scores on a final exam are compared Example from 4 11 edition of M ampM In the course of a clinical trial of measures to prevent coronary heart disease blood pressure measurements were taken on 12866 men Individuals were classi ed by age group and race The means for systolic blood pressure are given in the following table l 35 9 40 44 45 49 50 54 55 59 White 131 1323 1352 1394 142 NonWhite 1323 1342 1372 1413 1441 Note that we are not given raw data on these 12 866 men The table above is the mean for each raceage combination This means we can t use ANOVA We 11 just use graphing anal marginal means to describe the situation a Plot the means with age on the xaXis and blood pressure on the yaXis For each racial group connect the points for the different ages b V Describe the patterns you see Does there appear to be a difference between the 2 racial groups Does diastolic blood pressure appear to vary with age If so how does it vary Is there an interaction between race and age c Compute the marginal means Find the differences between the white and nonwhite mean blood pressures for each age group Use this information to summarize numerically the patterns in the plot mean How do you know from looking at a means plot whether there is interaction 0 If any lines cross each other then you have an interaction interaction If the lines are all fairly parallel to each other then you do not have an interaction If the lines aren t parallel but don t cross each other either then you might have an If you have 2 factors why not just do 2 separate One Way ANOVAs It is more efficient to study 2 factors simultaneously rather than separately We can reduce the residual variation in a model by including a second factor thought to in uence the response lurking variable We can investigate interactions between the factors If the twoway ANOVA test finds that a main effect is signi cant then you can go back to doing oneway ANOVA for that factor to find which levels are significantly different from each other Two Way ANOVA Table S ourc e Sum of Degrees of Mean square F df df Pvalue Squares Freedom A SSA DFAI1 M 39 ff t MSA SSA FA MSA fin e ec DFA MSE 0 DFA DFE B SSB DFBJ1 M 39 ff t MSB SSB FBMSB feign e ec DFB MSE 0 DFB DFE AB SSAB DFAB S 7 SSAB iMSAB Interaction IlJ1 M DFAB AB MSE ofA and B DFAB DFE E SSE DFE rror MSE SSE N IJ DFE T t 1 SST N l 0 a MST 55 7 DFT The hypothesis tests are H0 main effect A 0 Use Pvalue from F A H0 main effect B 0 Use Pvalue from F B H0 interaction of A and B0 Use Pvalue from interaction oanndB Example One way to repair serious wounds is to insert some material as a scaffold for the body s repair cells to use as a template for new tissue Scaffolds made from extracellular material ECM are particularly promising for this purpose Because they are made from biological material they serve as an effective scaffold and are then reabsorbed One study compared 6 types of scaffold material Three of these were ECMs and the other three were made of inert materials There were 3 mice used per scaffold type The response measure was the of glucose phosphated isomerase Gpi cells in the region of the wound A large value is good indicating that there are many bone marrow cells sent by the body to repair the tissue Here are the data for 2 weeks 4 weeks and 8 weeks after the repair 39 0 Material Gpl A 2 weeks 4 weeks 8 weeks me an Mew Data Danstmm Analvze Qvaphs mm 70 55 LEEHEE sea Ema ECM1 75 70 65 1 cm m 65 70 65 V cm Mammal Time gt 1 l 7 Ecw 2 weeks 60 60 60 2 75 Ecw 2 weeks 3 e5 Ecw 2 weeks ECM2 65 65 70 77 55 Ecw Aweeks 70 65 60 77 7D Ecw 4 weeks i 7n Ecw 4 weeks so 75quot ED Ecw 8 weeks ECM3 60 701 80 7 7 ES Ecw a weeks quot 7 7 E5 Ecw Eweeks 7 7 en ECMZ 2weeks 7 7 e5 ECMZ 2weeks 7n ECMZ 2 weeks MAT1 ED ECMZ 4 weeks 7 7 e5 ECMZ Aweeks 7 7 e5 ECMZ Aweeks en ECMZ a weeks MAT2 7n ECMZ Eweeks 7 7 en ECMZ Eweeks 7 7 an ECM3 2weeks 7 7 en ECM3 2weeks 75 ECM3 2 weeks MAT 75 ECM3 Aweeks 7 7 7n ECM3 Aweeks 7 7 75 ECM3 Aweeks 7 7 7n ECM3 Eweeks 7 7 an ECM3 Eweeks r Using SPSS Analyze 9 General LinearModel 9 Univariate Move Gpi into Dependent Variable box Move Time and Material into Fixed Factors box To get a means plat click Plots box move Material into the Horizontal Axis box Move Time into the Separate Lines box Click Add and Continue To get summary statistics click Options box Move Materia and Time into the Display Means box Click Descriptive Statistics box and click Continue Click OK a Make a table giving the sample size mean and standard deviation for each of the materialbytime combinations Is it reasonable to pool the variances Descriptive Statistics Dependent Variab e GPI Material Time Mean Std Deviation N M1 2 7000 3 4 6500 8660 3 8 6333 2887 3 Total 661 1 6009 9 ECM2 2 6667 7638 3 4 6333 2887 3 8 6333 5774 3 Total 6444 5270 9 ECM3 2 7167 10408 3 4 7333 2887 3 8 7333 5774 3 Total 7278 6180 9 MAT1 2 4833 2887 3 4 2333 2887 3 8 2167 5774 3 Total 3111 13411 9 MAT2 2 1000 5000 3 4 667 2887 3 8 667 2887 3 Total 778 3632 9 MAT3 2 2667 2887 3 4 1167 2887 3 8 1000 5000 3 Total 1611 8580 9 Total 2 4889 24648 18 4 4056 28330 18 8 3972 28619 18 Total 4306 27064 54 Because the sample sizes in this experiment are very small we expect a large amount of variability in the sample standard deviations Although they vary more than we would prefer we will proceed with the ANOVA b Make a plot of the means of the combinations Describe the main features of the plot C 01 Estimated Marginal Means of GPI Estimated Marginal Means T T Make a Lable ofthe sample Size and mean for each type of x 3 Time matmal Make a plot othe means othemaienals Gwe a snonsummary ofthe n Gpi depends on the type ofmatenal 1 Material Maienai Mean Sid Enuv LuwevEuund Unpev Euund ECMi E3111 1 742 E2 578 BB E44 ECM2 E4 444 i 742 ED 911 E7 978 ECM3 72 77E 1 742 BB 245 7B 311 MAN 31 Mi 1 742 27 57B 34 E44 MAT2 7 77E 1 742 4 245 11311 MAT 13111 1742 12578 19344 Make a Lable ofthe sample Size mean and Stan penod Make a plot othe means Gpi depends on the me penod oth dard aror for each time mm s Gwe a snonsummary ofthe 2 Tlme Deenuentvanaule Gm TWE Mean Stu Enur 2 48 EBB 1232 46 391 51387 An 556 39 722 4 E e 39 39 Report the F 39 39 39 r9 399 nn Vllhat you conclude Write a short paragraph summarizing the results ofyour analysis rssis vi asiwssn Sumsris Weds awarer is an r mam quotsinus in usnsrlsn I an Collected Modal 37537 5mg 17 2225 715 485 EIEIEI nklnevt mammal 1 mths 3654531 nun new Mamas 5 7174157 252547 nun rims Q25 nun 2 462 Sun nun Marginrim mm 557 1D 1mm 3514 um squotw usaaaa as 27 15 run 138925 nun artisan rum arszn saa aa r Earn gr mourners REquzled 95 u Ho39 H main effect material 70 H main effect timerO H interaction 70 up conclusion up conclusion Chapter 6 Con dence Intervals and Hypothesis Testing Using Z for the CI and test of the population mean Learning goals for this chapter Understand what inference is and why it is needed Know that all inference techniques give us information about the population parameter Explain what a confidence interval is and when it is needed Calculate a con dence interval for the population mean when the population standard deviation is known Know the assumptions that must be met for doing inference for the population mean Calculate the needed sample size if you have a predetermined margin of error Know how to write hypotheses calculate a test statistic and Pvalue and write conclusions in terms of the story Draw Normal curve pictures to match the hypothesis test Understand the logic of hypothesis testing and when a hypothesis test is needed Use the con dence interval to perform a twosided hypothesis test Explain sampling variability and the difference between the population mean and the sample mean Explain the difference between the population standard deviation and the sample standard deviation Know which technique is most appropriate for a story con dence interval hypothesis test or simple summary statistics When we collect data from our sample we can calculate sample statistics However usually we are interested in what is true for the whole population not just for the sample Remember that a census is very hard and expensive to do well Why can t we just accept our sample mean or sample proportion as the of cial mean or proportion for the population Every time we estimate the statistics 2713 sample mean and sample proportion we get a different answer due to sampling variability Two most common types of formal statistical inference Con dence Intervals when we want to estimate a population parameter 0 Signi cance Tests when we want to assess the evidence provided by the data in favor of some claim about the population yesno question about the population Con dence Intervals allow us to estimate the population mean or population proportion The true mean or proportion for the population exists and is a xed number but we just don t know what it is Using our sample statistic we can create a net to give us an estimate of where to expect the population parameter to be Con dence interval net Population parameter invisible stationary butter yamp We don t know exactly where the butter y is but from our sample we have a pretty good estimate of the location Density cum on If we just take a single sample our single con dence interval net may or may not include the population parameter However if we take many samples of the same size and create a con dence interval from each sample statistic over the long run 95 of our con dence intervals will contain the true population parameter if we are using a 95 con dence level Figures3 Inlmdun ionluIhern Mnfitvlistin Ftthdirian ms w H mum and mummy 7 42 272142 Populatlan p SR3 r1 8 V V 1 9mm 5 Of these intervals capture u Egg21 E212268HL2 the unknown u U 60 9 W M Iq2273 2 l We don t need to take a lot of random samples to recreate the sampling distribution with the population mean p at its center All we need is one Simple Random Sample of size n Because of what we know about the sample mean distribution we can use that one sample mean s con dence interval to infer what the population mean really is If you increase the sample size n you decrease the size of your ne or your margin of error n320 n 1280 14000 16000 18000 20000 22000 24000 26000 rigmss lnmdur anmIherzti EoSmunks fthdman a was w H Fmvmauand ompmi If you increase your con dence level C then you increase the size of your net or your margin of error 99 confidence 95 confidence 14000 16000 18000 20000 22000 24000 26000 Figur 56 e Mindmin m Mahatma oVSIuIVchs Hm Edition msw H Human and ombai y A smaller net is good because it gives you more information It is a smaller range for where to expect your true population parameter Freeman applet Go to course website Freeman link statistical applets con dence interval Con dence intervals look like estimate i margin of error IA39 F quot1 Interval for a I Mean 11 Where 2 is the value on the standard normal curve with area C between 72 and 2 Table D at the back of the book also contains more 2 values on the bottom row Remember from Ch 5 that the mean and standard deviation for a sample mean are lufc Zlux Also remember that if X is normally distributed then 7c will be too and if n is large the sample mean will be approximately normally distributed even if X is not normally distributed Central Limit Theorem What if your margin of error is too large Here are ways to reduce it 0 Increase the sample size bigger n 0 Use a lower level of con dence smaller C 0 Reduce 0 Sample Size n for Desired Margin of Error m z o 2 n m Note that it is the sample size n that in uences the margin of error The population size has nothing to do with it O Be careful You can only use the formula x i z x under certain circumstances J Data must be an SRS from the population Do not use if the sampling is anything more complicated than an SRS Data must be collected correctly no bias The margin of error covers only random sampling errors Undercoverage and nonresponse are not covered Outliers can have a big effect on the confidence interval This makes sense because we use the mean and standard deviation to get a CI You must know the standard deviation of the population or Example A questionnaire of drinking habits was given to a random sample of fraternity members and each student was asked to report the of beers he had drunk in the past month The sample of 30 students resulted in an average of 22 beers with a population standard deviation of 9 beers a Give a 90 confidence interval for the mean number of beers drunk by fraternity members in the past month b Is it true that 90 of the fraternity members each month drink the number of beers that lie in the interval you found in part a Explain your answer No this is the con dence interval for the population mean not for individual population members If we take many 3 0 frat member samples and make a con dence interval from each sample 90 of these confidence intervals will contain the true population mean of beers drunk in a month by fraternity members c What is the margin of error for the 90 confidence interval d How many students should you sample if you want a margin of error of l for a 90 confidence interval Hypothesis Testing To do a significance test you need 2 hypotheses Hg Null Hypothesis the statement being tested usually phrased as no effect or no difference Ha Alternative Hypothesis the statement we hope or suspect is true instead of 1 Hypotheses always refer to some population or model Not to a particular outcome Hypotheses can be onesided or twosided I One sided hypothesis covers just part of the range for your parameter Hg u 10 OR Hg u 10 Ha ygt10 Hault10 I Two sided hypothesis covers the whole possible range for your parameter Hg u 10 Ha u 10 Even though Ha is what we hope or believe to be true our test gives evidence for or against Hg only We never prove Hg true we can only state whether we have enough evidence to reject Hg which is evidence in favor of Ha but not proof that Ha is true or that we don t have enough evidence to reject Hg A test statistic measures compatibility between the Hg and the data P value the probability computed assuming that Hg is true that the test statistic would take a value as extreme or more extreme than that actually observed due to random uctuation It is a measure of how unusual your sample results are The smaller the PValue the stronger the evidence against Hg provided by the data Calculate the PValue by using the sampling distribution of the test statistic only the normal distribution for Chapter 6 Compare Prvalue to a simi cancelwal 2 If the Prvallle s It we can rzi eet Ho Kyou ean rejecth your results are signi cant Kyou dn nut reject Hg your results are nut signi cant T1124 Slsz cnmmnn tn alltesls hr signi cance 1 State the null hypothesrs Hg and the altemauve hypothesrs H 2 Ca1eu1ate the ya1ue ofthe test stausue zeseore m Chapter 6 k Mk L e Chapter 6 4 State your eonelusron about the datarn a sentenee usmg the Prvalue andor eornpanng the Prvalue to a sxgm cance level for your eyrdenee 2 Test fur a Pnpulaunn Mean unknown rnean h and knoyyn standard deyrauon If f h computetheteststausnc r h the Prvalues for atest of Hg agamst 11gt 1117st2 gt2 yltygstZ g2 IpgxsZPZ z These Pvalues are exact if the population is Normally distributed and are approximately correct for large n in other cases Examples 1 Last year the government made a claim that the average income of the American people was 33950 However a sample of 50 people taken recently showed an average income of 34076 with a population standard deviation of 324 Conduct a signi cance test to see if the true population mean is more than the govemment s claim Use a 001 N Suppose that the cellulose content of alfalfa hay in the population has a standard deviation of 8 mg A sample of 15 cuttings has a mean cellulose content of 145 mg A previous study claimed that the mean cellulose content was 140 mg Perform a hypothesis test to determine if the mean cellulose content is different from 140 mg ifU 005 Using con dence intervals to do hypothesis tests You can use a CI to do a HT only if 2 conditions are met 0 Your alternative hypothesis has a i is twosided 0 Your confidence level and your significance level add to 100 eg an or of 005 a confidence level of 95 100 You check to see if your null hypothesis could be true at the same time your con dence interval is true If the number from your null hypothesis ts inside the con dence interval then you DO NOT REJECT H0 because the null hypothesis and CI agree about u so we have no evidence that the null hypothesis is wrong If the number from your null hypothesis is outside the con dence interval then you REJECT H0 because the null hypothesis and CI disagree about u so one of them has to be wrong a Find a 95 con dence interval for the mean cellulose content b Now try the test from part a again using the con dence interval from part b to do the hypothesis test The result should be the same An environmentalist collects a liter of water from 45 different locations along the banks of a stream He measures the amount of dissolved oxygen in each specimen The mean oxygen level is 462 mg with the population standard deviation of 092 A water purifying company claims that the mean level of oxygen in the water is 5 mg Conduct a hypothesis test with OL0001 to determine whether the population mean oxygen level is less than 5 mg Annual Drinking Water Quality Report 2004 Town of Brookston IN I m pleased to report that our drinking water is safe and meets federal and state requirements Test Results MCL is the maximum contaminant level the highest level of a contaminant that is allowed in drinking water measurement emitters One of these violation reports should actually be a yes instead of a no Which one is it and why What hypotheses go along with these con dence intervals Note When I called the town of Brookston of ce to ask them about this the water manager called the state EPA of ce to get more information What they told him was that yes technically I was correct but that they don t use the confidence intervals that are reported Apparently these are the FEDERAL EPA rules They only use the mean I tried to get sample size or other information butI wasn t able to learn anything more Pvalues can be more informative than a rejectdo not reject H 0 based on a As P Value gets smaller the evidence for rejecting Ho gets stronger Just because we use a 005 a lot doesn t mean that s the level you have to useiit s just the most common There s nothing particularly special about that level In a large sample even tiny deviations from the null hypothesis can be important If we fail to reject Ho it may be because H0 is true or because our sample size is insuf cient to detect the alternative Plot your data and look at Pvalue both to determine your conclusions Could outliers be part of the problem A con dence interval actually estimates the size of an effect rather than simply asking if it is too large to reasonably occur by chance alone You must have a welldesigned experiment in order for statistical inference to work Randomization is important Section 25 Data Analysis for Two Way Tables Section 91 Chi square test for Two Way Tables Learning goals for this chapter I Find the joint marginal and conditional distributions from a twoway table of the counts by hand and with SPSS 0 Determine from the wording of the story whether the question is asking for a joint marginal or conditional percentageprobability Know when it twoway tables and the chisquare test are the correct statistical technique for a story Perform a hypothesis test for a 12 test including stating the hypotheses obtaining the test statistic and Pvalue from SPSS and writing a conclusion in terms of the story Check assumption to see if it is appropriate to use a 12 test using the footnote of the SPSS 2 test Twoway tables and the chisquare test are used when you are studying the association between 2 categorical variables cell total The joint distribution of the 2 categorical variables is the the inner squares All the joint distribution should add to l The marginal distribution allows us to study 1 variable at a time You get them just by adding across a row or down a column for the specific variable you are interested in The marginals are written in the margins of the table far right and very bottom The marginals for the row variable should add to l The marginals for the column variable should add to l Conditional distribution Ifyou know one variable for sure you have reduced your world what are the respective percentages for the other variable Bar graphs are a good way to demonstrate conditional distributions Hypothesis testing with 2 way tables Ho There is no association between the row and column variables in the population Ha There is an association between the row and column variables in the population To test the null hypothesis compare observed cell counts with expected cell counts calculated under the assumption that the null hypothesis is true Test statistic Chi Square Test Statistic X2 2 observed count expected count 2 7 expected count row total X column total Expected count n where n total of observations for the table The X2 test statistic has an approximately chisquare distribution To use the chisquare table you need the degrees of freedom r 1c 1 Go to Table F in the back of the book WE WILL LET SPSS CALCULATE THE TEST STATISTIC AND P VALUE FOR US YOU DO NOT NEED TO KNOW HOW TO USE THE TABLE P value for chisquare test is P 12 Z X 2 We ll be using SPSS to do the test The chisquare test becomes more accurate as the cell counts increase and for tables larger than 2x2 For tables larger than 2x2 use chi square test whenever the average of the expected counts is 5 or more I and the smallest expected count is 1 or more I lt20 of cells have expected counts of less than 5 For 2x2 tables use chi square test whenever I all 4 expected cell counts to be 5 or more Example Market researchers know that background music can in uence the mood and purchasing behavior of customers One study in a supermarket in Northern Ireland compared 3 treatments no music French accordion music and Italian string music Under each condition the researchers recorded the number of bottles of French Italian and other wine purchased Here is the 2way table that summarizes the data in counts total of bottles sold 243 Calculate the joint distribution for music and wine Music Wine None French Italian French 123 160 123 Italian 4 5 04 78 Other 177 144 144 Calculate the marginal distribution for music Music Wine None French Italian French 123 160 123 Italian 4 5 04 7 8 Other 177 144 144 Marg for music 346 309 346 Calculate the marginal distribution for wine Music Wine None French Italian Marg for wine French 123 160 123 407 Italian 45 04 78 128 Other 177 144 144 465 Marg for music 346 309 346 100 Questions joint marginal conditional 1 store 2 music 3 4 French 5 Using SPSS set up the data so that you have a wine column a music column and a purchase column where you will input the counts inside the chart Wine French Italian Other French Italian Other French Italian Other Then go to Data gt Weight Cases Click Weight cases by and then move purchase into the frequency variable box Click OK Do Analyze gt Descriptive Statistics gt Crosstabs Make sure observed is checked Put wine into the Rows box and music into the Columns box Click OK You will get Music None None None French French French Italian Italian Italian What percent of all wine bought was Italian with French music playing in the Of the Italian wine purchased what percent was from a store playing French What percent of wine bought was Italian What percent of the wine purchased from French musicplaying stores was What percent of wine was purchased from a store with no music playing Purchase 30 ll 43 39 l 35 30 19 35 y Type of Wine Type of Music Crosstabulation Cou nt Type of Musi French Italian None Total Type of French 39 30 30 99 Wine Italian 1 19 11 31 Oth er 35 35 43 113 Total 75 84 84 243 Then if you want the s for joint and marginal distributions instead of counts you go back to your data and do Analyze gt Descriptive Statistics gt Crosstabs gt your rows and columns should still be enteredfrom the previous step gt Click Cells gt Click Total Also un click observed so your table won t also include the counts and be too crowded Click Continue and then OK You will get Type ofWine Type of Music Crostabulation of Total Type of Musi French Italian None Total Type of French 160 123 123 407 Wine Italian 4 78 45 128 Other 144 144 177 465 Total 309 346 346 1000 Is there a relationship in the population between the type of wine purchased and the type of music that is playing Perform a significance test and write a short summary of your conclusion Hypotheses Test statistic PValue Conclusion in terms of the story Was it appropriate to use the chisquare test here Justify your answer To make SPSS do the hypothesis test you go back to Analyze gt Descriptive Statistics gt Crosstabs gt Cells Then click total to make their checks go away Also click expected under counts Click Continue Then click Statistics gt Chi Square gt Continue gt OK You will get ChiSquare Tests Asymp Sig Value df 2sided Pearson ChiSquare 182793 4 001 Likelihood Ratio 21875 4 000 N of Valid Cases 243 3 0 cells 0 have expected count less than 5 The minimum expected count is 957 Use the Pearson ChiSquare to get your X2 test statistic and the Asymp Sig to get the Pvalue Example Psychological and social factors can in uence the survival of patients with serious diseases One study examined the relationship between survival of patients with coronary heart disease and pet ownership Each of 92 patients was classi ed as having a pet or not and by whether they survived for one year The researchers suspect that having a pet might be connected to the patient status Here are the data a Find the joint and marginal distributions in probabilities of patient status and pet ownership Patient Status Marg for b Assuming a patient is still alive what is the probability he owns a pet Is this a joint marginal or conditional probability c What is the probability a patient is still alive and owns a pet Is this a joint marginal or conditional probability d What is the probability a patient owns a pet Is this a joint marginal or conditional probability e State the hypotheses for a X2 test of this problem nd the X2 test statistic its degrees of freedom and the PValue State your conclusion in terms of the original problem Hypotheses Test statistic PValue Conclusion in terms of the story ChiSquare Tests Asymp Sig Exact Sig Exact Sig Value df 2sicled 2sided 1sicled Pearson ChiSquare 8851b 1 003 Continuity Correctiona 739190 1 39007 Likelihood Ratio g011 1 003 Fisher39s Exact Test 006 004 LinearbyLinear Association 8755 1 39003 N of Valid Cases 92 Student Handout for MampMs Skittles Activity Chapter 9 Two Way Distributions Part 1 Plain vs Peanut MampMs 1 Your data for plain mine for peanut in counts Overall total number of plain and peanut MampMs counted 2 Joint Distribution in white boxes Divide each count above by the overall total of MampMs Peanut color 3 Marginal Distributions above in shading Add down the columns and across the rows The bottom numbers should add to 100 and the right column should add to 100 4 Conditional distribution of avor for green MampMs you know the MampM is green now what is the chance it is The denominator will be the same for both of these calculations These two percentages should add to 100 Plain Peanut 5 Bar graph for the conditional distributions above you will have 2 bars on 1 graph Conditional distribution of color for plain MampM Denominator will be the same for all 6 calculations All 6 add to 100 Brown Yellow Red Blue Orange Green Sketch a bar graph for the conditional distribution of color for plain MampMs You will have 6 bars on the graph Conditional distribution of color for peanut MampMs Denominator will be the same for all 6 calculations All 6 add to 100 Brown Yellow Red Blue Orange Green Sketch a bar graph for the conditional distribution of color for peanut MampMs You will have 6 bars on the graph Use the same yaXis scale that you used for the bar graph for plain MampMs so that you can easily compare your results How do they compare In order to do a hypothesis test we need a large data set like one from the whole class Brown Yellow Red Blue Orange Green Plain 147 302 264 407 330 373 Peanut 69 110 70 162 148 123 10 Hypotheses for the MampMs x2 hypothesis test Be sure to state whether your conclusion refers to the population or the sample 11 Test statistic and PValue for the x2 hypothesis test Chi Square Tests Aslmp Sig Value df 25ided Pearson Chi Square 143963 013 Likelihood Ratio 14623 5 012 N ofVaIid Cases 2505 3 0 cells0 have expected count lessthan 5 The minimum expected count i55881 12 Conclusion for the X2 hypothesis test or 001 in terms of the story 13 Was it appropriate to use the chisquare test here Part 2 MampMs vs Skittles Table for counts for the whole class 14 15 16 17 Yellow Nonyellow Total MampMs 302 1521 1823 Skittles 361 1351 1712 Total 663 2872 3535 Hypotheses for 96 test Test statistic and PValue ChiSquare Tests Asymp Sig Exact Sig Exact Sig Value df 2sided 2sided 1sided Pearson ChiSquare 11839b 1 001 Continuity Correctiona 11544 1 001 Likelihood Ratio 11840 1 001 Fisher39s Exact Test 001 000 N of Vahd Cases 3535 a Computed only for a 2x2 table 13 0 cells 0 have expected count less than 5 The minimum expected count is 321 09 Conclusion on 001 in terms of the story Was it appropn39ate to use the chisquare test here Chapter 12 Inference for OneWay AN OVA and Comparing the Means Learning goals for this chapter Know how oneway ANOVA and 2sample comparison of means techniques are related Test the standard deviations to see if it is appropriate to pool the variances Understand why it is important to pool the variances in oneway ANOVA Explain and check the assumptions for doing oneway ANOVA Calculate R2 and the estimate for 039 Write the correct hypotheses including the words population mean for one way ANOVA Use the F test statistic and Pvalue from SPSS to perform the oneway ANOVA test State the conclusion to a oneway ANOVA test in terms of the story Know when to use a Bonferroni multiple comparisons test Use SPSS to perform the Bonferroni multiple comparisons test and interpret the output both Pvalues and con dence intervals State the conclusions to a Bonferroni multiple comparisons test in terms of the story Interpret sidebyside boxplots and means plots in terms of the story Recognize the response variable factors number of levels for each factor and the total number of observations for a story Identify from reading a story whether the scenario is oneway ANOVA Use OneWay ANOVA when you have one categorical and one quantitative variable and you want to compare the means If the categorical variable has 2 groups gender male or female for example use Ch 7 twosample comparison of means ttest If the categorical variable has more than 2 groups eye color blue brown black green hazel other then use Ch 12 oneway ANOVA ANOVA ANalysis Of Variance the method for comparing several means Oneway ANOVA F test for Ho y1p2 uI Q the population means are equal Ha not all the population means are equal at least one is different Is there at least one population mean that is statistically signi cantly different from the others When you rst approach a problem which involves comparing more than 2 groups here is what you should do 1 Find the size n sample mean and sample standard deviation of each group You can then plot the means on a graph Do histograms of each group to look for outliers and overall shape 2 Find the 5number summary Min Ql Median Q3 Max for each group and do sidebyside box plots to see how much overlap there is between the groups 3 Run ANOVA Standard Deviations The standard deviation 039 is assumed to be the same for all of the groups even though the sample sizes ni may be different If the largest s lt 2 smallest s we can use the methods based on the assumption of equal standard deviations above If we assume all the standard deviations are equal each s is an estimate of 039 We combine these into a Pooled Estimator of 039 n n I n n 1 In the SPSS ANOVA output SP IMSE The ANOVA output see pg 764 for more information Source Sum of Degrees of Mean Square F Slg S uares Freedom Groups SSG MSG Between SSG DFG I l MSG 7 7 Pvalue Groups DFG M SE Error 7 SSE 7 2 Within SSE DFE N 1 MSE DFE 51 Groups SST Total SST DFTNl MSTi W Note thatN the total number of observations the sum of the all the n R2 SSG SST coefficient of determination of variation in the data that is accounted for by the FIT part of the model T his tells you how good a job your model is doing of explaining the variation in your data The closer to 100 the better your model is Example Many studies have suggested that there is a link between exercise and healthy bones Exercise stresses the bones and this causes them to get stronger One study examined the effect of jumping on the bone density of growing ram There were 3 treatmenm a control with no jumping a lowjump condition the jump height was 30 cm and a highjump condition 60 cm After 8 weeks of 1 0 jumps per day 5 days per week the bone density of the ram expressed in mgcmj was measured Here are the data Group Bone density control 611 621 614 593 5 3 653 600 554 603 569 l0W jump 635 605 638 594 599 632 631 588 607 596 high jump 650 622 626 626 631 622 643 674 643 650 112 gun Mew gala Dansmim Anawze Qvaphs Mme Addams vadaW new Enter the data into in BEEEHw imalih Ema wtw m WW 3 vertical columns With the imam 1 Gm 7991er 1 j 1 following labels Group Density EM cuntm and Treatment All the densities are listed in one long column Group is where you list control lowjump or hi hump Treatment is a numerical way of describing your group Make control be I 1 1 1 1 1 1 1 1 1 1 i 7 635 1mm 2 lowjump be 2 and higly39ump i 7 Bus WW 2 be 3 For some reason ANOVA E38 Wimp 2 i 7 5941mm 2 needs a numerical column for the i 7 599 W P 2 factor box not stated in your 532 1uw1ump 2 331 W p 2 SPSS manual 7 7 588 Wimp 2 7 7 517 Wimp 2 7 7 5951umump 2 7 7 BED mammp 3 7 7 E22 mm m 3 7 7 EZE mammp 3 7 7 EZE mammp 3 7 7 531 WWW 3 E22 mammp 3 a Identify the response variable I n and N for this study b Make a table giving the mean and standard deviation for each group of rats Make a graph of the means Is it reasonable to pool the variances Using SPSS Analyze Compare Means Means Move density into DependentList box Move group into Independent List box Click the Options box to get your summary statistics Click OK Report Bone Dens tv Group Mean N Std Deviation Minimum Maximum Median Control 60110 10 27364 554 653 60150 Highjump 63870 10 16594 622 674 63700 Lowjump 61250 10 19329 588 638 60600 Total 61743 30 26270 554 674 621 50 son 7 Means plot created in ANOVA SPSS am 7 E step below E52 Control l g Low Jump 2 High Jump 3 sun 7 Treatzment c Do sidebyside box plots for each group Using SPSS to get the boxplots Graphs Boxplot Define Move density into the Variable box Move group into the CategoryAxis box Click OK em 7 55a sou Bone Density 1 san 56a a o mm m mgm39wp Wimp Group d Run the analysis of vaIiance RepOIt the F statistic and Pvalue Write the hypotheses that go With this information What do you conclude Using SPSS Analyze 9CompareMeans 90neWayANOVA Move density into the DependentList box Move treatment into the Factor box Ifyou want aplot ofthe means you can click options and Meansplot 1 OK ANOVA Bone Density Sum of SJuares df Mean Square F Siq Between Groups 7433867 2 3716933 7978 002 Within Groups 125 79500 27 465907 Total 20013367 29 e What is the pooled estimate for the standard deviation 0 What is R2 This means that of the variation in bone density is explained by membership in the groups of high jump low jump and control The other of the variation is due to rat to rat variation within each of these groups If H g is rejected in OneWay ANOVA that means that we have evidence that at least one of the means is different Which 0nes Multiple Comparisons Use multiple comparisons method ONLY AFTER you have rejected Hg with the F test To perform a multiple comparisons procedure compute t statistics for all pairs of means using the formula If Otherwise we conclude that the data do not distinguish between them t2 2 t we declare that the population means 1 and u are different The value for t depends on which multiple comparisons procedure we choose We will use Bonferroni s multiple comparisons SPSS will do this for you and give you the Pvalue Another multiple comparisons method is simultaneous con dence intervals for all the possible differences Find 2 2 twsp i i for all the different pairs 1 n1 n J If an interval contains 0 then that pair of means will not be declared significantly different Example Going back to the rat jumping data use the Bonferroni multiple comparisons procedure to determine which pairs of means differ significantly Summarize your results in a short report Using SPSS Analyze Compare Means One WayANOVA Move density into DependentList box Move treatment into Factor box Click PostHoc box Click Bonferroni Click Continue Click OK Remember Control1 Lowjump2 Highjump3 Multiple Comparisons Dependent Variable Bone Density Mean Difference 11400 9653 37600quot 9653 2 The mean difference is signi cant at the 05 level Example Recommendations regarding how long infants in developing countries should be breastfed are controversial Ifthe nutritional quality of the breast milk is inadequate because the mothers are malnourished then there is risk of inadequate nutrition for the infant On the other hand the introduction of other foods carries the risk of infection from contamination Further complicating the situation is the fact that companies that produce infant formulas and other foods benefit when these foods are consumed by large numbers of customers One question related to this controversy concerns the amount of energy intake for infants who have other foods introduced in to the diet at different ages Part of one study compared the energy intakes measured in kilocalories per day kcald for infants who were breastfed exclusively for 4 5 or 6 months Here are the data Energy Intake kc ald a Identify the response variable I 71 and N for this study b Make a table giving the sample size mean and standard deviation for each group of infants Is it reasonable to pool the variances Report Enerqv Time Mean N Std Deviation Median Minimum Maximum BF4 57000 19 122 958 609 00 209 738 BF5 483 00 18 112 948 512 00 177 639 BF6 54188 8 93 963 506 50 445 703 Total 53020 45 118 906 528 00 177 738 c Show sidebyside boxplots for the 3 groups in Time d hypotheses for your test What do you conclude Run the analysis of variance Report the F statistic and PValue Write the ANOVA Enerqv Sum of Sguares df Mean Square F Siq Between Groups 71288325 2 163 2718 078 Within Grou ps 5508109 42 131 14 545 Total 6220992 44 Should we now do Bonferroni multiple comparisons procedure for this example Why or why not Chapter 1 Looking at DataDistributions Section 11 Introduction Displaying Distributions with Graphs Section 12 Describing Distributions with Numbers Learning goals for this chapter Identify categorical and quantitative variables Interpret create by hand and with SPSS and know when to use bar graphs pie charts stemplots standard backtoback split histograms and boxplots regular modi ed sidebyside Describe the shape center and spread of data distributions De ne calculate by hand and with SPSS and know when to use measures of center mean vs median and spread range 5number summary IQR variance standard deviation Understand what a resistant measure of center and spread is and when this is important Use the 151QR rule to look for outliers Draw a Normal curve in correct proportions and identify the meanmedian standard deviation middle 68 middle 95 and middle 997 Perform calculations with the empirical rule both backwards and forwards Understand the need for standardization Big picture what do we learn in this chapter Individuals vs Variables Categorical vs Quantitative Variables Grap hs Bar graphs and pie charts categorical variables Histograms and stemplots quantitative variablesigood for checking for symmetry and skewness Boxplots quantitative variablesigraphical display of the 5 summary modi ed boxplots show outliers Describing distributions Shape symmetricskewed unimodalbimodalmultimodal Center mean or median Spread usually standard deviationvariance or IQR from the 5 summary Outliers If you have a symmetric distribution with no outliers use the mean and standard deviation If you have a skewed distribution andor you have outliers use the 5 summary instead 2 components in describing data or information Individuals objects being described by a set of data people households cars animals corn etc Variables characteristics of individuals height yield length age eye color etc I Categorical places an individual into one of several groups gender eye color college major hometown etc Quantitative Attaches a numerical value to a variable so that adding or averaging the values makes sense height weight age income yield etc Distribution of a variable describes what values a variables takes and how often it takes those values If you have more than one variable in your problem you should look at each variable by itself before you look at relationships between the variables Example Identify whether the following questions would give you categorical or quantitative data a b e D What letter grade did you get in your Calculus class last semester What was your score on the last exam Who will you vote for in the next election How many votes did George W Bush get How many red MampMs are in this bag Which type of MampMs has more red ones peanut or plain It s always a good idea to start by displaying variables graphically before you do any other statistical analysis What kind of graph should you use That depends on whether you have a categorical or quantitative variable Categorical Variables Bar graphs or pie chalts Messy room example In a poll of 200 parents of children ages 6 to 12 respondents were asked to name the most disgusting things ever found in their Most dis ustin thing I Foodrelated I of parents 106 children s rooms The results are below JampC 2005 I of parents 53 Ammal aan Insectrelated 22 1 1 nulsances Clothrng dlrty soeks and 22 11 underwear espemally Other 5 0 25 Bar graph can use either of parents like below or of parents 120 100 DU amma dothmg f d 00 type of disgusting mess Cases weighted by of parents Pie chalt needs of parents Cases weighted by of parents M n amma other type of dwsgus ung mess I amma l dommg El food I om er Quantitative Variables Stemplots histograms and boxplots discussed a little later Example You investigate the amount of time students spend online in minutes You study 28 students and their times are listed below Show the distribution of times with a stemplot 7 42 72 20 24 25 25 28 28 3O 32 35 43 44 45 46 47 48 48 5O 51 75 77 78 79 83 87 88 To create a stemplot by hand 1 2 Put the data in order from smallest to largest The stem will be all digits for a data point except for the last one Write the stems in a vertical line Think of 7 as being 07 so that all the numbers have a digit in the tens place The leaf will be the next digit in this case the ones place from each data point Write the leaves after the appropriate stem in increasing order It is possible to trim any digits that you feel may be unnecessary For example if our second data point had been 203 we would probably choose to ignore the 3 for the purposes of the stemplot so that we could create a more reasonable stemplot If we did not ignore this 3 then our stems would have been 07 08 09 10 ll l2 l3 88 with decimal numbers as our leaves This would show a very uniform stemplot with only one leaf for each stem all leaves would be 0 except for the 3 This would not be helpful to us at all It makes much more sense to use the tens place for the stem and the ones place as the leaves in this example Stemplot A split stemplot just has more 0 l 7 stems There are several ways to 1 split the stems Here they are 2 l 0 4 5 5 8 8 3 0 2 5 sp11t by flves 0 7 4 l 2 3 4 5 6 7 8 8 1 5 l 0 l 1 6 7i25789 3 288 8 l 3 7 8 3 0 2 3 5 4 2 3 4 4 5 6 7 8 8 5 0 l 5 6 6 7 2 7 5 7 8 9 8 3 8 7 8 Why do we need split stemplots Sometimes it is easier to see the shape of the data with more stems Sometimes a regular stemplot is better If you re not sure try it both ways and see if a pattern appears Try a stemplot and a split stemplot with this data use the hundreds place for stems 3 4 17 18 39 93 102 110 143 178 250 278 299 3001 Histograms Sorting the quantitative data into bins How many bins 0 Not too many bins with either 0 or 1 counts 0 Not overly summarized so that you lose all the information 0 Not so detailed that it is no longer a summary Too few bins OK Too many bins 50 25 14 40 20 12 a 51 30 E15 5 8 3 5 2 g 10 E 6 10 5 4 2 0 15 20 25 30 35 40 U 39 39 BeakLengm 15 20 25 30 35 40 15 20 25 30 35 40 Beak Length Beak Length Histograms Bar graphs The bars for each interval touch each other The bars for each category do not touch each ther There are spaces between the bars Histograms have a continuous quantitative XaXis with the Xvalues in order Bar graphs can have the categories on the XaXis listed in any order alphabetical biggestt0 smallest etc Quantitative variables Categorical variables 25 20 7 gt U 5 15 3 5 I 10 u 5 0 15 20 25 30 35 40 Beak Length 39 39 qpeordlxguIngmeu mm mm Histograms Stemplots Quantitative variables Quantitative variables Good for big data sets especially if technology is available Good for small data sets convenient for back0f theenvelope calculations Rarely found in scienti c or laymen publications Uses a box to represent each data Uses a digit to represent each data point p01nt 25 1 955 1 I 444443322222110000 0 999958777565666555555 20 4444444433333332222222222221111111100 gt 0 000001111111111122222233333333444A4444 E 5555555555555555555556666666666577777776858635385599999 g 1 00000000011111111122233334444 U 1 55566667839 W 2 011334 IL 10 5 o 15 20 25 30 35 40 BeakLength You ve drawn your graph histogram or stemplot Now what Look for overall pattern and any outliers The pattern is described by shape center and spread 1 Shape o of peaks unimodal 1 bimodal 2 multimodal gt 2 0 Where the long tail is Symmetric Right skewed Left skewed long tail on the long tail on the left mm m Median Mean Median lt Mean Median gt Mean l To describe the shape use a histogram with a smoothed curve highlighting the overall pattern of the distribution don t get overly detailed 2 Center If the distribution is symmetric the mean will equal the median but otherwise these numbers are not the same 1 quot a Mean arithmetic average x E x quot 11 Where n the total of observations And x an individual observation b Mode the most common number biggest peak c Median M midpoint of the distribution such that 12 the observations are smaller and 12 the observations are larger The median is not as affected by outliers as the mean is the median is resistant to outliers To find the median i Order the data form smallest to largest ii Count the of observations n iii Calculate quot71 to find the center of the data set iv If n is odd M is the data point at the center of the data set v If n is even quot71 falls between 2 data points called the middle pair M the average of the middle pair Examples of center Find the mean and median of the following 7 numbers in Dataset A 23 25 325 33 67 1 20 Find the mean and median of the following 8 numbers in Dataset B 1 2 4 6 8 9 12 13 3 Spread a Range maX 7 min simplest not always the most helpful b Variance s2 average of the square of deviations of observations from the mean 32 112x17f2 11 n n c Standard Deviation s square root of the variance common way for measuring how far observations are from the mean Example of nding the standard deviation by hand 0 2 4 1 Calculate the mean 2 Calculate the variance 3 Take the square root of the variance d Pth percentile value such that p of the observations fall at or below it Median M 501h percentile First Quartile Q1 25Lh percentile Third Quartile Q3 751h percentile How do you nd quartiles Think of them as minimedians Leave the median out and then nd the median of what is left over on the left side Q1 and what is left over on the right side Q3 Find the 1st and 3rd quartiles of the following 7 numbers in Dataset A e S Number Summary Min Q1 M Q3 Max f Interquartile Range IQR Q3 7 Q1 Call an observation a suspected outlier if it is gt Q3 15 IQR OR lt Q1 7 15 IQR g Boxplots Graph of the 5number summary Modi ed boxplots have lines extend from the box out to the smallest and largest observations which are NOT outliers Dots mark any outliers We will always ask for the modi ed boxplot but if there are no outliers the modi ed and regular boxplots look exactly the same 70 Max e7 Baxplat for DatasetA with 5 60 11 ha summary 50 2o 1 25 33 67 4 03 33 Since there was no outlias in 30 7 this dataset a regular boxplat and 20 M 25 a modi ed baxplat look exactly the same for this data 10 0 01 1 40 3920 mm V20 730 For the amine time example with 2 additional data pains added in list the 5numbaquot summary nd any ou ias presmt and show a boxplot and modi ed baxplot 7 20 24 25 25 28 28 30 32 35 42 43 44 45 46 47 48 48 50 51 72 75 77 78 79 83 87 88 135 151 How do you know which method is best for determining center and spread S Number Summary better for skewed distributions or distribution with outliers Mean and Standard Deviation good for reasonably symmetric distributions free of outliers Always start with a graph In the internet time example here are how the meanstandard deviation and 5number summary are affected by the outlier 27600 7 77 151 7 29 135 The Median vs the Mean in the Age of Average by Mike Pesca on NPR s DaytoDay 71906 httpwww nor on 39 v storvphpstorvld5567890 Do you always have to do all of this by hand NO Statistical software packages like SPSS can make life much easier for you but it s a good idea to know how to do these by hand so you can make sense of your output Also on the exam you won t have access to a computer Read over your SPSS manual and get comfortable with using SPSS You will have a chance to practice on the HW for this week and you will work on it in lab on Friday Enter your data then Analyzegt Descriptive Statisticsgt Explore Follow the instructions on p 48 of the SPSS manual The output from SPSS for the internet time problem looks like De scri pti ve s Statistic Std Error TIme spent on the web Mean 5477 5961 95 Con dence Lower Bound 4258 Interval for Mean Upper Bound 6696 5 Trimmed Mean 5213 Median 4650 Variance 1065840 Std Deviation 32647 Minimum 7 Maximum 151 144 Interquartile Range 48 skewness 1314 427 Kurtosis 1977 833 cemeahdeLeaf Plot nguyam Frequency Stem amp Leaf 1uu u u I 9uu u 222222333 1nuu n 4444444455 A 5uu u 77777 39 3nn n BEE 3 nn 1 1uu 1 3 1uu Extremes gt151 y viaush Stem wldth 1mm I Each leaf 1 case5 timespemttawea V 39 L L iti iu imii quot quot Thiswouldbeyour 5uuiue uiiiui 39 39 quot 39 quot 39 39 39 a t he u a uaia You could also try calculating the mean and standard deviation without the outlier for comparison SPSS can also give you the Quaniles listed under Percentiles but these are not necess 1y L quota u a th hand The 39 39 39 and Tukey s Hinges are not the same method we use For this class whenever we ask you to calculate the Quartiles we want you to do them by hand What if you want to compare the results from two or more different groups Use side by side boxplots or back to back stemplots for your graphs 4o 35 3 30 81 25 i m 20 2 E 15 1o 5 o Twocity Two hwy Minicity Minihwy Figure 171 7 nnadumon m ththzm39a afsmrlxlks mm Edition m3 m1 Cnmpany Preview of Section 13 from Section 13 A z score tells us how many standard deviations away from the mean an observation is This is also called getting a standardized value Why is standardization useful For comparing apples to oranges Example p 88 Problem 199 Jacob scores 16 on the ACT Emily scores 670 on the SAT Assuming that both tests measure scholastic aptitude who has the higher score The SAT scores for 14 million students in a recent graduating class were roughly normal with a mean of 1026 and stande deviation of 209 The ACT scores for more than 1 million students in the same class were roughly normal with mean of 208 and standard deviation of 48 How else can we use standardization If the distribution of observations has a bell shape then these standardized values have some special properties One of these is the 68 95 997 Empirical Rule 0 Approximately 68 of the observations fall within 16 of the mean between u 1039 and LH la 0 Approximately 95 of the observations fall within 26 of the mean between u 2039 and y20 0 Approximately 997 of the observations fall within 36 of the mean between u 3039 and u3039 Pp16ltXltl116 068 Pp26ltXltl126 095 Pu36ltXltu3cs 0997 68 of data quot5quotquot f data a an 70AJ ofa o Standard deviations away I from the mean z score WM 3 2 1 O 1 2 3 so a zscore of 2 could 39quoti f iquot l352313 quotquot quotquot also be written as 7203 for example The mean and the median of a bellshaped curve are in the middle This is shown with a 0 because the mean is 0 standard deviations away from itself The most famous bellshaped distribution is the Normal distribution We will spend several lectures talking about it for Section 13 and it will be important to everything we do for the rest of the semester Example Checking account balances are approximately Normally distributed with a mean of 1325 and a standard deviation of 25 a Between what numbers do 68 of the balances fall b Above what number do 25 ofthe balances lie c Approximately what percent of balances are between 1250 and 1400 Chapters 2 and 10 Least Squares Regression Learning goals for this chapter Describe the form direction and strength of a scatterplot Use SPSS output to find the following leastsquares regression line correlation r2 and estimate for o Interpret a scatterplot residual plot and Normal probability plot Calculate the predicted response and residual for a particular xvalue Understand that leastsquares regression is only appropriate if there is a linear relationship between x and y Determine explanatory and response variables from a story Use SPSS and the t table to nd the confidence interval for the regression slope and intercept Perform a hypothesis test for the regression slope and for zero population correlation independence including stating the null and alternative hypotheses obtaining the test statistic and Pvalue from SPSS and stating the conclusions in terms of the story Understand that correlation and causation are not the same thing Estimate correlation for a scatterplot display of data Distinguish between prediction and extrapolation Check for differences between outliers and in uential outliers by rerunning the regression Know that scatterplots and regression lines are based on sample data but hypothesis tests and confidence intervals give you information about the population parameter When you have 2 quantitative variables and you want to look at the relationship between them use a scatterplot Ifthe scatter plot looks linear then you can do least squares regression to get an equation of a line that uses x to explain what happens with y The general procedure 1 Make a scatter plot of the data from the x and y variables Describe the form direction and strength Look for outliers Look at the correlation to get a numerical value for the direction and strength If the data is reasonably linear get an equation of the line using least squares regression Look at the residual plot to see if there are any outliers or the possibility of lurking variables Patterns bad randomness good 5 Look at the normal probability plot to determine whether the residuals are normally distributed The dots sticking close to the 45degree line is good 6 Look at hypothesis tests for the correlation slope and intercept Look at con dence intervals for the slope intercept and mean response and at the prediction intervals 7 If you had an outlier you should rework the data without the outlier and comment on the differences in your results Association 0 Positive negative 01 no association I Remember ASSOCIATON or CORRELATION is NOT the same thing as CAUSATION See chapter 325 notes Response variable 0 Y I Dependent variable 0 measures an outcome of a study Explanatory variable 0 X 0 Independent variable 0 explains or is related to changes in the response variables p 105 Scattemlots I Show the relationship between 2 quantitative variables measured on the same individuals Dots onlyidon t connect them with a line or a curve 0 Form Linear Nonlinear No obvious pattern Direction Positive or negative association No association Strength how closely do the points follow a clear form Strong or weak or moderate Look for OUTLIERS Correlation measures the direction and strength of the linear relationship between 2 quantitative variables r It is the standardized value for each observation with respect to the mean and standard deviation n l sx y You won 7 need to use this formula but SPSS will 1 r where we have data on vanables x and y for n 1nd1v1duals S Using SPSS to get correlation Use the Pearson Correlation output Analyze gt Correlate gt Bivariate see page 55 in the SPSS manual The SPSS manual tells you where to find r using the least squares regression output but this r is actually the ABSOLUTE VALUE OF r so you need to pay attention to the direction yourself The Pearson Correlation gives you the actual r with the correct sign Properties of correlation Field mummnm X and Y both have to be quantitative It makes no difference which you callX and which you call Y Does not change when you change the units of measurement Ifr is positive there is a positive association betweenX and Y AsX increases Y increases Ifr is negative there is a negative association betweenX and Y AsX increases Y decreases 1 S r S 1 The closer r is to l or to l the stronger the linear relationship m gm kilograms Nonexerclse activity glories The closer r is to 0 the weaker the linear relationship Outliers strongly affect r Use r with caution if outliers are present 0 u u 39 39 39 39s39 Correlation r 0 Correlation r 03 quot I 39 a r 39 39 J r I n 3 39 Correlation r 05 Correlation r 07 9 quot 11 12 g 39 393 g Correlation r 09 Correlation r 099 Fig ure 21 0 Introduction to the Practite afStatislicS Fifth Edition 2005 w HFieeman and Company 3 Example We want to examine whether the amount of rainfall per year increases or decreases corn bushel output A sample of 10 observations was taken and the amount of rainfall in inches was measured as was the subsequent growth of corn The scatterplot 120 110 100 2 3 4 5 6 7 amount of rain in a What does the scatterplot tell us What is the form Direction Strength What do we expect the correlation to be Correlations corn yield bushels amount of rain in amount of rain in Pearson Correlation Sig 2tailecl N corn yield bushels Pearson Correlation Sig 2tailecl N quot Correlation is significant at the 001 level 2tailed Inference for Correlation R correlation R2 of variation in Y explained by the regression line the closer to 100 the better p Greek letter rho correlation for the population When p 0 there is no linear association in the population so X and Y are independent if X and Y are both normally distributed Hypothesis test for correlation r n 2 lir2 To test the null hypothesis Hg p 0 SPSS will compute the t statistic t degrees of freedom n 7 2 for simple linear regression b Are corn yield and rain independent in the population Perform a test of signi cance to determine this c D0 corn yield and rain have a positive correlation in the population Perform a test of signi cance to determine this This test statistic for the correlation is numerically identical to the t statistic used to test Hg Can we do better than just a scatter plot and the correlation in describing how x and y are related What if we want to predict y for other values of x Least Squares Regression ts a straight line through the data points that will minimize the sum of the vertical distances of the data points from the line Minimizes ZXel2 11 0 Equation of the line is j 2 be blx with 3 the predicted yum I Slope of the line is b1 riy where the slope measures the amount of change 3 caused in the predicted response variable when the explanatory variable is increased by one unit I Intercept of the line is be 7 7 b1 where the intercept is the value of the predicted response variable when the explanatory variable 0 Type of line Least Squares Regression slope y intercept equation of line Ch 10 Sample j be blx b1 b0 Ch 10 Population y o lxl 81 u model Using the corn example nd the least squares regression line Tell SPSS to do Analyze RegressionLinear Put rain into the independent box and corn into the dependent box Click OK Model Sum man Adjusted Std Error of Model R R Square R Square the Estimate 1 995a 991 989 1290 3 Predictors Constant amount of rain in b Dependent Variable corn yield bushels ANOVR Sum of Model Sguares df Mean Sguare F Sig 1 Regression 1397195 1 1397195 840 070 0008 Residual 13305 8 1663 Total 1410500 9 8 Predictors Constant amount of rain in 13 Dependent Variable com yield bushels Coef cient Unstandardized Standardized Coef cients Coel cients 95 Con dence Interval for B Model B Std Error Beta t Sig Lower Bound U er Bound 1 Constant 50835 1728 29421 000 46851 54819 amount ofrain an 9625 332 995 28984 000 8859 10391 3 Dependent Variable corn yield bushels d What is the leastsquares regression line equation The scatterplot with the least squares regression line looks like n R2 is the percent of 11 0 variation in corn yield explained by the regression 1 line with rain 9906 100 90 u 80 70 qu 09906 3 4 5 6 amount of rain in Con dence Intervals and Signi cance Tests for Regression Slope and Intercept Level C con dence interval for the intercept o is be if SEbn Level C con dence interval for the slope is 131 i t SEbl SPSS will also give you these con dence intervals for 95 but you may have to use the estimates for the coef cients and their standard errors to find other confidence intervals use t table and n 2 degrees offreedom to get I Hypothesis testing for Ho 1 0 Test statistic t b1 with df n 2 SEb1 SPSS will give you the test statistic under t and the 2sided Pvalue under Sig e Give a 95 con dence interval for the slope f Give a 90 con dence interval for the slope g Is the slope positive Perform a test of significance h What of the variability in corn yield is explained by the least squares regression line i What is the estimate of the standard error of the model What do we mean by prediction or extrapolation Use your leastsquares regression line to nd y for other x values 0 Prediction using the line to nd yvalues corresponding to xvalues that are within the range of your data xvalues I Extrapolation using the line to nd yvalues corresponding to xvalues that are outside the range of your data xvalues Be careful about extrapolating yvalues for x values that are far away from the x data you currently have The line may not be valid for wide ranges of x Example On the raincom data above predict the corn yield for a 5 inches of rain b 72 inches ofrain 0 inches of rain 0 V d 100 inches of rain D V For which amounts of rainfall above do you think the line does a good job of predicting actual corn yield Why Deadly Sins Cartoon by J B Landers on wwwcauseweborg used with permission Assumptions for Regression Repeated responses y are independent of each other For any fixed value of x the response y varies according to a Normal distribution The mean response Ll has a straightline relationship with x 4 9 The standard deviation of y o is the same for all values of x The value of o is unknown How do you check these assumptions 0 Scatterplot and R2 Do you have a straightline relationship between X and Y How strong is it How close to 100 is R2 Hopefully no outliers 3 Normal probability plot Are the residuals approximately normally distributed Do the dots fall fairlv close to the diagonal line which is always there in the same spot 2 Normal PP Plot of Regresslorl Standardlzed Resldual Dependentvariable com yleld lbuslielsl Expe ed ciim th 0 Residual plot Do you have constant variability Do the dots on your residual plot look random and fairly evenly distributed above and below the 0 line Hopefully no outliers 1 and 4 Residual is the vertical difference between the observed yvalue and the regression line yvalue residual bxl ydm ylmg Residual plot scatterplot of the regression residuals against the explanatory variable 6 vs x eaXis has both negative and positive values but centered about 6 0 the mean of the leastsquares residuals is always zero E 0 Good total randomness no pattern approximately the same number of points above and below the e 0 line Bad obvious pattern funnel shape parabola more points above 0 than below or vice versa if you have a pattern your data does not necessarily t the model line well Subjm IS Residual m n esid uals Subjen 39 Residual 01 field measurement 0 20 40 50 30 5 8 10 12 14 6 3 20 Numhevolyears Laboratory measurement m percent Resldual 2uo o a mo son mm mm Nonexucise activity calories 3 a 5 a amcunt of ram m Outliers Outliers are observations that lie outside the overall pattern of the other observations Outliers in the y direction of a scatterplot have large regression residuals e Outliers in the x direction of a scatterplot are often in uential for the regression line 0 An observation is in uential if removing it would markedly change the result of the calculation Outliers can drastically affect regression line correlation means and standard deviations You can draw a second regression line that doesn t include the outliersiif the second line moves more than a small amount when the point is deleted or if R2 changes much the point is in uential Which hypothesis test do you use when If you re not sure whether to use or p here are some guidelines The test statistics and P values are identical for either symbol Either 01 Review 0fSPSS instructions for Regression When you set up your regression you click on Analyze gtRegression gtLinear Put in your y variable for quotdependentquot and your x variable for quotindependentquot on the gray screen Don39t hit quotokquot yet though At the bottom of that gray screen click on quotStatisticsquot and then click on quotconfidence intervalsquot if you will want the confidence intervals for any part of the problem You can also click on quotdescriptivesquot if you want information like the mean and standard deviation for each variable Click quotcontinuequot to the Statistics gray screen Back on the regression gray screen click on quotPlotsquot and then click on quotnormal probability plotquot Click quotcontinuequot on the Plots gray screen Back on the regression gray screen click on quotSavequot and then click on quotunstandardized residualsquot Click quotcontinuequot on the Save gray screen and then quotokquot to the big Regression gray screen You still won39t have a residual plot yet If you click back to your data input screen you now have a new column called quotResilquot To make the residual plot you follow the same steps for making a scatterplot go to graphs gtscatter gtsimple then put quotresilquot in for y and your x variable in for x Click quotokquot Once you see your residual plot you39ll need to double click on it to go to Chart Editor On the Chart Editor tool bar you can see a button that shows a graph with a horizontal line Click on that button Make sure that the y axis is set to O H 0w will I ever use this stu agaiu in my future career Testimonial from a Former Student Emails received June 9 2005 Ellen I hope that all is well It is your favorite student here Eric from your Fall 04 stat 301 class I need some help Believe it or not you were right and I am using stat everyday all day long but I am drawing a blank I am trying to determine a linear regression line and I can t remember the equation YmXb or course but on the regression analysis output what dquseasmampb I am very disappointed in myself because I can t remember but alas I am asking for help I tried looking for the notes on your home page but I couldn t nd it any longer and I think it was taken down for the summer If you could help me that would be great Hope your summer months are spent by the pool Regards Eric Ellen As for the project It essentially was a regression to determine the amount of out of state cotton seed a crushing plant would need when their own states current production increased or decreased It is not very statistically sound since we are only using 5 years worth of databut it is just a tool in price analysis that we are using to determine a spread between plants that we can buy the cotton seed at As for using methat would be great Let those impressionable young students see that we are using what they learn on an everyday basissome more than othersespecially me since I deal with prices and trading Eric Example The scatterplot below shows the calories and sodium content for each of 17 brands of meat hot dogs a Describe the main features of the relationship 600 u 500 u u n u n u 400 an n u u u U u 300 200 u 100 10 120 140 160 130 2 o Calories b What is the correlation between calories and sodium Correlations Sodium Calories content Calories Pearson Correlation l 863 Sig 2tailed 000 N 17 17 Sodium content Pearson Correlation 863 l Sig 2tailed 000 N 17 17 Correlation is significant at the 001 level 2tailed c Report the leastsquares regression line Model Summary Adjusted Std Error of Model R R Square R Square the Estimate 8633 745 728 48913 3 Predictors Constant Calories b Dependent Variable Sodium content Coel cient Unstandardized Standardized f cients Coef cients 95 Con dence Interval for B Model B l Std Error Beta t Sig 1 Constant 91185 77812 1172 260 Calories 3212 485 863 6628 000 2179 4245 3 Dependent Variable Sodium content d Show a residual plot and comment on its features 0 u a 7100 u 7200 1 0 120 140 160 180 2 0 Calories e Is there an outlier If so where is it f Show a normal probability plot and comment on its features Normal PP Plot of Regression Standardized Residual Dependent Variable Sodium content Expected Cum Prob I I EIEI D2 a u a 1 u Observed Cum Prob g Leave off the outlier and recalculate the correlation and another least squares regression line Is your outlier in uential Explain your answer Correlation s cal2 sod2 cal2 Pearson Correlation 1 834 Sig 2tailed 000 N 16 16 sod2 Pearson Correlation 834 l Sig 2tailed 000 N 16 16 Correlation is signi cant at the 001 level Model Summary Adjusted Std Error of Model R R Square R Square the Estimate 1 834al 695 674 36406 3 Predictors Constant cal2 b Dependent Variable sod2 Coel cient Unstandardized Standardized Coef cients Coef cients 95 Con dence Interval for B M B I Std Error Beta t Siq Lower Bound U er Bound Constant 46900 69371 676 510 101886 195686 ca2 2401 425 834 5653 000 1490 3312 3 Dependent Variable sod2 h If there is a new brand of meat hot dog with 150 calories per frank how many milligrams of sodium do you estimate that one of these hotdogs contains Section 13 The Normal Distribution Learning goals for this chapter Know when and how to use the empirical 6895997 rule Understand what the standard Normal distribution is and how it is related to other Normal distributions Calculate both forwards and backwards Normal distribution problems Draw a Normal distribution curve appropriate for a story Normal curves No 6 Bellshaped unimodal symmetric Mean u always in the center of the curve the peak Standard deviation 6 controls the spread of the graph wide or narrow Probabilities are just the area under the curve integral between the points of interest Total area under the Normal curve l or 100 Curve stretches from w to 00 but the area under the curve gets very small the farther you go from the mean In the last set of Chapter 1 notes we discussed bellshaped distributions standardization and the 6895997 rule Standard Normal Distribution What if you need different probabilities for X Np 6 Do we have to use Calculus No We have a great shortcutithe Normal table Table A in the front cover of your book You must convert X Nu 6 to Z N0 l where Z has the stande Normal distribution Convert using the formula Z x 7 u 039 Zscores are what you need in order to use Table A in the front cover of your book Zscores also let you compare 2 values from different Normal distributions to see their probabilities on the same scale P Z lt z score is what you will nd on the Normal table What if you want to know something else I PZ gt zscore l 7 PZS zscore PZ zscore 0 Therefore PZ Z 2 score PZ z score PaltZltbPZltb7PZlta Zscores tell you how far measured in standard deviations the original observations fall from the mean To find a probability if you have X NL 039 and a sample score to work with 1 ConvertXtoZ Zx a 2 Rearrange if necessary the inequality so that it uses lt or S Remember that PZ gt zscore 1 7 PZ S zscore 3 Look up the probability for your zscore on Table A 4 If z score is between 2 table values either pick the closer one or average the two closest values Example Checking account balances are N N1325 25 Bill has a balance of 1270 a b d What is the probability an account will have less money than Bill s What is the probability an account balance will be more than 1380 What is the probability an account balance will be exactly 1380 What is the probability an account will have less than 1325 the mean e What is the probability that an account will have between 1310 and 1390 1 What is the probability an account will have less than 10 Backwards Normal Problems If you are given the probability and know X Nu S but you don t know the sample s score backwards from the previous problems 1 Treat it as PZ lt zscore the probability Work backward from the probability in Table A to a corresponding zscore 2 Adjust to lt ifnecessary by doing the 1 7 trick 3 If you have a 2sided probability use Pzo lt Z lt 20 2 PZ lt zscore 7 1 4 Convert the zscore to X by converting with X u 26 Example In the checking account example where the balances are N N 1325 25 a What is the account balance X0 such that the percentage of balances less than it is 23 b What is the account balance X0 such that the probability of a balance being more than it is 015 c Between what 2 central values do 40 of the balances fall Mixed forwards and backwards problems IQ Scores are Normally distributed with a mean of 100 and a standard deviation of 15 1 What is the IQ range for the top 10 of people 2 What percent of the population scores between 100 and 120 3 Between what two scores does the central 20 of the population have 4 Ifa person is selected at random what is the chance he scores below an 85 E H rl 39Jju gal mu m mm mm pupmu m awn19mm n m um m2 umumm rummmy able enLry m z 15 me an Imder me nanlard normal curve n I le m I TABLE A an Chapter 7 Section 71 Inference for the Mean of a Population Section 72 Comparing Two Means Learning goals for this chapter Understand what inference is and why it is needed Know that all inference techniques give us information about the population parameter Explain what a confidence interval is and when it is needed Calculate a con dence interval for the population mean when the population standard deviation is unknown Know the assumptions that must be met for doing inference for the population mean when the population standard deviation is unknown robustness for 1 sample mean matched pairs and 2sample comparison of means Know how to write hypotheses calculate a test statistic and Pvalue and write conclusions in terms of the story Draw Normal curve pictures to match the hypothesis test Understand the logic of hypothesis testing and when a hypothesis test is needed Use the con dence interval to perform a twosided hypothesis test Explain sampling variability and the difference between the population mean and the sample mean Explain the difference between the population standard deviation and the sample standard deviation Know which technique is most appropriate for a story con dence interval hypothesis test or simple summary statistics Know which inference technique is most appropriate for a story lsample mean using Z lsample mean using t matched pairs or 2sample comparison of means Interpret Normal quantile plots and histograms to determine whether the t procedures are appropriate Know how to do all calculations listed above by hand with the t table and using SPSS In Chapter 6 we knew the population standard deviation 039 039 Con dence interval for the population mean u x i Z X 1 f7 0 UJ Hypothesis test statistic for the population mean u 20 Used the distribution 2 N041 71 f In Chapter 7 we don t know the population standard deviation 039 0 Use the sample standard deviation s 0 Con dence interval for the population mean y f it i gtlt sxZ I Hypothesis test statistic for the population mean u to I t distribution uses n l degrees of freedom Sometimes you ll see the symbol for standard error 6 n Using the t distribution 0 Suppose that an SRS of size n is drawn from aNu 0 population I There is a different t distribution for each sample size so tk stands for the t distribution with k degrees of freedom 0 Degrees offreedom k n 7 1 sample size 7 l I As k increases the t distribution looks more like the normal distribution because as 71 increases 3 gt 039 tk distributions are symmetric about 0 and are bell shaped they are just a bit wider than the normal distribution 0 Table shows upper tails only so o ift is negativePt lt t Pt gt t U o if you have a 2sided test multiply the Pt gt t by 2 to get the area in both tails o The Normal table showed lower tails only so the t table is backwards Finding t on the table Start at the bottom line to get the right column for your confidence level and then work up to the correct row for your degrees of freedom What happens if your degrees of freedom isn t on the table for example df 79 Always round DOWN to the next lowest degrees of freedom to be conservative the unknuwn pupuladun mean yield uftumatues cme smry Alsu draw aplcture quhet curve wth numberzml symhnl furthe pupulahun mean yuu use m yuur null hyputhesxs W the sample mean 5 the standard 5 mama n andthe ESL statistic 1g Alsushadetheappmpnatepanufthe J curve whxch shuws the Pavallle Example Exercise 737 How accurate are radon detectors of a type sold to homeowners To answer this question university researchers placed 12 detectors in a chamber that exposed them to 105 picocuries per liter of radon The detector readings were as follows 919 978 1114 1223 1054 950 1038 996 1193 1048 1017 966 a Is there convincing evidence that the mean reading of all detectors of this type differs from the true value of 105 Use or 010 for the test Carry out a test in detail and write a brief conclusion SPSS tells us the mean and standard deviation of this data are 10413 and 940 respectively b Find a 90 confidence interval for the population mean Now redo the above example using SPSS completely To alo just a con dence interval enter alata then Analyze Descriptive Statistics Explore Click on Statistics anal change the CI to 90 Then hit OK If you neeal to alo a hypothesis test anal a CI go to Analysis 9Compare Means One sample T test Change the test value to 105 since that is our Hg change options to 90 anal hit OK This will give you the output below OneSample Test Test Value 105 90 Confidence Interval of the D fference t df Sig 2tailed Difference Lower Upper radon detector readings 319 11 755 8667 5739 4005 Usmg ths SFSS EIutpuL omsmm Ya Yes Value m5 am Can denue mevva a he Dmevenue We um lmmmnamsl and nal m5 Mean m m WWI Wm M 755 755m what wuma yuurtrcurve mm shaded Prvalue luuk hke xfyuu had hypumeses uf Hn 41n5 HA Hgms Hn Aims HA wms Hn nilEIS HA ms gt gt gt Yuu must chuuse yuurhyputheses BEFORE yuu examme the data When m auum an a twursxdedtest How do you know when it is appropriate to use the t procedures Very important Always look at your data rst Histograms and Normal quantile plots pgs 8083 in your book will help you see the general shape of your data I t procedures are quite robust against nonnormality of the population except in the case of outliers or strong skewness larger samples n improve the accuracy of the t distribution Some guidelines for inference on a single mean I n lt 15 Use t procedures if data close to normal If data nonnormal or if outliers are present do not use t I 1557 540 Use t procedures except in the presence of outliers or strong skewness I n 2 40 Use t procedures even if data skewed Normal quantile plots In SPSS go to Graphs QQ Move your variable into variable column and hit OK Normal QQ Plot of Radon Detector Reading o I 100 Expected Normal Value I I 90 100 MO 120 Observed Value Look to see how closely the data points dots follow the diagonal line The line will always be a 45degree line Only the data points will change The closer they follow the line the more normally distributed the data is What happens if the t procedure is not appropriate What if you have outliers or skewness with a smaller sample size n lt 40 Outliers Investigate the cause of the outliers 0 O Skewness O 0 Was the data recorded correctly Is there any reason why that data might be invalid an equipment malfunction a person lying in their response etc If there is a good reason why that point could be disregarded try taking it out and compare the new con dence interval or hypothesis test results to the old ones If you don t have a valid reason for disregarding the outlier you have to leave the outlier in and not use the t procedures If the skewness is not too extreme the t procedures are still appropriate if the sample size is bigger than 15 Ifthe skewness is extreme or ifthe sample size is less than 15 you can use nonparametric procedures One type of nonparametric test is similar to the t procedures except it uses the median instead of the mean Another possibility would be to transform the data possibly using logarithms A statistician should be consulted if you have data which doesn t t the t procedures requirements We won t cover nonparametric procedures or transformations for nonnormal data in this course but your book has supplementary chapters 14 and 15 on these topics online ifyou need them later in your own research They are also discussed on pages 4654 70 of yourbook What do you do when you have 2 lists of data instead of 1 First decide whether you have 1 sample with 2 measurements on each unit OR 2 independent samples with one measurement each l Matched Pairs covered in 71 0 One group of individuals with 2 different measurements on each individual 0 Same individuals different measurements 0 Examples pre and posttests before and after measurements 0 Con dence intervals and hypothesis tests are based on the difference obtained between the 2 measurements 1 Find the difference post test pre test or before after etc in the individual measurements 2 Find the sample mean a and sample standard deviation s of these differences 3 Use the t distribution because the standard deviation is estimated from the data 0 Confidence interval 57 if I Hypothesis testing Hg may 0 7 ttest statistic39 t d 70 I 0 sxZ Example of Matched Pairs In an effort to determine whether sensitivity training for nurses would improve the quality of nursing provided at an area hospital the following study was conducted Eight different nurses were selected and their nursing skills were given a score from 110 After this initial screening a training program was administered and then the same nurses were rated again Below is a table of their pre and posttraining scores Conduct a test to determine whether the training could on average improve the quality of nursing provided in the population a What are your hypotheses b What is the test statistic c What is the Pvalue 3 1 What is your conclusion in terms of the story e What is the 95 confidence interval of the population mean difference in nursing scores Enter the pre and post training scores to SPSS Then Analyze Compare Means Paireal Samples T test Then input both variable names and hit the arrow key If you need to change the con dence interval go to Options SPSS will always do the left column of data 7 the right column of data for the order of the ali erence If this bothers you just be careful how you enter the data into the program Paired Samples Statistics Std Error Mean N Std Deviation Pair Posttrainin 9 score 63212 8 182086 643 77 1 Pretraining score 5 2700 8 201808 71350 Data entered as written above with pretraining in left column and posttraining in right column Paired Samples Test Sig 2 Pai ecl Differences t df tailed Std Std Error 95 Confidence Interval Mean Deviation Mean of the Difference Lower Upper Pa pretra39rl39hg 39 405125 147417 52120 228369 18119 2017 7 084 Data entered backwards from how it is written above with posttraining in left column and pretraining in right column Paired Samples Test 95 Con dence Interval ofthe Std Error score 105125 147417 52120 18119 228369 What s different What s the same Which one matches the way that you defined ydiff 2 2 Sample Comparison of Means covered in 72 I A group of individuals is divided into 2 different experimental groups treatment andor is measured only once Responses from each sample are independent of each other Examples treatment vs control groups male vs female 2 groups of different women Goal To do a hypothesis test based on Hg yA 3 same as Hg uA y 0 Ha yA gt y or Ha yA lt LIB or Ha uA yB pick one deviations are ESTIMATED from the data these are approximately I distributions but not exact Con dence Interval for M LIB 2 2 EA n ithdfmin nA ln A n B Equal sample sizes are recommended but not required Use the same guidelines for determining whether the t procedures are appropriate that you used for lsample mean and matched pairs but use 71 m m for the sample size No one unit can be in both groups Each individual receives only one 2 Sample t Test Statistic is used for hypothesis testing when the standard B l Example of 2Sample Comparison of Means A group of 15 college seniors are selected to participate in a manual dexterity skill test against a group of 20 industrial workers Skills are assessed by scores obtained on a test taken by both groups Conduct a hypothesis test to determine whether the industrial workers had signi cantly better average manual dexterity skills than the students Descriptive statistics are listed below Also construct a 95 con dence interval for this problem I group I n f s laudmns15 3512 431 workers 20 3732 383 Example of 2Sample Comparison of Means Exercise 7 84 The SSHA is a psychological test designed to measure the motivation study habits and attitudes towards learning of college students These factors along with ability are important in explaining success in school A selective private college gives the SSHA to an SRS of both male and female rstyear students The data for the women are as follows 154 109 137 115 152 140 154 178 101 103 126 126 137 165 165 129 200 148 Here are the scores for the men 108 140 114 91 180 115 126 92 169 146 I109 I132 75 88 113 151 70 115 187 104 a Test whether the population mean SSHA score for men is different than the population mean score for women State your hypotheses carry out the test using SPSS obtain aPvalue and give your conclusions When you enter your data into SPSS have 2 variables gender type string anal score numeric In the gender column state whether a score is from a man or a woman anal in the score column state all 38 scores Analyze Compare Means Inalepenalent Samples T Test Move score into Test Variables box Move gender into Grouping Variable box and then click Define Groups and state which woman and man as group 1 and group 2 hit Continue We will need a 90 con dence interval in part c so go to Options to change it Group Statistics Std Error gender N Mean Std Deviation Mean score woman 18 14106 26436 6231 man 20 12125 32852 7346 Independent SamplesTe Leve ne s Te St for Equaiit of Variances Heat for Equaiit ofiveans 90 Con dence interval oftne Mean Diffe ence F 810 t df Siq Hailed Difference Difference Lower UDDer score Equal variances assumed 862 359 2032 36 050 19806 9745 3353 36258 Equal variances not assumed 2056 35 587 047 19806 9633 3538 36073 What do we do with this Equal variances assumed and Equal variances not assumed Always go with the bottom row Equal variances not assumed This is the more conservative approach b Most studies have found that the population mean SSHA score for men is lower than the population mean score in a comparable group of women Test this supposition here Give a 90 con dence interval for the difference in population means of SSHA scores of male and female firstyear students at this college c To summarize Chapters 6 and 7 Z vs t Z if you know the population standard deviation t if you know only the sample standard deviation This is usually the reallife situation and we will assume that we have only the sample standard deviation unless we are explicitly told otherwise Matched pairs vs 2sample comparison of means Matched pairs if all units are measured twice and or receive both treatments over time Before vs after is the most common example 2sample comparison of means if you have two separate groups but each unit is only measured once Men vs women is the most common example vmwuyy nblu am my at I um maul 1 mu mm 1 1m 0 n ngn mdpmbzbmu I hire p bn umn 7pm
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'