INTR BUSINES STAT I
INTR BUSINES STAT I STAT 226
Popular in Course
Popular in Statistics
This 116 page Class Notes was uploaded by Giovani Ullrich PhD on Saturday September 26, 2015. The Class Notes belongs to STAT 226 at Iowa State University taught by Staff in Fall. Since its upload, it has received 17 views. For similar materials see /class/214412/stat-226-iowa-state-university in Statistics at Iowa State University.
Reviews for INTR BUSINES STAT I
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/26/15
Stat 226 Introduction to Business Statistics I Spring 2009 Professor Dr Petrutza Caragea ectlon Tuesdays and Thursdays 9 30710 50 a m Chapter 4 Section 44 Sampling Distributions amp Central Limit Theorem CLT Introdunion m Business Statistirs I Search 4 4 118 Toward Statistical Inference SAMPLING DISTRIBUTIONS Sampling Distribution The sampling distribution of a statistic e g the sample mean Sr is the distribution of all possible values taken by the statistic in all possible samples of the same size from the same population We know that our 1 value is one of the Sc values described by the sampling distribution immunequot to Business 522mm I Semen 4 4 218 Toward Statistical Inference 0 Handout Summary Sampling Distribution and recap of last weeks sampling activity a Handout Toward the Central Limit Theorem Before continuing with the actual Central Limit Theorem let39s look at two important properties any statistic should have Note in general we will refer to a statistic that is used to estimate an unknown population parameter as a soecalled estimator i e statistic estimator Introdunion m Business Statistirs I Search 4 4 318 Toward Statistical Inference When estimating a population parameter using a sample statistic e g Sr estimating u we have to be concerned about two things 0 Biasi how far is Sr away from u on average i e how large is the error in estimating u on average7 9 Variability of the statistic estimating the unknown population parameter 7 how spread out is the sampling distribution of the statistic7 Bias Bias concerns the center of the sampling distribution A statistic used to estimate a parameter is said to be unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated Example Sr will be unbiased if the mean of the sampling distribution of Sr is H which in fact it is immunequot to Business 522mm I Semen 4 4 418 Toward Statistical Inference So if a statistic an estimator is unbiased e g Sr then it holds that a the mean of the sample statistic is always equal to the population parameter eg the mean of is u o In repeated sampling Sr will sometimes fall above the true value and sometimes fall below However there is no systematic tendency to over or underestimate the parameter p gt Sr is correct on average a How close the value of the sample statistic falls to the parameter in most samples is determined by the overall spread of the sampling distribution If individual observations have a standard deviation 0 then sample means Sr for samples of size n have standard deviation of am gt averages vary less than individual observationsl Introdumon 0 Business Statistirs I Semen 4 4 518 Toward Statistical Inference Variability The variability of a statistic is described by the spread of its sampling distribution This spread is determined by the sampling design and the sample size n In general the larger the sample size n the smaller the spread Ideally we want the statistic have small bias and small variability immunequot to Business 522mm I Semen 4 4 518 Toward Statistical Inference How can we reduce bias and variability when estimating 7 a To reduce bias use a random sample a In order to reduce variability when estimating u we need to increase the sample size n In general statistical methods estimators are judged as good and reliable i t ey provide consistently accurate estimates when used repeated y Introdumon 0 Business Statistirs I Semen 4 4 718 Toward Statistical Inference Recall what we learned about the sampling distribution of Sr If all possible random samples of size n are taken from some population with mean u and standard deviation 0 then the sampling distribution of the sample mean Sr will 0 have a mean u equal to u 7 the population mean 0 have a standard deviation 0 equal to What about the shape of the sampling distribution7 immunequot to Business 522mm I Semen 4 4 818 Toward Statistical Inference gt the shape resembles more and more the shape of a normal distribution as the sample size n increases WHY lntvodumon 0 Business Statistirs l Semen 4 4 918 Toward Statistical Inference CENTRAL LIMIT THEOREM The Central Limit Theorem is one of the most important Theorems in Statistics It allows us to use normal calculations to answer questions about sample means even when the population distribution is not normal as long as the number of observations used to compute the sample mean is sufficiently large Central Limit Theorem CLT If we draw a simple random sample of size n from any population with mean u and standard deviation 0 and n is sufficiently large then the sampling distribution of the sample mean Sr is approximately normal l immunequot to Business 522mm l Semen 4 4 1u1B Sr approximately N 1 Toward Statistical Inference Some guidelines regarding the necessary sample size a if population is already normal gt frwill beNltugt regardless of sample size 7 CLT does not apply a if population is symmetric and bellshaped resembling somewhat a normal distribution gt Srwill be approximatelyV017 0 if sample size is 215 i CLT does apply lntvodumon 0 Business Statistirs l Semen 4 4 1118 Toward Statistical Inference o if population is far from normal e g skewed or multiemodal gt Srwill be a roximatel N 0 pp y 7 if sample size is 2 30 i CLT does apply The heavier a distribution is skewed or in the situation of very extreme observations outliers in the population itself the more observations are necessary for the CLT to apply and sometimes n 30 may not be sufficient immunequot to Business 522mm l Semen 4 4 1218 Toward Statistical Inference Example 1 According to chance magazine 1993 Vol6 Nr 3 p 5 the mean healthy body temperature is around 98 2 F 1 with a standard deviation of 0 0 6 The distribution of the body temperature is known to be belleshaped Suppose we take a random sample of 16 adults 0 What proportion of humans has a temperature at or above the presumed norm of 98 6 F7 9 What proportion of samples of size 16 have a mean temperature at or above the presumed norm of 98 6 F7 Introdumon 0 Business Statistirs I Semen 4 4 1318 Toward Statistical Inference Example 2 A bottling company uses a machine to fill bottles with Cola The bottles are supposed to contain 300 ml In fact the contents vary according to a normal distribution with mean H 298 and standard deviation 0 3ml 0 What proportion of individual bottles contains less than 295 ml7 0 What proportion of 6epacks contains less than 2957 9 Did we need the CLT to derive our answers in b7 immunequot to Business swim I Semen 4 4 1418 Toward Statistical Inference THE LAW OF LARGE NUMBERS There is a false belief that shorterun behavior must match what can only be expected in the longerun like Tossing a coin you have had Head come up 10 times the Law of Averages says Tail should come up next The Law of Averages is really the Law of Large Numbers p 283 and applies only to the longrun large n behavior of averages saying As n increases 2 gets closer to H in value Introdumon 0 Business Statistirs I Semen 4 4 1518 Toward Statistical Inference Application of the Law of Large Numbers Assume that for auto accidents in the state of Iowa the average damage loss is 2252 per accident 0 If you are in an accident does 2252 apply7 9 If you are in five accidents does 2252 apply7 9 if you are an insurance company and 206 of your clients have accidents does 2252 apply7 immunequot to Business swim I Semen 4 4 1518 Toward Statistical Inference lntvodumon to Business 522mm l Semen 4 4 1718 Toward Statistical Inference The law of averagesquot The baseball player Tony Gwynn got a hit about 34 of the time over his 207year career After he failed to hit safely in six straight atebats the TV commentator said Tony is due for a hit by the law of averages Is that right7 Why7 immunequot to Business 522mm l Semen 4 4 1818 Examining Relationships Main Goals Stat 226 lntrOdUCtlon to Busmess StatlSthS l l 0 graphically and numerically describe relation between 2 quantitative varia les Spring 2009 9 Identify if one variable can help predictexplain another variable Professor Dr Petrutza Caragea variable associations Two variables measured on the same individual are associated if certain values of one variable tend to occur often with certain values of a second variable ectIon Tuesdays and Thursdays 9 30710 50 a m Chapter 2 Example E R I 9 of items sold on a given day and daily gross sales xamlnlng e atlons lps 0 height and weight of a person I assessed value and sale price of a home In general associations will not be exact i as there is always variationl lntvodumon to Business Statistirs I cthm 2 1 1 Immunequot to Business Statistirs I cthm 2 2 1 Examining Relationships We distinguish two types of variables a response variable a result or outcome of Interest 0 often also called dependent variable a denoted usmg the Iettery Chapter 21 Scatterplots explanatory variable XplaIns changes In the response variable 0 often also called Independent variable 0 denoted using the letter X Examples explanatory variable X response variable y Immunequot to Business Statistirs I cthm 2 4 1 lntvodumon In Business Statistirs I Chapter 2 3 1 Chapter 21 Scatterplots EXPLORING RELATIONSHIPS GRAPHICALLY We can display the explanatory and the response variable in a soecalled scatterplot showing their relationship explanatory variable gt Xiaxis response variable gt yeaxis innnnn 5m nnnnn asnssnssnnnsm we lntvodumon 0 Business 522mm l Chapter 2 51 Chapter 21 Scatterplots 39 cmquot Manade r inisznzsanasns 5 Wm m 5 2a m n m Wm immunequot to Business 522mm l Chapter 2 e 1 Chapter 21 Scatterplots Example number of radio ads airedweek and amount of sales in 1000 No of ads x 2 5 8 8 10 12 Sales y 2 4 7 6 10 No ads helps explainpredict sales scatterp at v x lntvodumon 0 Business 522mm l Chapter 2 71 Chapter 21 Scatterplots FOUR FEATURES TO LOOK FOR IN A SCATTERPLOT 0 Direction positive or negative positive association negative association immunequot to Business 522mm l Chapter 2 a 1 Chapter 21 Scatterplots 9 Form a linear straight line a cu rved a scattered ntvodumon m Busmess Statrstrrs Chapter 2 91 Chapter 21 Scatterplots 6 Strength strong vs weak and moderate 0 Outliers ntvodumon to Busmess Statrstrrs Chapter 2 101 STAT 226 INTERPRETING CONFIDENCE INTERVALS HANDOUT 7 The following are correct interpretations of con dence intervals 1 am 90 con dence that the interval 11915 1281 captures the true mean yield in bushels per acre The interval 227 degrees 2618 degrees gives a range of reasonable values for the average angle of deformity for all patients having HAS a deformity of the big toe We are 95 con dent of this The mean IQ of all seventhgrade girls in the school district is somewhere between 9513 and 10912 with 99 con dence The following are incorrect interpretations of con dence intervals 99 of the le are contained in the interval 953 10912 The probability that the interval 11915 12811 captures the true mean yield in bushels per acre is 90 We are 95 con dent that the interval 227 degrees 2618 degrees contains the sample average angle of deformity for the 38 patients having HAVi 99 of the time the mean le of all seventhgrade girls will be contained in the interval 953 10912 We are 90 con dent that the interval 11915 12811 captures the yields in bushels per acre Section 24 Cautions about Correlation and Regression Warnings l Correlation and regression are powerful tools for describing the relationship between two variables I But Limitations I Correlation and regression describe only linear relationships I The correlation r and the leastsquares regression line are not resistant to outliers I always plot your data before interpreting a linear regression or correlation between two variables I What should we do with outliers if they are present OutliersWhat should you do I Make sure data points have been recorded correctly I Collect more data I Examine collection techniques I Examine outside influences I As a last resort remove the outlier Beware Extrapolation I Extrapolation is the use of a regression line for prediction far outside the range of values of the explanatory variable xthat you used to obtain the line I such predictions are often not accurate I Associations for variables can be trusted only for the range of values for which data have been collected I even a very strong relationship may not hold outside the data s range Example of Extrapolation I We fit a least squares regression line to model the relationship between diamond size and price Our explanatory variable is measured by carats and the response variable is dollars I The model is y 25963 372102x Price of Diamonds I 1000 l BUG Price EDD l ADD What would our model predict a 7 c for the hope diamond4552 carats CW 9 25963 3721024552 16912140 Beware Correlations based on Averaged Data I Many regression and correlation studies work with averages or other measures that combine information from many individuals I you should note this carefully and resist the temptation to apply the results of such studies to individuals I Correlations based on averages are usually too high when applied to individuals I it is important to note exactly what variable were measured in a statistical study Beware the Lurking Variable I A lurking variable is a variable that potentially affects the relationship among the variables in a study which is not included among the variables studied A lurking variable can falsely suggest a strong relationship between xand yor it can hide a relationship that is really there Sometimes the relationship between two variables is influenced by other variables that we did not measure or even think about Example 216 l Studies show that men who complain of chest pain are more likely to get detailed tests and aggressive treatment such as bypass surgery than are women with similar complaints Is this association between gender and treatment due to discrimination Example 216 continued I Not necessarily Men and women develop heart problems at different ages women are on the average between 10 and 15 years older than men Aggressive treatments are more risky for older patients so doctor s may hesitate to recommend them Lurking variables the patients age and condition may explain the relationship between gender and doctors decisions Another Lurking Variable Example I A study showed that there was a strong correlation between the number of firefighters at a fire and the property damage that the fire causes I So maybe we should send less fire fighters to fight fires I Wrongo I What is the lurking variable in this case Association is not Causation I An association between an explanatory variable xand a response variable y even if it is very strong is not by itself good evidence that changes in xactually cause changes in y Example 218 I Measure the number of television sets per person x and the average life expectancy y for the world s nations There is a high positive correlation nations with many TV sets have higher life expectancies I So we conclude that shipping TV s to Rwanda would increase this countries life expectancy Example 218 continued I No There are lurking variables present I What are they Nations with more TV s per person tend to be more wealthy and have better health care options I Clearly there is no cause and effect relationship between TV sets and length of life How To Show Causation I The only way to get absolutely conclusive evidence of cause and effect or that X causes changes in y is through a carefully designed experiment in which we change X in an environment which we completely control This keeps lurking variables under control I When experiments cannot be done finding the explanation for an observed association is often difficult and controversial Does Smoking Really Cause Cancer I In order to know that smoking causes cancer we would have to design an experiment where we can change the explanatory variable smoking or not in a controlled environment I Is this possible I Are there other types of cause and effect relationships similar to this scenerio Criteria for Establishing Causation with no Experiment I The association is strong I The association is consistent I Higher doses are associated with stronger responses I The alleged cause precedes the effect in time I The alleged cause is plausible Does Smoking Cause Cancer I Proving smoking causes lung cancer I Association between smoking and lung cancer is strong I This association is consistent in many studies I High doses are associated with stronger response That is people who smoke more often tend to get lung cancer more often I Cause precedes the effect in time I Cause is plausible Section 24 Summary I Correlation and regression must be interpreted with caution Plot the data to be sure the relationship is roughly linear and to detect outliers and influential observations Section 24 Summary I Avoid extrapolation the use of a regression line for prediction for values of the explanatory variable far outside the range of the data from which the line was calculated Section 24 Summary I Remember that correlations based on averages are usually too high when applied to individuals Section 24 Summary I Lurking variables that you did not measure may explain the relations between the variables you did measure Correlation and regression can be misleading if you ignore important lurking variables Section 24 Summary I Most of all be careful not to conclude that there is a causeandeffect relationship between two variables just because they are strongly associated High correlation does not imply causation The best evidence that an association is due to causation comes from an experiment in which the explanatory variable is directly changed and other influences on the response are controlled Section 71 Inference for population means 0 unknown Stat 226 Introduction to Business Statistics I INFERENCE FOR THE POPULATION MEAN H WHEN THE POPULATION STANDARD DEVIATION 0 IS UNKNOWN Spring 2009 PI OfQSSOI Dr Petmtza caragea When the population standard deviation 0 is unknown we have to estimate Ctlon A it first based on the collected data using the sample standard deviation 5 Tuesdays and Thursdays 9 30710 50 a m Chapter 7 Section 71 so we Using 5 instead of 0 adds more variability to the distribution of H f need a distribution with heavier tails Inference for population means 0 unknown The redistribution accounts for the additional variation by having heavier tails see graph next page Introdunion toBusmess Statistirsl Semen 6 1 1 12 Introdunion toBusmess Statistirsl Semen 6 1 212 Section 71 Inference for population means 0 unknown Section 71 Inference for population means 0 unknown Recall that the normal distribution is characterized by two parameters 3 3 a u the mean and i g i g a 0 the standard deviation s i i g a a The redistribution is characterized by a single parameter the soecalled 1 1 1 1 1 1 1 1 0 degrees of freedomquot short df As the degrees of freedom increase the redistribution approaches the a a standard normal distribution N01 g r g 7 Why7 As the sample size increases 5 estimates 0 more accurately because a we have more information about the population standard deviation in our 1 1 1 1 1 1 1 1 sample Introdumon 0 Business Statistirs I Search a 1 312 immunequot to Business 522mm I Semen e 1 412 Section 71 Inference for population means 0 unknown READING THE teTABLE Table D o tedistribution is symmetric o it is characterized by the degrees of freedom immunequot to Business 522mm l Semen e 1 5 12 Section 71 Inference for population means 0 unknown um I Table mlry tarJ m 1 Hr mm mm win pmhahile lying m in ugm m wommiw lying mm er and r Inn D x islrihuliun uiicul values rmi lulm 1 mm as 5 was was mm mi 0 m no u x s 1 l w l 7 mg 13m immam 0 Business Statistirs l Semen e 1 a 12 Section 71 Inference for population means 0 unknown FINDING CRITICAL VALUES f FOR A DISTRIBUTION TABLE D Example 0 95 percentile of a tedistribution with df is the critical value such that the area to the right uppertail probability of t is equal to 0 05 or 5 0 look down the df column first column on left to 5 0 at the top of the table find the right tall uppertail probability of 0 05 9 the critical value 1 quot corresponds to where row and column Intersect th his 1 quot 2015 Note39 tetable works with the area above right while zetable works immam 0 Business Statistirs l Semen e 1 7 12 Section 71 Inference for population means 0 unknown 9 the 5 percentile for a tedistribution with df15 0 Find the quantiles the bound the middle 95 immam 0 Business Statistirs l Semen e 1 a 12 Section 71 Inference for population means 0 unknown CONFIDENCE INTERVALS FOR H WHEN 0 IS UNKNOWN 0 A 1 7 a 100 confidence interval for u is given by xi t 5 gt W a just change 0 to s and 2 to t 0 look up t corresponding to a redistribution with df n 71 lntvodunion 0 Business Statistirs i Semen e 1 912 Section 71 Inference for population means 0 unknown Example A random sample of 30 pills yielded a mean level of 20 5 mg of aspirin and a standard deviation of 1 5 mg Find a 95 confidence interval for the mean level u of aspirin in a pill immunequot to Business 52mm i Semen e 1 1u12 Section 71 Inference for population means 0 unknown 0 Handout retest examples lntvodunion 0 Business Statistirs i Semen e 1 11 12 Section 71 Inference for population means 0 unknown What about assumptions for the t test a simple random sample ensuring independence of observations 0 data following a normal distribution or sufficiently large sample size for the CLT to apply immunequot to Business 52mm i Semen e 1 12 12 Review I I I I Distribution Stat 226 lntl39OdUCtlon t0 BUSlneSS StatlSthS l The distribution of a variable describes WHAT values the variable takes and HOW often It takes these values Spring 2009 PI OfeSSOI Dl r PQtl UtZa caragea Depending on the type ofthe data categorical or quantitative we need to Ctlon A use different graphical and numerical tools to analyze and summarize the Tuesdays and Thursdays 9 3010 50 a m data at hand We Will start by describing data graphically Chapter 1 seam 11 a bar graphs pie charts and pareto charts can be used to graphically summarize categorical data 0 a common graphical display for quantitative data Is a histogram lntmductiun m Ema 5mm l Chaney 1 Semen 1 1 2 37 lntmductiun m Busing 52221511551 Chzvtev 1 Sammy 1 1 1 37 Chapter 11 Displaying Distributions with Graphs Strategies for data analysis Chapter 11 Displaying Distributions with Graphs a examine each variable In data set IndIVIdually and look for possible relationships among variables a start With graphs and add numerical summaries as needed lntmductiun m Ema 5mm l Chaney 1 Semen 1 1 4 37 lntmductiun m Ema 5222191551 Chapter 1 Sammy 1 1 3 37 Chapterl 17 D1sp1aymg D strwbumons wmh Graphs oh ofthe rst WW1 dxsphys 15 the mrczHed mxmmb w Harem ngmng e W mm mmmmmwr on Liam 5 37 Chapter 1 17 D1sp1aymg D strwbumons wmh Graphs The text m the owev e mmev reads heAresuHhe spa mar 31 Mxk mads aresazh mama mm tmzenmva he The We madss mama mm thezemevothe mg Huygensea cm the m m Pmnub e m Mmmg 2mm mag the red 3135 mama mm the as m was 31 he um 31 red mm the myer red Vanda w Now as Mme boundary mm m a a y 31 a quotgthe month h Oltmbev 1m 31 Am 1855 he Mxk ma zommds m the ma m January 31 5 2 atoms mm Mxk mas m y bezomvarsd bymHmmgthe ma the 73 31 theb xk 1m enz osmg hem H 2 same N19vtm 1 orehze pagequot Many A0941th Ham gummy and mm Mmmmm 0mg vash Aw max hamluuhmSmnm Mayan am Chapterl 17 D1sp1aymg D strwbumons wmh Graphs Czbgoncal variables Bar graphs and Ple charts Example am on hm amounts owed by Am m mm may oznevs muntvy hm amount pewentzge n 97 2 35 8 Germany 31 7 11 7 Fr 24 a 9 1 US 23 a a a Great 3mm 16 3 5 h an 9 29 x am 2745 m 1017 quotuanammmt m BH mhsuDu1hrs duem m N undmg mamas Warm up w mu mmwr rum nsm n 1 37 Chapter 1 17 D1sp1aymg D strwbumons wmh Graphs m n 1 WIN mIIW qu mm mm m 5 mm Mayan U31 Chapter 11 Displaying Distributions with Graphs l Chart lnan amnunl cuunlvy cuumvy France Germany Gveal emam Japan US ulhevs lntmductiun m Ema 522mm Chaucer 1 Sammy 1 1 9 37 Chapter 11 Displaying Distributions with Graphs PIe Chart A pIe chart shows the amount of data that belong to each category as a proportional part of the circle a useful when only one varIable Is of Interest 0 easy to compare relative size of the parts to each other as well as the size of each part compared to the total a percentage have to add up to 100 ie we need to account for all possible categoriesquot Assume we dId not have complete Information on loan amounts for all other countries a pIe chart can no longer be constructed because loan amounts for Japan Germany France US and Great Britain do not account for 100 of loan amount gIven to AsIa see also AYK problem 1 3 page 9 Imaucmn m Ema 5mm cmw 1 Smiun 11 1a a Chapter 11 Displaying Distributions with Graphs Bar Graph A bar graph shows the amount of data that belong to each category as proportionally sized rectangular areas They are more flexible than pIe charts because we don39t need to account for all possible categories of the varIable cmn chm 25 25 m7 an lntmductiun m Ema Statisticsl chzvm 1 5mmquot 1 1 11 37 Chapter 11 Displaying Distributions with Graphs a Bar graphs are valuable presentation tools as they are effective at reinforcing differences In magnItude note that bars have to be of equal WIdths and are equally spaced a Bar graphs are obVIously useful when the observed outcomes ofthe varIable our data can be placed Into dIfferent categories a Bar graphs can be eIther horIzontal or vertIcal Imaucmn m Ema 5mm cmw 1 Smiun 11 12 a Chapter 11 Displaying Distributions with Graphs Comments 0 Often classes have a natural order so It makes sense to put the bars In that or er For example consider how many Fr So J S are enrolled In this 226 section 0 Sometimes however It Is more useful to arrange the bars With respect to their magnItude I e order them from tallest to shortest In order to highlight the categorIes WIth the highest frequencies most Important classes This type of bar graph Is called a Pareto chart introduction m same 5mm l chzvm 1 51am 1 1 13 37 Chapter 11 Displaying Distributions with Graphs Chm 1257 lune 757 E g sue 257 D e 9 gt m n S m E g 3 A a E E 7 u to introduction m Engine 5mm l chzum 1 3mm 11 14 37 Chapter 11 Displaying Distributions with Graphs Quantitative Variables Frequency Tables and Histograms NumerIcaI data observations are numbers rather than categorIes are very mon In business One ofthe most common ways to summarIze observations from a quantItatIve varIabIe Is a histogram Idea 0 group values Into classes Intervals of equal width 0 count how many observations fa Into each class a draw a bar graph for the counts keeping the Intervals In numerIcaI order adjacent but noneoverlappIng introduction m same 5mm l chzvm 1 51am 1 1 15 37 Chapter 11 Histograms Example Ags and annual salana for the CEOs of the 60 top ranked small companies In America In 1993 Source Forbes Nov 8 1993 America39s Best Small Companies D39slrihulinn anzrizhle Age Imducmn m same 5mm l cmw 1 Smion 11 1s 37 Chapter 11 Histograms Dislrihulinn nl Salary in Thnusznds 200 400 600 800 10001200 11111111111quot m 515115 5111151151 chzvm 1 51am 1 1 17 37 Chapter 11 Histograms General Guidelines for Drawing a Histogram a all classes have same Width classes must not overlap e each observation can only belong to one class a reasonable number of classes for samples of size n g 150 observations there Is no best choice of class Width and number of classes gt use good Judgement to decide see also next example for effect of class Width 3 frequency f for each class corresponds to number of observations that belong to that class 0 sum of all class frequencies must add to n the size ofthe sample Note Classifying data Into classes leads to a loss of information alWays be aWare of that39 imde m 515115 5111151115 l cmw 5 31mm 11 1137 Chapter 11 Histograms numnmms l 71 91 111111 i5 55 1 5515511151115 1 51 11 11 11 1 5 11 15 lntmductiun m 5151M snumsi cmm 1 51am 1 1 19 37 Chapter 11 Histograms Different shapes of distributionshistograms We distinguish the folloWIng main shapes of distributions a symmetric a skeWed to the right a skeWed to the left a uniformrectangular o Jrshaped In addition We distinguish betWeen bimodal and multimodal distributions as opposed to unlemodal distributions quot111111111quot m 515115 5111151115 l cmw 5 31mm 11 2a 37 Chapter 11 Histograms Example Boston Housing Data 7 housing values for Boston suburbs 506 observations on 14 different variables Ezrcm mrlhe uniyczlegonczi mm vznzhie 1 m In 1 introduction m Ema smmsi chzvm 1 Snipquot 1 1 21 37 Chapter 11 Histograms Example Boston Housing Data cont39d 7 different shapa of distributions 2m 3m nu am am mu an m an 2m 25 cm can introduction m Busing 5mm i cmm 1 Satinquot 11 22 37 Chapter 11 Histograms Describing distributionshistograms 0 overall shape of histogram 9 center 9 spread 0 possible outliers introduction m Ema smmsi chzvm 1 Snipquot 1 1 23 37 Chapter 11 Histograms bdaieciassesawan introduction m Busing 5mm i chzum 1 Satinquot 11 2437 Chapter 11 Stem and Leaf Plot Stemplot a shows the actual digits a Each numerical value is dIVIded into two components leading digits 7 stem trailing digiLs 7 leaf How to make a stemrandeleaf plot Separate each observation into a stem consisting of all but the last most right digit and a leaf the last digit Stems may have as many CllglIS as needed but each leaf contains only a single digit 9 Write stems in a vertical column With smallest at the top then draw a vertical line at the right if this column 0 Write each leaf in the row to the right of IIS stem in increasing order out from the stem 0 add legend on how to read stem plot lntmduction m Engine Statistics l cmei 1 Snipquot 1 1 25 37 Chapter 11 Stemplots Example random sample of package design ratings ranging from 0 e 45 22 21 31 20 25 21 32 26 43 30 27 30 27 36 28 33 38 35 19 30 34 41 lntmductlon m Engine seems l cmei 1i Smion 11 25 37 Chapter 11 Stemplots Split Stemand Leaf Plot Sometimes it might be more informative to have a socalled Split Stemeand Leaf Plot where we split the stem further namely into two groups for the leading digits 0 to 4 and 5 to 9 lntmduction m Engine Statistics l cmei 1 Snipquot 1 1 27 37 Chapter 11 Stemplots Backto Back Stemand Leaf Plot 411 lntmductlon m Engine seems l cmei 1i Smion 11 28 37 chm 1 17 Tm P ms A magnum mkexmng m mqu PW mm 11m 1mm 21 zoos 5mg Hung m Emma Asa on u s Russoquot Fen MM 1 Chaptev 1 1 7 Tm ms m 1M1 Mum 1 chm 1 17 Tm P ms mg mg A unrwmdsvhysdzh 1mm mm mm W vede m m ms 15 WSW m my th m mm Aways Wilma on m Aamnhhxsmyouv mm mm mums m m ve nhxs m 1M1m Mum 1 Chaptev 1 1 7 Tm ms mum faunas mm guarm vzvuimn 1M1 Mum 1 Chapter 11 Time Plots a A trend In a time series refers to a longaterm persistent rise or fall Le a positive upward or negative downward tren Time Series average number nlnccupied mums By mnnlh average number of occupied rooms 00 A o o i s lntmductiun m Busims scams i Chaucer 1i Seem 1 1 33 37 Chapter 11 Time Plots 0 A attern in a time series that repeats itself at known regular intervals is called seasonal variation Time Series average number nlnccupied mums By mnnlh 00 ioocoo 0000 0000 iii average number of occupied rooms i ll mi Jiliiii iflif i 500 immunequot m Busims Statistics i Chaucer 1i Seem 11 34 37 Chapter 11 Time Plots e iere ltinsmsomillmnn m 3 Man i wmv l se i mean 2 u a ws wse wax wen in we w 1 ww ww wu wi ww ww wxn um in um um um um w Var immunequot m Busims scams i Chaucer 1i Seem 1 1 35 37 Chapter 11 Time Plots i rirre sews sansmsayvmr Mean 46 m 5m 37 17775 15m u we i we Sump z quot 1 immunequot m Busims Statistics i Chaucer 1i Seem 11 as 37 Chapter 11 Time Plots E51372 nu mher m empmym Pagans Tn mousmaZEyTsaIE ma nnnn Mean 3922 s a am am lquot N m mmuw am pmquot n mamas NV ntmductmn m Busmamp 5mm cmm 1 5mmquot 1 1 3737 Section 12 Describing Distributions with Numbers Measuring Center I A description of a distribution should always include a measure of its center Measuring Center the Mean I The mean is the most common measure of center I To find the mean of a set of observations add their values and divide by the number of observations x1x2x l equation 1 x 1 1 1 Vi I equation 2 7 ZXZ n 11 Notation I The 2 capital Greek sigma in the formula for the mean is short for add them all up I The subscripts on the observations x are just a way of keeping the n observations distinct I they do not necessarily indicate order or any other special facts about the data I The bar over the xindicates the mean of the x values I pronounce the mean x as x bar Example Frank Thomas I Home runs per season I732 24 41 38 4O 40352915434 28 42 1812 39 21 7322441123921 18 i 28222 HRS Measuring Center the Median I The median is the formal version of the midpoint I The median Mis the midpoint of a distribution the number such that half the observations are smaller and the other half are larger I How do you find it Finding the Median I Arrange all observations in order from smallest to largest I If the number of observations is odd the median is the center observation in the ordered list I If the number of observations is even the median is the mean average of the two center observations in the ordered list Example Frank Thomas I Find the median of the numbers 7322441384040352915434284218 1239 21 I Order them from smallest to largest 4 71215182124 28 29 32 35 38 39 40 4041 42 43 I Find the value that is in the middle by using n12 18 1295 I So the median 529 32 305 HRS Which is the average of the 9th and 10th measurement Example Ryne Sandberg I Find the median of Ryan Sandberg s hrs 5 7 8 91214161919 25 262626 30 40 I Since n is even find the two center observations 15 12 8 so find the 8th measurement I The median 19 hrs Mean vs Median I Median 2 middle number I Mean 2 value where histogram balances I Mean and Median similar when I Data are symmetric I Mean and median different when I Data are skewed I There are outliers Mean vs Median I Mean influenced by unusually high or unusually low valuesoutliers I Example Income in a small town of 6 people 25000 27000 29000 35000 37000 38000 The mean income is 31830 The median income is 32000 Mean vs Median I Bill Gates moves to town 25000 27000 29000 35000 37000 38000 40000000 The mean income is 5741571 The median income is 35000 I Mean is pulled by the outlier I Median is not I Mean is not a good measure of center for these data Mean vs Median I We can see that the mean is more sensitive to extreme observations than the magma I The mean is easierto calculate for large data sets I In a symmetric distribution the mean and median should be in approximately the same location Mean vs Median I Skewness pulls the mean in the direction of the tail I Skewed to the right mean gt median I Skewed to the left mean lt median I Outliers pull the mean in their direction I Large outlier mean gt median I Small outlier mean lt median Symmetric Example ean 5 147 15 E 5 10 Median 50 5 Mean 2 Median 30 36 42 48 54 60 66 72 Age in years Forbes magazine published data on the best small rms in 1993 These Were rms With annual sales of more than ve and less than 350 million Firms Were ranked by veyear average return on investment The data extracted are the age and annual salary of the chief executive of cer for the rst 60 ranked rms 15 Mean 40417 1 5 I 5 C 1 0 C Median 3 50 ED SUEDE E 0100 300 500 700 900 1100 Salary thousands of Forbes magazine published data on the best small rms in 1993 These Were rms With annual sales of more than ve and less than 350 million Firms were ranked by veyear average return on investment The data extracted are the age and annual salary of the chief executive of cer for the rst 60 ranked rms Mean gt Median Choosing a Numerical Summary I Skewed Distributions I Median is the best measure of center mean is affected by outliers I Symmetricapproximately symmetric I Both mean and median can be used as measure of center but the mean is preferred because of its additional mathematical properties Is a central measure enough I A warm stable climate greatly affects some individual s health Atlanta and San Diego have about equal average temperatures 62 vs 64 If a person s health requires a stable climate in which city would you recommend they live I Or Suppose you were willed 100000 You can invest the money in one of two mutual funds with similar yearly average returns 98 vs 99 Which mutual fund would you choose Measuring Spread I A description of a distribution should always include a measure of spread Measuring Spread I One way to measure spread is to give the smallest and largest observations I These single observations show the full spread of the data but they are influenced by outliers Because they can be outliers I we can improve our description of spread by also giving several percentiles The pth percentile I The pth percentile of a distribution is the value such that p percent of the observations fall at or below it I The most commonly used percentiles are the median 50th percentile and the quaTiles the 25th and 75th percentiles I To calculate a percentile arrange the observations in increasing order and count up the required percent from the bottom of the list 21 The Quartiles Q1 and Q3 I The 1St quartile O1 is the 25th percentile I The 2nCI quartile M is the 50th percentileor median I The 3rd quartile 03 is the 75th percentile I The first and third quartiles show the spread of the middle half of the data Calculating the Quartiles I Arrange the observations in increasing order and locate the median M in the ordered list of observations I The rst quartile O1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median I The third quartile 03 is the median of the observations whose position in the ordered list is to the right of the location of the overall median FiveNumber Summary I The venumber summary of a distribution consists of the smallest observation the first quartile the median the third quartile and the largest observation I These numbers give us a clear idea of the center and spread of a distribution Minimum Q1 Median 03 Maximum I For a one number summary of spread use the lnterquartile Range IQR 03 Q1 Example 12641 15953 16015 16555 16904 17124 17274 17516 17813 18206 18405 19090 19312 19338 20788 I data in order 12641 15953 16015 16555 16904 17124 17274 17516 17813 18206 18405 19090 19312 19338 20788 Quartiles from Example 15 values n15 when n is odd Misthe n1 2 h ordered value gt M is equal the the 8th ordered value 12541 15953 15015 1555315904 17124 172741751517s13 18206 18405 1909019312 19338 20788 Q1 M Q3 Find Q1 and Q3 by nding the median of the lower and upper halves IQR 19090 16555 2535 Words of Caution I Do we include the median when we calculate Q1 and 02 I Rule of thumb Don t include median If the sample size is even do include the two values used to find the median I Be careful when observations take the same numerical value I write down all of the observations and apply the rules just as if they all had distinct values Boxplots I A boxplot is a graph of the fivenumber summary I a central box spans the quartiles I a line in the box marks the median I lines whiskers extend from the box out to the smallest and largest observations I Boxplots are very useful for sidebyside comparison of several distributions Boxplots I You can draw boxplots either horizontally or vertically I Be sure to include a numerical scale in the graph I When you look at a boxplot first locate the median I marks the center of the distribution I Then look at the spread I the quartiles show the spread of the middle half of the data I the extremes show the spread of the entire data set 1 2 3 4 5 Example 63 65 5O 37 35 41 25 23 27 21 17 17 20 19 22 15 15 15 15 101 I ordered data 15 6 17 11 23 16 41 15 7 19 12 25 17 5o 15 8 2o 13 27 18 63 15 9 21 14 35 19 65 17 10 22 15 37 20 101 Example MIN 15 Q117 M 225 OG 39 MAX 101 I Make a box plot o m 40 so an mo 12o I JMP produces box plots that identify potential outliers Also the red line indicates the most dense 50 of the observations The diamond gives he mean an a 95 confidence interval we will talk about this later Information from the Boxplot I If the median is dose to the box s center mmem39c If the median is to the left of the box s center 39 h If the median is to the right of the box s center skewedlefl If the Whiskers are approximately the same length Symme 19 I If the right whisker is longer than the left whiskerwedright If the left whisker is longer than the right whisker skewedleft IQR as a measure of spread I Advantages I Simple to compute I Not sensitive to outliers resistant I Disadvantages I Doesn t involve actual values of all data points I Complex theoretical properties I Not commonly used Measuring Spread I The five number specifically IQR is a measure of spread that does not use all of the data I Is there a way we can use all the data to measure spread I Maybe obtain the distance from some reference point for each observation The Standard Deviation I The standard deviation looks at how far the observations are from their mean I It is the square root of the average squared deviation from the mean I estimate of the average distance from the mean s 1 zrxi rcr n l I The standard deviation squared s2 is the variance 2 1 2 S x x H 2K gt Example 2 Data Values X x x x x 1 3 2 4 2 1 1 3 O O 4 1 1 5 2 s251gtlt10 25 2xi JTS210 sJsTJE2L58 Degrees of Freedom I Recall that the equation for the variance is 1 s2 x fz quot42c I Notice that the average in the variance divides the sum by n 1 rather than n l the reason is that the deviations xi falways sum to exactly 0 so that knowing n 1 of them determines the last one I only n 1 of the squared deviations can varyfreely and we average by dividing the total by n 1 I The number n 1 is called the degrees of freedom of the variance or standard deviation Important Details about the Standard Deviation I s measures spread around the mean I should be used only when the mean is chosen as the measure Of Center I s 0 ony when there is no spread I The larger the standard deviation the more variability there is in the data I s has the same units of measurement as the original observations I Like the mean s is in uenced by outliers and skewness Standard Deviation as a Measure of Spread I Advantages I Use deviations from every data point I Has wellestablished theoretical properties I Commonly used I Disadvantage I Inflated by outliers I Influenced by skewness Choosing a Summary I Skewed distribution I summary venumber summary I because the two sides of a strongly skewed distribution have different spreads no single number describes the spread well I Symmetric distribution I summary mean and standard deviation 20 Choosing a Summary I Remember that a graph gives the best overall picture of a distribution I Numerical measures of center and spread report specific facts about a distribution but they do not describe its entire shape I Numerical summaries do not disclose the presence of multiple peaks or gaps I Always plot your data Section 12 Summary I A numerical summary of a distribution should report its center and its spread or variability 21 Section 12 Summary I The mean xbar and the median M describe the center of a distribution in different ways The mean is the arithmetic average of the observations and the median is the midpoint of the values Section 12 Summary I When you use the median to indicate the center of the distribution describe its spread by giving the quartiles The first quartile Qt has onefourth of the observations below it and the third quartile QB has threefourths of the observations below it 22 Section 12 Summary I The fivenumber summary consisting of the median the quartiles and the high and low extremes provides a quick overall description of a distribution The median describes the center and the quartiles and extremes show the spread Section 12 Summary I Boxplots based on the fivenumber summary are useful for comparing several distributions The box spans the quartiles and shows the spread of the central half of the distribution The median is marked within the box Lines extend from the box to the extremes and show the full spread of the data 23 Section 12 Summary I The variance s2 and especially its square root the standard deviation s are common measures of spread about the mean as center The standard deviation s is zero when there is no spread and gets larger as the spread increases Section 12 Summary I A resistant measure of any aspect of a distribution is relatively unaffected by changes in the numerical value of a small proportion of the total number of observations no matter how large these changes are The median and quartiles are resistant but the mean and standard deviation are not 24 Section 12 Summary I The mean and standard deviation are good descriptions for symmetric distributions without outliers They are most useful for the Normal distributions introduced in the next section The five number summary is a better exploratory summary for skewed distributions DalaSummary Calegurical Graphical N uuuuu cal Graphical Nm i Hatched PieChan Tulal Percent Hislugram Slemplul Timeplul M2 Md iOR SD 25 Properties of Standard Deviation I Measures spread of data about the mean Only use 5 when the mean is an appropriate measure of center I s 2 0 It 520 then all data are same I Has the same unit of measurement as the original observations I Inflated by outliers 26 Stat 226 Introduction to Business Statistics Spring 2009 Professor Dr Petrutza Caragea Section A Tuesdays and Thursdays 9 30710 50 a m Chapter 6 Section 62 Test of Significance Hypothesis testing lntvodumon to Business 522mm i Search a 2 1 27 Chapter 62 Tests of Significance TESTS OF SIGNIFICANCE Example pick a jury of 12 people randomly out of a pool of 12 men and 12 women for a fairjury 6 men and 6 women What about a selection of a 5 men and 7 women7 a 4 men and 8 women7 a or even 1 man and 11 women7 Where do we draw the line and no longer believe that the jury selection was truly random and fair7 That is when do we start doubting that the chance of getting selected for each gender was truly 50507 immunequot to Business 522mm i Semen e 2 2 27 Chapter 62 Tests of Significance SOME BASIC TERMINOLOGY Hypothesis A hypothesis is a claim or belief about a population parameter that we wish to test In any test there are two competing hypotheses a the null hypothesis denoted by H0 is a statement of what we assume to be true vs 0 the alternative hypothesis denoted by HQ which is a statement against Ho 7 this is what we want to show lntvodumon to Business 522mm i Search a 2 3 27 Chapter 62 Tests of Significance The philosophy behind a statistical hypothesis test is the same as in a jury trial There are only two possibilities 0 not guilty corresponding to Ho vs a guilty corresponding to H3 Like in a jury trial the philosophy is39 innocent until proven guilty That is we assume not guilty until we have enough evidence to determine guilt Likewise we assume H0 is true until we have sufficient evidence in the data in favor of H3 immunequot to Business 522mm i Semen e 2 4 27 Chapter 62 Tests of Significance Both null and alternative hypothesis are always stated in terms of the population parameter Generally this will be H for us Example A brewery claims that the average content 2 of their cans of beer is 12 oz but we suspect that the average content is less getting ripped off We want to test lntvodumon 0 Business Statistirs i Semen e 2 5 27 Chapter 62 Tests of Significance Example Developing a new diet to loose weight we are interested in the average weight loss in lbs we want to see if the diet is effective We want to test H0 vs H 3 Example A machine in control should cut wood into 5 feet pieces It is suspected that machine is out of control We want to test Ho vs HQ immunequot to Business Statistirs i Semen e 2 e 27 Chapter 62 Tests of Significance In summary we have three different types of alternative hypotheses against the null hypothesis Ho H Ho Note the sign is always included in the null hypothesis never in the alternative hypothesis H0 and H3 always have to contradict each other 0 corresponds to the mean we assume under H0 ntvodunion 0 Business Statistirs i Semen e 2 7 27 Chapter 62 Tests of Significance Tech nica lly we test H0 H S 0 VS Ha H gt 0 instead of H0 p 0 as well as H0 H 2 0 VS HQ 7 H lt 0 instead of H0 p 0 For simplicity we will keep using H0 H 0 Example Brewery claims the mean average content of a can of beer is 12 oz We take a random sample of 36 beer cans and obtain the sample mean Sr 1182 oz If the standard deviation is known to be 0 038 oz do we have enough evidence that the brewery is making a false claim7 earlier we set up the following hypotheses immunequot to Business Statistirs i Semen e 2 a 27 Chapter 62 Tests of Significance If H is indeed true we should expect Sr to be less than 12 oz Again just like in the jury selection example how much less than 12 oz should Sr be before we start doubting that H 127 Is a mean of 1198 low enough7 What about Sr 11227 The question of interest becomes Is a value of Sr 11282 for a sample of size 36 unusually small if the brewery39s claim of H 12 oz is supposed to be true7 If so then this would be evidence against H0 in favor of H3 lntvodumon 0 Business Statistirs i Semen e 2 9 27 Chapter 62 Tests of Significance We can use our knowledge about the sampling distribution of the mean Sr and the normal calculation from Chapter 1 3 to assess how unusualunlikely our data and hence the corresponding sample mean is We need to find 7 7 PX g Sr PX 11282 i e find the probability of obtaining a sample mean that is at least as unusual in our case as small as the observed one of Sr 1182 immunequot to Business Statistirs i Semen e 2 1n 27 Chapter 62 Tests of Significance If the brewery claim is true 2 12 what do we know about how Sr behaves for a sample size of n 367 sampling distribution To evaluate evidence in favor of H3 judge how unusual the observed sample mean Sr 11282 is by where it falls on the sampling distribution of Sr under the null hypothesis lntvodumon 0 Business Statistirs i Semen e 2 11 27 Chapter 62 Tests of Significance For reference we then compute a soecalled pvalue which is the probability of getting a value at least as unusual as the observed sample mean 2 assuming that H0 is true gt pvalue measures evidence against Ho immunequot to Business Statistirs i Semen e 2 12 27 Chapter 62 Tests of Significance In the brewery example more unusual than Sr 11182 corresponds to Sr smaller than Sr 1182 under Ho and equivalently smaller than 2 72184 What is the area to the left of z 72184 the smaller the pvalue the stronger the evidence is against the null hypothesis Ho and in favor of Hal Why7 i recall the pevalue tells us how likely it is to obtain a sample mean as extreme as the observed one if the null hypothesis holds true So there is only a 0 23 chance of observing a sample mean of 11 82 when H0 H 12 holds true lntvodumon 0 Business Statistirs i Semen e 2 13 27 Chapter 62 Tests of Significance I Handout How to find pevalues immunequot to Business 522mm i Semen e 2 14 27 Chapter 62 Tests of Significance FIRST SUMMARY OF A HYPOTHESIS TEST 0 Write H0 and H3 in terms of the parameter H the population mean 0 Assume H0 is true 9 Find Zescore for the sample mean Sr from your data 0 Find corresponding pvalue area under the normal curve 9 if data come from a population that has a different mean than the one assumed under the null hypothesis Ho we will see a small pevalue i e our data most likely comes from a different population with a different population mean H lntvodumon 0 Business Statistirs i Semen e 2 15 27 Chapter 62 Tests of Significance How small of a p value do we need Typically we will make a decision to reject H0 by comparing our pevalue to a preselected cutoff value This cutoff value is called the level of significance and denoted by a common choices a 0105 a 0101 So why a 01057 What does it imply7 The level of significance corresponds to the error rate that we allow ourselves saying that in 5 of all decisions we will make the wrong decision ie reject the null hypothesis H0 when in fact H0 is true immunequot to Business 522mm i Semen e 2 1e 27 Chapter 62 Tests of Significance If we chose a 0201 we would commit the error only 1 of the times but it would be harder to reject the null hypothesis fr will have to be more extreme before we can reject H0 lntvodunion 0 Business Statistirs i Semen e 2 17 27 Chapter 62 Tests of Significance The choice of a is somewhat subjective 7 How much of an error probability are we willing to accept7 This is equivalent to how strong your evidence against Ho has to be before you are willing to reject Ho DECISION RULE o if pevalue g a reject H0 in favor of H We say We have statistically significant evidence against Ho and have reason to believe in a a if pevalue gt a fail to reject H0 We say We do not have sufficient evidence against H0 and no reason to believe in HQ immunequot to Business Statistirs i Semen e 2 18 27 Chapter 62 Tests of Significance Example 1 0205 level of significance We say any pevalue g 0 05 is statistically significant at the 0 05 level Example Suppose pevalue0 03 o if a 0205 a ifa 0201 lntvodunion 0 Business Statistirs i Semen e 2 1g 27 Chapter 62 Tests of Significance A technical amp philosophical note a the decision is always in terms of the null hypothesis H0 we either are able to reject H0 or we fail to reject Ho 0 we never prove neither Ho nor HQ we just collect evidence against Ho If we fail to find strong evidence against H0 we will stick to Ho This does not imply that H0 is necessarily true maybe we just did not have a sufficiently large sample size a On the other hand rejecting H0 in favor of H3 does not guarantee that H is true despite very strong evidence For any given hypothesis test there is two kind of errors we can commit but we also a very high chance of making a correct decision immunequot to Business Statistirs i Semen e 2 2n 27 Chapter 62 Tests of Significance TYPE I AND TYPE 11 ERROR IN HYPOTHESES TESTS lntvodumon 0 Business Statistirs i Semen e 2 21 27 Chapter 62 Tests of Significance o Handouts zrprocedurequot amp Examples immunequot to Business 522mm i Semen e 2 22 27 Chapter 62 Tests of Significance CONNECTION BETWEEN CONFIDENCE INTERVALS AND TWOpSIDED HYPOTHESES TESTS p 394395 Recall the example of water bottling company Water bottles are supposed to contain 710 ml on average 0 6 ml and a sample of 90 bottles yielded an average of 708 ml Example Is the bottling process still on target7 We constructed a 98 CI for H and obtained 706253 709247 We concluded intuitively that this is a good indicator that the process is not on target any longer 7 Was this intuitive decision justified7 lntvodumon 0 Business Statistirs i Semen e 2 23 27 Chapter 62 Tests of Significance Let39s see what decision we will obtain by conducting the corresponding hypothesis test immunequot to Business 522mm i Semen e 2 24 27 Chapter 62 Tests of Significance lntvodumon to Business 522mm i Semen e 2 25 27 Chapter 62 Tests of Significance WHAT IS THE CONNECTION A twoesided hypothesis test using a significance level a and a 17 a as 100 confidence interval are equivalent That is a twoesided hypothesis test rejects the null hypothesis H0 exactly when the value 0 falls outside the corresponding 17 a as 100 confidence interval immunequot to Business 522mm i Semen e 2 2e 27 Chapter 62 Tests of Significance PRACTICAL VERSUS STATISTICAL SIGNIFICANCE 9 Handout lntvodumon to Business 522mm i Semen e 2 27 27 Stat 226 Introduction to Business Statistics I I Spring 2009 Professor Dr Petrutza Caragea Section A Tuesdays and Thursdays 9301050 am Chapter 10 Section 101 Inference for simple linear regression Chapter 101 Inference for Simple Linear Regression IS THE LINEAR RELATIONSHIP BETWEEN X AND y SIGNIFICANT OR NOT Do New Jersey banks serve minority communities Financial institutions have a legal and social responsibility to serve all communities Do banks adequately serve both innercity and suburban neighborhoods both poor and wealthy communities In New Jersey banks have been charged with withdrawing from urban areas with 3 hi h percentage of minorities To examine this charge a regional New Jersey newspaper the Asbury Park Press compiled county by county data on the number y of people in each county per branch bank in the county and the percentage X of the population in each county that is minority S ce Mcciaue J T Benson P c Siricid i T 2007 Statisics orBusinez and Economics 10m Eat Prentice Haii Upper Saddie River NJ introduction to Business Statistics i Section 101 1 28 introduction to Business Statistics i Section 101 2 28 Chapter 101 Inference for Simple Linear Regression Chapter 101 Inference for Simple Linear Regression data county number of people percentage of l 35007 I per bank branch minority population 1 Atlantic 3073 mm 39 u 2 Bergen 2095 13 gag 39 3 Burlington 2905 178 gem 39 39 l quot 4 Camden 3330 234 E 39 I 39 u 5 Cape May 1321 73 mquot I f f f m 21 Warren 2349 28 39 If charge against New Jersey holds true we should see an increase in the 1 2 mm gem 3 I number of people per bank less bank branches as the minority WWWWquot percentage in population increases introduction to Business Statistics i Section 101 3 28 introduction to Business Statistics i Section 101 A 28 Chapter 101 Inference for Simple Linear Regression Correlation mm oi nunwo m LS regression line introduction to Business Statistics I Section 101 5 28 Chapter 101 Inference for Simple Linear Regression POPULATION REGRESSION LINE Because we have complete data for all 21 New Jersey counties and only New Jersey is of interest to us we have data on the entire population The least squares regression line fitted through the 21 observations corresponds therefore to the so called population regression line y o 1X u and 1 are population parameters describing the linear relationship between X and y in the entire population introduction to Business Statistics I Section 101 528 Chapter 101 Inference for Simple Linear Regression Note The population regression line by u 1X describes the linear relationship between the explanatory variable X and 11y ie the relationship between X and the averagemean value of y for a given X If we are interested in describing each individual y in the population we need to account for the fact that not all y are equal to W and therefore will not fall on the straight line but will deviate from the line by some error E y o 1XE Hy introduction to Business Statistics I Section 101 7 28 Chapter 101 Inference for Simple Linear Regression data for the 21 New Jersey counties the entire population im Pememve 0V quotmamPapal in introduction to Business Statistics I Section 101 B 28 Chapter 101 Inference for Simple Linear Regression New Jersey counties The simple linear regression model y u 1X 5 allows us to describe the linear relationship between each y for a given value of the explanatory variable X i12 21 ie y 0 1XiEi The s are independent and normally distributed with mean 0 and standard deviation 0 this is an important assumption to which we will come back to later introduction to Business Statistics i Section 101 928 Chapter 101 Inference for Simple Linear Regression Typically we are not as fortunate and won t be able to observe an entire population Hopefully though with the help of a representative random sample we still will obtain reliable information about the true underlying linear relationship in the population Recall the general form of the fitted least squares regression line from Chapter 2 7 2 bx where a and b are obtained from the sample as follows brsl and 277bgt39lt 5x introduction to Business Statistics i Section 101 1028 Chapter 101 Inference for Simple Linear Regression We can use a to estimate u and b to estimate l Both a and b are sample statistics and will vary from sample to sample If we took another sample we would get different values of a and b sampling variability Consequently a and b have a sampling distribution The textbook unfortunately switches notation from Chapter 2 to Chapter 10 In the following we will denote a as 70 and h as 71 introduction to Business Statistics i Section 101 11 28 Chapter 101 Inference for Simple Linear Regression Knowing the sampling distribution of b0 and b1 allows us to 0 construct confidence intervals for the slope 1 and intercept g 9 test whether the response y depends linearly on X ie there is a significant linear relationship between X and y in the population Generally we will focus on the slope l because the value of the slope determines whether or not a linear relationship between X and y exists introduction to Business Statistics i Section 101 12 28 Chapter 101 Inference for Simple Linear Regression a Note in order to test whether a linear relationship exists between X and y we need to test whether the population slope l 0 Why If l 0 we get the following regression model y o 1XE y o0Xa y 05 if l 0 gt X does not help explain the behavior of y introduction to Business Statistics i Section 101 13 28 Chapter 101 Inference for Simple Linear Regression introduction to Business Statistics i Section 101 14 28 Chapter 101 Inference for Simple Linear Regression ASSUMPTIONS FOR REGRESSION INFERENCE Before we are going to construct CIS and tests we should have a look at assumptions that are necessary for inference on regression parameters 0 simple random sample ensuring independence of y s 9 linear relationship between X and by 9 standard deviation of the responses about the population line is the same for all values of the explanatory variable X 6 the response y varies according to a normal distribution about the population regression line for all values of the explanatory variable X introduction to Business Statistics i Section 101 15 28 Chapter 101 Inference for Simple Linear Regression CHECKING THE ASSUMPTIONS 0 independence 0 linear relationship introduction to Business Statistics i Section 101 1528 Chapter 101 Inference for Simple Linear Regression Chapter 101 Inference for Simple Linear Regression 9 normality CONFIDENCE INTERVALS FOR SLOPE 1 recall the general form of a confidence interval is given estimate j margin of error where margin of error corresponds to critical value gtlt standard error e constant variance CI for the slope l is of the same form b1 i t SEi17 the standard error SE11 can be obtained from the JMP output Note the critical value tquot corresponds now to a tdistribution with f 2 introduction to Business Statistics i Section 101 17 28 introduction to Business Statistics i Section 101 18 28 Chapter 101 inference for Simple Linear Regression Chapter 101 inference for Simple Linear Regression New Jersey example Let s construct a 95 confidence interval for the Interpremt39on cont d39 slope 1 introduction to Business Statistics i Section 101 2028 introduction to Business Statistics i Section 101 1928 Chapter 101 Inference for Simple Linear Regression TESTING FOR A SIGNIFICANT LINEAR RELATIONSHIP IE l 7 0 example New Jersey data example Is there a significant linear relationship between the percentage of the minority population and the number of people per bank branch Recall the population regression line y u 1X We are interested in showing that l is significantly different from zero ie l 0 because this implies that there exists indeed a linear relationship between X and y We therefore set up the following hypotheses 0 H0 l 0 no linear relationship between X and y 0 HQ 1 y 0 there exists a linear relationship between X and y introduction to Business Statistics i Section 101 21 2B Chapter 101 Inference for Simple Linear Regression Note If there exists a linear relationship l 0 then this linear relationship can be either positive or negative a 1 lt 0 gt negative relationship a 1 gt 0 gt positive relationship If we are simply interested in showing that a linear relationship exists and the direction either positive or negative is not important we test H0 against the twosided alternative HQ l 0 If we are specifically interested in showing either a positive or negative relationship we need to set up the alternatives accordingly ie 0 HQ l gt 0 for positive linear relationship 0 HQ l lt 0 for negative linear relationship introduction to Business Statistics i Section 101 22 2B Chapter 101 Inference for Simple Linear Regression A general form of the test statistic is given by b1 1 t 551 with dfn2 for a tdistribution b1 is the estimate of 1 based on sample Under the null hypothesis we assume l 0 the test statistic therefore simplifies to 7 b1 7 0 7 b1 7 SE11 7 SE11 Often this test statistic is called the tratio eg in JMP introduction to Business Statistics i Section 101 23 2B Chapter 101 Inference for Simple Linear Regression finding the p value pvalues are found in exactly the same way we have done before Depending on the alternative the p value corresponds to 0H32 17 0 oHQ 1gt0 0H3 1lt0 introduction to Business Statistics i Section 101 24 2B Chapter 101 Inference for Simple Linear Regression Chapter 101 Inference for Simple Linear Regression Note JMP gives pvalues corresponding to a twosided alternative ie Ha 17 0 We need to divide the JMP p value by 2 if we are interested in testing a onesided alternative such as HQ l gt 0 or decISIOI I rllle as beforei we rejeCt H0 if P i value S a Q I 1 lt 0 iLmeaan i conclusion Rejecting H0 implies that there exists a statistically numberutpeupiepevbankbmnch mmisaasmamemensuem ML Significant linear relationship between X and y Does this conclusion imply a change in the response y can be caused by a change in the explanatory variable X Suurce DF Sumuisuusies Meansuuaie FRaliu Muuei i 3385mm 2 3385mm 2i izaa Enuv ia 3mm 3 i tub s r ai amazon 7 n nunz i Parameter Estlmates i Term Es1imale S dEYmY lR39aliu Pmbgtili intercept mmisa isaim 13naltuuui peicemauenimnuntvnunuistiun 35287737 767mm 6n uuuuz introduction to Business Statistics i Section 101 25 28 introduction to Business Statistics i Section 101 2528 Chapter 101 inference for Simple Linear Regression Chapter 101 inference for Simple Linear Regression Example New Jersey banks introduction to Business Statistics i Section 101 27 ME introduction to Business Statistics i Section 101 28 28 GROUP EXERCISES 7 STAT 226 EXERCISE 1 The mean age of 5 persons in a room is 30 years A 36 year old person walks in What is now the mean age of the persons in the room Suppose that the median age is 30 years and a 36 year old person enters Can you nd the new median age from this information EXERCISE 2 For the following set of numbers 4 O 1 4 3 6 the mean variance and standard deviation are given by i 3 52 48 s 219 Suppose you add 2 to each of the numbers in the rst set That gives us the set 6 2 3 6 57 a Find the mean and the standard deviation of this set of numbers b Compare your answers with those for the set given above How did adding 2 to each number change the mean THIS EXERCISE SHOULD HELP YOU SEE THAT THE STANDARD DEVIATION OR VARI ANCE MEASURES ONLY THE SPREAD AROUND THE MEAN AND IGNORES CHANGES IN WHERE THE DATA ARE POSITIONED EXERCISE 3 This is a variance contest You must give a list of six numbers chosen from the whole numbers 0 1 2 3 4 5 6 7 8 9 with repeats allowed a Give a list of six numbers with the largest variance such a list can possibly have b Give a list of six numbers with the smallest variance such a list can possibly have Stat 226 Introduction to Business Statistics Spring 2009 Professor Dr Petrutza Caragea Section A Tuesdays and Thursdays 9 30710 50 a m Chapter 2 Section 23 Least Squares Regression lntvodunion 0 Business Statistirs i Semen 2 3 1 22 Chapter 23 Least Squares Regression If we find a linear association between two quantitative variables e g through a scatterplot we can use this knowledge to help predict the value of one variable I given the value of another variable X What are plausible values when X 7 We can often use a straight line soecalled regression or prediction line to describe 0 how a response variable changes as an explanatory variable X changes 3 predict a value of given a specific value of X lntvodunion to Business Statistirs i Semen 2 3 2 22 Chapter 23 Least Squares Regression How shall we fit a line ie what is a good line We will use the soecalled least squaresquot principle to find the best line the best line should minimize the sum of the squared errors i e give the closest fit to all data points This best line through the data is called the least squares regression line or prediction line lntvodunion 0 Business Statistirs i Semen 2 3 3 22 Chapter 23 Least Squares Regression Least Squares regression line LS regression line The equation of the least squares regression line is given by 7 a b X where a 7 corresponds to the predicted value of for a given value of X I b r 3 is the slope of the LS regression line a a 7 7 b X is the intercept of the LS regression line a r IS the correlation coefficient between X and y o TltSX correspond to the mean and standard deviation ofX a sy correspond to the mean and standard deviation of y lntvodunion to Business Statistirs i Semen 2 3 4 22 Chapter 23 Least Squares Regression Example recall previous data X 2 5 8 8 10 12 2 4 7 6 10 we found X 75 5X 35637 7 63 5y 30111 and r 09878 so b 51 5x and a 7 7 b X gt LS regression line is given by introdumon 0 Business Statistirs i Semen 2 3 5 22 Chapter 23 Least Squares Regression Let39s discuss next how to plot the LS regression line and the meaninginterpration of the slope b and intercept a a How to plot the LS regression line on the scatterplot Find two points on the prediction line and connect both X3andX939 plot both points X7 on scatterplot and connect them immunequot to Business 522mm i Semen 2 3 e 22 Chapter 23 Least Squares Regression What can we use the line for we can use the line to predict the sales amount based on the number of radio ads aired per week e g what is 7 when X 67 introdumon 0 Business Statistirs i Semen 2 3 7 22 Chapter 23 Least Squares Regression INTERPRETATION OF INTERCEPT 3 AND SLOPE b 0 a i intercept corresponds to the predicted value of y when X 0 o b i slope corresponds to the change in the predicted value when X increases by 1 unit a b gt 0 oblt0 immunequot to Business 522mm i Semen 2 3 a 22 Chapter 23 Least Squares Regression Can we use the prediction line to predict the sales amount when the number of radio ads aired per week is 157 That is what is 7 when X 157 lntvodumon 0 Business Statistirs i Semen 2 3 9 22 Chapter 23 Least Squares Regression Some facts about LSRegression 0 ln regression we must clearly know which variable is the explanatory variable and which variable is the response variable Switching X and y will change the but will not affect the value of the correlation r 9 The Least squares regression line goes through the point immunequot to Business 522mm i Semen 2 3 1n 22 Chapter 23 Least Squares Regression Some facts about LSRegression cont d 9 is called and corresponds to the amount percent of in the yvalues that is accounted for by the regression of y on X tells us how good the predictions will be 0 We want r2 close to in general lntvodumon 0 Business Statistirs i Semen 2 3 11 22 Chapter 23 Least Squares Regression Example weekly radio ads and the sales amount r 9878 So r2 Thus of the variation in y can be explained by the least squares regression line of the number of advertisements on ase on e va ue o r ow 0 on now i r is ne a ive or osi ive 9 B d th I f 2 h d y k f g t p t that is the direction of the association immunequot to Business 522mm i Semen 2 3 12 22 Chapter 23 Least Squares Regression RESIDUALS AND RESIDUAL PLOTS residual A residual is defined as the difference between an observed value and its predicted value 7 based on the prediction line residual observed 7 predicted y i 7 a residual can be thought of as an error that we commit when using the prediction line unless a Ievalue falls right on the prediction line a residual we be either a positive when observed yevalue falls above the prediction line a negative when observed yevalue falls below the prediction line I so residual is only zero 42gt 7 lntvodumon 0 Business Statistirs i Semen 2 3 13 22 Chapter 23 Least Squares Regression x immunequot to Business Statistirs i Semen 2 3 14 22 Chapter 23 Least Squares Regression example radio ads vs sales LS regression line is 7 00735 0835 X radio ads X sales 7 residual 7 7 2 2 5 4 8 7 8 6 10 9 12 10 Total In order to plot residuals we need to plot residuals on the horizontal axis and corresponding X values on the vertical axis lntvodumon 0 Business Statistirs i Semen 2 3 15 22 Chapter 23 Least Squares Regression immunequot to Business Statistirs i Semen 2 3 15 22 Chapter 23 Least Squares Regression WHAT TO LOOK FOR IN A RESIDUAL PLOT residual plots can tell us whether the fitted linear model is adequate in general a residuals appear to be randomly scattered around 0 a good fit 0 any pattern in a residual plot always indicates a bad fit some examples good fit residuals are scattered around 0 we have about as many residuals below zero as are above lntvodumon 0 Business Statistirs l Semen 2 3 17 22 Chapter 23 Least Squares Regression curved pattern data are not linear straight line fits poorly increasingdecreasing spread of residuals immunequot to Business Statistirs l Semen 2 3 18 22 Chapter 23 Least Squares Regression INFLUENTIAL POINTS AND OUTLIERS outlier i an observation that is separated from the main bulk of the data An observation that has a considerable effect on the fitted regression model e g on the correlation r intercept a andor slope b is considered influential lntvodumon 0 Business Statistirs l Semen 2 3 1g 22 Chapter 23 Least Squares Regression Often we have that observations which are outliers with respect to their Xevalues tend to be more influential than observations which are outliers with respect to their yevalues Such observations outlying w r t all X values are said to have a high leverage they alter the fitted least squares regression line significantly consider the next example data on treadmill time until exhaustion versus ski time for biathletes immunequot to Business Statistirs l Semen 2 3 2n 22 Chapter 23 Least Squares Regression m m skrhmgsmm x nmmsxmzmmme Lrnur n 5kmquot an xmszz 27szsxnwumumme stzre n mm Rsamm n m Home mm mm M n mu m Man in 1 1353M pm Man square Env 3 nms Meir m Rewmse as ye Man a Rasv rse 57 mm Dhserwtmrs or m m5 2 new a Sum ms 1 sum m Ermr Hum mm mm m Ermr mm mm mm szxmsza 32n7xs 2925 lt mm mm 7Mns13x 5 71m 2 as mum mmmm mm mm m lt nnm mmmm n7n7sss n mm 1 25 252 Semen 2 3 ntvodumon to Busmess Stztrstrrs 2122 Chapter 23 Least Squares Regression m m skmmassmwzxs 254582xmmmume m n 5kmquot an xmszz 27szsxnwumumme stzre n mm Rsamm n mm Home Ad n xssms mm M n swns m Mun 5mm m 1 1353M pm Man square Env mm mm Mm as 75 Menu Wtsv me s7nn923 ungrmm or m Ms 2 new a Sum ms 13 m sum mm Hum mm mm mm mm mm mm szxmsza m7sz 2925 mum mm swam mass 1717 mum mmmm max mm m mum mmmm mm mm an 3 Ir 39 Semen 2 3 ntvodumon to Busmess Statrstrrs 2222 Chapter 2 Examinin Relationships Introduction I A un versty adm ss ons office wou d ke to know how h gh schoo GPA re ates to co ege I An nsurance group reports that heav er cars have fewer deaths per 10000 veh c es reg stered than do ghter cars I These and many other stat stca stud es ook at the re at onsh p between two var ab es I Most statistical studies examine data on more than one variable Introduction I Data consisting of two measurements two variables on each unit of a study is called Bivariate Data I To study a relationship between two variables we measure both variables on the same individuals I often we th nk that one of the var ab es exp a ns or nfluences the other Definitions I A response variable measures an outcome of I Sometimes called dependentvariable I An explanatory variable exp a ns or nfluences changes n a response var ab e I Sometimes called independentvariable I It s eas est to dent fy exp anatory and response var ab es when we actua y set va ues of one var ab e to see how t affects another var ab e Introduction I When we don t set the va ues of ether var ab e but just observe both var ab es there may or may not be exp anatory and response var ab es I whether there are depends on how we plan to use e data I Ca ng one var ab e exp anatory and the other response doesn t necessar y mean that changes n one cause changes n the other Introduction I Fortunately statistical analysis of multi variable data builds on the tools we used to examine individual variables I first p ot the data then add numer ca summar es I ook for overa patterns and dev at ons from those patterns I when the overa pattern s qu te regu ar use a compact mathemat ca mode to descrbe t Big Picture STATISTICS in SUMMARY riuxyourdara Analyzing Data in Two Variables Smiterplot interpret what you see Direuion Farm Strength Linear Numerical summary 9 it 5 5 and r I Mathematical model7 Regressron Line Section 21 Scatterplots Scatterplot I A scatterplot shows the relationship between two quantitative variables measured on the same individuals I The values of one variable appear on the horizontal axis and the values of the other variable appear on the vertical axis I Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual Helpful Hints I Always plot the explanatory variable if there is one on the horizontal axis I We usually call the explanatory variable x and the response variable y I If there is no explanatoryresponse distinction either variable can go on the horizontal axis 9905 l l 5 l97 l7 231 26 l7O 21 2025 30 2255 35 4997 94 2349 42 lei 5 21 294 44 422 65 3007 59 41 24 69 3469 59 923 l9 2559 42 l l 95 l6 2965 39 594 72 26329 43 24409 45 39429 64 24l Ell 36 29997 40 Scatterplot scatterp ot of gross sa es for each day n Apr 2000 aga nst the number of tems sodfor the same a 7 day a 39 Emu Grass Sales 400 Interpreting Scatterplots I In any graph of data look for the overall pattem and for striking deviations from that pattern I You can describe the overall pattern of a scatterplot by the form direction and strength of the relationship I An important kind of deviation is an outlier I an ndvdua va uethatfa 5 outs dethe overa pattern of the re at onsh p Interpreting Scatterplots I Form I is there a linear pattern may be approximate I are there clear outliers or clusters of points I Direction I positively associated an increase in one variable is accompanied byan increase in the other I negatively associated a decrease in one variable is accompanie yan increase in the other t I how closelythe points followa clear form Example continued I Form I linear u I D rect on I positive a 39 5 39 I Strength 39 39 IModerately Caution I Nota reatonsh ps are near I Not a re at onsh ps have a c ear d rect on that we can descr be as postve assoc at on or negat ve assoc at on Want to Add Categorical Variables I Use different colors or symbols to plot points when you want to add a categorical variable to a scatterplot I Sometimes several individuals have exactly the same I you can use a d fferent p ottng symbo to ca attent on to such po nts m 6196 sa umay 8905 115 m 197 17 9996 Tuesday 231 26 9 9 Wednesday 176 21 9 9 ma 025 so 6796 may 22 s as 6396 sa umay ms 7 a 61096 may 2395 oz 91 196 Tuesday 161 s 21 61296 Wednesday 25 9 61396 ma 922 65 919 6 may 3007 59 o196 sa umay 912 a 69 61796 M may 366 a 59 91 396 Tuesday 92 a 19 919 ednesday 255 5 oz Wmda 11a 5 16 62196 Tmay 286 s 39 62296 sa umay 599 72 929 M may 263 29 oi 92 Tuesday as 96 Wednesday 39 25 6 62796 Mrsday 2 1 31 36 62396 Tmay 299 97 60 62996 sa urday 669 04 106 E Scatterplot I x s represent s from weekdays I s represent s from Saturdays Section 21 Summary I To study relationships between variables we must measure the variables on the same group of individuals Section 21 Summary I If we think that a variable x may explain or even cause changes in another variable y we call x an explanatory variable and y a response variable Section 21 Summary I A scatterplot displays the relationship between two quantitative variables measured on the same individuals Mark values of one variable on the horizontal axis and values of the other variable on the vertical axis Plot each individual s data as a point on the graph Section 21 Summary I Always plot the explanatory variable if there is one on the x axis of a scatterplot Plot the response variable on the y axis Section 21 Summary I Plot points with different colors or symbols to see the effect of a categorical variable in a scatterplot Section 21 Summary I In examining a scatterplot look for an overall pattern showing the form direction and strength of the relationship and then for outliers or other deviations from this pattern Section 21 Summary I Form Linear relationships where the points show a straightline pattern are an important form of relationship between two variables Curved relationships and clusters are other forms to watch for Section 21 Summary I Direction If the relationship has a clear direction we speak of either positive association high values of the two variables tend to occur together or negative association high values of one variable tend to occur with low values of the other variable Section 21 Summary I Strength The strength of a relationship is determined by how close the points in the scatterplot lie to a simple form such as a line Stat 226 Introduction to Business Statistics Spring 2009 Professor Dr Petrutza Caragea Tuesdays and Thursdays 9 30710 50 a In Chapter 1 Section 12 Inmducmn m 5mm Statisticsl chzum 1 Seam 1 2 1 29 Review Chapter 1 Section 1 Describing distributions with graphs 0 PIe charts Bar graphs Pareto graphs Histograms Stemplots Imam m Busmss Sammy cmw I Sawsquot 1 2 2 29 Match the histograms to the best description 9 Numbers of medals won by countries In the 1992 Winter Olympics II I I Last dIgIt of each of 500 students39 social security A Jilt numbers Age at death of a sample of 45 persons 9 The SAT scores of 500 IIIL I u I students 9 The heights In Inches of 500 college students 0 0 9 Time on hold at a help lIne lntmdudmn m 5mm Sammy cthm 1 Sumoquot 1 2 3 29 Chapter 12 Describing Distributions with Numbers want to describe NUMERICALLY 0 CENTER of the data 9 SPREAD of the data Measuring the center 9 associated WIth locatIng the middle of the data a fIndIng the value that Is most typical for the data three common measures a mean 7 average value of all data points a medIan 7 middle value of all data points 0 mode 7 data poInts WIth hIghest frequency most popular Imam m Busmss smmsi cmw I Sam 1 2 4 29 Chapter 12 Sample mean Notation X The sample mean of a set of observations X1X2 Xn is the arithmetic average of all observations n if X1X2Xn 7 1 quot 1 Example of sick days employees took In a small local business Chapter 12 Sample mean Sometimes the mean is not an appropriate measure of the center because it simply does not reflect a typical value of the data This is almost always the case when we have unusually large or small observations in the data called outliers Example starting salaries of 5 people after graduating from college 35000 37000 35000 33000 210000 0 1 2 0 4 0 1 2 3 35000 37000 35000 33000 210000 5 70000 70000 is certainly not a typical starting salary for all 5 people it is Just the average lntvoduction m 5mm 5220mm Chzvtev 1 Seam 1 2 a 29 lntvoduction m Busing 5220mm cmm i Seam 1 2 s 29 Chapter 12 Sample mean Note The sample mean 2 is sensitive toward outliers i e it gets pulled toward the extreme values in a data set after removmg the salary of 210000 the new sample mean is 35000 37000 35000 33000 Xnew 35000 lntvoduction m 5mm 5220mm Chzvtev 1 Seam 1 2 7 29 Chapter 12 Median A measure of center that is more robust against outliers is the socalled median Notation median M The median corresponds to the value of the data that occupies the middle position when all observations are ordered from smallest to largest introduction m Busing seamsi chzvm i stem 1 2 a 29 Chapter 12 Median Finding the median 9 order all observations from smallest to largest 9 assess whether the total number of observations is odd or even 6 locate middle value of data ad median M is the middle observation in the ordered list ie the m E observation 2 9 even median M corraponds to the average of the two middle observations in the ordered list e the average of the gyh and ltg1gtm observation lntmductiun m 5mm seamsi chzum 1 Seam 1 2 929 Chapter 12 Median Examples 3 Example 1 Salary data ordered 33000 35000 35000 37000 210000 9 Example 2 33000 35000 35000 37000 39000 210000 imam m Busing seam chzvm 1 Semen 1 2 19 29 Chapter 12 Mean vs Median Note Salary data in Example 1 X 70000 and M 35000 5 the median M is obVIously less influenced by outliers a we should not conclude though that the median should always be preferred over the mean simply because of its robustness against outliers o the mean and median measure the center ofa data set in different ways 7 they are both useful depending on the situationapplication Example sick days data 0 1 2 0 4 0 1 one more data point ofx 56 2 3iadd lntmductiun m 5mm seam cmm 1 Semen 1 2 1129 Chapter 12 Mean vs Median If costs are directly associated With the amount of sick days then the mean would clearly be a better measure as it takes the extreme observation into account If we are Just interested in the typical number of sick days for all employees the median is probably the more representative measure imam m Busing seam chzvm 1 Semen 1 2 1229 Chapter 12 Mean vs Median Chapter 12 Mode Relation between the shape of a distribution and meanmedian The mode corresponds to the value of the variable that occurs most The more symmetric a distribution is the closer the mean and the median frequently Will be Most useful for categorical data With a relatively small number of a perfectly symmetric possible values Example Stat 226 7 classification Fr74 0731 J741 377 o skewed to the right a skewed to the left introduction m 5mm seam cmm 1 stem 1 2 13 29 imam m Businss seam chzvm 1 stem 1 2 14 29 Chapter 12 Measuring spreadvariation Chapter 12 Measuring spreadvariation 0 Variation is always present in real data Measures 0f Spread a it is important to know how spread out the data are as this tells us something about the behaVIor of a variable a furthermore describing dataJust using the measures of locationcenter is not sufficient i totally different data sets can still have the same meanmedian Quaniles Q1 Q2 Q3 9 describe the position of a specific data value in relation to the rest of Example Number of sick days for 9 employees two data sets the data Data 5939 1 01 01 01 11 11 21 21 3v 4 o are 3 numbers that divide the ordered observations into 4 equally Data set 2 0 0 0 0 0 0 0 0 13 sized groups i e each group contains 25 of all observations introduction m 5mm seam cmm 1 stem 1 2 15 29 imam m Businss seam chzvm 1 stem 1 2 15 29 Chapter 12 Quartiles Chapter 12 Quartiles Finding quartiles If an addItIonal observatlon of X 56 Is added now total number of observations Is even a 01 median of all observatlons to the left of the medIan M a 02 corresponds to the medIan M 0 0 0 1 1 2 2 3 4 56 o 03 median of all observatlons to the right of the medIan M Example slck days total number of observatlons Is odd 000112234 Quartiles are also less influenced by outliers lntmductmnm Emingstztisucsl cmmy 1 sacuanlz 1729 lntmductmnmBusinssstztisucsl cmam 1 Saticnil 1829 Chapter 12 Five number summary Chapter 12 Boxplots A graphical display of the 57number summary Is a soecalled boxplot convenlent tool to descrlbe both the center and the spread In a data set 57number summary The 57number summery conslsts of the followmg measures MIn 01 MedIan 03 Max Example slck days Note 0 0 MIn 01 colnclde here due to the nature of the data thIs Is more the exceptlon than It Is the rule a bogtltpots can be eIther Vertlcal or horIzontal a sIdeebyesIde bogtltpots to compare dlfferent groups lntmductmn a 5mm Samar cmam 1 Sammy 1 2 19 29 introduction m Busing Samar Chzvtev 1 Sammy 1 2 2n 29 Chapter 12 Boxplots Chapter 12 BoXpots sidesbysside bogtltplots Example data on the of surgeries performed by male and female surgeons in a hospita female 5 7 10 14 18 19 25 29 31 32 male 20 25 25 27 28 31 33 34 36 36 37 44 50 59 85 86 5snumber summary intioouttion to Busims Statisticsl Chaucer 1 Section 1 2 21 29 intioouttion m Businss Statisticsl chaatsi 1 Section 1 2 2229 Chapter 12 Boxplots Chapter 12 More measures of spread Measuring spread the range IQR variance and standard deviation usmg boxPlots to describe dIStr39bUtlons Need to describe the amount of spread or variability that is present In the data a less variability among female surgeons a distribution is also more symmetric for female surgeons o more variability among maie surgeons Note Any measure of spread Wlll take the value of zero only if all a meanmedian is much higher for maie surgeons than for female ones observations m the data set have the same value in general range R O f t 39 d39sl 39bul39 d b t H tf M l or a symme quotc I n mquot 01 aquot 03 are a 0 aqua y apar mm The range R corresponds to the difference between the highest and o for a skewed to the right distribution 0 Will be further away from M than Q1 owest vaue as well as Min and Max Example of surgeries performed by the 16 male surgeons a for a skewed to the left distribution Q Will be further away from M than as as well as Min and Max introduction m Bushes Statisticsl chzatsi 1 section 1 2 24 29 introduction to Bushes Statisticsl Chzvtev 1 section 1 2 23 29 Chapter 12 nterquartie range Note the range shows the full range of spread In the data but the range depends on the smallestlargest observation which could be outliers Alternatively we can use the socalled interquartile range IQR IQR IQR 03 i 01 corresponding to range of the middle 50 of the data Example 16 male surgeons introduction m 5mm Statisticsl cmm 1 Satinquot 1 2 25 29 Chapter 12 More measures of spread Sample variance Improve the description of SPREAD by looking at the deVIations of each single observations from the mean i e how far is an observation away from the overall mean X sample Variance 52 The sample variance corresponds to the sum of all squared deviations of each observations from the sample mean X introduction m Busing Statisticsl chzvm 1 Satinquot 1 2 2s 29 Chapter 12 Sample standard deviation standard deVIation s The standard deVIation is the positive square root of the variance 52 Why work With 5 instead of s27 s has the same units of measurements as observations in data set Example of surgeries 7 female surgeons 5 7 10 14 18 19 25 29 31 32 introduction m 5mm Statisticsl cmm 1 Satinquot 1 2 27 29 Chapter 12 Variance and Standard deviation Note 9 the variance 52 and hence s can only be greater or equal to zero as based on squared deviations O s2 and 5 measure the spread about the sample mean X a s2 s 0 only if all observations are of same value 0 s2 and 5 are strongly influenced by outliers one outlier can cause 52 and s to drastically increase in value introduction m Busing Statisticsl chzvm 1 Satinquot 1 2 2a 29 Choosing a numerical summary Choice of an appropriate measure of centerspread heaVIIy relies on o the shape of the distribution a the presence of outliers 5 If the data are reasonably symmetric and no outliers are present then the sample mean X and the standard deVIatIon s can be used 5 If the data are skewed andor outliers are present the 57number summary should be used magnum m 5mm Statisticsl cmm 1 Sammy i 2 29 29 Chapter 13 The Normal Distribution Stat 226 Introduction to Business Statistics l DENSITY CURVES So far we have a graphically displayed data histogram stemplot boxplot Spring 2009 Professor Dr Petmtza Caragea 0 described the overall pattern and Identified deVIatIons and outliers Section A O numerically quantified center and spread of the distribution TueSdayS and ThurSdayS 9 30710 50 a m If the distribution as displayed by the histogram appears sufficiently regular we can approximate it with a smooth curve a soecalled density curve Chapter 1 Section 1 3 The density curve is simplified and an idealized version of reality but can still be usefull Example The Normal Distribution lntvodumon toBusmess Statistirsl Semen 13 138 lntvodumon toBusmess Statistirsl Semen 13 238 Chapter 13 The Density Curve Chapter 13 The Normal Distribution gas mileage example from textbook Properties A density curve is a curve that 0 is always on or above the horizontal axis and 0 has an area of exactly 1 underneath it A density curve describes the overall pattern of a distribution The area under the curve and above any range of values is the proportion of all observations that fall in that range Examples lntvodunion 0 Business Statistirs l Semen 1 3 3 3s immunequot to Business Statistirs l Semen 1 3 4 3a Chapter 13 The Normal Distribution MEDIAN AND MEAN OF A DENSITY CURVE YK Median The equaleareas point with 50 of the mass on either side Mean The balancing point of the curve if it were a solid mass lntvodumon to Business 522mm l Semen 1 3 Chapter 13 The Normal Distribution 1234557aain111213 Semen 1 3 538 lntvoduniontoBusmess Statistirsl Semen 13 538 Chapter 13 The Normal Distribution Chapter 13 The Normal Distribution INTRODUCTION TO NORMAL DISTRLBUTIONS 0 the Normal or Gaussian distribution is the single most important Normal Distribution distribution in Statistics y can Friedrich Gauss1777391555 0 many variables can be modeled described using the Normal 2 distribution e g E m I height of humans a a SAT scores E w a length of human pregnancies etc 2 o it is characterized by the following two parameters 5 N the and the overall shape lntvoduniontoBusmess Statistirsl Semen 13 738 838 Chapter 13 The Normal Distribution pictures of various normal distributions A2 1 0 l 2 3 4 introduction to Business Statistics i Section 1 3 9 38 Chapter 13 The Normal Distribution 0 Notation to denote the normal distribution we use EXAMPLE denotes a normal distribution with mean standard deviation w i distribution with mean and denotes a normal and standard deviation 0 To denote that a variable eg heights SAT scores etc follows a normal distribution we write introduction to Business Statistics i Section 1 3 1o 3s Chapter 13 The Normal Distribution THE 6895997 RULE holds for L normal distributions ie for E choice of M and 7 689599 Rule For a variable that follows a we have t at 0 approx of the data fall within standard deviation of the mean ie within 0 approx ofall the data fall within standard deviations of the mean ie within 0 approx of all the data fall within standard deviations of the mean ie within introduction to Business Statistics i Section 1 3 11 3s Chapter 13 The Normal Distribution 997 735 725 75 w u a 1th um introduction to Business Statistics i Section 1 3 12 3s Chapter 13 The Normal Distribution Example The length of human pregnancies follows a normal distribution with mean H 266 days and a standard deviation of 0 16 days 0 How long do the middle 95 of all pregnancies last7 9 How long do the shortest 16 of all pregnancies last at most7 0 How long do the longest 0 15 of all pregnancies last at least7 lntvodumon 0 Business Statistirs l Semen 1 3 13 3s Chapter 13 The Normal Distribution THE STANDARD NORMAL DISTRIBUTION o isa special normal distribution a has a mean and a standard deviation 0 denoted by 3 Nearly all the area is between and immunequot to Business 522mm l Semen 1 3 14 3a Chapter 13 The Normal Distribution Knowing the mean and the standard deviation of a normal distribution allows us to determine 0 What of individuals fall in a specified range 9 What a given individual falls at if you know their data value 0 What data value corresponds to a given lntvodumon 0 Business Statistirs l Semen 1 3 15 3s Chapter 13 The Normal Distribution 0 For the standard normal distribution the proportion of observations falling into a specified range is tabulated a This is the normal distribution for which we have tabulated values a We therefore need to any given normal distribution to a standard normal distribution i e the values from any are transformed to the corresponding values from a o This is called immunequot to Business 522mm l Semen 1 3 1e 3a Chapter 13 The Normal Distribution standardizing zescore le is an observation from a normal distribution that has mean 1 and standard deviation 0 the standardized value of X is given by A standardized value is often called a a A zescore tells us how many standard deviations the original observation is oflc the mean and in which direction 0 Observations larger than the mean are positive i e have a positive ercore when standardized and observations smaller than the mean are negative i e have a negative zescore when standardized lntvodumon 0 Business Statistirs l Semen 1 3 17 3s Chapter 13 The Normal Distribution Example length of human pregnancies continued immunequot to Business Statistirs l Semen 1 3 1a 3a Chapter 13 The Normal Distribution FINDING ZSCORES AND CORRESPONDING PROPORTIONSAREAS UNDER THE NORMAL CURVE Why are zescores helpful7 a lQ39s follow a normal distribution with mean 1 100 and standard deviation 0 16 0 heights of males follow approx a normal distribution with mean 1 70 inches and 0 3 Who is more unusual7 i A man being 73 inches tall or a man having an IQ of 124 lntvodumon 0 Business Statistirs l Semen 1 3 1g 3s Chapter 13 The Normal Distribution Once we know the corresponding zescore of an observation we can look up the overall proportion percentage of men in that population having a height of 73 inches or more gt need to know how to read Table A Table of the Standard Normal Distribution gt Table A in your textbook Note in the following the terms proportion probability percentage and area are all interchangeable i e proportion probability percentage area immunequot to Business Statistirs l Semen 1 3 2n 3a Chapter 13 The Normal Distribution andz mg am column giveslhe mm values curred in ane decimal Place and We m1 rnwgiveslhe secand deumal place rm 3 z smre Farexam lemlvwvvsnlla ndlhearea el WP Wu and 2722 inlhe m1 caluanhen laak rm znm slang mg ms raw Where We carrespanding mwum culunn mum giveslhe value m s nnu Ede W 1139 zundzrl 0mm 39ml minim1 immunan m mm mums l Sammy 1 3 21 m Chapter 13 The Normal Distribution 39 2 39 a 1 39 r a 5 7 3m Sammy 13 22 m Chapter 13 The Normal Distribution USING TABLE A TO FIND PROPORTIONS UNDER THE NORMAL CURVE consider the followrng situations 0 What proportion of observations Is below 2 7167 I e what Is the probability of observing a ziscore of 71 67 or less7 immunan m mm mums l Sammy 1 3 23 m Chapter 13 The Normal Distribution 0 What proportion of observations Is greater than 2 1677 9 What proportion Is l s than 2 7200 and greater than 2 2007 immunan m mm mums l Sammy 1 3 21 m Chapter 13 The Normal Distribution 0 What is the area between 2 7125 and z 125 00 02 04 Stat 226 iSprmg 2 9 lnu39oducuon no BUSIHS Sumac Secllon 1 3 25 38 Chapter 13 The Normal Distribution 0 What z score does the 30 1 percentile correspond to 00 c2 04 00 02 04 Stat 226 iSprmg 2 15 lnu39oducuon no BUSIHS Siatisticsl Secllon 1 3 26 38 Chapter 13 The Normal Distribution APPLICATIONS OF THE NORMAL DISTRIBUTION 0 State the problem ie state the mean u the standard deviation 0 and the value of the observation X 9 standardize X ie find the corresponding z score using X L z 7 a 9 draw picture ie locate z score under normal curve and shade area of interest 9 use Table A to find the shaded area 5131 226 isprmg 2009 Section Ai Inuoducuon no Busing Sumac I Section 1 3 27 33 Chapter 13 The Normal Distribution APPLICATIONS OF THE NORMAL DISTRIBUTION Example male heights N703 0 What proportion of men is shorter than 72 inches 9 What proportion of men is taller than 65 inches 9 What proportion of men is taller than 73 inches Stat 226 iSprm lnu39oducuon no BUSIHS Sumac Secllon 1 3 28 38 Chapter 13 The Normal Distribution What proportion of men has an IQ of 124 or more7 IQ m N10016 lntvodumon 0 Business Statistirs l Semen 1 3 29 3s Chapter 13 The Normal Distribution BACKWARDS CALCULATIONS we can also work backwards 7 given a certain percentile or proportion what is the corresponding value of X7 Example Heights m N703 0 What value does the 50 1 percentile of men39s height correspond to7 9 What value does the 10 1 percentile of men39s height correspond to immunequot to Business 522mm l Semen 1 3 3n 3a Chapter 13 The Normal Distribution lntvodumon 0 Business Statistirs l Semen 1 3 31 3s Chapter 13 The Normal Distribution In general to do backward calculations use the following formula XZ0J What value does the 85 1 percentile correspond to7 immunequot to Business 522mm l Semen 1 3 32 3a Chapter 13 The Normal Distribution Chapter 13 The Normal Distribution ASSESSING NORMALITY OF DATA How TO ASSESS NORMALITY 0 Based on experience andor past data the assumption of normality might be justi ed 0 Histogramstemplot or boxplot reveal nonenormal features such as a skewness 0 In general It Is quite risky though to assume normality Without looking at the data and verifying normality mUltIPle mOdelS 0 Normally distributed data allow the application of further statistical 39 tl39er5 procedures which enable us to learn more about the data and also to further derive additional information about the variable we are interested in We will learn about such procedures in Chapters 6amp7 If the above graphical displays appear somewhat normal i e they indicate a symmetric unimodal belleshaped distribution we can use a soecalled normal quantile plot a If data are not normally distributed and we still apply statistical procedures that require the assumption of normality derived Normal quantile plots are a more sensitive tool allowing us to take a closer information can be wrong and misleading l k to lUdge the adequacy 0f norma lty lntvodumon to Business Statistirs l Semen 1 3 33 3a immunequot to Business Statistirs l Semen 1 3 34 3a Chapter 13 The Normal Distribution Chapter 13 The Normal Distribution NORMAL QUANTILE PLOTS r a hard to construct by hand use JMP quot1250 quot100 I for main idea see pages 67 amp 68 of the textbook 3 If distribution is close to a normal distribution the plots points in a normal quantile plot will lie close to a straight line Some Caution 0 Real data almost always show some departure from normality i e from a perfect normal distribution 0 It is important to restrict the examination of a normal quantile plot to searching for clear departures from normality a We can ignore minor wiggles in the plot 7 most common methods will work well as long as the data are reasonably close to a normal 1 extreme outliers lntvodumon to Business Statistirs l Semen 1 3 35 3a immunequot to Business Statistirs l Semen 1 3 3e 3a Chapter 13 The Normal Distribution small sample sizes n10 n25 immam 0 Business Statistirs l Semen 1 3 37 3s Chapter 13 The Normal Distribution Observations from a skewed right and a triangular distribution immunequot to Business 522mm l Semen 1 3 3838 Stat 226 Introduction to Business Statistics I I Spring 2009 Professor Dr Petrutza Caragea Section A Tuesdays and Thursdays 9301050 am Chapter 8 Section 81 Inference for population proportions introduction to Business Statistics i Section E 1 1 15 Chapter 81 Inference for population proportions INFERENCE FOR A POPULATION PROPORTION p Suppose we are interested in the proportion of people with credit card debt larger than 5000 The parameter of interest is now no longer a mean but a proportion eg say 25 of all credit card holders We will denote the population proportion by p The Census Bureau obtains a random sample of 2500 people and found 750 have more than 5000 credit card debt How would you estimate p We can use the sample proportion p to estimate the population proportion p p is an unbiased estimator of p introduction to Business Statistics i Section B 1 2 15 Chapter 81 Inference for population proportions SAMPLING DISTRIBUTION OF is Assuming that we will have a random sample the value of p will be random as well our statistic p is a random variable properties of the sampling distribution of p are a shape is close to normal 0 mean of p is the population proportion p ie IL p a standard deviation of p is Because the mean of the sampling distribution is indeed p p is an unbiased estimator of p As the sample size 1 increases the spread of the sampling distribution of p decreases introduction to Business Statistics i Section E 1 3 15 Chapter 81 Inference for population proportions For sufficiently large sample sizes we have that A p1ip pN IV pr f Knowing the sampling distribution of p we can do inference for the population proportion p in form of a confidence intervals 0 hypotheses tests introduction to Business Statistics i Section B 1 A 15 Chapter 81 Inference for population proportions A 1 7 a 100 CI FOR THE POPULATION PROPORTION p A 17 a 100 confidence interval for the population proportion p is given by A p 1 7 p P i 2 7 where we estimate 07 pll p by using p instead of the unknown p introduction to Business Statistics i Section B 1 515 Chapter 81 Inference for population proportions Example Bob wonders what proportion of students at his school think that tuition is too high He interviews a random sample of 50 of the 2400 students at his small college and finds that 38 think that tuition is too high Construct a 95 confidence interval for the proportion of students of the entire college thinking that tuition is too high introduction to Business Statistics i Section B 1 5 15 Chapter 81 Inference for population proportions Assumptions 0 Independence Assumption 9 Plausible independence7 0 Random sample7 a 10 condition Population Size gt 10 n a Sample size assumption Large enough for CLT o Checkthat ripgt 10 and n 17p gt 10 Question If not true can We Still estimate p7 introduction to Business Statistics i Section B 1 715 Chapter 81 Inference for population proportions Answer Yes Use the so called Wilson s estimator p How77 We simply add 4 phony observations 2 yes positive counts and 2 no negative counts Then estimate p A 17 a 100 confidence interval for the population proportion p using Wilson s estimate is given N 1 7 N a i 2s M n 4 o This helps to move p further away from 0 or 1 respectively a As long as n 2 5 this works very well introduction to Business Statistics i Section B 1 B 15 Chapter 81 Inference for population proportions Chapter 81 Margin of error m too large to be useful Example Bob s data cont d 11mm A LOOK AT THE HIGH TEMVERATURE Ldiorvioizkow39s WEATHER IQ o z X 7 1 TWS rauv39s LNEVER WRONG k W0 u WIL BE EETWEE I Lio R0 A P A ova introduction to Business Statistics i Section B 1 1015 introduction to Business Statistics i Section E 1 9 15 Chapter 81 Determining sample size Chapter 81 Determining sample size Back to Bob s data a confidence interval that says that the percentage of people who think they pay too much for tuition is between 10 and 90 wouldn t be of much use Most likely you have a sense of how large a margin of error you can tolerate This yields a required sample Size of You would like to get a narrower interval without giving up confidence gt you need to have less variability in your sample proportions How can you at 2 n2 p17p round upll do that Choose a larger sample How large m s Recall when estimating unknown means we used What we know Z i m What we usually don t know p What can we do 2 n gt 2 390 0 One pOSSIbility ConSIder the Worst Case Scenario the one that needs the largest sample size p 0 all we need to do is to adjust the standard deviation 0 and use the standard deviation of p Us P1 Pl introduction to Business Statistics i Section E 1 1115 introduction to Business Statistics i Section E 1 12 15 Chapter 81 Determining sample size Other possibilities 0 Use information on p p or p from previous studies eg prior belief a pilot study historical data 9 if we are going to use Wilson s estimate p we need to remember that we add 4 phony observations so we really only need n2 2304 observations again round up introduction to Business Statistics i Section E 1 13 15 Chapter 81 Inference for population proportions Example Bob s data we want a 95 CI for p that has a width of only 01 gt margin of error mwidth2 012005 based on Bob s previous data we have N X 2 4 p l i 02857 n 4 14 introduction to Business Statistics l Section E 1 14 15 Chapter 81 Inference for population proportions SIGNIFICANCE TEST FOR A SINGLE PROPORTION 0 when performing a hypothesis test the null hypothesis specifies a value for p which we will call pg 0 when calculating pvalues we act again as if the hypothesized p ie p0 was actually true 0 when testing H0 p P0 we substitute pa for p in the expression for Up and then standardize p in order to obtain our test statistic That is we get Section E 1 15 15 Stat 226 Introduction to Business Statistics I I Spring 2009 Professor Dr Petrutza Caragea ction A Tuesdays and Thursdays 9 30710 50 a m Chapter 3 Section 33 Toward Statistical Inference Introdunion 0 Business Statistirs I Semen 3 3 116 Toward Statistical Inference Question What is the average height of all Stat 226 students7 We have several options to answer this question 0 wild guess 9 collect everybody39s height and compute exact average 9 take a representative sample and compute sample mean immunequot to Business Statistirs I Semen 3 3 2 15 Toward Statistical Inference out of the three options it becomes obvious that the third 0 taking a representative sample and computing the sample mean appears to be the most reasonable one However this option raises a new and even more important question namely How reliable is our estimate based on the sample Answer depends on 0 the choice of the sample i e in which way was the sample obtained 9 the sample size the larger the sample gt the more information we have at hand gt the more accurate and precise our estimate should be Introdunion 0 Business Statistirs I Semen 3 3 3 1e Toward Statistical Inference Let39s recall the quotBIG PICTURE immunequot to Business Statistirs I Semen 3 3 416 Toward Statistical Inference a p is the overall mean of the population a u Is fixed value but unknown 0 u Is referred to as a population parameter a Sr is the mean of the sample taken from the population a i varies from sample to sample random but We Will know Its value once We collected the sample 2 Is referred to as a sample statistic Introdumon 0 Business Statistirs I Semen 3 3 515 Toward Statistical Inference HOW DO WE OBTAIN A REPRESENTATIVE SAMPLE we distinguish two types of studies in statistics observational studies versus experiments a observational study observe individuals w rt a variable of interest a In a 1981 study researchers compared scholastic performance of music students With that of nonemuslc students at a California High School a music students had a much higher overall GPA than nonemuslc students a a Whooping 16 of music students had all As compared With only 5 of the nonemuslc students a as a result of the study music programs Were expanded natIonWIde immunequot to Business 522mm I Semen 3 3 515 Toward Statistical Inference What is wrong with concluding that music education causes good grades7 0 Researchers tried to show an association between music education and grades But the study was neither a survey nor were students assigned to get music education 2 Students were simply observed recording the choices music education no music education they made and the overall outcome Grades Observational study In observational studies treatments don39t get assigned to study individuals individuals are simply observed Introdumon 0 Business Statistirs I Semen 3 3 716 Toward Statistical Inference 0 experiment we actively impose a treatment on individuals and observe variable of interest o Is a neW drug more effective In lowering blood cholesterol level compared to standard drugs7 a a group of patients gets randomly assigned to one of tWo treatment groups i new drug and standard drug a receiving the standard drug Is called the control treatment a patients do not know Which drug they receive to eliminate bias 0 If neither doctors nor patients know Who Is receiving Which treatment d then this study IS called a doubleblinded stu y Experiments An experiment requires a random assignment of study subjects to treatments immunequot to Business 522mm I Semen 3 3 815 Toward Statistical Inference experiments are the only way to show causeandeffect relationships There is much more to learn about designing an experiment but that is beyond the scope of this class Keep in mind though that 9 experiments can be designed well but also really badly Badly designed experiments often reveal no information at all a Most of the success in conducting a designed experiment results directly from how well the preeexperimental planning was done Introdunion 0 Business Statistirs i Semen 3 3 916 Toward Statistical Inference HOW TO OBTAIN A RANDOM SAMPLE Consider the following example You want to find out how much debt an Iowa State student has on average How should you pick a representative sample7 a take all Stat 226 students from our section a go to the dorms and take a random sample of 100 students 0 go to the library and take a random sample of 100 students a sample from the Football team immunequot to Business Statistirs i Semen 3 3 1u1e Toward Statistical Inference a convenience sampling the selection of units from the population is based on easy availability andor accessibility 2 eg mail survey a tradeoff made for ease of obtaining sample is that samples are typically not very representative of the population a yield very often biased responses a voluntary response sample consists of people who chose themselves by responding to a general appeal in eg NBC CNN polls 0 be aware they often over represent people With strong opinions most often negative opinions in yield very often biased responses 0 a study of exercise called for volunteers to run on a treadmill gt study concluded that Americans are in great shape Introdunion 0 Business Statistirs i Semen 3 3 1116 Toward Statistical Inference The advice columnist Ann Landers once asked her readers quotIf you had to do it over again would you have children A few weeks later her column was headlined 70 OF PARENTS SAY KIDS NOT WORTH IT Indeed 70 of the nearly 10000 parents who wrote in said they would not have children if they could make the choice again These data are worthless as indicators of opinion among all American parents The people who responded felt strongly enough to take the trouble to write Ann Landers Their letters showed that many of them were angry at their children These people don39t fairly represent all parents It is not surprising that a statistically designed opinion poll on the same issue a few months later found that 91 of parents would have children again Ann Landers announced a 70 No result when the truth about parents was close to 1 Yes http mm kinns k1 caius1 Lunbmythtml immunequot to Business Statistirs i Semen 3 3 1215 Toward Statistical Inference SO HOW DO WE CHOOSE A SAMPLE example using the map provided in class choose 5 counties of Iowa Best way to obtain a representative sample is if we let chance choose the sample from the population random selection gt removes bias and subjectivity Simple Random Sample of size n To obtain a soccalled simple random sample of size n a create a list of all individuals of the population and choose n at random e g using the table of random digits Table B In a simple random sample SRS each set of n individuals has an equal chance of selection Introdunion to Business Statistirs I Semen 3 3 13 1e Toward Statistical Inference USING THE TABLE OF RANDOM DIGITS a label all individuals assigning each a distinct numberlabel o labels have to be of the same number of digits eg county Adam Story Polk Ida Mills Clay Lyon szz Linn Lee Jackson cottoct wrong cottoct a pick a line in Table B to start e g line 122 0 choose a sample of size n by selecting the first n labels that appear 0 if a labelnumber does not match any labels in the list or if a labelnumber comes up more than once gt skip it I if you cannot obtain a sample of size n in one line continue in next line eg with 123 if you started in 122 In vodun toquot to Business Statistirs I Semen 3 3 14 1e Toward Statistical Inference example for a simple random sample of size n 5 of Iowa Counties starting at line 122 and using labels as indicated on the map provided in class we obtain Caution 0 SRS are not always feasible and appropriate 0 eg you may consider soccalled stratified random samples divide the population into strata groups of individuals that are similar in some way that is important to the response Then choose a separate SRS from each stratum and combine these SRSs to form the full sample more on this on page 179 textbook Introdunion to Business Statistirs I Semen 3 3 1515 Toward Statistical Inference Assuming that we obtained a representative sample of size n how do we know that the sample mean Sr from this sample is indeed a good estimate for 7 Answer Amazingly averages of random samples behave in very regular and predictable ways so knowing how 27 values behave in general lets us deduce how our 27 value is likely to behave in terms of being close to H more details on this follow in Section 4 4 Introdunion to Business Statistirs I Senion 3 3 1515 Stat 226 Introduction to Business Statistics I Spring 2009 Professor Dr Petrutza Caragea SectI Tuesdays and Thursdays 9 3010 50 a m Introduction Introduction m Busms swims i Introductmn 1 13 Introduction What is Statistics Statistics Is the scIence of collecting describing and interpreting data alloWIng for datarbased decision making I like to think of statistics as the scIence of learning from data Jon Kettennng ASA President 1997 In Business and Industry Statistics can be used to quantify unknowns In order to optimize resources e g 0 Predict the demand for producis and services 9 Check the quality of items manufactured in a facility 9 Manage Investment portfolios 0 Forecast how much risk aCthltlE entail and calculate fair and competitive insurance rates Imamquot m Busms swims i Introductmn 2 13 Introduction Descriptive vs Inferential We dIstInguIsh between descriptive and inferential StatIstIcs DescrIptIve StatIstIcs Is the collection presentatIon and descrIptIon of data In form of graphs tables and numerical summaries such as averages varIances etc Goals 0 look for patterns a summarize and present data a quick Information compare several groups I e one can easIly look for differences and sImIlarItIes Introduction m Busms swims i Introductmn 3 13 Introduction Descriptive vs Inferential compared to InferentIal statIstIcs lnferentIal StatIstIcs deals WIth the interpretation of data as well as draWIng conclusions and makIng generalizations based on data for a larger group of subjects Goals 0 makIng dataebased decisions 0 generalIzmg Information obtaIned from descrIptIve analysIs to a larger group of IndIVIduals Imamquot m Busms swims i Introductmn 4 13 Introduction Descriptive vs InferentiaI Example Before moVIes are released they are preVIeWed by a selected audience Assume 200 people are asked to proVIde an overall rating for a moVIe yielding the folloWing responses 0 24 very satisfied I 26 satisfied 0 33 In between a 12 dissatisfied a 5 very dissatisfied 24 of the 200 previewers Were very satisfied With the moVIe 7 this is a descriptive statement based on a sample of 200 preVIeWers gt 24 of all people Who Will see the moVIe Will be very satisfied With the moVIe 7 this is an inferential statement for the entire population of indIVIduals introduction to Busines Statistics i introduction 5 13 Introduction Population vs Sample Population The population in a study is the entire group of in iiiduals or subjects about which we want to gain information Examples 0 all ISU studenls currently enrolled a all Audi A6 vehicles manufactured in a year a all customers banking Wltl1 Wells Fargo Sample A sample is a subgroup or part of a population from which we obtain information in order to draw Concluslons about the entire population Examples a every 5th ISU students currently enrolled a all Audi A6 vehicles manufactured on a single day in 100 randomly chosen customers banking With Wells Fargo introduction to Busines Statistics i introduction 6 13 Introduction Population vs Sample Need to be careful the terms population and statistics are relative Consider all college students in the US then all ISU students are no longer the population of interest but rather a sampe CLEARLY FORMULATE WHAT THE POPULATION OF INTEREST ISI When using numerical summaries to describe samples or populations We need to distinguish betWeen a so7called statistic and a parameter a any numerical summary describing a sample is called a statistic 0 any numerical summary describing a population is called a parameter Example moVIe preVIeW o 24 ofthe 200 preVIeWers 24 7 statistic o 24 of all people going to see the moVIe 24 7 parameter introduction to Busines Statistics i introduction 7 13 Introduction Populations vs Sample It is important to distinguish betWeen a population parameter and a sample statistic A parameter is a numerical summary of a population Populations consist typically of too many indIVIduals so that these can never be observed For example it Would be impossible to knoW the average mmer earnings of all university students This Would require us to identify find and question thousands of students Therefore We Will hardly ever knoW the true parameter value of a population It is hoWever feasible to select a sample of 100 students using proper randomization and then the average earning of these 100 students could be computed Any numerical measure computed from a subset of the population typically a sample is a statistic and can be observed introduction to Busines Statistics i introduction a 13 Introduction Parameter vs Statistic Parameter is a numerical summary for the entire population It typically remains unknown as we cannot observe the entire population We Will use the information based on the data such as a sample mean to get an idea what the value of the unknown population parameter is i this process is inferential Statistics are numerical summaries eg an average that are obtained from real data we can actually observe a statistic 7 statistics are descriptive Introduction m Engines swims i Introduction 9 13 Introduction Individuals and Variables some more definitions Individuals lndIVIduals are subjectsobjects of the population of interest can be people but also business firms common stocks or any other object that we want to study Examples7 A Variable A variable is any characteristic of an indIVIdual that we are interested in A variable typically Will take on different values for different indIVIduals Examp es7 Introduction m Engines 5mm i Introduction 1013 Introduction Kinds of variables Categorical variables lndIVIduals can be placed into one of several categories We distinguish nominal and ordinal variables a nominal no order possible gender 0 religion 0 race in colors w ordinal order is possible 0 grada a educational degrea Introduction m Engines swims i Introduction 11 13 Introduction Kinds of variables Quantitative variables Quantitative variables take numerical values for which arithmetic operations such as adding and averaging make sense e g 0 height of a person a weight of a person a temperature a time it takes to run a mile 0 currency exchange rates Introduction m Engines 5mm i Introduction 1213 Introduction Distribution The distribution of a variable describes WHAT values the variable takes and HOW often it takes these values Depending on the type of the data categorical or quantitative we need to use different graphical and numerical tools to analyze and summarize the data at hand We Will start by describing data graphically a bar graphs pie charts and pareto charts can be used to graphically summarize categorical data a a common graphical display for quantitative data is a histogram introduction m Busms 5mm i lntmductiun 13 13