### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# INTRO TO STATISTICS STAT 104

ISU

GPA 3.5

### View Full Document

## 4

## 0

## Popular in Course

## Popular in Statistics

This 286 page Class Notes was uploaded by Giovani Ullrich PhD on Saturday September 26, 2015. The Class Notes belongs to STAT 104 at Iowa State University taught by Staff in Fall. Since its upload, it has received 4 views. For similar materials see /class/214404/stat-104-iowa-state-university in Statistics at Iowa State University.

## Reviews for INTRO TO STATISTICS

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/26/15

Stat 104 Lecture 13 Probability Distributions Discrete variables iNumerical values associated with elements in a sample space iOnly distinct discrete points on the number line The Deal Continued Bag 0 chips poker chips isome are red isome are white isome are blue Draw a chip from the bag The Deal Continued Draw a red chip win 3 bonus points Draw ablue chip win 1 bonus points Draw awhite chip lose 1 bonus points Stat 104 Lecture 13 Discrete Random Variable X number of bonus points x 71 1 3 Px 060 030 010 Discrete RV X number of bonus points 0 60 ooo wbov ooo Probabmty 0 20 l I I I I I I 71 0 1 2 3 4 Discrete RV Property 1 Property 2 Stat 104 Lecture 13 Mean of a Discrete RV The center of the distribution of values found as a weighted average of the values 2 WM Discrete Random Variable X number of bonus points x 1 1 3 Px 060 030 010 xPx 060 030 030 1 0 Variance of a Discrete RV The spread of the distribution of values found as a weighted average of the squared deviations from the mean 02 zkx mzw Stat 104 Lecture 13 Discrete Random Variable X number of bonus points l 2 12 Rake it inTM Rake it in TM If all 1440000 tickets are sold and if all prizes are claimed the Iowa Lottery Will payout 824400 Mean payout H 824400l440000 05725 This means the Iowa Lottery pays out on average under 60 cents for every 1 ticket sold 12 I Chapter 10 Inferences Involving Two Populations 39 I Chapter 10 Inference on Two Samples Independent versus Dependent Sampling Tn urdertu perform mference en the dttterenee ettwe peputatten means we must Independent when the mdwtdua s seTeeted fur ene sampTe de net dtetate whteh tndtwduats are te be m a seeend sampte Dependent when the thdtvtduats seteeted tn be W ehe sampte are used tn determme the thdtvtduatste be W the seeehd sampte The same set uf eeurees are usedte ubtam the data represenung beth peputattens Dependent samptes are etten referred te as matchedpairs samptes I Examphe Determtne Whetherthe teuewtng are tndependent or dependent samptes t A seetetegtstwtsheste compare the annuat satanes et mamed eeuptes She ms a random sampte et 5n mamed eeuptes m whteh both spouses wem and determtnes eaeh spouse s annuat satan 2 nungeneucfacturstu strueturat bratn abnermahttes en sehtzephrema The researchers Exammed the bran et 29 Mm pattents dtagnesed wth sehtzephrema and eemparedthemwth 2a heatthy Wm The whete bratn vetumes et the Wu groups were eem ared Inference about Two Means Independent Samples Wu independent sarnpies unknuvvn We stiH need s and tne tdrstnbdtren Sampling Distribution of the Difference of Two Means with Population standard Deviations Unkn independent Samples wn We have srrnpie randern sarnpie er SiZE n1 srrnpie rand unknuvvn rn an um sarnpie er SiZE n2 1 unknuvvn rnean d2 unknuvvn standard devratren e1 unknuvvn standard deviatiun e2 if iarge tnen appruximateiy ququ a tedis tributiun th tne 5m aner er n1 71 er n2 71 degrees at Constructing a 1 d 100 Con dence Interval Aboutthe Difference of Two Means unknuvvn rnean d d unknuvvn standard deviatiun e1 in additiun a srrnpie randern a i at Si n2 is taken frum a pepdiatren W n d kn standard deviatiun e2 i ne Wu pepdiatrensare nurmaHy distributed ertne sarnpie SiZE are sufficientiy iarge tnen a 1 d1oo eenndenee intervai a bedtwdzrs 2 r S S LuWerLimit xtrx2ta2d 7 2 2 wnere df tne smaHeruf NH and n27 Upper Limit giigztz2df 39 EXamp E Furtheququg data cunstructa 95 un dence mterva Pupuwatmnw Pupu atmn 2 Hypothesis Testing dune Huweven nuvvthe mm and anemauve hyputheseswm take une ufthe ququg farms VS H1 w uzg VS H1 w 7 2 gt D VS H1 1 r 2 lt D NDIE haI m s Eqmva ent HE WVlt2 39 39Exampwe Usmg the same pupu atmnsfrum befure testthe mawmthatu1 gt M atthe u 1 have m swgm cance fur the ququg data Pupuwatmnw Pupu atmn 2 Inference about Two Means Dependent Samples Matched pairs Wu dependentsamptes Paved data usuaHy resutt nern befure and after studtes ur studtes nern retated tndtwduats dEnUEaHWmS nusband anthfE ete Paired difference d x1e xz Where x1 5 tne nrst ubservatmn frum tne mm uuatt Mn tnd 1t nd t 5 Wm tndwtdua ursecund paved ndwtddat E Wm 2 wttn tnts type at data Sampte Mean uftne thferences d LE n Sampte Standard Dewatmn uftne Dnerenees 5d Nuvvvve need tne samphng dtsmbutmnfur d E E m andVart 3 n Tu penunn any sunetanatysts an d we need tn make sure tnat E nmesfrum a nennat dts mbutmn Wnen wm d be nunnaw 1 Wnentne engnatpepmatmnsare nennat OR 2 Wnen we ubtam a sampte greatertnan ereduat D an by CLT Con dence Interval for MatchedPairs Data Awe Mumn Cun dence ntENa few 5 gwen by yd Luvver Lxrmt d it d T 39 n Cuns truct a 99 cun dence mterva based an the ququg nurmaHy ms mbuted data mm d Uppemmn dt 76 76 74 57 83 BE 56 EB 1D7 94 7E BD EE Hypothesis Testing mean mfferenee HHEVEK m ue ed Wm take ene uftheququgf mm HE M D vs H1 Men HE Mm vs H1MgtEI H n D vs H1 ltu The 251 staustu We Wm use Wm quuvv a des mbutmn thn n 71 degrees uf freedum 1 d d 34 W 39EXamp E D as eve uf swgm cance X1 7E 7E 74 57 83 BE 5B x2 aw EB 1D7 94 7a an 85 u rus 1 733 737 us 724 723 51005 mp7 rm W 71 943 crmcal Value Inference about Two Population Proportions Sampling Distribution onhe Difference between Two Proportions Suppose We have A swmp e random samp e of swze n1 mm a pupu atmn x1ufthemdmdums have a spem ed eharamensuu x2 of the mdmdua s have a spem ed eharamensue We are mteres ted mme staustu 19719 Where 13 siandard dewatmn m quz 1 amp quot4 lt0 05 quot1 quot2 N N prewded that n p gt 5 no gt 5 npgt 5 and nqu 5 And Nuvvthatvve knuvvthe appruxwmate samphng dws mbutmn of F11 an mtruduce a prueedure that ean used m test mame regardmg We pupMatmn prupumuns Wuu d be However when We 251 a hypothesws the nu hypothesws s assumed Lrue He pr f Tms nuH hypothesws s assummg thatthe vame of p1 equa s the venue of p2 Smce the nu hypothesws s assumed e betrue vve are assumw 9 p1 2 Where p sthe eummun pupu atmn prupumun Subsmuungthe vame ufp mm the Equatmn furthe test siaus uc vve ubtam a new 25151311511 A A p15 named the pooled estim ate ofp dammed Pooled Eskimake of p Hence the test smusue WE W111use1n hyputhes1stestmg1s A 7amp2 C 792 05 n n Luvver L1m1t 7 2 Upper L1m1t F292 quot2 Nunee thatvve duwpum the samp1e prupumuns eoUdu1 d e mu m hyputhes1stes ung 39Examme On Apr1 12 19551 James saw re1easeeme resuus er e11n1ea1 ma sfurmsvacmne m prevem pe11e 1n mese e11n1ea1 151400000 emmren were raneemN ween gruups The sumeets 1n Gruup 1 me Exper1mema1gruupwere gwEn me vacuum wnue demnPd nnhn gruup 115 uevempee puhu Testme e1a1m matme pereemage er sumeets 1nme untrm gruup Whu umracted pe11e 311112 0 U1 1ee1ers1gmr1eanee 1 n prupumuns 1 200000 0 05 L 200000 0 05 N N N N 1 733 000016 9 71 000058 200000 200000 m 33gt5 n2 2115 gt5 1 199957 gt5 n20 199885 gt5 Test tne etanntnattne percentage at subjects m tne expenmentat gruup Wnu eun ranted puhu S essthan tne percentage uf subjects m tne untrm gruup Wnu untracted puhu attne u m tevet uf Stgm cance t Wew39vhit HDP P20 x x2 7 33115 0 0004 m n n2 7400000 0 000167 0 00058 v0 00040 9996 200000 200000 7648 W unsm rn 1 n at t m m n m Exwmw tlmm tn n t tun my tn t t nw wth 70000421645 39 End of Notes Chapter 3 Numerically Summarizing Data Section 31 Measures of Central Tendency Measures of Central Tendency 0 Mean 0 Median 0 Mode 0 Midrange Measures of central tendency are numeric values that locate in some sense the middle of a data set We have all heard the term average Most generally when this term is used it is referring to the mean but one must be careful because it can refer to the mean the median or the mode Each measure gives very different information Recall A parameter is a descriptive measure of a population A statistic is a descriptive measure of a sample The arithmetic mean of a variable is computed by determining the sum of all the values of the variable in the data set divided by the number of observations The population mean u is computed using all the individuals in a population The population mean is a parameter The sample mean E is computed using the sample data The sample mean is a statistic 32 1 Mean If x1 x2 xN are the N observations of a variable from a population then the population mean u is X x1x2 xN 11 N N If x1 x2 xn are the n observations of a variable from a sample then the sample mean x is 2 Median The median of a variable is the value that lies in the middle of the data when arranged in ascending order That is half the data are below the median and half the data are above the median We use M to represent the median Steps in Computing the Median of a Data Set 1 Arrange the data in ascending order 2 Determine the number of observations n 3 Determine the observation in the middle of the data set a If the number of observations is odd then the median is the data value that is exactly in the middle of the data set That is n1 the median is the observation that lies in the position U If the number of observations is even then the median is the mean of the two middle observations in the data set That is the median is the mean of the data values that lie in the g and 31 positions 33 3 Mode The mode of a variable is the most frequent observation of the variable that occurs in the data set If a data set has two values that occur with the highest frequency we say the data are bimodal If a data set has three or more data values that occur with the highest frequency the data set is multimodal 4 Midrange The midrange is the average of the smallest and largest data value smallest largest m1drange f Example You are given the following starting salaries for five graduates from the business college at Iowa State University 35000 37000 35000 33000 210000 Find the mean median mode and midrange Me an Median 34 Mode Midrange What would be the most appropriate measure of central tendency to report to give a good indication for future graduates about what starting salary to expect Why Notice the mean is the most sensitive to extreme values while the median is not We say the median is resistant to extreme values Let us take a look at how the mean compares to the median and mode in the various distributions 35 39Disri butions d Right Blamed Le 3mm ngmgztru39c When do we use the mean median and mode Mean When the data are quantitative and the frequency distribution is roughly symmetric Median When the data are quantitative and the frequency distribution is skewed left or skewed right Mode When most frequent observation is desired measure of central tendency or the data are qualitative Section 32 Measures of Dispersion Measures of DispersionSpread 1 Range 2 Interquartile Range Sec 34 3 Variance 4 Standard Deviation In completely describing a distribution we need more than just the center of the distribution Measures of dispersion give an indication of how much variability there exists in a data set 1 Range The range R of a variable is the difference between the largest value and the smallest data value That is Range R Largest Data Value Smallest Data Value 2 Interquartile Range The interquartile range IQR is the difference between the third and first quartile IQR Q3 Q1 3 Variance The variance is based upon the difference between each observation and the mean It is calculated as a mean of the squared deviations The divisor used in the calculation of this mean is dependent upon whether we are calculating the population variance or the sample variance Deviation about the mean x u orxl x What would be the value if I were to add all the deviations Why 37 The population variance of a variable is the sum of the squared deviations about the population mean divided by the number of observations in the population N That is it is the mean of the squared deviations about the population mean The population variance is symbolically represented by 2 M2 2 i I 2 2 2 O x1zu x2zu xNu N N where x1 x2 xN are the N observations in the population and u is the population mean N T Notice that the population variance 62 is a parameter An algebraically equivalent formula for computing the population variance inT x l N N M2 N T where 2x12 means to square each observation and then sum these squared values and 2602 means to add all the observations and then square the sum 38 The sample variance 2 is computed by determining the sum of the squared deviations about the sample mean and dividing this result by n l The formula for the sample variance from a sample of size n is z x a S2i1 x1 x xZ xzxn xz n 1 n 1 where x1 x2 xn are the n observations in the sample and X is the sample mean Notice that the sample variance 2 is a statistic An algebraically equivalent formula for computing the sample variance is n Air2 2 il n n l 2 where 2x1 means to square each observatlon and then sum these squared values and 2602 means to add all the observations and then square the sum Notice that 52 explains the variance of a sample and is therefore a statistic Notice that the sample variance is obtained by dividing by n 1 If we divided by n as we would expect the sample variance would consistently underestimate the population variance Whenever a statistic consistently overestimates or underestimates a parameter it is called biased To obtain an unbiased estimate of the population variance divide the sum of the squared deviations about the mean by n 1 Hence we have n 1 degrees of freedom in the computation of 52 because an unknown parameter u is estimated with x 39 4 Standard Deviation The population standard deviation 6 is obtained by taking the square root of the population variance That is aux07 Notice that G is a parameter since it is describing a measure of the population The sample standard deviation s is obtained by taking the square root of the sample variance That is 2 SIS Notice that s is a statistic since it is describing a measure of the sample If the standard deviation is just the square root of the variance what information is the standard deviation giving us that the variance is not Can you think of a reason why we use the standard deviation over the variance Understanding Standard Deviation Recall the data set from Chapter 2 on the threeyear rate of return on a mutual fund Suppose we are comparing two mutual funds that have the same mean but one has a standard deviation of 8 and the other a standard deviation of 20 Which would you invest your money in Why 40 Example Nine randomly selected students from a section of STAT 104 measured their pulse The following data were obtained 76 60 60 81 72 80 80 68 73 Find the range variance and standard deViation Range Variance Totals Standard DeViation 4l The Empirical Rule If a distribution is roughly bell shaped then I Approximately 68 of the data will lie Within one standard deviation of the mean I Approximately 95 of the data will lie Within two standard deviations of the mean I Approximately 997 of the data will lie Within three standard deviations of the mean Visualizing rhe Empirical Rule 42 Section 34 Measures of Position Measures of position are used to describe the relative position of a certain data value with the entire set of data Can you think of an example where you have been given a measure of position The zscore represents the number of standard deviation that a data value is from the mean It is obtained by subtracting the mean from the data value and dividing this result by the standard deviation There is both a population zscore and a sample zscore their formulas are as follows Population zscore x 039 Z Sample zscore X X Z S The zscore is unitless it has a mean of 0 and a standard deviation of l Zscores provide a way of comparing apples to oranges by converting variables with different centers andor spreads to variables with the same center and spread 43 Example The average 2029 year old man is 699 inches tall with a standard deviation of 30 inches The average 2029 year old woman is 646 inches tall with a standard deviation of 28 inches Who is relatively taller a 75inch man or a 70inch woman Percentiles are the values of the variable that divide a set of ranked data into 100 equal subsets Each set of data has 99 percentiles The kth percentile denoted Pk is a value such that at most k percent of the data are smaller in value than Pk and at most lOOk percent of the data are larger in value Determining the kth Percentile Pk 1 Arrange the data in ascending order 2 Compute an index i using the formula I nk l 00 where k is the percentile of the data value and n is the number of individuals in the data set U If 139 is not an integer round up to the next highest integer Locate the ith value of the data set written in ascending order This number represents the kth percentile If 139 is an integer the kth percentile is the mean of the 1th and ilst data value 44 Often we are interested in knowing the percentile to which a specific data value corresponds Finding the Percentile that Corresponds to a Data Value 0 Arrange the data in ascending order 0 Use the following formula to determine the percentile of the score x Number of data values less than x Percentlle of x 100 n 0 Round this number to the nearest integer Quartiles are the most common percentiles They diVide the data into four equal parts 0 Q1 represents the 1st quartile It is also the 25th percentile 0 Q2 represents the 2nd quartile It is also the 50th percentile Note This is also the median 0 Q3 represents the 3rd quartile It is also the 75th percentile Outliers are extreme observations in a data set Outliers distort both the mean and the standard deViation since neither is a resistant measure Because these measures often form the basis for most statistical inference any conclusions drawn from a data set that contains outliers can be awed We check for outliers using the interquartile range 45 Checking for Outliers by Using Quartiles Determine the first and third quartiles of the data Compute the interquartile range The interquartile range or IQR is the difference between the third and first quartile IQR Q3 Q1 Determine the fences Fences serve as cutoff point for determining outliers Lower Fence Q1 15IQR Upper Fence Q3 15IQR o If a data value is less than the lower fence or greater than the upper fence then it is considered an outlier Example The following data represent the number of inches of rain in Chicago during the month of April for 20 randomly selected years Find the quartiles 46 Find the 67th percentile What percentile is represented by 628 inches of rain Are there any outliers in this data set 47 Section 35 The FiveNumber Summary Boxplots Exploratory Data Analysis This is the area of statistics that looks at data in order to spot any interesting results that might be concluded from the data The idea here is to draw graphs of data and obtain measures of central tendency and spread in order to form some conjectures regarding the data Rather than numerically describing a distribution Via the mean and standard deviation exploratory data analysis summarizes a distribution by using measures that are resistant to extreme observations Such a measure would be the ve number summary Five Number Summary 0 Minimum Q 1 o M Q3 0 Maximum We use the five number summary to construct a boxplot Drawing a Boxplot Determine the upper and lower fences Draw vertical lines at each quartile Enclose these vertical lines in a box Label the lower and upper fences Draw a line from the first quartile to the smallest data value that is larger than the lower fence Draw a line from the third quartile to the largest data value smaller than the upper fence Any data values less than the lower fence or greater than the upper fence are outliers and are marked with an asterisk 48 Distribution Shape Based upon Boxplot o If the median is near the center of the box and each of the horizontal lines is of approximately equal length then the distribution is roughly symmetric If the median is to the left of the center of the box or the right line is substantially longer than the left line the distribution is skewed right If the median is to the right of the center of the box or the left line is substantially longer than the right line the distribution is skewed left Example The following data represents the number of grams of fat in breakfast meals offered at McDonald s 12 23 28 2 31 37 3415 23 38 311611 8 817 20 Find the five number summary 49 Construct a boxplot Comment on the shape of the distribution Chapter 1 Data Collection Section 11 Introduction to the Practice of Statistics Statistics is the science of 0 Collecting Describing displaying Interpreting Data Identify the research objective Collect the information necessary for l Organize and summarize the information obtained in 2 Draw conclusions by generalizing to the population LWNH Descriptive statistics consists of organizing and summarizing the information collected Inferential statistics uses methods that generalize results obtained from a sample to the population and measure their reliability We collect data to answer a specific question of interest Does nitrogen improve corn yield What seed is best What is the relationship between rainfall and yield Does this new drug cure the disease Is it safe What do voters think about a candidate or an issue Does nitrogen improve corn yield What is the group we want to study Are we interested in all corn just one brand of corn or only corn grown in Iowa The group that is to be studied is called the population and each element of the population is called an individual We now decide that we are specifically interested in all corn types grown in Iowa Is it feasible to collect data from every single corn field in the state of Iowa We look for a reasonable subset of the population called a sample What characteristic of the population do we want to study A parameter is a descriptive measure of the population A statistic is a descriptive measure of the sample a Identi the research objective Does nitrogen improve the corn yield 1 Identi the population All corn grown in Iowa c What is the parameter of interest Corn yield in bushelsacre when nitrogen is added to the soil d Identi the sample A random sample of 100 farms in Iowa for example e List the descriptive statistics 09 State the conclusions made in the study Once we have data collected from our sample we can look at the statistics Statistics are the numbers summarizing the data in a sample We hope these statistics are good estimates of the parameters What do we need to measure to answer the question of interest Variables are the characteristics of the individuals within the population What would be the variable for the proposed question Is this variable qualitative or quantitative Qualitative 0r Categorical variables allow for classification of individuals based on some attribute or characteristic Examples eye color class status sex Quantitative variables provide numerical measures of individuals Arithmetic operations such as addition can be performed on the values of quantitative variables and provide meaningful results 0 A discrete variable is a quantitative variable that has either a finite number of possible values or a countable number of possible values Examples age number of siblings height to nearest inch 0 A continuous variable is a quantitative variable that has an infinite number of possible values andor is not countable Examples distance from campus height Example Gallup News Service conducted a survey of 1012 adults aged 18 years or older August 29September 5 2000 The respondents were asked Has anyone in your household been the Victim of a crime in the past 12 months Of the 1012 adults surveyed 24 said they or someone in the household had experienced some type of crime during the preceding year Gallup News Service concluded that 24 of all households had been victimized by crime during the past year a Identify the research objective b Identify the population c What is the parameter of interest d Identify the sample e List the descriptive statistics t State the conclusions made in the study Section 12 Observational Studies Simple Random Sampling Data can be obtained from four sources 0 A census Existing sources Survey sampling Designed Experiments A census is a list of all individuals in a population along with certain characteristics of each individual Existing sources don t collect data that have already been collected Survey sampling is used in research where there is no attempt to in uence the value of the variable of interest Data collected from a survey sample lead to an observational study An observational study measures the characteristic of a population by studying individuals in a sample but does not attempt to manipulate or in uence the variables of interest A designed experiment applies a treatment to individuals referred to as experimental units and attempts to isolate the effects of the treatment on a response variable Example Determine whether the following is an observational study or a designed experiment A study to determine whether there is a relation between a person s body mass index BMI and the number of hours an individual exercises per week Seventhgrade students are randomly divided into two groups One group is taught math using traditional techniques while the other is taught math using a reform method After one year each group is given an achievement test to compare its proficiency with that of the other group It is Vital that we understand that observational studies do not allow a researcher to claim causation only association Observational studies do not control for lurking variables Lurking variables are variables that may affect the outcome With the use of a designed experiment we may control for lurking variables So observational studies are very useful tools for determining whether there is a relation between two variables but it requires a designed experiment to isolate the cause of the relation Observational studies are also used for learning about characteristics of a population Note Lurking variables are also referred to as extraneous variables Sampling The goal in sampling is to obtain individuals that will participate in a study so that accurate information about the population can be obtained How can the researcher obtain accurate information about the population through the sample while minimizing the costs in terms of money time personell etc The sample should be representative of the population The most basic sample survey design is simple random sampling which is often abbreviated as random sampling A sample of size n from a population of size N is obtained through simple random sampling if every possible sample of size n has an equally likely chance of occurring The sample is then called a simple random sample Obtaining a random sample l Assign a unique number from 1 to N to each individual in the population 2 Select n random numbers from this list A frame is a list of all the indiViduals in the population We can obtain a random sample by sampling with or without replacement Sampling Without replacement Once an indiVidual is selected to be in the sample it cannot be selected again For instance if we are using a deck of cards as the population if I draw a card and set it aside before selecting the next card this is sampling without replacement Sampling with replacement Once an indiVidual is selected to be in the sample the appropriate measurements are taken and then the indiVidual is placed back into the population before selecting the next indiVidual Here it is possible for an indiVidual to be selected more than one time For example if we are using a deck of cards as the population if I draw a card and record its suit and then place it back in the deck before the next card is selected this is sampling with replacement Example Suppose we have a population that contains 500 indiViduals and we wish to select 10 indiViduals for our sample Use the random number table to select the 10 indiViduals Section 13 Other Types of Sampling Strati ed Sample A strati ed sample is obtained by separating the population into non overlapping groups called strata and then obtaining a simple random sample from each stratum The individuals within each stratum should be homogeneous in some way Systematic Sampling A systematic sample is obtained by selecting every kth individual from the population The first individual selected is a random number between 1 and k Cluster Sample A cluster sample is obtained by selecting all individuals within a randomly selected collection or group of individuals Convenience Sample A convenience sample is a sample in which the individuals are easily obtained Example Identify the type of sampling used A radio station asks its listeners to call in their opinion regarding the use of American forces in peacekeeping missions A farmer divides his orchard into 50 subsections randomly selects 4 subsections and samples all of the trees within the 4 subsections in order to approximate the yield of his orchard A school official divides the student population into five classes freshman sophomore junior senior graduate student The official takes a random sample from each class and asks the members opinion regarding student services Section 14 Sources of Errors in Sampling Nonsampling errors are errors that result from the survey process They are due to the nonresponse of individuals selected to be in the survey to inaccurate responses to poorly worded questions to bias in the selection of individuals to be given the survey and so on Selection of the frame it is often difficult to get a complete list of individuals in a population Certain segments of the population are often underrepresented Nonresponse means that an individual selected for the sample does not respond to the survey Questionnaire Design done appropriately is critical in minimizing the amount of nonsampling error An open question is one in which the respondent is free to choose his or her response A closed question is one in which the respondent must choose from a list of predetermined responses Wording and Ordering of Questions is important in not swaying the opinion of the individual being surveyed Bias is minimized by selecting a random sample Sampling errors is the error that results from using sampling to estimate information regarding a population This type of error occurs because a sample gives incomplete information about the population Example The following surveys are awed Determine whether the sampling method or the survey itself is flawed For awed surveys identify the cause of the error and suggest a remedy A magazine is conducting a study on the effects of infidelity in a marriage The editors randomly select 400 women whose husbands were unfaithful and ask Do you believe a marriage can survive when the husband destroys the trust that must eXist between husband and wife A college vice president wants to conduct a study regarding student achievement of undergraduate students He selects the first 50 students who enter the building on a given day and administers his survey A polling organization is going to conduct a study to estimate the percentage of households that speak a foreign language as the primary language It mails a questionnaire to 1023 randomly selected households throughout the United States and asks the head of household if a foreign language is the primary language spoken in the home Of the 1023 households selected 12 responded Section 15 The Design of Experiments A designed experiment is a controlled study in which one or more treatments are applied to experimental units The experimenter then observes the effect of varying these treatments on a response variable Control manipulation randomization and replication are the key ingredients of a welldesigned experiment The experimental unit is a person object or some other welldefined item upon which a treatment is applied The treatment is a condition applied to the experimental unit A response variable is a quantitative or qualitative variable that represents the variable of interest Response variables are also referred to as dependent variables The predictor variables are the factors that affect the response variable Predictor variables are also referred to as independent variables Extraneous variables are variables that may affect the outcome of the experiment but are not controlled by the experimenter Handling extraneous variables 0 Hold constant 0 Ignore and randomize 0 Block and randomize Experimental Designs how do we assign treatments to experimental units A completely randomized design is an experimental design where the experimental units are randomly assigned to the treatment This is the most popular type of design because of simplicity However it is not always the best A design that requires each treatment to be applied to every block is called a randomized complete block design MatchedPairs Design This is a special type of block design where the experimental units are somehow related such as twins and has only two treatments Example A school psychologist wants to test the effectiveness of a new method for teaching reading She selects five hundred first grade students in District 203 and randomly divides them into two groups Group 1 is taught by means of the new method while Group 2 is taught via traditional methods The same teacher is assigned to teach both groups At the end of the year an achievement test is administered and the results of the two groups compared Steps in an experiment A J U A U 0 l Identify the problem to be solved This includes the population the predictor variables and the response variable What is your claim about how the predictor variable effects the response variable of the population Identify the experimental units What are they How many are there Identify and determine the levels of the treatment Determine the experimental design Determine how additional predictor variables extraneous variables will be controlled Collect and process the data Test the claim Random Numlyers Column Number Row Number 01 05 06 10 11 15 1620 21 25 26 30 31 35 36 40 41 45 46 50 01 89302 23212 74483 36590 5956 36544 68518 40805 09980 00167 02 61458 17630 96252 95649 73727 33912 72890 66218 52311 97141 03 11452 74197 81902 48443 90360 26430 73231 37740 20628 44690 04 27575 04429 31303 022 01698 19191 18948 78871 36030 23980 05 368213 59109 88976 46845 2832 47460 88944 08264 008113 54592 06 81902 93458 42161 26099 09419 89073 284 09160 61845 40906 07 59761 55212 333 68751 86737 79743 85262 31887 37879 17525 08 46827 25906 64708 0 78423 15910 86548 08763 47050 18513 09 2 66449 32353 83668 13874 86741 81312 54185 78324 00118 19 98144 96372 50277 15571 mm 31457 00377 55141 11 14228 17930 30118 00438 49666 65189 62869 31304 17117 7148 12 55366 51057 90065 11791 62426 02957 85518 28822 30588 32798 13 96101 30646 35526 90389 73634 79304 96633 6626 94683 16696 14 38152 55474 30153 26523 83647 31988 82182 98377 33802 80471 15 85007 18416 24661 35581 45868 15662 78906 36392 07617 50248 16 85544 15890 80011 18160 33468 84106 40603 01315 74664 20553 17 10446 20699 98370 17684 16932 810449 92654 02084 19935 59321 18 67237 45509 17638 65115 29757 80705 82686 48565 72612 61760 19 3026 89817 05403 82209 30573 47501 00135 33955 50250 72592 20 67411 58542 18678 46491 13219 84084 27783 34508 55158 78742 Descriptive Analysis and Presentation of Single Variable Data Variable Qualitative Quantitative Categorical Numerical describes or categorizes quanti es an element elanmtofapopulatton ofapopulation ordinal Nominal Discrete Continuous mCO P omtes 3 names 3 assumes a countable assumes an ordemdPOS l Onr element number ofvalues uncountable orrankmg number ofvalues I Categorical Qualitative I Display El Frequency distribution El Relative Frequency distribution El Cumulative Relative Frequency distribution El Pie chart circle graph El Bar graph I Summarize None Afrequency distribution llSlS tne number of occurrences for each category This is a connt ofrhe number ofiritliviiltials in each Calegnry H Exampte A urvey Wastaken et a pastsectmn etSTATmAte etasstheattehwas Ptease hete WZFrEshman 25e SErH Other The teHeWthg data Was cuHectEd We Wtsh te eeh h whatthetr current humure ur struet a frequency methhutte um an the hequehey th be SAME that even data hmht was when The relatwe frequency ts the p rupumuh ur percent uf ubsENatmns thhth a ateguryand tsteund usmg thetehhma Relative Frequency 7 w mm of all equencies A relative frequency distribution hsts the retatwe frequenmes et eaeh eategury uf data d Cunvert the prevmus frequency methhutteh te a retatwe frequency methhutteh Another Example Data I Who EICarnivores I What EIFaminspecies Categorical DBody mass DBite force EIDiet Numerical Numerical Categorical herbivore omnivore e c Categorical Qualitative quota 5 Wm 5cong Quantitative Numerical I Display El Dot plot El Stern and leaf El Histogram El Box plot I Summary El Center El Spread El Position Numerical Quantitative Body Mass of Canidae dogs Body Mass of Canidae dogs rounded to nearest kg rounded to nearest kg 51010 9251136 1 3 3 3 4 4 4 9 7 23 13 1 422 5m 5 5 5 5 5 6 6 5 512 6 6 6 5 67 8 9 91010 4 4 5 3 3 3 8 11121322232536 Budy Mass Elf Camdae munded ta nearest k9 D t Hm 0f Ebdy Mass 1 3 3 3 4 4 4 5 5 5 5 5 6 6 67 8 9 91u1n 11121322232536 E duq u a a 0 10 20 30 Body Nhss kg StemandLeaf Display Construction ofa stemandLeaf Plot i The stem of the graph Wiii Eunsis t utthe gigitstu the ie of the rightmost gigit The leaf of the graph Wiii be the rightrhustgigit 2 Write the stems iri a venicai euiurhri iri increasing order Draw a venicai iirie tn the right ufthe stem 3 i 7 3 Write each ieateurrespuhgihg tn the stems tn the right ufthe venicai iirie The ieaves must be Written iri ascending order Body Mass kg of Canidae Body Mass quamdaE 0 ruunded ta nearest kg 1 1333444 5 5 55566 2 789911 3 i ll 12 13 2223 2536 of Cam39dae Bod Mass 0 Regular StemandLeaf Nance 1qu data 1 bunched Sphmng s mare mmrmame Mass of Cam39dae Split StemandLeaf and Examples WexgmuHSU fuutba11p1ayers 17B 172 17B 18B WEB 19B 194 was 198 ZEIEI ZEIE ZWZ ZWS 23D GPA ufprevmusmzz students 34D 27E ZE SZE ZEE Z BBB SEIZAEIJEH 943652453 1 297 Suned and Ruunded 19 2 El 2 1 171MB 252328 Sammy mug 3437334u 131mm 119 zmus muwsaaa 21125 amuwawa 221 mu Vemca1hnerepresentsademma Body Mass ofFelz39dae cats rounded to nearest kg 471213 4 7 3 2 55 410101711 21 4 5 2 4 162 96 55178 36 5 3 3 2 8 1140 4 For Discrete or Continuous Data qualltatlye data I Histogram u Aplcture of tne dlstnbutlon of tne data a Collects yalues lnto classes a Classes snould be ofequal wrdtn a Different class cnOlces can yleld dlfferent plcture u Tne nelgnt ofeacn rectangle ls tne frequency or relatlye frequency oftne class Histogram 392 axes H0rlZol ltal measurementwhere lues ofthe numerlcal varlable ofll lterest are located Frequency Vertlcal frequency yalues witan eacn class interval tne nonzontal axis on Measurement Note Continuous data do not naye any predetermined categones tnat can be used to contrast a frequency distribution therefore tne categones must be created Categones of data are created by using intervals ofnurnbers called classes Constructing a Histogram I Order data from smallest to largest using a stem and leaf display I Determine classes a equal wrdtn u more data Igtrnore classes Body Mass quamdae munded tn nearest kg 1 3 3 3 4 4 4 5 5 5 5 5 5 7 8 9 91U1U 111213 2223 2536 illllllEZL Class Freq 0 S Body Mass lt 5 7 5 S Body Mass lt 10 12 10 S Body Mass lt 15 5 15 S Body Mass lt 20 0 20 S Body Mass lt 25 2 25 S Body Mass lt 30 1 30 S Body Mass lt 35 0 35 S Body Mass lt 40 1 I a Mass quamdae H Sto g ra m munded m nearest kg 7 istributions 1 5 5 a B dymasskg 63973983993993911139111 11121222222s2a DC n51n152n253n354n E2 Shape I Symmetry mirror image El Unimodal at I Skew mode on one side El Toward higher values right El Toward lower values Ie I Other El Multiple peaks outliers sha es of Distributions p Bellshaped e highestfrequency uccurs nearthe rniuuie arid frequenciestaii ufftu the ieftarid right is ruughiy he sarne pattern skewed 7 One taii is stretched ciut icingerthan the uthertaii Left skewed rmEISl etthe data is piieu en the high numbers and Q Right skewed 7 must etthe data is piieu en the ian numbers and I Uniform rthe frequency at each vaqu etthevariabie is equai J 39 Unimodal ethere is uniyurie maiurpeak big Eimodal ethere are WEI rnaicir peaks J Symmetric amp Unimodal Histogram of Octane Rating as 57 ea as 93 so 95 as an si 92 Octane Frequency Skewed to Right pH of Pork Lows an Frequency m7 an an our aur en m7 u n r x 5m 55 an 55 m pH 39 M Skewed to Left Hemmy ndex of Young Aduwen m 15 m 5 D 1 2 9 m 3 a 5 5 7 a Hmmmywnuex Frequency Multiple Peaks Swze of Dwamonds carats Frequency w w w as u w U1 U2 3sz carats Summarizing Numerical Data I What is a typical value I Look forthe center of the distribution I What do we mean by center 39 Measures of Center I Central Tendency EIMean EIMidrange I Measures of central tendency are numeric valuesthat locate in some sense the middle of a data se Recall Aparameter is a descrlptlye measure or a populatioh Greek letters J Astatistic is a descrlptlye measure ofa sample Roman Characters 52 The arithmetic mean ora yarlable is computed by determimhg the sum or all the ya es ofthe yarlable ih the data set divide by he numberofobservatlorls The populatio mean u is computed usihg all the ihdiyiduals ih a populatioh The populatioh m parameter The sample m an 2 is computed usihg the sample data The sample meah is a statistic t Mean mm x 44 are the V observattot39ts ota vartabte troth a poputattoh thehthe population mean pt ts N 2X 1 N 7x x2 xN 7 mm x x are the I7 observattot39ts of a vartabte tow a sampte theh the sample mean ts 39 2Medtah The median ett a vartabte ts the vatue that ttes th the rhtddte etthe data when That ts the data are abuve the medtah We use Mte represehtthe medtah Steps th Cumputtng the Medtah eta Data Set Ahahgethe data th aseehdthg DrdEr 2 Determthethe number ett DbSENaUDnS rt 3 Determthethe ubsetvatteth th the rhtddte ett the data set a tt the humbetetebsewattehs ts odd thehthe medtah ts the data vatue thatts Examy th the rhtddte at the data set That t5 the medtah tsthe ebsewattehthatttesth the n1 pustttert 2 34614 in b tt the numberufubsewattuns ts event thehthe medtah ts the mean uf thetwe mtddte ubsENattuns th the data set Thatts the medtah ts the mean uf the data vatuesthattte ththe 1 and 31 pestttehs 2 2 24 1526 g vtode The mode of a vahaote ts the most frequent oosewattoh of the Vanab e that occurs h the data set tta data set has two vatues that occurwtth the mghest frequency we say the data are bimodal tta data sethas three or data vatues that occurwtth the mghest frequency the data set ts multimodal 4 Mtdrahge The midrange ts the average of the srhauest ahd targest data vatue smallest largest midran e g 2 Exampte Yuu are gweh the rottowhg stamer satahestortwe graduatesrrorhthe bustness uHEgE at towa State Untverstty 35mm 37mm 35mm saunas ztuuuu Fthd the WEEK mEd a deE and rhtdrahge eah X1 Mode Mtdranae snnnx2 37mm3 smash aauuurx ZWDEIEIEI Whatwoutd be the most approohate rheasure otoehtrat tendency to report to owe a good thdtoattoh for tuture graduates about what Stamng satary to expect Why Nottce the mean 5 the most Sensmve to extreme vatues WhHe the medtan 5 not We Saythe medtan t5 resistant to extreme vames f210000 Was 40000 mstead hen the mean Woutd be 36000 J lam Shaved P W15 L5H Skewai A W A What does each measure I The sample midrange is midway between the smallest and largest values ii Affected by outliers I The sample median divides the distribution into a lower and an upper half ii Affected by outliers I The sample mean is the balance point of the distribution ii Affected by outliers When do We use the rnean rnedian and rnode Mean Nhen the data are quantitative and the frequency distribution is roughly symmetric Median When the data are quantitative and the frequency distribution ht is skewed left orskewed rig Mode Nhen rnost frequent observation is desired measure of central tendency or the data are qualitative Measures of DispersionSpread I Range I Variance I Standard Deviation I IQR later In completely describing a distribution we need more than just the center ofthe distribution Measures of dispersion give an indication ofhow much variability there exists in data set 1010101o10 vs 16101419 e mean and median but do they have the same distribution i Range The range R at a variabie is the difference between the iargestvaiue and the smaiiestdatavaiue That is Range R Largest Data Vaqu 7 SmaHes t Data Vaiue 2 Variance mean it is calculated as a mean of the sguared deviations The dMSEIr used in the eaieuiatiun at this mean is dependent upun Whetherwe are eaieuiating the pupuiatiun variance urthe sampie variance Deviation about the mean xi y or Whatwouid be the vaiue if i Were to add 5 the deviations Why u a ion variance 0 v anabie is the sum of the squared deviatio s about the popuiation mean divided by the number of observationsinthe popuiation i That is it is the mean of the squared deviatio s about the popuiation mean The popuiation variance is Symbohcaiiy represented by oz 2 Wiret N are the Nubsewatiuns in the pupuiatiun and u Where xii xzi ixN isthe bupuiatiun mean Nutiee that the pubuiatiun varianeei a is a garameter Ah atgebratcatty equtyateht torrhuta for computthg the poputatton hoe ts varta N 2 where 2x3 rheahstb square eaeh bbseryattbh ahdtheh surhthese squared venues and m2 rheahstb add attthe bbseryattbhsahd theh square the sum e variance 3 ts ebrhbuted by deterrhththq the sum uf the squared abbutthe sarhbte mean and dtytdthq thts resutt by Hit The turrhuta tbrt e sarhbte yartahee trbrh a sarhbte uf stze hts The sam p devtattuns ZR E where xyt xzt xquot are the h bbseryattbhsth the sarhbte and its the sarhbte rheah NEIUEE that the sarhbte VartanEE 2 ts a stattstt Ah atqebrateatty equtyatehttbrrhuta fur ebrhbutthq the sarhbte yartahee ts 1171 where Ex rheartstb square each bbsetvattbh and then surhthese squaredvatues and m2 rheahstb add att the bbseryattbhs and then square the sum Nuttce that the sarhbte yartahee ts bbtathed by dtytdthq by h it tt we dtytded by h t the bbbutattbh yartartee a Standard Dewauun The population standard deviation U ts ubtamed bytakmg the square rddt at the pdputatmn vananee That 5 Nutmethatn ts a Earameter SmEE tt ts desenbtng a rneasure utthe puputattun sarnpte vananee That 5 s F NEIUEE that S S a StaUSUE smce t S desmbmg a measure Elf the sarnpte tthe standard devtattdn Sjust the square rddt at the vartance what tntdrrnattdn ts u a atmn gwmg us thatthe vananee ts ndt7 can ya mm at 7 the standard evt reasdn Why We use the standard devtattdn aver the vananee Understandth Standard Dewauun Suppuse We are edrnpanng Wu rnutuat fundsthathave the sarne mean butane 8 anutn tnt Wmch Wuutd yuu mvestyuurmuney m7 th xarnpte STATWM mu pu se Theququg datavvere Dbtamed 76 BEI EEI EWJZ EEI EEI BEJS Fwd the range vanance and standard dEWaUDn Tuta s We have 2 22 47474 2xezi 529 56 50 Variance 9 7 4747475502 n7 f 1 7529555 661944 standard Deviatiuh s J 661944 8136 H Exampie Duhhg the 2007 season ii iciudii ig pre and postseason the Super Bowi XLii champions the New York Giants had a Wii ii iii ig record of 1579 Their poii it totais for the 24 games are iisted ii i aSCei idi g orderbeiovv Fii id the rahge vahahce ahd standard deviation to 12 13 13 13 16 20 24 38 When to use which I Standard deviation El More informative than range El Not resistant to outliers El Quantitative bellshaped symmetric data El Use with Mean El ess informative El Resistant to outliers El Quantitative skewed data El Use with Median 11 Empmca We mm 5 mm mm m 21 mm m w u wnmnvssmmneaaumu 2mm u WM 1 mwmmm M mm mmme muchquot 2mg ue Visual i Emym m l Wm w W m1 u m Measures uf Pusmun Measumsawasman museum mm maum mm m cenam m vame mm mm mm Exam pi e The average 2 standard deviatio of3 o i 64 6 in 29 y oid rnan is 69 9 inches taHi With a hes The average 20729 year did Woman is es taii With a standard deviatio of 2 8 inc taiien a 75inch rnan ora 79inch w 5 Who is reiativeiy 39 eddai subsets Each set cit data has 99 percentiies The W percentiie iii inn dencited it in and at must with percent citthe data are iargerinvaide Determining the Mn Percentile Pk e iiia iink Arrange the data in ascending cirder 2 Cumputeanindexi usingthefurmuia nk 1 7 iiiidi iddai in HE data SEI 3a it i is net an integer ruund up tci the nexthighestinteger Lcicate the N39vaiue at the data se itten in ascending cirder percentiie quot50 k75 nk 5075 i 375T38 100 100 This number represents the W 5 percentileis at Lhe3839hposition 3b it i is an integer the Winercentiie is the rnean cit the i and iist data vaqu n50k30 7 nk 750 3015 100 100 30 percentileis averageof 1539h arid1639h position Often we are mteresied 1n knuvvmg the percenme m WWEH a speeme data va1ue currespunds Finding the Percemile that corresponds to a Data Value 1 Arrangethe data m ascendmg urder 2 Use the ququg fermu1a m deterrmne the percenme ufthe scure gtlt Number of data Values less than 6 Percentxle of x 100 n 3 Runndtms numbertu the nearestmteger 39 pans Hum 01 representsthe 1stquam1e H1 a1se the 25m percenme 02 representsthe 2nd quarme H1 a1se the sum percenme Nate T111515 a1se me meman 03 representsthe 3rd quarme 111 a1su the 75m percenme Examp1e menm ElprrH fur 2uranuem1y seemed years 3 34 411 14 135 234 341 135 2 3 341 343 314 234 3 7 433 432 273 433 T7 522 553 769 573 E1 628 769 Fwd the B7 percent1e Whatpercenme 1 represented by a 23mm e7 ra1n7 23 Fwd the B7 perenme Whatpercenme ts represented by a 28 tnehes uf ram7 Hum parts 01 represents the tstduartne tt ts atsd the 25th percenme drder aH utthe rnedtan 02 represents the 2nd quarme tt ts atsd the suth percenme Nate Thts ts atsd the rnedtan 03 represents the 3rd quarme tt ts atsd the 75th percenme drderau Hth the rnedtan the standard devtattun SmEE nettherts a reststant measure Eundusmns drawn frum a data setthat Bantams nuthers can be awed We check tdr duthers usmg the interquanile range Checkmgroroutners by Using Quaniles DEtErrmnE theftrstandthtrd duamtes ufthe data Cumpute the interquanile range The tnterduamte range DMQR ts the dttterenee between the thtrd and hrst quarme QR 03701 LuWErFence 0171mm UppErFEnce as 150QR W a data vatue ts has than the tdwertenee urgreater than the uppertenee then ttts ednstdered an uuther 24 Using rne rainraii example data rinu rne quartiles anu iuenriry any euriiers FiveNum ber Sum mar is and easy to compute and effective way of describing a set of data I Minimum I 15 Quartile Q1 I Median2quotd Quartile Q2M I 339 Quartile Q3 I Maximum We use rne rive number summary in eensrruer a boxplot Boxplots provide a Vlsuai description orrne data set Organize rne rive number summary into a picture Drawmg a onplot Derenmne me upperand luwerfences M as Labelmeluwerand upperrenees Drau a iinerrem rnerirsr euaniiererne smallest data valuemat is iargerrnanrne iewerrenee Draw a line rum rne min quaniiererne iargesruara vaiue smaiier rnanrne upperrenee 25 How to determ ethe D bution Shape Based upon onplot the memah s nearlhe eemerbnhe bux and each bnhe hbhzbmax hhes s uf appraxwmate y Equa Ength than the msmbubbh s ruugh y symmem C the memah stu the en bnhe eemerbnhe bbxmhe hghume s subsianuaHy ungErthan the E hne the msmbubem s skewed ght D Wthe memah stu the hght bnhe eemerbnhe bux urthe E hne s subsianuaHy ungErthan the ththne the msmbubbh s skewed wen m we The ququg data representsthe numberuf grams uffatm breakfast mea s uffered atMcDunaM s WZ ZS ZE Z 31 37 34 15 23 38 31 WE M E E W ZD Fwd the ve number summary Cunstruct a bbxbmt Cumment an the shape bnhe msmbubbn 26 Examme 2 The number ofCDS owned by a118tat101 Students Was f recorded The foHoWH vg 15 a Samp1e 0 20 ofthose Stu 23 43 46 55 61 64 72 72 73 79 80 85 86 88 97 97 108 231 237 Construct a boxp1ot forthe gwen data dents 27 Stat 104 Lecture 17 Normal Approximation of the Binomial X is a Binomial random variable that counts the number of successes in n independent trials with success probability p iMean M np istandard deviation 039 1lnpll p Normal Approximation of the Binomial For large n it is difficult to calculate Binomial probabilities from the formula For large n the Binomial distribution is symmetric mounded at np and looks like a normal model Example 38 ofpeople in the US have 0 blood type If 1000 people chosen at random donate blood what is the chance that 360 or fewer will be 0 Stat 104 Lecture 17 003 002 o02 39 Dens 001 39001 350 I 400 39 39 l 450 Example We want to calculate PX S 360 Mean 7 u np 1000038 380 Standard deviation 7 a W 10000381 7 038 039 W 1535 Standardize Z X u U 360 380 20 1 1535 1535 Stat 104 Lecture 17 Standard Normal Model Standard normal table handed out in class Table 3 page 662 in your text httpdaVidmlanecomhyperstatzi tablehtml Comments The book suggests a continuity correction factor iAdd 05 to X for the probability of being less than or equal to X isubtract 05 from X for the probability of being greater than or equal to X Continuity Correction Factor z 7 Z 3605 380 195 127 1535 1535 Stat 104 Lecture 23 Testing Hypotheses Hypothesis a tentative theory 0 Statistical hypothesis a tentative theory about the value of a population parameter Testing Hypotheses 0 Use sample data in the form of a sample statistic to support or refute hypotheses 0 How likely is it to get the sample data if a hypothesis is true Example The hog processing plant wants to see if the population mean hot carcass weight of hogs is 200 lbs or is it something less than 200 lbs Stat 104 Lecture 23 Example Null hypothesis H0y 200 0 Alternative hypothesis HAU lt 200 0 u is the population mean hot carcass weight Decisions amp Conclusions Reject H0 the population mean is a value given in the alternative hypothesis Fail to Reject H0 the population mean could be the value given in the null hypothesis Errors H0 is True H0 is False Fail to rej ect Correct Type II Error Decision Reject H0 Type I Error Correct Decision Stat 104 Lecture 23 Errors 0 Would like to reduce the chance of making either error 0 One way to do this is to have a large sample of data Level of Signi cance Denoted by the Greek letter 05 The probability of making a Type I error Typicallyais chosen to be 5 or Test Statistic 0 Summary of the sample data used to make a decision in a test of hypothesis Depends on What is known about the population Stat 104 Lecture 23 Testing Step by Step 0 Step 1 Set Up 0 Step 2 Test Criteria 0 Step 3 Sample Evidence 0 Step 4 Probability Value 0 Step 5 Results Step 1 SetUp 0 u is the population mean hot carcass weight of hogs Null hypothesis H0y 200 0 Alternative hypothesis HA pk 200 Step 2 Test Criteria 0 Distribution of the sample mean is approximately normal 0 039 261 lbs 0 2 test statistic 0 at 005 Stat 104 Lecture 23 Step 3 Sample Evidence 0 Sample mean i 1875 0 Test statistic y o 1875 200 125 Z ll Step 4 Probability Value 0 Alternative hypothesis HAU lt 200 P value is the probability of getting a value of 2 smaller than 339 0 Table Z 00003 Step 5 Results Reject the null hypothesis because the P value is smaller than 005 The population mean hot carcass weight of hogs is less than 200 lbs Chapter 10 Inference on Two Samples Section 101 Inference about Two Means Dependent Samples Independent versus Dependent Sampling In order to perform inference on the difference of two population means we must first determine whether the data come from an independent or dependent sample A sampling method is independent when the individuals selected for one sample do not dictate which individuals are to be in a second sample A sampling method is dependent when the individuals selected to be in one sample are used to determine the individuals to be in the second sample Dependent samples are often referred to as matchedpairs samples Example Determine whether the following are independent or dependent samples 1 A sociologist wishes to compare the annual salaries of married couples She obtains a random sample of 50 married couples in which both spouses work and determines each spouse s annual salary 2 A study was conducted by researchers designed to determine the genetic and nongenetic factors to structural brain abnormalities on schizophrenia The researchers examined the brains of 29 twin patients diagnosed with schizophrenia and compared them with 29 healthy twins The whole brain volumes of the two groups were compared 147 Hypothesis Tests When analyzing matchedpairs data we compute the difference in each matched pair and then perform inference on the differenced data in the same manner we learned early Paired Difference d x1 x2 If we are going to make inferences on this new data we need statistics dealing with this type of data Sample Mean of the Differences 3 8 is a new statistic that has a sampling distribution U are Elud and Var3 7 To perform any sort of analysis on d we need to make sure that 3 comes from a normal distribution When will 3 be normal 1 When the original populations are normal 2 When we obtain a sample greater than or equal to 30 148 We can use the classical or the pvalue approach when testing a hypothesis about a mean difference The new parameter of interest will be ud So the null and alternative hypotheses will take one of the following forms H0 ud0 vs H1 ud 0 H0 ud0 vs H1 udgt0 H0 ud0 vs H1 udlt0 The test statistic we will use will follow a tdistribution with n 7 1 degrees of freedom Example Assume the following data comes from a normal population and test the claim that the difference is less than zero at the 005 level of significance X1 76 76 74 57 83 66 56 X2 81 66 107 94 78 90 85 d 149 Con dence Interval for MatchedPairs Data A l 00 100 Confidence Interval for pd is given by Lower Limit 3 t s7 Upper Limit 3 rm s7 2 aZ J Use the above example to construct a 99 confidence interval 150 Section 102 Inference about Two Means Independent Samples We will now turn our attention to inferential methods for comparing means from two independent samples We will discuss only the case where the population standard deviations are unknown Sampling Distribution of the Difference of Two Means Independent Samples with Population Standard Deviations Unknown Suppose a simple random sample of size ml is taken from a population with unknown mean H1 and unknown standard deviation 61 In addition a simple random sample of size 112 is taken from a population with unknown mean H2 and unknown standard deviation 62 If the two populations are normally distributed or the sample sizes are sufficiently large then approximately follows a t distribution with the smaller of n1 l or In 1 degrees of freedom The hypothesis testing procedure will again be similar to what we have already done However now the null and alternative hypotheses will take one of the following forms HorulH20 Vs H1H139H2 0 HorulH20 Vs H1 H139H2gt0 H0u1u20 vs H1u1u2lt0 151 Constructing a 1 0L100 Con dence Interval About the Difference of Two Means Suppose a simple random sample of size ml is taken from a population with unknown mean HI and unknown standard deviation 61 In addition a simple random sample of size n2 is taken from a population with unknown mean H2 and unknown standard deviation 62 If the two populations are normally distributed or the sample sizes are sufficiently large then a 1 a100 confidence interval about ul Hz is given by Lower Limit x1 2 I sf s 7 7 J Upper Limit g1 E2 rm F n1 n2 Example Test the claim that ul gt M at the 01 level of significance for the following data Also construct a 95 confidence interval Section 103 Inference about Two Population Proportions We will now discuss inferential methods for comparing two population proportions Sampling Distribution of the Difference between Two Proportions Suppose a simple random sample of size ml is taken from a population where x1 of the individuals have a specified characteristic and a simple random sample of size 112 is independently taken from a different population where x2 of the individuals have a specified characteristic The sampling distribution of 131 13 is approximately normal with mean p1 p2 and standard deviation 39ddtht gt10 gt10 gt10 d gt10 pr0v1 e a I llpl n1q1 n2p2 an n2q2 n1 n2 Now that we know the approximate sampling distribution of I31 f7 we can introduce a procedure that can used to test claims regarding two population proportions Let us first consider the test statistic It would seem logical that the test statistic would be Z 1 zppz HM n1 n2 However when we test a hypothesis the null hypothesis is assumed true When comparing two population proportions the null hypothesis will always be H0 p1 p2 0 This null hypothesis is assuming that the value of p1 equals the value of p2 Since the null hypothesis is assumed to be true we are assuming p1 p2 p where p is the common population proportion 153 Substituting the value of p into the equation for the test statistic we obtain a new test statistic Z I31 I3 2 m 11 n1 n2 We need a point estimate of p because it is unknown The best point estimate of p is called the pooled estimate ofp denoted f7 Pooled Estimate ofp x x A 1 2 p i l1 H2 Hence the test statistic we will use in hypothesis testing is Z 1 I 2 A A l l x m i n1 n2 Con dence Intervals for the Difference between Two Population Proportions aaaa Lower Limit 131 2 z aZ Upper Limit I31 p2 Notice that we do not pool the sample proportions This is because we are not making any assumptions regarding their equality as we did in hypothesis testing 154 Example On April 12 1955 Dr Jonas Salk released the results of clinical trials for his vaccine to prevent polio In these clinical trials 400000 children were randomly divided in two groups The subjects in Group 1 the experimental group were given the vaccine while the subjects in Group 2 the control group were given a placebo Of the 200000 children in the experimental group 33 developed polio Of the 200000 children in the control group 115 developed polio Test the claim that the percentage of subjects in the experimental group who contracted polio is less than the percentage of subjects in the control group who contracted polio at the 001 level of significance Also construct a 90 confidence interval for the difference between the two population proportions 155 Chapter 4 Describing the Relation Between Two Variables When collecting data sometimes we measure more than one variable on each individual Bivariate data are data consisting of the values of two different response variables that are obtained from the same population There are three possible combinations of variables that may be measured 1 Both variables are qualitative 2 One variable is qualitative and the other is quantitative 3 Both variables are quantitative When two values are measured for each individual of the population we denote the data as ordered pairs x y In some examples we will see that x is the input variable and y is the response variable Both Variables are Qualitative We often arrange the data in a crosstabulation or contingency table These tables count all possible combinations of the levels of the variables After counting the numbers we make the tables with frequencies relative frequencies or percentages listed in each of the respective categories Below is an example of a contingency table One Qualitative Variable and One Quantitative Variable In this case we separate out our results and group them according to the qualitative variable So essentially we have separate samples labeled by the qualitative variable We can use this information to draw dotplots boxplots and compute five number summaries means and standard deviations Example GPA by college LAS I AG BUSINESS EDUCATION 246 305 257 332 342 216 315 284 378 143 211 202 206 392 117 164 195 265 384 386 333 281 351 362 Two Quantitative Variables In many statistical problems we are given data that consist of a pair with an inpuUeXplanatoryindependent xvariable and an outpuUresponsedependent y variable We often want to know if there is a linear relationship between these variables Chapter 4 deals with this topic Example X is the number of hours of sleep deprivation y is the number of errors committed Section 41 Scatter Diagrams Correlation When we have two quantitative variables we are often interested in using the value of one variable to predict the value of the other variable Caution This relationship may not be causal Was the data obtained from an observational study or a controlled experiment Always consider lurking or hidden variables The response variable is the variable whose value can be explained by or determined by the value of the predictor variable Suppose I am interested in knowing if the amount of time studying for exam one is related to the score I receive on exam one Then the time spent studying is the predictor variable and the score on the exam is the response variable The first step in identifying if there is a relationship between the variables is to draw a picture A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual Each individual in the data set is represented by a point in the scatter diagram The predictor variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis Do not connect the points when drawing a scatter diagram Example x 2 4 8 8 9 Drawascatter y 14 18 21 23 25 diagram ofthe following data The correlation coef cient is a measure of the strength of linear relation between two quantitative variables We use p rho to represent the population correlation coefficient and r to represent the sample correlation coefficient We will only discuss in detail the sample correlation coefficient r Calculation of the Correlation Coef cient r n 1 n 1 or equilvalently We can simplify this equation using Sum of Squares notation SS 30 SSxx SSW where SSXy SSXX and SSyy are defined as follows 11 n n 2 n 2 i1 SS 2139 n n 2 n 2sz SSW 2 y 11 11 n Using the previous data set we can check if there is a linear relationship between x and y Calculate r Total Properties of the Correlation Coef cient i J U A U 0 l 00 9 The correlation coefficient is always between 1 and l inclusive Thatisl Srs l If r 1 there is a perfect positive linear relation between the two variables If r 1 there is a perfect negative linear relation between the two variables The closer r is to l the stronger the evidence is of positive association between the two variables The closer r is to l the stronger the evidence is of negative association between the two variables If r is close to 0 there is evidence of no linear relation between the two variables Because the correlation coefficient is a measure of strength of linear relation r close to 0 does not imply no relation just no linear relation The correlation coefficient is a unitless measure of association So the unit of measure for x and y plays no role in the interpretation of r Linear correlation does not mean causation CORRELATION DOES NOT IMPLY CAUSATION r is sensitive to extreme data points Let s take a look at some scatter diagrams and see what we think about the linear association between two variables 0 o o 0 Q0 0 O o to c a o O a O u a o I l O I 0 O 39 I a o D 390 it 0 co quot 0 Q no I 00 pnl 0 II It at all 0 00 l to o o In I 0 I v in O I o It I I n l I 59 Section 42 LeastSquares Regression We have looked at scatter diagrams and correlation We now know how to find the strength of the linear association between x and y If the data show a linear relationship between x and y can we find an equation to represent this relationship We want to find the line that best describes the relation between the two variables What does best mean The line that best describes the relation between two variables is the one that makes the residualserrors as small as possible The difference between the observed value of y and the predicted value of y is the residualerror seResidual yf The most popular technique for making the residuals as small as possible is the method of least squares The Least Squares Regression Criterion The leastsquares regression line is the one that minimizes the sum of the squared errors It is the line that minimizes the square of the vertical distance between observed values of y and those predicted by the line 9 We represent this as Minimize Z residuals2 ie min 2 y 72 or min 2 e2 60 Example Set the cruise control on your car at 50 mph and let y distance and x time Distance vs Time Distance miles 0 1 2 3 4 5 6 Time hours Each indiVidual point can be represented as y 30 31x 8 where 8 represents the errorresidual The leastsquares regression line is written as follows 3 2 0 blx wherebo is the yintercept and b1 is the slope 61 Example Let x the number of cars that fit into the garage Let y cost of the house in thousands of dollars Find the leastsquares regression line for the following data X 0 1 140 1 180 2 220 Number of cars vs Cost Price of House 0 05 1 15 2 25 Num ber of Cars We know the line that minimizes the sum of the squared residuals is j 2 0 blx but how do we determine the values of b0 and b1 62 Equations for b1 and b0 b1 2 n n XX 2 3921 299 i1 n SSXy S or 1 r SSxx sx 90 25 191 b1 is the slope of the leastsquares regression line This value represents the expected increase in y for a one unit increase in x b0 is the y intercept of the least squares regression line This represents the expected value of y when x equals zero This may or may not have a practical interpretation It will only have meaning if the following two conditions are met 1 A value of zero for the predictor variable makes sense 2 There are observed values of the predictor variable near zero 63 Extrapolation vs Interpolation The second condition above is especially important because statisticians do not use the regression model to make predictions outside the scope of the model extrapolation In other words statisticians do not recommend using the regression model to make predictions for values of the predictor variable that are much larger or much smaller than those observed because we cannot be certain of the behavior of the data for which we have no observations Interpolation is making predictions within the scope of our data Calculate b0 and b1 for the above example 64 Calculate each residual and the sum of the squared residuals Total This is the smallest possible value for the SS of the residuals What if we tried another model Total Now that we have the best line in terms of minimizing the sum of the squares of the errors we want to determine if this line is any good at describing the relation between x and y How much of the variability in the output y can be attributed to the input x 65 Section 43 Diagnostics on the LeastSquares Regression Line The coef cient of determination R2 measures the percentage of total variation in the response variable that is explained by the leastsquares regression line The coefficient of determination is a number between 0 and 1 If R2 0 the least squares regression line has no explanatory power If R2 l the least squares regression line explains 100 of the variation in the response variable Deviations The deviation between the observed value of the response variable y and the mean value of the response variable is called the total deviation so total deviation y y The deviation between the predicted value of the response variable 3 and the mean value of the response variable is called the explained deviation so explained deviation A y The deviation between the observed value of the response variable y and the predicted value of the response variable 3 is called the unexplained deviation so unexplained deviation y y Total Deviation Unexplained Deviation Explained Deviation y fy f Let us look at this in terms of our example 66 Number of cars vs Cost Price of House 0 05 1 15 2 25 Num ber of Cars It is also true that Total variation Unexplained Variation Explained Variation Note variation sum of deviations In other words SSTO SSE SSReg SSTO i y W i1 SSReg W SSE 202 522 11 67 Note that n 2 SSTOSSWny i1 SSReg 2 b1 SSW SSE SSTO SSReg Knowing this information we can find the value of R2 the coefficient of determination in three ways 1 R2 SSReg 39 SSTO 2 R2 1 7SSE SSTO 3 R2 02 Caution Squaring the linear correlation coefficient to obtain the coefficient of determination works only for the least squares of the simple linear regression model It does not work in general In the example above how much of the variability in cost of the house can be attributed to number of cars that fit into the garage 68 Example 1 The prices of homes y sold in a certain country are compared to the living area X The prices are given in thousands of dollars and the living area is given in hundreds of square feet x I27181317251820223010 y I132175129138232135150209271I89 Note 10 Z x 200 10 3 x1 y1 35585 11 2 y 1660 m 3 Z yf 303326 2 x 4344 H a Find the least squares line b What would you expect to pay for an extra 100 square feet of living area c Find SSTO d Find SSReg e What fraction of the variability in house prices is explained by living area f What is the correlation between the price of houses and the living area g What is the error residual for the house with 1700 square feet of living area 69 Example 2 The diastolic blood pressure X and the systolic blood pressure y were recorded for 15 females The data are given in the table x 767082906860626062726880749066 y I122102118126108130104118130116102122130140102 Note 5 2 x1 1080 15 1 2x1 y 128160 2y 1770 1 1 2 y 210940 2 x 79152 H 11 a Find the least squares line b Why is this called the least squares line c What systolic pressure would you expect for a woman with diastolic pressure of 65 d Find 2y yf e Find the correlation between and X and y That is find the correlation coefficient r f Find the Coefficient of Determination R2 g Is there a linear relationship between X and y Justify your answer h Assume the data follow a bell shaped curve and the standard deViation of the diastolic pressure X is 10 Using the empirical law what fraction of the data do you eXpect to fall below 82 70 Example 3 A camp director needs to buy 200 sets of ashlight batteries for her campers to use over the summer The costs for a set of two batteries are 250 300 and 400 She decides to buy four sets of each type and run them until the ashlight goes dark because she wants to determine the best buy The lifetime for each set of batteries is given in the table xcost 252525253030303040404040 yifetime I8 1211 9 I1511I161812202317 Note 12 Ex 38 12 12 Ely 172 1 2 u 2 Elly 2698 x 125 a Find the regression line b In what sense is this line the best c Find the total sum of squares for this regression problem d Find R2 e Find the correlation between the price and the lifetime of a set of ashlight batteries f If the lifetimes for the 250 batteries form a normal distribution in what interval would you expect 99 of the lifetimes to fall 71 Descriptive Analysis and Presentation of Bivariate Data 39h Bivariate Data tnuwuat aivariate data 5 data een tsttng ettnevatues uftvvu mnerent respunse vanabtestnatare ubtamedfrumtne sarne pepmatten Ehament re are tnree pesstme umbmatmns etvanabtes tnat may be measured Eetn vanames are quahta we I 2 One vaname S quahtauve and tne etner S quanmauve 3 Eetn vanames are quanmauve urdered pans x y n serne exarnptes We Wm see tnat X S tne mput vaname and ytstne respunse vanabte Both Var ab e amive We ett n arrange tne data m a eresstabmatten ur contingency table These tames uunt aH pessmte nmbmatmns er tne tevets ettne vanames After r the respemve eategenes Betewts an exarnpte er a cunungencytame Who 46 students m a prewous secttons of stat 104 What C ass and Gender Fresnrnan sepnernere Tetat Jumur 55mm BABEI13 746E152 CM CM I971 Examplefriends Suume Utts and Heckard ZDDAL Mmd un Staustms Barnum CA EruuKsCme pages 5297530 Summary Femahe Same we Femahe Oppesne 50 Femahe NE on as M312 Same 13 M312 Oppesne 15 M312 NE on 40 I Who u Students 111 a 5181151105 c1a55 at Penn State Unwemty I W a 7 D WM Whom151teas1e5tto makefnend57 0pp051te Sex Same Sex no d1fference D Gender Ma1e fema1e Crosstabs Table With whom is it easiest to make friends RowColumn Percentages With whom is it easiest to make friends Count Same Opposite N0 Row Rowon Sex Dwfference Tom Fema1e 10 58 03 11 7 42 3 40 0 100 Ma1e 13 15 40 68 191 221 588 100 Row percentages takes the raw uuntre1atwe m 11210131 number er 1nd1v1dua1s m that raw and expresses 11 as a percentage Nate that ma1es answersame and eppesne wan apeuune sarne percentage Eutfurfemahes 1t 15 aver 3 times as 11ke1y 0 answer eppesne than arne One Qualitative Variable and One Quantitative Variable I Separate results by the levels of the qualitative variable DConsider as separate samples EIUse quantitative information to draw dotplots boxplots and compute five number summaries means and standard deviations Example GPA by college BUS NESS EDUCATlON 2 57 3 32 Q3m342 Q3AG305 Q3m351 Q3Ed362 Maxm378 MaxAG392 Maxm384 Maxid386 Example Recall Carnivores Bite Force I Who DCarnivores I What DSpecies Canidae Felicidae Qualitative DBody mass kilograms Quantitative Numerical Quantitative Body Mass of Felidae Body Mass of Camdae rounded to nearest kg rounded to nearest kg 47 12 13 4 7 3 2 5 511 510 10 9 25 11 36 4 410101711 21 4 5 2 9 7 23131 4 22 5 4162 96 55178 36 5 3 3 512 6 6 5 4 4 5 3 3 3 8 6 39 Dot Plot of Body Mass Family Feiidae Canidae Body Mass kg 4444422277710i1333444 555 99 73211100i1i00123 mm SM 1 Oneway Analysis of Body Mass kg By Family lane g lune 1 3 sun I I Canidae Felldae Family 39h Two Quantitative Variables in many statistieal prublems we are given da an inputExplanatury lrideperideritxevarlab an edtpdt respunseHepariderityevarlable ta tnat eensist er a pair vvitn le and Example X is tne number at nedrs er sleep deprivatien v is tne number at errers ummltted an an Exam Two Quantitative Variables using tnevalue er une variable tn prediettnevalue ufthe utnervariable Suppusel am interested in knuvvlrig it tne amuunt ertime studying rer exam ene predictor variable and tne seere en tne exam is tne response variable Tne response variable is tne variable vvnese value ean be Explained by ur determined by tne value ertne predictor variable I J T a he first step in identifying if there is a relationship between the variables is to draw picture A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual Each individual is represented by a point on the scatter diagram The predictor explanatory variable is plotted on the horizontal axis The response variable is plotted on the vertical axis Do not connect the points when drawing a scatter diagram Example Draw a scatter diagram of the following data 2 4 8 8 9 1418 2123 26 x y E xample Draw a scatter diagram of the data for sleep deprivation and number of errors committed on an exam y is X is X12121818 24 24 30 30 Y19 22 2132 6160 88 93 the response variable quot the explanatory variable 39gt I I 17 How to Measure Association The correlation coefficient is a measure of the strength of linear relation between two quantitative variables We use p rho to represent the population correlation coefficient and rto represent the sample correlation coefficient We will only discuss in detail the sample correlation coefficient r Calculation of the Correlation Coefficient r n xii E yi 2 S S i1 x y zzy i1 17 1 or equivalently 739 11 1 18 X ss ssyy Where 55W 55 and SSWarE de ned asququ xandy Cauate r Tum fropemes of the Correlatlon Coef cient xandy Cateutate r Tutat Tne eurretattun uefftctent ts atways between it and t tnetustve Tnatts rt srst tfr there ts a berteet pustttve ttnear retattetn between tne Wu vartabtes tt r there ts a berteet negattve ttnear retattetn between tne Wu vartabtes Tne etetser rtstet t tne strungertne Evtdence ts Dfpusmve assetetattetn between tne Wu vartabtes Tne etetser r ts tet it tne strungertne Evtdence ts ett negattve assetetattetn betweenthe Wu vartabtes tt r ts etetse tet at tnere ts Evtdence ett nu ttnear retattetn between tne Wu vartabtes Because tne eurretattbn euentetentts a measure btstrengtn utttn retattun retetse tet a due nuttrnbty nu retattun tustnu ttnearretattetn ear Til l i ii 7 The correlation coefficient is a unitless measure of association So the unit of measure for X and y plays no role in the interpretation of r Recall Z scores 8 Linear correlation does not mean causation CORRELATION DOES NOT IMPLY CAUSATION Only designed experiments imply causation 9 Correlation is not effected by changes in center or scale 10 r is sensitive to an extreme data point rgt 099 25 l l Let s take a look at some scatter diagrams and see what we think about the linear association between two variables O O O 0 a r1 Perfect positive correlation O O 26 H r 09 Strong positive correlation O c o 3 O O O 39 O O o 3 o O a o o g 0 o o o o S 27 r 04 Weak positive correlation O o O I a o o o o O O Q v o 0 0 0 0 I o 39 o 8 o r 1 Perfect negative correlation o a o o o l 0 O a 29 r 0 9 Strong negative correlation 39 o I i o I o 39 n I o 39 v C I I O o 39 o 0 o I o o o O l 0 0 30 10 r O4 Weak negative correlation D O D U o D O D Q l o o u 0 a 8 a O O o o I o 9 o u a 31 rclose to zero no linear correlation u l l o I 39 o o I a 39 I 0 v o o 39 I l 32 rclose to zero no linear correlation C O 0 o D g 39 O 3 a u 39 o 39 a 39 O o o z a I 33 11 Recap Correlation I Linear Association El How closely do the points on the scatter diagram represent a straight line El The correlation coefficient gives the direction of and quantifies the strength of the linear association een two quantitative variables I ALWAYS STATE 3 THINGS El Degree of correlation strong weak moderate El Direction of correlation positive negative El The word LINEAR l r9 implies there is a strong positive linear correlation l r 01 implies there is a weak negative linear correlation 39 Canidae Bite force vs Body mass Bite Force Body Mass 35 Correlation Coefficient I Body mass and Bite force 1 Ex 2578 2x2 410858 1 Zy 4312 8 2y2 989588 5 n ny 8300859 ny7 5300559 n 257843128 28 rZX2 7 29212 21 ny 1103537 25228219895685 7 4312282 n I There is a strong positive linear association between the godydmass and bite force for the various species of am ae Correlation Cautions I Don t confuse correlation with causation mply causat39on El There is a strong positive correlation between the El Correlation does NOT mber of crimes committe munities and the number of 2 d graders in those communities Equation of a Line Review I The equation of a straight line I We W X0 I y mX b El m is the slope or rise over run El b is the yintercept the value where the line cuts the yaxns the change in yover the change in x y3x2 Review I y 3X 2 D x 0 y 2 y intercept D x 3 y 1 divided by the change in x 3 i Change in y 9 gives the slope LeastSquares Regression and reiatiensnib between xandy ean we find an equatiun te representtnis reiatiunship We want te find tne iine that best describes tne reiatien between tne twe variabies Whatdues best rnean7 Tne iine t at best describes tne reiatien between Wu variabies is tne ene tnat Ei n makesthe residualslerrors as srnaii as be sibi Tne difference be tween tne ubserved vaiue ufy and tne predicted vaiue er y is tne residuaiierror y u quottruthquot Residual yry 9 N quotestimateuklwenav is tne ii iu uuiaii iiiiiuu ii me nor orieasi squares The Least Squares Regression Criterion squared eners tis tne iine tnat minimizes tne square ertne venicai distanee between ubsewed vaiues er y and tnese predicted by tne equot We represent tnis as Minimize residuais2 6 mm 2 0751 urrnin 2e2 Exampre Setthe rmse eemrm un yuurcar 315D mph and hay erstanee and xurne mews Distance vs nme Y me hours represented as YBDB1XE The Eastrsquares regressmn hne rs Wrmen as FecaH RESwdua e yr yd ygt3 Sn eltEI egtu Exampre e accuunts furtms b b1x Where bu rs the yrmtercept and b the s upe Let X the number er ears mam mm the garage Lety cus L ufthe huuse m thuusan d er duHars Fwd the Eastrsquares regressmn Mme furthe ququg data g 5 my ears Negax We knuvvthe line that minimizes lne sum enne suuareu resiuuals is A but new on We determine y be b1x thievaluesnfbuarldb1 Equations for b1 and lo1 is the slope of the leastsquares regression line This value represents expected increase in yfor a one unik increase in x bu is the yimercept of the least squares regression line This represents the expected value ofy w x as bu ma ul lne fulluvvlrlg We undltluns are met l Avalue ufzeru furthe predlmurvarlable x makes sense 2 Tnere are ubsewedvalues enne predicturvarlable nearzere Extrapolation vs Interpolation extrapolation ln utherwurds siatlstlclarls ue nut reeemmenu using me mu maiui llHu n maul m n ii im ualai Internolntionls making predictlunswlthln lne seepe ufuurdata J Caleulaung bu and b1 furthe abuve example Tulal lipo amu atmg bu and b1 furthe abuve EXamp E Tum Least squares regressmn hne y 85 70x Ca m ate Each reswdua and the sum ufthe squared reswdua s J Whaufvvemed anmhermudew Say 8075x Tum 95D LEas LSquares Regressmn Lme y 85 70x Tum BEIEI Nuvv that we have t h best hne m tenns bt rmmrmzmg the sum utthe Square at the Errurs WE Want El determme W W hne 5 El between X and y Diagnostics on the LeastSquares Regression Line any gu e at desmbmg the re atmn wa mueh ufthevanabmty m the uutput y ean be atmbutedtu the mput x7 coef cient determination Ra measures the percentage uftuta va atmn m the respunse vanab e that S EXp amEd by the Teastsquares regressmn h The cuef mentufdeterrmnatmn S a numberbetvveen u and1 f R2 El the 2351 square regressmn hne has nu EXp anatury uvver fR21 the easts quares regressmn hne EXp ams munn at the vanatmn m the respunse vanab e The dew rneanva atbn between the ubsewed vgme utthe respunse varwab e y and the HE utthe respunse varwab e y S aHed thetotal deviation SD tbtaT eewatbn yy The eewatbn between the bremetee vame utthe respunse vanab e Sn EXp amEd eewatbn yiy The eewattbn betweenthe ubsewed vame at the respunse varwab e y andt e bremetee vame at the respunse varwab e y S named the unexplained deviation Sn unexp amed eewatbn yiy Tum Dewatm Unexp amed Dewatmn EXp amEd Dewatmn 07Ui 7w Let us uuk at thTs m terrns bt uur examp e Number m cars vs east Us a su true that Tuta vanatmn Unexp amed Vanatmn EXp amEd Vanatmn Nate vanamn sum uf devatmns 2 m utherwurds n 7 Z SSTO 20 ay 55w 1 SSTO SSE SSREQ 7 A 7 z SSReg 2011 y SSE 2 y a m2 Nutethat iy SSTO SSW Zy 77 l quot SSReg b SSW SSE SSTO r SSReg Knuwwgtm mfurmatm WE can nd mavame uf R2 the nef mentuf determmatmn m three Ways 1 R2 7 SSReg 39 7 SSTO Caution Squarmg the hnear nrre atmn cuemmenuu ubtam the uef ment uf determmatmn Wurks umy furthe 2351 square uf the swmp e hnear regressmn tduesnutwurk m genera Linear Regression I Example Body mass kg and Bite force N for Canidae Ely Response Bite force N EIX Explanatory Body mass kg DCases 28 species of Cam39dae Regression Plot 5m 7 am 7 3m 7 BF ca N 2m 7 m 7 i u 5 1D 15 Body mass kg 2m 25 an 35 I Least Squares Estimates Body mass x 2x 2578 2x2 410858 ny 6300659 Bite Force y Zy 43128 2y2 9895685 20 Least Squares Estimates 171 4108 58 7 63006 59 7 257 s4312 8 28 257 82 2s 13428 170 154 029 7 13 428 9207 30397 3 30397 13 428 x Interpret the slope and intercept I Slope for every 1 kg increase in force increases on average 134 I Intercept there is not a reasonable interpretation of the a body mass the bite 28 N intercept in this conte because one wou dnt see Canidae with a body mass of kg I Note that it is not enough to say The slope is the increase in X for a 1 unit increase in y and the intercept is y when x0 Bite Force vs Body Mass sun 7 mm 7 sun 7 BF ca N 2m 7 mm 7 y 30397 13423 u 1 z u 2 Body mass kg 21 Prediction find the estimate for y when x25 I Least squares line 9 30397 13428x x 25 30397 1342825 3661 N Residual I Body mass x 25 kg I Bite force y 3515 N I Predicted 3661 N Residual f2 3515 3661 y j 146 N Plot of Residuals vs Body Mass 607 507 407 39 E 307 3 207 3 n 10 0 l39 I 23 r 7 I I l 7307i i i i i i i i o 5 10 15 20 25 30 35 Body mass kg 22 Residuals I Residuals help us see if the linear model makes sense I Plot residuals versus the explanatory variable El lfthe plot is a random scatter of points then the linear model is the best we can do 39 lnterpretation ofthe Plot I The residuals are scattered randomly This indicates that the linear model is an appropriate model for the relationship between body mass and bite force for Canidae Body mass and Bite force I r 09807 r2 098072 0962 or 962 I 962 of the variation in bite force can be explained by the linear relationship with body mass 23 Regression Conditions I Quantitative variables both variables should be quantitative I Linear model does the scatter diagram show a reasonably straight line I Outliers watch out for outliers as they can be very influential Regression Cautions I Beware of extraordinary points I Don t extrapolate beyond the data I Don t inferx causes yJust because there is a good linear 39 bles model relating the two varia I Don t choose a model based on R2 alone 24 Stat 104 Lecture 27 Chapter 8 Quantitative Variable Population Parameters 039 known Population Sample an Me y Chapter 8 0 Sampling from a population that follows a Normal Mode u a we z ablez PValue HQ Chapter 9 Sections 1 2 Quantitative Variable Population Parameters unknown 039 Stat 104 Lecture 27 Unknown 039 0 If we do not know the value of the population standard deviation we cannot use the methods from Chapter 8 Unknown 039 We can use the sample standard deviation 3 as an estimate of the population standard deviation 039 Unknown 039 We can NOT continue to use the standard normal distribution or Table Z Why Stat 104 Lecture 27 4L L 1537 1932 1928 3 Mm 95 Confidence 0 Simulation illustrating repeating the procedure 0 httpstatwebcalpolyeduchance apgletsConfSimConfSimhtml Stat 104 Lecture 27 Simulating Confidence Intervals Merv quotw 23n25naz EI nunmnvtnm tam5mm 518 Chapter 9 Sections 1 2 0 Con dence Interval for u we 0 I found in Table T df n 1 Chapter 9 Sections 1 2 0 Test statistic y e Table T gt P value Stat 104 Lecture 27 Example What is the mean alcohol content of beer A random sample of 10 beers is taken and the alcohol content is measured Example We do not know the value of the population standard deviation 039 for the population of all beers Sample Data Alcohol 519 517 476 496 432 478 Stat 104 Lecture 27 Sample Summary 0 Sample size n 10 0 Sample mean y 4762 0 Sample standard deviation s 0314 Con dence Interval for we dfn 1 Inference for y 0 Do NOT use Table Z 0 Use Table T instead Stat 104 Lecture 27 95 Confidence Interval y 4762 0 n 10 0 s 0314 0 t 2262 Calculation 77tl to 7tr 47627 2262039314l to 4762 marml JR JR 4762 7 0225 to 4762 0225 4537 to 4987 Stat 104 Lecture 6 Summary Measures Dispersion or spread 75ample range 75ample mean absolute deviation 75ample standard deviation 9hole Golf Scores 46 44 50 43 47 52 Sample Range maximum 7 minimum 52 743 9 strokes l I i a slkw H 40 45 50 55 Measures of Spread Based on the deviation from the sample mean Deviation from the mean yf Stat 104 Lecture 6 9hole Golf Scores 45 44 50 43 48 52 7 g 47 strokes O O O O O 40 45 I 50 55 Deviations from the Mean 4 5 i 3 in WWI V 40 45 50 55 Sample Mean Absolute Deviation MAD Elyfl I l Stat 104 Lecture 6 Sample Mean Absolute Deviation MAD 432531 6 6 MAD 30 strokes Sample Variance Almost the average squared deviation 202 s2 n l Sample Variance Golf Scores s216942591g 5 5 128 strokes2 Stat 104 Lecture 6 Sample Standard Deviation Golf Scores SZJST ZO W n l s V12 2 358 strokes Sample Standard Deviation Body Mass of Canidae SZJST 20 2 n l s 6436 802 kg Standard Score Look at the number of standard deviations the score is from the Stat 104 Lecture 6 Summary Measures Position 7 Sample quartiles Five number summary Sample interquartile range Box and whiskers plot Sample Quartiles Medians of the lower and upper halves of the data Trying to split the data into fourths quarters Sample Quartiles Body Mass kg of Cam39dae 0 l 133344 a Q1 452 0l 55556667899 45 kg 1 023 1 2 23 Q3 lOll2 2 5 10 5 k 3 l 39 g 3 6 5 Stat 104 Lecture 6 Measure of Spread InterQuartile Range IQR iThe distance between the quartiles IQR 105 7 45 6 kilograms iThe length of the interval that contains the central 50 of the data Five Number Summary Minimum 1 kilogram Q1 45 kilograms Median 6 kilograms Q3 105 kilograms Maximum 36 kilograms Box Plot Establish an axis with a scale Draw a box that extends from Q1 to Q3 Draw a line from the Q1 to the minimum and another line from the Q3 to the maximum Stat 104 Lecture 6 Outlier BOX Plots Establishes boundalies on what are usual values based on the width ofthe box Values outside the boundaries are agged as potential outliers Box Plot of Body Mass of Canidae I in 1 5 Body Mass kg Body Mass of Canidae and Felidae Famin I 15D ZEIEI I I I an mu Body Mass kg Chapter 7 The Normal Probability Distribution Section 71 Properties of the Normal Distribution Recall a continuous random variable is a random variable that has an infinite number of possible values and is not countable To find probabilities for continuous random variables we do not use probability distribution functions as we did for discrete random variables Instead we use probability density functions Probability Density Function A probability density function is an equation used to compute probabilities of continuous random variables that must satisfy the following two properties 1 The area under the graph of the equation over all possible values of the random variable must equal one J The graph of the equation must be greater than or equal to zero for all possible values of the random variable That is the graph of the equation must lie on or above the horizontal aXis for all possible values of the random variable The area under the graph of a density function over some interval represents the probability of observing a value of the random variable in that interval 95 A continuous random variable is normally distributed or has a normal probability distribution if its relative frequency histogram of the random variable has the shape of a normal curve bellshaped and symmetric Properties of the Normal Density Curve A It is symmetric about its mean u N The highest point occurs at x u U The area under the curve is one A The area under the curve to the right of u equals the area under the curve to the left of u equals 12 5 As x increases without bound gets larger and larger the graph approaches but never equals zero As x decreases without bound gets larger and larger in the negative direction the graph approaches but never equals zero 6 The Empirical Rule Approximately 68 of the area under the normal curve is between xuGandxu6 Approximately 95 of the area under the normal curve is between xu26andxu26 Approximately 997 of the area under the normal curve is betweenx u 3G andx u 36 96 The Area under a Normal Curve Suppose a random variable X is normally distributed with a mean u and a standard deviation 6 Notation X Nu G The area under the normal curve for any range of values of the random variable X represents either The proportion of the population with the characteristics described by the range or The probability that a randomly selected individual from the population will have the characteristics described by the range Standardizing a Normal Random Variable Suppose the random variable X is normally distributed with a mean u and standard deviation 6 Then the random variable is normally distributed with a mean u 0 and standard deviation 6 l The random variable Z is said to have the standard normal distribution Notation Z N0 1 For any given X we can calculate the associated Z score using the formula above 97 Section 72 The Standard Normal Distribution Properties 0fthe Standard Normal Curve 1 2 U A It is symmetric about its mean u 0 The highest point occurs at u 0 The area under the curve is one This characteristic is required in order to satisfy the requirement that the sum of all probabilities in a legitimate probability distribution equals 1 The area under the curve to the right of u 0 equals the area under the curve to the left of u 0 equals 12 As 2 increases without bound gets larger and larger the graph approaches but never equals zero As 2 decreases without bound gets larger and larger in the negative direction the graph approaches but never equals zero The Empirical Rule 0 Approximately 068 68 of the area under the standard normal curve is between 1 and l 0 Approximately 095 95 of the area under the standard normal curve is between 2 and 2 0 Approximately 0997 997 of the area under the standard normal curve is between 3 and 3 98 Notation for the Probability of a Standard Normal Random Variable Pa lt Z lt b represents the probability a standard normal random variable is between a and b PZ gt a represents the probability a standard normal random variable is greater than a PZ lt b represents the probability a standard normal random variable is less than b The notation z pronounced 2 sub alpha is the Zscore such that the area under the standard normal curve to the right of z is 0c Table II at the back of the text is referred to as a Z table It tabulates the area to the left of a given Zscore We will now take a look at some examples 99 Standard Normal Distribution 04 05 06 07 08 09 34 00003 00003 00003 00003 00003 00003 00003 00003 00003 00002 33 00005 00005 00005 00004 0 0004 00004 00004 00004 00004 00003 32 00007 00007 00006 00006 00006 00006 00006 00005 00005 0 0010 9 00009 00008 00008 00003 00008 00007 00007 30 00013 00013 00013 00012 00012 00011 00011 011011 00010 00010 3929 00019 00018 00018 00017 00016 00016 00015 00015 00014 00014 28 00026 00075 0 0024 00023 00023 00022 00021 00021 00020 00019 0 26 00047 00045 00044 00043 00041 00040 00039 00038 00037 10036 24 00081 00080 0 78 00075 00073 23 00107 00104 00102 00099 00096 2 00139 00136 00137 21 00179 00174 0 0170 00166 00162 00228 20 l9 00287 00231 00274 00268 00262 18 00359 00351 00344 00336 00329 17 0 0446 0 3 00427 0 0418 D 0409 l6 0 0548 0 0337 00526 0 00505 15 066 0 5 00643 00630 00618 14 1 00793 0 78 00764 00749 Z L 5 11 01537 01562 01539 111515 01492 419 11341 01814 01753 01767 01736 02090 02061 0 0 05 o7 02420 02389 02353 02327 02296 265 02709 0 05 03035 03050 03015 02981 02946 03409 0 3372 03336 03300 03 03871 03783 0 3745 03707 03669 H 02 0 4207 04168 04119 04090 o 4052 04602 04562 045 04433 04443 lI0 05000 04060 04920 04380 043010 04801 04761 04721 04681 04641 100 Standard Normal Distribulion 03 04 z 00 01 07 08 09 00 05000 15040 05030 05120 05160 05199 05239 05279 05319 1 S759 01 05398 05438 05478 05517 555 05596 05636 05675 05714 3 02 05793 05832 05871 05910 05948 05987 06025 06064 06103 06141 03 16179 06117 06255 06293 06331 063 06406 06443 16480 16517 04 06554 06591 06628 66 06700 06736 06772 06808 06844 06879 05 06915 06950 06985 07019 07054 07 7123 07157 07190 07224 06 07257 07291 07324 07357 738 07422 07454 07486 07517 07549 07 7580 07611 07642 07673 07704 07734 7764 07794 0732 785 08 07881 07910 07939 07967 07995 08073 08051 08078 08106 08133 09 08159 08186 03212 08238 8264 03289 08315 08340 08365 0 8389 10 08413 08438 08461 08485 8508 08531 08554 08577 859 08621 1l 08643 08665 03686 3708 08729 08749 0877 08790 08810 181130 12 0884 886 08888 08907 08925 08944 08962 08980 0899 09015 J 09032 09049 09066 09082 09099 09115 09131 914 19162 09177 14 09192 19207 09222 09236 925 09265 09279 09292 09306 09319 L5 09332 09345 09357 09370 0938 09394 94 09418 942 09441 16 09452 09463 9474 09 09495 9505 09515 09525 09535 09545 L7 09554 956 09573 09582 09591 09599 09608 09616 09625 09633 18 09641 09649 09656 09 1967 09678 9 09693 96 09706 19 09713 09719 09726 09732 09738 09744 09750 09756 09761 09767 20 09772 097 78 09783 097815 19793 09798 09803 0 S 9812 09817 21 09821 09826 09830 19834 0983 904 09846 09850 09854 09857 22 09116 19864 09868 09871 09875 09878 09381 09884 09887 09890 23 09893 09895 119898 09901 09904 9906 09909 19911 09913 09916 24 09918 09920 992 09925 09927 09929 09931 09932 09934 09936 25 09938 0 09941 09943 09945 09 09949 09951 09952 26 09953 09955 09956 09957 09959 09950 09961 09962 9 09964 27 0996 09966 09967 09968 09969 09970 09971 09972 09973 09974 28 09974 09975 09976 09977 09977 09978 0997 997 9980 09981 29 09931 09982 09982 9983 09984 9984 09985 09985 09986 09986 30 09987 09987 09987 09988 09988 09989 09939 0998 1 19990 09990 31 09990 09991 999 09991 09992 09992 09992 19992 09993 09993 32 0999 09993 09994 09994 09994 09994 09994 19995 09995 09995 33 09995 09995 19995 0999 09996 099 1999 999 9996 09997 34 09997 09997 999 09997 09997 09997 09997 09997 09997 0 9998 101 AA F ind H I Probabu39b39fg hat Find 11m probabiUij haj Z is less Ham 1 Z is batman 1 and 1 AZ Find ke promloi39u39hj Hnoi the absolute Value 06 a is Find the probabili hj Haat 2 3mm than 18 If grea cev than 214 102 A Find m not percenh39le Find m N9 Percentile x 5 got Z AA Find HM Value quotFor 05 FM W Va be 235 103 Section 73 Applications of the Normal Distribution Finding the Area under any Normal Curve 1 2 U Draw a normal curve with the desired area shaded Convert the values of X to Zscores using Draw a standard normal curve with the area desired shaded Find the area under the standard normal curve This is the area under the normal curve drawn in Step 1 Procedure for Finding the Value of a Normal Random Variable Corresponding to a Speci ed Proportion or Probability l J U Draw a standard normal curve with the area corresponding to the proportion or probability shaded Use the Ztable to find the Zscore that corresponds to the shaded area Obtain the normal value from the fact that X u ZG We will take a look at some examples 104 AA LatX39 N45 LekXNNLLl5F39nam Find w promum W X probabilth W x is ladmm is gfemr neh a 1 and 539 m xm NMIs FmA Hw L93 meHb PM the mum w w absole Probabih ig w m absoluk mm Value 0 is gmms 09 X rs ess 111cm 5 105 Lat x Nuns Final 1m LU xr Nuns Fmd 1m F Sih FucmHe or X Wk percen h le JEv X 106 Section 74 Assessing Normality Suppose that we obtain a simple random sample from a population whose distribution is unknown Many of the statistical tests that we perform on small data sets sample size less than 30 require that the population from which the sample is drawn be normally distributed So how do we know if a data set comes from a normal distribution We will use a normal probability plot to answer the above question This plot is also called a normal quantile plot A normal probability plot plots observed data verses normal scores A normal score is the expected Zscore of the data value if the distribution of the random variable is normal If sample data are taken from a population that is normally distributed a normal probability plot of the actual values versus the expected Z scores will be approximately linear Fat pencil test The book talks in detail on how to manually draw a normal probability plot We will not do this by hand We will use JMP to draw these plots for us How to Obtain a Normal Probability Plot from JMP Click on Analyze and then Distribution Select a column heading for Y columns Click OK You will obtain a histogram and other output On the output screen find the red down triangle v and find Normal Quantile Plot This will yield a plot that can be used to test for normality 107 Example Use the following normal probability plots to assess whether the sample data could have come from a population that is normally distributed Normal Quantile Hot Normal Quantile Plot 01 05 10 Normal Quantile Hot Normal Quantile Plot 108 Section 75 Sampling Distributions And The Central Limit Theorem In general a sampling distribution of a statistic is a probability distribution such as the normal distribution for all possible values of the statistic computed from a sample of size n The sampling distribution of the sample mean is a probability distribution of all possible values of the random variable Y computed from a sample of size n from a population with mean u and standard deviation 6 The idea behind obtaining the sampling distribution of the mean is as follows 1 Obtain a simple random sample of size n 2 Compute the sample mean 3 Assuming that we are sampling from a finite population repeat steps 1 and 2 until all simple random samples of size n have been obtained Since each sample of size n will have an observed value of Y and not all observed values will be exactly the same Y is a random variable Since X is a random variable we can ask the following questions 0 What is the E07 0 What is the Var o What is the distribution of Y 109 The Mean and Standard Deviation of the Sampling Distribution of 7 Suppose that a simple random sample of size n is drawn from a population with mean u and standard deviation 6 The sampling distribution of 7 will have a mean y u and standard deviation 039 The standard 7 deviation of the sampling distribution of X a is called the standard error of the mean Now we have answered the questions 0 What is the E07 0 What is the VarX What about the distribution of X If a random variable X is normally distributed with mean u and standard deviation 6 then the distribution of the sample mean X is normally distributed with mean y u and standard deviation 039 i d What happens if the distribution of X is not normal CENTRAL LIMIT THEOREM Suppose a random variable X has a population mean u and standard deviation 6 and that a random sample of size n is taken from this population Then the sampling distribution of 7 becomes approximately normal as the sample size n increases The mean of the distribution is y u and standard 039 dev1at10n a 7 J2 Let us visualize this 110 06 078 10 12 14 16 i 07 ad 09 10 1 12 13 mean525 many 00 111 When is n large enough to assume normality The size of n depends on how close to normal the original population is If the population is normal n l is large enough As a rule of thumb we will use n 30 as sufficiently large Hence when n 2 30 the sampling distribution of X will be approximately normal 27 Nltu f 112 Example The length of human pregnancies is approximately normally distributed with a mean of 266 days and standard deViation of 16 days 1 What is the probability a randomly selected pregnancy lasts less than 260 days 2 What is the probability that a random sample of 20 pregnancies have a mean gestation period of 260 days or less 113 3 What is the probability that a random sample of 50 pregnancies have a mean gestation period of 260 days or less 4 What might you conclude if a random sample of 50 pregnancies resulted in a mean gestation period of 260 days or less 114 Section 76 The Normal Approximation To the Binomial Probability Distribution Criteria for a Binomial Probability Experiment A probability experiment is said to be a binomial experiment if all the following are true 0 The experiment is performed on n independent times Each repetition of the experiment is called a trial Independence means that the outcome of one trial will not affect the outcome of the other trials 0 For each trial there are two mutually exclusive outcomes success or failure The probability of success p is the same for each trial of the experiment When we were dealing with probabilities for the binomial distribution we only set up an expression since it is mathematically very tedious However we have a new way to approximate those probabilities As the number of trials n in a binomial experiment increases the probability distribution of the random variable X becomes more nearly symmetric and bellshaped As a general rule of thumb if np gt 5 and nq gt 5 then the probability distribution will be approximately symmetric and bell shaped The Normal Approximation to the Binomial Probability Distribution If np Z 10 and nplp nq Z 10 then the binomial random variable X is approximately normally distributed with mean MX 11p and standard deviation O39X Inpq 115 What is the major difference between a binomial random variable and a ormal random variable A binomial random variable is a discrete random variable and a normal random variable is a Continuous random variable Therefore s39 mg a continuous density function to approximate a discrete probabili we must apply a correction for continuity The continuity correction says that we add and subtract 05 from every value of CWOthinuL39l Example Suppose a softball player safely reaches base 45 of the time Assuming atbats are independent events use the normal approximation to the binomial to approximate the probability that in the next 100 at bats l The player reaches base safely exactly 50 times 2 The player reaches base safely 60 or more times 117 3 The player reaches base safely 50 or fewer times 4 The player reaches base safely between 60 and 90 times inclusive 118 Chapter 9 Hypothesis Testing Section 91 The Language of Hypothesis Testing Steps in Hypothesis Testing 1 A claim is made 2 Evidence sample data is collected in order to test the claim 3 The data are analyzed in order to support or refute the claim A hypothesis is a statement or claim regarding a characteristic of one or more populations Hypothesis testing is a procedure based on sample evidence and probability used to test claims regarding a characteristic of one or more populations The null hypothesis denoted H0 is a statement to be tested The null hypothesis is assumed true until evidence indicates otherwise In this chapter it will be a statement regarding the value of a population parameter The alternative hypothesis denoted H1 or H A is a claim to be tested We are trying to find evidence for the alternative hypothesis In this chapter it will be a claim about a population parameter 134 We have two types of alternative hypothesis onesided and twosided alternatives 1 Equal hypothesis versus not equal hypothesis twosided test 0 H0 parameter some value 0 H1 parameter at some value 2a Equal hypothesis versus less than onesided test 0 H0 parameter some value 0 H1 parameter lt some value 2b Equal hypothesis versus greater than onesided test 0 H0 parameter some value 0 H1 parameter gt some value Example Determine for the following the null and alternative hypotheses 1 According to the United States Department of Agriculture the mean farm rent in Indiana was 8900 per acre in 1995 A researcher for the USDA claims that the mean rent has decreased since then 2 According to the United States Census Bureau 163 of Americans did not have health insurance coverage in 1998 A politician claims that this percentage has decreased since 1998 3 According to the United States Energy Information Administration the mean expenditure for residential energy consumption was 1338 in 1997 An economist claims that the mean expenditure for residential energy is different today 135 Type I and Type II Errors Four Outcomes from Hypothesis Testing 1 We reject the null when in fact the alternative is true This decision would be correct We fail to reject the null when in fact the null is true This decision would be correct We reject the null when in fact the null is true This decision would be incorrect This type of error is called a Type I error 4 We fail to reject the null when in fact the alternative is true This decision would be incorrect This type of error is called a Type II error J U The level of signi cance 0c is the probability of making a Type I error We refer to the probability of making a Type II error as 3 Example Explain for the following what it would be to make a Type I and a Type II error 1 According to the United States Department of Agriculture the mean farm rent in Indiana was 8900 per acre in 1995 A researcher for the USDA claims that the mean rent has decreased since then 2 According to the United States Census Bureau 163 of Americans did not have health insurance coverage in 1998 A politician claims that this percentage has decreased since 1998 3 According to the United States Energy Information Administration the mean expenditure for residential energy consumption was 1338 in 1997 An economist claims that the mean expenditure for residential energy is different today 136 Section 92 Testing a Hypothesis About u 039 Known The Classical Method of Testing a Hypothesis If a claim is made regarding the population mean with 6 known we use the following steps to test the claim provided 0 The sample is obtained using simple random sampling 0 The population from which the sample is drawn is normally distributed or the sample size n is large Use the following steps to test the hypothesis 1 A claim is made regarding the population mean The claim is used to determine the null and alternative hypotheses J Select a level of significance 0c based upon the seriousness of making a Type I error The level of significance is used to determine the critical value The critical value represents the maximum number of standard deviations the sample mean can be from no before the null hypothesis is rejected 137 3 A 5 Provided the population from which the sample is drawn is normally distributed or the sample size is large and the population standard deviation 6 is known the distribution of the sample mean Y is 039 J2 normal with mean uo and standard deviation Therefore Z Z x 0 represents the number of standard deviations the L 42 sample mean is from the assumed mean uo Z is called the test statistic Compare the value of the test statistic to that of the critical value to make a decision regarding the null hypothesis Two Taied Lefttailed Righttailed If Z lt za2 or Z gt 2m if Z lt 20 if Z gt z reject the null reject the null reject the null hypothesis hypothesis hypothesis State the conclusion 138 Example A researcher claims that the average age of a woman before she has her first child is greater than the 1990 mean age of 246 years on the basis of data obtained from the National Vital Statistics Report Vol 48 No 14 She obtains a simple random sample of 40 women who gave birth to their first child in 1999 and finds the sample mean age to be 271 years Assume that the population standard deviation is 64 years Test the researcher s claim using the classical approach at the OL 005 level of significance 139 Testing a Hypothesis about p with 039 known using pValues A pValue is the probability of observing a test statistic as extreme or more extreme than the one observed under the assumption that the null hypothesis is true Use the following steps to test the hypothesis 1 A claim is made regarding the population mean The claim is used to determine the null and alternative hypotheses x yo 1 J2 3 Use the value of the test statistic to obtain the pvalue for a onesided test from the ztable If the hypothesis is twosided double the one sided pvalue to obtain the twosided pvalue 2 Compute the test statistic Z 4 Make a decision if the pvalue is less than 0c reject the null and if the pvalue is greater than 0c fail to reject the null 5 State the conclusion 140 Example In 1990 the average farm size in Kansas was 694 acres according to data obtained from the U S Department of Agriculture A researcher claims that farm sizes are larger now due to consolidation of farms She obtains a random sample of 40 farms and determines the mean size to be 731 acres Assume that G 212 acres Test the researcher s claim using the pValue approach at the OL 005 level of significance l4l Using Con dence Intervals to Test Hypothesis When testing H0 u uo vs H1 u 3 uo ifa l oc100 confidence interval contains uo we do not reject the null hypothesis However if the confidence interval does not contain uo we have evidence that supports the claim stated in the alternative hypothesis and conclude u 3 uo at the level of significance CL 142 Section 93 Testing a Hypothesis about u 039 Unknown The procedures here are exactly the same except now the test statistic will follow a tdistribution with n 1 degrees of freedom Test Statistic I x 0 S W Example In 1989 the average age of an inmate on death row was 362 years of age according to data obtained from the US Department of Justice A sociologist wants to test the claim that the average age of a death row inmate has changed since then She randomly selects 32 death row inmates and finds that their age is 389 with a standard deviation of 96 Using the classical approach test the sociologist s claim at the OL 001 level of significance 143 Example The mean monthly cellular telephone bill in 1999 was 4024 according to the Cellular Telecommunications Industry Association A researcher at CTIA claims that the average monthly billed has changed since then He conducts a survey of 49 cellular phone users and determines the mean bill to be 4515 with a standard deviation of 2120 Test the researcher s claim using the p value approach at the OL 010 level of significance 144 Section 94 Testing a Hypothesis About a Population Proportion The procedures here are exactly the same except now the test statistic will again follow a standard normal distribution However remember that we must check that np gt 10and n1p gt 10 and S 005 Test Statistic Example In a survey conducted by the Gallup Organization between August 29 and September 5 2000 395 of 1012 adults aged 18 years or older said they had a gun in the house In 1990 47 of household had a gun Is there significant evidence to support the claim that the proportion of households that have a gun has decreased since 1990 at the OL 001 level of significance Use the classical approach 145 Example Pathological gambling is an impulsecontrol disorder The American Psychiatry Association lists 10 characteristics that diagnose the disorder in its DSMIV manual The National Gambling Impact Study Commissions randomly selected 2417 adults and found that 35 were pathological gamblers Is there eVidence to support the claim that more than 1 of the adult population are pathological gamblers at the OL 005 level of significance 146 Chapter 7 Sampling Distribution Sampling Distributions Quantitative Variable Population Parameter Inference Sample Statistic population in pameuiap we Wanttu see What happens to the sample mean when we repeatedly sample frumthe population Example I Population Stat 104 this section I Variable Number of siblin s I Type of variable Numerical or Quanitative responses owever wha if everyone in the class is not available In this case we have to rely on sample data 0 tell us something about the population Example I Population D All Stat l04 students ll l Section E I Population Parameter D The average numberof Slblll lgS for a Stat l04 Section 8 student I Sample D 4 randomly selected students I Sample Statistic u The sample mean number of SiblingS for4 students What have we learned I Different samples produce different sample means I There is variation among sample means I Can we model this variation El What is a model for the distribution of the sample Sampling Distributions The Central Limit Theorem In general a sampling distribution ofa statistic is a probability distribution such as the normal distribution for all possible values ofthe statistic computed from a sample of size n The sampling distribution ofthe sample mean is a probability distribution of all possiblevalues of the random variable computed from a sample of size nfrom a population with mean u and standard deviation 0 1 Obtam a srmbxe random samp e or 5sz n 2 Cumpute the samp e mean 3 Assummg thatvve are samphngfrum a mte pupu atmn rebearsrebsw and 2 onm aH srmbxe random samp es or 5sz nhave been ubtamed r pupu atmm 5sz N mm and samp e 5sz n What rsrne number or bossrbxe samp es or soe 57 5 m 75287520 100 100 1009998979695 SW95 5432195 mce eaen samp e of 5sz mm have an observed vame of E and nut aH observed va uesvvm be Exac ythe same E rs a random vanab e Smce Ersa randumvarwab E WE ean asktheququg doesbons Wnaustne E 7 Wnat rs tne Var E 7 What rs the drsrnbobon or y 7 The Mean and Standard De ation onhe Sampling Distribution of E mean o and standard devrabomo The samphng drsrnbobon or X Wm have a mean and standard devrabon i J X 7 I error of the mean Nuvv We have answered the doesbons Wnausrne EQV Fupu atmn mean 2 m a F39 r t What s Va gy a n obo a ron vaHance samb e 5sz Whatabuutthe drstnbobon or E 7 Simulation We can simulate the repeated random selection of samples of individuals from a population WWW rllf rire eduaneStat distindexhtml What abuuttne dlstrlbutlun of E 7 if a random varlable x is normally dlstrlbuted With mean ll and siandard devlatlun U tnen the dlstrlbutlun uftne sample 7 n X ls normally dlstrlbuted Wltn mean u Whathappens lnne dls Lrlbutlun mm is nutnurrnal7 and standard deviatlun J CENTRAL LIMIT THEOREM and that a andurn sample or 5le n lstaken frurntnls pupulatlun Tnen the sampllng dlstrlbutlun or g becomes appruxlmately nurrnal as the sample sle n lncreases Tne mean uftne dlstrlbutlun is a a 7 I a and siandard devlatlun U U XNN 47 n 17 I Let us Visualize tnls U 100 random draws mes 55 X rtean52 l X l l 12 t4 13 39 When is n large enough to assume normality The size of n depends on how close to normal the original population is o If the population is normal n 1 is large enough As a rule of thumb we will use n 30 as sufficiently large Hence when n 2 30 the sampling distribution will be approximately normal 039 XN l 21 Example The length of human pregnancies is approximately normally distributed with a mean of 266 days and standard deviation of 16 days 1 What is the probability a randomly selected pregnancy lasts less than 260 days What is the probability that a random sample of 20 pregnancies have a mean gestation period of 260 days or less What is the probability that a random sample of 50 pregnancies have a mean gestation period of 260 days or less What might you conclude if a random sample of 50 pregnancies resulted in a mean gestation period of 260 days or less m a My a randomly selected pregnancy lasts less than 23m gestation period of ZED days urless a is 2 pm a n nat a random sample of 5m pregnancieshave a mean gestation period of ZED days urless mean gestation period man days urless Conditions NOT IN BOOK l Random sampling condition El Samples must be selected at random from the population El Mthout this results from previous slides don t necessarily hold I 10 condition El When sampling without replacement the sample size should be less than 10 of the population size Example Apples The dtameterof red dehctou mean 0 3 tnches and random Sam obtamed t5 2 5 5 ts normaHy dtStrtbuted thn a a standard devtatton 0m 25 mones A pte OMOO apptes ts gatnered and tne mean dtameter 56 u denttfythe sampan dtsmbutmn D f anuthersamp e Elf StZE WEIEI ts taken What tsthe prubabthtythatthe sampte mean Wm be greaterthan 2 5B tnches7 u Whattsthe prubabthtythatthe sampte mean M be between 2 at and 2 817 Stat 104 Lecture 18 Sampling Distributions Quantitative variable Population Pammeter Inference Sample Statistic Example Population Stat 104 students in Section A Variable Number of children in your family Type of variable Numerical or Quanitative Example Population iAll Stat 104 students in Section A Population Parameter iThe average number of children in a family of a Stat 104 Section A student Stat 104 Lecture 18 Example Sample 74 randomly selected students Sample Statistic iThe sample mean number of children in the 4 students families What have we learned Different samples produce different sample means There is variation among sample means Can we model this variation 7What is a model for the distribution of the sample mean Simulation We can simulate the repeated random selection of samples of individuals from a population wwwmfriceedulane stat simsa mpling distindexhtml Stat 104 Lecture 18 Simulation I Simple random sample ofsize n5 I Repeat many times I Record the sample mean to simulate the sampling distribution of Simulation I Different samples Will produce different sampl I There is VaIiation in the sample means I Can We model this VaIiation 7What is the distribution ofthe sample mean Formulas Probability Rules Complements A and Ac PAc 1 7 PA PA or B PA PB 7 PA and B PA and B PAPBAPBPAB A and B are mutually exclusive if PA and B 0 A and B are independent if PAPA B or PBPBA Discrete random variable x Probability distribution 0 S Px S 1 ZPltxgt 1 Mean u ZxPx Standard deviation 039 2x u2Px Binomial random variable x counts the number of successes in n independent trials where the probability of success on any one trial is p n 7 Probability lnction Px Jpx pquot x x 012n x Mean u np Standard deviation 039 1lnpil p Normal random variable y Mean u Standard deviation 039 Standardizez y u 039 Distribution of the sample mean 7 Shape approximately normal Mean u Standard deviation SD07 n y SDG Standardize z Chapter 3 Numerically Summarizing Data Section 31 Measures of Central Tendency Measures of Central Tendency 0 Mean 0 Median 0 Mode 0 Midrange Measures of central tendency are numeric values that locate in some sense the middle of a data set We have all heard the term average Most generally when this term is used it is referring to the mean but one must be careful because it can refer to the mean the median or the mode Each measure gives very different information Recall A parameter is a descriptive measure of a population A statistic is a descriptive measure of a sample The arithmetic mean of a variable is computed by determining the sum of all the values of the variable in the data set divided by the number of observations The population mean u is computed using all the individuals in a population The population mean is a parameter The sample mean E is computed using the sample data The sample mean is a statistic 32 1 Mean If x1 x2 xN are the N observations of a variable from a population then the population mean u is X x1x2 xN 11 N N If x1 x2 xn are the n observations of a variable from a sample then the sample mean x is 2 Median The median of a variable is the value that lies in the middle of the data when arranged in ascending order That is half the data are below the median and half the data are above the median We use M to represent the median Steps in Computing the Median of a Data Set 1 Arrange the data in ascending order 2 Determine the number of observations n 3 Determine the observation in the middle of the data set a If the number of observations is odd then the median is the data value that is exactly in the middle of the data set That is n1 the median is the observation that lies in the position U If the number of observations is even then the median is the mean of the two middle observations in the data set That is the median is the mean of the data values that lie in the g and 31 positions 33 3 Mode The mode of a variable is the most frequent observation of the variable that occurs in the data set If a data set has two values that occur with the highest frequency we say the data are bimodal If a data set has three or more data values that occur with the highest frequency the data set is multimodal 4 Midrange The midrange is the average of the smallest and largest data value smallest largest m1drange f Example You are given the following starting salaries for five graduates from the business college at Iowa State University 35000 37000 35000 33000 210000 Find the mean median mode and midrange Mean Median Mode Midrange What would be the most appropriate measure of central tendency to report to give a good indication for future graduates about what starting salary to expect Why Notice the mean is the most sensitive to extreme values while the median is not We say the median is resistant to extreme values 34 Let us take a look at how the mean compares to the median and mode in the various distributions Disiributions 3k w d 6 Right Skewed Le 039 ngmgztv u39c When do we use the mean median and mode Mean When the data are quantitative and the frequency distribution is roughly symmetric Median When the data are quantitative and the frequency distribution is skewed left or skewed right Mode When most frequent observation is desired measure of central tendency or the data are qualitative Section 32 Measures of Dispersion Measures of DispersionSpread 1 Range 2 Interquartile Range Sec 34 3 Variance 4 Standard Deviation In completely describing a distribution we need more than just the center of the distribution Measures of dispersion give an indication of how much variability there eXists in a data set 1 Range The range R of a variable is the difference between the largest value and the smallest data value That is Range R Largest Data Value Smallest Data Value 2 Interquartile Range The interquartile range IQR is the difference between the third and first quartile IQR Q3 Q1 3 Variance The variance is based upon the difference between each observation and the mean It is calculated as a mean of the squared deviations The divisor used in the calculation of this mean is dependent upon whether we are calculating the population variance or the sample variance 36 Deviation about the mean x u orxl x What would be the value if I were to add all the deviations Why The population variance of a variable is the sum of the squared deviations about the population mean divided by the number of observations in the population N That is it is the mean of the squared deviations about the population mean The population variance is symbolically represented by 2 N 2 a Z 2 x1 y2 x2 Lt2 xN y2 N N where x1 x2 xN are the N observations in the population and u is the population mean Notice that the population variance 62 is a parameter An algebraically equivalent formula for computing the population variance 1s N 2 xi il Xi N N N Tl MZ where 2x12 means to square each observation and then sum these squared values and 2602 means to add all the observations and then square the sum 37 The sample variance 2 is computed by determining the sum of the squared deviations about the sample mean and dividing this result by n l The formula for the sample variance from a sample of size n is z x a xlxzxxzxnxz n 1 n 1 where x1 x2 xn are the n observations in the sample and X is the sample mean Notice that the sample variance 2 is a statistic An algebraically equivalent formula for computing the sample variance is n ExtJ2 xi 2 il n n l 2 where 2x1 means to square each observation and then sum these squared values and 2602 means to add all the observations and then square the sum Notice that 52 explains the variance of a sample and is therefore a statistic Notice that the sample variance is obtained by dividing by n 1 If we divided by n as we would expect the sample variance would consistently underestimate the population variance 38 Whenever a statistic consistently overestimates or underestimates a parameter it is called biased To obtain an unbiased estimate of the population variance divide the sum of the squared deviations about the mean by n 1 Hence we have n 1 degrees of freedom in the computation of s2 because an unknown parameter u is estimated with x 4 Standard Deviation The population standard deviation 6 is obtained by taking the square root of the population variance That is 00 2 Notice that G is a parameter since it is describing a measure of the population The sample standard deviation s is obtained by taking the square root of the sample variance That is 2 SIS Notice that s is a statistic since it is describing a measure of the sample If the standard deviation is just the square root of the variance what information is the standard deviation giving us that the variance is not Can you think of a reason why we use the standard deviation over the variance Understanding Standard Deviation Recall the data set from Chapter 2 on the threeyear rate of return on a mutual fund Suppose we are comparing two mutual funds that have the same mean but one has a standard deviation of 8 and the other a standard deviation of 20 Which would you invest your money in Why 39 Example Nine randomly selected students from a section of STAT 104 measured their pulse The following data were obtained 76 60 60 81 72 80 80 68 73 Find the range variance and standard deViation Range Variance X 76 60 81 72 80 80 73 Totals Standard DeViation 40 The Empirical Rule If a distribution is roughly bell shaped then I Approximately 68 of the data will lie Within one standard deviation of the mean I Approximately 95 of the data will lie Within two standard deviations of the mean I Approximately 997 of the data will lie Within three standard deviations of the mean Visualizing rhe Empirical Rule 41 Section 34 Measures of Position Measures of position are used to describe the relative position of a certain data value with the entire set of data The zscore represents the number of standard deviation that a data value is from the mean It is obtained by subtracting the mean from the data value and dividing this result by the standard deviation There is both a population zscore and a sample zscore their formulas are as follows Population zscore x 039 Z Sample zscore X X S The zscore is unitless it has a mean of 0 and a standard deviation of l Zscores provide a way of comparing apples to oranges by converting variables with different centers andor spreads to variables with the same center and spread Example The average 2029 year old man is 699 inches tall with a standard deviation of 30 inches The average 2029 year old woman is 646 inches tall with a standard deviation of 28 inches Who is relatively taller a 75inch man or a 70inch woman 42 Percentiles are the values of the variable that divide a set of ranked data into 100 equal subsets Each set of data has 99 percentiles The kth percentile denoted Pk is a value such that at most k percent of the data are smaller in value than Pk and at most lOOk percent of the data are larger in value Determining the kth Percentile Pk 1 Arrange the data in ascending order 2 Compute an index i using the formula hi 100 where k is the percentile of the data value and n is the number of individuals in the data set U If 139 is not an integer round up to the next highest integer Locate the ith value of the data set written in ascending order This number represents the kth percentile If 139 is an integer the kth percentile is the mean of the 1th and ilst data value Often we are interested in knowing the percentile to which a specific data value corresponds 43 Finding the Percentile that Corresponds to a Data Value 0 Arrange the data in ascending order 0 Use the following formula to determine the percentile of the score x Number of data values less than x Percentile of x 100 n 0 Round this number to the nearest integer Quartiles are the most common percentiles They divide the data into four equal parts 0 Q1 represents the 1st quartile It is also the 25th percentile 0 Q2 represents the 2nd quartile It is also the 50th percentile Note This is also the median 0 Q3 represents the 3rd quartile It is also the 75th percentile Outliers are extreme observations in a data set Outliers distort both the mean and the standard deviation since neither is a resistant measure Because these measures often form the basis for most statistical inference any conclusions drawn from a data set that contains outliers can be awed We check for outliers using the interquartile range Checking for Outliers by Using Quartiles Determine the first and third quartiles of the data Compute the interquartile range The interquartile range or IQR is the difference between the third and first quartile IQR Q3 Q1 0 Determine the fences Fences serve as cutoff point for determining outliers Lower Fence Q1 15IQR Upper Fence Q3 15IQR 44 o If a data value is less than the lower fence or greater than the upper fence then it is considered an outlier Example The following data represent the number of inches of rain in Chicago during the month of April for 20 randomly selected years Find the quartiles Find the 67th percentile What percentile is represented by 628 inches of rain Are there any outliers in this data set 45 Section 35 The FiveNumber Summary Boxplots Exploratory Data Analysis This is the area of statistics that looks at data in order to spot any interesting results that might be concluded from the data The idea here is to draw graphs of data and obtain measures of central tendency and spread in order to form some conjectures regarding the data Rather than numerically describing a distribution Via the mean and standard deviation exploratory data analysis summarizes a distribution by using measures that are resistant to extreme observations Such a measure would be the ve number summary Five Number Summary 0 Minimum Q 1 o M Q3 0 Maximum We use the five number summary to construct a boxplot Drawing a Boxplot Determine the upper Q3l 5IQR and lower Q1 1 5IQR fences Draw vertical lines at each quartile Enclose these vertical lines in a box Label the lower and upper fences 46 0 Draw a line from the first quartile to the smallest data value that is larger than the lower fence Draw a line from the third quartile to the largest data value smaller than the upper fence 0 Any data values less than the lower fence or greater than the upper fence are outliers and are marked with an asterisk 47 Distribution Shape Based upon Boxplot o If the median is near the center of the box and each of the horizontal lines is of approximately equal length then the distribution is roughly symmetric If the median is to the left of the center of the box or the right line is substantially longer than the left line the distribution is skewed right If the median is to the right of the center of the box or the left line is substantially longer than the right line the distribution is skewed left Example The following data represents the number of grams of fat in breakfast meals offered at McDonald s 12 23 28 2 31 37 3415 23 38 311611 8 817 20 Find the five number summary 48 Construct a boxplot Comment on the shape of the distribution 49 Chapter 9 Inferences Involving One Population I I In Ch 8 we had 0 known and used CLT to say o X N M J 80 we can use the standardization 2 Y o toquot J What happens when we don t know 0 Recallthe point estimate for o is 3 sample standard deviation Can we say XNNlvf anduse n 7 J 2 I V Scientislswere rammed l Brewers relledmg lemme l developments in brewmg L1 An employee of the Guinness Brewing Co William Sealy Gosset a statistician and eventually the Master Brewer for Guinness who worked under the pen name Student discovered that simply substituting the sample standard deviation into the formulas designed for the population standard deviation did not give the correct results He applied his statistical knowledge both in the brewery and on the farm to the selection of the best yielding varieties of barley 3 Properties of the tDistribution 1 Different for different degrees of freedom 2 Centered at O and symmetric about 0 3 Area under the curve is 1 Area under the curve to the right of 0 equals area under the curve to the left of O which equals 12 4 As tincreases or decreases wo bound the graph approaches but never equals 0 5 The area in the tails is a little greater than the area in the tails of the standard normal due to estimating a with s which introduces further variability 6 As n increases the density for t gets closer to a standard normal density due to Law of Large Numbers as n increases 3 gets closer to o 39 Testing a Hypothesis about u 6 Unknown The procedures here are exactly the same except now the test statistic will follow a tdistribution with n 1 degrees of freedom Test Statistic pvalue approach ft x 0 1 Hypothesis Test Statistic pvalue Decision 01wa Conclusion ttable with n1 df 7 Nurrnat sarnpten Normat samph j yes No Yes No Use standard Nurma ts n 957 Usetdtsmbutmn ts n targe7 z and UVn net M and an 74 7 ins No Yes No 7777 t 5777 Note Gening reallife data from a normal distribution with 039 known is EXTREMELY rare vinually nonexistent Same procedures as before reotaotng z thn QM Assumptton sarnpted ooputatton ts normaHy dtstrtbuted or n gt 30 l Con dence mterva wBook notatto Edfzz 2 or eMy notatton Teststattsttc 2 Jwithdfn1 7 SN 39Exampte A GaHup poH conducted January 17 7 February 6 2005 asked 1028 teenagers aged 347 Tyot at no urs perweek do you spend watcntng W7 Suryey res utts tndtcate tnat f 13and s 2 nours Construct a 95 con dence tnteryat tortne nurnoerotnours ofTV teenagers waton eaon Week Example mean monthly cellular telephone bill in 1999 was 4024 according to the Cellular Telecommunications Industry Association A researcher at CTIA claims that the average monthly billed has changed sinoe then He conducts a survey of 49 cellular phone users and determines the mean bill to be 4515 with a standard deviation of 2120 Test the researchel s claim using the pvalue approach at the o 010 level of significance Hypothesis Test Statistic U PP N 13 lt E c 0 Conclusion 10 I as 025 mm 015 am ans 0025 001 00 0005 mm I mm 1 mm 17 m 1 um I m 4 m HAM l u m um 1M 4 wax u I Jon M ilulv m In Him 0M1 mm x um 1m aim is l w mm w u in ox 55 mp ow M NM mm w um um 3a um um w um um m mm um so u c n inquot w n w n r 2 m mm 11 I Confidence Interval Review I Find the confidence intervalInterpretation I am 1 o100 con dent that the population mean u is between the upper and lower limit I Meaning of the confidence interval If we obtained 100 simple random samples of size n from the population whose mean u is unknown then approximately 1 o100 of the intervals will contain u Example I What is the mean alcohol content ofbeer I A random sample of10 beers is taken and the alcohol content is measured I Do we know the value ofthe population standard deviation 0 for the population of all beers Sample Data Alcohol Sample Size n lu Sample mean f 4 762 Sample standard deviatlun s a 3M Calculation wt vmt l 476272262 amp 4762 2262 amp t J5 10 476270225 4762 0225 4537 4987 fax AA Interpretations I We are 95 confident that the population mean alcohol content of beer is between 4537 and 4987 I If we were to repeat this procedure 100 more times and collect 100 confidence intervals we would expect 95 of them to contain the true mean alcohol content Hypothesis testing I Using the sample above test the claim that the population mean alcohol content of beer is not 5 at the d005 level of significance iHDw 5 5 o is unknown so use Ttable Recall the sample mean for the 10 beers h wn 0 before was 4752 and the sample standard deviation 39 I I Test statistic E yiyo 476275 7 70238 2397 1 031417 00993 I m I Find the pvalue for t 2397 with 9 df and 005 level of significance I IFind the pvalue for t 2397 with 9 df and 005 level of significance t mum mm In Right m a Ms Ann 015 025 I lunl 1 7n l w in z I l um l a m um mm l I m 4 um mm l mu i mun l 1 m a um um l m w 7 am moo uw 54 x um quotW Lu 7 a 9 mm 1 m 7 m mm mm 45 u llv l7 l m m 1 mm mm mm 1 u m l mu m u n l min IS um 1 m u n w I mi m4 3 mm m l7 0 mm H w l um 11 I 74 Run 1 IE mm a MI I M mm L7H is Jnll w I Decision Reject the null hypothesis because the Pvalue is smaller than 005 I Conclusion There is sufficient statistical evidence to conclude that the population mean alcohol content of beer is not 5 What is the UHKHDWquot parameter Di lntEiEs1 7 Use 5137mm heimai heimai samplev i as i l a as in M i U a i w i chm i cementquot i MW l Mimiquot i N 77 if rence About a Binomial Probability of Success p Point Estimate ora Population Proportion each l dlvlduai either does urdues het have a certain characteristic The best peiht estimate nfp ericited the prupurticiri cf the pupulaticiri With a certain characteristic is glVErl by A x p n Where XlStHE hurhhercir lndlvlduais in the sample With the specified characteristic lE X is the hurhhercir successes rrcirh a El mlai l p distributlun In Ch7 we learned sampling distribution of the sample mean is normally distributed by Central 39 39 rem What can we say about the sampling distribution of p Online Simulation Population Reeses Pieces stamebcal Ell EdWEhdnEEa lEtsREESESREESESPlEE html 3 Population Parameter p Proportion of Orange Reeses Pieces Reese s Pieces Samples DI Hz a naius as nr Us I I I a Mean a non En Dav a nun L uriEMSmeie n Sampl SIZE 75 numnmwies 1 pm Normal mm P U 45 l7 Animate iii sill I Simulation I Simple random sample of size n25 I Repeat several times I Record the sample proportion of orange Reeses Pieces Reese39s Pieces Samples usu CuiranlSilee BUD sampiaslze 25 FEWMSMP ES quotum WES 9m Pipi39ni39ii39niin39 lkiln39ie l7 D45 Drawszmptgs l Animale Reset 39 Sampling Distribution of pquot We have For a simple random sample of size nitne sampling oistripution of pquot is approximately normal vvitn mean u p and standard erroro provioeo tnat np gt 5 and no gt 5 Heneei N p E 1 n gt 20 large enough sample size 2npgt5ANangt5 3 10 condition the sample consists of less than 10 of the population Constructing a 1 erloma Con dence interval for a Population Proportion Suppose a simple random sample of size n lstakenfrum a population A i e ixlEE n eontioenee interval furp lsglVErl pvtne following quantities Luvverllmlt Upperllmlt A pq e pzai2 quot P Zia2 n Estimate C v 5EEstimate Estimate p SEEstlmate P SE 739 Example in a elinieal trial of BBB Tne drug Llplturls meantto lovverenolesterol levels eiveo in mg doses of Lipitoroailvizii report u a neaoaene as a patients who ree side effect Experience a neaoaene as a side etteot 2 Verifythatthe requirementstoreonstruoting a eontioenee interval apout p are satisfie r3 Constructa awn confidence interval furtne pupulatlun prupumun ur Llpltur users who Will report a headache as a Side Effect Sample BEIEI randomly selected registered voters natiDnWidE FOX 7 NewsOpihiuh Dynamics Pull Jan SEIrSl ZEIEI ldll ill m Exists Find a 95 uhfidehce interval furp I BUTwe re not done we need to say I We are 95 confident that the true population proportion of all registered voters in the US who believe global warming exists is between 795 and 6 Interpretation of 95 Confidence I If one were to repeatedly sample at random 900 registered voters and compute a 95 confidence interval for each sample 95 of the intervals produced would contain the population proportion p I If we were to repeat this experiment 100 times ie take 100 samples of size 900 we would expect that approximately 95 of them would contain the true population proportion p Sample Size for Estimating the Population Proportion g a margin er enei ME is given by wneie 9 is a piieiestimate at p it a prior estimate etp is unavailableitne sample size required is Z n 025 ME Always round ri up tn the next integer Why use 025 if the prior is unavailable 2 A A Z npq ME Max zo25 3 f7 EXamp e mmuu u con dencew She usesa She does nutuse anyprmresumate Testing a Hypothesis Ab t a ou Population Proportion f 1995 esumate ubtamed frumthe U 5 Census Bureau of 18 5 Pr that Hp gt 5 and nq gt 5 n gt 2D and the wann eenemen sw 1 5 NOT Assummg pU 3 true mus TES L Statwstu Z P Pu NNpu W n nqn VI Because WE assume HE We unm We can prove umervwse Examphe 5 mm 335 of 1m aeuns aged 18 years ermeer saw they had a gun m the house m 19m 47 of househmd had a gun sthere swgm cant eweenee m suppenme man me e u m have of swgm cance7 Use the prvame approach Hypothesws Test Statws u e prvame 4 Demsmn 5 Cenemsmn 1 2 at EXamp E Assumatmn has m haramens ucs that ma A N The gnuse the msurdenn ts DSMJV manua admts 1 quotn uf u as have uf swgm canm Use prvame appmach Hyputhesw TESI Statws u mama Demsmn Condusmn mbmmg Example Are nonwhites underrepresented on tunes ii i Story County According to tne U s census Story County nas 9 7 ottts popuiation oiassmed as nonrwnite Popuiation AH peopie eiigibie foriury duty ii i Story Coun Parameter Proportion ofaii peopie eiigibie foriury Wno are none Wntte For a random samoie of 120 peopie oaiied foriury duty ii i Story County oniy 3 are nonrwnite is tnts Convincing evidence ofunderr representation of noanhiteSV Use prvaiue approaon l Check conditions Random sampling condition t Sample size t SuccessFailure condition Chapter 2 Organizing And Summarizing Data Section 21 Organizing Qualitative Data Qualitative data allows for classification of individuals based on some attribute or characteristic When this type of data is collected most generally we are interested in determining the number of individuals that occur within each category Common Displays of Qualitative Data Frequency Distribution Relative Frequency Distribution Cumulative Relative Frequency Distribution Bar Chart Pie Graph Pareto Chart A frequency distribution lists the number of occurrences for each category Example A survey was taken of a past section of STAT 104 to what their current classi cation was Please note lFreshman 2Sophomore 3Junior 4Senior 5Other The following data was collected We wish to construct a frequency distribution 4 3 3 2 2 1 2 3 2 1 1 2 4 5 3 3 2 1 3 2 4 1 1 2 3 4 2 2 1 2 1 2 1 3 2 2 1 1 4 3 4 5 1 3 4 3 2 1 1 2 2 1 1 2 1 3 3 2 1 1 1 2 1 4 Class Freshman 1 Total The relative frequency is the proportion or percent of observations within a category and is found using the formula Relative Frequency 2 w sum of all frequenczes A relative frequency distribution lists the relative frequencies of each category of data Convert the above frequency distribution to a relative frequency distribution Freshman 1 2 The cumulative relative frequency is a running total of the relative frequencies Convert the above relative frequency distribution to a cumulative relative frequency distribution Class Tally Frequency Relative Frequency Cumulative Relative Frequency Freshman 1 Sophomore 2 Junior 3 Senior 4 Other 5 A bar graph is constructed by labeling each category of data on a horizontal axis and the frequency or relative frequency of the category on the vertical aXis A rectangle of equal width is drawn for each category The height of the rectangle is equal to the category s frequency or relative frequency STAT 104 Classification Frequency Freshman Sophomore Junior Senior Other Class STAT 104 Classification Relative Freq uency O O O 0 39x 0 39N 0 a 01 N 01 0 O1 0 o 01 Freshman Sophomore Junior Senior Other Class A pareto chart is a bar graph whose bars are drawn in decreasing order of frequency or relative frequency The above bar charts are also pareto charts A pie chart is a circle divided into sectors Each sector represents a category of data The area of each sector is proportional to the frequency of the category STAT 104 Classification E Freshman lSophomore EIJunior EISenior lOther Section 22 Organizing Quantitative Data I Recall that we divided quantitative data into two categories discrete and continuous In summarizing quantitative data we must first recognize whether the data is discrete or continuous If the data are discrete the categories of data will be the observations as in qualitative data however if the data are continuous the categories of data called classes must be created using ranges of the observations ie intervals of numbers Discrete Data We can create frequency relative frequency and cumulative relative frequency distributions in the same manner for discrete data as we did for qualitative data A histogram is constructed by drawing rectangles for each class of data The height of each rectangle is the frequency or relative frequency of the class The width of each rectangle should be the same and the rectangles should touch each other Example The manager of a Wendy s fastfood restaurant was interested in studying the typical number of customers who arrive during the lunch hour The manager collected data for 40 randomly selected 15minute intervals of time during lunch and constructed the following histogram 20 Arrivals at Wendy s 10 m 7 O 7 l l l l l l l 0 2 4 6 8 10 12 Number of Customers Construct a relative frequency graph of this data Continuous Data Raw continuous data do not have any predetermined categories that can be used to contruct a frequency distribution therefore the categories must be created Categories of data are created by using intervals of numbers called classes Example The following table represents the number of United States residents between the ages of 25 and 74 that have earned a bachelor s degree The data are based on the Census Populations Survey conducted in 1 998 21 Number in Age Notice that the data are categorized or grouped by intervals of numbers Each interval represents a class The lower class limit of a class is the smallest value within the class The upper class limit of a class is the largest value within the class The class width is the difference between two consecutive lower class limits Notice in the above table that the class widths are equal for all classes One exception to this requirement is in openended tables A table is open ended if the last class does not have an upper class limit Example Construct a frequency histogram and a relative frequency histogram for the following data ThreeYear Rate Of Return Of Mutual Funds 274 167 108 241 359 182 320 255 237 381 22 We must decide on appropriate class widths We need the lower class limit of the first class to be slightly smaller than the smallest data value and the upper class limit of the last class to be slightly greater than the largest data value Notice the data ranges from 108 to 477 So we will use a lower limit of 100 and a class width of 5 Note Frequency distributions should be constructed so as to provide a good overall summary of the data Too few classes will cause a bunching effect whereas too many classes will spread the data out 23 ThreeYear Rate of Return Of Mutual Funds 0 V N 7 O 7 x x x x x 10 20 30 40 50 RateofReturn 10 What would a relative frequency histogram of this data look like The class midpoint is found by adding the lower class limit and upper class limit of a class and dividing the result by 2 lower class limit upper class limit class midpoint 2 24 A frequency polygon is drawn by plotting a point above each class midpoint on a horizontal axis at a height equal to the frequency of the class After the points for each class are plotted straight lines are drawn between consecutive points Example Draw a frequency polygon for the 3year rate of return data p 22 StemandLeaf Plots A stemandleaf plot is another way to represent quantitative data graphically Construction of a StemandLeaf Plot 1 The stem of the graph will consist of the digits to the left of the rightmost digit The leaf of the graph will be the rightmost digit 2 Write the stems in a vertical column in increasing order Draw a vertical line to the right of the stems 3 Write each leaf corresponding to the stems to the right of the vertical line The leaves must be written in ascending order 25 Example Construct a stemandleaf plot of the threeyear rate of return data given below 226 296 116 459 166 134 213 270 196 158 Step One Round the data Why would we need to do this 27 17 11 24 36 13 29 22 18 17 23 3O 12 46 17 32 48 12 18 23 18 32 26 24 38 24 15 13 31 22 18 21 27 2O 16 15 37 19 19 29 Construct the plot Three Year Rate of Return for Mutual Funds 1 12233 1 556777888899 2 012233444 2 67799 3 0122 3 678 4 4 68 ll represents 11 26 Selecting the stems in a stemandleaf plot is similar to selecting the class width in a histogram In the following stemandleaf plot a stem of 8 represents the class 80 89 757 823358888889 911233344889 7 5 represents 75 In the following stemandleaf plot the first stem of 8 represents the class 80 84 and the second stem of 8 represents the class 85 89 7 757 8233 858888889 911233344 9889 7 5 represents 75 Which stemandleaf plot gives a better description of the frequency distribution Shapes 0f Distributions o Symmetric left and right side are mirror images of each other 0 Skewed 7 One tail is stretched out longer than the other tail 0 Left Skewed the longer tail is associated with smaller numbers most of the data is piled on the high numbers end 27 0 Right Skewed the longer tail is associated with larger numbers most of the data is piled on the low numbers end Uniform the frequency of each value of the variable is equal Unimodal there is only one major peak Bimodal there are two major peaks Jshaped there is no tail on the side of the class with the highest frequency Bellshaped highest frequency occurs near the middle and frequencies tail off to the left and right is roughly the same pattern Let us look at histograms of these shapes 28 Shapes 016 Oistn39bu ons 6e shaped 8 31st SKeweJ LeFt Skewecl uni39Form T shaped or Bfmocla T39 te l mm A time series plot is obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis Lines are then drawn connecting the points Example The following data represent the percentage of recent high school graduates who enroll in college Construct a time series plot of the data Year ferclentl EIIIUIIUU 1985 577 1986 538 1987 568 1988 589 1989 596 1990 599 1991 624 1992 617 1993 626 1994 619 1995 619 1996 650 1997 670 30 Section 23 Graphical Misrepresentation Of Data Characteristics of Good Graphics Label the graphic clearly and provide explanations if needed Avoid distortion Don t lie about the data Avoid three dimensions Threedimensional charts may look nice but they distract the reader and often result in misinterpretation of the graphic Do not use more than one design in the same graphic Sometimes graphs use a different design in a portion of the graphic in order to draw attention to this area Don t use this technique Let the numbers speak for themselves 31 Chapter 6 Discrete Probability Distributions Section 61 Probability Distributions In Chapter 5 we presented the concept of an experiment and the outcomes of an experiment When experiments are conducted in a way such that the outcome is a numerical result we say the outcome is a random variable A random variable is a numerical measure of the outcome of a probability experiment so its value is determined by chance Random variables are denoted using capital letters such as X Each of the observed values are denoted with a small letter such as x A discrete random variable is a random variable that has either a finite number of possible values or a countable number of possible values A continuous random variable is a random variable that has an infinite number of possible values that is not countable The probability distribution of a random variable X provides the possible values of the random variable and their corresponding probabilities A probability distribution can be in the form of a table graph or mathematical formula Requirements for 3 Discrete Probability Distribution Let PX x denote the probability the random variable X equals x then 1 Z PX x 1 2 OSPXxSl Example Determine which of the following are probability distributions X PgX X 0 02 1 02 2 02 3 02 4 02 X PgX X 10 01 20 023 30 022 40 06 50 015 X PgX X 100 01 200 025 300 02 400 03 500 01 The Mean of a Discrete Random Variable The mean of a discrete random variable is given by the formula Hx EX Z xPX x where x is the value of the random variable and PX x is the probability of observing the random variable x This will also be referred to as the expected value of X denoted EX Variance and Standard Deviation of a Discrete Random Variable The variance of a discrete random variable is given by VarX 03 0x2 2 cc Hx2PX xgt1 2 x2PltX m M3 EltX2gt E0012 where EXZ z x2PX x To find the standard deviation of the discrete random variable take the square root of the variance That is 0X 2 a Example Find the mean and variance of the following probability distribution 90 Section 62 The Binomial Probability Distribution Criteria for a Binomial Probability Experiment An experiment is said to be a binomial experiment provided 1 The experiment is performed a fixed number of times Each repetition of the experiment is called a trial 2 The trials are independent This means the outcome of one trial will not affect the outcome of the other trials 3 For each trial there are two mutually exclusive outcomes success or failure 4 The probability of success is fixed for each trial of the experiment Notation Used in the Binomial Probability Distribution There are n independent trials of the experiment Let p denote the probability of success so that q 1 p is the probability of failure Let X denote the number of successes in n independent trials of the experiment So 0 S x S n 91 Since the binomial distribution is a specific example of a probability distribution we must be able to assign a probability distribution or probability function to the random variable X So what is the distribution of X when X binn p 92 Binomial Probability Distribution The probability of obtaining x successes in n independent trials of a binomial experiment where the probability of success is p is given by n x PX x x 19x61 wherex0 12 n Mean and Standard Deviation of a Binomial Random Variable A binomial experiment with n independent trials and probability of success p will have a mean and standard deviation given by the formulas HXZIlp and Uszlnpq 93 Example Singulair is a medication for controlling asthma attacks In clinical trials of Singulair 184 of the patients in the study experienced headaches as a side effect Let X the number of 1 Compute the mean and standard deviation of X the number of patients experiencing headaches in 400 trials of the probability experiment 2 Interpret the mean 3 Define p 4 What is the distribution of X 5 Give an expression for the probability that exactly 70 patients in 1 experience headaches 6 Give an expression for the probability that 100 or more patients experience headaches 7 Give an expression for the probability that between 90 and 110 patients experience headaches 8 Would it be unusual if 100 or more patients experience headaches in this study 94 Stat 104 Lecture 11 Prob ability Subjective Personal iBased on feeling or opinion Empirical iBased on experience Theoretical Formal iBased on assumptions The Deal Bag 0 chips poker chips isome are red isome are white isome are blue Draw a chip from the bag The Deal Draw a red chip win 3 bonus points Draw ablue chip win 1 bonus points Draw awhite chip lose 1 bonus points Stat 104 Lecture 11 Is this a good deal Subjective personal probability iBased on your beliefs and opinion Empirical probability iBased on experience 7Conduct a series of trials iEach trial has an outcome R B W Empirical Probability Look at the long run relative frequency of each of the outcomes iRed iBlue 7White Theoretical Probability Look in the bag and see how many 7White chips 7 Assumption 7 Each chip has the same probability of being chosen Equallylikely Stat 104 Lecture 11 Properties of Probability A probability is a number between 0 and l Something has to happen rule iThe probability of the set of all possible outcomes ofa trial must be l Law of Large Numbers For repeated independent trials the long run relative frequency of an outcome gets closer and closer to the true probability of the outcome Surviving the Titanic Stat 104 Lecture 11 Probability of survival Relative frequency of saved 770622230318 or 318 Relative frequency of lost 7151722230682 or 682 Conditional Probability Probability relative to a pre existing condition PAlB The probability ofA occurring given B has occurred Conditional Probability PSaveleirst Class iumber of First Class who were saved relative to the total number of First Class passengers 7PSaveleirst Class 199329 0605 or 605 m I e area un er e grap o aTdEnsity function over some interval represents the probability of observing a value of the random variable in that interval PaltXltb Chapter 6 The Normal Distribution A continuous random variable is normally distributed or has a normal probability distribution if its relative frequency histogram of the random variable has the shape The Normal of a normal curve bellshaped and symmetric Probability Distribution Properties of the Normal Density Curve 1 It is symmetric about its mean u 12 12 Recall a continuous random variable is a random variable that has an infinite 2 The highest POint occurs at X H number of possible values that is not countable X 3 The area under the curve is one Jfxdx 1 u 4 The area under the curve to the right of u equals the area under the curve to the left of u equals 12 5 As X increases without bound gets larger and larger the graph approaches but never equals zero As X decreases without bound gets larger and larger in the negative direction the graph approaches but never equals zero 6 The Empirical Rule X E 009 00 Probability Density Function The Empirical Rule Approximately 68 of the area under the normal A probability density function is an equation used to compute probabilities of curve is between X H G and X H a continuous random variables that must satisfy the following two properties Approximately 95 of the area under the normal curve is between X u 20 and X u 20 Continuous Jfxdx 1 2 The graph of the equation must be greater than or equal to zero for all possible values of the random variable That is the graph of the equation must lie on or above the horizontal axis for all possible values of the random variable 1 The area under the graph of the equation over all possible values of the random variable must equal one Discrete Z Px 1 Approximately 997 of the area under the normal curve is between X u 3G and X u 30 Discrete Continuous b 0sPltXgtsl lfxdx20 v abex Wherealtb The Area under a Normal Curve Suppose a random variable X is normally distributed with a mean u and a standard deviation 5 Notation X Nu o The area under the normal curve for any range of values of the random variable X represents either The Empirical Rule Approximately 068 68 of the area under the standard normal curve is between 1 and 1 The proportion of the population with the characteristics described by the tange 0t eg 68 of student heights are between 6518 and 7315 The probability that a randomly selected individual from the population will 0 I 0 have the characteristics described by the range Approxtmately 03995 95 h of the area under the standard normal curve is between 2 and 2 Histogramorheigm eg A student has a 068 probability to be between 6518 and 7315 inches tall 0 Height is approximately normally distributed t N69167 3987 so by the empirical rule 68 Approximately 0997 997 of the area under the N should be between 69167 3987 6518 and standard normal curve is between 3 and 3 r E 69167 3987 7315 Standardizing a Normal Random Variable Notation for the Probability of a Standard Normal Random Variable Suppose the random variable X is normally distributed with a mean u and standard deviation 0 Then the random variable X u Pa lt Z lt b represents the probability a standard normal random variable is Z between a and b 0 PZ gt a represents the probability a standard normal random variable is greater than a is normally distributed with a mean u O and standard deviation 5 1 The random PZ lt b represents the probability a standard normal random variable is less variable Z is said to have the standard normal distribution than b Notation Z NO 1 The notation za pronounced 2 sub alpha Is the 06 Zscore such that the area under the standard For any given x we can calculate the associated Zscore using the formula normal curve to the right of 206 is or above l I I Table II at the back of the text is refehredto as a Ztable lt tabulates the area to the left of a given Z 2 NO 1 TheStandard Normal Distribution Properties of the Standard Normal Curve 12 12 1 It is symmetric about its mean u O z 2 The highest pomt occurs at u O 0 Standard Normal Distributiun 3 The area under the curve is one This characteristic is required in order to I 599 m m 1 3 134 1145 06 all quot03 135 satisfy the requirement that the sum Of all probabilities in a legitimate 34 11111111133 00000 00013 0000 liglll lll 0010 0000 00003 00003 0000 ll 33 isti39l39ll ifi tiltquot015 Kiwi ritutiliitilat tigttlitH 000104 lfl39JtHnIiIelt triflmd tillilltil39lii WINNIE probability distribution equals 1 232 0000 00007 00000 0000 00000 00005 0000 00005 00th 00100 J f 1 3 DJMI IZI Un t ti tilii titi illithi39 i 00000 fl lltri ilijlitjiliiwe CitHMS 1 th 00007 3ll 00013 l39Jillrlfll3 Milli Ifllltl39l l iliillljill 1111011 lulllll lllltlll tidiititti 0110110 2 Milli 100013 000118 ll ldl l lfll 10015 0015 GEMS 0131014 00014 4 The area under the CU we to the rig ht 0t it 0 equals the area Uhder the CurVe 23 0000 0005 00024 00023 00013 00022 00021 000211 00020 00010 1 1 27 0005 00030 00033 00032 00031 00030 00009 00000 00027 00020 to the left Of h 0 equals 239 20 00007 01005 00044 0003 00041 00000 00030 00030 000 00030 15 109152 10060 Ltltti l 39tll li i S 00854 00052 Mittfil l l ii4 l 00043 5 ASZ increases thheUt bound gets larger and larger the graph approaCheS 24 00102 l iifjiiilari 01000 00070 0000 000101 00000 00000 0151000 00100 1 2 39 3 tell i39ll I Lt ill H fl U 3 00000 010000 lut39li39lt i Hillel titliiihild ill Wittquot ill 3914 il39lal bUt hever equals Zere AS 2 decreases W39thOUt bound gets larger and 2 00100 litiliilttir 00112 01020 0001 00032 ruliid 01010 00110 00110 39 39 39 39 21 00070 00074 01070 thrillil39i 0010 0000 00150 00150 00140 00143 target In the hegahve d39reCt39Oh the graph approaCheS but never equals 20 00120 ilin22 00217 0020 04700 0000 0000 00102 0000 mm zero 19 0030 00201 00274 00200 00202 00250 00250 00240 00230 00033 Z E 00 00 10 00350 00351 00044 00330 00320 00302 003114 00007 00301 00004 9 17 00000 00030 0002 00010 00009 00001 00303 01004 00305 00300 03 The Empthcat RUIe 40 00500 00537 00520 00310 00505 00005 00005 0005 00405 00455 45 00668 00055 110043 01030 00010 00000 00504 00502 00571 00050 7 Asa 1 swig z 0011 0910215 91 39sedwexe 90109 19 mm e 9021 N 9M Z qhohpqpqmd 9111119011 92 11 X19011021911101011eneAeuuou 9111 ugelqo 399 0919 pepeus 9111o1spuod9911001eu1eJOOSz 9111 pu11o1eqe1z 9111 esn 39z 0 Z I I 8 W14 3113119 31 a 30 2010A 2411103017 d 111 GUJOS r31 4N1 811110100101 393 P 39pepeus 01111qeq01d JO uonJodOJd 911101 6u1puodseuoo 9919 9111 11111111 euno euuou pJepuels e MBJCI 391 Mugqeqmd IO uomodmd pemoeds e 01 Bugpuodseuog queueA wopuea euuoN agile 6 6w 391 dens u umeup euno ew10u 396 PW 7 mm 393 2 9111 Jepun 9912 911181 81111 39eAJno euuou pJepuels 9111 Jepun 9912 9111 puH 397 400 10ququ 9 4 Wk 0 39pepeus pngsep 9919 9111 1111011 euno euuou pJepuels e MBJCl 399 Z l 0 77 X Z 6u19n 991009 0 osen 9A9 eAuo 39 quotF WHS713Z 2 1X1 I 11111 3 z 4 6me W W 1 39pepeus 9912 pngsep 9111 1111011 euno euuou e MBJCl 391 GMHQ euuoN Aue Jepun 261V 611 BugpuH V uonnqmsm 02me 6111 10 suoneonddv quot39seldwex3 00000 10000 10000 10000 0000 11 10000 10000 10000 10000 10000 f39f 1110011 911000 90000 90000 900011 900011 90000 50000 0000 0000 1quot 7 H 4 A I W I 0000 90000 50000 10000 r0000 10000 r0000 00000 10000 0000 IT H Rf 3 M jquot quot W W L55quot 5 419quot LUquot 10000 10000 0000 20000 20000 100011 10000 10000 10000 00000 11 153 LEW W22 35 mi W quotWquot Mt WY LEW 9 00000 00010 011000 09000 0110011 0110011 00000 19000 100011 10000 00 3mquot 1 Ni M MW 99 Wt 5mquot it W 9 2 J95 986 98660 3660 cm mm mm 3660 am mm mm 61 13110 95110 30100 810 1910 0011 0 901111 11100 00100 91050 00 18000 08660 01000 01000 01000 11000 ZL00390 91000 91000 17000 3 mm 010120 1850 01000 21000 9W0 1136170 91000 091110 911050 s0 0191390 91010 31570 11900 51900 91320 00100 11100 9390 amad I 011550 510039 11010 11010 01000 69000 101000 19110 99000 9000 a m mm H 91017 a 19650 9663 296 WI0 09660 6566 MM gggg SSW mm 91 10120 11120 9020 91100 9900 9620 121111 091110 00170 00100 10 19000 15000 00060 80660 90000 90000 0000 11000 01600 110000 939 W19 3 3319 2W 67639 W0 5 2quot quot3 9030 06090 5 quot W M lt V 11910 91910 09910 911910 11110 91110 09110 211110 11010 11010 63911 lt1003911 10100 111000 11100 0000 1000 0000 3000 02000 01000 139 a A A MEAN 1111 MLHU WHO WHO LDHU slslu 01511 L95 LNSHJ 0 9160390 660 16139 0110011 90660 3660 012160 W131 91191510 H196 i A a r I n u AV 1 0530 mm mm 8800 Sign 5m W1 W0 r986 W1 2 01110 00110 01110 01110 1910 11110 10110 11110 51110 19110 11 10100 19000 01000 91000 0000 01000 r1000 00100 911100 101100 139 W39quot W d quot W W 5 39 d quot A 395 zquot 4 1 N 1 A H 1 A 9 Hi 1 ER H A lt L r a ECHOU WSW 931 00301 SEEITO W110 H011 156039 89114 Fl Ll ih l LNG U 1101011 100 L EDUJ U HIM H H 1 903 ELL LII 10 mm mm Rum lam gun rum H110 ml0 mu r 39 39 0 39 09139 1 5 quot1 quot1 0 10 39 2323 995 2 mm W 0000 11500 10900 10500 90900 81900 190 0 10900 55900 9990 0 91 59960 W mm 909610 mm mgrquot 38 560 M60 m6 H 5000 99100 911110 58100 50100 5001 91500 9000 10900 111000 91 9960 9666 mm 91cm gm g m W6 W6 Egg 3 91 10500 91100 1111110 0000 10100 011110 L000 915100 91000 11 mm mm mm m0 WW Wm mm mm mo um n r0200 1000 10200 01500 20500 I 90100 0000 10100 05010 111 39 39 1 39 0200 00200 1111100 0500 9mm 39 119200 0100 10100 1300 0391 bltfggn I F 1110 0 001110 101110 10100 c0200 010 110110 111110 1010 0 9110 0 0 1 A 39 HO0 WNW DENT 13951quot HEN 29100 U HU39U 11111111 TUU39U mill 7 3399quot quot3 mi quot1219 MS 5975 L 535 3 TN 39 01100 11100 91100 01100 22100 90100 02100 31100 91100 00100 2 000 0100 00100 01100 01010 000 00190 90980 59900 119011 11 lam bags LL93 59 159 SETSquot may WW quot1 WWW LHUU39 VHVUIU MKJ H TN 10110 0011111 1011110 111100 EMU IZ VI W n y w v I r9001 LNUUU WOUU 690011 EUJO39U 5LUUU HLUUU 0900 0300 r 39 2 J05 I 23 10000 01000 190110 391100 5000 19000 02000 09100 9000 9 39 39 m mm mm m N L 911100 10100 115000 05000 100 110011 171000 110110 90000 111000 92 6am Mm mm mm Kim 90 90100 120011 112000 03000 5000 11000 0000 91000 11000 91000 1 wm more 869 WU mm 90 01000 02000 101110 11000 11000 2000 92000 01000 same 9000 112 39 M 11000 111000 91000 51001 91000 01000 11000 01000 91000 01000 0 mm m W M 5 fquot 01000 111000 110110 111 11000 1000 1003911 11000 1000 1000 0390 W W W 90W 52 3123 if 10000 10000 00000 110000 110000 001100 00000 011000 01000 1391 W J39 Hquot 85 395 mquot 1 90000 00000 90000 90000 10000 10100 2391 2 HRH 0191 14911911 9Z09390 quot A A A A 1 0 HHXH KIHJIH 80000 15L 111511 SLQSO 9950 M11 1355quot USED RLHU 8115 1411151 1390 A 1 1 A m AA 1 A 1 W A A 111000 WUU U FOUUU Lilli H1000 HM WUUU 50000 50000 51100 1 H 014 0 0115 U 0115 0 0115 0 1615 0 1915 0 00151 BROS 11 111415411 0003 U 00 am mum mm 60 80 1039 9 SO39 90 20 Z039 039 0039 7 11011011111510 1121111010 0101111013 1100 0001 10101 01301 0000 11001 WWII H 8L6039 Z80639O L D 390 09160390 L 1olt H QFTI IRON INT 1011 1081quot 1 80 0881quot 0be13 0901quot KJIIEU U39ll m m wwmwm 6511 L L V quot 0050 c L39 P 12211 061 LSIL39O 951130 8801quot 63390 1391 8 3 0 303391 311911 9039 5039 HT E039 2039 1039 0039 3 uonnqmsga laumN pmpuulg quot39p1u00 sedu19xg 28060 Zd H 991 2 81600 Zd H 991 z A Let Xm NH 5 Find in probabith W X is greater them a A L25 x NUl5F1 nd m probabilihj 1m X is bumen 9 and 5 The Normal Approximation To the Binomial Probability Distribution Criteria for a Binomial Probability Experiment A probability experiment is said to be a binomial experiment if all the following are true The experiment is performed on n independent times Each repetition of the experiment is called a trial Independence means that the outcome of one trial will not affect the outcome of the other trials For each trial there are two mutually exclusive outcomes success orfailure The probability of success p is the same for each trial of the experiment T Lei X N39l5l39139nd the Probabth l hdi the absoluk VOJW oi X is less man 5 A u xnI Hi 394 39ia39i39lu han n lqL139I39Jt 39all r I r aftrquot 411 Fa When we were dealing with probabilities for the binomial distribution we only set up an expression since it is very mathematically tedious However we have a new way to approximate those probabilities 15 PX 213 Z 065x035 x z 0062 x x13 As the number of trials n in a binomial experiment increases the probability distribution of the random variable X becomes more nearly symmetric and bell shaped As a general rule of thumb if np gt 5 and nq gt 5 then the probability distribution will be approximately symmetric and bell shaped H Standard Normal Model I Standard normal table handed out in class I Table 3 page 662 in your text I httpdavidmlanecomhvperstatz tableht ml I Ehe Normal Approximation to the Binomial Probability Distribution If np gt 5 and nq gt 5 then the binomial random variable X is approximately normally distributed with mean ux hp and standard deviation 0X anq X Binnp gtXNyX039Xgt XNnpVZPQ What is the major difference between a binomial random variable and a normal random variable Abinomia random variable is a discrete random variable and a normal random variable is a continuous random variable Therefore since we are using a continuous density function to approximate a discrete probability we must apply a correction for continuity The continuity correction says that we add and subtract 05 from every value of x Continuity Correction Hx0zP0 05ltXltx0m x7 Continuity Correction HxbzPu 05ltxltx0m iiillllg39 Example Suppose a softball player safely reaches base 45 of the time Assuming atbats are independent events use the normal approximation to the binomial to approximate the probability that in the next 100 at bats x Bin100045 p np 100045 45 0X M 100045055 z 497 3 X NN45 497 1 The player reaches base safely exactly 50 times 2 The player reaches base safely 60 or more times 3 The player reaches base safely 50 or fewer times 4 The player reaches base safely between 60 and 90 times inclusive 495 45 505 45 08643 08186 00457 lt Z lt 2 The player reaches base safely 60 or more times A 497 497 PX 2 60 gt PX gt595 rIiillllgt I 1 The player reaches base safely exactly 50 times PX 50 gt P495 lt X lt 505 I 45 50 PZlt111 PZlt091 A l 45 60 l 0 292 1PZlt 595 45 497 1 PZ lt292 21 09982 2 00018 lllull Exact Approximation P17ltX 252 x218 2P 2 lt 1 P Zlt 1139s quotp I I lfn200andp 01 uuu w4L M 25 njpanx p175 lt X lt 255 PX lt 255 PX lt 175 X Exact 06146 Approx 06256 00434nuil7mCotlccIilu 3 The player reaches base safely 50 or fewer times PX S 50 gt PX lt 505 PZlt505 45 497 PZ lt111 08665 4 The player reaches base safely between 60 and 90 times inclusive P60 s X s 90 2 P595 lt X lt 905 P 595 45 ltZlt 905 45 497 497 PZ lt 905 PZ lt 292 21 09982 2 00018 Review Let XN32 1Find the probability that X is less than 5 2Find the probability that X is greater than 2 3Find the probability X is between 2 and 5 EVlEW rule at the dice is lndEpEndEnL use the pihurhial distributlurl tn pruvide rhath iaii hi i p i l The player rules a l exactly in times The player rules ah eyeh hurhperlessthah in times The player rules a number less than ur equal tn 4 metre than an times The player rules a a between 5 and is tirhes ihelusiye Sample QuizExam Question LetXN6y3 1 Find the probabllltythat X is greatermal l 3 2 Find the probabllltyX is between 2 and 8 Suppuse We haye a dicethatls unbiased and each rule at the dice is lndEpEndEnL use the pihurhial uistripuuuh tn pruvide rhath hutatiuh fur the prubabllltythat in the nEXtZEI rules 3 See a yalue on the dice greater than or equal to 5 at least l2 times 4 See an even numberfewel thal l 2 times 5 ls the normal approximation appropriate forproblerrl 37 Why Stat 104 Lecture 8 Scatter Diagram Statistics is about variation Recognize quantify and try to explain variation Variation in two quantitative variables is displayed in a scatter diagrarn Scatter Diagram Numerical variable on the vertical axis y is the response variable Numerical variable on the horizontal axis x is the explanatory variable Scatter Diagram Exarnple Body mass kg and Bite force N for Cam39dae 7y Response Bite force N 7x Explanatory Body mass kg 7Cases 28 species of Canidae Stat 104 Lecture 8 Bivariate Fit of BFca N By Body Mass kg 5mm n 5 1D 25 an 35 An 15 2n Emmy Mass Kg Positive Association Positive Association iAbove average values of Bite force are associated with above average values of Body mass iBelow average values ofBite force are associated with below average values of Body mass Scatter Diagram Example Outside temperature and amount of natural gas used iResponse Natural gas used 1000 W 7 Explanatory Outside temperature 0 C 7Cases 26 days Stat 104 Lecture 8 l I m m 5 I l I39 u o o 50 100 150 Temp Negative Association iAbove average values of gas are associated with below average temperatures iBelow average values of gas are associated with above average temperatures Correlation Linear Association 7 How closely do the points on the scatter diagram represent a straight line 7 The correlation coef cient gives the direction of and quanti es the strength of the linear association between two quantitative variables Stat 104 Lecture 8 Correlation 0 St d d39 an ar lze y Z y y y s y Standardize X x C zx S Standardized Bite Force Bite Force vs Body Mass of Canidae I I I I 71 I 2 3 Standardized Body Mass Correlation Coef cient r 2 z szy n 1 r XXXWyn n 1s s x y Stat 104 Lecture 8 Correlation Coef cient Body mass and Bite force r 2 222 264796 27 17 1 39 r 09807 Correlation Coef cient There is a strong correlation linear association between the body mass and bite force for the various species ofCam39dae Analyze 7 Multivariate methods 7 Multivariate Y Columns 7 Body mass 7 ABF ca Bite force at the canine Stat 104 Lecture 8 Muhivzrizle Cnrrelzlinns am mm Bream am st 1mm mam Bream mm mm Emerplm Matrix mm mm Sham 51u15m253u35 mmumua Correlation Properties The sign ofquot indicates the direction ofthe 39 tion assocra The value of r is always between fl and 1 Correlation has no units Correlation is not affected by changes of center or so e Correlation Cautions Don t confuse correlation with causation 7 There is a strong positive correlation between the number of crimes committed in communities and the number of 2m1 graders in those communities Beware of lurking variables Stat 104 Lecture 29 Two Sample Data Dependent samples Data are connected 0 Independent samples Data are separate Dependent Samples Paired data One set of individuals TWO values of the response variable a pair of values for each individual Independent Samples Two separate sets of individuals One value of the response variable for each individual Stat 104 Lecture 29 Know the Difference It is important to know the difference between data arising from two independent samples and data arising from dependent paired samples Alcohol and Reaction Time Dependent paired samples One set of individuals Each individual has reaction time measured after consuming a glass of grapejuice and again a ter consuming a glass of grapejuice with 2 oz of 190 proof alcohol Alcohol and Reaction Time Independent samples Two separate sets of individuals One set has reaction time measured after consuming a glass of grapejuice The other set has reaction time measured after consuming a glass of grape juice with 2 oz of 190 proof alcohol Stat 104 Lecture 29 Alcohol and Reaction Time Dependent paired samples One set of 12 individuals Each individual has reaction time measured after consuming a glass of grape juice and again after consuming a glass of grape juice With 2 oz of 190 proof alcohol Summary of Differences n12 672 51 4 0425 12 n sd 05083 Stat 104 Lecture 29 Con dence Interval for yd 1 from Table T dfn 1 Table T mMmHa 11 gt2201 ConfidenceLeVels 80 90 95 98 99 Confidence Interval for yd 0425 i 2 201 05083 J5 0425 r 2 20101467 0 425 i 0323 0102 to 0748 Stat 104 Lecture 29 Interpretation 0 We are 95 con dent that the mean difference in reaction time is between 0102 and 0748 seconds 0 On average a person s reaction time increases from 0102 to 0748 seconds after drinking this amount of alcohol Test of Hypothesis for yd Step 1 Set up Hozydzo HAzydgt0 Udis the mean difference in reaction time Alc No Alc Test of Hypothesis for yd 0 Step 2 Test Criteria Differences are normally distributed Population standard deviation is not known ttest statistic a 005 Stat 104 Lecture 29 Test of Hypothesis for yd 0 Step 3 Sample evidence 3 0 0425 z 2897 Sid 0 1467 J Table T One tail probability 001 Prwlue 0005 EJANHQ 11 gt 2718 2897 3106 Test of Hypothesis for yd Step 4 Probability value The P Value is between 0005 and 001 Stat 104 Lecture 29 Test of Hypothesis for Mi Step 5 Results Reject the null hypothesis because the P value is smaller than a 005 With alcohol the reaction time is longer on average Comment This agrees with the confidence interval Zero was not in the confidence interval and so zero is not a plausible value for the population mean difference JMP Data in two columns Reaction time with no alcohol Reaction time with alcohol Create a new column of differences Cols Formula Stat 104 Lecture 29 JMP 0 Analysis Distribution Differences JMP Starter Basic Matched Pairs Analysis Distribution Distributions Difference Moments Test Meanvalue i Mean 0425 Hyputhesizedvalue 0 Std Dev 0 5083395 Actual Estimate 425 Std i M 0 t46745 tt upper 95 Mean 0 7479835 Std Dev 0 50834 in 5 0 tozot 65 Test 2 Test Statistic 2 8982 Pruhgtl1l oot45 F39rEIh gt1 0 0073 F39rEIh t 0 9927 2 Matched Palrs I Matched Pairs Difference AlcoholNo Alcohol cohol 7A IRaIio 2896181 No Alcohol 6 DF 11 Mean Difference 0425 Prob gt M 0 0145 Sid Error 74 Prob gt1 0 0073 Upper95 a 0 74798 Prob lt1 0 9927 L0wer95 a 0 10202 N 12 CorrelaIion 020195 Chapter 5 Probability Section 51 Probability of Simple Events Probability is a measure of the likelihood of a random phenomenon or chance behavior Probability describes how likely it is that some event will occur Probability falls into 3 major approaches 1 Classical Approach 2 EmpiricalExperimental Approach 3 Subjective Approach We will discuss each approach in detail but first we need to look at some basic ideas associated with probability In probability an experiment is any process that can be repeated in which the results are uncertain Probability experiments do not always produce the same results or outcome so the result of any single trial of the experiment is not known ahead of time Suppose we are to ip a coin one time what is the probability that we observe a tails So if we ip the coin 10 times would we definitely see 5 tails Why not What if we ipped the coin 100 times A million times 72 The Law of Large Numbers As the number of repetitions of a probability experiment increases the proportion with which a certain outcome is observed gets closer to the probability of the outcome Suppose I have a fair die and I am going to roll that die one time and observe the outcome What are all the possible outcomes The sample space S of a probability experiment is the collection of all possible outcomes or simple events A simple event is any single outcome from a probability experiment Each simple event is denoted ei An event is any collection of outcomes from a probability experiment An event may consist of one or more simple events Events are denoted using capital letters such as E Properties of Probabilities We define the probability of an event denoted PE as the likelihood of that event occurring Probabilities have some properties that must be satisfied 1 The probability of any event E PE must be between 0 and l inclusive That is 0 S PE S l 2 If an event is impossible the probability of the event is 0 3 If an event is a certainty the probability of the event is l 4 IfS e1 62 en thenPe1 Pe2 Pen 1 We will now discuss the three methods or approaches for determining probabilities 73 Classical Approach The classical method of computing probabilities requires equally likely outcomes An experiment is said to have equally likely outcomes when each simple event has the same probability of occurring Some examples would be each number of a die each card in a deck of cards and each side of a coin Computing Probabilities Using the Classical Method If an experiment has n equally likely simple events and if the number of ways that an event E can occur is m then the probability of E PE is Number of ways E can occur m P E Number of possible outcomes n So if S is the sample space of this experiment then NE PE NS where NE is the number of simple events in E and NS is the number of simple events in the sample space Determining the size of the sample space A permutation is an ordered arrangement in which r objects are chosen from n distinct objects and repetition is not allowed The number of arrangements of r objects taken from n distinct objects is given by the equation nl n r n Ir With permutation order matters 74 A combination is a collection without regard to order of n distinct objects without repetitions The number of combinations of n distinct objects taken r at a time is given by the equation H H n choose r n rr Example 1 Let the sample space be S l2345678910 Suppose the simple events are equally likely Compute the probability of the event E an odd number Example 2 What is the probability of getting a ush all cards of the same color in a 5card poker hand when a standard 52 card deck is used EmpiricalExperimental Approach In this approach probabilities are obtained from empirical evidence that is evidence based upon the outcomes of a probability experiment Approximating Probabilities through the Empirical Approach The probability of an event E is approximately the number of times event E is observed divided by the number of repetitions of the experiment PE m relative frequency of E 75 frequency of E P E number of trials of experiment Example On September 8 1998 Mark McGwire hit his 62nd homerun of the season Of the 62 homeruns he hit 26 went to left field 21 went to left center 12 went to center 3 went to right center and 0 went to right field 1 What is the probability that a randomly selected homerun was hit to left center field 2 What is the probability that a randomly selected homerun was hit to right field 3 Is it impossible for Mark McGwire to hit a homerun to right field 76

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.