### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# APPLIED STATISTICS I STAT 700

GPA 3.93

### View Full Document

## 17

## 0

## Popular in Course

## Popular in Statistics

This 195 page Class Notes was uploaded by Shane Marks on Monday October 26, 2015. The Class Notes belongs to STAT 700 at University of South Carolina - Columbia taught by Staff in Fall. Since its upload, it has received 17 views. For similar materials see /class/229672/stat-700-university-of-south-carolina-columbia in Statistics at University of South Carolina - Columbia.

## Reviews for APPLIED STATISTICS I

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/26/15

STAT 700 APPLIED STATISTICS I Fall 2005 Lecture Notes Joshua M Tebbs Department of Statistics The University of South Carolina TABLE OF CONTENTS STAT 7007 J TEBBS Contents 1 Looking at Data Distributions 1 11 Displaying distributions with graphs 5 111 Categorical data distributions 5 112 Quantitative data distributions 7 12 Displaying distributions with numbers 13 121 Measures of center 13 122 The ve number summary7 box plots7 and percentiles 15 123 Measures of variation 16 124 The empirical rule 20 125 Linear transformations 21 13 Density curves and normal distributions 24 131 Measuring the center and spread for density curves 27 132 Normal density curves 28 133 Standardization 31 134 Finding areas under any normal curve 34 135 lnverse normal calculations Finding percentiles 36 136 Diagnosing normality 37 2 Looking at Data Relationships 40 21 Scatterplots 42 22 Correlation 45 23 Least squares regression 49 231 The method of least squares 51 232 Prediction and calibration 54 233 The square of the correlation 56 24 Cautions about correlation and regression 57 TABLE OF CONTENTS STAT 7007 J TEBBS 241 Residual plots 58 242 Outliers and in uential observations 59 243 Correlation versus causation 60 3 Producing Data 62 31 Introduction 62 32 Experiments 63 321 Terminology and examples 63 322 Designing experiments 66 33 Sampling designs and surveys 73 331 Sampling models 73 332 Common problems in sample surveys 76 34 Introduction to statistical inference 76 4 Probability The Study of Randomness 82 41 Randomness 82 42 Probability models 83 421 Assigning probabilities 85 422 Independence and the multiplication rule 87 43 Random variables 89 431 Discrete random variables 89 432 Continuous random variables 92 44 Means and variances of random variables 95 441 Means Discrete case 95 442 Variances Discrete case 98 5 Sampling Distributions 100 51 The binomial distribution 100 TABLE OF CONTENTS STAT 7007 J TEBBS 52 An introduction to sampling distributions 106 53 Sampling distributions of binomial proportions 108 54 Sampling distributions of sample means 113 Introduction to Statistical Inference 119 61 Introduction 119 62 Con dence intervals for a population mean when 039 is known 120 63 Hypothesis tests for the population mean when 039 is known 130 631 The signi cance level 134 632 One and two sided tests 135 633 A closer look at the rejection region 138 634 Choosing the signi cance level 139 635 Probability values 141 636 Decision rules in hypothesis testing 144 637 The general outline of a hypothesis test 145 638 Relationship with con dence intervals 146 64 Some general comments on hypothesis testing 147 Inference for Distributions 150 71 Introduction 150 72 One sample t procedures 151 721 One sample t con dence intervals 153 722 One sample t tests 154 723 Matched pairs t test 159 73 Robustness of the t procedures 161 74 Two sample t procedures 162 741 lntroduction 162 742 Two sample t con dence intervals 166 TABLE OF CONTENTS STAT 7007 J TEBBS 743 Two sample t hypothesis tests 168 744 Two sample pooled t procedures 172 8 Inference for Proportions 178 81 Inference for a single population proportion 178 811 Con dence intervals 179 812 Hypothesis tests 182 813 Sample size deterrninations 184 82 Comparing two proportions 185 821 Con dence intervals 186 822 Hypothesis tests 189 CHAPTER 1 STAT 700 J TEBBS 1 Looking at Data Distributions TERMINOLOGY Statistics is the development and application of theory and meth ods to the collection design analysis and interpretation of observed information from planned or unplanned experiments and other studies TERMINOLOGY Biometry is the development and application of statistical methods for biological experiments which are often planned SCOPE OF APPLICATION Statistical thinking can be used in all disciplines Consider the following examples In a reliability study tribologists aim to determine the main cause of failure in an engine assembly Failures reported in the eld have had an effect on customer con dence for safety concerns 0 In a marketing project store managers in Aiken SC want to know which brand of coffee is most liked among the 18 24 year old population 0 In an agricultural study in North Carolina researchers want to know which of three fertilizer compounds produces the highest yield In a clinical trial physicians on a Drug and Safety Monitoring Board want to determine which of two drugs is more effective for treating HIV in the early stages of the disease 0 A chemical reaction that produces a product may have higher or lower yield de pending on the temperature and stirring rate in the vessel where the reaction takes place An engineer would like to understand how yield is affected by the tempera ture and stirring rate 0 The Human Relations Department at a major corporation wants to determine employee opinions about a prospective pension plan adjustment PAGE 1 CHAPTER 1 STAT 700 J TEBBS 0 In a public health study conducted in Sweden researchers want to know whether or not smoking causes lung cancer andor ii is strongly linked to a particular social class TERMINOLOGY In a statistical problem the population is the entire group of in dividuals that we want to make some statement about A sample is a part of the population that we actually observe TERMINOLOGY A variable is a characteristic eg temperature age CD4 count growth etc that we would like to measure on individuals The actual measurements recorded on individuals in the sample are called data TWO TYPES OF VARIABLES Quantitative variables have measurements data on a numerical scale Categorical variables have measurements data where the values simply indicate group membership TERMINOLOGY The distribution of a variable tells us what values the variable takes and how often it takes these values Example 11 Which of the following variables are quantitative in nature Which are categorical lKEA Atlanta daily sales measured in 10007s store location Baltimore Atlanta Houston Detroit etc CD4 cell count yield bushelsacre payment times in days payment times latenot late 0 age PAGE 2 CHAPTER 1 STAT 7007 J TEBBS 0 advertising medium ratioTVinternet 0 number of cigarettes smoked per day 0 smoking status yesno TERMINOLOGY An experiment is a planned study where individuals are subjected to treatments In an observational study7 individuals are not treated77 instead we simply observe their information data TERMINOLOGY The process of generalizing the results in our sample to that of the entire population is known as statistical inference We7ll study this more formally later in the course Example 12 Salmonella bacteria are widespread in human and animal populations in particular7 some serotypes can cause disease in swine A food scientist wants to see how withholding feed from pigs prior to slaughter can reduce the number and size of gastrointestinal tract lacerations during the actual slaughtering process This is an important issue since pigs infected with salmonellosis may contaminate the food supply Individuals pigs Population all market bound pigs7 say Sample 45 pigs from 3 farms 15 per farm assigned to three treatments 7 Treatment 1 no food withheld prior to transport7 Treatment 2 food withheld 12 hours prior to transport7 and 7 Treatment 3 food withheld 24 hours prior to transport Data were measured on many variables7 including body temperature prior to slaugh ter7 weight prior to slaughter7 treatment assignment7 the farm from which each pig originated7 number of lacerations recorded7 and size of laceration Boxplots of the lacerations lengths by treatment are in Figure 11 PAGE 3 CHAPTER 1 STAT 7007 J TEBBS Lacera ion in cm l l l trt0 trt12 trt24 Treatment Figure 11 Salmonella etperlment Laeemtttm length for three treatments SOME PARTICULAR QUESTIONS OF INTEREST o How should we assign pigs to one of the three treatments 0 What are the sources of variation That is what systematic components might affect laceration size or number of lacerations 0 Why would one want to use animals from three farms 0 Why might body temperature or prior weight be of interest GENERAL REMARKS 0 ln agricultural medical and other biological applications the most common ob jective is to compare two or more treatments In light of this we will often talk about statistical inference in the context of comparing treatments in an experimen tal setting For example in the salmonella experiment one goal is to compare the three withholding times 0 hours 12 hours and 24 hours PAGE 4 CHAPTER 1 STAT 7007 J TEBBS 0 Since populations are usually large the sample we observe is just one of many possible samples that are possible to observe That is samples may be similar but they are by no means identical Because of this there will always be a degree of uncertainty about the decisions that we make concerning the population of interest 0 A main objective of this course is to learn how to design controlled experiments and how to analyze data from these experiments We would like to make conclusions based on the data we observe and of course we would like our conclusions to apply for the entire population of interest 11 Displaying distributions with graphs IMPORTANT Presenting data effectively is an important part of any statistical analysis How we display data depends on the type of variables or data that we are dealing with o Categorical pie charts bar charts tables 0 Quantitative stemplots boxplots histograms timeplots UNDERLYING THEMES Remember that the data we collect are often best viewed as a sample from a larger population of individuals In this light we have two primary goals in this section 0 learn how to summarize and to display the sample information and 0 start thinking about how we might use this information to learn about the under lying population 111 Categorical data distributions Example 13 HIV infection has spread rapidly in Asia since 1989 partly because blood and plasma donations are not screened regularly before transfusion To study this PAGE 5 CHAPTER 1 STAT 7007 J TEBBS researchers collected data from a sample of 1390 individuals from villages in rural eastern China between 1990 and 1994 these individuals were likely to donate plasma for nancial reasons One of the variables studied was education level This was measured as a categorical variable with three categories levels illiterate primary and secondary TABLES I think the easiest way to portray the distribution of a categorical variable is to use a table of counts andor percents A table for the education data collected in Example 13 is given in Table 11 Table 11 Education level for plasma donors in rural eastern China between 1990 1994 Education level Count Percentage llliterate 645 464 Primary 550 396 Secondary 195 140 Total 1390 1000 REMARK lncluding percentages in the table in addition to the raw counts is helpful for interpretation Most of us can understand percentages easily Furthermore it puts numbers like 645 into perspective eg 645 out of what INFERENCE These data are from a sample of rural villagers in eastern China From these data what might we be able to say about the entire population of individuals PIE CHARTS AND BAR CHARTS Pie charts and bar charts are appropriate for categorical data but are more visual in nature A pie chart for the data collected in Example 13 is given in Figure 12a A bar chart is given in Figure 12b REMARK Unfortunately pie charts and bar charts are of limited use because they can only be used effectively with a single categorical variable On the other hand tables can be modi ed easily to examine the distribution of two or more categorical variables as we will see later PAGE 6 CHAPTER 1 STAT 7007 J TEBBS llliterate Percemaue v Secondary 2U Primary lllnevale Primary Secundary Educatmn a Pie chart b Bar chart Figure 12 Education level for plasma donors in rural eastern China between 1990 1994 112 Quantitative data distributions GRAPHICAL DISPLAYS Different graphical displays can also be used with quantitative data We will examine histograms7 stem plots7 time plots7 and box plots Example 14 Monitoring the shelf life of a product from production to consumption by the consumer is essential to assure quality A sample of 25 cans of a certain carbonated beverage were used in an industrial experiment that examined the beverage7s shelf life7 measured in days The data collected in the experiment are given in Table 12 Table 12 Beverage shelf life data 262 188 234 203 212 212 301 225 241 211 231 227 217 252 206 281 251 219 268 231 279 243 241 290 249 GOALS o The goal of a graphical display is to provide a visual impression of the characteristics of the data from a sample The hope is that the characteristics of the sample are a likely indication of the characteristics ofthe population from which it was drawn PAGE 7 CHAPTER 1 STAT 700 J TEBBS 0 In particular we will be always be interested in the 7 center of the distribution of data 7 spread variation in the distribution of data 7 shape is the distribution symmetric or skewed the presence of outliers TERMINOLOGY A frequency table is used to summarize data in a tabular form Included are two things 0 class intervals intervals of real numbers 0 frequencies how many observations fall in each interval Table 13 Frequency table for the shelf life data in Example 14 Class lnterval Frequency 175 200 1 200 225 7 225 250 9 250 275 4 275 300 3 300 325 1 NOTE The number of intervals should be large enough that not all observations fall in one or two intervals but small enough so that we don7t have each observation belonging to its own interval HISTOGRAMS To construct a histogram all you do is plot the frequencies on the vertical axis and the class intervals on the horizontal axis The histogram for the shelf life data in Example 14 is given in Figure 13 PAGE 8 CHAPTER 1 STAT 7007 J TEBBS Shelf Life in days Figure 13 Histogram for the shelf life data in Example 14 INTERPRETATION We see that the distribution of the shelf life data is approximately symmetric The center is around 240 days7 and variation among the shelf lives is pretty apparent There are no gross outliers in the data set INFERENOE 0 We can use the distribution to estimate what percentage of cans have shelf life in a certain range of interest Suppose that the experimenter believed most shelf lives should be larger than 250 days7 From the distribution7 we see that this probably is not true if these data are representative of the population of shelf lives We can also associate the percentage of lives in a certain interval as being propor tional to the area under the histogram in that interval For example7 are more can s likely to have shelf lives of 200 225 days or 300 325 days We can estimate these percentages by looking at the data graphically TERMINOLOGY If a distribution is not symmetric7 it is said to be skewed PAGE 9 CHAPTER 1 STAT 7007 J TEBBS l Illi II 00 300 4 600 o as o 90 o 92 o 94 o 96 o 9 death times percentage of onrtime payments 250 l 250 l 150 200 l l 200 l 150 l 100 l 100 l Figure 14 Left Death times for rats treated with a tocein Right Percentage of monthly on time payments SKEWED DISTRIBUTIONS Skewed distributions occur naturally in many applica tions Not all distributions are symmetric or approximately symmetric In Figure 147 the left distribution is skewed right the right distribution is skewed left STEM PLOTS These plots provide a quick picture of the distribution while retaining the numerical values themselves The idea is to separate each data value into a stem and a leaf Stems are usually placed on the left7 with leaves on the right in ascending order Stem plots work well with small to moderate sized data sets7 say7 15 to 50 observations The stem plot for the shelf life data in Example 14 appears in Table 14 In this plot7 the units digit is the leaf the tens and hundreds digits form the stem LONGITUDINAL DATA In many applications7 especially in business7 data are observed over time Data that are observed over time are sometimes called longitudinal data More often7 longitudinal data are quantitative but they need not be Examples of longitudinal data include monthly sales7 daily temperatures7 hourly stock prices7 etc PAGE 10 CHAPTER 1 STAT 7007 J TEBBS Table 14 Stem plot for the shelflz39fe data in Example 14 18 8 19 2O 3 6 21 12279 22 5 7 23 1 1 4 24 1 1 3 9 25 1 2 26 2 8 27 9 28 1 29 0 30 1 TIME PLOTS If it is the longitudinal aspect of the data that you wish to examine graphically you need to use a graphical display which exploits this aspect Time plots are designed to do this histograms and stem plots in general do not do this To construct a time plot simply plot the individual values on the vertical axis versus time on the horizontal Usually individual values are then connected with lines Example 15 The Foster Brewing Company is largely responsible for the development of packaged beers in Australia In fact canned beer had been developed in the USA just prior to the Second World War and was rst produced in Australia in the early 1950s Developments since have included improved engineering techniques has led to larger vessels and improved productivity The data available online at http www maths soton ac ukteachingunitsmathGO 1 1 mwhdataBeer2 htm are the monthly beer sales data in Australia from January 1991 to September 1996 A time plot of the data appears in Figure 15 PAGE 11 CHAPTER 1 STAT 700 J TEBBS l l l 111992 111994 111996 Month Figure 15 Australian beer sales from January 1991 to September 1996 INTERPRETING TIME PLOTS When I look at time plots 1 usually look for two things in particular 0 Increasing or decreasing trends ls there a general shift over time upward or downward ls it slight or notably apparent 0 Evidence of seasonal effects Are there repeated patterns at regular intervals If there is what is most likely to have produced this pattern USEFULNESS Analyzing longitudinal data is important for forecasting or prediction For example can we forecast the next two years of beer sales Why might this information be important DECOMPOSING TIME SERIES For a given set of longitudinal data it is often useful to decompose the data into its systematic parts eg trends seasonal effects and into its random parts the part that is left over afterwards See pages 21 23 MM PAGE 12 CHAPTER 1 STAT 7007 J TEBBS 12 Displaying distributions with numbers NOTE In Section 117 the main goal was to describe distributions of data graphically We now wish to do this numerically For the remainder of the course7 we will adopt the following notation to describe a sample of data 71 number of observations in sample z variable of interest 1727 xn the 71 data values in our sample We now examine numerical summaries of data We will quantify the notion of center and variation ie7 spread PREVAILING THEME Our goal is to numerically summarize the distribution of the sample and get an idea of these same notions for the population from which the sample was drawn 121 Measures of center TERMINOLOGY With a sample of observations 1727 man the sample mean is de ned as That is7 the sample mean is just the arithmetic average of the n values 1727 x The symbol E is pronounced x bar7 and is common notation Physically7 we can envision E as the balancing point on the histogram for the data SIGMA NOTATION The symbol PAGE 13 CHAPTER 1 STAT 7007 J TEBBS denotes sum ln particular7 the sum of the data 1727 man can be expressed either as 11 x12zn or i1 We will be using sigmanotation throughout the course The symbol 239 denotes the index of summation This is used to tell us which values to add up The n on top of the summation symbol is the index of the nal quantity added TERMINOLOGY With a sample of observations x172m the sample median7 denoted by M7 is the middle ordered value when the data are ordered low to high If the sample size n is an odd7 then the median will be uniquely de ned if n is even7 then the median is the average of the middle two ordered observations Example 16 With our beverage shelf life data from Example 147 the sum of the data is 25 2xx12x252621882495974 11 and the sample mean of the 25 shelf lives is given by 1 25 1 z g 5974 23896 days From the stemplot in Table 147 we can see that the median is M 234 days Note that the median is the 13th ordered value 12 values fall below M and 12 values fall above M Also7 note that the mean and median are fairly close7 but not equal COMPARING THE MEAN AND MEDIAN The mean is a measure that can be heavily in uenced by outliers Unusually high data observations will tend to increase the mean7 while unusually low data observations will tend to decrease the mean One or two outliers will generally not affect the median Sometimes we say that the median is generally robust to outliers MORAL If there are distinct outliers in your data set7 then perhaps the median should be used as a measure of center instead of the mean PAGE 14 CHAPTER 1 STAT 7007 J TEBBS OBSERVATION If the mean and median are close7 this is usually but not always an indication that the distribution is approximately symmetric ln fact7 o if a data distribution is perfectly symmetric7 the median and mean will be equal 0 if a data distribution is skewed right7 the mean will be greater than the median o if a data distribution is skewed left7 the median will be greater than the mean 122 The ve number summary box plots and percentiles BOX PLOT A box plot is a graphical display that uses the ve number summary min7 the smallest observation Q1 the rst quartile ie7 the observation such that approximately 25 percent of the observations are smaller than Q1 0 median7 M7 the middle ordered observation ie7 the observation such that roughly half of the observations fall above the median7 and roughly half fall below Q3 the third quartile ie7 the observation such that approximately 75 percent of the observations are smaller than Q3 0 max7 the largest observation NOTE To compute the ve number summary7 one rst has to order the data from low to high The values of Q1 the median7 and Q3 may not be uniquely determined In practice7 computing packages give these values by default or by request Example 17 For the beverage shelf life data in Example 147 the ve number summary is min 188 Q1 217 median 234 Q3 252 max 301 The box plot for the shelf life data uses these values and is given in Figure 16 PAGE 15 CHAPTER 1 STAT 7007 J TEBBS 300 N m o l Shelf Life in days N N o l 180 Figure 16 B055 plot for shelf life data in Example 14 TERMINOLOGY The pth percentile of a data distribution is the value at which p percent of the observations fall at or below it Here7 p is a number between 0 and 100 We have already seen three special percentiles Q1 is the 25th percentile7 M is the 50th percentile7 and Q3 is the 75th percentile 123 Measures of variation OBSERVATION Two data sets could have the same rnean7 but values may be spread about the mean value differently For example7 consider the two data sets 24 25 26 27 28 6 16 26 36 46 Both data sets have E 26 However7 the second data set has values that are much more spread out about 26 The rst data set has values that are much more compact about 26 That is7 variation in the data is different for the two data sets PAGE 16 CHAPTER 1 STAT 700 J TEBBS RANGE An easy way to assess the variation in a data set is to compute the range which we denote by R The range is the largest value minus the smallest value ie R zmam 7 rmin For example the range for the rst data set above is 28 7 24 4 while the range for the second is 46 7 6 40 DRAWBACKS The range is very sensitive to outliers since it only uses the extreme observations Additionally it ignores the middle n72 observations which is potentially a lot of information NTERQUARTLE RANGE The interquartile range QR measures the spread in the center half of the data it is the difference between the rst and third quartiles ie QR Q3 Q1 NOTE This measure of spread is more resistant to outliers since it does not use the extreme observations Because of this the QR can be very useful for describing the spread in skewed distributions INFORMAL OUTLER RULE The 15 gtlt QR outlier rule says to classify an obser vation as an outlier if it falls 15 gtlt QR above the third quartile or 15 gtlt QR below the rst quartile Example 18 For the shelf life data in Example 14 and Example 17 we see that QR Q3 7 Q1 252 7 217 35 Thus observations above Q3 15 gtlt QR 252 1535 3045 or below Q3 715 gtlt QR 217 71535 1645 would be classi ed as outliers Thus none ofthe shelf life observations would be classi ed as outliers using this rule PAGE 17 CHAPTER 1 STAT 7007 J TEBBS TERMINOLOGY The sample variance of 1727 man is denoted by 52 and is given by 1 1 1 2 212 22 5 1 E 1 1 n7111 n71 11 n 11 RATIONALE We are trying to measure how far observations deviate from the sample mean E A natural quantity to look at is each observation7s deviation from the mean7 ie7 1 E However7 one can show that n E 0 11 that is7 the positive deviations and negative deviations cancel each other out77 when you add them REMEDY Devise a measure that maintains the magnitude of each deviation but ignores their signs Squaring each deviation achieves this goal The quantity is called the sum of squared deviations Dividing SLAM E by n 1 leaves approximately an average of the n squared deviations This is the sample variance TERMINOLOGY The sample standard deviation is the positive square root of the variance ie7 FACTS ABOUT THE STANDARD DEVIATION AND VARIANCE 1 The larger the value of s 527 the more variation in the data 1727 x 252052gt0 3 If s 0 52 07 then 1 2 x That is7 all the data values are equal there is zero spread PAGE 18 CHAPTER 1 STAT 7007 J TEBBS 4 s and 52 in general7 are heavily in uenced by outliers 5 s is measured in the original units of the data 52 is measured in units CT The quantities s and 52 are used mostly when the distribution of the sample data 122 72 is approximately symmetric as opposed to skewed distributions 5 Computing 5 and 52 can be very tedious if n is large Fortunately7 many hand held calculators can compute the standard deviation and variance easily7 and7 of course7 many software packages can as well Example 19 Keeping plants healthy requires an understanding of the organisms and agents that cause disease as well as an understanding of how plants grow and are affected by disease In a phytopathology experiment studying disease transmission in insects7 x denotes the number of insects per plot A sample of n 5 plots is observed yielding 1 57 2 77 3 4 x4 9 and x5 5 Calculating the sum and the uncorrected sum of squares7 we get 5 Zzi5749530 11 5 Zz 5272429252196 11 The sample mean is E 30 6 insectsplot The sum of squared deviations is equal to 5 1 5 2 7 2 7 2 7 7 I 5 96gt 2 52le 11 11 11 1 2 196 7 530 196 i 180 16 Thus7 the sample variance is 52 i2 121 7 EV 164 4 and the sample standard deviation is s xZT 2 insectsplot Example 110 With our shelf life data from Example 147 we now compute the variance and standard deviation Recall in Example 16 we computed E 23896 days First7 we PAGE 19 CHAPTER 1 STAT 7007 J TEBBS compute 25 Zi2621882495974 i1 25 2x12 2622 1882 2492 1447768 i1 Thus7 25 25 1 25 2 1 1447768 7 45974 25 2022096 and the sample variance is 52 i 7 EV 202209624 84254 The sample standard deviation is s x 84254 x 2903 days The unit of measurement associated with the variance is days27 which has no physical meaning This is one of the advantages of using the standard deviation as a measure of variation 124 The empirical rule The empirical rule7 or the 6895997 rule says that if a histogram of observations is approximately symmetric7 then 0 approximately 68 percent of the observations will be within one standard deviation of the mean7 0 approximately 95 percent of the observations will be within two standard deviations of the mean7 and 0 approximately 997 percent or almost all of the observations will be within three standard deviations of the mean We will see the justi cation for this in the next section when we start discussing the normal distribution PAGE 20 CHAPTER 1 STAT 7007 J TEBBS Example 111 Returning to our shelf life data from Example 14 recall that the histogram of measurements was approximately symmetric see Figure 13 Hence the data should follow the empirical rule Forming our intervals one two and three standard deviations from the mean we have 0 E 7 5E s 23896 7 2903 23896 2903 or 20993 26799 0 E 7 25E 2s 23896 7 2 gtlt 2903 23896 2 gtlt 2903 or 18091 29702 0 E 7 35E 3s 23896 7 3 gtlt 290323896 3 gtlt 2903 or 1518832604 Checking the Empirical Rule with the shelf life data we see 1725 or 68 percent fall in 20993 26799 2425 or 96 percent fall in 1809129702 and all the data fall in the interval 15188 32604 Thus for the shelf life data we see a close agreement with the empirical rule A NOTE ON COMPUTING STATISTICS BY HAND Fortunately we have calculators and computing to perform calculations involved with getting means and standard devi ations Except where explicitly instructed to perform calculations by hand please feel free to use SAS Minitab or any other computer package that suits you The goal of this course is not to be inundated with hand calculations although we will be doing somel 125 Linear transformations REMARK In many data analysis settings we often would like to work with data on different scales Converting data to a different scale is called a transformation LINEAR TRANSFORMATIONS A linear transformation changes the original vari able E into the new variable u in a way given by the following formula u a bE where b is nonzero and a is any number PAGE 21 CHAPTER 1 STAT 7007 J TEBBS EFFECTS OF LINEAR TRANSFORMATIONS Suppose that we have data 1 2 an measured on some original scale with sample mean E and sample variance 5 Consider transforming the data by ui a bzi for i 127 n7 so that we have the new77 data ul a bxl uz a bag on a ban Then7 the mean and variance of the transformed data u17u2 un are given by U a bf and 53 bzsi Furthermore7 the standard deviation of u17u2 un is given by SM lblsm where lbl denotes the absolute value of b It is interesting to note that the constant a does not affect the variance or the standard deviation why7 Example 112 A common linear transformation characterizes the relationship between the Celsius and Fahrenheit temperature scales Letting z denote the Celsius temperature7 we can transform to the Fahrenheit scale u by u 32 18x You will note that this is a linear transformation with a 32 and b 18 From the web7 1 found the high daily temperatures recorded in June for Baghdad7 Iraq the last twenty years worth of data ie7 600 observationsl l plotted the data in Celcius in Figure 17 I then transformed the data to Fahrenheit using the formula above and plotted the Fahrenheit data as well You should note the shape of the distribution is unaffected by the linear transformation This is true in general that is7 linear transformations do not change the shape of a distribution PAGE 22 CHAPTER 1 STAT 7007 J TEBBS a jl Ill 2 6 2 3 6 3 8 8 5 9 0 9 5 1 30 100 r 80 r 50 r 40 r 20 a 30 32 34 75 50 High June temperatures 2 High June temperatures F Figure 17 High J1me temperatures in Baghdad from 1984 2004 MEAN VARIANCE AND STANDARD DEVIATION COMPARISONS I asked the software package R to compute the rnean7 variance7 and standard deviation of the 600 Celsius observations I also asked for the 5 Nurnber Summary gt meanx 1 32341 gt varx 1 4116 gt stdevx 1 2029 gt summaryx Min lst Qu Median 3rd Qu Max 251 310 324 336 384 We see that E 323417 5 41167 and 51 2029 What are the rnean7 variance7 and standard deviation of the data measured in Fahrenheit I also asked R to give us these PAGE 23 CHAPTER 1 STAT 7007 J TEBBS gt meanu 1 90214 gt varu 1 13335 gt stdevu 1 3652 gt summaryu Min lst Qu Median 3rd Qu Max 771 878 903 926 1011 EXERCISE Verify that E a bf 53 19252 m7 and SM lblsm for these data recall that a 18 and b 32 Also7 with regard to the ve values in the Five Number Summary7 what are the relationships that result for each under a linear transformation How is the QR affected by a linear transformation 13 Density curves and normal distributions Example 113 Low infant birth weight is a fairly common problem that expecting mothers experience In Larimer County7 CO7 a group of researchers studied mothers aged from 25 to 34 years during 1999 to 2003 During this time7 1242 male birth weights were recorded The data appear in a relative frequency histogram in Figure 18 SIDE NOTE A relative frequency histogram is a special histogram Percentages are plotted on the vertical axis instead of counts Plotting percentages does not alter the shape of the histogram DENSITY CURVE On Figure 187 l have added a density curve to the histogram One can think of a density curve as a smooth approximation to a histogram of data Since density curves serve as approximations to real life data7 it is sometimes convenient to think about them as theoretical models for the variable of interest here7 birth weights PAGE 24 CHAPTER 1 STAT 700 J TEBBS 015 020 025 030 l l l l 010 005 l 000 l l l l l l 4 6 8 10 12 Male birthweights in lbs Figure 18 Distribution of American male birth weights A density curve has been su perimposed over the relative frequency histogram PROPERTIES A density curve describes the overallpdttern Ufa distribution In general a density curve associated with a quantitative variable x eg birth weights etc is a curve with the following properties H the curve is non negative E0 the area under the curve is one 00 the area under the curve between two values say a and b represents the proportion of observations that fall in that range EXAMPLES For example the proportion of male newborns that weigh between 7 and 9 lbs is given by the area under the density curve between 7 and 9 Also the proportion of newborns that are of low birth weight ie lt 55 lbs is the area under the density curve to the left of 55 In general nding these areas might not be straightforward PAGE 25 CHAPTER 1 STAT 7007 J TEBBS 002 003 004 005 006 I I 001 00 Remission times in months Figure 19 Density curve for the remission times of leukemia patients The shaded region represents the proportion of patient times between 10 and 20 months MATHEMATICAL DEFINITION A density curve is a function f which describes the distribution of values taken on by a quantitative variable The function f has the following properties 1 fx gt 0 nonnegative 2 ff fzd 1 area under the curve is 1 3 the proportion of observations that fall between a and b is given by fxd the area under the curve between a and b Example 114 Researchers have learned that the best way to cure patients with acute lymphoid leukemia is to administer large doses of several chemotherapeutic drugs over a short period of time Suppose that the density curve in Figure 19 represents the remission time for a certain group of leukemia patients The equation of the curve is 7 1 2 714 mews PAGE 26 CHAPTER 1 STAT 7007 J TEBBS 5 1 Proportion of response to drug Battery lifetimes 004 005 002 000 n n o o 40 50 Standardized test scores Remission times Figure 110 Four density curves for values of z gt 0 here z is our variable representing remission time If this is the true model for these remission times what is the proportion of patients that will experience remission times between 10 and 20 months That is what is the area of the shaded region in Figure 19 Using calculus we could compute 20 20 1 fxd imze m ldx 042 10 10 128 Thus about 42 percent of the patients will experience remission times between 10 and 20 months 131 Measuring the center and spread for density curves MAIN POINT The notion of center and spread is the same for density curves as before However because we are talking about theoretical models instead of raw data we change our notation to re ect this In particular 0 The mean for a density curve is denoted by M PAGE 27 CHAPTER 1 STAT 7007 J TEBBS o The variance for a density curve is denoted by 02 o The standard deviation for a density curve is denoted by 039 SUMMARY Here is a review of the notation that we have adopted so far in the course Observed data Density curve Mean E u Variance 52 02 Standard deviation 5 039 IMPORTANT NOTE The sample values E and s can be computed from observed data The population values p and 039 are theoretical values In practice7 these values are often unknown ie7 we can not compute these values from observed sample data COMPARING MEAN AND MEDIAN The mean for a density curve u can be thought of as a balance point77 for the distribution On the other hand7 the median for a density curve is the point that divides the area under the curve in half The relationship between the mean and median is the same as they were before namely7 o if a density curve is perfectly symmetric7 the median and mean will be equal 0 if a density curve is skewed right7 the mean will be greater than the median o if a density curve is skewed left7 the median will be greater than the mean MODE The mode for a density curve is the point where the density curve is at its maximum For example7 in Figure 197 the mode looks to be around 8 months 132 Normal density curves NORMAL DENSITY CURVES The most famous and important family of density curves is the normal family or normal distributions The function that describes the PAGE 28 CHAPTER 1 STAT 700 J TEBBS fx 004 006 000 I I 002 I 00 I I I I 10 20 30 40 tomato yields x bushelsacre Figure 111 A normal density curve with mean M 25 and standard deviation 0 5 A model for tomato yields normal distribution with mean M and standard deviation 0 is given by 1 1L i2 f g 20 FACTS Here are some properties for the normal family of density curves 0 the curves are symmetric thus the empirical rule holds perfectly 0 area under the a normal density curve is one 0 mean median and mode are all equal mound shaped and unimodal change of curvature points at M i o EMPIRICAL R ULE o 68 percent 6826 of the observations will be within one o of M PAGE 29 CHAPTER 1 STAT 7007 J TEBBS o 95 percent 9545 of the observations will be within two a of M and o 997 percent 9973 of the observations will be within three a of M Example 115 For the normal density curve in Figure 1117 which is used as a model for tomato yields x7 measured in bushels per acre7 the empirical rule says that about 68 percent of the observations yields will be within 25 i 57 or 207 30 bushelsacre about 95 percent of the observations yields will be within 25 i 2 gtlt 57 or 157 35 bushelsacre about 997 percent of the observations yields will be within 25 i 3 gtlt 57 or 107 40 bushelsacre PRACTICE QUESTIONS USING EMPIRICAL RULE Concerning Figure 1117 a nor mal distribution with mean M 25 and standard deviation 0 57 0 what percentage of yields will be less than 20 bushels per acre 0 what percentage of yields will be greater than 35 bushels per acre 0 what percentage of yields will be between 10 and 35 bushels per acre 0 what percentage of yields will be between 15 and 30 bushels per acre SHORT HAND NOTATION FOR NORMAL DISTRIBUTIONS Because we will men tion normal distributions often7 a short hand notation is useful We abbreviate the normal distribution with mean M and standard deviation 0 by NOW To denote that a variable X follows a normal distribution with mean M and standard deviation 07 we write X N NMo PAGE 30 CHAPTER 1 STAT 7007 J TEBBS 005 HO HE 005 00 002 00 002 U 005 001 00 002 Figure 112 E eets of changing M and 039 ah the shape of the normal density curve NORMAL DENSITY CURVE SHAPES Figure 112 displays four normal distributions 0 The left two are the N254 and N258 density curves Note how the N258 distribution has an increased spread ie7 rnore variability o The right two are the J257 4 and N304 Note how the N304 distribution has been shifted over to the right ie7 the mean has shifted 133 Standardization QUESTION For the normal density curve in Figure 1117 what is the percentage of yields between 264 and 379 bushels per acre GOAL For any NMU distribution7 we would like to compute areas under the curve Of course7 using the empirical rule7 we know that we can nd areas in special cases ie7 when values happen to fall at Mi 0 Mi 207 and M i30 How can we do this in general PAGE 31 CHAPTER 1 STAT 7007 J TEBBS IMPORTANT RESULT Suppose that the variable X N NM0 and let x denote a speci c value of X We de ne the standardized value of z to be 95 M U z Standardized values are sometimes called zscores FAOTS ABOUT STANDARDIZED VALUES 0 unitless quantities 0 indicates how many standard deviations an observation falls above below the mean u CAPITAL VERSUS LOWEROASE NOTATION From here on out7 our convention will be to use a capital letter X to denote a variable of interest We will use a lowercase letter z to denote a speci c value of X This is standard notation STANDARD NORMAL DISTRIBUTION Suppose that the variable X N NMU Then7 the standardized variable has a normal distribution with mean 0 and standard deviation 1 We call Z a standard normal variable and write Z N N01 We call N01 the standard normal distribution Example 116 SAT math scores are normally distributed with mean M 500 and standard deviation 039 100 Let X denote the SAT math score so that X N J5007 100 a Suppose that I got a score of z 625 What is my standardized value How is this value interpreted b What is the standardized value of z 342 How is this value interpreted c Does the variable X 7 500 100 have a normal distribution What is its mean and standard deviation PAGE 32 CHAPTER 1 STAT 7007 J TEBBS standard noimal distribution 2 standard normal distribution 2 Figure 113 Standard normal areas Left Area to the left ofz 231 Right Area to the left ofz 7148 FINDING AREAS UNDER THE STANDARD NORMAL DENSITY CURVE Table A in your textbook provides areas under the standard normal curve Note that for a particular value of z the tabled entry gives the area to the left of 2 For example7 o the area to the left of z 231 is 09896 0 the area to the left of z 7148 is 00694 Q UESTIONS FOR YO U o What is the area to the left of z 034 o What is the area to the right of z 231 o What is the area to the right of z 7200 o What is the area between 2 7148 and 2 231 PAGE 33 CHAPTER 1 STAT 7007 J TEBBS IMPORTANT When you are making standard normal area calculations I highly rec ommend that you sketch a picture of the distribution place the values of 2 that you are given and shade in the appropriate areas you want This helps tremendously ONLINE RESOURCE The authors ofthe text have created an applet that will compute areas under the standard normal curve actually for any normal curve It is located at http bcs whfreeman comipSSe 134 Finding areas under any normal curve FINDING AREAS Computing the area under any normal curve can be done by using the following steps 1 State the problem in terms of the variable of interest X where X N NMU 2 Restate the problem in terms of Z by standardizing X 3 Find the appropriate area using Table A Example 117 For the N255 density curve in Figure 111 what is the percentage of yields between 264 and 379 bushels per acre SOLUTION 1 Here the variable of interest is X the yield measured in bushels per acre Stating the problem in terms of X we would like to consider 264 lt X lt 379 2 Now we standardize by subtracting the mean and dividing by the standard de viation Here M 25 and 039 5 264 7 25 lt H lt 379 7 25 5 5 5 028 lt Z lt 258 PAGE 34 CHAPTER 1 STAT 7007 J TEBBS fz I o I o 25 2 55 3 2 1 D 1 2 3 standard normal distribution 2 Figure 114 Standard normal distribution Area between 2 028 and z 258 3 Finally7 we nd the area between 2 028 and z 258 on the standard normal distribution using Table A o the area to the left of z 258 is 09951 Table A o the area to the left of z 028 is 06103 Table A Thus7 the area between 2 028 and z 258 is 09951 7 06103 03848 ANSWER So7 about 3848 percent of the tomato yields will be between 264 and 379 bushels per acre QUESTIONS FOR YOU ln Example 1177 0 What proportion of yields will exceed 345 bushels per acre 0 What proportion of yields will be less than 188 bushels per acre PAGE 35 CHAPTER 1 STAT 7007 J TEBBS 020 fx battery lifetimes x Figure 115 Standard normal distribution Unshdded area is the upper ten percent of the distribution 135 Inverse normal calculations Finding percentiles OBSERVATION In the last subsection7 our goal was to nd the area under a normal density curve This area represents a proportion of observations falling in a certain interval eg7 between 264 and 3797 etc Now7 our goal is to go the other way77 that is7 we want to nd the observed values corresponding to a given proportion Example 118 The lifetime of a cardiac pacemaker battery is normally distributed with mean M 250 days and standard deviation 039 23 days Ten percent of the batteries will last longer than how many days SOLUTION Denote by X the lifetime of a pacemaker battery so that X N J2507 23 We want to nd the 90th percentile of the J2507 23 distribution See Figure 115 We will solve this by rst nding the 90th percentile of the standard normal distribution PAGE 36 CHAPTER 1 STAT 7007 J TEBBS From Table A7 this is given by z 128 Note that the area to the left of z 128 is 08997 as close to 09000 as possible Now7 we unstandardize that is7 we set 725 128 23 Solving for L we get z 12823 25 27944 Thus7 10 percent of the battery lifetimes will exceed 27944 days FORMULA FOR PERCENTILES Since7 in general7 we see that z 0392 M Thus7 if 2 denotes the pth percentile of the N01 distribution7 then x 0392 M is the pth percentile of the NW 0 distribution Q UESTIONS FOR YO U o What is the 80th percentile of the standard normal distribution The 20th 0 ln Example 1187 ve percent of battery lifetimes will be below which value 0 ln Example 1187 what is the proportion of lifetimes between 214 and 282 days 136 Diagnosing normality Example 119 Since an adequate supply of oxygen is necessary to support life in a body of water7 a determination ofthe amount of oxygen provides a means of assessing the quality of the water with respect to sustaining life Dissolved oxygen DO levels provide information about the biological7 biochemical7 and inorganic chemical reactions occurring in aquatic environments In a marine biology study7 researchers collected 71 164 water PAGE 37 CHAPTER 1 STAT 7007 J TEBBS l 52 54 56 5 8 6 0 62 64 66 DO content ppm 25 20 l 1 5 l 1 0 l 5 l Figure 116 Dissolved omygen contents for n 164 water specimens specimens and recorded the DO content measured in parts per million A histogram of the data appears in Figure 116 Are these data normally distributed How can we tell GOAL Given a set of data x172m we would like to determine whether or not a normal distribution adequately ts the data CHECKING NORMALITY Here are some things we can do to check whether or not data are well represented by a normal distribution 1 Plot the data 2 Compute summary measures check to see if the data follow the empirical rule 3 Construct a normal quantile plot NORMAL SCORES A set of n normal scores from the standard normal distribution are the 2 values which partition the density curve into n1 equal areas That is the normal scores are percentiles from the standard normal distribution PAGE 38 CHAPTER 2 STAT 7007 J TEBBS Standardized DO contents l l l l l 2 1 0 1 2 Standard normal qu antiles Figure 117 Normal quantile plot for the DO content data in Example 119 EXERCISE With 71 4 nd the normal scores from the standard normal distribution NORMAL QUANTILE PLOTS A normal quantile plot plots the observed data or dered from low to high and then suitably standardized versus the corresponding 71 normal scores from a standard normal distribution REALIZATION If we plot our ordered standardized data versus the normal scores7 then 0 if the resulting plot is relatively straight7 this supports the notion that the data are normally distributed ie7 a normal density curve ts77 the data well 0 if the resulting plot has heavy tails77 and is curved7 this supports the notion that the data are not normally distributed ie7 a normal density curve does not t77 the data well CONCLUSION The normal quantile plot for the DO data in Example 119 is given in Figure 117 A normal distribution reasonably ts the dissolved oxygen data PAGE 39 CHAPTER 2 STAT 7007 J TEBBS 2 Looking at Data Relationships OVERVIEW A problem that often arises in the biological and physical sciences eco nomics industrial applications and biomedical settings is that of investigating the math ematical relationship between two or more variables EXAMPLES amount of alcohol in the body BAL versus body temperature degrees C weekly fuel consumption degree days versus house size square feet amount of fertilizer poundsacre applied versus yield kgacre sales 1000s versus marketing expenditures 1000s HIV status yesno versus education level eg primary secondary college etc gender versus promotion yesno Are promotion rates different Remission time in days versus treatment eg surgerychemotherapyboth REMARK In Chapter 1 our focus was primarily on graphical and numerical summaries of data for a single variable categorical or quantitative In this chapter we will largely look at situations where we have two quantitative variables TERMINOLOGY Two variables measured on the same individuals are said to be as sociated if values of one variable tend to occur with values of the other variable Example 21 Many shes have a lateral line system enabling them to experience mechanoreceptizm the ability to sense physical contact on the surface of the skin or movement of the surrounding environment such as sound waves in air or water In an experiment to study this researchers subjected sh to electrical impulses The frequency number per second of electrical impulses El emitted from one particular sh was measured at several temperatures measured in Celcius the data are listed in Table 25 PAGE 40 CHAPTER 2 STAT 7007 J TEBBS 310 290 270 250 Frequency number per second C 230 Temperature in Celcius Figure 218 E frequency at dz erertt temperatures The scatterplot for the data appears in Figure 218 It is clear that these variables are strongly related As the water temperature increases there is a tendency for the frequency of impulses to increase as well Table 25 Electrical impulse data Temperature Frequency Temperature Frequency 20 2 2 4 27 301 2 2 2 5 2 28 306 23 2 6 7 30 318 25 287 VARIABLES In studies where there are multiple variables under investigation eg temperature El frequency it is common that one desires to study how one variable is affected by the others In some problems not all it makes sense to focus on the behavior of one variable and in particular determine how another variable in uences it PAGE 41 CHAPTER 2 STAT 7007 J TEBBS In a scienti c investigation the main variable under consideration is called the response variable An explanatory variable explains or causes changes in the response variable there may be more than one explanatory variablel ln Example 217 El frequency is the response variable7 and temperature is the explanatory variable NOTATION We denote the explanatory variable by z and the response variable by y NO TE Explanatory variables are sometimes called independent variables A response variable is sometimes called a dependent variable 21 Scatterplots SOATTERPLOTS A scatterplot is a graphical display that plots observations on two quantitative variables It is customary that the response variable is placed on the vertical axis The explanatory variable is placed on the horizontal Scatterplots give a visual impression of how the two variables behave together INTERPRETING SOATTERPLOTS It is important to describe the overall pattern of a scatterplot by examining the following 0 form are there curved relationships or different clusters of observations 0 direction are the two variables positively related or negatively related 0 strength how strong is the relationship Obvious Weak Moderate o the presence of outliers LINEAR RELATIONSHIPS If the form of the scatterplot looks to resemble a straight line trend7 we say that the relationship between the variables is linear TERMINOLOGY Two variables are positively related if they tend to increase to gether They are negatively related if an increase in one is associated with a decrease in the other The data in Figure 218 display a strong positive relationship PAGE 42 CHAPTER 2 STAT 7007 J TEBBS 25m 55 an 55 15 In 25 an 35 MI 3U 35 in 5 an Lean may masskg Distance in les mm kgmm Mmher swelqm m ms MEI ian ism ian mu 23m FemhzevUbSplm Father s weight 0n lbs 7 El B El Figure 219 Four soatterplots Upper left positive linear relationship Upper right mild linear negative relationship Lower left curved relationship Lower right random scatter ADDING CATEGORICAL VARIABLES TO SCATTERPLOTS In some situations7 we might want to add a third variable to a scatterplot As long as this third variable is categorical in nature7 we can do this by using different plotting symbols for the levels of the categorical variable Example 22 An engineer is studying the effects of the pH for a cleansing tank and polymer type on the amount of suspended solids in a coal cleansing system Data from the experiment are given in Table 26 The engineer believes that the explanatory variable pH X is important in describing the response variable Y7 the amount of suspended PAGE 43 CHAPTER 2 STAT 7007 J TEBBS 420 370 m N o l AmountSusp N x O l 220 170 C Figure 220 Amount of suspended material as afunetz39on of pH for three polymers solids measured in ounces However7 she is also studying three different polymer types generically denoted by A7 B7 and C In Figure 2207 different plotting symbols are used to differentiate among the three polymer types What is the relationship between the amount of suspended solids and pH for each polymer Table 26 Cleansing data for dz erent polymers Polymer A Polymer B Polymer C y I y I y I 292 6 5 410 9 2 167 6 5 329 6 9 198 6 7 225 7 0 352 7 8 227 6 9 247 7 2 378 8 4 277 7 5 268 7 6 392 8 8 297 7 9 288 8 7 SIDE BY SIDE BOXPLOTS When we have a quantitative response variable Y and a categorical explanatory variable X7 we can display the relationship between these vari PAGE 44 CHAPTER 2 STAT 7007 J TEBBS Height in mm 30 l l l f1 f2 f3 Fertilizer Figure 221 Seedlings height data for three feitilizers ables using side by side boxplots These plots are very helpful with data from designed experiments where the response variable is often quantitative and the goal is often to compare two or more treatments so that the categorical variable is treatment Example 23 The operators of a nursery would like to investigate differences among three fertilizers denoted by f17 f27 and f3 they might use on plants they are growing for commercial sale The researchers have 24 seedlings and decide to use 8 seedlings for each fertilizer At the end of six weeks7 the heights of each seedlings7 Y measured in mm7 are collected The data from the experiment are displayed in Figure 221 22 Correlation SCENARIO We would like to study the relationship between two quantitative vari ables7 z and y We observe the pair X7 Y on each of 71 individuals in our sample7 and we wish to use these data to say something about the relationship PAGE 45 CHAPTER 2 STAT 7007 J TEBBS NOTE Scatterplots give us graphical displays of the relationship between two quantita tive variables We now wish to summarize this relationship numerically TERMINOLOGY The correlation is a numerical summary that describes the strength and direction of the linear relationship between two quantitative variables With a sample of n individuals7 denote by xi and yi the two measurements for the 2th individual The correlation is computed by the following formula 7 1 n iif 11quot Tin1 5w 5y 7 where E and g are the sample means and 51 and 5y are the sample standard deviations REMARK Unless n is small7 it is often best to use statistical software to compute the correlation You should note that the terms are the sample standardized values of xi and yi respectively PROPERTIES OF THE OORRELATI ON o The correlation r is a unitless number that is7 there are no units attached to it eg7 dollars7 mm7 etc It also makes no difference what you call x and what you call y the correlation will be the same The correlation 7 always satis es 71ltr 1 o If r 17 then all data lie on a straight line with positive slope lf 7 71 then all the data lie on a straight line with negative slope lf 7 07 then there is no linear relationship present in the data 0 When 0 lt r lt 17 there is a tendency for the values to vary together in a positive way ie7 a positive linear relationship When 71 lt r lt 0 there is a tendency for the values to vary together in a negative way ie7 a negative linear relationship PAGE 46 CHAPTER 2 STAT 7007 J TEBBS Figure 222 Four soatterplots using data generated from Mlnltab Upper left r 0 Upper right r 09 Lower left r 7099 Lower right r 705 o The correlation only measures linear relationships It does not describe a curved relationship no matter how strong that relationship is 0 Thus we could have two variables X and Y that are perfectly related but the cor relation still be Zeroll This could occur if the variables are related quadratically for example See Example 25 o The value of r could be highly affected by outliers This makes sense since sample means and sample standard deviations are affected by outliers and these values are required to compute r PAGE 47 CHAPTER 2 STAT 7007 J TEBBS Plant growth 0 O I l l l l l l l 10 12 14 16 18 20 22 Water amount Figure 223 Plant growth versus water amount WARNING The correlation is by no means a complete measure of a bivariate ie7 two variable data set Always plot your data before computing the correlation Example 24 In Example 217 the correlation between El frequency and temperature is r 0981 This value suggests that there is a strong positive linear relationship between the two variables Example 25 Researchers are trying to understand the relationship between the amount of water applied to plots measured in cm and total plant growth measured in cm A sample of n 30 plots is taken from different parts of a eld The data from the sample is given in Figure 223 Using Minitab7 the correlation between plant growth and water amount is r 0088 This is an example where the two variables under investigation water amount and plant growth have a very strong relationship7 but the correlation is near zero This occurs because the relationship is not linear rather7 it is quadratic An investigator that did not plot these data and only looked at the value of 7 could be lead astray and conclude that these variables were not related PAGE 48 CHAPTER 2 STAT 7007 J TEBBS 23 Leastsquares regression REMARK Correlation and scatterplots help us to document relationships correlation only helps us assess linear relationships The statistical technique known as regression allows us to formally model these relationships Regression unlike correlation requires that we have an explanatory variable and a response variable NOTE In this course we will restrict attention to regression models for linear rela tionships with a single explanatory variable this is called simple linear regression Techniques for handling nonlinear relationships andor more than one explanatory vari able will be explored in the subsequent course Example 26 The following data are rates of oxygen consumption of birds measured at different temperatures Here the temperatures were set by the investigator and the 02 rates y were observed for these particular temperatures x degrees Celcius 718 715 710 75 O 5 10 19 ym1ghr 52 47 45 36 34 31 27 18 The scatterplot of the data appears in Figure 224 There is clearly a negative linear relationship between the two variables REGRESSION MODELS A straightline regression model consists of two parts 1 a straight line equation that summarizes the general tendency of the relationship between the two variables and 2 a measure of variation in the data around the line GOALS We rst need an approach to compute the equation of the line Then we will look at a numerical summary that measures the variation about the line Of course we would like for the variation about the line to be small so that the regression line ts the data well PAGE 49 CHAPTER 2 STAT 7007 J TEBBS OZRate 1 l l l l l l l l l 20 15 10 5 0 5 10 1 5 20 Temperature Figure 224 Bird omygen rate data for dz erent temperatures USEFULNESS We can use the resulting regression equation to o quantify the relationship between Y and X 0 use the relationship to predict a new response 11 that we might observe at a given value7 say7 f perhaps one not included in the experiment or study 0 use the relationship to calibrate that is7 given a new y value we might see7 say7 y for which the corresponding value x is unknown7 estimate the value x SCENARIO We would like to nd the equation of a straight line that best describes the relationship between two quantitative variables We observe the pair Ly on each of 71 individuals in our sarnple7 and we wish to use these data to compute the equation of the best t line77 STRAIGHT LINE EQUATIONS A REVIEW Suppose that y is a response variable plotted on the vertical axis and that z is an explanatory variable plotted on the PAGE 50 CHAPTER 2 STAT 7007 J TEBBS horizontal axis A straight line relating y to x has an equation of the form y a bx where the constant b represents the slope of the line and a is the yintercept INTERPRETATION The slope of a regression line gives the amount by which the response y changes when x increases by one unit The y intercept gives the value of the response y when m 0 231 The method of least squares TERMINOLOGY When we say7 t a regression model7 we basically mean that we are nding the values of a and b that are most consistent with the observed data The method of least squares provides a way to do this RESIDUALS For each yi and given values of a and b7 note that the quantity 6 yi 7 a bm measures the vertical distance from yi to the line a bzi This distance is called the 2th residual for particular values of a and b If a point falls above the line in the y direction7 the residual is positive If a point falls below the line in the y direction7 the residual is negative See Figure 225 We will discuss residuals more thorougly in Section 24 RESIDUAL SUM OF SQUARES A natural way to measure the overall deviation of the observed data from the line is with the residual sum of squares This is given by Z 6 7 a M0 i1 i1 THE METHOD OF LEAST SQUARES The most widely accepted method for nding a and b is using the method of least squares The method says to select the values of a and b that minimize the sum of squared residuals A calculus argument can be used to nd equations for a and b these are given by brgtlty and a7bf PAGE 51 CHAPTER 2 STAT 7007 J TEBBS 11 30 1 lD 12 0 1130 1210 150 160 X Figure 225 A seatterplot with straight lirie arid residuals associated with the straight lirie EQUATION OF THE LEASTSQ UARES REGRESSION LINE With the values of a and b just mentioned7 the equation of the least squares regression line is if a bx We use the notation yAto remind us that this is a regression line t to the data xi7 i 12771 TERMINOLOGY The values a bzi are called tted values These are values that fall directly on the line We should see that the residuals are given by 5i will Observed y 7 Fitted y REMARK Computing the least squares regression line by hand is not recommended because of the intense computations involved It is best to use software However7 it is interesting to note that the values of a and b depend on ve numerical summaries that we already know namely7 E 7 51 5y and rll PAGE 52 CHAPTER2 STAT70mJTEBBS OZRate Temperature Figure 226 Bird amygeri rate data for di ererit temperatures The least squares regression line is superimposed Example 27 We now compute the least squares regression line for the bird oxygen rate data in Example 26 Using Minitab7 we obtain the following output The regression equation is D2Rate 347 00878 Temperature Predictor Coef SE Coef T P Constant 347142 006012 5774 0000 Temperature 008776 000499 1758 0000 s 0168249 R Sq 9817 R Sqadj 9787 From the output7 we see that a 347 and b 700878 Thus7 using syrnbols7 our best t regression line is if 347 7 00878x PAGE 53 CHAPTER 2 STAT 7007 J TEBBS INTERPRETA TION o The slope b 700878 is interpreted as follows for a one unit degree increase in temperature7 we would expect for the oxygen rate to decrease by 00878 mlghr7 o The y intercept a 347 is interpreted as follows for a temperature of z 07 we would expect the oxygen rate to be 347 mlghr7 232 Prediction and calibration PREDICTION One of the nice things about a regression line is that we can make predictions For our oxygen data in Example 267 suppose we wanted to predict a future oxygen consumption rate for a new bird used in the experiment7 say when the temperature is set at x 25 degrees Using our regression equation7 the predicted value 11 is given by 7f 347 70087825 3252 mlghr Thus7 for a bird subjected to 25 degrees Celsius7 we would expect its oxygen rate to be approximately 3252 mlghr EXTRAPOLATION It is sometimes desired to make predictions based on the t of the straight line for values of f outside the range of x values used in the original study This is called extrapolation7 and can be very dangerous In order for our inferences to be valid7 we must believe that the straight line relationship holds for x values outside the range where we have observed data In some situations7 this may be reasonable in others7 we may have no theoretical basis for making such a claim without data to support it Thus7 it is very important that the investigator have an honest sense of the relevance of the straight line model for values outside those used in the study if inferences such as estimating the mean for such f values are to be reliable CALIBRATION In a standard prediction problem7 we are given a value of z eg7 temperature and then use the regression equation to solve for y eg7 oxygen rate A PAGE 54 CHAPTER 2 STAT 7007 J TEBBS 200 150 100 COZDate l l l 5 30 55 80 105 130 155 Rings Figure 227 Tree ages measured by two dz erent methods calibration problem is the exact opposite Namely7 suppose now that we have a value of y say7 y and the goal is to estimate the unknown corresponding value of L say7 f Example 28 Consider a situation where interest focuses on two different methods of calculating the age of a tree One way is by counting tree rings This is considered to be very accurate7 but requires sacri cing the tree Another way is by a carbon dating process Suppose that data are obtained for n 100 trees on age by the counting method z and age by carbon dating y7 both measured in years A scatterplot of these data7 with the straight line t7 is given in Figure 227 Suppose that the carbon dating method is applied to a new tree not in the study yielding an age 11 34 years What can we say about the true age of the tree7 xi that is7 its age by the very accurate counting method without sacri cing the tree ANALYSIS Here7 we are given the value of the response 11 34 obtained from using the carbon dating method The idea is to use the least squares regression line to estimate f the corresponding value using the ring counting method The almost obvious solution PAGE 55 CHAPTER 2 STAT 7007 J TEBBS arises from solving the regression equation y a bx for z to obtain yea b For our tree age data in Example 287 I used the R software package to compute the z regression line as o 6934 122x Thus7 for a tree yielding an age 11 34 years using carbon dating the true age of the tree7 xi is estimated as i 34 i 6934 7 122 at z 2219 years 233 The square of the correlation SUMMARY DIAGNOSTIC In a regression analysis7 one way to measure how well a straight line ts the data is to compute the square of the correlation r2 This is interpreted as the proportion of total variation in the data epplained by the straight line relationship with the explanatory variable NOTE Since 71 S r S 17 it must be the case that 0 r2lt1 Thus7 an r2 value close to 1 is often taken as evidence that the regression model does a good job at describing the variability in the data IMPORTANT It is critical to understand what r2 does and does not measure Its value is computed under the assumption that the straight line regression model is correct Thus7 if the relationship between X and Y really is a straight line7 r2 assesses how much of the variation in the data may actually be attributed to that relationship rather than just to inherent variation o If r2 is srnall7 it may be that there is a lot of random inherent variation in the data7 so that7 although the straight line is a reasonable rnodel7 it can only explain so much of the observed overall variation PAGE 56 CHAPTER 2 STAT 700 J TEBBS 0 Alternatively r2 may be close to 1 but the straight line model may not be the most appropriate model In fact r2 may be quite high but in a sense is irrelevant because it assumes the straight line model is correct In reality a better model may exist eg a quadratic model etc Example 29 With our bird oxygen rate data from Example 26 I used Minitab to compute the correlation to be r 709904 The square of the correlation is r2 7099042 09809 Thus about 981 percent of the variation in the oxygen rate data Y is explained by the least squares regression of Y on temperature This is a very high percentage The other 19 percent is explained by other variables not accounted for in our straight line regression model TRANSFORMATIONS In some problems it may be advantageous to perform regression calculations on a transformed scale Sometimes it may be that data measured on a different scale eg square root scale log scale etc will obey a straight line relationship whereas the same data measured on the original scale before transforming do not For more information on transformations see Example 214 on pages 143 4 24 Cautions about correlation and regression A CLOSER LOOK AT RESID UALS When we t a regression equation to data we are using a mathematical formula to explain the relationship between variables However we know that very few relationships are perfect Thus the residuals represent the left over77 variation in the response after tting the regression line As noted earlier 5i 9i i 17139 Observed y 7 Predicted y The good news is that software packages such as SAS or Minitab will store the residuals for you An important part of any regression analysis is looking at the residuals PAGE 57 CHAPTER 2 STAT 7007 J TEBBS Example 210 With our bird oxygen rate data from Example 267 one of the observed data pairs was 531 In this example7 we will compute the least squares residual associated with this observation From Example 27 we saw that the least squares regression line was if 347 7 00878x Thus7 the tted value associated with z 5 is 347 7 008785 3031 and the residual is 6i M772 317 3031 0069 The other seven residuals from Example 26 are computed similarly 241 Residual plots TERMINOLOGY A residual plot is a scatterplot of the regression residuals against the explanatory variable Residual plots help us assess the t of a regression line In particular7 residual plots that display nonrandom patterns suggest that there are some problems with our straight line model assumption Example 211 In order to assess the effects of drug concentration X on the resulting increase in CD4 counts Y7 physicians used a sample of n 50 advanced HIV patients with different drug concentrations and observed the resulting CD4 count increase Data from the study appear in Figure 228a There looks to be a signi cant linear trend between drug concentration and CD4 increase The residual plot from the straight line t is in Figure 228b This plot may look random at rst glance7 but7 upon closer inspection7 one will note that there is a w shape77 in it This suggests that the true relationship between CD4 count increase and drug concentration is curved PAGE 58 CHAPTER 2 STAT 7007 J TEBBS 004 increase Residuals n 5 n 5 Cunnemvalmn Cunnemvahun a Scatterplot b Residual plot Figure 228 HIV study CD4 count increase versus drug concentration INTERESTING FACT For any data set7 the mean of the residuals from the least squares regression is always zero This fact is useful when interpreting residual plots Example 212 An entornological experiment was conducted to study the survivability of stalk borer larvae It was of interest to develop a model relating the mean size of larvae cm as a function of the stalk head diarneter Data from the experiment appear in Figure 229a There looks to be a moderate linear trend between larvae size and head diameter size The residual plot from the straight line regression t is in Figure 228b This plot displays a fanning out77 shape this suggests that the variability increases for larger diarneters This pattern again suggests that there are problems with the straight line model we have t 242 Outliers and in uential observations OUTLIERS Another problem is that of outliers ie7 data points that do not t well with the pattern of the rest of the data In straight line regression7 an outlier might be an observation that falls far off the apparent approximate straight line trajectory followed by the remaining observations Practitioners often toss out77 such anornalous points7 which may or may not be a good idea If it is clear that an outlier is the result of a PAGE 59 CHAPTER 2 STAT 7007 J TEBBS Size cm Residuals l l l l l l r i i i i i i i am am mu BUD MUD 13m 1an 1mm Sun sun mu Sun M an 13ml isnn 17ml Diametev cm Diane ev cm a Scatterplot b Residual plot Figure 229 Entomology emperimeiit Larvae size versus head diameter size mishap or a gross recording error then this may be acceptable On the other hand if no such basis may be identi ed the outlier may in fact be a genuine response in this case it contains information about the process under study and may be re ecting a legitimate phenomenon In this case throwing out77 an outlier may lead to misleading conclusions because a legitimate feature is being ignored TERMINOLOGY ln regression analysis an observation is said to be in uential if its removal from consideration causes a large change in the analysis eg large change in the regression line large change in r2 etc An in uential observation need not be an outlier Similarly an outlier need not be in uential 243 Correlation versus causation REMARK lnvestigators are often tempted to infer a causal relationship between X and Y when they t a regression model or perform a correlation analysis However a signi cant association between X and Y does not necessarily imply a causal relationship Example 213 A Chicago newspaper reported that there is a strong correlation be tween the numbers of re ghters X at a re and the amount of damage Y measured PAGE 60 CHAPTER 2 STAT 700 J TEBBS 150 130 39 n 3110 E m D o 90 70 o g i i i i i i 1 2 5 s NoTrucks Figure 230 Chicago re damages 10005 and the number of re trucks in 10007s that the re does77 Data from 20 recent res in the Chicago area appear in Figure 230 From the plot there appears to be a strong linear relationship between X and Y Few however would infer that the increase in the number of re trucks causes the observed increase in damages Often when two variables X and Y have a strong association it is because both X and Y are in fact each associated with a third variable say W In the example both X and Y are probably strongly linked to W the severity of the re so it is understandable that X and Y would increase together MORAL This phenomenon is the basis of the remark Correlation does not necessarily imply eausalion7 An investigator should be aware of the temptation to infer causation in setting up a study and be on the lookout for lurking variables like W above that are actually the driving force behind observed results In general the best way to control the effects of lurking variables is to use a carefully designed experiment In observational studies it is very dif cult to make causal statements Oftentimes the best we can do is make statements documenting the observed association and nothing more For more information on causation see Section 25 PAGE 61 CHAPTER 3 STAT 7007 J TEBBS 3 Producing Data 31 Introduction EXPLORATORY DATA ANALYSIS Up until now we have focused mainly on ex ploratory data analysis that is we have analyzed data to explore distributions and possible relationships between variables In a nutshell exploratory data analysis is per formed to answer the question quotWhat do we 866 in our sample of data The use of graphical displays and summary statistics are an important part of this type of analysis STATISTICAL INFERENOE A further analysis occurs when we wish to perform sta tistical inference This has to do with generalizing the results of the sample to that of the population from which the sample data arose Statistical inference is more numerical in nature and consists largely of con dence intervals of hypothesis tests These important topics will be covered later in the course TERMINOLOGY This chapter deals with producing data Primary data are data that one collects proactively through observational studies and experiments Secondary data are data gathered from other sources eg journals internet census data etc TERMINOLOGY An observational study observes individuals and measures vari ables of interest but does not attempt to in uence the responses An experiment deliberately imposes a treatment on individuals in order to observe their responses Example 31 The Human Relations Department at a major corporation wants to collect employee opinions about a prospective pension plan adjustment Survey forms are sent out via email asking employees to answer 10 multiple choice questions Example 32 In a clinical trial physicians on a Drug and Safety Monitoring Board want to determine which of two drugs is more effective for treating HIV in its early stages Patients in the trial are randomly assigned to one of two treatment groups After 6 weeks on treatment the net CD4 count change is observed for each patient PAGE 62 CHAPTER 3 STAT 7007 J TEBBS DISCUSSION Example 31 describes an observational study there is no attempt to in uence the responses of employees lnstead7 information in the form of data from the questions is merely observed for each employee returning the survey form Example 32 describes an experiment because we are trying to in uence the response from patients CD4 count by giving different drugs TOWARD STATISTICAL INFERENOE ln Example 317 it might be the hope that information gathered from those employees returning the survey form are7 in fact7 repre sentative of the entire company When would this not be true What might be ineffective with this type of survey design In Example 327 we probably would like to compare the two drugs being given to patients in terms of the CD4 count change However7 would these results necessarily mean that similar behavior would occur in the entire population of advanced HIV patients This is precisely the question that we would like to answer7 and using appropriate statistical inference procedures will help us answer it 32 Experiments 321 Terminology and examples TERMINOLOGY In the language of experiments7 individuals are called experimental units An experimental condition applied to experimental units is called a treatment Example 33 Does aspirin reduce heart attacks The most evidence for this claim comes from the Physicians Health Study7 a large doubleblinded experiment involving 22000 male physicians One group of about 11000 took an aspirin every second day7 while the rest of them took a placebo ie7 a sugar pill designed to look like an aspirin 0 Treatment Drug eg7 placeboaspirin 0 Experimental units Physicians they receive the treatment 0 Response Heart attacknot PAGE 63 CHAPTER 3 STAT 7007 J TEBBS TERMINOLOGY ln Example 337 we might call the group that received the placebo the control group This group enables us to control the effects of outside variables on the outcome7 and it gives a frame of reference for comparisons TERMINOLOGY In the language of experiments7 explanatory variables are often called factors These are often categorical in nature Factors are made up of different levels Example 34 A soil scientist wants to study the relationship between grass quality for golf greens and two factors 0 Factor 1 Fertilizer ammonium sulphate7 urea7 isobutylidene dairy7 sulfur urea 0 Factor 2 Number of months of dead grass buildup three7 ve7 seven 0 Response Grass quality7 measured by the amount of chlorophyll content in the grass clippings mggm Here7 there are 12 treatment combinations 4 gtlt 3 12 obtained from forming all possible combinations of the levels of each factor ie7 AS3 U3 lBlD3 SU3 ASS U5 lBlD5 SUB AS U7 lBlD7 SU7 A golf green is divided up into 12 plots7 roughly of equal size How should the 12 treatment combinations be applied to the plots This example describes an experiment with a factorial treatment structure7 since each treatment combination is represented If we had two greens7 we could replicate the experiment in this case7 we would have 4 gtlt 3 gtlt 2 24 observations Example 35 The response time measured in milliseconds was observed for three different types of circuits used in automatic value shutoff mechanisms Twelve machines were used in the experiment and were randomly assigned to the three circuit types Four replicates were observed for each circuit type The data observed in the experiment are given in Table 37 PAGE 64 CHAPTER 3 STAT 7007 J TEBBS Table 37 Circuit response data Circuit type Times Means 1 9201215 El 1350 2 23 26 20 21 E2 2250 3 6 3 10 6 E3 625 0 Treatment Circuit type with three levels 1 2 3 0 Response Time 0 Experimental unit Machine REMARK When experimental units are randomly assigned to the treatments without restriction we call such a design a completely randomized design CRD The experiment described in Example 35 is a CRD DISCUSSION We see that the treatment means Ehfgjg are different This is not surprising because each machine is inherently different The big question is this are the observed dz ereuees real or could they have resulted just by chance Results that are real and are more likely to have not been caused by mere chance are said to be statistically signi cant A major part of this course coming up will be learning how to determine if results are statistically signi cant in an experimental setting EXPERIMENTS An experiment is an investigation set up to answer research questions of interest 0 In our context an experiment is most likely to involve a comparison of treatments eg circuits fertilizers rations drugs methods varieties etc o The outcome of an experiment is information in the form of observations on a response variable eg yield number of insects weight gain length of life etc 0 Because uncertainty in the responses due to sampling and biological variation we PAGE 65 CHAPTER 3 STAT 7007 J TEBBS can never provide de nitive answers to the research questions of interest based on such observations However we can make inferences that incorporate and quantify inherent uncertainty 322 Designing experiments IMPORTANT NOTES Before an experiment may be designed7 the questions of interest must be well formulated Nothing should start until this happens The investigator and statistician should work together to identify important features and the appropriate design This is extremely important as the design ofthe experiment lends naturally to the analysis of the data collected from the experiment If the design changes7 the analysis will as well An experiment will most likely give biased results if it is not designed properly or is analyzed incorrectly Before we talk about the ne points7 consider this general outline of how experiments are performed General Outline of Experimentation Sample experimental units from population l Randomize experimental units to treatments l Record data on each experimental unit l Analyze variation in data l Make statement about the differences among treatments PREVAILING THEME The prevailing theme in experimental design is to allow explicitly for variation in and among sarnples7 but design the experiment to control this variation as much as possible PAGE 66 CHAPTER 3 STAT 700 J TEBBS o This is certainly possible in experiments such as eld trials in agriculture clinical trials in medicine reliability studies in engineering etc o This is not usually possible in observational studies since we have no control over the individuals THE THREE BASIC PRINCIPLES The basic principles of experimental design are randomization replication and control We now investigate each principle in detail PRINCIPLE Z RANDOMIZATION This device is used to ensure that samples are as alike as possible77 except for the treatments Instead of assigning treatments sys tematically we assign them so that once all the acknowledged sources of variation are accounted for it can be assumed that no obscuring or confounding effects remain Example 36 An entomologist wants to determine if two preparations of a virus would produce different effects on tobacco plants Consider the following experimental design 0 He took 10 leaves from each of 4 plots of land so that there are 40 leaves in the experiment 0 For each plot he randomly assigned 5 leaves to Preparation 1 and 5 leaves to Preparation 2 Randomization in this experiment is restricted that is leaves were randomly assigned to treatments within each plot Notice how the researcher has acknowledged the possible source of variation in the different plots of land In this experiment the researcher wants to compare the two preparations With this in mind why would it be advantageous to include leaves from di erent plots This experiment is an example of a randomized complete block design RCBD where 0 Block Plot of land 0 Treatment Preparations 2 0 Experimental unit Tobacco leaf PAGE 67 CHAPTER 3 STAT 7007 J TEBBS AN INFERIOR DESIGN In Example 367 suppose that our researcher used a CRD and simply randomized 20 leaves to each preparation In this situation7 he would have lost the ability to account for the possible variation among plots The effects of the different preparations would be confounded with the differences in plots TERMINOLOGY In the language of experiments7 a block is a collection of experimental units that are thought to be more alike7 in some way Example 37 Continuing with Example 327 suppose it is suspected that males react to the HIV drugs differently than females In light of this potential source of variation7 consider the following three designs 0 Design 1 assign all the males to Drug 17 and assign all the females to Drug 2 0 Design 2 ignoring gender7 randomize each individual to one of the two drugs 0 Design 3 randomly assign drugs within gender that is7 randomly assign the two drugs within the male group7 and do the same for the female group Design 1 would be awful since if we observed a difference7 we would have no way of knowing whether or not it was from the treatments ie7 drugs or due to the differences in genders Here7 one might say that drugs are completely confounded with gender7 Design 2 is better than Design 17 but we still might not observe the differences due to drugs since differences in genders might not average out77 between the samples However7 in Design 37 differences in treatments may be assessed despite the differences in response due to the different gendersl Design 3 is an example of an RCBD with 0 Block Gender 0 Treatment Drugs 2 0 Experimental unit Human subject QUESTION Randomization of experimental units to treatments is an important aspect of experimental design But how does one physically randomize PAGE 68 CHAPTER3 STAT70mJTEBBS TABLE OF RANDOM DIGITS To avoid bias treatments are assigned using a suitable randomization mechanism Back in the days before we had computers researchers could use a device known as a Table of Random Digits See Table B For example to assign each of eight fan blades to two experimental treatments eg lubricant A and lubricant B in a CRD we list the blades as follows by their serial number and code xc4553d O xc4550d 4 xc4552e 1 xc4551e 5 xc4567d 2 xc4530e 6 xc4521d 3 xc4539e 7 To choose which four are allocated to lubricant A we can generate a list of random numbers then match the randomly selected numbers to the list above The rst 4 matches go to lubricant A Suppose the list was 26292 31009 Reading from left to right in the sequence we see that numbers 2 6 3 and 1 would be assigned to lubricant A The others would be allocated to lubricant B REMARK Tables of Random Digits are useful but are largely archaic In practice it is easier to use randomnumber generators from statistical software packages Example 38 Matched pairs design A certain stimulus is thought to produce an increase in mean systolic blood pressure SBP in middle aged men DESIGN ONE Take a random sample of men then randomly assign each man to receive the stimulus or not using a CRD Here the two groups can be thought of as independent since one group receives one stimulus and the other group receives the other DESIGN TWO Consider an alternative design the so called matchedpairs design Rather than assigning men to receive one treatment or the other stimulusno stimulus obtain a response from each man under both treatments That is obtain a random sample of middle aged men and take two readings on each man with and without the stimulus In this design because readings of each type are taken on the same man the difference between before and after readings on a given man should be less variable than PAGE69 CHAPTER 3 STAT 7007 J TEBBS Table 38 Sources of variation present in dz crcnt designs Design Type of Difference Sources of Variation Independent samples CRD among men among men within men Matched pairs setup within men within men the difference between a before response on one man and an after response on a different man The man to man variation inherent in the latter difference is not present in the difference between readings taken on the same subject ADVANTAGE OF MATCHED PAIRS In general by obtaining a pair of measurements on a single individual eg man rat pig plot tobacco leaf etc where one of the measurements corresponds to treatment 1 and the other measurement corresponds to treatment 2 you eliminate the variation among the individuals Thus you may com pare the treatments eg stimulusno stimulus ration Aration B etc under more homogeneous conditions where only variation within individuals is present that is the variation arising from the difference in treatments REMARK In some situations of course pairing might be impossible or impractical eg destructive testing in manufacturing etc However in a matched pairs experiment we still may think of two populations eg those of all men with and without the stimulus What changes in this setting is really how we have sampled77 from these populations The two samples are no longer independent because they involve the same individual A NOTE ON RANDOMIZATION ln matched pairs experiments it is common prac tice when possible to randomize the order in which treatments are assigned This may eliminate common patterns77 that may confound our ability to determine a treatment effect from always following say treatment A with treatment B In practice the exper imenter could ip a fair coin to determine which treatment is applied rst If there are carryover effects that may be present these would have to be dealt with accordingly We7ll assume that there are no carry over effects in our discussion here PAGE 70 CHAPTER 3 STAT 7007 J TEBBS PRINCIPLE 2 CONTROL The basic idea of control is to eliminate confounding effects of other variables Two variables are said to be confounded when their effects cannot be distinguished from each other Consider the following two examples Example 39 In an agricultural study7 researchers want to know which ofthree fertilizer compounds produces the highest yield Suppose that we keep Fertilizer 1 plants in one greenhouse7 Fertilizer 2 plants in another greenhouse7 and Fertilizer 3 plants in yet another greenhouse This choice may be made for convenience or simplicity7 but this has the potential to introduce a big problem namely7 we may never know whether any observed differences are due to the actual differences in the treatments Here7 the observed differences could be due to the different greenhouses The effects of fertilizer and greenhouse location are confounded Example 310 Consider a cancer clinical trial a medical experiment where a drug7s ef cacy is of primary interest to compare a new7 experimental treatment to a standard treatment Suppose that a doctor assigns patients with advanced cases of disease to a new experimental drug and assigns patients with mild cases to the standard drug7 thinking that the new drug is promising and should thus be given to the sicker patients Here7 the effects of the drugs are confounded with the seriousness of the disease SOME GENERAL COMMENTS 0 One way to control possible confounding effects is to use blocking Also7 the use of blocking allows the experimenter to make treatment comparisons under more homogeneous conditions How could blocking be used in the last two examples Assignment of treatments to the samples should be done so that potential sources of variation do not obscure the treatment differences If an experiment is performed7 but the researchers fail to design the experiment to block out77 possible confounding effects7 all results could be meaningless Thus it pays to spend a little time at the beginning of the investigation and think hard about designing the experiment appropriately PAGE 71 CHAPTER 3 STAT 700 J TEBBS PRINCIPLE 3 REPLICATION Since we know that individuals will vary within sam ples and among different samples from a population we should collect data on more than one individual eg pig plant person plot etc Doing so will provide us with more information about the population of interest and will reduce chance variation in the results Example 311 A common way to test whether a particular hearing aid is appropriate for a patient is to play a tape on which words are pronounced clearly but at low volume and ask the patient to repeat the words as heard However a major problem for those wearing hearing aids is that the aids amplify background noise as well as the desired sounds In an experiment 24 subjects with normal hearing listened to standard audiology tapes of English words at low volume with a noisy background there were 25 words per tape Recruited subjects were to repeat the words and were scored correct or incorrect in their perception of the words The order of list pre enta rion was randomi ed Replication was achieved by using 24 subjects Example 312 A chemical engineer is designing the production process for a new product The chemical reaction that produces the product may have higher or lower yield depending on the temperature and stirring rate in the vessel in which the reaction takes place The engineer decides to investigate the effects of temperature 500 and 600 and stirring rate 60rpm 90rpm and 120rpm on the yield of the process Here we have 6 treatment combinations One full replication would require 6 observations one for each of the six treatment combinations This is an example of an experiment with a factorial treatment structure REMARK There are statistical bene ts conferred by replicating the experiment under identical conditions well become more familiar with these bene ts later on namely 0 increased precision and 0 higher power to detect potential signi cant differences PAGE 72 CHAPTER 3 STAT 700 J TEBBS 33 Sampling designs and surveys TERMINOLOGY A survey is an observational study where individuals are asked to respond to questions In theory no attempt is made to in uence the responses of those participating in the survey The design ofthe survey refers to the method used to choose the sample from the population The survey is given to the sample of individuals chosen 331 Sampling models TERMINOLOGY Nonprobability samples are samples of individuals that are chosen without using randomization methods Such samples are rarely representative of the population from which the individuals were drawn ie such samples often give biased results 0 A voluntary response sample VRS consists of people who choose themselves by responding to a general appeal eg online polls etc A famous VRS was collected by Literary Digest in 1936 Literary Digest predicted Republican pres idential candidate Alfred Landon would defeat the incumbent president FDR by a 32 margin In that election FDR won 62 percent of the popular vote The sampling procedure included two major mistakes First most of those contacted were from Literary Digest7s subscription list Second only 23 percent of the ballots were returned a voluntary response sample with major nonresponse o A convenience sample chooses individuals that are easiest to contact For ex ample students standing by the door at the Student Union taking surveys are excluding those students who do not frequent the Union TERMINOLOGY A probability sample is a sample where the individuals are chosen using randomization While there are many sampling designs that can be classi ed as probability samples we will discuss four simple random samples SRS strati ed random samples systematic samples and cluster samples PAGE 73 CHAPTER 3 STAT 7007 J TEBBS SIMPLE RANDOM SAMPLE A simple random sample SRS is a sampling model where individuals are chosen without replacement and each sample of size n has an equal chance of being selected See Example 318 on page 220 0 In the SRS model we are choosing individuals so that our sample will hopefully be representative Each individual in the population has an equal chance of being selected 0 We use the terms random sample77 and simple random sample77 interchangeably DIFFICULTIES In practice researchers most likely will not identify exactly what the population is ii identify all of the members of the population an impossible task most likely and iii choose an SRS from this list It is more likely that reasonable attempts will have been made to choose the individuals at random from those available Hence in theory the sample obtained for analysis might not be a true random sample77 however in reality the SRS model assumption might not be that far off SYSTEMATIC SAMPLE A systematic sample is chosen by listing individuals in a frame ie a complete list of individuals in a population and selecting every jth individual on the list for the sample In a systematic sample like an SRS each individual has the same chance of being selected However unlike the SRS each sample of size 71 does not have the same chance of being selected this is what distinguishes the two sampling models Example 313 In a plant disease experiment researchers select every 10th plant in a particular row for a sample of plants The plants will then be tested for the presence of a certain virus STRA TIFIED RANDOM SAMPLE To select a strati ed random sample rst divide the population into groups of similar individuals called strata Then choose a separate SRS in each stratum and combine these to form the full sample Examples of strata include gender cultivar age breed salary etc PAGE 74 CHAPTER 3 STAT 7007 J TEBBS Table 39 Education level for plasma donors in rural eastern China between 1990 1994 Education level Count Percentage llliterate 645 464 Primary 550 396 Secondary 195 140 Total 1390 1000 REASONS FOR STRATIFIOATION o Strati cation broadens the scope of the study 0 Strata may be formed because they themselves may be of interest separately Example 314 In Example 13 notes7 we examined the following data from blood plasma donors in rural China The three groups of individuals illiterate7 primary7 and secondary form strata for this population Furthermore7 it might be reasonable to assume that the samples within strata are each an SRS CLUSTER SAMPLE Cluster samples proceed by rst dividing the population into clusters ie7 groups of individuals which may or may not be alike Then7 one randomly selects a sample of clusters7 and uses every individual in those chosen clusters Example 315 For a health survey7 we want a sample of n 400 from a population of 10000 dwellings in a city To use an SRS would dif cult because no frame is available and would be too costly and time consuming to produce lnstead7 we can use the city blocks as clusters We then sample approximately 125 of the clusters NOTE We have looked at four different probability sampling models In practice7 it is not uncommon to use one or more of these techniques in the same study this is common in large government based surveys For example7 one might use a cluster sample to rst select city blocks Then7 one could take a strati ed sample from those city blocks In this situation7 such a design might be called a multistage sampling design PAGE 75 CHAPTER 3 STAT 7007 J TEBBS 332 Common problems in sample surveys UNDERCO VERA GE This occurs when certain segments of the population are excluded from the study For example individuals living in certain sections of a city might be dif cult to contact NONRESPONSE This occurs when an individual chosen for the sample can not be contacted or refuses to participate Non response is common in mail and phone surveys RESPONSE BIAS This could include interviewer bias and respondent bias An interviewer might bias the results by asking the questions negatively For example quotIn light of the mounting casualties that we have seen in Iraq recently do you approve of the way President Bush has handled the war in Iraq On the other hand respondent bias might occur ifthe interviewee is not comfortable with the topic of the interview such as sexual orientation or criminal behaviors These types of biases can not be accounted for statistically Also important is how the questions are worded The following question appeared in an American based survey in 1992 quotDoes it seem possible or does it seem impossible to you that the Nazi eptermination of the Jews never happened When 22 percent of the sample said that it was possible the news media wondered how so many Americans could be uncertain A much simpler version of this question was later asked and only 1 percent of the respondents said it was possible 34 Introduction to statistical inference RECALL Statistical inference deals with generalizing the results of a sample to that of the population from which the sample was drawn This type of analysis is different from exploratory data analysis which looks only at characteristics of the sample through graphical displays eg histograms scatterplots etc and summary statistics PAGE 76 CHAPTER 3 STAT 7007 J TEBBS Example 316 A Columbia based health club wants to estimate the proportion of Columbia residents who enjoy running as a means of cardiovascular exercise This pro portion is unknown that is why they want to estimate itl Let p denote the proportion of Columbia residents who enjoy running as a means of exercise If we knew p7 then there would be no reason to estimate it In lieu of this knowledge7 we can take a sample from the population of Columbia residents and estimate p with the data from our sample Here7 we have two values p the true proportion of Columbia residents who enjoy running unknown 15 the proportion of residents who enjoy running observed in our sample Suppose that in a random sample of n 100 Columbia residents7 19 said that they enjoy running as a means of exercise Then7 15 19100 019 TERMINOLOGY A parameter is a number that describes a population A parameter can be thought of as a xed number7 but7 in practice7 we do not know its value TERMINOLOGY A statistic is a number that describes a sample The value of a statistic can be computed from our sample data7 but it can change from sample to sample We often use a statistic to estimate an unknown parameter REMARK In light of these de nitions7 in Example 3167 we call p population proportion 15 sample proportion7 and we can use 15 019 as an estimate of the true p HYPOTHETIOALLY Suppose now that I take another SRS of Columbia residents of size n 100 and 23 of them said that they enjoy running as a means of exercise From this sample7 our estimate ofp is 15 23100 023 That the two samples gave different estimates should not be surprising the samples most likely included different people Statistics values vary from sample to sample because of this very fact On the other hand7 the value of p7 the population proportion7 does not change PAGE 77 CHAPTER 3 STAT 7007 J TEBBS OF INTEREST In light of the fact that statistics7 values will change from sample to sample this is called sampling variability it is natural to want to know what would happen if we repeated the sampling procedure many times In practice we can not do this because chances are we only get to take one sample from our population However we can imitate this notion by using simulation SIMULATION Here is what happened when l simulated 10 different values of under the assumption that p 02 I used R to perform the simulations gt phat 018 025 020 020 017 016 019 019 014 019 Now I simulated a few more values of under the assumption that p 02 gt phat 017 016 020 020 016 025 021 017 014 0 26 012 0 20 0 18 015 0 28 020 021 024 025 022 018 020 0 19 016 0 21 0 21 025 0 18 020 014 014 021 020 017 018 0 15 021 0 12 0 18 023 0 18 022 026 018 013 019 017 028 0 18 021 0 22 0 18 019 022 023 017 026 021 019 019 020 0 10 017 0 18 0 18 018 021 020 023 023 026 018 018 016 0 24 022 0 16 0 21 027 0 18 019 026 025 024 010 018 018 0 25 018 021 0 20 021 0 20 018 022 019 026 017 016 020 It should be clear that I can generate as many values of as I want Thus to answer the question quotWhat would happen ifI took many samples we can do the following 1 Take a large number of samples assuming that p 02 2 Calculate the sample proportion for each sample 3 Make a histogram of the values of 4 Examine the distribution displayed in the histogram PAGE 78 CHAPTER3 STAT70mJTEBBS mIIlIi m 2000 l 1500 1000 500 l o 005 010 015 020 025 030 035 phat Figure 331 Sampling distribution of the sample proportion when p 02 RESULTS When I did this using 10000 random samples7 each of size n 1007 I got the histogram in Figure 331 It is interesting to note that the shape looks normally distributed with a mean of around 02 In fact7 when I took the sample mean of the 10000 randomly generated govalues7 I got gt meanphat 0200419 This is close to the true proportion p 02 What we have just done is generated using computer simulation the sampling distribution of when p 02 SAMPLING DISTRIBUTIONS The sampling distribution of a statistic is the dis tribution of values taken by the statistic in repeated sampling using samples of the same size REMARK Sampling distributions are important distributions in statistical inference As with any distribution7 we will be interested in the following 0 center of the distribution PAGE 79 CHAPTER 3 STAT 7007 J TEBBS 0 Spread variation in the distribution 0 shape is the distribution symmetric or skewed o the presence of outliers TWO QUESTIONS FOR THOUGHT o First7 based on the appearance of the sampling distribution for when p 02 see Figure 3317 what range of values of are common 0 Second7 suppose that in your investigation you did not get a value of 15 in this commonly observed range What does this suggest about the value of p TERMINOLOGY A statistic is said to be an unbiased estimator for a parameter if the mean of the statistics sampling distribution is equal to the parameter If the mean of the statistics sampling distribution is not equal to this parameter7 the statistic is called a biased estimator It should be clear that bias or lack thereof is a property that concerns the center of a sampling distribution TERMINOLOGY The variability of a statistic is described by the spread in the statistics sampling distribution This variability often depends largely on the sample size 71 When n is larger7 and the sample is a probability sample eg7 SRS7 the variability often is smaller IDEAL SITUATION Better statistics are those which have small or no bias and small variability If a statistic is unbiased or has very low bias then we know that we are approximately right on average If we have a statistic with small variability7 then we know there is not a large spread in that statistics sampling distribution The combination of right on average and small variation is the ideal case See p 237 for an illustration MANAGING BIAS AND VARIABILITY To reduce bias7 use random sampling In many situations7 SRS designs produce unbiased estimates of population parameters To PAGE 80 CHAPTER 3 STAT 700 J TEBBS f II II ll l l l l l l l l l l l l l nus mu n15 um n25 nan n35 um um um an nu ma phat znnn 15m 15m mm mm a n 100 b n 1000 Figure 332 Sampling distribution of the sample proportion when p 02 reduce variation use a larger sample size n Often the variation in the sampling distri bution decreases as it increases well see examples of this later on ILLUSTRATION To illustrate how increasing the sample size decreases the variation in a statistic7s sampling distribution consider Figure 332 On the left we see the sampling distribution of when p 02 and n 100 On the right we see the the sampling distribution of when p 02 and n 1000 l simulated both distributions using R Note how the variation in the right distribution is greatly reduced This happens because the sample size is larger Note also that increasing the sample size didn7t affect the mean of the sampling distribution of ie 15 is unbiased REMARK Randomization is an important aspect of sampling design in experiments and observational studies Whenever you sample individuals there are two types of error sampling error and non sampling error Sampling error is induced from natural variability among individuals in our population The plain fact is that we usually can never get to see every individual in the population Sampling error arises from this fact Nonsampling error occurs from other sources such as measurement error poor sampling designs nonresponse undercoverage missing data etc Randomization helps to reduce the non sampling error however it usually can not fully eliminate it PAGE 81 CHAPTER 4 STAT 7007 J TEBBS 4 Probability The Study of Randomness 41 Randomness TERMINOLOGY We call a phenomenon random if its individual outcomes are un certain7 but there is nonetheless a regular distribution of outcomes in a large number of repetitions The probability of any outcome is the proportion of times the outcome would occur in a very long series of independent repetitions NOTE This interpretation of probability just described as a long run proportion is called the relative frequency approach to measuring probability EXAMPLES Here are examples of outcomes we may wish to assign probabilities to o tomorrow7s temperature exceeding 80 degrees 0 manufacturing a defective part 0 the NASDAQ losing 5 percent of its value 0 rolling a 277 on an unbiased die Example 41 Ari example illustrating the relative frequency approach to probability Suppose we roll a die n 1000 times and record the number of times we observe a 277 Let A denote this outcome The quantity number of times A occurs f number of trials performed 7 n is called the relative frequency of the outcome If we performed this experiment repeatedly7 the relative frequency approach says that PA fi i7 for large n NOTATION The symbol PA is shorthand for the probability that A occurs7 SIMULATION l simulated the experiment in Example 41 four times that is7 I rolled a single die 1000 times on four separate occasions using a computer Each occasion PAGE 82 CHAPTER 4 STAT 7007 J TEBBS Prapaman a1 lails 00 01 02 03 04 as 0 200 400 600 500 1000 0 200 400 600 500 1000 Prapaman a1 lals 0 2 0 3 0 200 400 600 500 1000 0 200 400 600 500 1000 Figure 433 The proportion of tosses which result in a 2 each plot represertts 1000 rolls of afaz39r die is depicted in its own graph in Figure 433 The vertical axis measures the relative frequency proportion of occurrences of the outcome A roll a 2 You will note that each relative frequency gets close to 16 as 717 the number of rolls7 increases The relative frequency approach says that this relative frequency will converge to 167 if we were allowed to continually roll the die 42 Probability models TERMINOLOGY ln probability applications7 it is common to perform an experiment and then observe an outcome The set of all possible outcomes for an experiment is called the sample space7 hereafter denoted by S Example 42 A rat is selected and we observe the sex of the rat S male female PAGE 83 CHAPTER 4 STAT 7007 J TEBBS Example 43 The Michigan state lottery calls for a three digit integer to be selected S 000 001 002 998 999 Example 44 An industrial experiment consists of observing the lifetime measured in hours of a certain battery Sww20 That is7 any positive value in theory could be observed TERMINOLOGY An event is an outcome or set of outcomes in a random experiment Put another way7 an event is a subset of the sample space We typically denote events by capital letters towards the beginning of the alphabet eg7 A7 B7 C7 etc EXAMPLES ln Example 427 let A female ln Example 437 let B 0035477 988 In Example 447 let O w w lt 30 What are PA7 PB7 and 130 To answer these questions7 we have to learn more about how probabilities are assigned to outcomes in a sample space PROBABILITY RULES Probabilities are assigned according to the following rules H The probability PA of any event A satis es 0 S PA S 1 E0 If S is the sample space in a probability model7 then PS 1 9 Two events A and B are disjoint if they have no outcomes in common ie7 they can not both occur simultaneously If A and B are disjoint7 then PA or B PA PB r The complement of any event A is the event that A does not occur The comple ment is denoted by A0 The complement rule says that PAc 17 PA NOTE Venn Diagrams are helpful in depicting events in a sample space See those on p 263 PAGE 84 CHAPTER 4 STAT 7007 J TEBBS Example 45 Consider the following probability model for the ages of students taking distance courses at USC Age group 18 23 years 24 29 years 30 39 years 40 years and over Probability 047 027 014 012 First7 note that each outcome has a probability that is between 0 and 1 Also7 the probabilities sum to one because the four outcomes make up S 1 What is the probability that a distance course student is 30 years or older 2 What is the probability that a distance course student is less than 40 years old 3 What is the probability that a distance course student is 25 years old 421 Assigning probabilities Example 46 In an experiment7 we observe the number of insects that test positive for a virus lf 6 insects are under study7 the sample space is S 07 17 27 6 De ne A1 exactly one insect is infected 1 A2 no insects are infected 0 A3 more than one insect is infected 23457 6 A4 at most one insect is infected 01 Note that events A1 and A2 each contain only one outcome The events A3 and A4 each contain more than one outcome How do we assign probabilities to these events To do this7 we need to know how probabilities are assigned to each outcome in S EXERCISE Draw a Venn Diagram for the events A1A2A3 and A4 TERMINOLOGY A sample space is called nite if there are a xed and limited number of outcomes Examples 427 437 457 and 46 have nite sample spaces Example 44 does not have a nite sample space PAGE 85 CHAPTER 4 STAT 7007 J TEBBS ASSIGNING PROBABILITIES In a nite sample space7 computing the probability of an event can be done by 1 identifying all outcomes associated with the event 2 adding up the probabilities associated with each outcome Example 47 In Example 467 consider the following probability model for the number of insects which test positive for a virus Outcome 0 1 2 3 4 5 6 Probability 055 025 010 004 003 002 001 With this probability model7 we can now compute PA17 PA27 PA37 and PA4 ln particular7 PA1 0257 PA2 0557 PA3 010 004 003 002 001 0207 and PA4 055 025 080 QUESTION What is PA1 or A2 PA3 or A4 PA1 or A4 EQUALLY LIKELY OUTCOMES ln discrete sample spaces with equally likely out comes7 assigning probabilities is very easy In particular7 if a random phenomenon has a sample space with k possible outcomes7 all equally likely7 then 0 each individual outcome has probability 1197 and o the probability of any event A is number of outcomes in A number of outcomes in A P A 7 number of outcomes in S k Example 48 Equally likely outcomes In Example 437 the sample space for the Michi gan lottery experiment is S 000 001 002 998 999 PAGE 86 CHAPTER 4 STAT 700 J TEBBS Suppose that I bought three lottery tickets numbered 003 547 and 988 What is the probability that my number is chosen That is what is PB where B 003 547 988 If each of the k 1000 outcomes in S is equally likely this should be true if the lottery numbers are randomly selected then each outcome has probability 1k 11000 Also note that there are 3 outcomes in B thus PltBgt 7 number of outcomes in B 7 3 7 1000 7 100039 Example 49 Equally likely outcomes Four equally quali ed applicants abcd are up for two positions Applicant 1 is a minority Positions are chosen at random What is the probability that the minority is hired Here the sample space is S abacadbcbdcd We are assuming that the order of the positions is not important If the positions are assigned at random each ofthe six sample points is equally likely and has 16 probability Let E denote the event that a minority is hired Then E abacad and number of outcomes in E 3 ME f a 422 Independence and the multiplication rule INFORMALLY Some events in a sample space may be related77 in some way For example if A denotes the event that it rains tomorrow and B denotes the event that it will be overcast tomorrow then we know that the occurrence of A is related to the occurrence of B If B does not occur this changes our measure of PA from what it would be if B did occur TERMINOLOGY When the occurrence or non occurrence of A has no effect on whether or not B occurs and vice versa we say that the events A and B are independent MATHEMATICALLY If the events A and B are independent then PA and B PAPB PAGE 87 CHAPTER 4 STAT 7007 J TEBBS This is known as the multiplication rule for independent events Note that this rule only applies to events that are independent If events A and B are not independent this rule does not hold Example 410 A red die and a white die are rolled Let A 4 on red die and B sum is odd Are the events independent SOLUTION The sample space here is S 1711721731741757176LMHZ2737274727572767 3717372L3733743753767471L4727473L474747574767 5 1 52 53 54 55 56 6 1 62 63 64 65 6 6 In each outcome the rst value corresponds to the red die the second value corresponds to the white die Of the 36 outcomes in S each of which are assumed equally likely7 o 6 are favorable to A7 0 18 are favorable to B7 and o 3 are favorable to both A and B Thus7 since 3 6 18 7 PA and B 7 PAPB 7 gtlt the events A and B are independent Example 411 In an engineering system7 two components are placed in a series so that the system is functional as long as both components are Let A1 and A27 denote the events that components 1 and 2 are functional7 respectively From past experience we know that PA1 PA2 095 Assuming independence between the two components7 the probability the system is functional ie7 the reliability of the system is PA1 and A2 PA1PA2 095 09025 If the two components are not independent7 then we do not have enough information to determine the reliability of the system PAGE 88 CHAPTER 4 STAT 7007 J TEBBS EXTENSION The notion of independence still applies if we are talking about more than two events If the events A17 A27 An are independent7 then Pall Al occur PA1PA2 PAn FAOT If A and B are independent events7 so are a A0 and B b A and B07 and c A0 and Be That is complements of independent events are also independent Example 412 Suppose that in a certain population7 individuals have a certain dis ease with probability p 005 Suppose that 10 individuals are observed Assuming independence among individuals7 what is the probability that a no one has the disease b at least one has the disease 43 Random variables TERMINOLOGY A random variable is a variable whose numerical value is deter mined by chance We usually denote random variables by capital letters near the end of the alphabet eg7 X7 Y7 Z7 etc NOTATION We denote a random variable X with a capital letter we denote an observed value by L a lowercase letter This is standard notation 431 Discrete random variables TERMINOLOGY A discrete random variable7 say7 X has a nite limited num ber of possible values The probability distribution of X lists these values and the probabilities associated with each value i 1 2 3 k Probability p1 102 103 10k PAGE 89 CHAPTER 4 STAT 7007 J TEBBS The probabilities pl must satisfy 1 Every probability pl is a number between 0 and 1 2 The probabilities 101102 pk must sum to 1 R ULE We nd the probability of any event by adding the probabilities pl of the particular values that make up the event Example 414 Suppose that an experiment consists of ipping two fair coins The sample space consists of four outcomes 5 H7H7 H7T7 T7 H7 T7T Now7 let Y count the number of heads in the two ips Before we perform the exper iment7 we do not know the value of Y it is a random variable Assuming that each outcome in S is equally likely7 we can compute the probability distribution for Y Probability 025 050 025 What is the probability that I ip at least one head This corresponds to the values y 1 and y 2 hence7 the probability is PY gt1 PY 1 PY 2 050 025 075 Example 415 During my morning commute7 there are 8 stoplights between my house and 1 77 Consider the following probability distribution for X7 the number of stop lights at which I stop note that I stop only at red lights x l 0 1 2 3 4 5 6 7 8 Probability 020 020 030 020 004 002 002 001 001 Using this model7 what is the probability that I must stop a at most once b at least ve times c not at all PAGE 90 CHAPTER 4 STAT 7007 J TEBBS 030 025 o N o I probability 0 01 I 010 005 numberstops Figure 434 Probability histogram for the number of tra c light stops SOLUTION a mxguPmmPmu 020 020 040 SOLUTION b mxgmMXmmX MXmMXm 002 002 001 001 006 SOLUTION c PX 0 020 TERMINOLOGY Probability histograms are graphical displays that show the prob ability distribution Of a discrete random variable The values for the variable are placed on the horizontal axis the probabilities are placed on the vertical axis A probability histogram for Example 415 is given in Figure 434 PAGE 91 CHAPTER 4 STAT 7007 J TEBBS 5 1 Proportion of response to drug Battery hfetimes 004 005 002 000 s s o o 40 50 Standardized test scores Remission times Figure 435 Four density curves 432 Continuous random variables TERMINOLOGY Continuous random variables are random variables that take on values in intervals of numbers instead of a xed number of values like discrete random variables Example 416 Let Y denote the weight7 in ounces7 of the next newborn boy in Columbia7 SC Here7 Y is continuous random variable because it can in theory assume any value larger than zero TERMINOLOGY The probability distribution of a continuous random variable is rep resented by a density curve Examples of density curves appear in Figure 435 COMPUTING PROBABILITIES How do we compute probabilities associated with con tinuous random variables We do this by nding areas under density curves See Figure 436 Associating the area under a density curve with probability is not new In Chapter 17 we associated areas with the proportion of observations falling in that range PAGE 92 CHAPTER 4 STAT 7007 J TEBBS 002 003 004 005 006 I I 001 00 Remission times in months Figure 436 Density curve for X the remission times of leukemia patients The shaded region represents the probability P10 lt X lt 20 These are similar ideas that is7 in Figure 4367 the shaded area represents 0 P10 lt X lt 207 the probability that a single individual selected from the population will have a remission time between 10 and 20 months 0 the proportion of individuals in the population having a remission time between 10 and 20 months UNUSUAL FACT Continuous random variables are different than discrete random vari ables Discrete random variables assign positive probabilities to speci c values see EX amples 414 and 415 On the other hand7 continuous random variables assign probability 0 to speci c points Why It has to do with how we assign probabilities for continuous random variables The area under a density curve7 directly above a speci c point7 is zero Thus7 for the density curve in Figure 4367 the probability that a remission time for a selected patient equals 22 months is PX 22 0 PAGE 93 CHAPTER 4 STAT 7007 J TEBBS fx 004 006 000 I I 002 I 00 I I I I 10 20 30 40 tomato yields x bushelsacre Figure 437 A normal probability distribution with mean M 25 and standard deviation 0 5 A model for tomato yields NOTE The normal distributions studied in Chapter 1 Section 13 are examples of probability distributions Normal density curves were studied extensively in Chapter 1 RECALL Suppose that the random variable X N NW 0 Then7 the random variable X Z M U has a normal distribution with mean 0 and standard deviation 1 ie7 a standard normal distribution That is7 Z N N01 REVIEW QUESTIONS Let X denote the tomato yields per acre in a certain geograph ical region The probability distribution for X is normal and is depicted in Figure 437 Compute the following probabilities and draw appropriate shaded pictures PX gt 274 o P193 lt X lt 308 PX lt 162 PAGE 94 CHAPTER 4 STAT 7007 J TEBBS 44 Means and variances of random variables REMARK Random variables have probability distributions that describe 1 the values associated with the random variable 2 the probabilities associated with these values Graphically7 we represent probability distributions with probability histograms in the discrete case and density curves in the continuous case FACT Random variables have means and variances associated with them For any random variable X7 0 the mean of X is denoted by x 0 the variance of X is denoted by 0 o the standard deviation of X is denoted by TX and is still the positive square root of the variance 441 Means Discrete case MEAN OF A DISCRETE RANDOM VARIABLE Suppose that X is a discrete random variable whose distribution is i 1 2 3 k Probability pi 102 103 10k To nd the mean of X7 simply multiply each possible value by its probability and add the results ie7 k MX 1101 2102 3103 39 39 39 149k i1 PAGE 95 CHAPTER 4 STAT 7007 J TEBBS PHYSICAL INTERPRETATION The mean x of a discrete random variable X may be thought of as the balance point77 on a probability histogram for X this is similar to how we interpreted E for a sample of data in Chapter 1 Example 417 In Example 4157 we saw the probability distribution for X7 the number of stop lights at which I stop x l 0 1 2 3 4 5 6 7 8 Probability 020 020 030 020 004 002 002 001 001 What is the mean number of stops 1 have to make on any given day SOLUTION The mean of X is given by MX 1101 2102 3103 quot39 9109 00201020 2030 3020 4004 5002 6002 7001 8001 193 Thus7 the mean number of stops 1 make is MX 193 stopsday INTERPRETATION The mean number of stops is 193 this may cause confusion be cause 193 is not even a value that X can assume lnstead7 we may interpret the mean as a longrun average That is7 suppose that I recorded the value of X on 1000 con secutive morning commutes to l 77 that7s over 3 years of commutesl lf 1 computed E the average of these 1000 observations7 it would be approximately equal to x 193 LAW OF LARGE NUMBERS Draw independent observations at random eg7 SRS from any population distribution with mean u The Law of Large Numbers says that as n the number of observations drawn7 increases7 the sample mean E of the observed values approaches the population mean M and stays arbitrarily close to it In other words7 E converges to M REMARK The Law of Large Numbers LLN applies to any probability distribution not just normal distributions The LLN is illustrated in Figure 438 PAGE 96 CHAPTER 4 STAT 7007 J TEBBS samplemean 24 25 samplemean 18 20 22 0 200 400 600 500 1000 0 200 400 600 500 1000 number of observauons number of obsewauons 28 sample mean sample mean 25 25 27 0 200 400 600 500 1000 0 200 400 600 500 1000 number of observauons number of obsewauons Figure 438 An illustration of the Law of Large Numbers Each graph depicts the long run sample mean E computed from 17 000 observations from aN255 distribution SCENARIO Fred and Barney are playing roulette in Las Vegas For those of you that are not familiar with roulette7 dont worry the important thing to know is that the probability of winning on a given color red or black is p 1838 There are 18 red slots7 18 black slots7 and 2 green slots Fred is playing roulette7 always bets on red7 and he has lost 10 consecutive times Barney says play one more time you are due for a win77 What is wrong with this reasoning ANSWER The important statistical issue that Barney is ignoring is that the spins of the roulette wheel are likely independent That is7 what comes up in one spin has nothing to do with what comes up in other spins That Fred has lost the previous 10 spins is irrelevant insofar as his next play Also7 Barney probably has never heard of the Law of Large Numbers The value p 1838 is really a mean To see this7 let X denote the outcome of Fred7s bet on any given spin ie7 z 0 if Fred loses and z 1 if Fred wins Then7 z 1 occurs with probability p 1838 and z 0 occurs with probability 1 m p 2038 That is7 X obeys the following probability distribution PAGE 97 CHAPTER 4 STAT 7007 J TEBBS 0 loss 1 win Probability 2038 1838 Here7 the mean of X is MX 11012102 7 02038 11838 1838 p By the Law of Large Numbers7 we know that the proportion of Fred7s wins7 say7 15 will get close to p 1838 if he plays the game a large number of times That is7 if he continues to play over and over again7 he will win approximately 1001838 474 percent of the time in the long run this has little to do with what will happen on the next play REMARK The text describes Barney7s erroneous reasoning as the Law of Small Num bers77 Many people incorrectly believe in this that is7 they expect even short sequences of random events to show the kind of average behavior that7 in fact7 appears only in the long run Sports commentators are all guilty of this 442 Variances Discrete case VARIANCE OF A DISCRETE RANDOM VARIABLE Suppose that X is a discrete random variable whose distribution is i 1 2 3 k Probability p1 102 103 10k To nd the variance of X7 we use the following formula 0 I1 7 MX2101 x2 7 282102 x3 7 ux2103 96k 7 MX2Pk k 2W 7 2021 i1 The standard deviation is the positive square root of the variance ie7 0X 0 PAGE 98 CHAPTER 4 STAT 7007 J TEBBS Example 416 In Example 415 we saw the probability distribution for X7 the number of stop lights at which I stop z 0 1 2 3 4 5 6 7 8 Probability 020 020 030 020 004 002 002 001 001 In Example 4177 we computed the mean to be ax 193 The variance of X is given by 0 7 MVP 51719320201719320202 71932030 3 719320204 7 1932004 5 71932002 6 719320027 7 1932001 8 71932001 2465 Thus7 the standard deviation of X is equal to 0X V2465 1570 stopsday DIFFERENCES In Chapter 17 we spent some time talking about the values E and s the sample mean and sample standard deviation computed from a sample of data These values are statistics because they are computed from a sample of data On the other hand7 the values ax and 0X are parameters They are not computed from sample data rather7 they are values that are associated with the population ie7 distribution of values that are possible As a general rule we use sample statistics to estimate population parameters Thus7 if I drove to work7 say7 n 25 days7 and computed the sample mean and sample standard deviation of these 25 daily observations7 I would have E and 5 If I didnt know the probability distribution of X7 the number of stops per day7 I could use these values as estimates of ax and 0X respectively MATERIAL TO SKIP From the text MM7 we are not going to cover the Rules for Means77 section on pages 298 299 and the Rules for Variances77 section on pages 301 304 We are also going to skip Section 45 General Probability Rules77 PAGE 99 CHAPTER 5 STAT 7007 J TEBBS 5 Sampling Distributions 51 The binomial distribution BERNOULLI TRIALS Many experiments can be envisioned as a sequence of trials Bernoulli trials are trials that have the following characteristics i each trial results in either a success or a failure ie there are only 2 possible outcomes per trial ii there are 71 trials where n is xed in advance iii the trials are independent and iv the probability of success denoted as p 0 lt p lt 1 is the same on every trial TERMINOLOGY With a sequence of n Bernoulli trials de ne X by X number of successes out of Then X is said to have a binomial distribution with parameters 71 the number of trials performed and success probability p We write X N 30171 NOTE A binomial random variable is discrete because it can only assume 71 1 values Example 51 Each of the following situations might be modeled as binomial experi ments Are you satis ed with the Bernoulli assumptions in each instance a Suppose we ip a fair coin 10 times and let X denote the number of tails in 10 ips Here X N 871 1010 05 b In a eld experiment forty percent of all plots respond to a certain treatment I have four plots of land to be treated If X is the number of plots that respond to the treatment then X N 871 410 04 PAGE 100 CHAPTER 5 STAT 7007 J TEBBS c In a large African city7 the prevalence rate for HIV is about 12 percent Let X denote the number of HIV infecteds in a sample of 500 individuals Here7 X N 8n 50010 012 d It is known that screws produced by a certain company do not meet speci cations ie7 are defective with probability 0001 Let X denote the number of defectives in a package of 40 Then7 X N 871 4010 0001 Example 52 Explain why the following are not binomial experiments a I draw 3 cards from an ordinary deck and count X7 the number of aces Drawing is done without replacement b A couple decides to have children until a girl is born Let X denote the number of children the couple will have c In a sample of 5000 individuals7 I record the age of each person7 denoted as X d A chemist repeats a solubility test ten times on the same substance Each test is conducted at a temperature 10 degrees higher than the previous test Let X denote the number of times the substance dissolves completely GOAL We would like to compute probabilities associated with binomial experiments7 so we need to derive a formula that allows us to do this Recall that X is the number of successes in n Bernoulli trials7 and p is the probability of success on any one trial How can we get exactly z successes in 71 trials Denoting S success F failure for an individual Bernoulli trial7 a possible outcome in the underlying sample space for n Bernoulli trials is SSFSFSFFSFSF ntrials Because the individual trials are independent7 the probability that we get any particular ordering of z successes and n 7 z failures is pw1 7p w Now7 how many ways are there PAGE 101 CHAPTER 5 STAT 7007 J TEBBS n to choose z successes from 71 trials The answer to this last question is m Thus7 for values of z 0127717 the probability formula for X is a binomial coe icient computed as follows PltX z 1240 7 pr RECALL For any positive integer a7 alagtlta71gtlta72gtltgtlt2gtlt1 The symbol 1 is read a factorial7 By de nition7 0 1 Also7 recall that a0 1 Example 53 In Example 51b7 assume that X7 the number of plots that respond to a treatment7 follows a binomial distribution with n 4 trials and success probability p 04 That is7 assume that X N 871 410 04 What is the probability that exactly 2 plots respond That is7 what is PX 2 SOLUTION Note that 4 4 24 7 6 2 2 4 i 2 2 gtlt 2 Thus7 using the binomial probability formula7 we have 4 PX 2 204217 044 2 6 gtlt 016 gtlt 036 03456 So7 you can see that computing binomial probabilities is quite simple In fact7 we can compute all the individual probabilities associated with this experiment PX 0 304017 04 1 gtlt 040 gtlt 064 01296 PX 1 04114 0441 4 gtlt 041 gtlt 063 03456 PX 2 04217 0 4 6 gtlt 04 gtlt 062 03456 PX 3 04317 04H 4 gtlt 043 gtlt 061 01536 PX 4 j 0 4417 0444 1 gtlt 0 44 gtlt 0 60 0 0256 Note that these probabilities sum to 1 as they should PAGE 102 CHAPTER 5 STAT 7007 J TEBBS probability 00 l l l l l x number of plots responding Figure 539 Probability histogram for the number of plots which respond to treatment X N 871 410 04 ADDITIONAL QUESTIONS a What is the probability that at least one plot responds to treatment b at most one responds c all four respond BINOMIAL PROBABILITY TABLE Table C MM7 pages T6 10 contains the binomial probabilities PltX z Pm1 pr This table can be used to avoid computing probabilities by hand Blank entries in the table correspond to probabilities that are less than 00001 Also7 Minitab and SAS as well as other packages like Excel can be used as well Example 54 In a small Phase II clinical trial with 20 patients7 let X denote the number of patients that respond to a new skin rash treatment The physicians assurne independence among the patients Here7 X N 871 20107 where p denotes the prob ability of response to the treatment For this problern7 we7ll assume that p 03 We want to compute a PX 57 b PX 2 57 and c PX lt 3 PAGE 103 CHAPTER 5 STAT 7007 J TEBBS 020 015 probability 005 000 i l i i i i i i i i i I l i i i i y i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x number of patients Figure 540 Probability histogram for number of patients X N 871 2010 03 SOLUTIONS From Table C7 the answer to part a is PX 5 250 0350720 5 01789 For part b7 we could compute PX25 PX5PX6PX20 To make this probability calculation7 we would have to use the binomial probability formula 16 times and add up the results Alternatively we can use the complement rule PX25 17PX4 i1iPX0PX1PX2PX3PX4 17 00008 00068 00278 00716 01304 07626 The probabilities in part b were taken from Table C For part c7 we can also use Table C to compute PX lt 3 PX 2 00008 00068 00278 00354 PAGE 104 CHAPTER 5 STAT 7007 J TEBBS 020 015 probability 005 000 lyiiliiiiiiiiil 0123456789101112131415 x number of seeds Figure 541 Probability histogram for number of seeds X N 871 1510 06 MEAN AND VARIANCE OF THE BINOMIAL DISTRIBUTION Mathematics can show that if X N 80110 then the mean and standard deviation are given by uxnp 0X np17p Example 54 Suppose that 15 seeds are planted in identical soils and temperatures7 and let X denote the number of seeds that germinate If 60 percent of all seeds germinate on average and if we assume a 8157 06 probability model for X7 the mean number of seeds that germinate is 0X np 1506 9 seeds and the standard deviation is 0X np17p 150617 06 m 19 seeds REMARK Note that the 871 1510 06 probability histogram is approximately symmetric As 71 increases7 the 8n7p distribution becomes more symmetric PAGE 105 CHAPTER 5 STAT 7007 J TEBBS 002 003 000 001 x we lifetimes IN 1000 1 500 45 55 60 sample mean n 25 0 500 1000 1500 2000 500 1000 1500 2000 40 45 50 55 60 65 sample mean n 10 46 AB 50 52 54 sample mean n 100 Figure 542 Upper left N50731041 probability modelfor tires Upper right sampling distribution of when n 10 Lower left sampling distribution of when n 25 Lower right sampling distribution of when n 100 52 An introduction to sampling distributions Example 55 Many years of research have led to tread designs for automobile tires which offer good traction The lifetime of Brand A tires7 under regular driving conditions7 is a variable which we will denote by X measured in 1000s of miles The variable X is assumed to follow a N50731041 probability distribution This distribution is sketched in Figure 542 upper left The other three histograms were constructed using simulation To construct the upper right histogram in Figure 5427 0 l generated 10000 samples7 each of size n 107 from aN50731041 distribution 0 l computed E the sample mean7 for each sample 0 l plotted the 10000 sample means 11127110000 in a histogram this is what you see in the upper right histogram PAGE 106 CHAPTER 5 STAT 7007 J TEBBS NO TE The other two histograms were computed in the same fashion except that I used 71 25 lower left and n 100 lower right DISCUSSION What we have just done is use simulation to generate the sampling distribution of the sample mean E when n 10 n 25 and n 100 From these sampling distributions we make the following observations 0 The sampling distributions are all centered around 1 5073 the population mean 0 The variability in Ts sampling distribution looks to get smaller as 71 increases 0 The sampling distributions all look normall TERMINOLOGY The population distribution of a variable is the distribution of its values for all members of the population The population distribution is also the probability distribution of the variable when we choose one individual at random from the population In Example 55 the population distribution is N50731041 This is the distribution for the population of Brand A tires TERMINOLOGY The sampling distribution of a statistic is the distribution of values taken by the statistic in repeated sampling using samples of the same size In Example 55 the sampling distribution of has been constructed by simulating samples of individ ual observations from the N50731041 population distribution and then computing E for each sample OBSERVATION A statistic from a random sample or randomized experiment is a ran dom variable This is true because it is computed from a sample of data which are regarded as realizations of random variables Thus the sampling distribution of a sta tistic is simply its probability distribution REMARK The sampling distribution of a statistic summarizes how the statistic behaves in repeated sampling This is an important notion to understand because many statistical procedures that we will discuss are based on the idea of repeated sampling77 PAGE 107 CHAPTER 5 STAT 7007 J TEBBS mIIlIi m 2000 l 1500 1000 500 l o 005 010 015 020 025 030 035 phat Figure 543 Sampling distribution off when n 100 and p 02 53 Sampling distributions of binomial proportions Example 56 see also Example 316 A Columbia based health club wants to estimate p the proportion of Columbia residents who enjoy running as a means of cardiovascular exercise Since p is a numerical measure of the population ie Columbia residents it is a parameter To estimate p suppose that we take a random sample 71 residents and record X the number of residents that enjoy running out of Then the sample proportion is simply A X 197 77 Thus the sample proportion is simply the binomial count X divided by the sample size n We use 15 a statistic to estimate the unknown pl ln Example 316 recall that we simulated the sampling distribution of under the assumption that n 100 and p 02 see Figure 543 PAGE 108 CHAPTER 5 STAT 7007 J TEBBS SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION Let f Xn denote the sample proportion of successes in a 80110 experiment Mathematics can be used to show that the mean and standard deviation of 15 are 1 7 U1 p p n Furthermore7 for large n7 mathematics can show that the sampling distribution of 15 is approximately normal Putting this all together7 we can say that A 1 7 p N M lt 7 7 for large n The abbreviation AN stands for approximately normal77 REMARK It is important to remember that this is an approximate result ie7 is not perfectly normal This isnt surprising either remember that 15 is a discrete random variable because it depends on the binomial count X which is discrete What we are saying is that the discrete distribution of is approximated by a smooth normal density curve The approximation is best 0 for larger values of 717 and o for values of p closer to 12 Because of these facts7 as a rule of thumb7 the text recommends that both np 2 10 and n1 7p 2 10 be satis ed Under these conditions7 you can feel reasonably assured that the normal approximation is satisfactory NOTE Figure 545 displays simulated sampling distributions for for different values of n and p Two values of p are used When p 01 top row7 the normal approximation is not adequate until 71 100 when n 10 and n 407 the approximation is lousyl On the other hand7 when p 05 bottom row7 the normal approximation already looks to be satisfactory when n 40 PAGE 109 CHAPTER 5 STAT 7007 J TEBBS 14 i DD D1 D2 D3 D4 D5 D5 DDD DD5 D1D D15 D2D D25 DaD DDD DD5 EMU D15 D2D D25 phat h4 FEI1 phat hawDDpD phat h DpD i H D h a Z DD D2 D4 D5 DD D D2 D3 D4 D5 D5 D7 DD D3 D4 D5 D5 D7 phat h DpD 5 phat h4D pD5 phat hawDDpD 5 Figure 544 Sampling distributions for the sample proportion 15 using di erent n and p Example 57 Suppose that in Example 567 we plan to take a sample of n 100 residents If n 100 and p 027 what is the probability that 32 or more residents say that they enjoy running as a means of exercise Do you think this probability will be small or large Before you read on7 look at Figure 543 SOLUTION To solve this problern7 we could compute an exact answer ie7 one that does not rely on an approximation If X denotes the number of residents that enjoy running in our sarnple7 and if we assume a binomial rnodel7 we know that X N 81007 02 Thus7 we could compute PX 2 32 1302002320868 13O02330867 02100080 00031 using Minitab This is an exact answer7 under the binomial model We could also get an approximate answer using the normal approxirnation Here7 np 10002 20 and n1 7 p 10008 807 which are both larger than 10 Thus7 we know that A 021a 02 pw AN lt02 100 PAGE 110 CHAPTER 5 STAT 7007 J TEBBS Using this fact7 we can compute PX 2 32 M52 032 032 7 02 P Z gt 7 021702 V 100 PZ 2 30 00013 22 Thus7 the approximation is very close to the exact answer This should convince you that the normal approximation to the sampling distribution of 15 can be very good Furthermore7 normal probability calculations are often easier than exact lengthy binomial calculations if you don7t have statistical softwarel DISCUSSION In Example 577 suppose that we did observe 15 032 in our sample Since 13152 032 is so small7 what does this suggest This small probability might lead us to question whether or not p truly is 021 Which values ofp would be more consistent with the observed 15 032 values ofp larger than 02 or values of p smaller than 02 REMARK One can use mathematics to show that the 80110 and the Jno7 distributions are also close7 when the sample size n is large Put another way one could also use the normal approcoimation when dealing with binomial counts We won7t pay too much attention to this7 however7 since a problem involving X w 80110 can always be stated as a problem involving Hence7 there is no need to worry about the continuity correction material in MM p 347 8 Besides7 we can always make exact binomial probability calculations with the help of statistical software as in Example 57 Example 58 Hepatitis C HCV is a viral infection that causes cirrhosis and cancer of the liver Since HCV is transmitted through contact with infectious blood7 screening donors is important to prevent further transmission Currently7 the worldwide seropreva lence rate of HCV is around 37 and the World Health Organization has projected that HCV will be a major burden on the US health care system before the year 2020 A study was performed recently at the Blood Transfusion Service in Xuzhou City7 China The study involved an SRS of n 1875 individuals and each was tested for the HCV antibody If p 0037 what is the probability that 70 or more individuals will test positive PAGE 111 CHAPTER 5 STAT 7007 J TEBBS II IL 0 1500 2000 l l 1000 500 l l l l l 1 0015 0020 0025 0030 0035 0040 0045 phat Figure 545 Sampling distribution off when n 1875 and p 003 SOLUTION If X denotes the number of infecteds and if we assume a binomial model is this reasonable7 then X N 871 187510 003 Here np 1875003 5625 and n1 7 p 1875097 181875 Thus we can feel comfortable with the normal approximation for that is A 00317 003 pAN 003 m Using this fact we can compute PX 2 70 H U 39 B W Fl 00 xi xi 0 U V 70 m P Zgt 03903 T 00317003 V 1875 PZ 2186 00314 Again the event X 2 70 is not too likely under the assumption that p 003 Thus if we did observe 70 or more HCV positives in our sample of n 1875 what might this suggest about individuals living near Xuzhou City Note that the exact value of PX 2 70 using Minitab is 00339 very close to the approximationl PAGE 112 CHAPTER 5 STAT 7007 J TEBBS 54 Sampling distributions of sample means REMARK Binomial counts and sample proportions are discrete random variables that summarize categorical data To summarize data for continuous random variables we use sample means percentiles sample standard deviations etc These statistics also have sampling distributions In this subsection we pay particular attention to the sampling distribution of E the sample mean DISCUSSION In Example 55 we simulated the sampling distribution of E using the N50731041 population distribution We saw that in each case ie when n 10 n 25 and when n 100 the sampling distribution of E looked to be well modeled by a normal distribution with the same population mean but with smaller standard deviation This is not a surprising fact in light of the following mathematical fact FACTS Suppose that the statistic E is computed using an SRS from a population distrib ution not necessarily normal with mean a and standard deviation 0 The mathematical arguments on page p 360 1 show that ME M 0 57 That is the mean ofthe sampling distribution ofE is equal to the population mean a and the standard deviation of the sampling distribution of E is 1MB times the population standard deviation In light of the preceding facts we can state that o The sample mean E is an unbiased estimator for the population mean a o The precision with which E estimates a improves as the sample size increases This is true because the standard deviation of E gets smaller as n gets larger 0 Thus the sample mean E is a very good estimate for the population mean a it has the desirable properties that an estimator should have namely unbiasedness and small variability for large sample sizes PAGE 113 CHAPTER 5 STAT 7007 J TEBBS BIG RESULT If a population has a NW 0 distribution and the statistic E is computed from an SRS from this population distribution then Note that this is not an approximate result ie it is an exact result If the underlying population distribution is Nuo the sample meanE will also vary according to a normal distribution This sampling distribution will have the same mean but will have smaller standard deviation The simulated sampling distributions of E in Figure 542 simply reinforce this result Example 59 In the interest of pollution control an experimenter records X the amount of bacteria per unit volume of water measured in mgcmg The population distribution for X is assumed to be N48 10 a What is the probability that a single water specimen7s bacteria amount will exceed 50 mgcm3 SOLUTION Here we use the population distribution N48 10 to compute PX250 1322501 048 PZ 2 02 04207 b Suppose that the experimenter takes an SRS of n 100 water specimens What is the probability that the sample mean E will exceed 50 mgcm3 SOLUTION Here we need to use the sampling distribution of the sample mean E Since the population distribution is normal we know that o 10 2N 7 N 487 N 481 w lt 100 Thus 1 PZ 2 200 00228 50748 132250 1322 Thus we see that PX 2 50 and PE 2 50 are very different probabilities PAGE 114 CHAPTER 5 STAT 7007 J TEBBS CURIOSITY What happens when we sample data from a nonnormal population dis tribution Does E still have a normal distribution Example 510 Animal scientists are interested in the proximate mechanisms animals use to guide movements The ultimate bases for movements are related to animal adap tations to different environments and the development of behaviors that bring them to those environments Denote by X the distance in meters that an animal moves from its birth site to the rst territorial vacancy For the banner tailed kangaroo rat7 the population distribution of X has mean M 10 and standard deviation 039 The density curve that summarizes this population distribution is given in Figure 546 upper left This density curve is sometimes called an exponential distribution INVESTIGATION Suppose that I take an SRS of n banner tailed kangaroo rats and record X for each rat How will the sample mean E behave in repeated sampling That is7 what values of E can I expect to see when sampling from this exponential distribution with mean M 10 To investigate this7 we again use a simulation study 0 Generate 10000 samples7 each of size 717 from an exponential population distribu tion with mean M 10 and standard deviation 039 V10 0 Compute E7 the sample mean7 for each sample 0 Plot the 10000 sample means E1E2E10000 in a histogram This histogram will be the approximate sampling distribution of E OBSERVATIONS We rst note that the exponential distribution is heavily skewed right However7 from Figure 5467 we note that o the sampling distribution for E7 when n 5 is still skewed right7 but it already is taking a unimodal shape 0 the sampling distribution for E7 when n 25 is almost symmetric Sharp eyes might be able to detect a slight skew to the right 0 the sampling distribution for E7 when n 100 is nearly perfectly normal in shape PAGE 115 CHAPTER 5 STAT 7007 J TEBBS ll 500 1000 1500 2000 000 002 004 005 008 010 o 10 20 30 40 50 o gtlt distance m 20 30 sample mean n5 T W 10 B 10 12 500 1000 1500 2000 500 1000 1500 2000 5 15 20 14 sample mean n25 sample mean n100 Figure 546 Upper left population distribution of distance traveled to rst territorial vacancy for banner tailed kangaroo rats Upper right sampling distribution of when n 5 Lower left sampling distribution of when n 25 Lower right sampling distribution of when n 100 REMARK What we have just witnessed in Example 510 is an application of the follow ing important result It is hard to overstate the importance of this result in statistics CENTRAL LIMIT THEOREM Draw an SRS of size n from any population distrib ution with mean a and standard deviation 0 When n is large the sampling distribution of E is approximately normal with mean a and standard deviation TAE that is o E w AN 7 1 W The real novelty in this result is that the sample mean will be approximately normal for large sample sizes even if the original population distribution is not Remember that AN is an abbreviation for approximately normal HOW GOOD IS THE APPROXIMATION Since the Central Limit Theorem CLT only offers an approximate sampling distribution for E one might naturally wonder PAGE 116 CHAPTER 5 STAT 7007 J TEBBS l 1 0 500 1000 1500 2000 2500 00 02 04 0B 08 20 25 30 gtlt gamma sample mean n5 a Illl a Illl a 10 12 14 16 1s 9 1o 11 1 13 14 15 sample mean n25 sample mean n1oo Figure 547 Upper left gamma population distribution with u 12 and o 148 Upper right sampling distribution of when n 5 Lower left sampling distribution of when n 25 Lower right sampling distribution of when n 100 exactly how good the approximation is In general the goodness of the approximation jointly depends on o the sample size n and o the skewness in the underlying population distribution JOINT BEHAVIOR For heavily skewed population distributions such as the exponen tial we need the sample size to be larger for the CLT to work On the other hand for population distributions that are not too skewed the CLT will work77 even when smaller values of n are used To illustrate consider the population distribution in Figure 547 upper left This is a gamma distribution with mean u 12 and standard deviation 0 You should see that with only it 5 upper right the normal approximation to the sampling distribution of E is already very good At n 25 it is already almost perfect compare these ndings with the exponential distribution in Figure 546 PAGE 117 CHAPTER 5 STAT 7007 J TEBBS QUESTION So how large does 71 have to be for the 0LT to work Unfortunately there is no minimum sample size n that ts in every situation As a general rule7 larger sample sizes are needed when the population distribution is more skewed smaller sample sizes are needed when the population distribution is less skewed Example 511 There are many breeds of potatoes that are sold and studied throughout the United States and all over the world For all varieties7 it is important to conduct carefully designed experiments in order to study the marketable maturity and adaptabil ity with emphasis upon general appearance7 susceptibility to infection7 and external and internal defects which affect speci c markets For one speci c variety of potatoes7 Cherry Red7 an experiment was carried out using 40 plots of land The plots were fairly identical in every way in terms of soil composition7 amount of precipitation7 etc The population distribution of yields from last years harvest was estimated to have mean 11 1582 and standard deviation 039 149 bushelsplot Suppose that this years average yield in the forty plots was E 1558 Would you consider this unusual SOLUTION We can answer this question by computing PE 3 1558 under the assump tion that M 1582 and 039 149 From the CLT7 we know that 039 149 5 AN 7 AM 15827 AN 15822356 M x M Thus7 we have that 1558 71582 7 lt m lt Pz 7 155 8 P Z 7 2356 PZ 7102 01539 This probability is not that small In the light of this7 would you consider E 1558 to be all that unusual FINAL NOTE That the sample proportion A p 1 i p p N M p7 ltgtgt n for large n is really just an application ofthe Central Limit Theoreml This is true because the sample proportion is really just an average of values that are 0s and 1s For more detail7 see p 365 PAGE 118 CHAPTER 6 STAT 7007 J TEBBS 6 Introduction to Statistical Inference 61 Introduction We are now ready to formally start the second part of this course namely we start our introduction to statistical inference As we have learned this part of the analysis aims to answer the question quotWhat do the observed data in my sample suggest about the population of individuals TWO MAIN AREAS OF STATISTICAL INFERENCE o estimation of one or more population parameters eg con dence intervals 0 hypothesis tests for population parameters PREVIEW We wish to learn about a single population or multiple populations based on the observations in a random sample from an observational study or from a random ized experiment The observed data in our sample arise by chance because of natural variability eg biological etc in the population of interest Thus the statements that we make about the population will be probabilistic in nature From our sample data we will ueueiquot get to make deterministic statements about our population Example 61 Here are some examples of the types of problems that we will consider throughout the rest of the course 0 A veterinarian is studying the effect of lead poisoning on a species of Canadian geese She wants to estimate u the mean plasma glucose level measured in mg100 mL plasma for the population of infected geese 0 An aluminum company is experimenting with a new design for electrolytic cells A major design objective is maximize a cell7s mean service life ls the new design superior to the industry standard PAGE 119 CHAPTER 6 STAT 7007 J TEBBS 62 In a large scale Phase III clinical trial involving patients with advanced lung cancer the goal is to determine the ef cacy of a new drug Ultimately we would like to know whether this drug will extend the life of lung cancer patients as compared to current available therapies A tribologist is studying the effects of different lubrications in the design of a rocket engine He is interested in estimating the probability of observing a cracked bolt in a particular mechanical assembly in the population of engines in the eet ln laboratory work it is desirable to run careful checks on the variability of readings produced on standard samples In a study of the amount of calcium in drinking water undertaken as part of a water quality assessment multiple observations are taken on the same sample to investigate whether or not there was too much vari ability in the measurement system The Montana Agricultural Experiment Station continues to develop wheat vari eties widely adopted by Montana producers In an experiment with four varieties researchers are interested if there is signi cant variability among the wheat yields Con dence intervals for a population mean when a is known REMARK We start our discussion of con dence intervals in a rather unrealistic setting To be speci c we would like to construct a con dence interval for the population mean M when 039 the population standard deviation is known REMARK Why is this an unrealistic situation Recall that 039 is a parameter that is it is a measure that summarizes the entire population Rarely ie almost never will population parameters be known However there are instances where we might have a reliable guess77 for 039 perhaps from past data or knowledge In this situation the methods that we discuss this chapter are applicable In the next chapter we discuss the more realistic situation of when 039 is not known PAGE 120 CHAPTER 6 STAT 700 J TEBBS RECALL With data from a random sample we have seen that the sample mean E is an unbiased estimate for the population mean a However we know that the value of E changes from sample to sample whereas a does not change That is the estimate E is a likely value77 of a but that is all Recall that if our data are normally distributed then a 9 N 7 Since E can assume many different values in theory an in nite number of values instead of reporting a single value as an estimate for a parameter let7s report an interval of plausible values for the parameter That is let7s report an interval that is likely to contain the unknown parameter a Of course the term likely77 means that probability is going to come into play TERMINOLOGY A con dence interval for a parameter is an interval of likely plau sible values for the parameter DERIVATION To nd a con dence interval for the population mean a we start with the sampling distribution of E In particular we know that from a random sample of normally distributed data the statistic Thus we know that by standardizing f i M N 0 1 7 7 and E i M P i D lt 7 lt D 1 7 lt 2 2 0N7 2 H a where zag is the upper 042 percentile from the standard normal distribution ie zag satis es P Z lt 72042 P Z gt 2042 042 For example if 04 005 then 042 0025 and 20025 196 from Table A see also Figure 648 PAGE 121 CHAPTER 6 STAT 7007 J TEBBS standard normal density l area 095 l l 196 196 Figure 648 Standard normal density with unshaded area equal to 095 With 04 005 see Figure 6487 we have that T M 095 P 7196lt lt196 W U U P 71967lt77 lt1967 W M M W U U P 1967gt 7 gt71967 W M M W U U P T 1967gt gt771967 H W M M W P 196Ult lt 196M W M W Thus7 we have trappedl77 the unknown population mean a in between the random end points 7 196U and E 196U with probability 095 recall that 039 is known TERMINOLOGY We call U U 771967 7 1967 M W H W a 95 percent con dence interval for a PAGE 122 CHAPTER 6 STAT 7007 J TEBBS Example 62 The dissolved oxygen DO content measured in mgL is recorded for a sample of n 6 water specimens in a certain geographic location Here are the data 1727763 262 265 279 283 291 357 From past knowledge7 a reliable guess of the population standard deviation is o 03 mgL Assuming that the DO contents are well represented by a normal distribution7 we would like to nd a 95 percent con dence interval for u the population mean DO concentration ie7 of all possible water specimens from this location The sample mean is given by 1737 E T 290mgL Thus7 the 95 percent con dence interval for u is 290 196x03 290196X0393 266 314 7 7 7 or 7 x xE Thus7 we are 95 percent con dent that the population mean DO concentration a is between 266 and 314 mgL BEHAVIOR The probability associated with the con dence interval has to do with the random endpoints The interval a o E 71967 E 1967 95 95 a is random because E varies from sample to sample7 not the parameter a The values of the endpoint change as E change See Figure 63 on p 386 o If we are allowed the freedom to take repeated random samples each of size n then 95 percent of the intervals constructed would contain the unknown parameter a Hence7 we are making a statement about the process of selecting samples and the methods used Of course7 this has to do with the quality of the data collection 0 We say that we are 95 percent con dent that the interval contains the mean a In Example 617 we are 95 percent con dent that the population mean DO concentration level a is between 266 and 314 mgL PAGE 123 CHAPTER 6 STAT 7007 J TEBBS o It is not technically correct to say that the probability that M is between 266 and 314 mgL is 09577 Remember7 M is a xed quantity it is either in the the interval 2667 314 or it is not 0 The con dence interval is an interval of likely values for M7 the population mean a parameter In Example 617 some students would erroneously conclude that 95 percent of the individual DO measurements will be between 266 and 314 mgLl To compute the proportion of individual measurements between 266 and 314 mgL7 we would have to use the population distribution of DO content not the sampling distribution of El REMARK There is nothing special about using a 95 percent con dence interval We can use any con dence level the only thing that will change is the value of 2012 in Figure 648 Popular con dence levels used in practice are 99 percent7 95 percent the most common7 90 percent7 and even 80 percent The larger the con dence level7 the 77 more con den we are that the resulting interval will contain the population mean M TERMINOLOGY A level 0 con dence interval for a parameter is an interval com puted from sample data by a method that has probability 0 of producing an interval containing the true value of the parameter The value 0 is called the con dence level In decimal form7 C 0997 C 0957 C 0907 C 0807 etc GENERAL FORM The general form of a level 1704 con dence interval for a population mean M is 7 U 7 a xizag 22a2 To compute this interval7 we need four things 0 the sample mean E o the standard normal percentile 2012 0 the population standard deviation 0 and o the sample size n PAGE 124 CHAPTER 6 STAT 7007 J TEBBS area 080 standard normal density l Figure 649 Standard normal density with unshaded area equal to 080 FINDING 2012 The value 2012 is the upper 042 percentile of the standard normal distri bution Here are the values of 2042 found using Table A for commonly used con dence levels Make sure you understand where these values come frole 04 020 010 005 001 0170 080 090 095 099 2042 128 165 196 258 EXERCISE What would the value of 2012 be for a 92 percent con dence interval an 84 percent con dence interval a 77 percent con dence interval Example 62 In Example 617 an 80 percent con dence interval for the population mean DO content M is 290 128x03 290128x0393 274 306 L 7 7 7 or 7 m m m g Thus7 we are 80 percent con dent that the population mean DO concentration M is between 274 and 306 mgL Note that this is a shorter interval than the 95 percent PAGE 125 CHAPTER 6 STAT 7007 J TEBBS con dence interval 266 314 lntuitively this makes sense namely the higher the con dence level the lengthier the interval NONNORMAL DATA When we derived the general form of a level 1 7 04 con dence interval for the population mean M we used the fact that This is only exactly true if the data 12 xn are normally distributed However what if the population distribution is nonnormal Even if the underlying population distribution is nonnormal we know from the Central Limit Theorem that Mote when the sample size n is large In this situation we call 7 o 7 o z 7 Za2T7 z 20427 n W an approximate level 1 7 04 con dence interval Example 63 In Example 14 we examined the shelf life data from n 25 beverage cans in an industrial experiment The data measured in days are listed below 262 188 234 203 212 212 301 225 241 211 231 227 217 252 206 281 251 219 268 231 279 243 241 290 249 In Chapter 1 we informally decided that a normal distribution was not a bad model for these data although there was some debate We computed the range to be R mm 7 xmm 3017188 113 We can use the range to formulate a guess of a To see how recall that the distribution of shelf lives was approximately symmetric Thus by the Empirical Rule almost all the data should be within 30 of the mean This allows us to say that Rm 60 gt o 1136 188 PAGE 126 CHAPTER 6 STAT 7007 J TEBBS From the 25 cans7 recall that we computed E 23896 Thus7 an approximate con dence interval for M7 the population mean shelf life7 is given by 188 188 23896 7196 gtlt 7 23896 196 gtlt 7 or 231597 24633 V25 V25 Thus7 we are approximately 95 percent con dent that the population mean shelf life M is between 23159 and 24633 days MARGIN OF ERROR For a level 1 7 04 con dence interval for a population mean M7 the margin of error is de ned as a m ZaZ This is the quantity that we add and subtract to the sample mean E so that the con dence interval for M can be thought of7 as E 7 m7 E m Recall that this interval is exact when the population distribution is normal and is ap proximately correct for large n in other cases Clearly7 the length of the interval then is simply L 2m 22042 n f This expression for the length of a con dence interval for M is helpful We can use it to study how the length is related to the con dence level 1 7 04 the sample size 717 and the population standard deviation 039 0 As the con dence level increases7 so does zag look at Figure 650 then look at Figure 648 This7 in turn7 increases the length of the con dence interval this makes sense the more con dent we want to be7 the larger the interval must be 0 As the sample size increases7 the length of the interval decreases note that n is in the denominator of L This also makes sense the more data information we have from the population7 the more precise our interval estimate will be Shorter intervals are desirable they are more precise than larger intervals PAGE 127 CHAPTER 6 STAT 700 J TEBBS o The population standard deviation 039 is a parameter we can not think about changing its value77 However if we were to focus on a particular subpopula tion the value of 039 associated with this smaller population may be smaller For example suppose that in Example 63 we provided a guess of the population stan dard deviation to be 039 m 188 If we were to focus on a smaller population of cans say only those cans produced by a particular manufacturer it may be that the vari ability in the ll weights may be smaller because inherently the cans themselves are more similarl If we can focus on a particular subpopulation with a smaller 039 we can reduce the length of our interval Of course then our inference is only applicable to this smaller subpopulation SAMPLE SIZE DETERMINATIONS Before one launches into a research investigation where data are to be collected inevitably one starts with the simple question quotHow many observations do I need The problem is the answer is often not simple The answer almost always depends on the resources available eg time money space etc and statistical issues like con dence and power In single population problems we usually can come up with a sample size formula which incorporates these statistical issues Recall that the margin of error for a con dence interval for p when 039 is known is given by a m zag This is an equation we can solve for n in particular za2 7gt2 m Thus if we specify our con dence level 1 7 04 an estimate of the population standard deviation 039 and a desired margin of error m we can nd the sample size needed Example 64 In a biomedical experiment we would like to estimate the mean lifetime of healthy rats n measured in days that are given a high dose of toxic substance This may be done in an early phase clinical trial by researchers trying to nd a maximum tolerable dose for humans Suppose that we would like to write a 99 percent con dence PAGE 128 CHAPTER 6 STAT 700 J TEBBS interval for M with length no more than 4 days The researchers have provided a guess of 039 m 8 days How many rats should we use for the experiment SOLUTION With a con dence level of 1 7 Oz 099 our value of zag is 2042 20012 20005 258 Also if the length of the desired interval is 4 days our margin of error is m 2 days 258 gtlt 8 2 n T 107 Thus we would need 71 107 rats to achieve these goals The sample size needed is CURIOSITY lf collecting 107 rats is not feasible we might think about weakening our requirements after all 99 percent con dence is high and the margin of error is tight Suppose that we used a 90 percent con dence level instead with margin of error m 5 days Then the desired sample size would be 165 2 nlt X8 7 5 This is an easier experiment to carry out now we need only 7 ratsl However we have paid a certain price less con dence and less precision in our interval estimate CA UTIONS REGARDING CONFIDENCE INTERVALS Moore and McCabe p 393 offer the following cautions when using con dence intervals for inference Data that we use to construct a con dence interval should be or should be close to a simple random sample If the sample is biased so will be the results Poor data collection techniques inundated with nonsampling errors will produce poor results tool 0 Con dence interval formulas for M are different if you use different sampling designs eg strati ed cluster systematic etc We won7t discuss these Outliers almost always affect the analysis You need to be careful in checking for them and decide what to do about them PAGE 129 CHAPTER 6 STAT 7007 J TEBBS 0 When the sample size n is small7 and when the population distribution is highly nonnormal7 the con dence interval formula we use for M is probably a bad formula This is true because the normal approximation to the sampling distribution for E is probably a bad approximation The text offers an n 2 15 guideline for most population distributions That is7 if your sample size is larger than or equal to 157 you7re probably ne However7 this is only a guideline If the underlying population distribution is very skewed7 this may or may not be a good guideline you might need 71 to be larger Of course7 it goes both ways namely7 if the underlying distribution is very symmetric7 then this guideline might actually be too restrictive 63 Hypothesis tests for the population mean when a is known Example 65 In a laboratory experiment to investigate the in uence of different doses of vitamin A on weight gain over a three week period7 n 5 rats received a standard dose of vitamin A The following weight increases mg were observed 35 49 51 43 27 For now7 we will assume that these data arise from a population distribution which is normal A reliable guess of the population standard deviation is 039 m 14 mg The sample mean of these data is E 41 mg REMARK Often7 we take a sample of observations with a speci c research question in mind For example7 consider the data on weight gains of rats treated with vitamin A discussed in Example 65 Suppose that we know from several years of experience that the mean weight gain of rats this age and during a three week period when they are not treated with vitamin A is 278 mg SPECIFIC QUESTION If we treat rats of this age and type with vitamin A7 how does this affect 3 week weight gain That is7 if we could administer the standard dose of vitamin A to the entire population of rats of this age and type7 would the population PAGE 130 CHAPTER 6 STAT 7007 J TEBBS mean weight gain change from what it would be if no vitamin A was given The sample results seem to suggest so but whether or not this increase is representative of the population is an entirely different question STRATEGY Since we can not observe the entire population of rats we observe a sample of such rats treat them with vitamin A and view the sample as being randomly drawn from the population of rats treated with vitamin A This population has an unknown mean weight gain denoted by M Clearly our question of interest may be regarded as a question about the parameter M i Either H0 M 278 mg is true that is the vitamin A treatment does not affect weight gain and the mean is equal to what it would be if no vitamin A was given ii Or H1 M 74 278 mg is true that is vitamin A does have an affect on the mean weight gain TERMINOLOGY We call H0 the null hypothesis and call H1 the alternative hy pothesis A formal statistical inference procedure for deciding between H0 and H1 is called a hypothesis test APPROACH Suppose that in truth the application of vitamin A has no effect on weight gain and that M really is 278 mg For the particular sample we observed E 410 mg based on n 5 So the key question becomes quotHow likely is it that we would see a sample mean ofE 410 mg if the population mean really is M 278 mg o If E 410 mg is likely when H0 is true then we would not discount H0 as a plausible explanation That is we would not reject H0 0 le 410 mg is not likely when H0 is true then this would cause us to think that H0 is not a plausible explanation That is we would reject H0 as an explanation THE FORMAL METHOD As you might expect we characterize the notion of likely and not likely in terms of probability To do this we pretend that H0 is true and assess the probability of seeing the E value we observed in our particular sample PAGE 131 CHAPTER 6 STAT 7007 J TEBBS o If this probability is small then we reject H0 o If this probability is not small then we do not reject H0 IN GENERAL Consider the general situation where we desire to test H0 3 M 0 versus H13M7 M07 where no is the value of interest in the experiment or observational study If we assume that H0 is true then u 0 Recall that when E1E2 En is a random sample from a Nu02 population we know that when H0 is true the quantity if ow 27 U N01 Hence if the null hypothesis is true 2 follows a standard normal distribution The quantity 2 is sometimes called a onesample 2 statistic INTUITIVELY A likely value77 of E is one for which E is close to no or equivalently one for which the value of the 2 statistic is close to zero That is if H0 is true we would expect to see m U On the other hand an unlikely value of E would be one for which E is not close to no or 2 close to 0 equivalently one for which the 2 statistic is far away from zero in either direction STRATEGY To formalize the notion of unlikely suppose that we decide to reject H0 if the probability of seeing the value ofE that we saw is less than some small value 04 say 04 005 That is for probabilities smaller than a we feel that our sample evidence is strong enough to reject H0 REALIZATION Values of the 2 statistic that are larger than 2012 or smaller than 72042 ie in the shaded regions in Figure 650 are unlikely in the sense that the chance of PAGE 132 CHAPTER 6 STAT 7007 J TEBBS area 095 standard normal density l l l 196 196 Figure 650 Standard normal density shaded region equal to Oz 005 If the value ofz falls in the shaded region our sample evidence against H0 M no is strong seeing them is less than oz the cut off probability for unlikeliness77 that we have speci ed In Figure 6507 the level of Oz 005 TERMINOLOGY We call the shaded areas in Figure 650 the rejection region for the hypothesis test of H0 M no versus H1 M 31 no 0 If the value of the 2 statistic falls in the rejection region7 then we reject H0 That is7 the evidence in our sample shows that the null hypothesized value of the rnean7 no is not a likely value for the parameter M o If the 2 statistic does not fall in one of the shaded regions7 then we do not reject H0 That is7 there is not enough evidence to refute the conjecture that M no 0 In either case7 note that we are making a statement about H0 We either have suf cient evidence to reject it7 or we do not PAGE 133 CHAPTER 6 STAT 7007 J TEBBS Example 66 Using the rat weight gain data from Example 657 we wish to test7 at level 04 0057 H0 a 278 versus H1 a 31 2787 with our n 5 experimental values Recall that our guess for 039 m 14 The 2 statistic is equal to 727 4107278 7 211 2 U 14MB Taking Oz 0057 we have that 20025 196 Thus7 the rejection region for the test is values of 2 that are larger than 196 or smaller than 7196 see Figure 650 CONCLUSION Comparing the 2 statistic 211 to the critical value 20025 1967 we see that 211 gt 196 see Figure 650 Hence7 we reject H0 at the 04 005 level since the 2 statistic falls in the rejection region The evidence in our sample is strong enough to discount no 278 as plausible ie7 to conclude that vitamin A does have an e ect 0n the mean weight gain TERMINOLOGY The 2 statistic E Mo am is an example of a test statistic A test statistic is computed from the sample and is 2 used as a basis for deciding between H0 and H1 If the test statistic falls in the a level rejection region for a test we reject H0 at level 04 631 The signi cance level REALIZATION In the hypothesis test of H0 a no versus H1 a 31 no we used a cut off probability level of a We realized that this probability determined the size of the rejection region The value 04 is called the signi cance level for the test Formally7 because we perform the test assuming that H0 is true7 Oz denotes the probability of rejecting a true H0 PAGE 134 CHAPTER 6 STAT 7007 J TEBBS REJECTING H0 When do we reject H0 There are two scenarios i H0 really is not true and this caused the unlikely value of 2 that we saw ii H0 is in fact true but it turned out that we ended up with an unusual77 sample that caused us to reject H0 nonetheless TYPE I ERROR The situation in ii is a error That is we have made an incorrect judgement between H0 and H1 Unfortunately because we are dealing with chance mechanisms in the sampling procedure it is possible to make this type of mistake A mistake like that in ii is called a Type I Error The probability of committing a Type 1 Error is equal to 04 the signi cance level for the test TERMINOLOGY When we reject H0 we say formally that we reject H0 at the level of signi cance 04 or that we have a statistically signi cant result at level 04 Thus we have clearly stated what criterion we have used to determine what is unlikely7 If we do not state the level of signi cance other people have no sense of how stringent or lenient we were in our determination An observed value of the test statistic leading to the rejection of H0 is said to be statistically signi cant at level a 632 One and twosided tests TWO SIDED TESTS For the rat weight gain study in Example 65 we considered the following hypotheses H0 u 278 versus H1 M 31 278 The alternative hypothesis here is an example of a twosided alternative hence this is called a twosided test The reason for this is that the alternative simply speci es a deviation from the null hypothesized value no but does not specify the direction of PAGE 135 CHAPTER 6 STAT 7007 J TEBBS that deviation Values of the 2 statistic far enough away from 0 in either direction will ultimately lead to the rejection of H0 ONE SIDED TESTS In some applications7 the researcher may be interested in testing that the population mean M is no larger or no smaller than a certain pre speci ed value For example7 suppose that we are hopeful that vitamin A not only has some sort of effect on weight gain7 but7 in fact7 it causes rats to gain more weight than they would if they were untreated In this instance7 we would be interested in the following test H0 M 278 versus H1 M gt 278 That is7 it might be of more interest to specify a different type of alternative hypothesis As we now see7 the principles underlying the approach are similar to those in the two sided case7 but the procedure is modi ed slightly to accommodate the particular direction of a departure from H0 in which we are interested here7 that the population mean is larger than 278 The alternative hypothesis H1 M gt 278 is called a onesided alternative and the test above is called a onesided test INTUITION lf H1 is really true7 we would expect to see a value of E larger than 2787 or7 equivalently7 a value of f M0 UW larger than 0 With a two sided test7 we only cared about the 2 statistic being large in Z magnitude7 regardless of direction But7 if our interest is in this one sided alternative7 we now care about the direction IN GENERAL Just as before7 consider the general set of hypotheses H0 3 M 0 Versus H1MgtM0 PAGE 136 CHAPTER 6 STAT 7007 J TEBBS standard normal density l area 095 Figure 651 Standard normal density with shaded area equal to oz 005 This is the rejection region for a one sided upper tail test If the value ofz falls in the shaded region our sample evidence against H0 is strong If we assume that the null hypothesis H0 is true ie7 M no7 then the one sarnple 2 statistic E M0 TAE still follows a standard normal distribution However7 we are now interested only in the Z situation where E is signi cantly larger than no ie7 where the 2 statistic is signi cantly larger than 0 We know that there is a value 2a such that Pz gt 20 a Graphically7 the shaded area in Figure 651 has area probability equal to oz 005 Note here that the entire probability oz is concentrated in the right tail7 because we are only interested in large7 positive values of the 2 statistic With oz 0057 values of the 2 statistic greater than 2005 165 are unlikely77 in this sense PAGE 137 CHAPTER 6 STAT 7007 J TEBBS Example 67 Using the rat weight gain data from Example 657 we wish to test7 at level 04 0057 H0 a 278 versus H1 a gt 278 The test statistic is the same as before ie7 2 4107278 2 95 of 211 0N5 14MB CONCLUSION From Table A MM7 with 04 0057 we have 2005 165 see also Figure 651 Comparing the value of the 2 statistic to this tabled value7 we see that 211 gt 165 We thus reject H0 at the 04 005 level that is7 the evidence in the sample is strong enough to conclude that the mean weight gain is greater than no 278 mg ie7 that vitamin A increases weight gain on average QUESTION ls there enough evidence to reject H0 at the 04 001 level at the 04 0001 level Why or why not Note that 2001 233 and 20001 308 using Table A 633 A closer look at the rejection region TWO SIDED REJEOTION REGION The test of hypotheses of the form H0 3 M 0 versus H13M7 M0 is a twosided test The alternative hypothesis speci es that a is different from no but may be on either side of it With tests of this form7 we know that the rejection region is located in both tails of the standard normal distribution7 with total shaded area equal to 04 the signi cance level for the test PAGE 138 CHAPTER 6 STAT 7007 J TEBBS ONE SIDED REJECTION REGION Similarly7 tests of hypotheses of the form H0 3 M 0 versus H1 3M gtM0 and H0 3 M 0 versus H13MltM0 are onesided hypothesis tests The alternative in which we are interested lies to one side of the null hypothesized value For the rst test7 the rejection region is located in the upper tail of the standard normal distribution For the second test7 the rejection region is located in the lower tail of the standard normal distribution In either case7 the area of the rejection region equals a the signi cance level for the test 634 Choosing the signi cance level CURIOSITY How does one decide on an appropriate value for a Recall that we mentioned a particular type of mistake that we might make7 that of a Type I Error Because we perform the hypothesis test under the assumption that H0 is true7 this means that the probability we reject a true H0 is equal to a Choosing oz thus7 has to do with how serious a mistake a Type 1 Error might be in the particular application Example 68 Suppose the question of interest concerns the ef cacy of a costly new drug for the treatment of advanced glaucoma in humans7 and the new drug has potentially dangerous side effects eg7 permanent loss of sight7 etc Suppose a study is conducted where sufferers of advanced glaucoma are randomly assigned to receive either the standard treatment or the new drug a clinical trial7 and suppose the random variable of interest is the period of prolongation of sight It is known that the standard drug prolongs sight PAGE 139 CHAPTER 6 STAT 7007 J TEBBS for M0 months We hope that the new drug is more effective in the sense that it increases the prolongation of sight in which case it may be worth its additional expense and risk of side effects We thus would consider testing H0 M M0 versus H1 M gt MO where M denotes the mean prolongation period under treatment of the new drug SCENARIO Suppose that after analyzing the data unbeknownst to us our sample of patients leads us to commit a Type 1 Error that is we reject a true H0 and claim that the new drug is more effective than the standard drug when in reality it isnt Because the new drug is so costly and carries the possibility of serious side effects this could be a very serious mistake ln this instance patients would be paying more with the risk of dangerous side effects for no real gain over the standard treatment ln a situation like this it is intuitively clear that we would like 04 to be very small so that the chance of claiming a new treatment is superior when it really isn7t is small That is we would like to be conservative ln situations where the consequences of a Type 1 Error are not so serious there is no reason to take 04 to be so small That is we might choose a more anticonservative signi cance level This is often done in preliminary investigations TYPE U ERROR Committing a Type 1 Error is not the only type of mistake that we can make in a hypothesis test The sample data might be unusual77 in such a way that we end up not rejecting H0 when H1 is really true This type of mistake is called a Type II Error Because a Type ll Error is also a mistake we would like the probability of committing such an error say 6 to also be small ln many situations committing a Type ll Error is not as serious as committing a Type ll Error ln Example 68 if we commit a Type ll Error we infer that the new drug is not effective when it really is Although this too is undesirable as we are discarding a potentially better treatment we are no worse off than before we conducted the test whereas if we commit a Type 1 Error we will unduly expose patients to unnecessary costs and risks for no gain SUMMARY 0 Type I Error Reject H0 when it is true 0 Type II Error Not rejecting H0 when H1 is true PAGE 140 CHAPTER 6 STAT 7007 J TEBBS fz I p 00114 o I I I 2 z211 Figure 652 Staridard riormal derisity with the area to the right ofz 211 The shaded area equals 00174 the probability value for the me sided test in Example 67 6 3 5 Probability values INVESTIGATION ln Example 677 we considered the one sided test H0 M 278 versus H1 It gt 278 Recall that our one sarnple 2 statistic was 2 211 We saw that when 04 0057 we rejected H0 since 2 gt 2005 165 However7 when 04 0017 we do not reject H0 since 2 lt 2001 REALIZATION From the above staternents7 we know that there must exist some 04 between 001 and 005 where our test statistic z 211 and critical value 20 will be the same The value of 04 where this occurs is called the probability value for the test PAGE 141 CHAPTER 6 STAT 7007 J TEBBS TERMINOLOGY The smallest value of oz for which H0 is rejected is called the prob ability value of the test We often abbreviate this as P value CALCULATION ln Example 677 the probability value for the test is the area to the right of z 211 under the standard normal distribution ie7 P2 gt 211 00174 Note that how we compute the probability value is consistent with the alternative hy pothesis H1 a gt 278 We reject Hg for large positive values of z thus7 the probability value is the area in the right tail of the distribution see Figure 652 MAIN POINT In any hypothesis test we can always make our decision by comparing the probability value to our signi cance level a In particular7 o If the probability value is smaller than 04 we reject H0 o If the probability value is not smaller than 04 we do not reject H0 RULES FOR OOMP UTING PROBABILITY VALUES If we have a onesided hypoth esis test of the form H0 3 M 0 versus H13MgtM07 the probability value for the test is given by the area to the right of 2 under the standard normal distribution If we have a onesided hypothesis test of the form H0 3 M 0 versus H13MltM07 PAGE 142 CHAPTER 6 STAT 7007 J TEBBS the probability value for the test is given by the area to the left of 2 under the standard normal distribution If we have a twosided hypothesis test H0 3 M 0 versus 1111311751107 the probability value for the test is given by the area to the right of plus the area to the left of ilzl under the standard normal distribution IN ANY CASE If the probability value is smaller than 04 we reject H0 Example 68 It is thought that the body temperature of intertidal crabs exposed to air is less than the ambient temperature Body temperatures were obtained from a random sample of n 8 such crabs exposed to an ambient temperature of 254 degrees C 258 246 261 249 251 253 240 245 Assume that the body temperatures are approximately normally distributed and let n denote the mean body temperature for the population of intertidal crabs exposed to an ambient temperature of 254 degrees C Then7 we wish to test7 say7 at the conservative Oz 001 level7 H0 u 254 versus H141 lt 254 ANALYSIS From past experience7 a reliable guess of 039 m 07 degrees C Simple calcu lations show that E 250 so that the one sample 2 statistic is 1710 2507254 162 z 7 t 39 a 07 To compute the P value7 note that Pz lt 7162 005267 which is not less than 001 Thus7 we do not have enough evidence against H0 That is7 there is not enough evidence in the sample7 at the 04 001 level of signi cance7 to suggest that the mean body temperature of intertidal crabs exposed to air at 254 degrees C is7 indeed7 less than 254 PAGE 143 CHAPTER 6 STAT 7007 J TEBBS 636 Decision rules in hypothesis testing SUMMARY Summarizing everything we have talked about so far7 in a hypothesis testing situation7 we reject H0 if 1 E0 the test statistic 2 falls in the rejection region for the test This is called the rejection region approach to testing the probability value for the test is smaller than a This is called the probability value approach to testing These two decision rules are equivalent With the rejection region approach7 we think about the value of the test statistic If it is large then it is an unlikely value under the null hypothesis Of course7 large depends on the probability a we have chose to de ne unlikely With the probability value approach7 we think directly about the probability of seeing something as weird or weirder than what we saw in our experiment If this probability is small ie7 smaller than a then the test statistic we saw was unlikely A large test statistic and a small probability value are equivalent An advantage to working with probability values is that we calculate the probability of seeing what we saw this is useful for thinking about just how strong the evidence in the data really is Smaller probability values mean more unlikely Hence7 since we compute the probability value under the assumption that H0 is true7 it should be clear that the smaller probability is7 the more unlikely it is that H0 is true That is7 the smaller the probability value7 the more evidence against H0 Large probability values are not evidence against H0 PAGE 144 CHAPTER 6 STAT 7007 J TEBBS 637 The general outline of a hypothesis test SUMMARY We now summarize the steps in conducting a hypothesis test for a single population mean u when U is known The same principles that we discuss here can be generalized to other testing situations H Determine the question of interest This is the rst and foremost issue No experi ment should be conducted unless the scienti c questions are well formulated D Eccpress the question of interest in terms of two hypotheses involving the population mean u 0 H0 the null hypothesis 7 the hypothesis of no effect If no is a speci ed value7 H0 u no 0 H1 the alternative hypothesis 7 the condition that we suspect or hope is true Depending on the nature of the problem7 H1 will be two sided7 ie7 H1u 31 no or one sided ie7 H1 u gt no or H1u lt 0 9 Choose the signi cance level7 oz to be a small value7 like 005 The particular situation ie7 the severity of committing a Type I Error will dictate its value 7 Conduct the epperiment collect the data and calculate the test statistic 9 Perform the hypothesis test7 either rejecting or not rejecting H0 in favor of H1 This can be done two different ways7 as we have seen 0 The rejection region approach Reject the null hypothesis if the test statistic falls in the rejection region for the test 0 The probability value approach Reject the null hypothesis if the proba bility value for the test is smaller than a Example 69 In a certain region7 a forester wants to determine if u the mean age of a certain species of tree7 is signi cantly different than 35 years Using a carbon dating procedure7 the following ages were observed for a simple random sample of n 16 trees PAGE 145 CHAPTER6 STAT70mJTEBBS 328 252 397 308 305 267 209 176 379 286 138 428 361 294 210 187 From previous studies a reliable guess of the population standard deviation is 039 5 years We would like to test7 at the 04 010 level7 H0 M 32 versus H1 M 31 32 a two sided alternative To perform this test7 I used Minitab here is the output Test of mu 32 vs not 32 The assumed standard deviation 5 Variable N Mean StDev SE Mean 90 o CI Z P years 16 282813 84181 12500 262252 303373 298 0003 CONCLUSION Because the probability value is small ie7 smaller than 04 0107 we have suf cient evidence to reject H0 using our sample results That is7 our data suggest that the mean age of the trees under investigation is different than 32 years As a side note observe that the 90 percent con dence interval for M does not include 32 the null hypothesized value OfMH 638 Relationship with con dence intervals EQ UIVALENC39E There is an elegant duality between hypothesis tests and con dence intervals We have already seen that a con dence interval for a single population mean is based on the probability staternent M0 P ea lt lt04 17 zLav JZ 0 As we have seen7 a twosided hypothesis test is based on a probability statement of the P form f 0 WE 2 2042 CY PAGE 146 CHAPTER 6 STAT 700 J TEBBS Comparing the two statements a little algebra shows that they are actually the same Thus choosing a small level of signi cance level 04 in a hypothesis test is equivalent to choosing a large77 con dence level 1 7 04 for a con dence interval Furthermore with the same choice of 04 we may 0 reject H0 u no in favor of the two sided alternative at the signi cance 04 if the 1001 7 04 percent con dence interval does not include no Thus the experimenter can perform a twosided test at level 04 just by looking at whether or not the null hypothesized value no falls in the 1001 7 04 percent con dence interval We can not use our con dence intervals to conduct one sided tests OBSERVATION ln Example 69 we see that no 32 is not contained in the 90 percent con dence interval for n Thus we may reject H0 u 32 in favor of the two sided alternative at the 04 010 level 64 Some general comments on hypothesis testing Although we have only discussed hypothesis tests for a single population mean u the ideas outlined below will hold in many other hypothesis testing situations 0 We do not even begin to collect data until the question of interest has been estab lished There are good reasons for doing this For one it makes the experimenter specify up front the goal ofthe experiment and research question of interest Also if one were to start performing the experiment and then pose the question of inter est preliminary experimental results may sway the researcher into making biased judgements about the real question at hand For example if one starts off with a two sided alternative hypothesis but changes his mind after seeing some data to use a one sided alternative this may introduce bias into the experiment and the data not yet collected PAGE 147 CHAPTER 6 STAT 7007 J TEBBS Refrain from using statements like Accept H0 and H0 is true These statements can be grossly misleading If we do reject H07 we are saying that the sample evidence is suf ciently strong to suggest that H0 is probably not true On the other hand7 if we do not reject H07 we do not because the sample does not contain enough evidence against H0 Hypothesis tests are set up so that we assume H0 to be true and then try to refute it based on the experimental data If we can not refute H07 this doesnt mean that H0 is true7 only that we couldn7t reject it To illustrate the preceding remark7 consider an Amercian courtroom trial analogy The defendant7 before the trial starts7 is assumed to be not guilty This is the null hypothesis The alternative hypothesis is that the defendant is guilty This can be viewed as a hypothesis test H0 defendant is not guilty versus H1 defendant is guilty Based on the testimony data7 we make a decision between H0 and H1 If the evidence is overwhelming beyond a reasonable doubt against the defendant7 we reject H0 in favor of H1 and classify the defendant as guilty If the evidence presented in the case is not beyond a reasonable doubt against the defendant7 we stick with our original assumption that the defendant is not guilty This does not mean that the defendant is innocent It only means that there wasn7t enough evidence to sway our opinion from not guilty to guilty By the way7 what would Type I and Type ll Errors be in this analogy The signi cance level and rejection region are not cast in stone The results of hypothesis tests should not be viewed with absolute yesno interpretation7 but rather as guidelines for aiding us in interpreting experimental results and deciding what to do next The assumption of normality may be a bad assumption If the original population distribution is severely non normal7 this could affect the results since the one sample PAGE 148 CHAPTER 6 STAT 7007 J TEBBS 2 statistic is no longer exactly standard normal As long as the departure from normality is not great7 we should be ne It has become popular in reports and journals in many applied disciplines which routinely use these statistical tests to set 04 005 regardless of the problem at hand7 then strive rigorously to nd a probability value less than 00577 From a practical standpoint7 there probably isnt a lot of difference between a probability value of 0049 and a probability value of 0051 Thus7 probability values need to interpreted with these cautionary remarks in mind Many statistical packages7 such as SAS and Minitab7 report only probability values This way7 a test at any level of signi cance level 04 level may be performed by comparing the probability value reported in the output to the pre speci ed a Changing the value of 04 after the experiment has been performed or during the ex periment is a terrible misuse of a statistics For example7 if7 before the experiment is performed7 the researcher took 04 001 conservative7 but after getting a prob ability value of 0047 the researcher changes to Oz 005 just to get a statistically signi cant result Theoretical arguments can show that this practice pretty much negates all information conveyed in the test7 and hence7 the experiment itself A LOOK AHEAD We have spent a great deal of time learning the mechanics of hy pothesis tests for a single population mean M in the situation where 039 is known In introducing hypothesis tests for a single population mean u we have discussed many ideas eg7 signi cance level7 probability values7 rejection region7 null and alternative hypotheses7 one and two sided tests7 Type I and ll errors7 philosophies of testing7 rela tionship with con dence intervals7 etc The basic premise behind these underlying ideas will be the same for the rest of the course7 regardless of what population parameters we are interested in The theory and philosophy we have discussed will carry over to other hypothesis testing situations OMISSION For now7 we will skip Section 64 in MM PAGE 149 CHAPTER 7 STAT 7007 J TEBBS 7 Inference for Distributions 71 Introduction In the last chapter we learned that hypothesis tests and con dence intervals for the population mean u based on an SRS were constructed using the fact that w inW N01 You will recall that a 1001 7 04 percent con dence interval for u was given by 7 U 7 a Za2W7 ZaZW and the one sample 2 statistic used to test H0 u MO was given by 2 E 7 M0 U For these procedures to make sense we required that the population standard deviation 039 was known REALITY In most applications 039 will not be known Thus if we want to write a con dence interval or perform a hypothesis test for the population mean u we should not use the methods outlined in the previous chapter SOL UTION When 039 is not known however we can do the almost obvious thing namely we can use the sample standard deviation 5 as an estimate for 039 Recall that Using 5 as an estimate of 039 we can create the following quantity 7 E 7 M T However unlike z It does not have a N0 1 distributionll This follows because we are using an estimate 5 for the population parameter 039 Put another way 2 and t have different sampling distributions Because of this we need to become comfortable with a new density curve PAGE 150 CHAPTER 7 STAT 7007 J TEBBS Figure 753 The t3 distribution solid and the N01 distribution dotted 72 Onesample t procedures BIG RESULT Suppose that an SRS of size n is drawn from a NMU population distribution Then7 the random variable t D 5W has a t distribution with n 7 1 degrees of freedom This density curve is denoted by tikl Figure 753 displays the t3 density curve along with the N01 density curve FACTS ABOUT TIIEt DISTRIBUTION o All t distributions are continuous and symmetric about 0 o The t family is indexed by a degree of freedom value k an integer 0 As k increases7 the tk distribution approaches the standard normal distribution When k gt 307 the two distributions are nearly identical PAGE 151 CHAPTER 7 STAT 7007 J TEBBS 1 distribution with 11 degrees of freedom 2 I area 005 Figure 754 The tn density curve and 95th percentile t11gt0v05 1796 0 When compared to the standard normal distribution the t distribution in general is less peaked and has more mass in the tails 0 Table D lists probabilities and percentiles for the t distributions Of course statistical software packages should be used in practice NOTATION The upper 04 percentile of a tk distribution is the value mm which satis es P t lt 71kg P t gt that 04 In Figure 754 we see that the 95th percentile of the tn distribution is t11gt0v05 1796 see also Table D By symmetry the 5th percentile is 7131905 71796 QUESTIONS 0 What is the 99th percentile of the tn distribution the 1st percentile o What is the 90th percentile of the tlg distribution the 10th percentile o What is the 92nd percentile of the t6 distribution the 8th percentile PAGE 152 CHAPTER 7 STAT 7007 J TEBBS 721 Onesample t con dence intervals ONE SAMPLEt CONFIDENCE INTERVAL Suppose that an SRS is drawn from a NMU population A 1001 7 04 percent con dence interval for a is given by s s T twig2W7 T tnilyZ This interval is exact when the population distribution is normal and is approximately correct for large n in other cases REMARKS Comparing the t interval to the z interval7 we see that they are identical in form that is7 each interval is of the form ii m7 where m denotes the margin of error For the t interval7 the margin of error is 5 m nilyZ For the z interval7 the margin of error was a m ZaZ REMINDER When the population standard deviation 0 is not known7 we use the t interval TERMINOLOGY In the t con dence interval7 the margin of error can be written as 5 m tnilyZ X 7 W t percentile from Table D standard error The quantity s is called the standard error of the estimate E To see where this comes from7 recall that from a random sample of normally distributed data7 the statistic o E N 7 M w Thus7 the standard error s is an estimate of the standard deviation of E The stan dard error of an estimate such as E is an important quantity lt summarizes numerically the amount of variation in an estimate Standard errors of estimates are commonly re ported in statistical analysis PAGE 153 CHAPTER 7 STAT 7007 J TEBBS Example 71 In an agricultural experiment7 a random sample of n 10 plots produces the following yields measured in kg per plot The plots were treated identically in the planting7 growing7 and harvest phases The goal is to obtain a 95 percent con dence interval for M7 the population mean yield Here are the sample yields 232 201 188 193 246 271 337 247 324 173 From these data7 we compute E 241 and s 56 Also7 with n 107 the degrees of freedom is n 7 1 97 and tWLDtg 9025 2262 Table D The 95 percent con dence interval is 56 56 2417 2262 X 7 241 2262 X 7 or 2017281 k lOt To To lt gt gp Thus7 based on these data7 we are 95 percent con dent that the population mean yield M is between 201 and 281 kgplot EXERCISE Using the data in Example 717 nd a 90 percent con dence interval for M Also7 nd a 99 percent con dence interval 722 Onesample t tests RECALL In that last chapter7 we saw that the one sample z statistic7 used to test H0 M M07 was given by E M0 2 U Like the one sample z con dence interval7 we required that 039 be known in advance When 039 is not known in advance7 we can use a onesample t test ONE SAMPLEt TEST Suppose that an SRS is drawn from a NMU population7 where both M and 039 are unknown To test H0 M M07 we use the one sample t statistic tf M0 When H0 is true7 the onesample t statistic varies according to a 1211 sampling dis tribution Thus7 rejection regions are located in the tails of this density curve PAGE 154 CHAPTER 7 STAT 7007 J TEBBS TWOSIDED REJEOTION REGION The twosided test H0 3 M 0 versus H13M7 M0 has a rejection region located in both tails of the twl distribution The total area of the rejection region equals 04 the signi cance level for the test By convention7 each tail has area 042 ONE SIDED REJEOTION REGION Onesided tests of the form H0 3 M 0 versus H13M gtM0 and H0 3 M 0 versus H13MltM0 have rejection regions located in only one tail of the tn1 distribution The test with H1 M gt MO has its rejection region in the upper tail the test with H1 M lt M0 has rejection region in the lower tail The total area of the rejection region equals a NOTE Probability values are computed in the same manner they were before in the last chapter The only difference is that now we are nding areas under the twl density curve instead of nding areas under the N01 density curve To be more explicit7 0 H1 M gt MO Probability value area to the right oft 0 H1 M lt M0 Probability value area to the left oft 0 H1 M 31 M0 Probability value twice that of a one sided test PAGE 155 CHAPTER7 STAT70mJTEBBS Example 72 Starting salaries for a school7s recent graduates is an important factor when assessing the quality of the program At a particular college7 a random sample of n 15 students was taken here were the starting salaries in thousands of dollars for those students in the sample 371 498 365 263 401 183 452 513 393 499 441 643 393 410 563 The dean of the college7 in an attempt to boost the ratings of his program7 claims that his school7s average starting salary exceeds 400007 Based on these data7 is there evidence to support his claim ANALYSIS Statistically7 we are interested in testing H0 M 40 versus H1Mgt407 where M denotes the population mean starting income for new graduates in this school7s program To conduct the test7 we will use 04 005 Here is the Minitab output Test of mu 40 vs gt 40 Variable N Mean StDev SE Mean T P salary 15 425883 113433 29288 088 0196 The t statistic provided above is computed by tifiuo 7 4259740 N 5N5 1134 15 N The probability value is computed by nding the area to the right of t 088 under the 088 t14 density curve see Figure 755 Note that P 0196 is much larger than 04 005 This is not a signi cant result ie7 we do not have enough evidence to reject H0 CONCLUSION At the 04 005 level7 we do not have evidence to conclude that the mean starting income is larger than 407 000 dollars per year That is7 we cannot support the deans assertion PAGElM CHAPTER 7 STAT 7007 J TEBBS 14 I 2 0 t 088 2 Figure 755 The t14 density curve test statistic and probability value for Example 72 Example 73 A rental car company is monitoring the drop offpick up times at the Kansas City International Airport The times listed below measured in minutes are the round trip times to circle the airport passengers are dropped off and picked up at two terminals during the round trip Management believes that the average time to circle the airport is about 125 minutes To test this claim7 management collects a sample of n 20 times selected at random over a one week period Here are the data 197 147 192 151 104 121 131 131 196 333 137 102 106 191 108 210 240 260 123 144 ANALYSIS Since it is unknown whether or not the mean is larger or smaller than 125 minutes7 management decides to test H0 M 125 versus H1 M 31 125 at the 04 005 level7 where M denotes the population mean waiting time Here is the Minitab output Test of mu 125 vs not 125 Variable N Mean StDev SE Mean 95 o CI T P time 20 166291 60840 13604 137816 194765 304 0007 PAGE 157 CHAPTER 7 STAT 7007 J TEBBS N 19 304 2 0 2 t 304 Figure 756 The tlg density curve and test statistic for Example 73 The probability value shaded region is P 00077 twice the area to the right 0ft 304 Note that we can make our decision in different ways 0 The 95 percent con dence interval is 13787 1948 rninutes Note that this does not include the null hypothesized value no 125 minutes Thus7 we would reject H0 M 125 at the ve percent level The probability value for the test is P 0007 see Figure 756 This is much smaller than the signi cance level 04 005 Thus7 we would reject H0 M 125 at the ve percent level The rejection region for the test consists of values oft larger than t190V025 2093 and values of t smaller than 7t190V025 72093 Since our test statistic t is in the rejection region ie7 t is larger than 20937 we would reject H0 M 125 at the ve percent level CONCLUSION At the ve percent level7 we have suf cient evidence to conclude that the mean waiting time is different than 125 minutes PAGE 158 CHAPTER 7 STAT 7007 J TEBBS Table 710 Systolic blood pressure data Subject Before After Difference 1 120 128 78 2 124 131 77 3 130 131 71 4 118 127 79 5 140 132 8 6 128 125 3 7 140 141 71 8 135 137 72 9 126 118 8 10 130 132 72 11 126 129 73 12 127 135 78 723 Matchedpairs t test RECALL In Chapter 37 we talked about an experimental design where each subject received two different treatments in an order determined at random and provided a response to each treatment we called this a matchedpairs design We now learn how to analyze data from such designs Succinctly put7 matched pairs designs are analyzed by looking at the differences in response from the two treatments Example 74 In Example 387 we considered an experiment involving middle aged men in particular7 we wanted to determine whether or not a certain stimulus eg7 drug7 exercise regimen7 etc produces an effect on the systolic blood pressure SBP Table 710 contains the before and after SBP readings for the n 12 middle aged men in the experiment We would like to see if these data are statistically signi cant at the 04 005 level APPROACH ln matched pairs experiments7 to determine whether or not the treatments differ7 we use a one sample t test on the data differences PAGE 159 CHAPTER 7 STAT 7007 J TEBBS HYPOTIIESIS TEST To test whether or not there is a difference in the matched pairs treatment means we can test H0 M 0 versus H13M7 07 where M denotes the mean treatment difference The null hypothesis says that there is no difference between the mean treatment response while H1 speci es that there is in fact a difference without regard to direction Of course one sided tests would be carried out in the obvious way Also one could easily construct a one sample t con dence interval for 1 Even though there are two treatments we carry out a one sample analysis because we are looking at the data differences MINITAB For Example 74 here is the Minitab output Test of mu 0 vs not 0 Variable N Mean StDev SE Mean 95 o CI T P difference 12 18333 58284 16825 55365 18698 109 0299 ANALYSIS We have n 12 data differences one for each man From the output we see that E 718333 and s 58284 so that iiipo 7 71833370 77109 s 58284x12 39 39 The probability value is P 0299 ie the area to the left of t 7109 plus the area to the right of t 109 under the tn density curve Because this probability is not small ie certainly not smaller than 04 005 we do not reject H0 See Figure 757 CONFIDENCE INTERVAL We see that the 95 percent con dence interval for 1 based on this experiment is 7554187 This interval includes no 0 CONCLUSION At the ve percent level we do not have evidence to say that the stimulus changes the mean SBP level in this population of middle aged men PAGE 160 CHAPTER 7 STAT 7007 J TEBBS 11 Figure 757 The tn derisity curve for Example 74 The probability value shaded region is twice the area to the right 0ft 7109 This probability value equals P 0299 73 Robustness of the t procedures TERMINOLOGY A statistical inference procedure is called robust if the probability calculations are not affected by a departure from the assumptions made IMPORTANT The one sample t procedures are based on the population distribution being normal However7 these procedures are robust to departures from normality This means that even if the population distribution from which we obtain our data is non normal7 we can still use the t procedures The text provides the following guidelines 0 n lt 15 Use t procedures only if the population distribution appears normal and there are no outliers o 15 S n S 40 Be careful about using t procedures if there is strong skewness andor outliers present 0 n gt 40 t procedures should be ne regardless of the population distribution shape PAGE 161 CHAPTER 7 STAT 7007 J TEBBS NONPARAMETRIO TESTS When normality is in doubt and the sample size is not large7 it might be possible to use a nonparametric testing procedure The term non parametric77 means distribution free The nice thing about nonparametric methods is that they do not require distributional assumptions such as normality We will not discuss nonparametric tests in this course For more information7 see Chapter 15 74 Twosample t procedures 741 Introduction A COMMON PROBLEM One of the most common statistical problems is that of com paring two treatment eg7 treatment versus control or two stratum means For example7 which of two HlV drugs longer delays the onset of AIDS Are there salary differences between the two genders among university professors Which type of foot ball air lled versus helium lled travels the farthest Do rats living in a germ free environment gain more weight than rats living in an unprotected environment How can we answer these questions PROCEDURE Take two simple random samples of experimental units eg7 plants7 patients7 professors7 plots7 rats7 etc Each unit in the rst sample receives treatment 1 each in the other receives treatment 2 We would like to make a statement about the difference in responses to the treatments based on this setup Example 75 Suppose that we wish to compare the effects of two concentrations of a toxic agent on weight loss in rats We select a random sample of rats from the population and then randomly assign each rat to receive either concentration 1 or concentration 2 The variable of interest is z weight loss for rat PAGE 162 CHAPTER 7 STAT 700 J TEBBS Until the rats receive the treatments we assume them all to have arisen from a common population where z N NMU Once the treatments are administered however the two samples become different One convenient way to view this is to think of two populations Population 1 individuals under treatment 1 Population 2 individuals under treatment 2 That is populations 1 and 2 may be thought of as the original population with all possible rats treated with treatment 1 and 2 respectively We may thus regard our samples as being randomly selected from these two populations Because of the nature ofthe data it is further reasonable to think about two random variables 1 and 2 one corresponding to each population and to think of them as being normally distributed Population 1 1 N NltM1Ul Population 2 2 N NltM20392 NOTATION Because we are now thinking about two independent populations we must adjust our notation accordingly so we may talk about the two different random variables and the observations on each ofthem Write xij to denote the jth unit receiving the 2th treatment that is the jth value observed on the random variable xi With this de nition we may thus view our data as follows x11z12x1m n1 units in sample from population 1 21 22 2M n2 units in sample from population 2 FRAMEWORK In this framework we may now cast our question as follows difference in mean response for two treatments gt is 1 different than 27 More formally then we look at the difference 1 7 2 M1 7 p2 0 i there is no difference 1 7 p2 31 0 i there is a difference PAGE 163 CHAPTER 7 STAT 700 J TEBBS OBVIO US STRATEGY We base our investigation of this population mean difference on the data from the two samples by estimating the difference In particular we can compute El 7 Eg the difference in the sample means THEORETICAL RESULTS It may be shown mathematically that if both population distributions one for each treatment are normally distributed then the random variable 2 2 039 039 7 7 1 2 172NN M1 M27 i l i 711 712 That is the sampling distribution of E1 7E2 is normal with mean 1 7M2 and standard El 7 22 satis es deviation 2 711 71239 0 051722 i It also follows by direct standardization that T1 T2 1 i 2 N Nlt07139 2 2 122 711 7L2 IMPLICATIONS From the last result we can make the following observations 0 A 1001 7 04 percent con dence interval for M1 7 112 is given by 2 2 2 2 a a 039 039 7 7 1 2 7 7 1 2 17227za2 n7n721722a2 77 1 2 1 2 0 To test H0 M1 7 p2 0 versus H1 M1 7 p2 31 0 at the 04 signi cance level we can use the twosample 2 statistic z12 z 2 2 32 n1 n2 Note that this test statistic is computed assuming that H0 is true ie M1 7M2 0 Thus values of z gt zag and z lt 72042 lead to rejecting H0 in favor of H1 Of course onesided tests could be carried out as well Probability values could also be used eg areas to the rightleft of PAGE 164 CHAPTER 7 STAT 7007 J TEBBS PROBLEM As in the one sample case it will rarely be true that we know the population standard deviations 01 and 02 If we do not know these then the two sample con dence interval and two sample 2 statistic above are really not useful SOLUTION As in the one sample setting we do the almost obvious thing If 01 and 02 are not known we estimate them To estimate 01 we use the sample standard deviation from sample 1 ie where 31 is the sample mean of the data from sample 1 Similarly to estimate 02 we use the sample standard deviation from sample 2 ie where 12 is the sample mean of the data from sample 2 IMPORTANT FACT Suppose that z11z12 z1m is an SRS of size 711 from aNQLl 01 population that z21z22 z2n2 is an SRS of size 712 from a NltM20392 population and that the two samples are independent of each other Then the quantity w 2 2 i m M t has approximately a tk sampling distribution where s 2 kw ltEEgt If k is not an integer in the formula above we can simply round k to the nearest integer This expression is sometimes called Satterwaite7s formula for the degrees of freedom That t N tk is not an exact result rather it is an approximation The approximation is best when the population distributions are normal For nonnormal populations this approximation improves with larger sample sizes 711 and n2 PAGE 165 CHAPTER 7 STAT 700 J TEBBS REMARK The authors of your textbook note that computing k by hand is frightful In the light of this the authors recommend that instead of computing k Via Satterwaite7s formula you use k the smaller of n1 71 and n2 71 I think this is a silly recommendation although I understand why they suggest it In practice software packages are going to compute k anyway and there is no reason not to use software 742 Twosample t con dence intervals RECALL When 01 and 02 were known we learned that a 1001 7 04 percent con dence interval for M1 7 2 was given by 2 2 2 2 o o o o 7 7 1 2 7 7 1 2 1i 2Za2 n n7122a2 n n 1 2 1 2 When 01 and 02 are not known a 1001 7 04 percent con dence interval for M1 7 2 becomes 52 52 52 52 7172f17fztka2 772 f if it 1 2 hon2 n1 n2 n1 712 where the degrees of freedom k is computed using Satterwaite7s formula This is called a twosample t con dence interval for M1 7 n2 INTERPRETATION The two sample t con dence interval gives plausible values for the difference 1 7 n2 o If this interval includes 0 then 1 7 n2 0 is a plausible value for the difference and there is no reason to doubt that the means are truly different o If this interval does not include 0 then 1 7 n2 0 is a not plausible value for the difference That is it looks as though the means are truly different 0 Thus we can use the con dence interval to make a decision about the difference PAGE 166 CHAPTER 7 STAT 7007 J TEBBS Weight gain Ibs 1 1 RationA RationB Figure 758 Boa plots of pig weight gains by ration m Emample 76 Example 76 Agricultural researchers are interested in two types of rations7 1 and 27 being fed to pigs An experiment was conducted with 24 pigs randomly selected from a speci c population then 0 n1 12 pigs were randomly assigned to Ration 17 then 0 the other 712 12 were assigned to Ration 2 The goal of the experiment was to determine whether there is a mean difference in the weight gains lbs for pigs fed the two different rations Because of this7 researchers are interested in M1 7 2 the mean difference The data weight gains7 in lbs from the experiment are below Boxplots for the data appear in Figure 758 Ration 1A 31 34 29 26 32 35 38 34 3O 29 32 31 Ration 2B 26 24 28 29 3O 29 32 26 31 29 32 28 Assuming that these data are well modeled by normal distributions7 we would like to nd a 95 percent con dence interval for M1 7 2 based on this experiment PAGE 167 CHAPTER7 STAT70mJTEBBS MINITAB Here is the output from Minitab for Example 76 Two sample T for Ration 1 vs Ration 2 N Mean StDev SE Mean 1 12 3175 319 092 2 12 2867 246 071 Difference mu 1 mu 2 Estimate for difference 308333 95 o CI for difference 065479 551187 T Test of difference 0 vs not T 265 P Value 0015 DF 20 INTERPRETATION From the output7 we see that El 31757 22 28677 51 3197 52 2467 and k m 20 this is Satterwaite7s degrees of freedom approximation From Table D7 we see that 1509025 2086 For right now7 we only focus on the 95 percent con dence interval for M1 7 112 this interval is 52 52 319 246 ita 1 72 317572867i2086 7 1 2 k 2 m m 2 12 7 or 065551 That is7 we are 95 percent con dent that the mean difference 1 7 112 is between 065 and 551 lbs lndeed7 the evidence suggests that the two rations are different with respect to weight gain 743 Twosample t hypothesis tests GOAL Based on random samples from normal populations7 we would like to develop a hypothesis testing strategy for H0 3 1 t 2 0 versus H13M1M27 07 PAGE 168 CHAPTER7 STAT70mJTEBBS where 1 is the mean of population 1 and 2 is the mean of population 2 We do not assume that the population standard deviations 01 and 02 are known One sided tests may also be of interest these can be performed in the usual way RECALL Suppose that 11 12 x1m is an SRS of size 711 from aNltM1 01 population that 2122 mm is an SRS of size 712 from a NM2 02 population and that the two samples are independent of each other Then the quantity 51 i 32 i 1 i 2 2 2 i m M t has approximately a tk sampling distribution where 172 t 2 2 i1iz n1 712 Note the two sample tstatistic is computed assuming that H0 Mli g 0 is true When H0 is true this statistic varies approximately according to a tk sampling distribution where k is computed using Satterwaite7s formula Thus we can do hypothesis tests one or two sided tests by computing t and comparing it to the tk distribution Values oft that are unlikely77 lead to the rejection of H0 Example 77 Two identical footballs one air lled and one helium lled were used outdoors on a windless day at The Ohio State University7s athletic complex Each foot ball was kicked 39 times and the two footballs were alternated with each kick The experimenter recorded the distance traveled by each ball Rather than give you the raw data well look at the summary statistics sample means and variances Boxplots appear in Figure 759 Treatment 71 m 5 Air 39 25 92 2197 Helium 39 26 38 3861 PAGE 169 CHAPTER 7 STAT 7007 J TEBBS Distance in yards M II l l Air Helium Ball Type Figure 759 Ohio State football data Let 1 denote the population mean distance kicked using air and let 2 denote the population mean distance kicked using helium The belief is that those balls kicked with helium on average will be longer than those balls lled with air so that 1 7 2 lt 0 this is the researchers hypothesis Thus we are interested in testing at 04 005 say H0 3 1 i 2 0 versus H15M1 M2lt0 The two sample t test statistic is given by t M zz 2592 7 2638 7037 2 2 91 52 2197 3861 7 72 V 39 39 ls this an unusual value To make this decision we compare it to a tk distribution where 2 2 2 i i2 2197 38612 k N 1 2 39 39 N 71 N S2 2 S2 2 2197 2 3861 2 N 39 1 i2 39 9 quot1 quot2 38 38 mil ngil PAGE 170 CHAPTER 7 STAT 7007 J TEBBS 71 I t 037 0 Figure 760 The t71 density curve test statistic and probability value for Example 77 Table D doesnt have a row for k 71 degrees of freedorn7 but Minitab provides 7t71ovo5 7167 Note that this value is close to 72005 7165 why is this DECISION Since t 7037 is not less than 7t710V05 7167 we do not reject H0 That is7 there is insuf cient evidence to conclude that those balls kicked with heliurn7 on average7 will travel farther than those balls lled with air The probability value for this test is P 03567 the area to the left of t 7037 under the t71 density curve see Figure 760 We do not have a statistically signi cant result at any reasonable 04 levelll ROBUSTNESS AGAIN Like the one sarnple t procedure7 the two sarnple t procedures are robust as well That is7 our populations need not be exactly normal for the two sarnple t con dence intervals and hypothesis tests to provide accurate results7 as long as there are no serious departures from norrnality and the sample sizes are not too small A good exploratory analysis should always be done beforehand to check whether or not the normality assurnption seems reasonable If it does not7 then the results you obtain from these procedures may not be valid PAGE 171 CHAPTER 7 STAT 7007 J TEBBS 744 Twosample pooled t procedures REMARK The two sample t procedures that we have discussed so far in Sections 742 and 743 are always usable when the underlying distributions are normal or approximately normal because of robustness However7 it turns out that when the true population standard deviations are equal ie7 when 01 02 there is a different approach to take when making inferential statements about 1 7 M2 in the form of a con dence interval or hypothesis test This approach is called the pooled approach NOTE There are many instances where it is reasonable to assume that the two popu lations have a common standard deviation ie7 01 02 039 One interpretation is that the treatment application affects only the mean of the response7 but not the variability If there is strong empirical evidence that 01 31 02 do not use these pooled procedures Formal hypothesis tests do exist for testing 01 02 but your text does not recommend using them see Section 73 MM we7ll skip this section As it turns out7 hypothesis tests for 01 02 are not robust to nonnormality thus7 the tests can give unreliable results when population distributions are not exactly normal RULE OF THUMB Do not use the pooled two sample procedures if the ratio of the largest sample standard deviation to the smallest sample standard deviation exceeds 2 A POOLED ESTIMATE If both populations have a common standard deviation7 then both populations have a common variance ie7 a a 02 and both of the sample variances 5 and 5 are estimating this common value of 02 Hence7 in computing an estimate for 02 we will pool observations together and form a weighted average of the two sample variances7 5 and 5 That is7 our estimate of 02 becomes 52 n1 15in2 Us p 711 712 7 2 This is called the pooled estimate of 02 the assumed common variance IMPORTANT FACT Suppose that x11x12z1m is an SRS of size 711 from aNQLl 01 population7 that 217227 72n2 is an SRS of size 712 from a NltM2Ug population7 and PAGE 172 CHAPTER 7 STAT 700 J TEBBS that the two samples are independent of each other When 01 02 ie the population standard deviations are equal then the quantity w 824 i i has a 1214212 sampling distribution This is an exact result ie it is not approximation t as long as the population standard deviations are truly equal The quantity 5 1 mg where 512 is the pooled sample variance given above POOLED CONFIDENCE INTERVAL FOR 1 7 2 From the last result we may conclude that a 1001 7 04 percent con dence interval for M1 7 2 is given by l l l 1 Ti f2 tn17n272a2 gtlt 51717 7 T1 i2 tn17n272042 gtlt Spl i n1 n2 711 712 This is an exact interval when the population distributions are normal and when 01 02 039 when the population standard deviations are truly equal This interval is robust to departures from normality but it is not robust to departures from the equal population standard deviation assumption Example 78 A botanist is interested in comparing the growth response of dwarf pea stems to different levels of the hormone indoleacetic acid 1AA Using 16 day old pea plants the botanist obtains 5 millimeter sections and oats these sections on solutions with different hormone concentrations to observe the effect of the hormone on the growth of the pea stem Let 1 and x2 denote respectively the independent growths that can be attributed to the hormone during the rst 26 hours after sectioning for lO 4 and 10 4 levels of concentration of lAA measured in Previous studies show that both 1 and 2 are approximately normal with equal variances that is we are assuming that a a 02 The botanist would like to determine if there is a difference in means for the two treatments To do this she would like to compute a 90 percent con dence interval for M1 7 2 the difference in the population mean growth response Here are the data from the experiment boxplots for the data are in Figure 761 Treatment 1 081810 0109171014 0912 05 Treatment 2 10 0816 2613112418 251419 2012 PAGE 173 CHAPTER7 STAT70mJTEBBS Growth mm 1 Trt1 Trt2 Figure 761 Borplots of stem growths by treatment MINITAB Here is the Minitab output for the analysis Two sample T for Trt 1 vs Trt 2 N Mean StDev SE Mean Trt 1 11 1027 0494 015 Trt 2 13 1662 0594 016 Difference mu Trt 1 mu Trt 2 Estimate for difference 0634266 9O o CI for difference 1021684 O246847 T Test of difference 0 vs not T 281 P Value 0010 DF 22 Both use Pooled StDev 05507 ANALYSIS From the output7 we see E1 1027E2 1662751 04947 and 52 0594 The degrees of freedom associated with this experiment is 711 712 7 11 13 7 2 22 PAGE 174 CHAPTER 7 STAT 7007 J TEBBS Our estimate of the common variance 02 is given by the pooled estimate 05507 x 03033 n1n272 111372 52 7 n1 715 n2 71s 1004942 1205942 p From Table D we see that t22gt0v05 1715 For right now we only focus on the 90 percent con dence interval for M1 7 112 this interval is 1 1 1 1 1 7 952 itn1n22a2 gtlt 5 771 772 1027 71662 i1715 gtlt 055071E E or 71022 70247 mm That is we are 90 percent con dent that the mean difference 1 7 112 is between 71022 and 70247 mm Since this interval does not include zero the evidence suggests that the two levels of lAA have a different with respect to growth POOLED HYPOTHESIS TESTS FOR 111 7 112 When the researcher assumes that 01 02 039 a common population standard deviation the form of the two sample t statistic changes as well We are interested in testing H0 M1 7 p2 0 versus one or two sided alternatives RECALL Suppose that x11z12z1m is an SRS of size 711 from aNL1 01 population that 2122 2 is an SRS of size 712 from a NQLZ 02 population and that the two samples are independent of each other When 01 02 ie the population standard deviations are equal then the quantity t51 32 i 1 i 2 has a tmwrg sampling distribution Recall that 5 where 512 is the pooled sample variance NOTE Observe that when H0 M1 7 p2 0 the quantity above becomes 1 139 5P n1n2 This is called the twosample pooled t statistic When H0 is true this statistic varies t according to a 1211an sampling distribution Thus we can make a decision about H0 by comparing t to this density curve PAGE 175 CHAPTER7 STAT70mJTEBBS Example 79 In an experiment that compared a standard fertilizer and a modi ed fertilizer for tomato plants a gardener randomized 5 plants to receive the standard fertilizer Treatment 1 and 6 plants to receive the modi ed fertilizer Treatment 2 To minimize variation the plants were all planted in one row The gardener7s hypothesis was that the modi ed fertilizer was superior to that of the standard fertilizer ie that the modi ed fertilizer produced a higher mean yield At the 04 005 signi cance level the researcher would like to test H0 3 1 i 2 0 versus H13M1 M2lt0 We assume that yields arise from two independent normal populations with a common standard deviation To test his claim the following data are yields that were observed in the experiment Standard 1 267 179 211 155 188 Modi ed 2 286 257 295 191 197 243 MINITAB Since we are assuming that the population standard deviations are equal we use a pooled analysis Here is the Minitab output for this experiment Two sample T for Fert1 vs Fert2 N Mean StDev SE Mean Fert1 5 2000 425 19 Fert2 6 2448 437 18 Difference mu Fert1 mu Fert2 Estimate for difference 448333 T Test of difference 0 vs lt T Value 172 P Value 0060 DF 9 Both use Pooled StDev 43165 PAGE 176 CHAPTER 7 STAT 7007 J TEBBS 9 P 006 t172 Figure 762 The t9 derisity curve test statistic arid probability value for Example 79 ANALYSIS From the output we see El 200012 2448 51 425 and 52 437 The degrees of freedom associated with this experiment is 711 712 7 2 5 6 7 2 9 Our estimate of the common variance 02 is given by the pooled estimate 712 712 44252 54372 520 NH gt52 431652186322 p 71171272 From Table D we see that 7t9005 71833 The two sample pooled t statistic is given by x1 7 12 2000 7 2448 t 7172 5p hi 52 43165 Note that t 7172 does not fall in the rejection region since t is not less than 7t9005 71833 however it is very close The probability value is P 006 see Figure 762 CONCLUSION Of cially we do not have a statistically signi cant result at the oz 005 level That is there is not enough evidence to conclude that the modi ed fertilizer is su perior to the standard fertilizer There is euideriee that the modi ed fertilizer outperforms the staridard just riot eriough at the oz 005 level I call this a borderline result PAGE 177 CHAPTER 8 STAT 7007 J TEBBS 8 Inference for Proportions REMARK ln Chapters 6 and 7 we have discussed inference procedures for means of quantitative variables using one or two samples We now switch gears and discuss analogous procedures for categorical data With categorical data7 it is common to deal with proportions RECALL We denote the population proportion by p For example7 p might denote the proportion of test plants infected with a certain virus7 the proportion of students who graduate in 4 years or less7 the proportion of the voting population who favor a certain candidate7 or the proportion of individuals with a certain genetic condition As we will now see7 to make probability statements about p7 we return to our use of the standard normal distribution 81 Inference for a single population proportion RECALL Suppose that the number of successes X is observed in a binomial investigation that is7 X N 30119 and let 5 Xn denote the sample proportion of successes In Chapter 57 we argued via simulation that the approximate sampling distribution of 15 was normal in fact7 wW p7 p1 19 7 for large n ie7 for large sample sizes Recall that the abbreviation AN stands for ap mathematics shows that proximately normal For small sample sizes this appmm39matloa may not be adequate In the light of this result7 we know that 7 571 N W WW for large sample sizes that is7 2 has an approximate standard normal distribution This provides the mathematical basis for the onesample z procedures for proportions PAGE 178 CHAPTER 8 STAT 7007 J TEBBS 811 Con dence intervals WALD INTERVAL From the last result we know that there exists a value zaZ such that P 72042 L 32042 17a xp1 7 MM For example if a 005 ie 95 percent con dence then zaZ 20025 196 If we perform some straightforward algebra on the event in the last probability equation and estimate 1 p1 7 pn with the standard error we obtain a 1001 7 a percent con dence interval for p This interval is given by 3 204ml 1 15 204ml W We call this the Wald con dence interval for p You will note that this interval has the same form as all of our other con dence intervalsll Namely the form of the interval is estimate i margin of error Here the estimate is the sample proportion and the margin of error is given by WWW1m SE WALD RULES OF THUMB The Wald interval for p is based on the approximate sam pling distribution result for given on the last page Remember that this result may not hold in small samples Thus your authors have made the following recommendations regarding when to use this interval namely use the Wald interval when 0 you want to get a 90 95 or 99 percent con dence interval or anything in between 90 99 percent and 0 both of the quantities 71 and n1 7 163 are larger than 15 PAGE 179 CHAPTER 8 STAT 7007 J TEBBS Example 81 Apple trees in a large orchard were sprayed in order to control moth injuries to apples growing on the trees A random sample of n 1000 apples was taken from the orchard7 150 of which were injured Thus7 the sample proportion of injured apples was 15 1501000 015 and a 95 percent Wald con dence interval for p7 the population proportion of injured apples7 is 01517015 01517 015 01571961 015196 1000 1000 or 01280172 That is7 we are 95 percent con dent that the true proportion of injured apples7 p7 is somewhere between 128 and 172 percent We should check the Wald Rules of Thumb in this example 0 this is a 95 percent con dence interval and both ofthe quantities n15 1000015 150 and n1 7133 1000085 850 are larger than 15 REMARK Checking the Wald interval Rules of Thumb is important Why If these guidelines are not met7 the Wald interval may not be a very good intervalH Extensive simulation studies have shown that the Wald interval can be very anticonservative that is7 the interval produces a true con dence level much less than what you think your getting ln fact7 it is not uncommon for a 95 percent77 con dence interval to have a true con dence level around 085 090 or even lower This is why your authors have proposed the Rules of Thumb for checking the Wald interval7s accuracy and appropriateness Sum marizing the Wald interval earl perform very poorly when sample sizes are small This has prompted statisticians to consider other con dence interval expressions for p The one we present now is a slight modi cation of the Wald interval and it is clearly superior7 especially in small sample size situations AGRESTI COULL INTERVAL The form of the Agresti Coull AC interval is much like that of the Wald interval7 except that a slight adjustment to the sample proportion is made The Wald interval uses X2 n4 A X p 7 whereas the AC interval uses p n PAGE 180 CHAPTER 8 STAT 7007 J TEBBS The estimate 15 is called the plusfour estimate Note that this estimate is formed by adding two successes X 2 and two failures so that the total sample size is n 4 The resulting con dence interval then is computed in the same way as the Wald interval namely the interval is given by 0723 A 1145 p 2042 710Za2 This is called a 1001 7 04 percent Agresti Coull con dence interval for p AGRESTI C39O ULL RULES OF THUMB Use the AC interval when 0 you want to get a 90 95 or 99 percent con dence interval or anything in between 90 99 percent and o the sample size n is 10 or more NOTE Use the AC interval instead of the Wald interval especially when sample sizes are small Example 82 At a local supermarket the owner has received 32 checks from out of state customers Ofthese 2 ofthe checks bounced ie this means that the customers bank account had insuf cient funds to cover the amount on the check For insurance purposes the store owner would like to estimate the proportion of bad checks received monthly using a 90 percent con dence interval Because the sample size is small it is best to use the Agresti Coull interval Here the plus four estimate of p is given by X2 22 4 7im0111 n4 324 36 p Recalling that with 90 percent con dence zag 20102 2005 165 Table A the 90 percent AC interval is given by 0111170111 0111170111 01117165gtlt T 0111165gtlt T or 00250197 Thus we are 90 percent con dent that the true proportion of bad checks is between 25 and 197 percent PAGE 181 CHAPTER 8 STAT 7007 J TEBBS 812 Hypothesis tests REMARK We can also perform hypothesis tests regarding p7 the population propor tion As with the Wald and AC con dence intervals7 we will use the standard normal distribution to gauge which test statistics are unlikely under H0 TEST STATISTIC Suppose that we observe the number of successes X in a binomial investigation that is7 X w 80110 To test the hypothesis H0 p p0 we compute the onesample 2 statistic for proportions 10 i 100 P01 100n Note that this test statistic is computed assuming that H0 p p0 is true When H0 is z true7 z N AN01 that is7 2 has an approximate standard normal distribution Thus7 unlikely values of z are located in the tails of this distribution Of course7 the signi cance level a and the direction of H1 tells us which values of z are unlikely under H0 0 H0 is rejected at level a in favor of H1 p gt p0 when 2 gt 2a ie7 2 falls in the upper tail 0 H0 is rejected at level a in favor of H1 p lt p0 when 2 lt 72a ie7 2 falls in the lower tail 0 H0 is rejected at level a in favor of the two sided alternative H1 p 31 p0 when 2 gt 2042 or 2 lt 72012 NOTE Probability values are computed in the same manner To be more explicit7 to test H0 versus H17 the probability values are computed as areas under the standard normal curve ie7 0 H1 1 gt p0 Probability value area to the right of z 0 H1 1 lt p0 Probability value area to the left of z 0 H1 p 31 p0 Probability value twice that of a one sided test PAGE 182 CHAPTER 8 STAT 7007 J TEBBS RULES OF THUMB To perform hypothesis tests for p7 the appropriate Rules of Thumb are that both of the quantities npo and n1 7 p0 are larger than 10 Example 83 Dimenhydrinate7 also known by the trade names Dramamine and Gravol7 is an over the counter drug used to prevent motion sickness A random sample of n 100 Navy servicemen was given Dramamine to control seasickness while out at sea From previous studies7 it was estimated that about 25 percent of all men experienced some form of seasickness when not treated To evaluate the effectiveness of Dramamine7 it is desired to test7 at 04 0057 H0 p 025 versus H1 1 lt 0257 where p denotes the proportion of men who would experience seasickness when treated with Dramamine Of the 100 servicemen7 20 of them did7 in fact7 experience seasickness Thus7 15 0207 and our one sample 2 statistic is Z 7 pi pg 7 020 i 025 7 p0l10071 02517 025100 ls this an unlikely value of the test statistic We can make our decision using either the 7115 rejection region or probability value approach 0 Rejection region approach Since our test statistic z 7115 is less than 72005 7165 we do not reject H0 0 Probability value approach The probability value is the area to the left of z 7115 this area is 01251 see Figure 863 Since the probability value is not smaller than 04 0057 we do not reject H0 CONCLUSION At the ve percent signi cance level7 there is not enough evidence to support the hypothesis that Dramamine reduces the rate of seasickness PAGE 183 CHAPTER 8 STAT 7007 J TEBBS standard normal density I P 01 251 Figure 863 The standard normal density curve test statistic and probability value for Example 83 813 Sample size determinations RECALL In Chapter 67 we discussed a sample size formula appropriate for estimating a population mean using a 1001 7 a percent con dence interval with a certain prescribed margin of error We now discuss the analogous problem with proportions We will focus explicitly on the Wald interval CHOOSING A SAMPLE SIZE To determine an appropriate sample size for estimating p7 we need to specify the margin of error quotEggM MM 71 However7 you will note that m depends on 15 which7 in turn7 depends on n This is a small problem7 but we can overcome the problem by replacing with 10 a guess for the value of p Doing this7 the last expression becomes 7 191 719 777472042 gtlt n PAGE 184 CHAPTER 8 STAT 700 J TEBBS Solving this last equation for n we get 2042gt2 as 7 1 7 lt m p p This is the desired sample size to nd a 1001 7 04 percent con dence interval for 10 with a prescribed margin of error equal to m CONSERVATIVE APPROACH If there is no sensible guess for 10 available use 10 05 In this situation the resulting value for n will be as large as possible Put another way using 10 05 gives the most conservative solution ie the largest sample size Example 84 In a Phase ll clinical trial it is posited that the proportion of patients responding to a certain drug is 10 04 To engage in a larger Phase III trial the researchers would like to know how many patients they should recruit into the study Their resulting 95 percent con dence interval for 10 the true population proportion of patients responding to the drug should have a margin of error no greater than m 003 What sample size do they need for the Phase III trial SOLUTION Here we have m 00310 04 and 2012 20052 196 Table A The desired sample size is 2 2042f 1 1 196 7 17 7 04144 1025 m M p 003 Thus their Phase III trial should recruit 1025 patients 82 Comparing two proportions COMPARING PROPORTIONS Analogously to the problem of comparing two popula tion means Chapter 7 Section 73 we would also like to compare two population proportions say 101 and 102 For example do male and female voters differ on their support of a political candidate ls the response rate higher for a new drug than the control drug ls the the proportion of nonresponse different for two credit card mailings We can use the following methods to answer these types of questions PAGE 185 CHAPTER 8 STAT 7007 J TEBBS FRAMING THE PROBLEM Our conceptualization of the problem is similar to that for comparing two means We now have two independent binomial investigations X1 N 80117101 X2 N 80127102 P rom the investigations our two sample proportions are 151 Xlnl and 152 Xgng Clearly the problem involves the difference of these proportions ie D 131 i 32 If the true population proportions p1 and p2 were in fact equal then we would expect to see values of 151 7 152 close to zero We need to know how this statistic varies in repeated sampling 821 Con dence intervals MATHEMATICAL RESULTS Choose independent SRSs from two populations with proportions p1 and p2 respectively The estimator A A p 17p p 17p 101 102 N M 101 1027 1 1 2 2 771 772 That is the sampling distribution of 151 7 152 is approximately normal with mean pl 7 p2 and standard deviation 1011 101 1921 102 711 712 39 Replacing p1 and p2 with their estimates 151 and 152 respectively in the last expression gives the standard error of D 151 7 152 That is 1711ii 51 1521 152 71 TWO SAMPLE WALD INTERVAL An approximate 1001 704 percent con dence interval for pl 7 p2 based on two independent random samples is given by 131 162 izaZ PAGE 186 CHAPTER 8 STAT 7007 J TEBBS We call this the twosample Wald interval for pl 7102 Note that this interval has the same form as all of our other con dence intervals Namely7 the form of the interval is estimate i margin of error Here7 the estimate is D 151 7 152 and the margin of error is given by A 1 7 A A 1 7 A 2042 X SED 2042 X 101 101 102 102 711 712 RULES OF THUMB Much like the one sample problem considered in Section 817 there are guidelines to use with the two sample Wald interval Your authors recommend to use the Wald interval only when 0 you want to get a 907 957 or 99 percent con dence interval or anything in between 90 99 percent7 and o the quantities 711151 7111 7151 712152 and 7121 7 152 are all larger than 10 Example 85 An experimental type of chicken feed7 Ration 17 contains an unusually large amount of a certain feed ingredient that enables farmers to raise heavier chickens However7 farmers are warned that the new feed may be too strong and that the mortal ity rate may be higher than that with the usual feed One farmer wished to compare the mortality rate of chickens fed Ration 1 with the mortality rate of chickens fed the current best selling feed7 Ration 2 Denote by p1 and p2 the population mortality rates proportions for Ration 1 and Ration 27 respectively Researchers would like to get a 95 percent con dence interval for pl 7 p2 Two hundred chickens were randomly assigned to each ration of those fed Ration 17 24 died within one week of those fed Ration 27 16 died within one week 0 Sample 1 200 chickens fed Ration 1 gt 151 24200 012 0 Sample 2 200 chickens fed Ration 2 gt 152 16200 008 PAGE 187 CHAPTER 8 STAT 7007 J TEBBS Thus7 the difference D 151 7 152 012 7 008 0047 and an approximate 95 percent Wald con dence interval for the true difference pl 7 p2 based on this experiment is 012 088 008 092 012 088 008 092 004 7196 004 196 200 200 2 200 or 70027 010 Thus7 we are 95 percent con dent that the true difference in mortality rates is between 7002 and 010 Note that this interval includes zero7 so we can not say that the mortality rates are necessarily different for the two rations at the ve percent level It is easy to see that the Rules of Thumb are satis ed in this problem AGRESTI CAFFO INTERVAL Just as in the one sample problem Section 817 the Wald interval has some serious aws when the sample sizes are small An alternative interval is available for small sample sizes it is called the Agresti Caffo interval The interval is based7 again7 on adding 2 successes and 2 failures However7 because we have two samples now7 we add 1 success and 1 failure to each To be more speci c7 we compute X11 X21 1 and p2 12 712 The 1001 7 04 percent Agresti Caffo AC con dence interval for pl 7102 is given by 12511 151 12521 152 7 i 101 102 2042 m 2 m 2 Your authors recommend this method when con dence levels are between 90 99 percent and when both sample sizes 711 and 712 are at least 5 Example 86 Who is a better breaker of his opponents serve Andre Agassi or Roger Federer In a recent match between the two tennis stars7 Agassi converted 3 of 13 break opportunities and Federer converted 6 of 8 For all of you non tennis a cionados7 a break of serve occurs when one player wins a game from his opponent while the opponent is serving Denote by p1 the true proportion of breaks for Agassi and p2 the true proportion of breaks for Federer Our plus four estimates for p1 and p2 are given by 31 61 7m027 d 7070 p1 132 an m 82 PAGE 188 CHAPTER 8 STAT 7007 J TEBBS A 95 percent con dence interval for pl 7 p2 is given by 02717 027 07017 070 3 2 8 2 7 or 70797 7007 Thus7 we are 95 percent con dent that the true difference pl 7 p2 is 027 i 070 i 196 between 7079 and 7007 Because the interval does not include 07 this suggests that there is a difference between the rates at which these two players break in the others service games 822 Hypothesis tests HYPOTHESIS TEST FOR TWO PROPORTIONS Analogously to performing a hy pothesis test for two means Section 737 we can also compare two proportions using a hypothesis test To be precise7 we would like to test H0 3101 i 102 0 versus H1 P1P27 0 Of course7 one sided tests are available and are conducted in the usual way Our two sample 2 statistic for proportions is given by 51 752 0170071707 2 where is the overall sample proportion of successes in the two samples This estimate of p is called the pooled estimate because it combines the information from both samples When H0 pl 7 p2 0 is true7 this statistic has an approximate standard normal distribution Thus7 values of z in the tails of this distribution are considered unlikely RULES OF THUMB This two sample test should perform adequately as long as the quantities 711151 7111 7151 712152 and 7121 7152 are all larger than 5 PAGE 189 CHAPTER 8 STAT 7007 J TEBBS Example 87 A tribologist is interested in testing the effect of two lubricants used in a manufacturing assembly found in rocket engines Lubricant 1 is the standard lubricant that is currently used Lubricant 2 is an experimental lubricant which is much more expensive Does Lubricant 2 reduce the proportion of cracked bolts A random sample of n 450 is used7 with 250 bolts receiving Lubricant 1 and the other 200 receiving Lubricant 2 After simulating rocket engine conditions7 it is noted what proportion of the bolts have cracked It is thus desired to test7 at the conservative Oz 001 level7 H0 3101 i 102 0 versus H13101102gt07 where p1 denotes the proportion of bolts that would crack using lubricant 1 and p2 denotes the proportion of bolts that would crack using lubricant 2 After the experiment is over7 it is noted that 21 of the bolts cracked using lubricant 1 151 21250 00847 and 8 with lubricant 2 152 8200 0040 The pooled estimate ofp is given by X1X2 218 15 00647 711 712 250 200 and the value of the two sample 2 test statistic is 0084 7 0040 2 m 189 1006417 0064 The probability value for the test7 the area to the right of z 189 on the standard normal distribution7 is 00294 Table A This is a small probability value However7 it is not small enough to be deemed signi cant at the 04 001 level Alternatively7 you could have noted that the critical value here is 2001 233 Our test statistic z 189 does not exceed this value CONCLUSION There is some evidence that Lubricant 2 does reduce the proportion of cracked bolts in this assembly7 but not enough evidence to be statistically signi cant at the conservative Oz 001 level PAGE 190

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.