### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Introductory Business Statistics BUS 221

CWU

GPA 3.84

### View Full Document

## 15

## 0

## Popular in Course

## Popular in Business

This 55 page Class Notes was uploaded by Cecelia Mayert MD on Monday October 5, 2015. The Class Notes belongs to BUS 221 at Central Washington University taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/218983/bus-221-central-washington-university in Business at Central Washington University.

## Similar to BUS 221 at CWU

## Reviews for Introductory Business Statistics

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/05/15

Legend 1 N 5 6 Bus 221 Notes Grey particularly important summaryreview information This material will not be presented in class as it is largely an outline of the class and students strongly prefer to sim 1 get started with the material in chapter 1 articularly important information or the level of detail here is not necessary to know for the exam but is rather meant to facilitate a more detailed understanding of the concept at hand particularly useful for understanding regression analysis and or the concept of confounding lurking variables CLV at a higher level of sophistication then covered in the class 39 refers to examples or exercises problems from the text refers to graphics tables charts transparencies of pages etc from the textbook Introduction to Statistics 1 Equot Equot 5 General de nition of statistics Finding the truth through the use of data numerical values a For individual decision making b For firm decision making i Should a firm advertise for its product c For policy makers d For scientistsacademics Examples of Usefulness a A firm s decision to advertise i To whom ii How much iii Where b A firm trying to decide whether to close a plant and open another in its place or to shift production to another facility i How well is the plant operating relative to other s within the company This class will focus on one particular statistic the mean a The mean of a certain variable i For example the mean height of a CWU student b The mean effect of one variable on another i For example the mean effect of education on earnings Essentially this class is about assessing how accurately an estimate of the mean calculated from a sample re ects the true mean in the full population of interest a For example suppose you were interested in determining the mean effect of advertising on purchase decisions in a certain region in which you are interested in potentially advertising Say you took a Small sample of people from your region of interest and then used them to conduct an experiment on the mean effect of advertising on the purchase of your product Getting information about the true mean effect in the entire population of your region is not as simple as calculating the mean effect on the small randomly selected sample for which you conduct your experiment Why Because people Fquot O are dyfkrent More specifically other factors which will affect how people respond to advertising 7 for example age 7 are not constant across individuals Therefore the effect on one individual may not be representative of the mean effect for the entire region The accuracy of the mean effect calculated in a randomly selected sample who took part in your experiment will be heavily in uenced by the following two factors i The variation of the effect in the population That is the variation in the effect across all people in the underling population the region of interest 1 Say you had two regions with the same mean effect in the population but different spreads of the individual effect In which country would your measured average effect be more accurate ii The size of the sample 1 Suppose you used a sample ofl person in your experiment in a region with a large spread in the individual effect of advertising on purchase decisions How accurate would your measured average effect be How could you make the results of your experiment more accurate So the accuracy of your measured average effect in the quality of your estimate obtained from your experiment on the effect of advertising on purchase decisions is determined by the sample size in your experiment and the spread of the data in the underlying population for whom you are trying to determine the true effect the region in which you are interested i The bigger the sample the more accurate the estimate ii The smaller the spread in the individual effects the more accurate the estimate 1 With a matched pairs design and the use of a placebo calculating o is straightforward 2 With a regression or an experiment using treatment and control groups calculating 6 can be complicated Eventually we will get to the point where we can make claims about the range the true mean is likely to fall within and perform tests on hypothesis we make about the true value of the mean i Con dence intervals 1 In which region would your confidence interval be smaller ii Hypothesis tests In which region would you be more likely to reject the null hypothesis of a zero effect of advertising on purchase decisions assuming the true mean effect in the population was positive 5 In the process we will discuss the following statistical concepts a Data and their Distributions i What is a distribution ii Picturing distributions The Normal distribution 1 Relevance Confidence intervals and hypothesis tests only work when the data are normally distributed or close to it F 5 iv The mean and variance 1 Relevance We will usually be estimating the mean Meanwhile the variance will be used to construct con dence intervals and perform hypothesis tests b Relationships between data estimating the effect of one variable on another i Correlation 1 Correlation measures how linear like a straight line the association between two variables are when we plot one against the other 2 Relevance correlation is used in regression analysis which we discuss next ii Regression 1 Regressions are one technique for measuring estimating the mean effect of one variable on another 2 Note that experiments are more accurate ways of estimating the mean effect of one variable on another c Producing Data i Experiments 1 Experiments allow us to estimate the mean effect of one variable on another ii Sampling 1 Samples are subsets of the population of interest 2 Relevance Calculating mean effects from experiments Before conducting an experiment we rst must collect a sample on which to perform the experiment We also must divide the sample into separate subsamples This must be done correctly for the experiment to be valid in terms of measuring what you want it to Relevance Calculating the mean of a variable In order for the mean calculated from a large sample to accurately re ect the true underlying mean of the population of interest the sample must be constructed properly Equot d Inference i Con dence intervals estimating the population mean from sample data 1 For example the mean effect of one variable on another a The mean effect of college on earnings 2 For example the mean of a certain variable a The mean height of a CWU student ii Testing hypothesis about population means using sample data 1 For example testing whether the mean effect of one variable on another is above zero a Testing whether the mean effect of college on earnings is above zero 2 For example testing whether the mean height of a CWU student if 68 inches Chapter 1 Picturing Distributions with Graphs 1 Individuals Variables Observations amp Distributions ii Individuals make amp model iii variables vehicle type transmission type number of cylinders city mpg highway mpg b Sample a subset of a population where the population is the full group from which we take a sample i Although the discussion in this chapter is focused on samples it also pertains to the full population The reason we focus the discussion on samples is because we typically do not have a full population ii The latter part of this course is dedicated to nding ways to describe a population using information from a sample c Individuals the person or object on which data information is collected d Variables The categories of data information you collect from the individuals For example the height variable 1 and weight variable 2 of the students in our class the individuals i Alternatively a characteristic of an individual ii Quantitative variables the data is recorded as a unit of measurement iii Categorical variables the data is not recorded as a unit of measurement and instead is recorded as a category e g major town age group e Observation the value that a variable takes for a particular individual i In the example below it would be the specific height the variable for a specific member of the class the individual ii The term observations will frequently be used as shorthand for the phrase observations of the variable for the individuals f Data any collection of observations i Examples include 1 a single data point the observation of one variable for one individual 2 the observations on one variable for all individuals 3 the entire data set including observations on all variables for all individuals ii Data will often be used as an alternative to the term observations 2 Distribution of a variable for a sample It tells us what proportion or percent or number of the data of the observations on the variable fall into different ranges That is it tells us how the data are distributed across the different ranges 391 39 l 39 Z L c Distribut ons in tabular form d Distributions in a more Visual graphical form i the histogram l of observations 2 of observations 2 graphs e f 3 h ii the pie chart 1 of observations only Visually representing distributions of categorical variables pie charts and bar Pie chart the size of the wedge represents the percent of individuals that t into a certain category in the sample L i l V Bar chart the height of each bar represents the percent or count number of individuals that t into a iiven category in the sample ii The bars are drawn with a space between them iii The hei ht of each bar can represent percent or count Visually representing distributions of quantitative variables histograms amp stemplots g Histogram Very similar to a bar chart Data are placed into one of a number of categories usually equal sized and then the height of each bar represents the percent or count number of individuals that t into a certain category in the le sam with more classes show more detail but may have a Histograms less clear pattern v The categories are ranges for the value of the variable rather than being different types vi The bars in a histo ram do not have any space between them Interpreting histograms i Overall pattern 1 Shape a Skewed b Symmetric 2 Center a The point where half the observations lie to the left and half lie to the right 3 Spread ii Deviations from the overall pattern 1 utlier i Stemplots very similar to a histogram it looks like a sideways histogram but which reveals the exact numerical value of sample data in each category For each value that your variable takes the right most digit is the leaf and the remaini di its are the stem 7 l 3 7 When making stemplots we often rst rerecord the data in terms of the number of rounded tens hundereds or thousands We then put the rounded data into the stem lot A ii39 We will also often s lit the data into rou s before makin stem lots The data is first rounded to the nearest 10 The data are then split into groups 049 tens 559 tens lOl49 tens 15159 tens 20249 tens 25299 tens 30 349 tens and 35399 tens 4 Visually representing quantitative variables measured over time time plots j Time series data Sometimes data on a certain variable for an individual or group is collected over time These type of date are called time series data i Cross sectional data A different type of data than time series data The name for the data that we have been discussing up to this point Cross sectional data on a certain variable is data collected am a section of individuals at a certain point in time Bar graphs pie charts and histograms plot crosssectional data k Time plots a time plot of a variable graphs the level of the variable on the y aXis a ainst the time eriod when the variable was collected on the X aXis N ii Cycles regular or cyclical up and down movements in data over time iii Trends a long term movement in one direction over time Chapter 2 Describing Distributions with Numbers 1 Measuring the center of a distribution mean and median a Mean Xbar ln 2x i X bar the mean of the variable the mean of the observations on the variable in question for the different individuals in our sample at which we are looking where x represents the variable at which we are looking ii n the number of individuals in the data iii 2 the sum iv x the variable at which we are looking lt S vii viii xi the observation on the variable at which we are looking for individual 1 l The observations for the individuals otherwise known simply as the observations Exi the total sum of the observations on the variable for each individual individual 1 through individual 71 inEX1X2Xn Notation allows us represent formulas easily We can calculate the mean of a sample or a full population b Median M the midpoint of a distribution i If there are an even number of observations the median is the mean of the two middle observations This table displays the raw data a This stemplot allow us to nd the middle numbers b Before plotting the data in the stemplot we will rst convert the data to lOLh s and then round the data We must nd the original data point that corresponds with the middle data point in our stemplot i The original data point will have a different value since the data in the stemplot was converted to lOm s and then rounded 3 We can calculate the median of a sample or a full population 0 a Calculating the median is one part of this problem c Comparing the mean and median 1 in a skewed distribution the mean is usually further out in the tail of the distribution the side of the distribution that extends further 2 Measuring the spread of the distribution variance amp standard deviation a s2 variance 1n 1 2xi Xbar2 i ii Here x represents the value for the i th individual of the variable at which we are looking This is the variance of a sample The formula for the variance of a population is a little different b s standard deviation sqrt 1n 1 Zx Xbar2 i This is the standard deviation of a sample The formula for the standard deviation of a ropulation is a little different d Calculating s and s by hand 1 ii iii iv v Make a column of observations Make a column of deviations x iXbar Make a column of squared deviations x iXbarz Sum the last column and divide by n 7 1 to calculate sz Take the square root of the last column to calculate s 3 Measuring the spread of a distribution the quartiles ve number summary amp boxplots a Quartiles i First quartile Q1 the median of the ordered observations to the left of the median ii Third quartile Q3 the median of the ordered observations to the right of the median b Five number summa Minimum Q1 M Q3 Maximum c Box lot 4 Spotting suspected outliers a Interquartile range IQR Q3 7 Q1 b Interquartile Rule for outliers i An observation is a suspected outlier if the observation falls 15 X IQR above third quartile or below frrst quartile 5 Choosing measures of center amp spread a The venumber summary is preferred with a skewed distribution or with strong outliers while the X bar and s are convenient for somewhat symmetric distributions 6 Organizing a statistical problem State What is the practical question in the context of the realworld setting Formulate What speci c statistical operations does this problem call for Solve Make the graphs and carry out the calculations needed for this problem Conclude give your practical conclusion in the setting of the realworld problem 990 Chapter 3 The Normal Distributions 1 Density curves a A density curve is another way of representing a distribution for a quantitative variable A density curve is just a line a continuous function i In many ways a density curve does a better job of representing a distribution then a histogram How does a histogram meet the de nition of a distribution 2 Does a density curve also represent the de nition of a distribution 3 Which does a better job of giving information on the proportion of data that fall within different ranges That is which one allows us to calculate the proportion for a greater number of ranges iii For a density curve the y aXis represents proportion which is just percent 100 1 For a histogram the y aXis can represent the count number of observations the percent or the proportion iv For a density curve the X aXis represents different possible values for the variable at which we are looking 1 Here X will represent the variable at which we are looking 2 For a histogram the X aXis also represents different possible values or ranges for the variable at which we are looking v Proportion of data falling between two values is now represented by the area under the curve and above the X aXis 1 For a histogram the proportion of data falling between to values is represented by the height of the bar vi The total area under the curve is l 1 For a histogram the summed heights of the bars is 1 vii The proportion for any single value is 0 because the area between any single value and itself is just 0 b Note contrary to all of the histograms we have been looking at which show the distribution of a sample a density curve will typically be used to estimate the distribution of a full population c Questions i How do we calculate the proportion of observations between two values 1 How would be do it for a histogram ii How do we calculate the proportion of observations below above a certain level 1 J 1 x wal How would be do it for a histo ram i quotl lml ii iii rve over the region 0 1 What is the height of a orizontal density cu How about over the re ion i0 2 2 Describing density curves a Median the eiual areas ioint b Mean the balance oint ii39 ii Remember th mean is pulled away from the median toward the long tail 3 Normal distributions Normal density curves a A Normal distribution is a symmetric bell shaped density curve described by the following equation 39 f X 1 WEE f 1 e c or 7 lt Xlt ii You do not need to know this equation for the eXam iii This is an equation for a whole family of normal density curve in the same way that fX mx b is an equation for a whole family of straight lines 1 This might look more familiar as y mx b iv I will use the terms density curve and distribution interchangeably b There is a whole family of Normal density curves In fact there are infinitely many normal density curves By altering u and Uthe density curve can be made to take a wide variety of shapes However it will always be symmetric and bell shaped It turns out that u is the mean of the distribution and 0 is the standard deviation of the distribution d Nu a is notation for a Normal distribution with mean u and standard deviation 0 O i For example N0 1 would refer to a normal distribution with a mean of 0 and a standard deviation of 1 ii u the mean determines the location of the distribution on the x axis Note that the mean and median are located at the center of a Normal distribution at the same point S Later we will use u to describe the mean of the population which is the full set of data from which a sample is taken X bar is the mean ofa sample iii 0 the standard deviation determines the 3 The change of curvature points are located at distance 0 on either side ofthe mean 4 Does it make sense that a more spread Normal density curve would have a greater standard deviation 0 Remember the analogy is a histogram with a greater spread 5 Later we will use Uto describe the standard deviation of the population which is the full set of data from which a sample is taken 3 is the standard deviation of the sample e Note contrary to all of the histograms we have been looking at which show the distribution of a sample the Normal distribution will be used to estimate the distribution of a full population 4 Motivation for using the Normal distribution 1 a Sample versus population b An important objective of this course is to utilize data from a sample to say something about the population i For example we can calculate the proportion of the population that have a value for the variable that falls above a certain value below a certain value or between two values ii For example we can estimate a con dence level for the population mean iii For example we can test whether the population mean is a certain value c These procedures can be done using a Normal distribution although sometimes it is first necessary to assume that the variable is Normally distributed d Since a Normal distribution closely approximates the distribution of many real world data and many chance outcomes this assumption is often reasonable Remember sometimes the assumption is not even necessary i When the assumption is necessary we will first verify that the assumption is reasonable by checking to see if the histogram has a bell shape e Before using a Normal distribution to perform these procedures we first must nail down the mean and standard deviation so that we know which Normal distribution to use i We will assume that the mean of the Normal distribution p in the above equation is the same as the mean of the sample ii We will assume that the standard deviation of the Normal distribution 6 in the above equation has the same or a closely related standard deviation of the sample 5 Motivation for using the Normal distribution 2 a First Normal distributions are good descriptions for some distributions of real data i However it is not perfect It implies that there are some values for x which are extremely high and extremely low b Second Normal distributions are good approximations to the results of many kinds of chance outcomes such as the proportion of heads in many tosses of a coin c Third We will see that many statistical inference procedures based on Normal distributions work well for other roughly symmetric distributions 6 The 68 95 997 rule a In a Normal distribution with mean u and standard deviation 0 39 Approximately 68 of the observations fall within 0 of the mean u ii Approximately 95 of the observations fall within 20 of u iii Approximately 997 of the observations fall within 30 of u 39 This analoizes to a histogram Draw three normal distributions of varying p but with the same 6 2 Draw three normal distributions of varying 6 but with the same u a Are your distributions of varying height Why ix There are infrnitelil many normal distributions 7 Cumulative proportion a The cumulative proportion for a given variable is the proportion of individuals whose observed value is equal to or less than some speci ed value for the variable in question i Px 5 speci ed value ii Where P is the proportion of data less than the speci ed value The cumulative proportion is just the area below the density curve to the left of the s ecifred value Fquot O The proportion of individuals whose observed value is greater than x for the variable in uestion is 39ust l 7the cumulative proportion of x rm Ci 1 The standard normal distribution is a normal distribution with the below properties That is it is a distribution for a variable that has the following a characteristics i u 0 ii 01 0 Since the standard deviation is one any individual s observed value for a variable that has a standard normal distribution tells us exactly how many standard deviations that individual s observed value for the variable is from the mean i For example imagine the value 25 in the standard normal distribution Since u 0 and 0 1 it is implied that the value 25 in the standard normal distribution is 25 standard deviations from the mean We can obtain the cumulative proportion from a calculator or computer once feeding in u and 6 or from a table Conveniently there is standard normal table in the back of our book on p 690 691 that tells us the cumulative proportion for any value any observed variable value in a standard normal distribution Unfortunately this table only applies to a standard normal distribution which is just one of the infinitely many normal distributions We would need a table corresponding to the particular Normal distribution with which we were dealing described by the u and o in order to get the appropriate proportions for any other Normal distribution Luckily by converting the data using a formula we will see shortly we will be able to use the standard Normal table to get the appropriate proportions for all Normal distributions again after the data has been converted usin a formula we will see shortly 3 1 D quot1 i Imagine a variable that has a standard normal distribution 7 say the mean likeability rating of CWU students where scores can be negative or positive and where zero represents not likable or unlikable We could then use the standard Normal table to calculation the proportion of students above or below any level of likability or between any two levels of likability 9 Using the standard Normal table to nd proportions for any Normal distribution a It turns out that the cumulative proportion for a speci c X value observed value in any and every Normal distribution is determined by the number of standard deviations that speci c X value observed value is from the mean Therefore to nd the cumulative proportion of an X value an observed value of a variable in any and every Normal distribution we can simply calculate how many standard deviations the X value is from the mean and then look up the number in the standard normal table on p 690691 i Remember since the standard normal has p 0 and 0 1 every value in a standard normal distribution tells you the number of standard deviations that value is from the mean Fquot d The below formula allows us to determine the number of standard deviations that a speci c value falls from the mean in any and every Normal distribution i z x 7 u 0 ii 2 is called the 2 value or z score It tells us how far a speci c X value a speci c observed value of a variable falls from the mean for a Normall distributed variable g Summary Finding cumulative proportions for values in any Normal distribution i Convert the X value to a 2 value ii Look up the 2 value in Table A p 684 7 5 to nd the corresponding r0 ortion h Finding the proportion of data that lie between two particular values That is nding the proportion of data that fall within a certain range say between X1 and X2 i Convert the X values X1 and X2 to 2 values ii Look up the 2 values in Table A and use the fact that the total area under the curve is 1 to nd the required area under the standard Normal curve 10 A general approach for f39mding a cumulative proportion for a certain value in a normal distribution a State the problem i Write down relevant information ii Draw a picture Vir39quot quotMC 12 Example of the usefulness of the normal distribution a If you had sample data on a certain variable for a group of individuals you could calculate proportions you were interested in using the following steps i Assume following a quick con rming glance at the histogram for the sample that the variable is distributed Normally Estimate the mean and standard deviation of the Normal distribution using the sample mean and standard deviation Use the z formula and the standard Normal table to do such things as calculate the proportion of individual in the full population whose value falls below a certain level of the variable ii39 Chapter 4 Scatterplots and Correlation 1 Displaying relationships between data scatterplots a Up to now we have focused on describing one variable mean standard deviation histogram distribution etc In the next few chapters we will focus on the relationshi between two different variables b A scatterplot is drawn using observations on two variables from a group of individuals Each point on the graph represents the following ordered pair for each individual observation on first variable for a given individual observation on second variable for that same individual c utilize a table with observations on two variables for a group of individuals to fill in a graph where one variable is represented on the x axis and the other variable is represented on the y axis 2 A scatterplot will often give insight into the relationship between two variables That is a scatterplot will often give insight into how two variables are related a Positive relationship If a rise in one variable is associated with a rise in the other b Negative relationship if a rise in one variable is associated with a fall in the other c No relationship if a rise in one variable is associated with no particular change in the other variable 3 Interpreting scatterplots a Overall pattern of relationship i Direction of relationship 1 Positive association 2 Negative association 3 Note if the line is sloped upwards the relationship is positive If the line is sloped downwards the relationship is negative ii Form of relationship eg linear Strength of relationship strong or weak 1 The closer the pattern of the data are to a curve line that is the more compact the formation of the data are giving the appearance of a curved or straight line the stronger is the relationship b Deviations from the pattern i Height plotted against weight 1 Which is the explanatory variable and which is the response variable ii Price of house plotted against price of car 1 Is there another variable which is driving this correlation iii Shoe size plotted against number of basketball games played 1 Is there another variable which is driving this correlation 5 Explanatory and response variables a Explanatory variable causal or independent variable a variable which may explain changes in the response variable b Response variable dependent variable a variable which responds to changes in the explanatory variable c Slide d It is not always obvious which is the explanatory and which is the response variable 6 Adding categorical variables to scatterplots a This is done by using a different plot color or symbol for individuals in each category b Doing so can provide information useful in understanding data For example it can provide information useful when trying to determine whether the relationship between variables is causal or not i Graph white versus AFDC payments 7 Correlation Measuring linear association n a r 1 2 u u n 1 11 3 sy b Correlation tells us about the direction and strength of the LINEAR relationship between two variables Two variables may be strongly related but if the relationship is not close to linear the correlation will be have a low absolute value r will always be between 1 and 1 A positive r implies a positive relationship between the variables and a negative r implies a negative relationship between the variables iii The more closely the pattern of the data seen in a scatterplot resembles a straight line the greater the absolute value of r i39 8 Facts aut correlation Correlation makes no distinction between explanatory and response variables Because r uses the standardized values of the observations r does not change when we change the units of measurement of X y or bot c Positive r indicates positive association between the variables and negative r indicates negative association The correlation r is always a number between 1 and 1 Correlation requires that both variables be quantitative so that it makes sense to do the arithmetic indicated by the formula for r f Correlation measures the strength of only the linear relationship between two variables Correlation does not describe curved relationships between variables no matter how strong they are g Like the mean and standard deviation the correlation is not resistant r is strongly affected by a few outlying observations h Correlation is not a complete summary of two variable data even when the relationship between the variables is linear Pquotm 00 Chapter 5 Regression V Lintmduction The purpose ot39a regression is to allow us to more completelpl compared to correlation characterize the relationship between two variables Pitt another way a regression helps us to do the following Detennine the causal impact ol39one variable the x variable on another the y lvariablel iiCausal in the sense that holding all other variables characteristics constant quot 39 v the 39 variable will cause increases in the variable N l39liRlE ii Regressions are otten plagued by bias that prevent one from interpreting the results ot39the regression as the quotcausal impactquot of one variable onj 39 iotherl b Predict the value of one variable given the value of another variable 7 Lquot can alwavs do this though the l quot39 quot will not alwavs bd laccurate l Essentially a regression finds the line that most closely quot tsquot the data for to A ariables plotted on a scatteiplot graph That is a regression produces the y intercept and slope ofthe line which most closely resembles the data raccordiTgl o a speCitic mathematical criterion 39lthl we will discuss shortlv E Evample the etl ect ot studv hours on gpal p39 Malxe a table with the raw data and then graph ii Plot the data where again each data point on the graph represents ltd ilvalue ol the two variables for a specific individuall iii n the graph the explanatory variable will be represented on die x axis r land the response variable vvill be represented on the y axts LA regression allows us to predict the value ot the response variable for am level ot39the explanatory variable Just start on the 39 axis at the evplanatoiv 39ariable level go up to the line when you hit the line go le to the y axisl I39he intersection point ot39this line with the 3 axis is the predicted level oil 1e response varirahle 2 Regression Line A straight line that describes how a response variable y changes as an ex lanato variable x changes b Regressions are often called linear regressions because they hypothesize and describe a stra39 ht line relationshi between to variables One variable has a causal effect on another variable when in a properly done experiment an increase in the causal variable results in a chane in the averae level of the other variable 3 Bivariate Regression in a Nutshell a Begin with a data set containing two variables i Think of a data table including the following columns The individuals Variable l the Y variable and Variable 2 the X variable or predicting variable Later we will add the following variables to the data table Yhat e for error b Assume the following linear relationship between the explanatory variable and the response variable i Y a bX 1 This is the bivariate regression model This is a linear relationship because it has the general form of a line Y mX b a above corresponds with b here and bX above corresponds with mX here ii Later we will discuss the importance of this assumed linear relationship ii39 Note that other variables besides the predicting variable may impact Y Therefore the assumed linear relationship does not imply that other variables do not impact Y 1 More on this below a and b are the Y intercept and slope respectively of the line which comes as close as possible to matching the data the best fit line when we plot the predicting variable on the X aXis and the forecast variable on the Y aXis i Plotting the data is simply recording with a series of points on a graph the combinations of the X variable the predicting variable and Y variable the forecast variable for each period for which we have data We can estimate a and b using equations which will be presented later in the lecture The following equation gives us the predicted value of Y we ll call it Yhat for each level of X i Yhat a bX ii This is just the equation for the line which most closely matches the data iii We can add the Yhat variable to the above data table calculate Yhat for each period and then record it in the Yhat column in the data table f The error e also called the residual will be defined as the difference between the actual level of Y for a given level of X and the predicted level of Y Yhat for a given level ofX 39 e Y 7 Yhat ii On the graph this is just the vertical distance between the predicted level of Y for a given level of X Yhat which is the height of the line which most closely matches the data for a given level of X Yhat a bX and the actual level of Y for that same level of X N 0 3 1 FD v We can add the e variable to the above data table calculate e for each period that is each X Y combination for each period and then record it in the e column in the data table g Note that equations for a and b that will be presented shortly find the line which most closely resembles the data according to the least squares criterion 7 the equations find the a and b for the line which minimizes the sum of squared errors i Min Zez z Y a bX2 Residuals The difference between an observed value of the response variable and the value redicted by the regression line The mean of the least squares residuals is always zero i This is one of the conditions for the least squares minimization ii See slide for next bullet point e A residual plot is a scatterplot of the regression residuals against the explanatory variable Residual plots help us assess how well a regression line ts the data 5 The least squares regression line line that minimizes the sum of the squared vertical distances residuals from each observation to the line c Though you can often get close to estimating a least squares line by just e eballin it this becomes much more difficult when there are outliers 6 Predicting y for a given level of x a Example the effect of study hours on gpa b If we plug a b and a particular value for x into the above equation yhat a bx we will get the predicted value of y yhat for an individual with that level of x i Remember we use the symbol yhat to represent the predicted value of y 7 In uential observation an observation that noticeably changes the level of b when removed a Note that not all outliers are in uential 8 Facts about least squares regression a The distinction between explanatory and response variables is essential in regression i If we switch the variables on the x and y axis we get a different b There is a close connection between correlation and the slope of the leastsquares line The equation for b says that along the regression line a change of one standard deviation in x corresponds to a change of r standard deviations in y The leastsquares regression line always passes through the point on the rah f aainst x Fquot O If there were two sets of data that both had the same regression line the data set with a scatterplot that had a pattern of data more closely resembling a line would have the lower r2 9 Cautions about correlation and re ression h 0quot Fquot O i Extrapolation using a regression line to predict the value of a y variable when the level of the x variable is far outside the range used to estimate your regression that is to estimate a and b The less linear is the association between x and y the less valid are the regression results li This is less true for a multiple regression b will not accurately estimate the causal effect of the explanatory variable on the response variable when there is a confounding lurking variable i One variable has a causal effect on another variable When in a properly done experiment an increase in the causal variable results in a change in the average level of the other variable Eh39erage levelquot implies the effect does not need to hold for all individuals but rather need only hold on average that is must 7 7 old for at least one individual 7 ii lLven when b is different than zero implying that there is an associationl etween r and it is not necessarily the case that x has a causal impact on Typically with a regression we are trying to ascertain the causal impact 139 on rather than simply the association between A and Jitl 1iii Regardless of V t hetlier there is a causal impact ofx on it or not I will slilll Enable you to predict 39 given 5 iv Lurking variable any variable omitted from the analysis not among the explanatory or response variables v Confounding Lurking Variable CLV A lurking variable is confounding when it causes or is correlated with x and causes or is correlated with y but is omitted from the analysis When there is a confounding lurking variable the regression estimate of b the impact of x on y will be biased Note in order to be a CLY the variable must cause either x or a potential confounding lurking variable is intelligence as he confounding lurking variable in a study ot the impact ofeducation onl amings Intelligence would be a confounding lurking variable fod xaniple it it had a causal impact on the amount ot education individuald cceived and if it also had a causal impact on the amount of earning s that individuals received vshatever the education level 1 Think about an experiment where a random sample of graduating iiin school seniors is taken from the population and ball of the sample is financially induced to go to college while the other ham is inaiicialh induced to abstain from college Then 12 years later at age 30 the eamings ot tlie members ofeach group are collectidl L1 We can plot education on the x axis against ezimings on the yj axis for all ofthe participants in the study some LilinllClt received 12 years of education and some ofwhicli liaye 16 years of education Note that we expect the average intelligence levelsill otli groups the group receiving 12 years ot education and the group receiving lo years ofeducation to be roughly the same Eow we can think ofthe slope of the regression line through the am as an estimate ot the quot truequot effect ofeducation on earnings Now consider collecting observational data as an altemative til experimental data With observational data we simpl3 ctllegj sample 0130 years olds from the population of30 year olds Further assume 1 that more intelligent people get more educated and that 2 more intelligent people earn extra due to their great intelligence The first assumption implies that those getting 12l7 years of education are less intelligent than the general populationl while those getting 16 vears of education are more intelligent than the general r quot LWhat would the plot of earnings for individuals willth 39 ducatioii 12 look like relative to the prior group plotted who had education 12 Remember that the experimental oup receiving 12 years ofeducation was reflective ofthej eneral population in temis of intelligence while thel bservational group receiving 12 years of education tended o be less intelligent than the general population LWhat would the plot ofeamings for individuals withl 7 education 16 look like relative to the prior group plottLdl who had education 16 assuming higher intelligenc people were more liker to go to college and received greater earnings regardless of education level greateii to someone ofthe same education level 1 Remember that the experimental group receiving 16 yez t education was reflective of the general populatiom terms ot intelligence while the observational groqu receiving 16 years of education tended tgibe infor intelligent than the general populationl R VIt39we drew a line through the nev cluster of points we would see that the slope was higher than the slope ofthe prior line drawnL using data from the experiment The use of observational data had caused there to be a confounding lurking variable and this in tuml ias caused the slope ot the line through the data to be an inaccurat measurement an overestimate in this case ofthe effect 0 education on earnings Remember the old slope usingj experimental data measured the true effect of education onl 7 earnings Note that many studies are done using observaticLall 7 ata and this often results in confounding lurking variables iii of a confounding lurking variable biasing the estimate of b is problem 42 in your homework There changes in saccharin appear to cause changes in weight However it is likely that fat consumption is a confounding lurking variable that causes changes in weight and which is cog ated with saccharine consumption o understand this better do thd following iv With a confounding lurking variable the estimated impact of the explanatory variable x on the response variable measured by the slope of the regression line will be biased inaccurate v In order for there not to be a confounding lurking variable problem biasing our estimate of b there should at least be no lurking variable which has a causal impact on both the explanatory and response variable One way to ascertain whether or not a confounding lurking variable may be biasing b or biasing the causal relationship you have made in your mind when you are conducting your own informal analysis of the causal effect of one variable on another is to ask yourself whether there is a variable that has a causal impact on both the explanatory and response variable If there is then b or the effect you have estimated in your informal analysis is likely biased inaccurate v39 Chapter 8 Producing Data Sampling 1 Introduction a Before we can answer any statistical questions and or perform any statistical analysis like calculating a regression estimate of b or calculating the mean for a variable we need to collect or have someone else collect data This chapter is about the process by which we collect data 2 Population versus sample a The Population in a statistical study is the entire group of individuals about which we want to acquire information b A Sample is a subset of the population from which we actually collect information i The sample is typically utilized as an alternative to the population because it is too difficult or costly to collect data on the entire population c Sampling design the process by which a sample is chosen from the population d Sample survey a survey given to a sample that is used to acquire information about a population e The goal of a sample is to re ect a population so that it can yield accurate information about a sample However many samples because of the way in which they are chosen do not re ect the population 3 Sample Bias How to sample badly 5 Equot b Convenience sample A sample selected by taking the members of the population that are easiest to reach i Convenience samples are often biased samples Creating an unbiased sample Simple random Samples a Simple Random Sample SRS A sample where each member of the population has an equal probability of being selected and every possible sample has an equal probability of being selected b A random sample is unbiased It does not systematically favor any individuals from the population That is we expect that a random sample is not biased towards ie does not contain too much of any particular type of individual from the population 4 x M h v m An SRS is one type of probability sample Inference about the population a The purpose of a sample is to give us information about a larger population The process of drawing conclusions about a population on the basis of sample data is called inference because we infer information about the population from what we know about the sample b Unfortunately it is unlikely that results from a random sample are exactly the same as for the entire population the sample results will differ somewhat just by chance c Properly designed samples avoid systematic bias but their results are rarely exactly correct and they vary from sample to sample d One point is worth making now larger random samples give more accurate results than smaller samples Data collection for measuring simple statistics on one variable like for example mean approval rating of congress versus data collection for studies of the impact of one variable on another a Inference relates to both means of variables and means of effects of one variable on another b Sometimes we collect data on only one variable to determine various statistics relating to that variable c Sometimes we collect data on several variables to measure the impact of one variable on another 8 Potential sources of bias in a sample Cautions about sample surveys a Undercoverage When the chosen sample excludes a group or groups from the population Nonresponse when an individual chosen for the sample can t be contacted or refuses to participate Response bias Examples include the following i People know that they should take the trouble to vote for example so many who didn t vote in the last election will tell an interviewer that they did ii The race or sex of the interviewer can in uence responses to questions about race relations or attitudes toward feminism iii Answers to questions that ask respondents to recall past events are often inaccurate because of faulty memory d Wording of questions Confusing or leading questions can introduce strong bias and changes in wording can greatly change a survey s outcome Fquot 0 Chapter 9 Producing Data Experiments 2 Introduction a The principles in last chapter pertain to both samples taken to analyze certain variables for example the mean height of a CWU student and samples taken to analyze the effect of one variable on another for example the effect of education on earnings b The principles in this chapter pertain only to samples taken to analyze the effect of one variable on another for examle the effect of education on earnings s 39 E 5 Experiments 0 gt P Vocabulary i Subjects the individuals studied in an experiment ii Factors the explanatory variables in an experiment 1 Factors can be combined to make up a treatment 2 Different degrees of a factor for example different amounts of a drug can be used to create different treatments iii Treatment Any specific experimental condition applied to the subjects If an experiment has several factors a treatment is a combination of specific values for each factor 1 Last chapter the treatments were get a factor or don t get a factor This chapter a treatment can consist of different combinations of factorsnonfactors b the effects of tv advertising iii We can study the combined effects of several factors simultaneously How to experiment badly a Don t randomly select who receives the treatment or the different treatments When individuals aren t randomly chosen to receive the treatment experiments become susceptible to the bias created by confounding lurking variables i A simple design often yields worthless results because of confounding with lurking variables Randomized comparative experiments An experiment that uses both comparison of two or more treatments and chance assignment of subjects to treatments a Control group The group in an experiment that receives no treatment or that receives an alternative treatment to which the treatment being analyzed is being compared b Completely randomized experiment All the subjects are allocated at random amon all the treatments The logic of randomization in comparative experiments a Random assignment of subjects forms groups that should be similar in all respects before the treatments are applied b Comparative design ensures that in uences other than the experimental treatments oerate euall on all rous Statistical signi cance An observed effect so large that it would rarely occur by chance is called statistically significant If we assign many subjects to each group the effects of chance will average out and there will be little difference in the average responses in the two groups unless the treatments themselves cause a difference By selecting samples randomly we have not taken age out of the analysis we have simply tried to minimize the possibility that older people are more concentrated the treatment group compared to the control group This helps us to understand why lurking variables that have a causal impact on the response variable but which are not confounded with explanatory variables do not bias our estimate of the impact of the explanatory variable on the response variable 1 Intelligence causes a higher test score but if does not cause people to take an online course there is no reason to expect more intelligent people to be in the online class group and hence our conclusions to be biased due to not factoring intelligence into the analysis Of course our estimated impact of the online course on scores may be affected by the random chance that we have intelligent people overrepresented in the online class group However the larger is N a 10 Cautions about experimentation The logic of a randomized comparative experiment depends on our ability to he samle the smaller this random chance becomes v13 treat all the subjects identically in every way except for the actual treatments being compared i Placebo a placebo is a dummy or fake treatment Placebos are useful in experiments because individuals given a treatment often show an effect just because of the belief that an effect should occur To control for this affect both groups the treatment and control group are given a treatment 7 the treatment group receives the real treatment and the control group receives the lacebo Double blind an experiment if double blind when neither the individuals receiving the treatments nor the scientist analyzing the effects of the treatment are aware of who received the actual treatment and who received the placebo Double blind experiments are useful because often scientists recording effects will record an effect if they expect to see an effect iii Lack of realism When the subjects or treatments or setting of an experiment do not realistically duplicate the conditions we really want to study An unrealistic environment often in uences the degree of the effect of a treatment Example 93 showed a 40minute video to students who knew an experiment was going on We can t be sure that the results apply to everyday television viewers 11 Matched pairs and other block designs a Matched pair design A matched pairs design compares just two treatments Choose pairs of subjects that are as closely matched as possible Use chance to decide which subject in a pair gets the rst treatment The other subject in that pair gets the other treatment That is the random assignment of subjects to treatments is done within each matched pair not for all subjects at once Sometimes each pair in a matched pairs design consists of just one subject who gets both treatments one after the other Each subject serves as his or her own control The order of the treatments can in uence the subject s response so we randomize the order for each subject Block design There is nonrandom assignment into blocks or groups for example men and women or Harvey and his pair Michael before there is random assiinment of treatments Fquot 314 Understanding The Confounding Lurking Variable Problem lL 396 L One variable has a causal effect on another variable when in a properly donel experiment an increase in the causal variable results in a change in the averagj level of the other variable Average levelquot implies the effect does not need t hold for all individuals but rather need only hold on average that is must hold for at least one individuall fDe nition 1 of C 11 n a scienti c study experimental or observational of they impact of one variable on another variable a confounding lurking variable WI is one which 1 directly indirectly causes or is directly indirectly caused by the response variable and 2 has a direct or indirect causal irnpact on th explanatory variablel Definition 2 ofCl39Z fthere exists a variable with 1 a direct causal impact orii the explanatory or response variable and 2 a correlation With the other variade in the study then there will be a variable not necessarily the variable you have found that meets the COI IdIllOnS in the definition of TV above That is whenever there exists a CLV meeting these two criterion there will be anotheri CLV call it the true CLV which meets the definition ofa CLV given abovel As an example consider the following Often a confounding lurking variable Wm have only an indirect causal effect on one or more of the study variables 7 treatmentexplanatory or response That is often a counfounding lurking variable will effect a second lurking variable that then has a direct effect on the treatmentexplanatory or response variable When this is the case the second lurking variable will have a causal effect on one of the study variables thel treatmentexplanatory or response variable but only be correlated with the otheg study variable Because the role of this second variable in impacting the study variables is o en more obvious it is often identi ed as the confounding lurking variable even though there is another variable which is really driving the confounding Therefore when an identified confounding lurking variable is correlated with one of the study variables and has a causal effect on the other study variable it is implied that there is another variable the true confounng lurking variable that has a causal impact direct and or indirect on bothl variablesl In order to be confounding a potential CLV must not only meet the definition fjl CLV but there must be variation in the CLV This is a more strict criterion than V simply being causal by the definition above ie in the sense of having an impactL in an experiment because it adds the condition that the level ofthe potential CLVA must be different across the two groups Several factors contribute to determine the size of the bias when a CLV iex1stsl iWhen the conditions for a C LV are met we cannot differentiate the effect of the difference in treatment level across the two groups from the effect of thel difference in the characteristic level confounding lurking variable across the two groups and we say that effects ofthe treatment are confounded by the effect of the confounding lurking variable the characteristic When analyzing the effects of a certain treatment all characteristics must be the Raine across the treatment and control groups to guarantee accuracy ofthe resu hen the characteristics are not the same we introduce potential bias into the s study I say introduce potential bias because in order for bias to result thei characteristic must meet the de nition of a CLV given above Anytime the characteristics across the treatment and control groups are not the same thel 7potential for a CLV problem exists lg Put more generally when analyzing the effects of a certain treatment all conditions must be controlled ie be the same across the treatment and contrioll groups to guarantee accuracy of the results When the conditions are not the same for example because members of the treatment group have different 7 characteristics than members of the control group we introduce potential bias into the study I say introduce potential biasquot because in order for bias to result thej characteristic must meet the de nition ofa CLV given above Anytime the conditions are not controlled ie are not the same across the treatment and control groups the potential for a C LV problem exists hln well done experiments the experimenter can take efforts to control the conditions across the treatment and control group For example the experimenter can take efforts to increase the similarity of characteristics across the treatmentj and control groups The experimenter can do this because the experimented 7 77 chooses who gets the treatment The way the experimenter attempts to match thg characteristics across the treatment and control groups is by getting a large SL5 from the population and then randomly putting people into the treatment ori control group Better control of conditions characteristics across the two groups means more similar characteristics across the two groups which means less likelihood ol a CLV Alternatively with an observational study the experimenter is much less able to control the conditions across the treatment and control grot because the subject chooses who receives the treatment The problem with Lhel subject choosing whether to receive the treatment or not is that o en times 7 subjects with certain characteristics will be more likely to choose the treatment 7 This results in differences in characteristics across the treatment and control grou and when that characteristic is correlated with or causes changes in the response effect guarantees the existence of one or more CLV A regression is one way of analyzing the results of a scienti c study Regressionsj can be done with either observational or experiments data studies but are morel typically based on observational data In a regression there are typically varying degrees of the treatment amounts of treatment across individuals included in the sample Thus a regression can be interpreted as an observational or experimental study where there are multiple treatment levels rather than simply level 0 and level 7 7 7 777 7 iii Even with a well done experiment with a SRS there can be a CLV if there arel ifferent levels of a certain characteristic across the treatment and control grou l ifferent characteristics across the treatment and control group imply at least at orrelation between the characteristics and the treatment Thus it meets one of wo criterion for the existence ofa CLV given in De nition 2 above Ifthej econd criterion is met a CLV ex1sts l15 Contexts where a bias exists due to a confounding lurking variablte la lewig Estimates ofb from a D 39 are biased when there is a confoundingl ilurking variablel 7 ii Examples of potential sources of bias include nonresponse of a samplei ilmember and or the use of a convenience sample iiii Sec 81 here you can interpret this study as a regression of brain cancer 3 Wariable on cell phone use x Variable A potential confounding lurking variable would be stressful jobs because it may have a cau l effect on lcell phone use and on brain cancer 7 I 7 Simple non regression measurement of the effect of one variable on anothte pariahlel ii The measured effect of one variable on another variable is biased whenl there is a confounding lurking variable 7 Examples of potential sources ofbias include nonresponse of a impLei lmember and or the use of a convenience samplel tiii See 81 here there may be a confounding lurking variable having a causal limpact on both cell phone use the treatment and cancer the response A potential confounding lurking variable would be stressful jobs because it may have a causal effect on cell phone use and on bram cancerl LSee 98 here there may be a confounding lurking variable having a causal timpact on both meterchart use the treatment and change in electricityj luse the response A potential confounding lurking variable would beL ear where the year of the study caused usage of the meterchart since meterscharts were administered 111 the study year While also causrng electricity use through its impact on the weather in that year Alternatively one can think of weather severity 1n the year of the studyj s the potential confounding lurking variable which causes electricity se in the study vear while also being correlated with meterchart use in the study year c Simple non regression comparison of effects of several different variabled ion a certain var ia ble LComparisons of the effects of several different varialbes on anothiej ivariable are biased when there is a confounding lurking variable 7 Examples of potential sources of bias 1nclude nonresponse of a sample ilmember and or the use ofa convenience samplel iiii See 829 here there may be a confounding lurking variable having al lcausal impact on the type of anesthesia and on the rate of death afteriV surgery A potential confounding lurking variable is type of surgery forj lexample kidney surgery which might have a causal impact on both their V treatment since a certain anesthesia may work better at anesthetizng tlg idney and on the measured effectresponse since people with kidney Kiisease might have a high rate of deathl B Opinion Surveys ifSurvey results are only biased by for example non response or the use if i convenience sample when there is a variable correlated with inclusionl tin the survey sample which also effects the average responsel LSee 84 here unhappiness with the Social Security system will correlated with inclusion in the survey sample if people unhappy with SS are more likely to respond to the survey because they arel eager to voice their frustration Since this would also effect the average response it would cause a bias in the survey results ESee 84 here there may be a confounding lurking variable havigngia causal impact on inclusion in the sample and on the attitudes V towards Social Security For example unhappiness with the Social Security system may have a causal effect on an individual s likelihood of responding to the survey while also having a causall effect on the average measured unhappiness with the system ESee 84 here the school might have a more liberal student botm r lIn this case liberalness would be correlated with inclusion in the survey sample That is survey respondents would be more liberal 11 average If liberalness also affected attitudes towards Social Security then the survey results would be biasedl ii Once could determine the bias by noting the magnitude ofb in theli 7 7 lregreSSion ol inclu51on in surveyquot on the measured opinion Within HE lcontext survey opinion results are biased when there is a confounding llurking variable that is when there is a variable which has a causal impactl lon both inclusion in the survey and on the measured opinion 16 Digression Knowing that a variable is causal does not necessarily imply an Lunderstanding of the true underlying processE 21 Example 7 Education amp Incomel kExample 7 Heat Pumps some heat pump valves are defective and don t operh When they don quotI open they will explode shortly after the pump is turned onl Imagine an experiment on heat pump valves where a hole is screwed into the sidd 1 all of the valves in the treatment group From the experiment we nd that thd treatment group has a reduced incidence of explosions In this sense we can say that drilling a hole causes a reduction in explosions However we don t g understand the true underlying nature of the explosion ill Digression knowing that a variable is causal implies an average causal effng butl m necessarily an individual causal effect la Example Education and Income Assume that education only results in greateii arning for intelligent people and people are either intelligent or not Nex d assume that an experiment shows that all people in the control group that received 0 education had the same low income regardless of intelligence Meanwhilet among those in the treatment group that received educatiom the intelligent peopld eceived a high income and the unintelligent people received a low income 7 While a well done experiment in this context reveal whether education causes 39ncome to increase on averagequot it doesn t reveal whether education causes greater income for any individual 7 to know that we must know whether they 39ndividual is ofhigh or low intelligence Since intelligence wasn t a variablel included in the experiment the experiment cannot reveal that intelligence is their rucialiari leiiwickristaniding whetherieducation causynjn s cpnig Chapter 10 Introducing Probability l The idea of robability a We call a phenomenon random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repititions i Alternativel DJ the proportron of heads is random It is uncertain what the number of heads will be 1 Nonetheless there is a distribution over the different possible proportions for a given number of tosses 2 Tree diaram the probability of getting a head is 050 If the probability of getting a head is 05 it is not implied that the proportion of heads in ntrials is equal to 05 d Probability vs proportion D 1 ii39 Probability and proportion are closely linked If for example an event had a probability in a random sample of PROB it would imply that the proportion of times that the event occurred in the population was PROB Similarly if the proportion of times that the event occurred in the population was PROB then the probability of the event occurring in a random sample would be PROB 1 For example if the proportion of CWU students below 68 inches in height is 050 then if we randomly selected one CWU student there would be a 50 percent chance 050 probability of the student having a height below 68 inches Just as we can think of a distribution on x for example height that shows cumulative proportions we can think of a distribution on x that shows the probability a random selected person will have different heights The law of large numbers the idea that the proportion approaches the probabilit as the number of trials increases Libis helps us to imderstand uh larger samples result in more accurate conclusions from experiments Think ofan experimentgotl the effect ol education on Income where ability is an imporm potential confounding lurking variable s the sample si7e increases the dillererice in abilin level bet een the treatme l group and control group diminishes because the proportion oti individuals in the tvm groups converges to their proportion mm population This diminishes the likelihood that ability will bias 113 results ot the experiment because to create a bias abilit must bcj higher in one group than the other that is abilit must bel correlated with the treatment 1 the treatment and control groups are the same the level ofability across sub samples is the sam that is abilil is not correlated with the trealinent and ability not in uence the measured earnings ot one group more than the ytherl Probability distribution The probability distribution of a random variable X tells us what values X can take think of the different values as different events and howrto assign probabilities to those values li ii Finite populations like the height ot39a CWU student or the height ot39an American can be approximatel nonnall distributed Processes like th of L eg lab or a perfomtancel 1 measurement like the percent free throws made in a game or the number flieads in a xed number of coin flips can also be normally distributed Note that o en with processes a sample is not chosen from a tixedl population Rather the outcome of a process that has an in nite population an infinite number ol39possibilities Is observedl Therefore distributions we will priniaril utilize the Nomial distribution an describe both xed 39 like the height ot all CWU studentsl pnd processes like a lab measurement 2 Probability models a Random Phenomenon b Sample space S The sample space S of a random phenomenon is the set of all possible outcomes c Event An event is an outcome or a set of outcomes of a random phenomenon That is an event is a subset ofthe sample space d Probability Model A probability model is a mathematical description of a random phenomenon consisting of two parts a sample space S and a way of assigning probabilities to events i The normal distribution is an example of a probability model e rolling two dice i Random phenomenon the two number combination resulting from rolling two dice simultaneously 1 An alternative random phenomenon would be the additive total from rolling two dice simultaneously See Example 105 ii Sample space S Slide 1 How would the sample space for the alternative random phenomenon be different iii Event there are many possible events We will look at the event rolling a 5 which we will call Eventl 2 Ifthe dice are perfectly balanced all 36 outcomes in Figure 102 will be equally likely That is each outcome has a probability of 136 Therefore Eventl has probability 436 Probability Model the sample space described above combined with a way of assigning probabilities to events described by the following count up the number of possible ways of achieving the event in your sample space S and then multiply by 136 3 Probability rules Straight from the text a Rule 1 The probability PA of any event A satisfies 0 PA 1 b Rule 2 If S is the sample space in a probability model then PS 1 c Rule 3 Two events A and B are disjoint if they have no outcomes in common and so can never occur together i The addition rule for disjoint events If A and B are disjoint ii PA or B PA PB 1 Example of nondisjoint events the probability of rolling a 12 with two dice and the probability of having the same number show on both dice 11 22 33 etc d Rule 4 For any event A i P A does not occur 1 7 PA 2 ii What is the probability of the event rolling a 5 iii What is the probability of the event not rolling a 5 or rolling a 5 does not occur two ways of saying the same thing 4 Discrete probability models a De nition A probability model with a nite sample space is called discrete i To assign probabilities in a discrete model list the probabilities of all the individual outcomes These probabilities must be numbers between 0 and l and must have sum 1 The probability of any event is the sum of the probabilities of the outcomes making up the event 5 Continuous robabilit models The uniform density curve spreads probability evenly between 0 and l 3 Calculate the probability of being between 0 and 05 iii Note The probability model for a continuous random variable assigns probabilities to intervals of outcomes rather than to individual outcomes In fact all continuous probability models assign probability 0 to every individual outcome Only intervals of values have positive probability 1 In the above example the width of the area under the curve would be zero 2 Another way of looking at it is that if the density curve is uniform probability spread evenly for all possible values then the probability of getting any particular value is value inf1nity since there is an in nite number of possibilities for the value 1 Normal distributions are probability models There is a close connection between a Normal distribution as an idealized description for data and a Normal probability model Not only does the height of females closely follow the Normal distribution with mean u 64 and standard deviation 6 27 inches but if we choose a female at random call the person s height X and repeat the random choice very many times the distribution of values of X is the same normal distribution N 6 Summary amp Example a Examples of random variables i The height of a randomly selected CWU student where x the random variable is the height of the randomly selected student ii The mean height of a sample of randomly selected CWU students where Xbar is the mean height in our sample iii The mean impact on earnings of education in an experiment of the effect of education on earnings for CWU students Random variable There are two criterion for a variable to be a random variable A variable for which individual outcomes are uncertain A variable which nonetheless has a stable distribute of outcomes in a large number of repetitions Example The height of a randomly selected CWU student call it x is a random variable because 1 it is uncertain 2 there is a stable distribution which reveals the probability of different possible outcomes Probability The probability of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very long series of repetitions i Probability vs Proportion If for example an event had a probability in a random sample of PROB it would imply that the proportion of times that the event occurred in the population was PROB Example If we randomly select a CWU student there is a certain probability that the student s height x will be less than say 65 inches In order to determine that probability that x is less than 65 inches rst pick a student s name from a hat record the student s height again this is x and then put the student s name back in the hat Do this a very large number of times say 10000 The proportion of times that a student with a height less than 65 inches was selected is the probability that a randomly chosen student will have a height less than 65 inches iii Law of Large Numbers the idea that the proportion approaches the probability as the number of trials increases In the above example the proportion of times that a student with a height less than 65 inches was selected will vary depending on the number of times that a student is selected Above we selected a student 10000 times We could have selected a student s name out of the hat only 100 times and then calculated the proportion of student s whose height was less than 65 inches The more times a student is selected the closer we expect the proportion will be to the true probability Probability distribution The probability distribution is just the distribution you can think of it as the histogram for a very large number of repetitions U ii39 O 3 1 i In the above example the histogram representing the values we found for the randomly chosen student s height X in the 10000 trials is the probability distribution Chapter 11 Sampling Distributions 17 i i t We use samples statistics like the sample mean to estimate parameters likd lthepopulation meanJ7 i7Sometimes we use a sample estimate the mean ofa certain variable 2117 7bammeter e for example to estimate the mean height of a C WU srudent7i lii7 7We also use samples in the process of an experiment when attemptingtol kstimate the causal impact of one variable on another variablel 7 b There are several interesting features about Xbar obtained from a samplel E lntroductionl 77 7 7 7 7 77 7 7 777 77 7 7 l1 X represents the value ofa random variable for example the height1 bf a randomly selected CWU studentl 7Xbar represents the mean of the random variable X for a sample otj individuals For example the mean height ot a randomly selected 77 7 39ample ofCWU students 7 7 77 7 77 7 777 7 LXbar is a random variable see de nition of random and randoml bjariable above l1 Think ot39a sample of l 2 3 or 4 students from a population ot fll tudents at CWU from which you will calculate Xbar the ieight ot the sample to estimate it the mean height ot the o ulation at CWIM7 7 7 7777 7 7 7 77 7 7 77 Assume the students have height 50 60 70 amp 80 inches liii As Xbar is a random variable there is a distribution on Xbar77 17 7We might Simply estimate what the distribution looks like or we taunt collect a large Izughsrif ttazl and Eaterillsdistribytienl ruin the histograml LDTZIW a distribution for samples of size 1 2 3 and 4 Thai 7 77lis forn 1 2 3 and 7 7 7 LNote that for a sample of size I the distribution obear lookd7 identical to the distribution of Remember from above that distribution of x is just the distribution ofthe height of all CW1 Tj studentsquot Therefore if the sample size is l and thus Xbar Willi take an value that x can take then the distribution on Xbar is the distribution on LGenerally the approximate distribution for a random variable e inferred by making a large number ot observatioits of thel andom variable aaslphssrviaglhslasjegtam f9 vsoisrnall ategon39esi iiLThe spread ot the distribution itit makes more sense think ot39the spread s the range of possible values for Xbar depends on the size of thd tamp e LWith a sample ot size l the distribution obear looks like thd7 istribution of X but with a sample size ot 4 the distribution ofl 39bar collapses on u Eil he mean of the distribution of Xbar is the same as the mean ol39thel kiistribution of X E Example the mean height of WU studentsl LLets randomly select a sample a SRS of 100 CWIl students from thd Mulation of all 10000 CWU studentsl 7 1 X here is the height ofa randomly chosen CWU studentl There 77is a satimle mean for a sample of say 100 CWU students 39bar EThere is a true I l 39 39 mean height for all 10000 CW1quot lstudents ul7 LThere is a sample standard deviation for a sample of 100 CWUl htudents sx FThere is a true populati7on7stan7d7ard deviationlor all Cle lstudents ox El here is a histogram which reflects the disuibutlon of the height oil lthe 100 CWU students in the sample 7 77 7 7777 7 There is a 39 39 g which re ects the distribution of the height oil bl CWIl studentsl quot Now take an SRS of 100 students and calculate the mean ofthe sampletl Saul LTake 1000 SRSs of 100 students and calculate the mean of each ihample Xbarl Xbarz Xbarlmntt777 777 7 77 7 7 7 7 LDraw a r quot D the distribution ofthe mean heightl 77 7 7 lin all ofthe 1000 samplesl77 7313 take an SRS M9399 studentsl l1 Take 1000 SRSs of 999 students andcalculatethe m7ea171797l7 7e7ac7l hample Xban Xbarg Xharlnonn ilkaw a r quot D the distribution ofthe mean height 7 7lin all ofthe 1000 samples liv Questionsl 1 Where are the histograms in steps b and c likel to be centeredJ 7What is the relationship between the distribution of x and th Histribution obeirl EWhal is the difference in spread between the histograms in steps bl kind cl LGenerally speaking what happens to the spread in the histogram ad the size of the sample increases v r 2 Introduction a Let s look at the random variable X the height of a randomly selected CWU student Say we took a sample of 100 students and then calculated Xbar b Xbar is a random variable because the outcome of Xbar calculated from the sample is uncertain because we do not know who will be selected into the sample 3 Parameters and statistlcs a Often we will calculate a statistic from a sample in order to get an estimate about Fquot O a a parameter from the population i For example we could calculate the mean of a sample of CWU students to get an estimate for the mean height for the population ie all CWU students We use a different name and a different letter to label a statistic depending on whether it represents a population statistic or a sample statistic Parameter A number that describes the population In statistical practice the value of a parameter is not known because we cannot examine the entire population i u ii p Statistic A number that can be computed from the sample data without akm use of an unknown 7 arameters i Name some contributing factors to the erroneous rejection 1 The sample is not representative due to chance not bias 2 Largely due to the fact that the sample size is to small 4 Statistical estimation and the law of lare number H 39 b Law of large numbers Draw observations at random from any population with nite mean u As the number of observations drawn increases the mean X bar of the observed values gets closer and closer to the mean u of the population i In the long run the proportion of outcomes taking any value for example the proportion of coin ips that yield a head gets close to the probability of that value and the average outcome gets close to the population mean iii Example the sample mean height of a student in this class d Describing sampling distributions shape center amp spread 6 The Sampling distribution obear a Suppose that Xbar is the mean of an SRS simple random sample of size n drawn from a large population with mean u and standard deviation 0 Then the sampling distribution of X bar has mean u and standard deviation Usquare root n i Xbar the sample mean ii Mean obear 1 Where u is the mean ofx 2 Because the mean obear is equal to u we say that the statistic X bar is an unbiased estimator of the parameter u Standard Deviation obear 039 square root n 1 Where 0 is the standard deviation of x iii a This is an example where instead of randomly selecting a value from a xed population that has a normal distribution we are observing a process whose outcome follows a normal distribution based on the observation of past outcomes Remember from Ch 10 notes Finite populations like the height of a CWU student or the height of an American can be approximately normally distributed Processes like the of 39 eg lab or a performance measurement like the percent free throws made in a game can also be normally distributed Note that in the latter case a sample is not chosen from a xed population rather the outcome of a process that has in nite ulation is observed Fquot l l The shape of the distribution of X bar depends on the shape of the underlying population and the size of the sample i ii39 i ii39 The center of the distribution of Xbar is determined by the center of the distribution of X For small samples the shape of the distribution of Xbar is similar to the distribution of X However the larger the sample the more the shape of the distribution looks like a Normal distribution regardless of the distribution of X Hence with larger samples the shape of the distribution may be completely unrelated to the distribution of X Speci cally if x has the Nu a distribution then the sample mean Xbar of an SRS of size n will approximately have the Nu Uquotsquare root 7 distribution Use the simple random sample applet 1 Population 100 2 Sample 5 3 Subtract 1 from numbers in sample before logging approvaldisapproval Phat represents the proportion in the sample Phat from a sample will be normally distributed and centered at p the proportion in the population Phat is to p was Xbar is to u In both cases the former is a statistic and the latter is a parameter 1 This can be seen from the table presented in class 7 The Central Limit Theorem and the shape of the sampling distribution a If n is large for an SRS from any population with mean t and nite standard deviation 039 Xbar is approximately N04 039 square root n regardless of the distribution of x What is the random variable 2 Remember that the Z score tells us how many standard deviations a particular observation falls from the mean Chapter 14 Introduction to Inference 1 Introduction a Statistical inference Statistical inference provides methods for drawing conclusions about a population from sample data Statistical inference allows us to infer information about the population from the sample data Statistical inference uses the language of probability to say how trustworthy our conclusions are b Assumptions on which inference procedures in Chapters 14 are based i We have an SRS from the population of interest There is no nonresponse or other practical difficulty The variable we measure has a perfectly Normal distribution Nu a in the population iii We don t know the population mean u But we do know the population standard deviation 0 2 Con dence interval he conce I t sample means will capture the unknown mean u of the population c The form of a con dence interval i estimate i mar 39n of error l i r l X sdXbar is an example of a 95 confidence interval m m HMI n wwt i39vm 39 5 Exam les i A see overhead ii B i ii iii iV 6 Tests of 39 ificanc State The problem tells us that we are interested in estimating the mean percent change during three months of breastfeeding in the bone mineral content of the spines of the population of all breast feeding mothers 2 Formulate The problem asks us to use a 99 con dence interval to estimate this mean percent change in the population 3 Solve a Xbar i 2 X 0 square root 7 b 3587 i 2576 X 25 square root 47 c 3587 i 0939 d 4526 to 2468 4 Conclude We are 99 con dent that the mean percent change during three months of breastfeeding in the bonemineral content of the spines of the population of all breastfeeding mothers is between 74526 and 72648 State We want to estimate the mean healing rate for comparison with rates under other conditions Formulate We will estimate the mean rate u for all newts of this species by giVing a 95 con dence interval Solve 1 Xbar i 2 X 0 square root 7 2 2567 i 196 X 8 square root 18 3 2567 i 370 4 2197 7 2937 Conclude We are 95 con dent that the mean healing rate for all newts of this species is between 2197 and 2937 micrometers per hour the reasonin testing whether arti cially sweetened colas lose their sweetness over time x is the difference in sweetness over a 1 month period at high temperatures meant to simulate a 4 month period under normal temperatures Suppose we know that for any cola the sweetness loss scores vary from taster to taster according to a Normal distribution with standard deviation 0 1 iii Slide For which Xbar would you reject the hypothesis 1 Xbar 102 is way out on the normal curve in Figure 151 7 so far out that an observed value is good evidence that in fact the true u is greater than 0 that is that the cola lost sweetness 7 Tests of signi cance Stating Hypothesis a Start with some nding for which you are trying to nd evidence That is start with a claim about the o ulation for which ou are t 39 gm mm 3 to nd evidence in arameter Lurk E This is a twosided hypothesis Example testing whether pizza the night before an exam has an effect on exam performance Fquot is is a onesided hypothesis Example testing whether a college education has a positive effect on earnins 0quot quot quotpl il39rr mf n 19 leU a This is a onesided hypothesis The alternative hypothesis is 2 one sided if it states that a parameter is larger than or smaller than the null hypothesis value It is two sided if it t rom the null value states that the arameter is aliieren NBA 1391 w Denoted H o ii A claim about a population parameter The 0 osite ofHa d Note The null est that one can do regarding verifying a claim is to fail to reject the i The alternative hypothesis is one sided because the claim about which we are trying to nd evidence is that nursing mothers lose bone mineral the mean percent change in the mineral content of the spine is lt zero 1 Ha u lt 0 where u represents mean bone mineral loss The null hypothesis is that there is zero bone mineral loss because we are trying to determine whether nursing mothers lose bone mineral l H o u 0 i The alternative hypothesis is one sided because the claim about which we are trying to nd evidence is that arti cially sweetened colas lose their sweetness over time sweetness loss gt 0 1 Ha u gt 0 where u represents mean sweetness loss ii The null hypothesis is that there is zero taste loss because we are trying to determine whether arti cially sweetened colas lose their sweetness over time 1 Ho u 0 8 Tests of S39 nificance P values amp Statistical Signi cance Alternatlve De nition the probability assuming H o is true of getting a value for X bar that is any further from the H o in the direction of H a ii When we say a result is statistically signi cant at level a we are saying that our rejection of the null hypothesis is signi cant at level on iii Note that signi cant in the statistical sense does not mean important It means sim 1 not likely to happen by chance 9 Tests of signi cance Tests for a population mean a Test Statistic 39 De nition a statistic that we construct to test the null hypothesis ii The test statistic for hypotheses about the mean u of a Normal distribution is the standardized version of Xbar It is known as the z statistic uo the hypothes1zed value of u in the null is used because the test statistic is calculated under the assumption that the null h othesis is correct 1 10 Signi cance from a table a The outcome ofatest is signi cant at level a ifP a b From table C we can approximate the P level First nd the two numbers in the 2 row that your 2 value is between Then determine whether you have a onesided or twosided test iii Then look up the two corresponding values in either the onesided P or twosided P row The P level will be between these two values Chapter 15 Thinking About Inference 1 Conditions for Inference in Practice a When you use statistical inference you are acting as if your data are a random sample or come from a randomized comparative eXperimen b Statistical con dence intervals and tests cannot remedy basic aws in producing the data such as voluntary response samples or uncontrolled experiments 2 How con dence intervals behave a Reducing the margin of error m z x 17 square root n i Lower z 1 We do this by reducing the size of the con dence interval C 2 To obtain a smaller margin of error from the same data you must be willing to accept lower con dence ii Lower 0 iii Increase 71 1 If we have a certain target for the margin of error we can select the sample size to ensure obtaining that margin of error a We start with the fact that the margin of error m 2 x 0 square root 7 Using this formula we can solve n to determine the required sample size to obtain a speci ed margin of error 2 nz Hm2 b The margin of error in con dence interval covers only random sampling errors Practical difficulties such as undercoverage and nonresponse are often more serious than random sampling error The margin of error does not take such difficulties into accoun 3 How Signi cance Tests Behave a How small a P is convincing i How plausible is Ho How plausible is Ho If Ho represents an assumption that the people you must convince have believed for years strong evidence small P will be needed to persuade them What are the consequences of rejecting H o What are the consequences of rejecting Ho If rejecting Ho in favor of Ha means making an expensive changeover from one type of product packing to another you need strong evidence that the new packaging will boost sales b Significance depends on the alternative hypothesis c Significance depends on sample size d Beware of multiple analysis Chapter 17 Inference about a population mean Conditions for inference a The data are an SRS simple random sample from the population b The population distribution for the underlying random variable X is approximately Normal i The larger is the sample size the less important is the Normality condition ii In practice it is enough that the distribution be symmetric and single peaked unless the sample is very small 2 Estimating the standard deviation of the sampling distribution a When we don t know the population distribution standard deviation 0 for the random variable x we cannot calculate the sampling distribution standard deviation 0 square root 7 for X bar When this is the case we use s the sample standard deviation to estimate 0 the population standard deviation c Our estimate for the standard deviation of the sampling distribution for X bar called the standard error to denote that it is an estimate is then 39 Standard error obear s suare root n Fquot l l x 1 4 Stating Hypothesis a Null Hypothesis The statement being tested i Denoted H 0 ii A claim about a population parameter iii A null hypothesis always take the form 1 Ho u hypothesized value b Alternative Hypothesis Frequently but not always the claim about the population that we are trying to nd evidence for 39 Denoted Ha i A claim about a population parameter iii The alternative hypothesis is one sided if it states that a parameter is larger than or smaller than the null hypothesis value It is two sided if it states that the parameter is alz39jfkrent from the null value iv An alternative hypothesis always takes one of the following forms 1 u hypothesized value 2 u gt hypothesized value 3 u lt hypothesized value d The one sample t test 5 Tests of signi cance the four step process a State i State the problem in terms of a speci c question about the mean the mean of what ii For example We are interested in knowing whether there is evidence that the mean is b Formulate i Identify what u is the mean of ii State the null hypothesis Ho iii State the alternative hypothesis Ha 1 Frequently but not always the claim about the population that we are trying to nd evidence for i Check the conditions for the test you plan to use ii Calculate the test stat1st1c iii Find the Pvalue d Conclude Describe your results in the context of the stated question l 7 Con dence intervals a State i State what is being estimated the mean of what and why b Formulate i Describe how you will estimate the mean We will estimate the mean u using a con dence interval of size C c Solve i Check the conditions for the test you plan to use 1 SRS l l ii Calculate the con dence interval d Conclude i We are C con dent that the mean is between and 8 Matched pairs t procedures a De nition To compare the responses to the two treatments in a matched pairs design nd the difference between the responses within each pair Then apply the onesample t procedures to these differences i Ho u 0 ii Haugt0ult0oru 0 9 Robustness of t procedures a A con dence interval or signi cance test is called robust if the con dence level or P value does not change very much when the conditions for use of the procedure are violated i For example even with outliers the con dence level and signi cance test results will still be robust if the samle size is large enough Chapter 19 Two Sample Problems 1 Two sample problems a Two sample problems are problems where we want to test whether two samples have the same mean for example is the mean height of a CWU student the same as the mean height of an Evergreen student or whether two treatments have the same mean effect for example physical therapy advice versus regular physical therapy i Whereas one sample problems focus on the random variable X bar two sample problems focus on the random variable X barl iXbarz b The null hypothesis is that the difference in the two means is zero 11 7 2 0 or equivalently that mean of the random variable X barl iXbarz is zero i Another way of stating this null hypothesis is that the two means are equal 1 2 2 Conditions for inference We have two SRSs from two distinct populations The samples are independent That is one sample has no in uence on the other i Matching violates independence for example b Both populations are Normally distributed In practice it is enough that the distributions have similar shapes and that the data have no strong outliers 3 Two sample If procedures a Two sample confidence intervals a confidence interval on 1 712 i The confidence interval has the usual form 1 observed value of random variable i t standard deviation of random variable 2 Here the random variable is X barl iXbarz b Two sample hypothesis tests a hypothesis on 1 712 i The null is thatu1 u 0 ii The test statistic has the usual form 1 observed value of random variable 7 hypothesized value of random variable standard deviation of random variable 2 Here the random variable is X barl iXbarz H v gm M14111 mi H1 5 Con dence interval the four step process a State i State what is being estimated the mean of what and why b Formulate i Describe how you will estimate the mean We will estimate the mean u using a con dence interval of size C c Solve i Check the conditions for the test you plan to use 1 2 Approximately normally distributed random variable a See rules ofthumb for t procedures 3 Samples independent ii Calculate the con dence interval d Conclude i We are C con dent that the mean is between and 6 Robustness again a As a guide to practice adapt the guidelines given in chapter 18 for the use of onesample t procedures to twosample procedures by replacing sample size with the sum ofthe sample sizes m m

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.