Introduction to Statistical Methods
Introduction to Statistical Methods STAT 301
Popular in Course
Popular in Statistics
This 25 page Class Notes was uploaded by Ena Kris on Tuesday September 22, 2015. The Class Notes belongs to STAT 301 at Colorado State University taught by Staff in Fall. Since its upload, it has received 32 views. For similar materials see /class/210330/stat-301-colorado-state-university in Statistics at Colorado State University.
Reviews for Introduction to Statistical Methods
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/22/15
Sections 21 22 Variable Response that varies from unit to unit VariabgaLs of Interest The speci c characteristics of interest in a study What question do I ask the people or items in my study Univariate Data Set A data set consisting of observations on a single variable or characteristic Bivariate Data Set A data set consisting of pairs of observations on two variables Multivariate Data Set A data set consisting of observations on 2 or more variables Categorical or nglitative Data eye color type of car political af liation employment status Numerical or Quantitative Data height weight number of customers blood alcohol level Discrete Quantitative data for which noninteger values DO NOT make sense Continuous Quantitative data for which noninteger values DO make sense Examples of Discrete vs Continuous Data Discrete siblings classes registered for speeding tickets bus ridesweek Continuous Height weight length blood alcohol level speed time salary Variable Type Is the research question answered with a category or a number If answered with a number do noninteger values make sense 0 StemandLeaf Display 0 Select one or more leading digits for the stem values The trailing digit or digits become the leaves 0 List possible stem values in a vertical column 0 Record the leaf for every observation beside the corresponding stem value 0 Indicate the units for stems and leaves someplace in the display Curt Storlie ST301 21 and 22 Handoutdoc 82001 1 Sections 11 13 Statistics the scienti c discipline that provides methods to help us make sense of data Descriptive Statistics methods that organize and summarize data Inferential Statistics Methods for generalizing from a sample to the population from which the sample was selected Population the entire collection of items of interest in this study Sample a subset of the entire population selected in some prescribed manner for study Observation al Studv a study in which the investigator merely observes and records information on the items in the sample Experimental Study a study in which the investigator compares two or more treatments or experimental conditions after randomly assigning the sample items subjects to the different treatment groups Experiment A procedure for investigating the effect of an experimental condition which is manipulated by the experimenter on a response variable Response Variable the variable being studied by the experimenter Experimental Condition or Treatment any particular speci cation of such factors that may affect the response variable Extraneous Factor a factor that is not of interest in the current study but is thought to affect the response variable Confounded Two factors are confounded if their effects on the response variable cannot be distinguished from one another Randomization Random assignment to ensure that the experiment does not intentionally favor one experimental condition over another Blocking Using extraneous factors to create groups blocks that are similar All experimental conditions are then tried in each block Control Holding extraneous factors constant so that their effects are not confounded with those of the experimental conditions Simple Random Sample SRS A sample that is selected from a population in a way that ensures that every different possible sample of the desired size has the same chance of being selected Each item in the population has the same chance of being selected to be in the sample Strati cation the process of grouping population items into subpopulations strata before sampling Curt Storlie ST301 11 to 13 Handoutdoc 82001 1 Section 3133 Freguency Distribution a table that displays the categories frequencies and relative frequencies Freguency the number of observed responses that fall into a particular category Relative Freguency the fraction or proportion of observed responses in a particular category Class Intervals the intervals that the values of the variable are broken up into Class Width the width of the class interval Cutpoints the endpoints of the class intervals Midpoint the average of the cutpoints of any class interval Cumulative Relative Freg the sum of the relative frequencies of a class interval and class intervals that precede it Histogram a graph based on frequency distributions each relative frequency is represented by a rectangle whose area is proportional to the corresponding relative frequency Constructing 3 Histogram g ar Chart for Categorical Data 0 Draw a horizontal line and write the category names at regularly spaced intervals 0 Draw a vertical line and scale it using relative frequency values or use frequencies 0 Place a rectangle above each category label The height is the category s relative frequency or frequency and all rectangles should have identical base widths Constructing 3 Histogram for Numerical Discrete Data 0 Draw a horizontal scale and mark possible values 0 Draw a vertical scale and mark it with either relative frequencies or frequencies 0 Above each possible value draw a rectangle centered at that value The height of each rectangle is the corresponding relative frequency or frequency Usually possible values are consecutive whole numbers in which case the base width for each rectangle is 1 Constructing a Histogram for Numerical Continuous Data g gual Class Widths 0 Mark boundaries of the class intervals on the horizontal axis 0 Use either relative frequencies or frequencies on the vertical axis 0 Draw the rectangle for each class directly above the corresponding interval So that the edges are at the class boundaries Constructing a Histoggam for Continuous Data When Class Widths are Unegual In this case frequencies or relative frequencies should not be used on the vertical axis Instead the height of each rectangle often called the density for the class is given by relative frequency class width The vertical axis is called the density scale it should be marked so that each rectangle can be drawn to have the calculated height density rectangle height Modes the number of peaks in a histogram Unimodal the shape of a histogram with only one peak Bimodal the shape of a histogram with two peaks Multimodal the shape of a histogram with two or more peaks smmetric a histogram which when cut in half vertically has halves that are mirror images of each other Upper amp Lower Tails the part of a unimodal histogram to the right and left of the peak respectively Skewed a unimodal histogram that is not symmetric Positive Skew a unimodal histogram with a distinctively larger right tail Skewed Right Negative Skew a unimodal histogram with a distinctively larger left tail Skewed Left Normal Curve a symmetric unimodal histogram from a normal distribution bell curve Heaxy Tailed a curve with tails that do not decline as rapidly as the tails of a normal curve Light Tailed a curve with tails that do decline more rapidly than the tails of a normal curve Sampling Variability the extent to which samples differ from one another and from the population Curt Storlie ST301 3133 Handoutdoc 082301 1 Section 42 Range the difference between the largest and smallest values in a data set range max min Deviation from the Sample Mean the difference between an observation in the data set and the sample mean X1 4 x2 lm xn i Sample Variance the typical squared deviation that a sample observation is from the sample mean u 2 n n 2 xi 2 391 2 Xi KY 2 Xi quot 1 S2 Sxx i1 i1 n n 1 n l n 1 where Sxx the sum of the squared deviations from the mean Sample Standard Deviation the typical deviation that a sample observation is from the sample mean Notes 0 The standard deviation is in the same units as the original data whereas the variance is in squared units Therefore the standard deviation tends to be the most popular measure of variability I Use caution when interpreting the meaning of the standard deviation and the variance of a data set 039 Only nl of the n deviations contain independent information about the variability in a data set This is part of the reason that n1 is used in the denominator of the variance calculation Notation 0 Population variance 52 0 Population standard deviation 6 Note Often times the population characteristics are unknown The sample variance sz and sample standard deviation 5 are used to estimate the population variance oz and population standard deviation 039 respectively Quartile separates 25 of the data from the remaining data Lower Quartile ngg median of the lower half of the data Upper Quartile Q3 median of the upper half of the data Middle Quartile gQZ or Medianz median of the entire 3a set QXCIV 9 Note If n is odd the median of the entire data set is ineluded in both halves of the data set when determining the upper and lower quartiles Interguartile Range QRz the difference between the upper and lower quartiles IQR Q3 Q1 Curt Storlie ST30l 42 Handoutdoc 090401 1 Section 41 Sample Mean arithmetic average of sample responses 11 x i sum of allobservations 1n the sample x x2 xu number of observations in the sample n 11 Population Mean the average x value in the entire population N x sum of all observations in the population x1 x2 xN number of observations in the population N N Notes 0 Different samples will lead to different sample means If you average all of the possible sample means you will get the population mean The larger the variability of the population data set the more variability in the sample means The value of a sample or population mean can be greatly in uenced by the presence of an outlier Median the middle value of an ordered data set It separates the data set into two equal halves n1 m ordered observation if n is odd Median th m average of and I ordered observations if n is even Trimmed Mean A measure of center in which the observations are rst ordered from smallest to largest one or more observations are deleted from each end and the remaining ones are averaged Computing a Trimmed Mean 0 Order the data set from smallest to largest Delete a selected number of observations from both the top and bottom of the ordered data set 0 Average the remaining observations Trimming Percentage the percentage of values deleted from each end of the ordered list Proportion of Successes gCategorical Data number of S39s in the sample sample proportion of successes p Il number of S39s in the population N population proportion of successes 11 Curt Storlie ST301 41 Handoutdoc 090401 1 Section 43 44 Chebyshev39s Rule The percentage of observations that are within k standard deviations of the mean is at least 101 J j for k 2 1 k2 Notes Chebyshev39s Rule can be applied to any data set no matter the shape of the distribution symmetric or skewed 0 Chebyshev39s Rule is extremely conservative Often times a higher percentage of observations lie within k standard deviations than Chebyshev39s Rule suggests Empirical Rule If the histogram of values in a data set can be reasonably well approximated by a normal curve or the distribution of the data set is approximately normal then 1 Approximately 68 of the observations are within one standard deviation of the mean 2 Approximately 95 of the observations are within two standard deviations of the mean 3 Approximately 997 of the observations are within three standard deviations of the mean Note The Empirical Rule is much more precise than Chebyshev39s Rule but it can only be applied when the data distribution is reasonably normal z score A measure of relative standing It tells us how many standard deviations the observation in a data set is from the mean of the data set observation mean s tan dard deviation Z score Notes 0 If xi is greater than quoti the z score will be positive 0 If xi is less than i the z score will be negative rth percentile the value such that r percent of the observations in the data set fall at or below that value Boglot graph that provides information about the center spread and symmetryskewness of the data Construction of a Bogplot 1 Draw a horizontal or vertical measurement scale 2 Construct a rectangular box whose left or lower edge is at the lower quartile and whose right or upper edge is at the upper quartile so box width IQR 3 Draw a vertical or horizontal line segment inside the box at the location of the median 4 Extend horizontal or vertical line segments from each end of the box out to the smallest and largest observations in the data set These line segments are called whiskers Outlier An observation that is more than 15 IQR away from the closest end of the box 0 Extreme Outlier An outlier that is more than 3 IQR from the closest end of the box 0 Mild Outlier An outlier that is between 15 IQR and 3 IQR from the closest end of the box Curt Storlie ST301 4344 Handoutdoc 090601 1 Sections 51 and 52 Scatter Plots and Correlation Bivariate Data xvyl x2y2 xmyn Scatter plot A picture of bivariate numerical data in which each observation xi yi is represented as a point located with respect to a horizontal x axis and a vertical y axis Exam 0 9 X 8 2 3 7 o 3 6 gt 6 o 4 7 5 5 10 4 3 1 I l l 7 2 3 4 5 Things to Notice About Bivariate Data and Scatter Plots 1 The relationship between y and x 0 Positive Relationship y increases as x increases OR y decreases as x decreases 0 Negative Relationship y increases as x decreases OR y decreases as x increases 0 No Relationship 2 Does it appear that the value of y can be predicted from knowing x by nding a line that is reasonably close to the points in the plot 39 Correlation Coefficient a quantitative assessment of the strength of the relationship between the x39s and y39s S Pearson39s Sample Correlation Coefficient r 20 Xy y quoty 20 4 m at Jiffy sxx Zx2Qni air 245i sx arty LXI Properties of r 0 The value of r does not depend on the unit of measurement for either variable It is unitless For example if x is distance r will not change regardless of whether x is measured in feet yards miles or any other unit 0 The value of r does not depend on which of the two variables is labeled x o The value of r is between 1 and 1 close to 1 indicates a strong positive linear relationship close to 1 indicates a strong negative linear relationship close to 0 indicates a very weak linear relationship or no linear relationship at all 1 only when all the points in the scatter plot lie exactly on a straight line that slopes upward r 1 only when all the points in the scatter plot lie exactly on a straight line that slopes downward o The value of r is a measure of the extent to which x and y are linearly related straight line only A value of r close to zero means x and y are not linearly related It doesn t mean that x and y are unrelated o r is the square root of the quantity R2 in Minitab computer output 0 HHHH Curt Storlie ST301 51 and 52 Handoutdoc 1 12901 1 y2 Strong Moderate Weak Moderate Strong PM FA L F A f H L l I J J l l 39 I T l l I T 1 08 O5 0 05 08 1 Examples 14 I 16 M 14 o 13 15 o J 392 00 14 O 1 o H 13 o 0 o o W 0 12 11 I 1 o c n 1 9 10 4 o 39 gt 9 o a 0 9 I o a 0 7 o a j o 7 T 6 7 1 o s 5 r r I f 6 39 l W I l 5 I I I 3 4 5 6 7 3 4 5 6 7 3 4 5 6 Exact line r 1 Positive correlation r 8 No relationship r 003 5 I 6 s 6 7 1 Q 397 o 8 o 4 J a 39 9 9 J 39 4 o o o 3 quot 10 0 1 z 22 n 12 o 1 39 13 5 45 o o 1 n quot4 I F T l 3915 quot o d 0 Strength of r Sections 51 and 52 a Exact line r 1 Negative corfelation r 8 No correlatidn r 003 Note When examining a scatter plot you should be able to do the following 0 Determine if there is a relationship between the two variables x and y o If there is a relationship be able to describe it s shape direction and strength of r Population Correlation Coef cient go The Pearson39s sample correlation coef cient r is an estimate of the population correlation coef cient p o p is a number between 1 and 1 I p is unitless the value of p does not depend on the unit of measurement for either variable 0 p 1 or 1 only if all x y pairs in the population lie exactly on a straight line 0 p measures the extent to which there is a linear relationship in the population Curt Storlie ST301 51 and 52 Handoutdoc 1 12901 2 Sections 53 and 54 Prediction A predicted y value for any x value can be obtained by plugging x value into the least squares line equation Q the least squares line for a sample is y 12 25x What is the predicted value for x 4 y 1225x 12 254 12100 112 Don t ever make a prediction for an x value outside the range of the x s This is called Extrapolation Ex IfI had only collected data for the example above for x s between 10 and 20 I cannot make a prediction for x 4 which is outside the range of my x s Coef cient of Determination the proportion of variation in y that can be explained by a linear relationship between x and y Residual Sum of Squares measure of unexplained variation denoted SSResid is given by SSResid 26302 y1912 y2 72z yn 902 Zyz 32yb2xy Total Sum of Squares measure of total variation denoted SSTo is given by 2 SSTO 202 y1Y1Yy2 y 22 yn if Zyz ampSyy 2 SSTo SS Re Sid SSTo r2 correlation2 This is why we call the coef cient of determination r2 High values of r2 indicate that a line is a pretty good model for the relationship between x and y Standard Deviation About the Least S uares Line se SSReld n Coefficient of Determination denoted r2 is given by r Hypothesis testing in Regression Recall that we are assuming that the relationship between x and y is de ned by the equation y a Bx e We can test whether or not there is a linear relationship by testing whether or not the slope equals 0 Ho B 0 vs Ha B 96 0 In order to perform this test we need to make the assumption that e N 002 at every x value Then if H0 is true 3 0 t i has a t distribution with n2 df Sb Where sb J s is given in the Minitab printout along with t and the pvalue for the test See a e 514 book for an example of the Minitab output Con dence Interval for Q When the normality and equal variance assumptions made above hold then a 10t100 CI for B is b i t n2 sh Curt Storlie 51301 53 d 54 Handoutdoc 112901 2 3 Sections 53 and 54 Simple Linear Regression Independent Eredictorz Variable XQ variable that doesn39t dependrely on the other variable Dependent Response Variable ng variable that does depend or rely on the other variable Example For the variables Age and Income we realize that a person does not get older or younger depending on their income but a person39s income can change as they get older Thus we realize that Income is dependent on Age So Age is the Independent Variable X and Income is the Dependent Variable Y Linear Reggession Model Assume that y and x are related by the linear equation y a x e Otis the y intercept the value of y when x is zero B is the slope the increase for y for a single unit increase in x e is a random error term We will use the sample of bivariate data to estimate 0t and B Estimated Regression Line 9 a bx a point estimate for a the yintercept b point estimate for b the slope Predicted Values The predicted or tted values result from substituting each sample it value into the estimated regression line The predicted value for y1 is 511 a bx1 12 a bit2 9quot 1 bxquot Residuals The residuals of the estimated regression line are the actual y values minus the predicted y values Y1 quot 5 1 3 2 S zwi Yn 5 quot Finding the Best Estimated Regression Line Least Sguares Line The criterion most commonly used for determining the best line that ts the data is the line that minimizes the residual sums of squares SSResid SSResid 202 y15 12y2 92Z quoty 92 This technique is called the Least Squares Method The a and b estimates that accomplish this are given below b 2X ixyy Ely w Ebb 302 XXL Eli sxx n Note All least squares lines have the property that the sum of all the residuals is 0 jabx aybi Curt Storlie 7 ST301 53 and 54 Handoutdoc 1 12901 1 Section 61 Probability Subjective Interpretation 0 Probability a personal subjective measure of the strength of belief that the event will occur 0 PA0 9 Probaility of the event A equals zero Belief that the event won39t occur 0 PA1 9 Probaility of the event A equals one Belief that the event will certainly occur 0 Problem Different people may place different probabilities on the same event Relative Frequency Integpretation Probability the longrun proportion of times that an event will occur given many replications under identical circumstances A number between 0 and 1 that re ects the likelihood of an event occurring 0 PA0 9 The event A occurs 0 of the time Impossible Event 0 PAl 9 The event A occurs 100 of the time Certain Event Event Operations 0 A n B 9 A intersect B A and B both happen 0 A U B 9 A union B A happens B happens or A and B both happen 0 A 9 Complement of A A does not happen 0 S is the sample space S is the union of all possible events Properties of Probability 0 S PA S 19 The Probability of any event is between 0 and 1 inclusive PA n B 0 9 Two events are mutually exclusive ie both events cannot occur simultaneously PA 1 PA 9 The Probability that an event A doesn39t occur is 1 PA o PA PA l 9 The probabilities of an event and its complement sum to 1 at PA U B PA PB PA n B o PA U B S PA PB 0 Conditional Probabilities AIB 9 quotA given Bquot PABi lli forPB 0 PIBA B forPA 0 PA n B PA PBA PB PAB PA PA n B PA n B Independence 0 Dependence the occurrence of one event changes the probability that the other event occurs 0 Independence the chance that one event occurs isn39t affected by knowing whether or not the other event has occurred 9 The following 3 statements are equivalent 1 Events A amp B are independent 2 PAIB PA 3PBA PB o Multiplication Rule for Independent Events allv The events A amp B are independent if and only if PA n B PAPB 0 If the events A B amp C are independent then PA n B n C PAPB 0PC Curt Storlie ST301 61 Handoutdoc 091801 1 Section 73 Standard Normal Distribution A special case of the normal distribution where p 0 and 039 1 It is customary to use the letter z to represent a variable whose distribution is described by the standard normal curve Note If X has a normal distribution with mean p and std dev a then the zscore of X z F has a Standard Normal Distribution Cummulative Distribution Function for Standard Normal 0 Pzlta PzSa Look up a in the Ztable Appendix Table II Pg 680 0 Pzgta P12a 1 Pzlta 0 Paltzltb Pzltb Pzlta for bgta Identif ng Extreme Values 0 Goal To identify the values included in the most extreme percentage of the distribution 0 Solution Find the zvalue by looking up the probability in the ztable 0 Rule of Thumb If the probability that we are trying to nd is between two zscores choose the zscore that the probability is closest to If the probability is directly between two zscores use the average of those zscores Other Normal Distributions Notes 0 The letter 2 represents standard normal variables while the letter x is used more generally to represent normal variables with mean p and standard deviation 039 0 Pa lt x lt b 0 Pa lt z lt b 0 To use the zcurve areas to compute probabilities about a normal variable x we must convert the endpoints a and b these are both associated with x into the a and b these are associated with 2 that give the same probability To do this we compute zscores for the endpoints a and b This process is called standardizing the endpoints 0 Summary If x is normally distributed with mean 11 and standard deviation 039 then 1 PxltbPzltb1gtzlt bi O39 2 PxgtaPzgtaPzgt 2 039 3 PaltxltbPaltzltbr ltzlt9 O39 039 where z is standard normally distributed p 0 039 1 Describing Extreme Values in a Normal Distribution Steps 1 Solve the problem for the standard normal distribution determine 2 2 Translate the answer 2 into one for the normal distribution of interest use x p z039 Curt Storlie ST301 73 Handoutdoc 092001 1 Sections 81 and 82 SECTION 81 M Testimates u s2 estimates 0392 p estimates Tl39 SampletoSample Variability the extent to which samples differ from one another Samgling Variability the extent to which samples differ from one another and from the population w The average of all possible sample means of a population is equal to the population mean Statistic A numerical quantity computed from values in a sample 0 5 20 Monti 0 quot 439quot 0 H t Parameter A numerical quantity computed from values in a population Sampling Distribution The distribution of a statistic ote The bigger the sample size the less variability in the sampling distribution This implies that the sample mean is closer to the population mean when the sample size is large SECTION 82 Note A sample mean based on a large sample size tends to be closer to the population mean than does a sample mean based on a small sample size Rules Concerning the 3 Sampling Distribution 1 its H In words The true or population mean of the sample means is equal to the true or population mean of the individual observations or x values Or the mean of all possible sample means is equal to the population mean of the individual observations 2 039 6 JR Note This rule is approximately correct as long as no more than 5 of the population is included in the sample or VN lt 05 In words The true or population standard deviation of the sample means is equal to the true or population standard deviation of the individual observations divided by the square root of the sample size Note As the sample size n increases the true standard deviation of the sample means 07 decreases Or the spread of the sampling distribution of the sample mean decreases as the sample size increases 3 When the population distribution of the individual observations or x values is normal the sampling distribution of the sample means is also normal for any sample size In other words the sample size n does not have to be large for the sampling distribution of the sample means to be normal as long as the population distribution of the individual observations or x values is normal 4 Central Limit Theorem gCLT When the sample size n is suf ciently large the sampling distribution of the sample means is well approximated by a normal curve even when the population distribution of the individual observations or x values is not itself normal Recall A variable is standardized by subtracting its mean and then dividing by its standard deviation Mg If n is large or the population distribution is normal the standardized variable 2 21 XHi has approximately a standard normal 2 distribution X JR Rule of Thumb The CLT can safely be applied if the sample size n exceeds 30 Curt Storlie ST301 81 and 82 Handoutdoc 092401 1 Section 83 Recall S success an individual or object that possesses the property of interest F failure an individual or object that does not possess the property of interest of successes in the population N of successes in the sample 76 proportion of S39s in the population p proportion of S39s in the sample n Note If we record a Success as a l and a Failure as a 0 39S391 39F390 then p 3 of the 01 data M The sampling distribution of p depends on both n and 7t 1 The larger the sample size n the closer the sampling distribution of p is to normal 2 The closer at is to its extreme values 0 and 1 the larger the sample size necessary to get the sampling distribution of p to be approximately normal Rule of Thumb If both rm 2 5 and n17t Z 5 then it is safe to use the normal approximation Notation up mean value of the sample proportions of success op standard deviation of the sample proportions of success Rules Concernin the Sam lin Distribution 1 up 1 In words The true or population mean of all possible sample proportions of success is equal to the true or population proportion of success nl n 2 o ij p n In words The true or population standard deviation of all possible sample proportions of successes is equal to the square root of the population proportion of success multiplied by 1 minus the population proportion of success divided by the sample size Note of L1 n 3 When n is large and 1t is not too near 0 or 1 the sampling distribution of p is approximately normal Curt Storlie ST301 83 Handoutdoc 092501 1 Section 91 3 Point Estimate of a Population Characteristic A single number that is based on sample data and represents a plausible value of the characteristic 7 Notes A means estimate of For example 11 means estimate of u 1 13 X is a point estimate of p in trimmed mean is a point estimate of pt 395 ft 52 sample median is a point estimate of p 2 62 s2 is a point estimate of 0392 3 6 s is a point estimate of G2 3 ft p is a point estimate of TE Unbiased Statistic A statistic with mean value equal to the value of the population characteristic being estimated Biased Statistic A statistic that is not unbiased Notes 1 Y is an unbiased statistic for u since the mean of 3 u u 2 52 is an unbiased estimator of 0392 3 s is a biased estimator of 039 but there are other good reasons for using it Note Given a choice between several unbiased statistics that could be used for estimating a population characteristic the best statistic to use is the one with the smallest standard deviation Curt Storlie ST301 91 Handoutdoc 100801 1 Section 92 93 Common Values for Z042 Con dence Level l0t 0t sz 90 90 10 1645 95 95 05 196 98 98 02 233 99 99 01 258 Notes 1 Con dence Level of an interval estimate refers to the percentage of CPS computed similarly that will contain it The more reliable the estimate is the higher the percentage of CPS that will contain pt 2 Margin of Error B of an interval estimate refers to half the width of a CI ie E i B Population Proportion 1m Recall 1 Notation 0 it proportion of population that possess the property of interest 0 p proportion of sample that possess the property of interest p in sample that possess the property of interest n 2 Properties of the Sampling Distribution of p o The sampling distribution of p is centered at it up 1t nl n o The standard deviation of p is Up n o If both rm 2 5 and n1 11 2 5 the sampling distribution of p is well approximated by the normal distribution Large Sample gla2100 CI for 11 Ifboth np 2 5 and n1 p 2 5 a large sample CI for 1t is plp iZ Pam11 Curt Storlie ST301 92 93 Handoutdoc 101101 2 Section 92 93 Con dence Inter39val SCI for a PopulatiOn Characteristic An interval of plausible values for the characteristic It is constructed so that with a chosen degree of con dence the value of the characteristic will be captured inside the interval Confidence Level Associated with a CI Estimate Speci es the success rate of the method used to construct the interval 95 CI for 1 when 039 is known If 1 X1 Xn are approximately normally distributed or if 2 n is large n 30 a 95 CI for u is 1 JEN 45 Killg 6 Q5 if Curve K A couple of other ways to represent the C1 are i 196 i i 1961 and J JR Y 196 6 SpsX196 J W Note If the 95 CI method were used to generate an estimate over and over again with different samples in the long run 95 of the resulting intervals would capture the true value of the characteristic being estimated In regards to our 95 CI for 11 approximately 95 of all samples will result in an 3 value that is within 196 039 196 of the population mean u n General Interpretation for ALL CI39s Based on this sample we can be con dence level con dent that the true parameter of interest lies between lower bound of CI and upper bound of CI You are required to fill in all underlined pieces in the above interpretation l gla100 CI for 1 when 039 is known 5 z E If 1 X1 Xn are approximately normally distributed or quot391 if 2 n is large n 30 a1oc100 CI for 11 is i 5 G XiZtyz whereZ isavaluesuchthat Z ltZltZ 1 oc Curt Storlie 101 101 92 93 IIandoutdoc Section 93 cont tDistribution Let x1 xz xn constitute a random sample from a normal population distribution Then the sampling distribution of the standardized variable t i 75 is the tdistribution with n 1 degrees of freedom Note When 039 is unknown we have to estimate it with s Consequentially using 3 in place of 039 introduces extra variability so the tdistribution is more spread out than the standard normal 2 distribution Properties of tDistributions 0 The t curve corresponding to any fixed number of degrees of freedom df is bellshaped and centered at zero just like the z curve 0 Any t curve is more spread out than the 2 curve 0 As the number of degrees of freedom increases the spread of the corresponding t curve decreases 0 As the number of degrees of freedom gets very large the corresponding sequence of t curves approaches the 2 curve gloc100 CI for g when a is unknown If 1 X1 Xn are approximately normally distributed or if 2 n is large n gt 30 a l 0t100 CI for pt is iitg s 2n l J where the tcritical value is based on n 1 degrees of freedom and is obtained from Appendix Table 111 Remember this CI is appropriate for small n only when the population distribution is at least approximately normal Sample Size Determination The sample size required to estimate it to within an amount B with 1 0t100 con dence is 26 39 O 2 n Z 2 where B is the margin of error If U is unknown it may be estimated based on previous data Note ALWAYS ROUND A SAMPLE SIZE UP Curt Storlie ST301 93 cont Handoutdoc 10 1601 1 Section 93 cont Summagx of CPS gloc1100 CI for Lt gsnwhen o is known If 1 X1 Xn are approximately normally distributed or if 2 n is large n gt 30 a 10t100 CI for it is 039 X i Z JR 3 when 039 is unknown If 1 X1 Xn are approximately normally distributed or if 2 n is large n 230 a 10t100 CI for u is S X it 39 n 1 J gla100 CI for it When both np 2 5 and n1 p 2 5 a large sample CI for TC is plp z pln General Format for ANY CI Point Estimate Using Critical Standard Deviation Value of the Statistic if 6 is known a SpeCi ed Statistic Value Point Estimate Using Critical Estimated Standard Deviation of the Statistic if 039 is unknown a Spec1 ed Statistic Note The estimated standard deviation of the statistic is generally referred to as the standard error SE ie the standard error of X SE 1 J3 s This is sometimes denoted as s J Curt Storlie ST301 93 cont Handoutdoc 101601 2 EX 1 EX 3 101 Examples Indicate whether the following pairs of null vs alternative hypotheses are feasible i Hozu15vsHau15 AFeas ii Hozn4 vsHau6 Nor Ho pt 123 vs Ha p lt123 F6015 Hozu123 vs H3 11 125 N v Hozp1 vs Ha2p l NH iii iv vi Ho it 14 vs Ha 1tlt 14 F96quot vii Ho ugt 15 vs Ha pl 15 Nal viii Ho 11 15 vs Ha 11 15 Peas To determine whether the pipe welds in a nuclear power plant meet speci cations 3 random sample of welds is selected and tests are conducted on each weld in the sample Weld strength is measured as the force required to break the weld The speci cations state that the mean strength of welds should exceed lOOlbin2 The inspection team wants to show that the welds meet speci cations so which hypotheses should be tested Ho u 100 vs Ha p lt 100 Ho 11 100 vs Ha u gt100 or Ho u 100 vs Ha 11 100 How Ioo w Hqt4gtlao A county commissioner must vote on a resolution that would commit substantial resources to the construction of a sewer in an outlying residential area Her scal decisions have been criticized in the past so she decides to take a survey of constituents to nd out whether they favor spending money for a sewer system She will vote to appropriate funds only if she can be fairly certain that a majority of the people in her district favor the measure What hypotheses should she test 17 H lu 5 V Ha 44 101 Examples EX 4 The mean length of a longdistance telephone call placed with a Sprint was known to be 73 minutes under the old rate structure In an attempt to be more competitive with other companies Sprint decided to lower their rates They are hoping that this will encourage customers to then make longer calls so they would not lose any revenue Let u denote the true mean length of longdistance calls under the new rate structure What hypothesis should be tested to determine if the mean length of longdistance phone calls has increased with the lowered rates all 7 Ha buzz Hau73 EX 5 Many older homes have electrical systems that use fuses rather than circuit breakers A manufacturer of 40amp fuses wants to make sure that the mean current at which its fuses burn out is in fact 40 If the mean current is lower than 40 customers will complain because the fuses require replacement too often If the mean current is higher than 40 the manufacturer might be liable for damage to an electrical system due to fuse malfunction The manufacturer does not want to spend more time and money to make adjustments to their process unless they show that the mean is not 40 amps A sample of fuses is to be selected and inspected If a hypothesis test is to be performed on the resulting data what null and alternative hypotheses would be of interest to the manufacturer I la Hm 40 vs Ha M740 Additional Problems 105 107 Sections 101 102 Hypothesis A claim or statement about the value of a population characteristic Null Hypothesis Ho A claim about a population characteristic that is initially assumed to be true Alternative hypothesis Hap The competing claim Typically this is what we want to show is true Test of Hypotheses A method for using sample data to decide between two competing claims or hypotheses about a population characteristic Conclusions We will reject Ho if the observed sample is very unlikely to have occurred when H0 is true 1 Fail to Reject Ho if there is not convincing evidence against Hg for Ha 2 Reject Ho if there is convincing evidence against Ho for H Note Ho will be rejected only if the sample strongly suggests that H0 is false Interpretations 1 Fail to Reject Ho there is not convincing evidence that Ha is true not strong evidence against Ho 2 Reject Ho there is convincing evidence that Ha is true strong evidence against Ho Note A statistical hypothesis test is only capable of demonstrating strong support for the alternative hypothesis We are not capable of demonstrating strong support for the null hypothesis This is why we say Fail to Reject Ho instead of Accept Ho or Reject Ha Possible Hypotheses Ho Population Characteristic Hypothesized Value Less Than Hypothesis Test Ha Population Characteristic lt Hypothesized Value Example Ho p 10 vs Ha u lt 10 Ho Population Characteristic Hypothesized Value Greater Than Hypothesis Test Ha Population Characteristic gt Hypothesized Value Example Ho 1 10 vs Ha p gt 10 Ho Population Characteristic Hypothesized Value Not Equal to Hypothesis Test Ha Population Characteristic at Hypothesized Value Example Ho u 10 vs Ha p 10 Note the hypothesized Value in Ha must be the same as the hypothesized value in Ho Test Procedure The decision rule for the method used to determine whether or not Ho should be rejected Hypothesis Testing Errors Type I Error Reject H0 when H0 is true Type II Error Fail to Reject Ho when H0 is false Error Rates a Probability of a type I error a is also called the Level of Significance LOS B Probability of a type II error Note We typically choose oz for a test based on how much risk we are willing to take in making a type I error Decreasing the risk of a Type I Error small on results in an increase in the risk of a Type 11 Error large 3 Identify the largest at that is tolerable for the problem and use this largest on as the level of signi cance Curt Storlie ST301 101 102 Handoutdoc 10 1801 1 Sections 103 104 Test Statistic The function of sample data on which a conclusion to reject or fail to reject H0 is based Pvalue A measure of inconsistency between the hypothesized value for a population characteristic and the observed sample It is the probability assuming H0 is true of obtaining a test statistic value at least as contradictory to Ho as what actually resulted Loosely speaking a pvalue is the probability that H0 is true Note Small p values suggest H0 is false while large pvalues suggest H0 is true Decisions Determine an appropriate Level of Signi cance 0c and if 1 PValue S or 9 Reject Ho 2 PValue gt or 9 Fail to Reject Ho Ste sin H othesis Testin 370 Describe the population characteristic about which hypotheses are to be tested State the null hypothesis Ho State the alternative hypothesis Ha Select the significance level or for the test Display the test statistic to be used Substitute the hypothesized value identi ed in step 2 Compute all quantities appearing in the test statistic and then the value of the test statistic itself Determine the pvalue associated with the observed value of the test statistic State the conclusion in the context of the problem The level of signi cance should be included OOIOUlJgtL JNt Test Procedure for u when 039 is known LZ Test If 1 X1 Xn are approximately normally distributed or if 2 n is large n gt 30 For the hypotheses Ho u uo M0 is the hypothesized value Ha uiuo orHa ptltu0 orHa ugtuo The test statistic is given as follows Where has a standard normal distribution when H0 is true We can obtain the pvalue or the probability of observing a sample at least this extreme if Ho was true Computing the pvalue 1 TwoTailed Test ZCurve Ha p at no Pvalue sum of area in two tails Z 0 ill Curt Storlie ST301 103 104 Handoutdoc 102301 1 Sections 103 104 Note For a twotailed test 0 If test statistic gt 0 p value 2area under the curve to the right of the test statistic 0 If test statistic lt 0 pvalue 2area under the curve to the left of the test statistic 2 UpperTailed Test ZCurve Ha pl gt no Pvalue area in upper tail Note For an uppertailed test pvalue area under the curve to the right of the test statistic 3 LowerTailed Test ZCurve Ha M lt Mo P value 2 area in lower tail Note For a lowertailed test 0 p Value area under the curve to the left of the test statistic Test Procedure for u when 039 is unknown gt Test If 1 X1 Xn are approximately normally distributed or if 2 n is largen gt 30 For the hypotheses Ho it no no is the hypothesized value Ha menu orHa pltu0 orHa ugtpo The test statistic is given as follows where t has a t distn39bution with n l df when H0 is true l7 Curt Storlie ST301 103 104 Handoutdoc 102301 2
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'