STATISTICAL METHODS I
STATISTICAL METHODS I STAT 515
Popular in Course
Popular in Statistics
verified elite notetaker
This 124 page Class Notes was uploaded by Shane Marks on Monday October 26, 2015. The Class Notes belongs to STAT 515 at University of South Carolina - Columbia taught by Staff in Fall. Since its upload, it has received 32 views. For similar materials see /class/229669/stat-515-university-of-south-carolina-columbia in Statistics at University of South Carolina - Columbia.
Reviews for STATISTICAL METHODS I
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/26/15
STAT 515 Chapter 11 Regression 0 Mostly we have studied the behavior of a single random variable 0 Often however we gather data on two random variables 0 We wish to determine Is there a relationship between the two rv s 0 Can we use the values of one rv to predict the other rv Often we assume a straightline relationship between two variables 0 This is known as simple linear regression Probabilistic vs Deterministic Models If there is an exact relationship between two or more variables that can be predicted with certainty without any random error this is known as a deterministic relationship Examples In statistics we usually deal with situations having random error so exact predictions are not possible This implies a probabilistic relationship between the 2 variables Example Y breathalyzer reading X amount of alcohol consumed oz 0 We typically assume the random errors balance out they average zero 0 Then this is equivalent to assuming the mean of Y denoted EY equals the deterministic component StraightLine Regression Model Y B0 31X e Y response variable dependent variable X predictor variable independent variable a random error component 30 Yintercept of regression line B1 slope of regression line Note that the deterministic component of this model is EY 30 31X Typically in practice B0 and 31 are unknown parameters We estimate them using the sample data Response Variable 1Y1 Measures the major outcome of interest in the study Predictor Variable 1X Another variable whose value explains predicts or is associated with the value of the response variable Fitting the Model Least Squares Method If we gather data X Y for several individuals we can use these data to estimate B0 and B1 and thus estimate the linear relationship between Yand X First step Decide if a straightline relationship between Yand X makes sense Plot the bivariate data using a scattergram scatterplot Once we settle on the bestfitting regression line its equation gives a predicted Yvalue for any new X value How do we decide given a data set which line is the best tting line Note that usually no line will go through all the points in the data set For each point the error Some positive errors some negative errors We want the line that makes these errors as small as possible so that the line is close to the points Leastsguares method We choose the line that minimizes the sum of all the sguared errors SSE Least squares regression line Y 60 61X where 30 and 31 are the estimates of B0 and B1 that produce the bestfitting line in the least squares sense Formulas for 30 and 31 Estimated slope and intercept A SSxy A A BIZSS and 0Y 1X X Y where SSxy ZXin39 Qllnw a nd 2 SSW Z X Z Xi n and n the number of observations Example Table 113 Interpretations Slope Intercept Example Avoid extrapolation predictinginterpreting the regression line for Xvalues outside the range of X in the data set Model Assumptions Recall model equation Y 30 51X 5 To perform inference about our regression line we need to make certain assumptions about the random error component 8 We assume 1 The mean of the probability distribution of a is 0 In the long run the values of the random error part average zero 2 The variance of the probability distribution of 8 is constant for all values of X We denote the variance of 8 by 62 3 The probability distribution of 8 is normal 4 The values of 8 for any two observed Yvalues are independent the value of a for one Yvalue has no effect on the value of s for another Y value Picture Estimating 0392 Typically the error variance 62 is unknown An unbiased estimate of 02 is the mean squared error MSE also denoted s2 sometimes MSE SSE n 2 where SSE ssyy BI ssxy 2 ZYZ T and SSW Note that an estimate of 039 is s n Since 8 has a normal distribution we can say for example that about 95 of the observed Yvalues fall within 2s units of the corresponding values Y Testing the Usefulness of the Model For the SLR model Y o lX 8 Note X is completely useless in helping to predict Yif and only if B1 0 So to test the usefulness of the model for predicting Y we test If we reject H0 and conclude Ha is true then we conclude that X does provide information for the prediction of Y Picture Recall that the estimate 31 is a statistic that depends on the sample data This 31 has a sampling distribution If our four SLR assumptions hold the sampling distribution of 31 is normal with mean B1 and standard deviation which we estimate by A 3 Under H0 31 0 the statistic S gs has a tdistribution with n 2 df Test for Model Usefulness OneTailed Tests TwoTailed Test H0 310 H0 310 H0 310 H03B1lt0 H03B1gt0 A H0 B17E0 A Test statistic t S SSW Rejection region tlt ta tgt ta tgt tat2 0139 t lt tq2 Pvalue left tail area right tail area 2tail area outside t outside t outside t Example In the drug reaction example recall 81 07 Is the real 31 signi cantly different from 0 Use OL 05 A 1001 oc Confidence Interval for the true slope 31 is given by where tan is based on n 2 df In our example a 95 CI for B1 is m The scatterplot gives us a general idea about whether there is a linear relationship between two variables More precise The coefficient of correlation denoted r is a numerical measure of the strength and direction of the linear relationship between two variables Formula for r the correlation coefficient between two variables X and Y SS 0 lssm SSW Most computer packages will also calculate the correlation coefficient Interpreting the correlation coefficient 0 Positive r gt The two variables are positively associated large values of one variable correspond to large values of the other variable 0 Negative r gt The two variables are negatively associated large values of one variable correspond to small values of the other variable 0 r 0 gt No linear association between the two variables Note 1 S r S 1 always How far r is from 0 measures the strength of the linear relationship 0 r nearly 1 gt Strong positive relationship between the two variables 0 r nearly 1 gt Strong negative relationship between the two variables 0 r near 0 gt Weak relationship between the two variables Pictures Example Drugreaction time data Interpretation Notes 1 Correlation makes no distinction between predictor and response variables 2 Variables must be numerical to calculate r Examples What would we expect the correlation to be if our two variables were 1 Work Experience amp Salary 2 Weight of a Car amp Gas Mileage Some Cautions Example Speed of a car X 20 30 40 50 60 Mileage in mpg Y 24 28 30 28 24 Scatterplot of these data Calculation will show that r 0 for these data Are the two variables related Another caution Correlation between two variables does not automatically imply that there is a causeeffect relationship between them M The population correlation coefficient between two variables is denoted p To test H0 p 0 we simply use the equivalent test of H0 31 0 in the SLR model If this null hypothesis is rejected we conclude there is a significant correlation between the two variables The square of the correlation coefficient is called the coefficient of determination r2 Interpretation r2 represents the proportion of sample variability in Ythat is explained by its linear relationship with X 2 1 SSE r SS 1392 always between 0 and 1 yy For the drugreaction time example 1 2 Interpretation Estimation and Prediction with the Regression Model Major goals in using the regression model 1 Determining the linear relationship between Yand X accomplished through inferences about B1 2 Estimating the mean value of Y denoted EY for a particular value of X Example Among all people with drug amount 35 what is the estimated mean reaction time 3 Predicting the value of onr a particular value of X Example For a new individual having drug amount 35 what is the predicted reaction time 0 The point estimate for these last two quantities is the same it is Example 0 However the variability associated with these point estimates is very different 0 Which quantity has more variability a single Yvalue or the mean of many Yvalues This is seen in the following formulas 1001 oc Confidence Interval for the mean value of Y at X xp where taZ based on n 2 df 1001 oc Prediction Interval for the an individual new value of Yat X xp where taZ based on n 2 df The extra 1 inside the square root shows the prediction interval is wider than the CI although they have the same center Note A Prediction Interval attempts to contain a random quantity while a con dence interval attempts to contain a fixed parameter value The variability in our estimate of EY re ects the fact that we are merely estimating the unknown 30 and B1 The variability in our prediction of the new Yincludes that variability plus the natural variation in the Y values Example drugreaction time data 95 CI for EY with X 35 95 PI for a new Yhaving X 35 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml STAT 515 SAS Templates This page contains sample code for using SAS in the course STAT 515 with the text Statistics 9th Edition by McClave and Sincich It is designed so that you can read along with material below and copy and paste the SAS code usually looking like type writer font directly from the web page into the SAS program editor window The introduction section is written assuming you have no previous experience using SAS In the later sections the examples usually correspond to portions of the text book and it would be helpful to have the book with you In general you will be able to use these templates to do your homework assignments after modifying them by entering your own data and making sure that all of the names match up Originally created by B Habing Last updated 1204 Introduction to SAS Normal t ChiZ and F Tables Section 53 and Supplement 6 CIs for Means and Variances Section 72 and Supplement 7 0 CI for One Proportions Section 73 0 One Sample ttest Section 84 0 Test for One Proportion Section 85 0 Two Sample ttest Section 91 0 Paired ttest Section 92 0 For Two Proportions Section 93 0 Oneway ANOVA Section 102 0 Simple Linear Regression Chapter 11 and Supplements 0 Goodness of Fit Test Section 132 TwoWay Contingency Table Section 133 and Supplement 133 Text Book Data Sets on CD Computer Trouble SAS Won39t Start Graphs Printing Small in Word Introduction to SAS To begin with you should open up the program SAS If there is a SAS icon on the screen already you can just double click that to start the program If there is not a SAS icon on the screen already you need to use the start menu Click the Start rectangle in the bottom left of the screen with your left mouse button This should open up a list of options above the Start rectangle and you should slide the mouse arrow up to the one labeled Programs This should open up a list of more options to the right and you should keep sliding along until you get to the one labeled The SAS System for Windows Once you get there click on it This should start up SAS There are three main windows that are used in SAS The Log window the Program Editor window and the Output window The Log and Program Editor window are the two on the screen when you start up the program The Output window isn39t Visible yet because you haven t created anything to output If you happen to lose one of these windows they usually have a bar at the bottom of the SAS window You can also find them 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l SSAShtml under the View menu The Program Editor is where you tell SAS what you want done The Output window is where it puts the results and the Log window is where it tells you what it did and if there are any errors It is important to note that the Output window often gets very long You usually want to copy the parts you want to print into MSWord and print from there It is also important to note that you should check the Log window everytime you run anything The errors will appear in maroon Successful runs appear in Blue Hitting the F3 key at the top of your keyboard will run the program currently in the Program Editor window You can also run programs by clicking on the little image of the runner in the list of symbols near the top of the SAS program screen In older editions of SAS running the program will erase whatever was written in the Program Editor window To recall whatever was there make sure you are in that window and hit the F4 key If you keep running more programs it will keep adding it all to the Output window To clear the Output window make sure you are in that window and choose Ciear text under the Edit menu If you happen to lose a window try looking under the View menu The following is the SAS code for entering data about the starting salaries of a group of bank employees The data consists of the beginning salaries of all 32 male and 61 female entry level clerical workers hired between 1969 and 1977 by a bank Yes they really are annual salaries The data is reported in the book The Statistical Sleuth by Ramsey and Schafer and is originally from HV Roberts quotHarris Trust and Savings Bank An Analysis of Employee Compensationquot 1979 Report 7946 Center for Mathematical Studies in Business and Economics University of Chicago Graduate School of Business The data is formatted in two columns the first is the starting salary the second is an id code m for male and f for female You can just cut and paste the code below starting with OPTIONS and ending with the after all the numbers into the Program Editor window OPTIONS pagesize50 linesize64 DATA bankdata INPUT salary gender LABEL salary quotStarting Salaryquot gender quotmma1e ffemalequot CARDS 3900 f 4020 f 4290 f 4380 f 4380 f 4380 f 4380 f 4380 f 4440 f 4500 f 4500 f 4620 f 4800 f 4800 f 4800 f 4800 f 4800 f 4800 f 4800 f 4800 f 4800 f 4800 f 4980 f 5100 f 5100 f 5100 f 5100 f 5100 f 5100 f 5160 f 5220 f 5220 f 5280 f 5280 f 5280 f 5400 f 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l SSAShtml 5400 f 5400 f 5400 f 5400 f 5400 f 5400 f 5400 f 5400 f 5400 f 5400 f 5400 f 5520 f 5520 f 5580 f 5640 f 5700 f 5700 f 5700 f 5700 f 5700 f 6000 f 6000 f 6120 f 6300 f 6300 f 4620 m 5040 m 5100 m 5100 m 5220 m 5400 m 5400 m 5400 m 5400 m 5400 m 5700 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6000 m 6300 m 6600 m 6600 m 6600 m 6840 m 6900 m 6900 m 8100 m Note that imosti lines end with a semicolon but not all SAS will crash if you miss one but usually the log window will tell you where the problem is An extramissing semicolon is probably the single largest reason for a SAS program crashing The OPTIONS line only needs to be used once during a session It sets the length of the page and the length of the lines for viewing on the screen and printing The font can be set by using the gptions choice under the Tools menu along the top of the screen When you cut and paste from SAS to a word processor the font 10 point Courier New works well The DATA line de nes what the name of the data set is The name should start with a letter have no spaces and only letters numbers and underscores The INPUT line gives the names of the variables and they must be in the order that the data will be entered The after gender on the INPUT line means that the variable gender is qualitative instead of quantitative The at the end ofthe INPUT line means that the variables will be entered right after each other on the same line with no returns Instead of needing one row for each person If we hit F3 at this point to enter what we put above nothing new will appear on the output screen This is no big surprise however once we realize that we haven t told SAS to return any output The code below simply tells SAS to print back out the data we entered PROC PRINT DATAbankdata TITLE quotGender Equity in Salariesquot RUN The only difficulty we have now is that it would be nice to look at both the men and women separately so we need to be able to split the data up based on whats in the second column The following lines will make two separate data sets male and female and then print out the second one to make sure it is working right DATA male S ET bankdata 952007 1144 AM 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml KEEP salary WHERE gender39m39 RUN DATA female SET bankdata KEEP salary WHERE gender39f39 RUN PROC PRINT DATAfemale TITLE quotFemale Salariesquot RUN Whenever you have a DATA line that means you are creating a new dataset with that name The SET line tells it that we are making this new data set from an old one The KEEP line says the only variables we want in this new data set are the ones on that line The lines after that say any special commands that go into the making of the new data set In this case the WHERE command is used to make sure we only keep one gender or the other Later we will see examples of making datasets that involve using mathematical functions In any case it should be pretty straightforward when you just stop and read through what the lines say The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE PROC UNIVARIATE DATAfemale PLOT FREQ VAR salary TITLE 39Summary of the Female Salaries39 RUN The VAR line says which of the variables you want a summary of Also note that the graphs here are pretty awful The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will and a lot more The one thing it won t do is be open to programming it to do new things Later in the semester we39ll see how some of the other procedures in SAS can be used to do things that aren t already programmed in PROC INSIGHT OPEN female DIST salary RUN You can cut and paste the graphs from PROC INSIGHT right into Microsoft Word Simply click on the border of the box you want to copy with the left mouse button to select it You can then cut and paste like normal 952007 1144 AM 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml Before printing the graphs you might want to adjust them so that they print at the correct size While in PROC INSIGHT clicking on the arrow in the bottom comer of each of the boxes gives you options for adjusting the graphs format To quit PROC INSIGHT click on the X in the upper right portion of the spreadsheet The graph at the top is aBox Plot or aBox and Whisker Plot The wide line in the middle is the median The edges of the box are the 25th percentile lst Quartile Q1 and the 75th percentile 3rd Quartile Q3 A percentile is the point where at least that percent of the observations are less than the percentile and 100that percent are greater The distance between the 75th percentile and 25th percentile is called the Interquartile Range IQR Q3Ql It is a measure of how spread out the data is the larger the IQR the more spread out the middle of the data is Extending beyond the edge of the box are the whiskers The whiskers are allowed to go up to 15 IQRs from the edge of the box and must end at a data point The values beyond that are displayed as dots They are possible outliers We can see from this box plot that about l4 ofthe data is less than 4800 about a U4 is between 4800 and 5220 about a U4 is between 5220 and 5400 and the last U4 is between 5400 and 6300 Click on the boxes in the box plot to see the numbers We also see that there are no really extreme points The box plot has the advantage that it is always drawn the same way unlike a histogram but the disadvantage that it doesn t show as much detail One thing that we can notice from the histogram and box plot is that the data does not look very symmetric aka balanced instead it looks slightly skewed to the left that is it looks like it collapsed a bit farther out in that direction We can add a curve over the histogram to make it easier to compare to a bellshaped or normal curve Under the Curves menu choose Parametric Density Just hit ok on the box that pops up As mentioned before one of the problems with the histogram is that the way it looks can be affected a lot by how the width of the bars is selected and where they start and begin The box with the arrow in it at the lower left side of the histogram lets you control that Click on that box and then select Eicks Change the 3800 to 3600 and the 6600 to 6400 and then click ok Now try 3700 and 6700 and set the Tick Increment to 600 Unlike the histogram the QQ plot is not subject to options chosen by the user Under the Curves menu select QQ Ref Line and then click ok in the window that comes up The idea of the QQ plot is that it plots the actual data along the yaxis and the values that the data would have if they were exactly the percentiles of a normal curve bell curve So if the data is approximately like that of a bell curve the line should look fairly close to straight If not it should be off Notice that this looks very close to a straight line The data isn39t actually that far from a bellcurve and could be made to look even closer in the histogram if we set the bars up just right Because the data is close to being bellcurved we will find that some very nice properties will hold For example it will be close to being symmetric the mean and median differ by less than 100 We will see that the bellcurve or normal curve will be very important throughout the semester and so it will be useful to tell if the data follows a bell curve Lets change one of the values though so that the data appears less normal Click on the spreadsheet and change the 3900 to 8900 the 4020 to 8020 the 4290 to the 8290 and the first 43 80 to 8380 Now note that those four points are rather exterme as indicated on the box plot You can also see the change in the QQ plot And note that the mean is over 100 larger than the median we only changed 4 out of 61 values To compare both men and women we could open PROC INSIGHT with the original data set It is probably quickest to close it using the X in the spreadsheet then go to the Solutions menu the Analysis submenu and then interactive Data Analysis Select Work and then Bank Data and then hit Open Under the 952007 1144 AM 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml Analyze menu choose on PlotMosaic Plot Y Pick Salary for Y and Gender for X Section 53 and Supplement 6 Normal t Chiz and F Tables SAS has built in functions that can calculate the values you nd in the tables Each distribution has one function that solves PX lt x0 and is called the probability function Each distribution also has another function that solves PX lt p and is called the quantile function Distribution Quantile Probability Standard Normal PROBITpct PROBNORMva1 chiz CINVpctdf PROBCHIva1df t TINVpctdf PROBTvaldf F FlNVpctdfxdfy PROBFvaldfxdfy In each case the pct is the probability percent ofthe area that is less than the val and df are the degrees of freedom Notice that this is the opposite of what the chi2 t and F tables in the book report the tables in the book give the probability greater than the value The following code will obtain the answers for examples 52 to 510 in the text book DATA normanswers e5p2 PROBNORM133PROBNORMl33 e5p3 l PROBNORM164 e5p4 PROBNORM067 eSpS PROBNORM l96 l PROBNORM196 e5p6 PROBNORM12 lOl5 PROBNORM8 lOl5 e5p7 PROBNORM20 273 e5p8 PROBITlOl 5p PROBITO975 e5p10 550 100PROBTTO90 PROC PRINT DATAnormanswers RUN Note that the answers differ slightly from the text because the text rounded The code works similarly for other distributions Say the sample size was 10 and we were asked to find Pchi2 gt 416816 P2 lt chi2 lt 4 and the x0 such that Pchi2 gt x0 0005 The code to give these three answers would be as follows DATA chianswers a1 l PROBCHI4168169 a2 PROBCHT49 PROBCHT29 a3 CINVl00059 PROC PRINT DATAchianswers 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin515SAShtml RUN You can compare the results of this code to what you would get using Table X1 in the text Drawing the pictures should help While the Table can give us the answers for the rst and third questions for the second the best it can do is to say between 100 and 050 minus between 010 and 005 so that the nal answer is somewhere between 095 and 040 Section 72 and Supplement 7 Con dence Intervals for Means and Variances The following instructions and code will construct con dence intervals for the population mean variance and standard deviation from the data in Example 72 on page 301 The rst step is to enter the data DATA examp 7132 INPUT numbichars CARDS 113 155 143 092 125 136 132 085 107 148 120 133 118 122 129 r We could then use PROC INSIGHT to analyze this data You could start PROC INSIGHT by going to the Solutions menu and then the Analysis submenu then choose Interactive Data Analysis In the box that comes up choose WORK and examp7p2 and hit Open You could also start PROC INSIGHT by using the following PROC INSIGHT OPEN examp7p2 DIST numb chars RUN T We can add a QQ plot to the output by going under the Curves menu and selecting QQ Ref Line Notice in this case that most of the points are fairly near the straight line We would say quotThe data seems to approximate a normal distribution and so we can trust the results of the con dence intervals for the mean and variancequot If it was less clear that it was approximately normal then we could still have a reasonable amount of faith in the con dence interval for the mean but not for the standard deviation See supplement 73 for more on when we can trust the results of a con dence interval See problem 545 on page 241 for three sample QQ plots aka normal probability plots where one looks good and two do not To construct the con dence intervals go to the Tables menu and select Basig Confidence Intervals If you choose your desired percent all three will be added to the output window To report the results of this example the easiest thing to do would be to open Microsoft word and copy into it the code from the Program Editor window the box containing the QQ plot and the box containing the con dence intervals Section 73 Con dence Interval for One Proportion To calculate a con dence interval for a proportion we can use SAS s ability to nd probabilities for a normal distribution and simply type the formulas into a data step The following code will calculate the Agresti and Coull corrected con dence interval in example 75 page 312 It is set up so that you can enter the x n and con dence level after the cards in that order Notice that we have simply entered the formulas exactly as they are in the book and remembered to put in the 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin515SAShtml semicolons DATA pinterval INPUT x n confcoeff phat xn pstar x2n4 alpha Iconfcoeff zaIphaoverZ PROBITIa1pha2 plusorminusza1phaover2sqrtpstarIpstarn4 lowerpstarp1usorminus upperpstarp1usorminus KEEP x n confcoeff pstar plusorminus lower upper CARDS 3 200 095 PROC PRINT DATApinterval RUN Section 84 One Sample t test The following code is an example of how you can get a test of hypotheses for a single mean It is example 84 on page 343 Please note that PROC INSIGHT always gives the pvalue for the alternate hypothesis quotnot equals tooquot If you want the pvalue for lt or gt then you must draw the picture and see how you need to change the pvalue The text gives instructions for this on page 344 in its Note DATA examp8p4 INPUT hstays CARDS O O u p k u HAH H H w w w m H m m QH1N U H H N H m N UWHAN wa qugtw H w kym m Hw H H N H m UHH m H w 4 w H m H Uin H N H H H H H kyH m kyw m w m N H o kyw m w m w N w H H K1F1661N H H N N kom m UUiH H H m PROC INSIGHT OPEN examp8p4 DIST hstays RUN We can construct the QQ plot for this data by choosing QQ Ref Line under the Curves Menu In this case the data doesn t appear normal at all It s very skewed to the right Because of this we can39t fully trust the results but with a sample size of 100 and the robustness of this ttest we can still get some idea of whether the mean is 5 or is lt5 To conduct the test of hypotheses we can simply go to Tests for Location under the Tables menu Select mu5 our null hypothesis value in this example and hit ok This gives the results of three tests The rst is our ttest for testing the alternate hypothesis of quotnot equals toquot The other two tests the Sign Test and the Signed Rank test are discussed in STAT 518 Looking at the pvalue we can see that it is 02042 with a tstatistic of 128 Following the instructions on page 344 we cut that pvalue in half for our hypotheses and get that the pvalue is 01021 Comparing this to our alpha005 we fail to reject the alternate hypothesis 952007 1144 AM 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml Another way to get the onesided pvalues is to use the following code It returns all three pvalues one for each possible alternate hypothesis and you have to choose the correct one The only portions you need to change are the name of the data set the name of the variable and the value after the cards line The value after the cards line should be the value for the null hypothesis If you read through the code you should be able to make out several of the formulas PROC MEANS NOPRINT DATAexamp8p4 VAR hstays OUTPUT OUTtemp MEANXbar STDsd Nn RUN DATA tempZ SET temp KEEP xbar mu sd n t pgreater pless ptwoside INPUT mu t xbarmusdsqrtn df n l pgreater l PROBTtdf pless PROBTtdf ptwoside 2MINlABS PROBTtdf ABS PROBTtdf cards 5 PROC PRINT RUN Of course you don39t need to do it both ways They39ll always give you the same answer Section 85 Test for One Proportion A test for one proportion can be conducted in the same manner that we made a confidence interval for one proportion We can simply put all of the needed formulas into a data step The pvalues are calculated using the same ideas as the note on page 344 The code below will conduct the test for example 87 and 88 on pages 355357 the observed X n and p from the null hypothesis appear in the cards section Note that the values are slightly different because the book rounded Also instead of using the more complicated method to see if the sample size is large enough it will calculate np and nplp instead DATA ptest INPUT x n pnull np npnull nlminusp nlpnull phatXn zphatpnullsqrtpnulllpnulln pgreater l PROBNORMZ pless PROBNORMZ 952007 1144 AM 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml ptwoside 2MTNlABSPROBNORMZABSPROBNORMZ CARDS 10 300 05 PROC PRINT DATAptest RUN Section 91 Two Sample ttest The following sample code will analyze the data in Example 94 on page 387 Notice that we have to enter the group and the value for each observation and that the group name is a word and so it needs a DATA examp 9p4 INPUT group value CARDS new 80 new 80 new 79 new 81 stand 79 stand 62 stand 70 stand 68 new 76 new 66 new 71 new 76 stand 73 stand 76 stand 86 stand 73 new 70 new 85 stand 72 stand 68 stand 75 stand 66 PROC TTEST DATAexamp9p4 CLASS group VAR value RUN The order you enter the groups determines which hypothesis is being tested Because the new group was entered first the procedure will look at the difference newstand By default the procedure tests the hypothesis that this difference is equal to zero To test that the difference is equal to some other value say 5 you would add HO5 to the first line between examp 9p4 and the semicolon in PROC TTEST The first three rows of the output contain the means confidence intervals for the means standard deviations and confidence intervals for the standard deviations for the two groups individually and for the difference of the two groups Notice that the third row gives the confidence interval 1399 95327 that matches what the text found in the middle of page 388 SAS uses 95 for Cls by default Also notice that we can nd szp on the third line 6119 is the square root of 37 45 The next two rows of the output are for the two ttests The Pooled line is the case where the variances are equal and the Sattertwaite line is for unequal variances One important thing to note here is that SAS is doing the twosided test for the alternate hypothesis quotnot equals toquot If you are testing either lt or gt then you will need to draw the picture and adjust the pvalue by hand 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin515SAShtml The nal line is a test of the null hypothesis that the variances of the two groups are equal or not The PrgtF column is the pvalue for this test so if it is a small value less than alpha you reject the null hypothesis that the variances are equal For this example we see that the pvalue is 08148 which is very large and so we fail to reject that the variances are equal However this F test for two variances is not robust at all and should not be used In checking the assumption that the data is normal we need to make sure we check it for BOTH samples We can still use PROC INSIGHT for this PROC INSIGHT OPEN examp 9p4 RUN When you choose Qistribution Y under the Analyze menu choose value for Y and choose group for group This will put the information for both groups in the same window use the scroll bar at the bottom to switch between them If you add the qq plot and qq line it will add them for both variables In this case the second sample appears to look fairly normal but the first one is a bit questionable Because of this I would trust the two sample ttest it is robust against slight violations A logical next question is quotIf I can t trust the Ftest then how do I know if I can use the test where we assume the variances are equalquot Because the twosample ttest for two means is fairly robust especially when the two samples are about equal sized see pg 383 we could just compare the standard deviations or we could use two boxplots like they do on page 382 and see if we believe the variances are equal or not If we have our doubts though especially since the sample sizes are not equal here we should use the Sattherwaite formula NOTE There are more robust tests available to see if two variances are equal than the F test One of these that SAS can be made to perform in PROC GLM is called the quotModified Levene39s Testquot or the quotBrown and Forsythe Testquot Similarly SAS can be made to perform tests of the null hypothesis quotthe data comes from a normally distributed populationquot vs the alternate hypothesis that the population is not normal Perhaps the best of these tests is the quotAndersonDarling Testquot Because these tests are somewhat complicated we will not be covering them in STAT 515 but if you ever need a test of either of these hypotheses a statistical consultant can show you how to conduct them Section 92 Paired ttest The following code will conduct the paired ttest shown on pages 399400 the data is in table 94 Remember that the pvalue is for the twosided alternate hypothesis and would need to be ajusted for testing either gt or lt DATA learner INPUT new standard CARDS 77 72 74 68 82 76 73 68 87 84 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l SSAShtml 69 68 66 61 80 76 PROC TTEST DATA1earner PAIRED newzstandard RUN Another way to conduct this ttest is to remember that it is simply a onesample ttest on the differences In PROC INSIGHT you could choose variables under the Edit menu Select the cher option We can now make a new variable that is equal to newstandard Click on YX in the transformation box then select new for the Y value standard for the X value and click on OK You can now choose Qistributi on Y under the Analyze menu and simply do a one sample ttest on the new variable you created Section 93 Tests and Con dence Intervals for Two Proportions Tests and con dence intervals for two proportions can be made in a similar fashion to how we made tests and confidence intervals for one proportion The following code will calculate the confidence interval discussed on pages 511 and 512 DATA in INPUT x1 n1 X2 n2 confcoeff p1h x1n1 p2h x2n2 alpha 1confcoeff zalphaoverZ PROB1T1a1pha2 diffp1hp2h plusorminusza1phaover2sqrtp1h1p1hn1 p2h1p2hn2 lowerdiffp1usorminus upperdiffp1usorminus KEEP confcoeff diff plusorminus lower upper CARDS 546 1000 475 1000 095 PROC PRINT DATACin RUN This code performs the test ofhypotheses for the data in examples 96 and 97 on pages 413415 DATA testh INPUT x1 n1 x2 n2 p1h x1n1 p2h x2n2 ph X1X2n1n2 diffp1h p2h zp1hp2hsqrtph1phn1 ph1phn2 pgreater 1 PROBNORMZ pless PROBNORMZ ptwoside 2M1N1ABSPROBNORMZABSPROBNORMZ KEEP ph z pgreater pless ptwoside CARDS 555 1500 578 1750 952007 1144 AM 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml PROC PRINT DATAteSt2p RUN Section 102 Oneway Analysis of Variance The following code will analyze the data in examples 103 and 104 Notice that each observation must have both a group identi er and a measurement Recall that the is tells SAS that the variable before it is characters not numbers and the means that more than one observation occurs on a line DATA iron INPUT brand dist CARDS A 2512 B 2632 C 2697 D 2516 A 2451 B 2629 C 2632 D 2486 A 2480 B 2650 C 2775 D 2494 A 2511 B 2545 C 2674 D 2420 A 2605 B 2643 C 2705 D 2465 A 2500 B 2570 C 2655 D 2513 A 2539 B 2628 C 2707 D 2618 A 2446 B 2644 C 2729 D 2490 A 2546 B 2606 C 2756 D 2471 A 2488 B 2559 C 2665 D 2459 PROC INSIGHT OPEN iron FIT dist brand RUN This output gives you the ANOVA table automatically To check the assumptions we need to verify that the variances seem to be approximately equal and that the distributions seem approximately normal Remember that you also need for the samples to be random and independent but you can t really check that from the plots One way to check whether the groups seem to have the same variance is to use the residual versus predicted plot at the bottom left of the PROC INSIGHT window This is basically like Figure 1010 on page 457 except that it is turned on its side and all of the columns have been made to have the same mean If the four in this example columns all look equally spread out then we would say it looks like the variances are the same Another way to check whether the groups seem to have the same variances is to use side by side box plots like we did for the twosample ttest Similar to Figure 1010 but using Box Plots instead of Dot Plots Under the Analyze menu choose Box P1otMosaic Plot Y Select dist for Y and brand for X This plot shows us both how the means differ C looks larger than the others and that all have roughly the same variance the size of each box plot isn39t that much different from the others Finally you also could have chosen Distribution Y under the Analyze menu with dist for Y and brand for group This will calculate the standard deviation for each group and we could conclude that since all the sds are between 38 and 52 that they are close enough that the procedure should work well Any of these three methods is acceptable and you do not need to do all of them To check if each of the samples looks like they came from normal distributions we really should make a separate QQ plot for each group just like for the twosample ttest Under the Analyze menu choose Distribution Y Select dist for Y and brand for group You can then choose QQ Ref Line under the curves menu To get rid of all the extra information you could deselect the various quotcheckedquot tables and graphs under the Tables and Graphs menus For this example Brands A and C look very close to normal Brand B looks a little odd and Brand D appears to have one outlier Since the F test for ANOVA is robust however none of these seem bad enough 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin515SAShtml not to trust the test Unfortunately these can be hard to check if there weren39t many observations in each sample An alternative approach is to use the combined qq plot that will show up at the bottom of the PROC INSIGHT window that has the ANOVA table in it To add that plot to the output choose Residual Normal QQ under the Graphs menu NOTE PROC GLM is another commonly used method of performing both Analysis of Variance and regression It will construct the ANOVA table and many other statistics that we will use in STAT 516 it won t construct the various graphs however The code we would use with this data and PROC GLM would be PROC GLM DATAirOD CLASS brand MODEL distbrand RUN The last page of this output is the same as that on the top of page 453 except that it was generated using PROC GLM instead of PROC ANOVA Chapter 11 and Supplements Simple Linear Regression The following code will calculate all of the statistics for the data in Table 111 on page 514 and give the output that is discussed in Sections 112 through 118 DATA stimulus INPUT amountix reactioniy CARDS 1 m Ul gtw N gtN N H H PROC INSIGHT OPEN stimulus FIT reactioniy amountix RUN To generate the qq plot for the residuals go to Residual Normal QQ under the Graphs menu To generate the con dence interval for the slope select C l C l Wald for Parameters under the Tables menu To get the prediction intervals and con dence interval for the regression line like Figure 1124 on page 556 you can simply go to Confidence Curves under the Curves menu and select the type you want You can put both on the same graph This picture will not let you see the actual values though To get output like that in Figure 1119 and 1120 you would use the following code PROC GLM DATAstimulus MODEL reactioniy amountix ALPHA005 CLI N PROC GLM DATAstimulus MODEL reactioniy amountix ALPHA005 CLM N Where CM is for the individual prediction interval and CLM is for the mean or regression line 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l SSAShtml If we wanted to get the intervals for a new x value say x45 we would add a new data point to the data set and rerun the PROC GLM code The for the yvalue tells SAS that it shouldn39t be included in the calculation of the regression line DATA stimulus INPUT amountix reactioniy CARDS l l 2 l 3 2 4 2 45 5 4 Section 132 Goodness of Fit Test The following code will analyze the data in example 132 on page 710 DATA exl3p2 INPUT opinion count CARDS legal 39 decrim 99 exist 336 noopin 26 PROC FREQ DATAexl3p2 ORDERdata TABLES opinion TESTP07l865lO WEIGHT count RUN Instead of using TESTP test proportion you also could use TESTF test frequency In this case you would put the expected values in instead of the proportions One further complication with PROC FREQ in SAS is that it doesn t handle observed values of zero well If there is a cell that was 0 use the value 000001 instead This way SAS will actually recognize that it is a cell and it won39t throw the test statistic off by very much Section 133 Twoway Contingency Tables The following code will work example 133 on pages 719720 DATA exl3p3 INPUT rel s marit s count CARDS A D 39 952007 1144 AM 5 l 5 SAS TEMPLATES httpwwwstatscedu7Evesselin5 l 5SAShtml B D 19 C D 12 D D 28 None D 18 A Never 172 B Never 61 C Never 44 D Never 70 None Never 37 PROC FREQ DATAex13p3 WEIGHT count TABLES re1marit chisq expected nopercent RUN The Text Book Data The various data sets used in the text book can be found on the CD in the back of the text in text les The key to the various names in this directory can be found in Appendix B of the text on pages 823827 Computer Trouble In most cases help with the computers NOT the programming can be gained by emailing helpstatscedu For the printers on the rst and second oor printer paper is available in the Stat Department of ce For printers on the third oor paper is available in the Math Department of ce If you are using a PC restarting the machine will x many problems but obviously don t try that if you have a le that won t save or the like If SAS won39t start one of the things to check is that your computer has loaded the X drive correctly whatever that means Go to My Computer and see ifthe apps on 39lcnt39 x is listed as one ofthe drives If it isn t go to the Tools menu and select Map Network Drive Select X for the drive and enter 1c nt apps for the Folder Then click Finish This should connect your computer to the X drive and allow SAS to run If you already had the Xdrive connected then you will need to email helpstatscedu If your graphs print out extremely small after you copy them to word you might be able to x the problem by quotopening and closingquot the image In word left click once on the image and select Egit Picture or Open Picture obj ect under the Edit menu A separate window will open with the image in it Simply choose glose Picture It should now print out ok This will also make the spacing between the characters in the labels look right if they were somewhat off 952007 1144 AM 515 SAS TEMPLATES httpwwwstatscedu7Evesse1in515SAShtm1 If the problem is an emergency requiring immediate attention see the statistics department computer person in room 209D If they are not available and it is an emergency see Minna Moore in room 417 Flagrantly non emergency cases may result in suspension of computer privileges 952007 1144 AM 1152007 Comparing Two Population Means Chapter 9 Paired Difference Experiments i Mndlngh Pair Ndedlhadlll lnferences Based on Two i n Samples Con dence Intervals and Tests onypothesis 0 lur Elam I39al lstne rnean reading test score using fne new rnefndd greatertnan fne rnean reading test score using fne did He Mam He memo Comparing Two Population Means Identifying the Target Parameter Paired Difference Experiments Parameter Key Words orPhrases Because tne samples are notindependent of eacn diner a newfecnnidde is used 7 Mean difference difference t in averages Anewvananie d is create p 7 Difference between 1 pro onionsipercentage Tegmgm Wm fractions drrates new mania d He 1 0 refo 2 Heeegt0iczgt0 5 Comparing Two Population Means Paired Samples Paired Difference Experiments Testing is now based on a one sample t lnterventian Studies statistic I 37 0 saina Sample rnean difference P dquot I ample Standarddeviationofdifferences a sam 95 a Ndrnnerdf differences numher dfpairs Before and After Experiments Comparing Two Population Means Paired Difference Experiments oThis type of experiment paired observations is called a paired difference experiment oPairing removes differences between pairs days in this case focuses on differences within pairs sales Comparisons within groups is called blocking oPaired difference experiment is a randomized block experiment 1152007 Comparing Two Population Means Paired Difference Experiments Paired Difference Test of Hypothesis for uduj HZ Small Sample OneTailed Test TwoTailed Test H0014 Dr H0 4 Dr H 4 lt00 H 4 00 or H 4 gt00 rest Statistic 7 d 7 Dn 544714 Rejection region llt eta Rejection region l gt 2W orlgtlawhen H 4 gt0 m where laand l are based on n degree5 of freedom Comparing Two Population Means Paired Difference Experiments Paired Difference Confidence Interval for uduj u2 U S Large Sample 1 r za 391 s za 391 Z 1 71d 2 1 71d 51 j where 02 is based on nu1 degrees of freedom Small Sample E r taZ Comparing Two Population Means Paired Difference Experiments Conditions forVaIid LargeSample Inferences about pd 1 Random sample of differences selected 2 Sample size is large nd gt 30 Conditions forVaIid SmallSample Inferences about pd 1 Random sample of differences selected 2 Population of differences has a distribution that is approximately normal Comparing Two Population Means Paired Difference Experiments Paired Difference Test of Hypothesis for uduj H2 Large Sample OneTailed Test TwoTailed Test H9014 Dr H0014 Dr H 4 lt0 H 4 Dr OI4014 gtDr rest statistic Z 7 DU 2 7 DU 2 s rd l2 o NZ Rejection region 2 lt 72 Rejection region z gt 292 orz gtzawhen H W gt0 Two Samples Comparing Proportions 1152007 Comparing Two Population Proportions Independent Sampling Properties of the Sampling Distribution of fink Mean ofSampling distribution Fer isFrF2 E61 92Pi Pz 722 is an unbiased estimator O prpz Standard deviation of sampling distribution oflgrgzj is Fi UM T Ifn1 and n2 are large the sampling distribution of 1152 is approximately normal Comparing Two Population Proportions Independent Sampling LargeSample Test of Hypothesis about pypZ OneTailed Test TwoTailed Test Hi Mp2 0 Hi Mb 0 Mp lt0 H 11132 0 or H 074172 gt0 Teststatis c z Fz a I M Reieclion region 2 lt 2 or z gt zixwheri H 1142 gt0 mmo LL J Note 5mm K quot2 Pqn n1 WhErE F quot1M1 Reieclion iegion z gt 22 Comparing Two Population Proportions Independent Sampling LargeSample 1001u Confidence Interval for pl p2 gegzieeiw igeaieziz sprain Pf2 Alternative formula Comparing Two Population Proportions Independent Sampling Conditions required for Valid LargeSample Inferences about p1p2 lndependent randomly selected samples Sample sizes n1 and n2 are sufficiently large so that the sampling distribution of 9722 will be approximately normal STAT 515 Chapter 7 Supplement Brian Habing University of South Carolina Last Updated October 167 2003 S7 Con dence Intervals for Variances Just as we can use the t distribution to form a con dence interval for the mean of a population7 we can use the X2 and F distributions to make con dence intervals for the variance of a population7 or the ratio of two variances The logic is the same in both cases use the sampling distribution to form a probability statement containing just the one unknown parameter7 and then solve for that parameter S71 The Con dence Interval for 039 or 02 In section 862 we saw that if the random sample was drawn from a population that was normally distributed7 then 71 712 2 den71 U2 If we choose xiZ to be the value such that PX fn1 2 xiZ g and at2 to be the value such that PX fn1 S at2 we get the following 717152 PlXTiaZ S S Xizl1 04 51 This is illustrated in the gure below 7 2 n 128 for a random sample 0 r from a normal population XTiaZ Xiz We can now solve the inequality in S1 for 02 to get the con dence interval 71712 PXT7uz2 lt T S Xiz gt 0 2 L XT7oz2 T nil52 T Xiz 2 XuzZ 71 Us X17042 P XZz XT7oz2 71712 S 2 S 71152 1701 The 1 7 a100 con dence interval for the population variance is thus lt 71 7152 71 7152 gt SQ Xiz XT7oz2 This can be changed to a con dence interval for the population standard deviation simply by taking the square root of both sides 71 7152 71 7152 lt Xiz 7 X17042 gt 53 Say we are interested in the standard deviation of a certain population A sample of size n 12 df 11 is gathered7 52 is found to be 2027 the 01 01 plot for the data looks fairly norrnal7 and it is desired to construct a 95 con dence interval Using S37 we see that we need to determine xiZ and xiaZ where 042 0025 and 1 7 042 1 7 0025 0975 From Table Vll we see that the 1795 7 2 7 values are 025 219200 and XaWS 381575 Plugging these values into S3 gives M10137 M58232 318763 72 The Con dence Interval for i 75 A con dence interval for the ratio of two variances could be constructed in the same way as one for a single variance From 864 we see that if two samples 2122 zn and y1y2 yny are drawn randomly from two populations that seem normal7 then 93 2 S Fdfxnx71dfyny71 2 Lx 2 0y If we choose Fat2 to be the value such that PFdfwnr1dfynr1 2 Fat2 g and Fla2 to be the value such that PFdfwnr1dfynr1 lt Fla2 we get the following P Fla2 Faz 17a S4 7 Solving for 42 then gives us the 1 7 a100 con dence interval 1 9 9 Q Q S5 Fox2 7 Flier2 It might be good practice to see if you can work out the steps between S4 and SB Note that the text only gives the table for nding the Fat2 values7 and not the Fla2 ones One way of getting around this would be to use SAS just remember that SAS gives the area in the lower end of the table7 while the text and the formulas above use the upper end Another way is to use the following relationship Fl a2gtdfxnx71dfyny 1 Fa2dfxny71dfynxil That is7 to nd the F0975 you ip the degrees of freedom and take 1F0V025 So7 if you wanted a 95 con dence interval when 711 6 and my 5 the two values you would use are 936 for F0025 straight from the table for df of 5 and 4 and 1739 0135 for F0975 one over the value from the table for the reversed df of 4 and 5 873 Robustness What if the Data Isn7t Normal A statistical procedure is called robust if it performs well even when its assump tions aren7t met In the case of using the t distribution to make inferences about the mean7 the XZ distribution for the variance7 and the F distribution for two vari ances7 we need to assume that the initial populations were normally distributed The procedures that use the t distribution are fairly robust however That is7 the proce dures involving the use of the t distribution to make inferences about M work fairly well even when the data isn7t normally distributed In general it will work well for small sample sizes 71 S 30 even if their are some doubts about the 01 01 plot For large sample sizes it will work well for all but the worst q q plots The procedures discussed in 861 and 862 for making inferences about population variances are not robust at all If there are any questions about the 01 01 plots7 they should not be used It is important to note that in Chapters 107 117 and 13 we will see other uses of the X2 and F distributions that are robust A particular distribution is never robust or non robust7 robustness is a property of the entire procedure that you are attempting In this course we are only covering some of the most common and basic methods for making inferences about a population A variety of other methods are discussed in STAT 518 Nonparametric Statistical Methods A method is called nonparametric if it does not depend on the assumptions that the data follows a particular distribution like the normal distribution There are two reasons for not simply always using nonparametric methods One is that they are somewhat more complicated to explain see Section 142 for example The second reason is that while the nonparametric tests are generally better when the assumptions are badly violated7 the standard methods we are learning here are better when the assumptions are met 10182007 Chapter 9 Inferences Based on Two Samples Confidence Intervals and Tests of Hypothesis Comparing Two Population Means Independent Sampling Large Sample Confidence Interval for p1 p2 a 0 Xi xz izaZUGFE Xi xz izaZ n n 2 assuming independent sampling which provides the following substitution 2 2 2 2 039 039 S S 7 1 2 m 1 2 Gear n1 n2 n1 n2 Identifying the Target Parameter Comparing Two Population Means Independent Sampling Parameter Key Words or Phrases Mean difference difference in averages Difference between proportions percentage 39actions or rates M l z Fer Properties of the Sampling Distribution of 292 Mean ofSampling distribution EVE is pry2 Assuming two samples are independent the standard deviation ofthe sampling distribution is The sampling distribution of iv gt9 is approximately normal forlarge samples by the CLT Comparing Two Population Means Independent Sampling Comparing Two Population Means Independent Sampling Confidence Intervals and hypothesis testing can be done for both large and small samples Large sample cases use zstatistic small sample cases use tstatistic When comparing two population means we test the difference between the means Large Sample Test of Hypothesis for p p2 OneTailed Test TwoTailed Test Hi MM 00 Ha MM 00 Ha 1142 ltDa Ha 1142 Da 039 Ha w 2 gtDa Where 0 hypothesized difference between the means Test Statistic v x a a 2 Z Dquot minere Up t 557 5 n2 Rejection region 2 lt 72 Rejection region 2 gt 24 urz gtzhvvhen H 1142 gtDi 10182007 Comparing Two Population Means Independent Sampling Comparing Two Population Means Independent Sampling Required conditions for Valid LargeSample Inferences about lift 1 Random independent sample selection 2 Sample sizes are both at least 30 to guarantee that the CLT applies to the distribution era 72 Required conditions forVaIid SmallSample Inferences about llrllz 1 Random independent sample selection 2Approximate normal distribution of both sampled populations 3 Population variances are equal 17121722 Comparing Two Population Means Independent Sampling Comparing Two Population Means Independent Sampling Small Sample Confidence Interval for M p2 1 1 xi in rmZ sin 1 2 quot170512 72 1S22 p n1 n2 72 where and M is based on n1 n22 degrees of 39eedom Z Z Small Samples Assume 0391 0392 Con dence interval1 7x2iIaMlls12n1 i Sin2 Test Statistic for HE t x172 sfn1 s22n2 where t is based on V degrees of Sfquot1y Siquot2y 39eedom r1171 rtZ 71 Comparing Two Population Means Independent Sampling Small Sample Test of Hypothesis for W p2 OneTailed Test TwoTailed 39r t Hamw Do Ho 1142 Do H w H 1142 09 M lt09 or H 1142 gtDai Where Do hypothesized difference between the means Test Statistic t xi ex DD 1 1 W ere M We Rejection region I lt el Rejection region I gt iW Dr I gt nwnen H 1142 gt09 Where im lware based on npnTZ degrees offreedom STAT 515 Annotations to the Text Brian Habing University of South Carolina Last Updated December 31 2003 In several places the textbook uses some shorthand notation that might seem confusing on first glance There are also a few places where it presents things in a more complicated fashion than it needs to The notes below are given by section number and will hopefully make reading the text a bit easier once in a while There are also seven supplements available on the course web page that contain additional material Chapters and sections with supplements are noted below in the appropriate place Section 38 Notice that we can think of the permutations rule page 156 and the combinations rule page 158 as both being special cases of the partitions rule page 157 A permutation is a partition where there are 71 groups of size 1 and one group of size Nn This is the case where each of the first 71 selected are selected for different purposes and the order is important but we don t care about the order of those not selected A combination is a partition where one group is of size n and the other is of size N n This is the case where the first 71 are interchangeable with each other and the remaining N n are also interchangeable Section 44 qlp is just shorthand Don t let the q throw you off Section 54 We will always use method 4 in the box on page 237 the normal probability plot also called the qq plot Exercise 545 on page 241 gives three sample qq plots with answers in the back of the text Histograms method 1 are bad because they are easily manipulated by choosing different class intervals Figuring out the percentages method 2 is time consuming Comparing the IQR to s method 3 isn t built into most packages and doesn t give as much information as the qq plot Section 55 The explanations in this section are a lot more complicated than they need to be The entire key to the section can be seen by simply reading the material on page 244 and at the top ofpage 245 First notice in Figure 518 that they are looking for the probability that the binomial random variable x will be less than or equal to 10 In order to get all of the histogram bars for 10 and less they need to take everything from 105 and smaller If you started at 10 you would miss half of the 10 bar Ifyou started at 11 you would include an extra half of the 10 bar If they had asked lt 10 we would have needed to start at 95 ifthey had asked gt10 we would have needed to start at 105 and ifthey had asked 210 we would have needed to start at 95 This going up or down by 05 is the continuity correction and the safest way to see which way you need to go is to draw the picture Second notice that they are using u ux np and 6 6x np1 p because we are looking at the binomial distribution box on the bottom of page 194 Third an easier rule for seeing if n is large enoughis to say that n is large enough for the normal approximation to work if both np25 and n1p 2 5 Technically this condition is weaker than the one using 36 but it seems to work fairly well in practice and will match a rule we will use in Chapter 13 The accuracy of the normal approximation to the binomial even for large n is a topic of current research by statisticians and we will see in section 73 that some tricks can be done to make it work even better in certain circumstances Applying this to example 512 we would do the following 1 np 2000061225 and n1p200100620009418825 so the sample size is large enough for the normal approximation to the binomial to be reasonable Notice that we ve already found unp12 Just using the formula for s gives us a 1np1 p i2000061 006 1128 a 3359 2 Ifwe want 1306220 then we want to include the 20 bar so we need to start at 195 The uncolored in area in Figure 520 there is no reason to switch it around like the book does Px 2 20 Px 2195P 2195 P 2 2132 2 223 7 0 an p 3359 This is now just a probability to look up on the normal table and we get 050487100129 Section 63 When reading the examples in this section it is important to note the blue box on the bottom of page 274 that says that u u and that a 7N See the sugglement Chapter 6 More on Sampling Distributions t chisquare and F Section 71 As in section 63 note that u u and that 6 Also Since we virtually never know 039 and since we can always use the ttable you should 6 s s robabl never use 7c i z or f i z Instead ou should alwa s use 7c it P y 042 J 042 J y y 042 J that is discussed in section 72 unless for some bizarre reason you actually know 039 Section 73 In the boxes on page 309 notice first that U J 1 which we n n l A l A approx1mate by M if we aren t given p and we never are for a con dence interval Second notice that we can again use the anS and nlp 2 5 rule discussed above instead of the 7 i 30 rule we ll just have to use the 7 instead ofthe p Finally it is probably always best to use the Agresti and Coull formula in the box on the bottom of page 3 11 than it is to use the formulas in the box on the bottom of page 309 The crorrection basically says pretend your sample was four larger and that two of those four were successes The reason this works is fairly complicated to explain but a computer simulation studies can be performed that show that in a large number of cases that the normal approximation to the binomial works better using this correction in the context of confidence intervals for a single percentage We will not use it for a test of hypotheses or a confidence interval for two proportions Section 74 The entire key to this section can be found in the top equations in the blue boxes on page 316 and 318 The basic idea is to simply set the plus or minus portion of the confidence interval equal to the size you want it to be The second equations in each box are gotten simply by solving the first equation for n See the sugglement Chapter 7 Confidence Intervals for Variances Section 81 Again note that a 65 Section 82 Just like with the confidence intervals we will never have the 039 so we should 9 c S uo 1nstead of z uo A J always use I Section 83 The note about pvalues on page 344 can be important Section 85 In the boxes on page 309 notice rst that U J JM and unlike for n n con dence intervals we actually have p Second notice that we can again use the anS and n1p 2 5 rule discussed above instead ofthe f7 i 30 rule See the sugglement Section 86 Power Curves Section 91 Just like in sections 71 and 82 we will never use the large sample formulas with z and 6discussed on pages 380385 Instead use the tformula on page 386 if the variances are equal Ifthe variances are not equal use the box on page 390 Section 92 Again do not use the large sample formula in the box on page 401 P1Q1P292 P11P1P21P2 quot1 quot2 quot1 quot2 I When we are making a con dence interval we know nothing about the values of p1 and p2 except Section 93 Note that 6 17 2 what we have in the sample In this case we just substitute in 71 and 72 When we are making a test of the hypothesis that the two populations have equal percentages however it doesn t make sense to put in two different values because we are assuming they are the same In this case A x x we subst1tute 1n p for both p1 and p2 n1 n2 See the sugglements Section 102 The ANOVA Table Section 113 Checking the Regression Assumptions Section 115 The ANOVA Table for Regression Section 133 ChiSquare Test for Homogeneity STAT 515 Annotations to the Text Brian Habing University of South Carolina Last Updated December 31 2003 In several places the textbook uses some shorthand notation that might seem confusing on first glance There are also a few places where it presents things in a more complicated fashion than it needs to The notes below are given by section number and will hopefully make reading the text a bit easier once in a while There are also seven supplements available on the course web page that contain additional material Chapters and sections with supplements are noted below in the appropriate place Section 38 Notice that we can think of the permutations rule page 156 and the combinations rule page 158 as both being special cases of the partitions rule page 157 A permutation is a partition where there are 71 groups of size 1 and one group of size Nn This is the case where each of the first 71 selected are selected for different purposes and the order is important but we don t care about the order of those not selected A combination is a partition where one group is of size n and the other is of size N n This is the case where the first 71 are interchangeable with each other and the remaining N n are also interchangeable Section 44 qlp is just shorthand Don t let the q throw you off Section 54 We will always use method 4 in the box on page 237 the normal probability plot also called the qq plot Exercise 545 on page 241 gives three sample qq plots with answers in the back of the text Histograms method 1 are bad because they are easily manipulated by choosing different class intervals Figuring out the percentages method 2 is time consuming Comparing the IQR to s method 3 isn t built into most packages and doesn t give as much information as the qq plot Section 55 The explanations in this section are a lot more complicated than they need to be The entire key to the section can be seen by simply reading the material on page 244 and at the top ofpage 245 First notice in Figure 518 that they are looking for the probability that the binomial random variable x will be less than or equal to 10 In order to get all of the histogram bars for 10 and less they need to take everything from 105 and smaller If you started at 10 you would miss half of the 10 bar Ifyou started at 11 you would include an extra half of the 10 bar If they had asked lt 10 we would have needed to start at 95 ifthey had asked gt10 we would have needed to start at 105 and ifthey had asked 210 we would have needed to start at 95 This going up or down by 05 is the continuity correction and the safest way to see which way you need to go is to draw the picture Second notice that they are using u ux np and 6 6x np1 p because we are looking at the binomial distribution box on the bottom of page 194 Third an easier rule for seeing if n is large enoughis to say that n is large enough for the normal approximation to work if both np25 and n1p 2 5 Technically this condition is weaker than the one using 36 but it seems to work fairly well in practice and will match a rule we will use in Chapter 13 The accuracy of the normal approximation to the binomial even for large n is a topic of current research by statisticians and we will see in section 73 that some tricks can be done to make it work even better in certain circumstances Applying this to example 512 we would do the following 1 np 2000061225 and n1p200100620009418825 so the sample size is large enough for the normal approximation to the binomial to be reasonable Notice that we ve already found unp12 Just using the formula for s gives us a 1np1 p i2000061 006 1128 a 3359 2 Ifwe want 1306220 then we want to include the 20 bar so we need to start at 195 The uncolored in area in Figure 520 there is no reason to switch it around like the book does Px 2 20 Px 2195P 2195 P 2 2132 2 223 7 0 an p 3359 This is now just a probability to look up on the normal table and we get 050487100129 Section 63 When reading the examples in this section it is important to note the blue box on the bottom of page 274 that says that u u and that a 7N See the sugglement Chapter 6 More on Sampling Distributions t chisquare and F Section 71 As in section 63 note that u u and that 6 Also Since we virtually never know 039 and since we can always use the ttable you should 6 s s robabl never use 7c i z or f i z Instead ou should alwa s use 7c it P y 042 J 042 J y y 042 J that is discussed in section 72 unless for some bizarre reason you actually know 039 Section 73 In the boxes on page 309 notice first that U J 1 which we n n l A l A approx1mate by M if we aren t given p and we never are for a con dence interval Second notice that we can again use the anS and nlp 2 5 rule discussed above instead of the 7 i 30 rule we ll just have to use the 7 instead ofthe p Finally it is probably always best to use the Agresti and Coull formula in the box on the bottom of page 3 11 than it is to use the formulas in the box on the bottom of page 309 The crorrection basically says pretend your sample was four larger and that two of those four were successes The reason this works is fairly complicated to explain but a computer simulation studies can be performed that show that in a large number of cases that the normal approximation to the binomial works better using this correction in the context of confidence intervals for a single percentage We will not use it for a test of hypotheses or a confidence interval for two proportions Section 74 The entire key to this section can be found in the top equations in the blue boxes on page 316 and 318 The basic idea is to simply set the plus or minus portion of the confidence interval equal to the size you want it to be The second equations in each box are gotten simply by solving the first equation for n See the sugglement Chapter 7 Confidence Intervals for Variances Section 81 Again note that a 65 Section 82 Just like with the confidence intervals we will never have the 039 so we should 9 c S uo 1nstead of z uo A J always use I Section 83 The note about pvalues on page 344 can be important Section 85 In the boxes on page 309 notice rst that U J JM and unlike for n n con dence intervals we actually have p Second notice that we can again use the anS and n1p 2 5 rule discussed above instead ofthe f7 i 30 rule See the sugglement Section 86 Power Curves Section 91 Just like in sections 71 and 82 we will never use the large sample formulas with z and 6discussed on pages 380385 Instead use the tformula on page 386 if the variances are equal Ifthe variances are not equal use the box on page 390 Section 92 Again do not use the large sample formula in the box on page 401 P1Q1P292 P11P1P21P2 quot1 quot2 quot1 quot2 I When we are making a con dence interval we know nothing about the values of p1 and p2 except Section 93 Note that 6 17 2 what we have in the sample In this case we just substitute in 71 and 72 When we are making a test of the hypothesis that the two populations have equal percentages however it doesn t make sense to put in two different values because we are assuming they are the same In this case A x x we subst1tute 1n p for both p1 and p2 n1 n2 See the sugglements Section 102 The ANOVA Table Section 113 Checking the Regression Assumptions Section 115 The ANOVA Table for Regression Section 133 ChiSquare Test for Homogeneity STAT 515 Chapter 6 Supplement Brian Habing University of South Carolina Last Updated January 11 2002 6 More on Sampling Distributions The t X2 and F Distributions As we saw in Section 63 the normal distribution plays a pivotal roll in describing how the sample mean i will behave when you have a random sample 12 x Unfortunately the central limit theorem only applies when the sample size is large Additionally it only tells us about the sampling distribution of the sample mean and not about the sampling distribution of the sample variance 52 These limitations can be overcome if we can believe that the sample was taken from a population that was normal to begin with That is if we apply the methods in Section 54 and verify the data is normal we can get the sampling distribution for i when n is small and can also get the sampling distribution for 52 861 i and the Normal Distribution A fact that is proved in STAT 512 is if the random sample is drawn from a population that follows a normal distribution then Z is exactly standard normal In other words if the base population is already normal the central limit theorem result applies even when n ll The only dif culty in this is that we rarely if ever know the value of the parameter 039 Because of this we can7t use this fact directly 62 52 and the X2 chisquared Distribution The X2 distribution can be de ned as follows If 2122 2n1 are independent and each follows the standard normal distribution then X2212z 22n71 follows the X2 distribution with n 7 1 degrees of freedom The table for this distri bution and a typical picture of it can be found in TABLE Vll on page 810 This distribution is skewed to the right7 has mean 71 7 17 variance 2n 7 17 and takes all values 0 and higher The normal on the other hand takes all positive and negative values The usefulness of this distribution becomes a bit clearer if we again consider the random sample 1 2 xn from a normal distribution Looking at the formula for 52 we can see that we are squaring a bunch of independent normal random variables the and summing them up The only reason that this isnt a X2 random variable is that they aren7t standard normal7 and we are dividing by n 1 By multiplying both sides of the above equation by n 7 1 and dividing by 02 we get the following 71712 anxi7i i1 2 039 039 If the i on the right side of the equation were replaced by M then we would be summing up a bunch of z and it would be the sum of 71 standard normals7 7 0 making a X2 random variable with 71 degrees of freedom Because we are using i instead of 1 we lose one degree of freedom7 and so 71 7152 Xif 2 1 039 where the df in the subscript is the number of degrees of freedom If the data come from a sample that is normal7 we know that the left hand side of equation S1 behaves as a X2 random variable7 we know the value 71 7 1 because it is based on the sample size7 and we know the value of 52 because we can calculate it from the data If we solve this equation for 02 we could then get information about this unknown population parameter We will discuss this more in Chapters 7 and 8 Table Vll on pages 810 811 of the text give several of the values of the X2 random variable for a variety of degrees of freedom Each row of this table corresponds to a different number of degrees of freedom df Remember that you need to look at df n 71 if you are using the sample variance to investigate the population variance The rest of the table is set up the opposite of the normal table The values along the top are the probabilities the areas that are shaded in on the gure and the body of the table contains the X2 values that go with those probabilities Because each df needs its own row7 no table can possibly contain all of the possible values A normal table can because we can change any normal to a standard normal7 there is no such simpli cation for the X2 Say we wanted to know PX f8 3 1344 Looking at the df 8 row of Table Vll7 we see that this corresponds to a probability of 0995 The probability is for greater than 1344 though7 so we have to take 1 0995 and get a value of 0005 Many of the values are not in the table however If we wanted to know PX f8 2 25 the closest values in the table are 2179 and 2733 Because of this7 all we can say about the probability is that it is between 0950 and 0975 If an exact value is needed then a computer package such as SAS would be used instead We can also use the table in the reverse direction The 00 such that PX f8 2 Co 0010 is 200902 The 00 such that PX f8 3 Co 0010 is 1646482 63 i s and Students t Distribution As noted in 8617 the dif culty with the central limit theorem is that it requires us to know 039 The tool we use to deal with this is the t distribution that was discovered in 1908 by chemist William Gosset at the Guinness brewery in Ireland Because he didn7t want employees at other breweries to know that he found statistics useful7 he published his results under the pseudonym Student Hence7 the distribution is often known as Student7s t distribution A t distribution is formed by dividing a standard normal by a X2 over its degrees of freedom7 where the normal and the X2 are independent 52 At rst this seems to be more than a little bit out of nowhere A fact proved in STAT 712 sheds some light on why it is useful however If the sample 1727 xn is independent7 then its sample mean i and sample variance 52 are independent If the sample comes from a normal distribution7 841 showed that i is related to a standard normal distribution7 and 842 showed that 52 is related to a X2 distribution Combining these previous results gives E M t Z am dfn71 i 2 i n71s2 den71 r By cancelling the n 71 terms in the denominator7 applying the square root7 multiply ing both the numerator and denominator by one over the denominator7 and cancelling7 we get 7 n 7 n 7 n 7 tdfwl WiL7M 53 Just as we could solve equation 81 to nd out information about the population variance7 we can solve equation S3 to nd out information about the population mean7 if the sample comes from a population that follows the normal distribution This usage is discussed more in Section 72 One useful fact about the t distribution is that it becomes very similar to the standard normal distribution as the sample size 71 increases Many tables for the t distribution stop at 30 degrees of freedom and simply refer the user to a standard normal table Our table Vl continues on past 307 but does not give all the values Notice that the values change very little from one row to the next after about row eighteen If you go to the bottom row7 all of the values should be recognizable from the normal table 64 Two variances and the F Distribution The nal sampling distribution we will be concerned with is the F distribution The F distribution is de ned by X3 Fdfxmrrdf nrr 3 S4 nyil where X and X are independent X2 random variables with 711 71 and my 71 degrees of freedom respectively Because it is formed by using two X2 random variables7 the F distribution has two separate degrees of freedom7 one for the numerator and one for the denominator This makes the F tables even more complicated than the X2 or t tables The formula for the F distribution again looks out of nowhere7 until you recognize that we could get this formula by comparing two variances Say we have independent random samples from two populations7 call them 1727 znw and yhyg yny We could then write arms 2 an F 7 1 1 dfxnx71dfyny71 7 yinsi 2 Cancelling and then inverting the fractions we get i i 7 0 9 Fdfxnx1gtdfyny 1 g i g 55 75 75 Equation number S5 thus allows us to compare the variances of two different populations if we can assume both populations are normally distributed Tables Vlll IX X and XI give the values of the F distribution for various combinations of degrees of freedoms and areas under the curve The F distribution is useful not only for comparing two variances more in Section 95 but in Section 102 we will see that it is useful for comparing more than two means and in Chapter 11 that it is useful for predicting one variable from another One nal fact that we will encounter later concerns the relationship between the t distribution and the F distribution Turn back to formula S2 and square the numer ator and the denominator The denominator becomes the same as the denominator in S5 The numerator becomes a Z2 which is just a X2 with one degree of freedom We thus get the following result tdfn712 Fdfx1dfyn71 This result means that in some cases in Chapter 11 we will be able to work with either a t distribution or an F distribution and get the same result 1 022007 The Concept of Sampling Distributions Taking all possible samples of size 2 we can graph them and come up with a Chapter 6 sampling distribution of the sample statistic x Sampling distributions can be derived for any statistic Knowing the properties of the underlying sampling distributions allows us tojudge how accurate the statistics are as estimates of parameters Sampling Distributions The Concept of Sampling The Concept of Sampling Distributions Distributions Parameter numerical descriptive measure Decisions about which sample statistic to of a population It is usually unknown use must take into account the sampling Sampie Statistic numericai descriptive distribution ofthe statistics you will be measure of a sample It is usually known ChOOSinQ from Sampling distribution the probability distribution of a sample statistic calculated from a very large number of samples of size n 2 5 The Concept of Sampling The Concept of Sampling Distributions Distributions 19 19 20 21 20 25 22 18 18 17 Given the probability distribution We can take 45 samples of size 2 from this group of 10 observations X I it 199 W Welake one random sample and 99t 191 Find the sampling distribution of mean and 20 9595 median of x Another random sample may yield 22 25 with x 235 1022007 Sampllng dls1rlbullnn Sampllng eisnputinn nit min Properties of Sampling Distributions Unbiasedness and Minimum Variance Tvvu pulnLEStlmatDrS A and a at parameter e A ergeneratlng tne sampling distriputieins qu and El vve can SEE that A is an unblasEd estimateir at e E is a piased estimateireir S Wlth a blastuvvard uverstaternent The Concept of Sampling Distributions Slrnulatlrlg a Sampllng Distripupein Use a suftvvara package tei generate samples at size n ll rreim a pupulatlun vvlth a knuvvn p 5 Calculate tne mean and median fureach sample Generate nisteigramsreirtne means and medians ufthe samples i Nate tne greater clusterlng eir 39 tne values at x aruund p These nisteigrams are apprDleatanS urtne sampling distributluns pr and m Properties of Sampling Distributions Unbiasedness and Minimum Variance What irA and E are buth unpiased estimateirs pr 67 pp att e sampling distributluns and eeimpare tneir standard dEVlatlDrlS A nas a smaller standard dEVlatan tnan El quotquotr l39 gunnth u nii s Lininiimnii Wnien Wuuld yeiu use as lursmnmczl y ur Estimatur7 lnr die It Properties of Sampling Distributions Unbiasedness and Minimum Variance Point Estimator formula or rule for using sample data to calculate an estimate ofa population parameter Point estimators have sampling distributions Sampling distributions can also indicate whether an estimator is likely to underover estimate a parameter The Sampling Distribution onand the Central Limlt Theorem Assurne lEIEIEI samples at size n taken from a pupulatlun With x calculated fur Each sample What arE thE PerErtlES ufthe Sampling Distributlun pr X7 Mean at sampling distriputiein eguals mean at sampled pupulatlun 56 p standard EVlatlUn at sampling distnpuuein eguals d Standard dEVlatlUrl Elf sarnglEd gupula un re rueit pr sample size an geeJ 4T is referred tn asme standard Errbrufthe mean 1022007 The Sampling Distribution ofXand the Central Limit Theorem ifva sample n Db Natiuns trdrn a ndrrnaiiydistribdted pupulat iun the sampling distnbdudn at x Will be a ndrrnai rn in a pupulatiun Witn standard deviatl n cr and mean ithe f a Emeansfrum am has DfN DbsENatiuns Will appruach a normal distributlun With standard deviatl d aft J7 an mean at as N gets larger Th larger the N the luserthe sampling distribution Elf x tn a nurmal distributiun The Sampling Distribution ofX and the Central Limit Theorem Convert 82 to 2 value First calculate the standard dEVlatlDri ofthe sampling distributlu a i 5 n 1 sh Then calculate the zvalue z Use the tables to nd probability of interest Pgt 82 Pz gt 2 3957 4772 0228 The Sampling Distribution of X and the Central Limit Theorem ote how the sampling distribution approaches the normal distribution as N increases whateverthe sha e A A ofthe distribution of the original population The Sampling Distribution ofX and the Central Limit Theorem Assume a population with u 80 o 6 lfa sample of 36 is taken 39om this population what is the probability that the sample mean is larger than 82 Sketch the curve offand identify area of interest 1 1262007 Chapter 11 Simple Linear Regression Probabilistic Models 5 steps of Simple Linear Regression Hypothesize the deterministic compone t Use sample da a to estimate unknown model parameters 3 Specify probability distribution of estimate standard deviation ofthe distr39but39on Na Statistically evaluate model usefulness Use or prediction estimatation once model is useful oub 4 Probabilistic Models General form of Probabilistic Models Y Deterministic Component Random Error Where Ey Deterministic Component Fitting the Model The Least Squares Approach me vevsus Drug Paneling Am in r are 1le Ram nrrmeiisea mgr Probabilistic Models First Order StraightLin e Probabilistic Model y 0 1xg Fitting the Model The Least Squares Approach Least Squares Line 7 n px has Surn er errers SE u Surn er Squared errers SSE is smallest et all straight line models Formulas A SSV A siupe 3 veintereept u yi x ss ezaeazez fzf a2as ns 2ampxsw 1 1262007 Fitting the Model The Least Squares Approach i r l 4 it A Assessing the Utility of the Model Making Inferences about the Slope 51 Sampling Distribution of it iii Model Assumptions Mean or the probability oistribotion of is n Variance ortne probability distribution or e is constant roraii values Probability oistribotion or e is norrnai 4 Values or e are inoebenoentor each other Assessing the Utility ofthe Model Making Inferences about the Slope 31 ATest of Model Utility Simple Linear Regression Orie Tailed Test TWO Tailed Test Hu BED Hu BED HIB1ltDorHB1gtD H Wu 7252mm 1 reelection region Kat Reiemionregion ltlgt rm lt LWHEHHIB1gtU Where and tm are baseo on n72 degrees orrreeoorn An Estimator of 62 Estimator of o for a straightline model SSE ngrzesof ezdom fomror n 72 SSE SSW r ASSW ma zormtzyfal n a J EstimatedStandardEnarafthe RegreaaianMadet o Assessing the Utility ofthe Model Making Inferences about the Slope 31 A 1001d Con dence Interval for p1 l rm233 where s 7 S A SS Overview of the ANOVA Table for Comparing Means Corresponding to Section 102 B Habing 112003 Say we are comparing the means of p different groups treatment group where each group has 71 observations We ll let the total number of observations be 11 n1 nz np then Because we have to number both the treatment group and the observation number we need to use two subscripts The rst subscript 139 is for the group number and the second 139 is for which observation it is So that the notation matches that for regression we ll use yij to name each observation then but you could use xi like the book does if you wanted to The data could be laid out as follows Treatment 11 12 z39p y11 y21 y pl y2 y22 yPZ Overall 39 39 39 Mean yln1 y2n2 ypnp V Means for gt 51 i2 7 Each Group If we looked at Example 103 on page 452 we would have p4 there are four brands n110 because 10 different balls were used with the rst brand Similarly n2 713 and m are all 10 in this example they don t all need to be equal in general and the total sample size 7140 Just as we have laid out in the table above y112512 rst group rst observation y232650 second group third observation etc Similarly J712508 the mean of the rst group etc To nd 7we would have to take the average of all 40 observations n see m mpafpuge 442 We cmtsxmplyuse a a w b mndnm mm Ifwe hadthree ymlps X39s D39s and 39s we mm mum m m fa awmg Th2 gadaf u momma 15 m album um F 5mm m 0 m a mm bemgwmmdemmmmmagmmm SWMMW F mm W1 swmmm m1 Wm wwm W The ANOVA Table for Comparing Means Source SS Sum of Squares DF MS lllean Square F the numerator of the variance the denominator the variance Treatment p nl SST or Between SST Z Ti 52 171 MST 1 F MST or Model 121 11 MSE n Error p l 2 SSE SSE MSE or Within 13951 1E1 yU yl n p n p p quoti 2 Total TSS Z Z yl39j y 171 i1 jl Notice that if we took the TSSnl we would end up with just the overall variance of all the observations Compare the formula you get here to our usual formula for s2 it only looks different because we had to use two subscripts to make sure to include all of the observations The ve keys to remember about the Analysis of Variance table are l The sum of squares add up SST SSE TSS 2 The degrees of freedom can be calculated from the sum of squares formulas Looking at the SST notice that there are p different 7 and one 7 so we get pl degrees of freedom Looking at the SSE we have n different yij and p different 7 and so the degrees of freedom are n p Finally for TSS there are n different yij and one fso there are n J degrees of freedom 3 The degrees offreedom add up p1 np nl 4 The mean squares the variances are found by taking the sum of squares the numerator and dividing by the degrees of freedom the denominator 5 The F statistic is calculated by dividing the two MS Overview of the ANOVA Table for Comparing Means Corresponding to Section 102 B Habing 112003 Say we are comparing the means of p different groups treatment group where each group has 71 observations We ll let the total number of observations be 11 n1 nz np then Because we have to number both the treatment group and the observation number we need to use two subscripts The rst subscript 139 is for the group number and the second 139 is for which observation it is So that the notation matches that for regression we ll use yij to name each observation then but you could use xi like the book does if you wanted to The data could be laid out as follows Treatment 11 12 z39p y11 y21 y pl y2 y22 yPZ Overall 39 39 39 Mean yln1 y2n2 ypnp V Means for gt 51 i2 7 Each Group If we looked at Example 103 on page 452 we would have p4 there are four brands n110 because 10 different balls were used with the rst brand Similarly n2 713 and m are all 10 in this example they don t all need to be equal in general and the total sample size 7140 Just as we have laid out in the table above y112512 rst group rst observation y232650 second group third observation etc Similarly J712508 the mean of the rst group etc To nd 7we would have to take the average of all 40 observations n see m mpafpuge 442 We cmtsxmplyuse a a w b mndnm mm Ifwe hadthree ymlps X39s D39s and 39s we mm mum m m fa awmg Th2 gadaf u momma 15 m album um F 5mm m 0 m a mm bemgwmmdemmmmmagmmm SWMMW F mm W1 swmmm m1 Wm wwm W The ANOVA Table for Comparing Means Source SS Sum of Squares DF MS lllean Square F the numerator of the variance the denominator the variance Treatment p nl SST or Between SST Z Ti 52 171 MST 1 F MST or Model 121 11 MSE n Error p l 2 SSE SSE MSE or Within 13951 1E1 yU yl n p n p p quoti 2 Total TSS Z Z yl39j y 171 i1 jl Notice that if we took the TSSnl we would end up with just the overall variance of all the observations Compare the formula you get here to our usual formula for s2 it only looks different because we had to use two subscripts to make sure to include all of the observations The ve keys to remember about the Analysis of Variance table are l The sum of squares add up SST SSE TSS 2 The degrees of freedom can be calculated from the sum of squares formulas Looking at the SST notice that there are p different 7 and one 7 so we get pl degrees of freedom Looking at the SSE we have n different yij and p different 7 and so the degrees of freedom are n p Finally for TSS there are n different yij and one fso there are n J degrees of freedom 3 The degrees offreedom add up p1 np nl 4 The mean squares the variances are found by taking the sum of squares the numerator and dividing by the degrees of freedom the denominator 5 The F statistic is calculated by dividing the two MS STAT 515 Chapter 5 Continuous Distributions Probability distributions are used a bit differently for continuous rv s than for discrete rv s Continuous distributions typically are represented by a probability densi function pdf or density curve kind of a theoretical histogram A densipv curve is a representation of the underlying population distribution not a description of actual sample data The normal distribution is a particular type of continuous distribution Its density has a bell shape Properties of Density Functions 1 Density function always on or above the horizontal axis curve can never have a negative value 2 Total area beneath the curve between curve and horizontal axis is exactly 1 3 An area under a density function represents a probability about the rv or the proportion of observations we expect to have certain values With discrete rv s we looked at probability function table graph to find probability of the rv taking a particular value For continuous rv s the probability distribution will give us the probability that a value falls in an interval for example between two numbers That is the probability distribution of a continuous rv X will tell us Pa S X S b where a and b are particular numbers Speci cally Pa S X S b is the area under the density function between x a and x b Examples The Uniform Distribution This is a simple example of a continuous distribution A uniform rv is equally likely to take any value between its lower limit some number c and its upper limit some number d Density looks like a rectangle If total area is 1 then what is the height of the density function 0 Mean of a Uniformc d rv c d 2 0 Std deviation of a Uniformc d rv d c 12 Example A machine designed to fill 16ounce water bottles actually dispenses a random amount between 150 and 170 ounces The amount X of water dispensed is a Uniform15 17 random variable Density What is the probability that the bottle has less than 155 ounces of water PXlt 155 P15 ltXlt 155 In general For X Uniformc d PaltXltb b a d c The Normal Distribution The density function for the normal distribution is complicated 1 form e for allx Note that the normal distribution changes depending on the values of the mean u and the standard deviation 039 Standard Normal Distribution Notationz N0 1 The normal distribution with mean u 0 and standard deviation 039 1 Picture 0 Moundshaped symmetric centered at 0 0 Density always positive even in tails 0 Area under curve is 05 to left of zero 05 to right of zero 0 Almost all area under curve 997 between 3 and 3 Note N0 1 distribution sometimes called the zdistribution and standard normal values are denoted by z Table IV in back of book gives areas between 0 and certain listed values of z Example Area under N0 1 curve between 0 and 124 Table IV Go to row labeled 12 column labeled 04 Correct area What does this area mean 0 If Z is a rv with a standard normal distribution then P0 lt Zlt 124 Note Same as P0 5 Z 124 0 We expect that 3925 of the values of data having a standard normal distribution will be between 0 and 124 Other Probabilities PZ gt 124 PZ lt 124 Values to the left of zero Use symmetry P054 5 Z lt 0 PZ lt 054 P175 lt Z lt 079 P079 lt Zlt 116 Finding Probabilities for any Normal rv Note There are many different normal distributions change u andor 039 get a different distribution 0 Changing u shifts the distribution to the left or right 0 Increasing 039 makes the normal distribution wider o Decreasing 039 makes the normal distribution narrower So why so much emphasis on the standard normal Standardizing If a rv X has a normal distribution with mean u and standard deviation 039 then the standardized variable Z 2 039 has a standard normal distribution So We can convert m normal rv to a standard normal and then use Table IV to find probabilities Example Assume lengths of pregnancies are normally distributed with mean 266 days and standard deviation 16 days What proportion of pregnancies last less than 255 days What is the probability that a random pregnancy will last between 260 and 280 days We can also find the particular value of a normal rv that corresponds to a given proportion Example Suppose the shortest tenth of pregnancies are classified as unusually premature What s the maximum pregnancy length that would be classi ed as such We need to unstandardize to get back to the X value pregnancy length General Rule To unstandardize a zvalue use X 2039 u Normal Approximation to the Binomial The normal distribution is very powerful can be used to approximate probabilities for rv s that are not normal Calculating binomial probabilities using Table 11 doesn t cover all values of 11 only 510 15 20 25 Using the binomial probability formula can be tedious for large n Fortunately when n is large the binomial distribution closely resembles the normal distribution with mean np and standard deviation npq Rule of Thumb When can this normal approximation be applied Continuity Correction Since the normal is a continuous distribution and the binomial distribution a discrete distribution an adjustment of 05 is usually made to the value of interest Example A hotel has found that 5 percent of its guests will steal towels If there are 220 rooms with guests in a hotel on a certain night what is the probability that at least 20 of the rooms will need the towels replaced What is the probability that between 10 and 20 rooms will need towels replaced 1 1262007 Fitting the Model The Least Squares Approach me mus Drug pemeoe Dru 39le Hutu quotTimeylsec nail Chapter 11 Simple Linear Regression Probabilistic Modeis Fitting the Model The Least Squares Approach First OrderStraightLine Probabilistic Model Least Squares Line 51305 x has Surn or errors SE u y o Ax 8 Sum or Sooareo errors SSE issrnaiiestoraii straightline rnooeis Formulas A SSW A A Slope 33 yriritercept u yi z Safeway ssi 20 profane BED Fitting the Model The Least Probabilistic Models Squares Approach 5 steps of Simple Linear Regression Hypothesize the deterministic component Use sample data to estimate unknown model parameters Specify probability distribution of 8 estimate standard deviation ofthe distribution Statistically evaluate model usefulness f r prediction estimatation once model is eful NA 0 we 1 1262007 Assessing the Utility ofthe Model Making Medel Assumptlons Inferences about the Slope 31 Mean of the probability distribution of a is El ATest of Model Utili Simple Linear Regression 2 yananee ortne prooaoiiity distribution or e is eonstant On Tam Test TW roraii values ufx 3 F39rubability distribution of is norrnai Hm V H 4 Values of e are inoeoenoentor eaen otner Ha BF r H W H W Teststanstx 1 7 i 53 Sum 17 Reieetion region tltrt Reieetion region itigt tM ortlt rt Wnen Ha B9B h H 7 Where t and tm are based on n72 oegrees orrreeoorn Assessing the Utility ofthe Model Making 2 An Est39mator Of 0 Inferences about the Slope 31 Estimator of o2 for a straightline model A 1001oi Con dence Interval for or SSE 7 SSE Degreesoffrzedom forermr rt 7 Z 51 i msi where SSE SSW e ASSW sswzorrikzyzel n o e I EotmazedmndardEnarafme RegreootanMadet x Assessing the Utility of the Model Making Inferences about the Slope 5 Sampling Distribution of A lit anti idol Analysis of a Two Way Table Chi square Test for Independence of Two Classi cations Calculations for our Example Does the incidence of heart disease depend on snoring pattern Test using at 005 A random sample of 2484 British adults was taken and the results given in a contingency table Snoring Pattern Never Occasionally Almost Every Night Healt Yes 1 24 35 51 1 110 Disease No 1 1355 603 416 1 2374 1379 638 467 2484 Expected cell counts 1101379 6107 110638 2825 110467 2068 2484 2484 2484 60975 W131793 W 44632 484 2484 2374467 484 Test Statistic 2 24 61072 L 35 28252 L 51 20682 L1355 1317932 L I 6107 39 2825 39 2068 39 131793 39 Z Z 603 60975 L416 44632 71L75 60975 44632 For r 7 1c 7 1 12 2 degrees of freedom 135 599 Table VII Since 7175 gt 599 we reject H0 and conclude the classi cations are dependent The incidence of heart disease does depend on snoring pattern STAT 515 Chapter 7 Supplement Brian Habing University of South Carolina Last Updated October 167 2003 S7 Con dence Intervals for Variances Just as we can use the t distribution to form a con dence interval for the mean of a population7 we can use the X2 and F distributions to make con dence intervals for the variance of a population7 or the ratio of two variances The logic is the same in both cases use the sampling distribution to form a probability statement containing just the one unknown parameter7 and then solve for that parameter S71 The Con dence Interval for 039 or 02 In section 862 we saw that if the random sample was drawn from a population that was normally distributed7 then 71 712 2 den71 U2 If we choose xiZ to be the value such that PX fn1 2 xiZ g and at2 to be the value such that PX fn1 S at2 we get the following 717152 PlXTiaZ S S Xizl1 04 51 This is illustrated in the gure below 7 2 n 128 for a random sample 0 r from a normal population XTiaZ Xiz We can now solve the inequality in S1 for 02 to get the con dence interval 71712 PXT7uz2 lt T S Xiz gt 0 2 L XT7oz2 T nil52 T Xiz 2 XuzZ 71 Us X17042 P XZz XT7oz2 71712 S 2 S 71152 1701 The 1 7 a100 con dence interval for the population variance is thus lt 71 7152 71 7152 gt SQ Xiz XT7oz2 This can be changed to a con dence interval for the population standard deviation simply by taking the square root of both sides 71 7152 71 7152 lt Xiz 7 X17042 gt 53 Say we are interested in the standard deviation of a certain population A sample of size n 12 df 11 is gathered7 52 is found to be 2027 the 01 01 plot for the data looks fairly norrnal7 and it is desired to construct a 95 con dence interval Using S37 we see that we need to determine xiZ and xiaZ where 042 0025 and 1 7 042 1 7 0025 0975 From Table Vll we see that the 1795 7 2 7 values are 025 219200 and XaWS 381575 Plugging these values into S3 gives M10137 M58232 318763 72 The Con dence Interval for i 75 A con dence interval for the ratio of two variances could be constructed in the same way as one for a single variance From 864 we see that if two samples 2122 zn and y1y2 yny are drawn randomly from two populations that seem normal7 then 93 2 S Fdfxnx71dfyny71 2 Lx 2 0y If we choose Fat2 to be the value such that PFdfwnr1dfynr1 2 Fat2 g and Fla2 to be the value such that PFdfwnr1dfynr1 lt Fla2 we get the following P Fla2 Faz 17a S4 7 Solving for 42 then gives us the 1 7 a100 con dence interval 1 9 9 Q Q S5 Fox2 7 Flier2 It might be good practice to see if you can work out the steps between S4 and SB Note that the text only gives the table for nding the Fat2 values7 and not the Fla2 ones One way of getting around this would be to use SAS just remember that SAS gives the area in the lower end of the table7 while the text and the formulas above use the upper end Another way is to use the following relationship Fl a2gtdfxnx71dfyny 1 Fa2dfxny71dfynxil That is7 to nd the F0975 you ip the degrees of freedom and take 1F0V025 So7 if you wanted a 95 con dence interval when 711 6 and my 5 the two values you would use are 936 for F0025 straight from the table for df of 5 and 4 and 1739 0135 for F0975 one over the value from the table for the reversed df of 4 and 5 873 Robustness What if the Data Isn7t Normal A statistical procedure is called robust if it performs well even when its assump tions aren7t met In the case of using the t distribution to make inferences about the mean7 the XZ distribution for the variance7 and the F distribution for two vari ances7 we need to assume that the initial populations were normally distributed The procedures that use the t distribution are fairly robust however That is7 the proce dures involving the use of the t distribution to make inferences about M work fairly well even when the data isn7t normally distributed In general it will work well for small sample sizes 71 S 30 even if their are some doubts about the 01 01 plot For large sample sizes it will work well for all but the worst q q plots The procedures discussed in 861 and 862 for making inferences about population variances are not robust at all If there are any questions about the 01 01 plots7 they should not be used It is important to note that in Chapters 107 117 and 13 we will see other uses of the X2 and F distributions that are robust A particular distribution is never robust or non robust7 robustness is a property of the entire procedure that you are attempting In this course we are only covering some of the most common and basic methods for making inferences about a population A variety of other methods are discussed in STAT 518 Nonparametric Statistical Methods A method is called nonparametric if it does not depend on the assumptions that the data follows a particular distribution like the normal distribution There are two reasons for not simply always using nonparametric methods One is that they are somewhat more complicated to explain see Section 142 for example The second reason is that while the nonparametric tests are generally better when the assumptions are badly violated7 the standard methods we are learning here are better when the assumptions are met 1082007 Chapter 7 Inferences Based on a Single Sample Estimation with Con dence Intervals LargeSample Confidence Interval fora Population Mean 9 25 Z Je 95 n J We are about 95 con dent for anyx from sample size nthat 4 will lie in the interval xiZ I n Identifying the Target Parameter Target parameter the unknown population parameter of interest for estimating Param eter Key Words or Phrases Type of Data Mean average Quantitative p Prupurtiuri percentage Qualitative fra lun rate LargeSample Confidence Interval fora Population Mean We usually don t know m but Witn a large sampie sis a good estimator of a oaiouiate con dence intervals for different con dence coefficients Interval Estimator or con dence intervaie a formula used to oaiouiate an interval estimate from samoie data Con dence coef cientr probability tnat a randomly seieoted con dence interval enoiosestne ooouiation parameter Con dence level 7 Con dence coefficient expressed as a perceri age 5 LargeSample Confidence Interval for a Population Mean How to estimate the population mean and assess the estimates reliability is an estimate of IA and we use CLT to assess how accurate that estimate is According to CLT 95 of all from sample size n lie within 411960 ofthe mean We can use this to assess accuracy of g as an estimate of IA LargeSample Confidence Interval fora Population Mean The con dence coef cient is equal to 101 and is split between the two tails ofthe distribution 1082007 LargeSample Confidence Interval for a Population Mean SmallSample Confidence Interval fora Population Mean The Con dence Interval is expressed more generally as zma 2m i or samples of size gt 30 the con dence interval is expressed as 5 z i we Requires that the sample used be random lfwe can assume that the sampled population is approximately normal then the sampling distribution of x can be assumed to be approximatelyinormal Instead ofusing k we use tL til71 sJZ This tis referred to as the tstatistic LargeSample Confidence Interval for a Population Mean SmallSample Confidence Interval fora Population Mean TABLE 72 Commonly Used Values of 1m Contideme Level 1000 A a a 9mm ill 95 05 99 0 The tstatistic has a sampling distribution very similar to z Variability dependent 4 n or sample s39ze Variability is expressed as n1 degrees of freedom df As df gets smaller variability increases SmallSample Confidence Interval fora Population Mean SmallSample Confidence Interval fora Population Mean 2 problems presented by sample sizes of less than 30 CLT no longer applies Population standard deviation is almost lw ys un n wn and s may provide a poor a a k estimation when n is small Table fur t W umblnatluns er degrees at freedom and ta Partlal table here shuvvs umpunents er able 1082007 SmallSample Confidence Interval fora Population Mean LargeScale Con dence Interval for a Population Proportion Cumparlngtand z distributlunsfurthe same a WIth df4 e the Ldlstrlbutl yeu can see that the trscure IS larger and thererere the eehrieehee ihtervai WIII be WIdEr The Iuserdf getstu Stir the mere cluser the Ldistributiun appruximates the nurmal distributieh minimum smitiimrriiiir Sample size n is large if i 361 falls between 0 and 1 Con dence Interval Is calculated as iza ap zalz gm izaZwi q A x A where Pand 91 P SmallSample Confidence Interval fora Population Mean LargeScale Con dence Interval for a Population Proportion When creating a Con dence interval around p for a small sample we use S x ta2 J basing tm2 on n1 degrees offreedom We assume a random sample drawn from a population that is approximately normally distributed When p is near 0 or 1 the con dence intervals calculated using the formulas presented are misleading An adjustment can be used that works for any p even with very small sample sizes Agresti and Coull Correction LargeScale Confidence Interval for a Population Proportion Agresti and Coull Correction Con dence intervals around a proportion are con dence intervals around the probability of success in a binomial experiment Sample statistic of interest is Mean of sampling distribution of is p p is an unbiased estimator of P Standard deviation ofthe sampling distribution is i n ere 1 5 F4 For large samples the sampling distribution of P is approximately normal Wherejw x2 n 4 x number of successes n sample size STAT 515 Chapter 7 Con dence Intervals 0 With a point estimate we used a single number to estimate a parameter 0 We can also use a set of numbers to serve as reasonable estimates for the parameter Example Assume we have a sample of size 100 from a population with 039 01 From CLT Empirical Rule If we take many samples calculating X each time then about 95 of the values of X will be between Therefore This interval is called an approximate 95 confidence interval for u Confidence Interval An interval along with a level of confidence used to estimate a parameter 0 Values in the interval are considered reasonable values for the parameter Confidence level The percentage of all CIs if we took many samples each time computing the CI that contain the true parameter Note The endpoints of the C1 are statistics calculated from sample data The endpoints are random not the parameter In general if X is normally distributed then in 1001 00 of samples the interval will contain 14 Note zm the zvalue with aZ area to the right 1001 00 CI for p X l zmwJZ Problem We typically do not know the parameter 039 We must use its estimate s instead Formula CI for u when 039 is unknown 7 Since s J has a tdistribution with n 1 df our 1001 00 CI for u is where taz the value in the tdistribution n 1 df with xZ area to the right 0 This is valid if the data come from a normal distribution Example We want to estimate the mean weight 11 of trout in a lake We catch a sample of 9 trout Sample mean X 35 pounds s 09 pounds 95 CI for u Question What does 95 con dence mean here exactly 0 If we took many samples and computed many 95 CIs then about 95 of them would contain 11 The fact that contains u with 95 confidence implies the method used would capture 11 95 of the time if we did this over many samples Picture A WRONG statement There is 95 probability that u is between 281 and 419 Wrong p is not random 11 doesn t change from sample to sample It s either between 281 and 419 or it s not Level of Con dence Recall example 95 CI for u was 281 419 0 For a 90 CI we use tag 8 df 186 o For a 99 CI we use t005 8 df 3355 90 CI 99 CI Note tradeoff If we want a higher confidence level then the interval gets wider less precise Confidence Interval for a Proportion 0 We want to know how much of a population has a certain characteristic 0 The proportion always between 0 and 1 of individuals with a characteristic is the same as the probability of a random individual having the characteristic Estimating proportion is equivalent to estimating the binomial probability p Point estimate of p is the sample proportion A x Note P 2 Z is a type of sample average of 0 s and 1 s so CLT tells us that when sample size is large sampling distribution of f9 is approximately normal For large 11 1001 00 CI forp is How large does 11 need to be Example 1 A student government candidate wants to know the proportion of students who support her She takes a random sample of 93 students and 47 of those support her Find a 90 CI for the true proportion Check Example 2 We wish to estimate the probability that a randomly selected part in a shipment will be defective Take a random sample of 79 parts and nd 4 defective parts Find a 95 CI forp Con dence Interval for the Variance 0392 or for sd 039 Recall that if the data are normally distributed 1 ls 2 02 This can be used to develop a 1 0L100 CI for 0392 has a x2 sampling distribution with n 1 df Example Trout data example assume data are normal how to check this s 09 pounds so s2 n 9 Find 95 CI for 0392 95 CI for 039 2 o o 01 Also a CI for the ratlo of two varlances y can be 2 found by the formula Example If we have a second sample of 13 trout with 2 sample variance s22 07 then a 95 CI for g is 2 Sample Size Determination Note that the bound or margin of error B of a CI equals half its width For the CI for the mean with 039 known this is For the CI for the proportion this is Note When the sample size n is bigger the CI is narrower more precise We often want to determine what sample size we need to achieve a prespeci ed margin of error and level of confidence Solving for 11 CI for mean CI for proportion Note Always round 11 up to the next largest integer These formulas involve 039 p and q which are usually unknown in practice We typically guess them based on prior knowledge often we use p 05 q 05 Example 1 How many patients do we need for a blood pressure study We want a 90 CI for mean systolic blood pressure reduction with a margin of error of 5 mmHg We believe that 039 10 mmHg Example 2 Pollsters want a 95 CI for the proportion of voters supporting President Bush They want a 3 margin of error B 03 What sample size do they need 10152007 The Elements of a Test of Hypothesis Test statistic to be used 2 Chapter 8 Relectlon region Determined by Type l ehui which is the prubability uf nferences Based on a Single reieetihg the hull hyputhesis When it is true which is a Sample Tests of Hypothesis HWY 55 a D5 Region is zgt1645 39om 2 value tab e The Elements of a Test of The Elements of a Test of Hypothesis Hypothesis 7 elements Assume that s is a good approximation aft 1 The Null hypothesis Sample of60 taken 2 2450 5200 2 The alternate or research hypothesis Test statistic is 2 246er400 60 3 The test statistic z N 39 zooJ5 2828 A 4 Th 39 t39 39 5 T erejec lonreglon Test statistic lies in rejection region e assum HS therefore we reject HD 6 The Experiment and test statistic calculation 7 The Conclusion 2 5 The Elements of a Test of The Elements of a Test of Hypothesis Hypothesis Does a manufacturer s pipe meet building Type lvs Type II Error COde Cnnclu ups and Cnnsequenceslnr zTesI MHypnIh Null hypothesis Pipe does not meet code m S amuma m HE us 2400 cphelusiph HnTlue HiTlue Alternate hypothesis pipe meets Falltu ieieelhn cpiieeleeeisipp Type ll eiipi pippapililyp spec39 cat39ons Reieel hn Type l eiipi cpiieel eeeisiph Ha pgt 2400 piupapiiiyei 3 i 10152007 The Elements of a Test of Hypothesis LargeSample Test of Hypothesis about a Population Mean 1 The Null hypothesis the status quo What we will accept unless proven otherwise Stated as Hg parameter value The Alternative research hypothesis H5 theory that contradicts H0 Will be accepted if there is evidence to establish its truth Test Statistic sample statistic used to determine whether or not to reject Ho and accept H m 0 Alternative hypothesis can take one of3 forms Onertailed luvvertail H ult24uu Onertailed uppertail H ugt24uu Tvvurtailed H ugmu The Elements of a Test of Hypothesis LargeSample Test of Hypothesis about a Population Mean 4 The rejection region the region that will lead to Hg being rejected and H accepted Set to minimize the likelihood ofa Type I error The assumptions clear statements about the population being sampled 6 n g a performance of sampling and calculation of value oftest statistic 7 The Conclusion decision to not reject Hg based on a comparison oftest statistic to rejection region mm 1 Rejuliun Region lor nmmnn Value of utmm Hynumm a Upp msil LomnYmkd LargeSample Test of Hypothesis about a Population Mean LargeSample Test of Hypothesis about a Population Mean Null hypothesis is the status quo expressed in one ofthree forms Hn uz 400 We can either Reject or Fail to rejectquot the Null lfwe have n100 Z 1185s 5 and we want to test if u 12 with a 99 con dence level our setup would be as follows HE 12 Test statistic 2 a Z Rejection region 2 lt 2575 or z gt 2575 twotailed 10152007 LargeSample Test of Hypothesis about a Population Mean CLT applies therefore no assumptions about population are nee e Solve Z 2 ittgsetz 1185712715773 7 a aJZ am cto 7 5107 Since 2 falls in the rejection region we conclude that at 01 level of signi cance the observed mean differs signi cantly 39om 12 SmallSample Test of Hypothesis about a Population Mean When sample size is small lt30 we use a different sampling distribution for determining the rejection region and we calculate a different test statistic The tstatistic and t distribution are used in cases 0 a small sample test of hypothesis about All steps ofthe test are the same and assumption about the population distribution is now necessary since CLT does not apply Observed Significance Levels p Values SmallSample Test of Hypothesis about a Population Mean The pvalue or observed signi cance level is the smallest level of 0 at which we can reject the Null Small Sample Test of Hypothesis about F O TailedTes1 T iled Test Hquot t Hquot Wm H ltn niHe Fm Ha Fm Tes1813tis1ie xcitn Tes18tatis1i 1p c t a SN sJZ R Ejedlunveglun tltrt R Ejedlunveglun Hymn cii tgtt wnenHa pain quot71 degrees citieeociin wneie tn and L aie based on Observed Significance Levels p Values LargeSample Test of Hypothesis about a Population Proportion When pvalues are used results are reported by setting the maximum ayou are willing to tolerate and comparing pvalue to that to reject or not reject HEl Large5am pie Test of Hypothesis about F 51 TWEI39 ONE39Taled Te Tailed Test Ha F Fn Ha F Fn H F lt Fn H F Fn Tes1 Statis1ie Tes1 Statis1ie Z p p wneieaccciioingtciHn 51 pnqnn and q pp R ejectiun YEngquot zltr7 R ejectiun YEngn ci gtz when p gtpn 2 LargeSample Test of Hypothesis about a Population Proportion Assumptions needed for a Valid LargeSample Test of Hypothesis for A random sample is selected from a binomial population The sample size n is large condition satis ed if puiStTI falls between 0 and 1 Tests of Hypothesis about a Population Variance Hypotheses aboutthe variance use the Chi Square distribution and statistic a a sampling distribution that follows the um chisquare distribution assuming the population the sample is drawn from is normally distributed Tests of Hypothesis about a Population Variance SmallSample Testonypothesis about 93 ONE39TallEd Test TWD39TallEdTE Hu 03 03 He 0392 cg H 03ltJ uvHa 03gta Haaja 2 TEstStatistic 2 quotall TE Statlstlcl 53 l n n Reiemmmegmn 7 lt12 Reiemmmeemn 7 lt1 W 2 2 m 1 gt1 WhenH 03gta 0y I gtlld1t Where a isthehyputhesizedyananceandthe dishibutlunull isbased an M degrees uHreedum 10152007 9192007 Chapter 4 Discrete Random Variables Probability Distributions for Discrete Random Variables 2 Requirements that must be satis ed i px20 furallvaluesufx Z Z px1 Were the summatiun or is Dyer all pussible values ufgtlt Two Types of Random Variables sample point Probability Distributions for Discrete Random Variables Random Variable Experiment tossing 2 coins simultaneously variable that assumes numerical values Ra dom associated with random outcomes of an exp variable X number of heads observed X can assume values cm 1 and 2 Two types of Random Variable Discrete Continuous alumniquot Iquot u m a mm mm 4 i mummy I Two Types of Random Variables Probability Distributions for Discrete Random Variables Discrete Random Variabe Probability Distribution of Discrete Random Variable Random variable that has a nite or countable X 39 omerf quotquot5 number of distinct possible values Example number of people born in July Continuous Random Variable Random variable that has an in nite number of distinct possible values A erage age ofpeople born in July 14 PHT 1414 12 H14 9192007 Expected Values of Discrete andom Variables The mean or expected value ofa discrete random varia leis u mam The Binomial Random Variable Binomial Random variable An experiment of n identical trials 2 possible outcomes on each trial denoted as Ssuccess and Ffailure Probability of success p is constant from trial to trial Probability of failure q is 1p Trials are independent Binomial random Variable number of 839s in n trials in Expected Values of Discrete Random Variables The variance ofa discrete random variable i 52Ex 22x 21 x u71 171 72712 and standard deviation is aJ12 The Binomial Random Variable Heartassuciatiun claims that only lEl n urus adults over an can ass the President s Physical Fitness commission s Select4 adults at random administer the test Whatisthe probability that none ufthe adults passesthe test mm 2 Sample Paints 1ar Funny mt m mmpit a 9 w my s pass F Fail Expected Values of Discrete andom Variables Probability Rules for a Discrete Random Variable The Binomial Random Variable Use multiplicative rule to calculate probabilities of the possible outcomes PSSSS 1 1 1 39 001 130009 PFSSS 9 1 1 PFFFF 9 9 9 99 6561 9192007 The Binomial Random Variable What is the probability that 3 ofthe 4 adults pass the test P3 ofthe 4 adults pass the test 4139409 0036 What is the probability that 3 ofthe 4 adults fail the es P3 ofthe 4 adults fail the test 49314o729 2916 Do you see a pattern The Binomial Random Variable Using Binomial Tables Binomial tables are cumulative tables entries represent cumulative binomial probabilities Make use of additive and com lemen ary properties to calculate probabilities of individual X39s orx being greater than a particular value The Binomial Random Variable Formula for the probability distribution px n PX jw q x Where in probability or success on single trlal urnber Elf trials gtlt number Elf successes in rl trials n H x XHXl The Binomial Random Variable lHZZ andel n WXgtZande l The Binomial Random Variable Mean H np Variance 02 npq Standarddeviation U npq 8282007 Types of Statistical Applications lnferential Statistics make generalizations about a group based on a subset Sample Chapter 1 ofthat group Statistics Data and Statistical Thinking H ased on exit polls more people voted for Candidate Aquot The Science of Statistics Fundamental Elements of Statistics Statistics the science of data Experimental Unit object ofinterest example 7 grauuaiing senior c quot mquot Population the set ofunits we are Evaluation classification summary organization interested in learning about 3 MEWS example 7 all 7450 graduating seniors ai Interpretation quotSiaie U Variable characteristic ofan individual population unit example 7 age ai graduation 2 5 Types of Statistical Applications Fundamental Elements of Statistics Descriptive Statistics describe collected Sample subset of population data example 7 700 graduating seniors ai quotState Uquot Statistical Inference eneralization about a Nearly 87 of players participating in o ulation based on 53m Ie data a Speed Training Program improved p p M p I d I 27 9 ii Imim examp e e e average age a gra ua ion is their sprint times an based on sawe 0mm Only about 3 or players participating in a speed Measure of reliability statement about the Training Program had decreased timesquot uncertainty associated with an inference 3 s 8282007 Fundamental Elements of Statistics Types of Data Elements of Descriptive Statistical Problems populationsample ofinterest investigative variables numerical summary tools charts graphs tables pattern identi cation in data Qualitative Data measured by classification only Non numerica in nature Meaningfuly ordered categories identify ordinal data best to worst ranking age categories Categories without a meaningful order identify nominal data political affiliation industry classification ethniccultural groups Fundamental Elements of Statistics Types of Data Elements of Inferential Statistical Problems population of interest investigative variab es sample taken from population inference about population based on sample data Reiabiity measure forthe inference oDifferent statistical techniques used for quantitative and qualitative data Qualitative and Quantitative data can be used together in some techniques Quantitative data can be transformed into Qualitative data through category creation Qualitative data cannot be meaningfully transformed into Quantitative data Types of Data Collecting Data Quantitative Data measured on a naturally occurring scale equa intervals along scale allows for meaningful mathematical calculations data with absolute zero zero means no value is ratio data bank balance grade Data with relative zero zero has value is interval data temperature Data Sources Pubished source books journals abstracts Prirnaryvs secondary Designed Experiment Often used for gatnering information about an intervention y Data gathered through questions from a sarnpie of people Observation a u dy Data gathered through observation no interaction Witn units Collecting Data 8282007 Sampling Sampling is necessary if inferential statistics are to be used Samples need to be representative Rellect population orihterest Random Samplin tcorhrhOh sampling method to ensure sample is representative Ensures that each subset Offlxed size is equally likelyto be selected The Role of Statistics in Critical Thinking Common Sources of Error in Survey Data Selection bias exclusion ofa subset ofthe population of interest prior to sampling Nonresponse bias introduced when responses are not gotten 39om all sample members Measurement error inaccuracy in recorded data Can be due to survey design interviewer impact or a transcription en39or Collecting Data Question a local TV station conducts exit polling during an election selecting every 10m person who exits the polling station Is this a random sample No Whyquot Before the rst person is surveyed there are only 10 subsets that can be selected from the whole population Once the rst person is surveyed there is only 1 subset that can be selected from the whole population Summary 2 types of statistical applications in business Descriptive and Inferential 6 fundamental elements of statistics population experimental units variable measure of reliability The Role of Statistics in Critical Thinking Statistical literacy is necessary today to make informed decisions both at work and at home Requires statistical thinking to critically assess data and the inferences drawn from 39t l Statistical thinking assists you in identifying research resulting from unethical statistical practices Summary 2 types of data Quantitative and Qualitative 4 Data collection methods published source designed experiment survey observation 8282007 Summary Sources of Error in Survey Data selection bias nonresponse bias measurementerror STAT 515 e Secuun a a Slpplem nl Emu Habmg e Unwexsnyafsamh Camhm Lasl apnea vaemhex 1 2mm Once u pawn has been calculmzd Ink 111 me example an pages 359361 meessexym have amyafea lydls aymg this results On afthz masl camman my Dream mm mpxmm me puwexateachafthz pusslhlewhlzs afthz pmmem Sn funk exam z we wuld medm pmme palms 5n pawn a V2 75 paw a ad mu 2359 um 24m 15m m m me man laukmg m m slaw curve men we can see me ag pmbahxhtyafcamcdymecnng u ml hyputhzsls faxes afthz sslhlz leuzs afthz mun Faxexample me mstlus mmymnmme afprfannmgcamcdyu thzacmlmzmxsgnatznhanZSZS On mth ma n 15 1255 um 2w chance afn prfannmg camcdyu thz annual mean 5 1e dunZAlEl Whyxsnthanhz curve wdlalwuys puss mum pm camspundmg a me ml hypumesemme and me win17 Whywm u always a um ax an m u extremes Inscemmlytm cansummg m calculate ms wuss Ufa puwexcllrve byhandbmmany camputzxpuckzgeswdl r xyml In c 52 mm met 2 mum shst um ms Fm greater k an SD my a curve M ukzbasedanthzalmmm hyputhzsls unsaf manquotthzcnwe w Ina 1x m an alum njecnng max amnraxuxgsmMsbmmszcm fax smallwlvzs Once ym kmw um ms mm s n as ym shmddbe ah ha aman bath ms ml and 51mm hypumssss um ms pawn curves belnw camspund a mquot mm mm mu zsnu mu 23m mm mm 2m 7 s Mean Smhdy xfyml m yven Hquot H and myml shmddbe ab a skztrh ms ska 1mm pawn curve wm gemnllylnak m 1152007 Elements of a Designed Experiment singieeracmr Experiment ndndddn d need Chapter 1 0 ednne Design of Experiments and Analysis of Variance independent Variable Dependent Variable Elements of a Designed Elements of a Designed Experiment Experiment Response variable Tvvurfactur Experiment Also called the dependent variable Factors quantitative and qualitative Also called the independent variables Factor Levels Treatments Experimental Unit Elements of a Designed The Completely Randomized Experimen Design DSSiQHSd V5 ObseNational Experiment Achieved when the samples of experimental In a Designed Experiment the analyst units for each treatment are random and determines the treatments methods 0 independent of each other ass39gmng quot 5 To treatmenrs39 Design is used to compare the treatment In an Observational Experiment the analyst means observes treatments and responses but does not determine treatmen s Many experiments are a mix ofdesigned and observational H A we HE At least we Hf the treatment means dszer 1152007 The Completely Randomized Design The Completely Randomized Design The hypotheses are tested by comparing the differences between the treatment means to the amount of sampling variability present Test statistic is calculated using measures of variability within treatment groups and measures of variability between treatment groups Mean Square for Treatments MST Measure of the variability among treatment means ST MST S Mean Square for Error MSE Measure of sampling variability within treatments Msg nik The Completely Randomized Design The Completely Randomized Design Sum of Squares for Treatments SST Measure of the total variation between treatment means with ktreatmentsc Calculated by SST Z1nx E 72 11 Where n number of observations in I treatment group 2 mean of meaxurements in 1 treatment group x overall mean of all meaxurements FStatistic Ratio of MST to MSE MST F with k71n7k MSE df J Values of F close to 1 suggest that population means do not differ Values further away from 1 suggest variation among means exceeds that within means supports Ha The Completely Randomized Design The Completely Randomized Design Sum of Squares for Error SSE Measure of the variability around treatment means attributable to sampling error Calculated by 395 2 quot2 2 a 2 SSE 211le 7x1 Zxzj 7x2 Zxk ixk J J J After substitution SSE can be rewritten as SSEVI11S12W21S nk71si Conditions Required for a Valid ANOVA F Test Completely Randomized Design Independent randomly selected samples All sampled populations have distributions that approximate normal distribution The k population variances are equal NT 0quot The Completely Randomized Design A Format for an ANOVA summary table ANOVA SummzlyTzhle lnr a Campieneiy Randnmized Design The Completely Randomized Design ANOVA summarylable an example from Minilab amwlyANmA niancEvmaamua The Completely Randomized Design Conducting an ANOVA for a Completely Randomized Desi n Assure randumness er design and independence randumness ufsamples 2 Check nurmality Equal variance assumptiuns 3 Create ANOVA summary table 4 Conduct multiple cumparisunsfurpairs er rneans as eee rydesired 5 if HE nut rejected cunsider pussible Explanations keeping in mind the pussibility er a Type ll errer 1152007 9242007 Chapter 5 Continuous Random Variables The Normal Distribution A normal random variable has a probability distribution called a normal distribution The Normal Distribution ii Belleshaped curve Symmetrical about its mean u Spread determined by tne yalue orit s standard deviation o l l Continuous Probability Distributions Continuous Probability Distribution areas under curve correspond to probabilities for x Area A corresponds to the probability that x lies between a and b no you see ine similarity in siaoe between ine cuntlnunus and disoeie omoaoiliiy dls1rll2utlnns 7 tul The Normal Distribution The mean and standard deviation affect the atness and center ofthe curve but not the shape basic Hi The Uniform Distribution in Unirorm Probability Distributan r distributan resulting When a continuous random yariable is eyenly distributed oyer a particular interval Probability Distribution for a Uniform Random Variable x Probability density function fx T c9511 391 Standard Deyiation T be dis Mean Z po x ltbbeodeccso ltbsd The Normal Distribution The function that generates a normal curve is ufthe form genzlitxeuJa Where u Me e normal random variable X an ofth Standard deyiation rt 3 747 e 2 77828 F gtltlta is obtained from a table or normal probabilities 9242007 The Normal Distribution The Normal Distribution Probabilities associated With values arranges or a randorn What is Pz gt 164 variable cones ond to areas dndertne norrnai curve Calculating prooaoiiities can he sirnpiiiied by Working Witn a Table 9 5 area A2 mal D SWW O Smmetr about the mean A Standard Normal Distribution l5 a Normal distribution with tell us that A 5 1 and 477 H39l 2 i 39 The Standard normal random variable is ltzgti FAG 57A 57 4495 a mans denoted bythe rntmlz A The Normal Distribution The Normal Distribution Table rorStandard Normal Distribution contains probability What is pz lt 137 n ni iortne area between D and z paranoia snows Table gives us area Ai components oftable Smmetr about the mean a tell us that A2 5 PZlt 67A1A2 2486 5 7486 The Normal Distribution The Normal Distribution What is P 133 lt 2 lt133 What is Pzgt196 Table gives us area A1 Table gives us area 5 A2 Smmetr about the mean 4750r 50 A2 0250 tell us that A2 A1 Symmetry about the mean tell us that A2 A PH 33 ltzlt i 33 PH 33 ltZltEIPEI ltzlt i 33 piiZi gt1 95A1A2 mg mg 55 A2A1 4mm 4mm 8164 9242007 The Normal Distribution The Normal Distribution What ifvalues of interest were quot you can 3ng USE metahie not normalized We want to know in mama to ring a Hallie P altxlt12 with p10 and o39 5 that correspondsto a in I Convert to standard normal usin 3mm Whamm u ag u x7 u 1 Whatlsthe value of Zthat Will be exceeded only iEl n of the time P3ltgtltlt12 P133lt2lt133 24032 3164 Look in the body ortnetaole rortne value closest to 4 and read the corresponding 1 value Z i 28 it in The Normal Distribution The Normal Distribution Steps for Finding a Probability Corresponding to a Which valueg or Z encioge the Normal Random Variable middle 95 oftne standard Sketch the distribution locate mean shade area quotWW1 V5 957 of interest Usl tn mmetiy roperty Convert to standard 2 values using 2 xi forgwmnd W m a a Addz values to the sketch From the table we find that 2 and azuare l 95 and Vi 95 Use tables to calculate probabilities making use VESDELWEW of symmetry property where necessary i n The Normal Distribution The Normal Distribution Making an Ierrence Given a normallv distributed Humlikely is an observation Variable gtltWith mean SSH and in area A given an assumed normal N standard deviation or iEIEI vvnat diStrihutiOn with man of 27 and Mi value of x identifies the top iEl n Z ndard deviation of 37 I ll orthe diStrihutiOn 2 value forx2 i572 33 mm PPS XnPzg V550 on Pxlt2 Pzlt72 33 57 mi DEIQQ J 10 You could reasonably conclude that this is a rare event The Z value corresponding With AD iSi 28 Solving forxu XE SSH i ZEUDD SSH HIE 578 is ix Descriptive Methods for Assessing Normality Evaluate the shape from a histogram or stemandIeaf display Compute intervals about mean Eisji 2si3 and corresponding percentages Compute IQR and divide by standard deviation Result is roughly 13 if normal oUse statistical package to evaluate a normal probability plot for the data 9242007
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'