STAT METH SOC RES 1
STAT METH SOC RES 1 STA 6126
Popular in Course
Popular in Statistics
This 34 page Class Notes was uploaded by Golden Bernhard on Friday September 18, 2015. The Class Notes belongs to STA 6126 at University of Florida taught by Yasar Yesilcay in Fall. Since its upload, it has received 19 views. For similar materials see /class/206557/sta-6126-university-of-florida in Statistics at University of Florida.
Reviews for STAT METH SOC RES 1
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/18/15
Chapter 5 Statistical Inference Methods of Statistical Inference 0 Point Estimation o Interval Estimation 0 Signi cance Tests Chapter 6 51 Point Estimation A point estimator of a parameter is the sample statistic that predicts the value of the parameter 0 A point estimator of the population mean is the sample mean IX 2 X o A point estimator of the population variance is the sample variance 6 S o A point estimator of the population proportion is the sample proportion 7339 p Desirable properties of Estimators o Ef ciency 0 Unbiasedness o Normality The above estimators have the following properties 1 They are ef cient ie one cannot nd other estimators that have smaller standard errors and these estimators are closer to the true parameter values 2 They are unbiased In repeated sampling the estimates average out to give the true values of the parameters S is not an unbiased estimator of 6 but its bias is small and decreases as the sample size increases P not the sample variance 52 Con dence interval for the mean for large samples In Chapter 4 we wrote 095P 196xilt17lt196xi J J We also said that this can be rewritten as 095Pl7 196xiltuY lt17196xi J J The sample mean and the sample proportion have approximate normal distributions but S A large sample con dence interval for u is Y i Z X T where z is chosen in such a way n that 1372 E Z E z Some desired con dence level The most frequently used con dence level is 95 but sometimes we may also use 99 or 90 if we want more or less con dence The larger the con dence level good the larger will be the con dence interval not good Example Suppose a random sample of 100 females selected from a population of females yielded a sample mean of 166 inches and a sample standard deviation of 12 inches Find a 95 con dence interval for the population mean We are given that l7 166 and S 12 From the tables of the normal distribution we can nd that P 7 196 lt Z lt 196 095 Hence a 95 con dence interval for the population mean is S 12 CI Y i zx 166 i l96gtlt 117 V100 166 i 196x12166 i 2352 166 2352 1662352 163648 168352 Thus we are 95 con dent that the population mean u is a point in the interval from 163648 to 168352 Alternatively we can state that we are 95 con dent that the interval 163648 168352 contains the true population mean Some guestions you should ask yourself gand avoid the wrong answers Why con dence and not probability What is the probability that the population mean L1 is in the above interval that is Is P163648 S u 168352 095 NO l l 2D0es the con dence interval tell us anything about the population values Can we say that 95 of the females in the population have heights between 163548 and 168352 NO 3D0es the con dence interval tell us anything about the sample values Can we say that 95 of the females in the sample have heights between 163548 and 168352 NO 4D0es the con dence interval tell us anything about the sample mean Can we say that the same mean is in the interval with 95 con dence NO Actually we are 100 con dent the sample mean is ALWAYS in the center of the con dence interval Look at the formula But we are NOT interested in the sample mean we are ONLY interested in making statements about the population mean Note that o A con dence interval IS a statement about a population parameter 0 A con dence interval IS NOT a statement about population values 0 A con dence interval IS NOT a statement about sample values 0 A con dence interval IS NOT a statement about the sample statistics A general formula to keep in mind A con dence interval for a parameter is Estimate of the parameter 1 ME Where ME Margin of Error table value gtlt SE of the estimate Hence An approximate con dence interval For the population mean u is X i ME andME 1 J When the sample size is large 53 CI for population proportion 11 Using the general formula we may also write An approximate con dence interval for 7 is p i MEandMEZgtlt 11 When 11p 2 10 and n1 p Z 10 54 Deciding on the sample size Remember the formula for the ME 039 For estimatin ME 2 2X Y g l Iqu For estimating 7 ME z x lp 7r 11 These are measures of sampling error or estimation error that is the difference in absolute value between the estimate and the true value of the parameter Although we do not know the true value of the parameter statistical theory lets us to put an upper bound say B on the estimation error J Thus for estimating u we have 397 uY z x i S B 2 Solving this for n we get 11 Z 0392 x Then to estimate the mean of a population where the estimate in not more than B units om the true value with a con dence level of l X we need 2 n 2 0392 x i B Similarly to estimate the population proportion where the estimate in not more than B units from the true value with a con dence level of l X we need n27rl 7rgtltj Note that the above formulas give a value that is not an integer We always round up that value since 11 has to be an integer Factors that in uence our decision on 11 As can be seen in the formula for n the sample size is in uenced by 1 B the bound on estimation error precision The smaller the bound the larger Will be the required sample size 2 The con dence level re ected in z The more con dence you need the larger Will be z and hence n 3 0392 the variability of the population The more heterogeneous variable the population is the larger Will be the required sample size 4 Available resources No matter What you get om the formula if you don t have the necessary money time manpower and equipment you need you cannot have as large a sample as you Wish 5 Complexity of the analyses Here we are thinking of only one variable In many surveys we collect data on more than one variable and carry out more complex analyses Which may require larger samples 6 Nonsampling errors The larger the sample size the less Will be our control on the nonsampling errors such as low response rate Skip 55 Chapter 9 Linear Regression and Correlation As in the previous chapter we are interested in the presence of an association or independence between two random variables However there are two main differences in this chapter 1B0th the response and the explanatory variable are quantitative 2 We are interested in linear relationship association between the response and the explanatory variable The response is also called the dependent variable usually denoted by Y and always shown on the vertical axis of a scatter diagram The explanatory variable is the independent variable usually denoted by X and always shown on the horizontal axis of a scatter diagram The explanatory variable is also called the predictor because it helps to predict the values of response 91 Linear Relationship The response Y is hypothesized as a linear function of the independent variable X Thus we have the following Simple Linear Regression SLR model Ya Xg Alternatively we may take the expected values means of both sides of the above relation and show the relation between these variables as uY 205 8X This is called the true regression line In this case we are using the assumption that Y is a random variable X is a variable not random and the mean of the error term Es O In the above equations Y response dependent variable is a random variable that has a normal distribution with mean uY and standard deviation 6 X explanatory variable independent variable It is a nonrandom variable OL Intercept yintercept Value of Y when X 0 Note that this interpretation may not always be meaningful Extra care is needed in interpreting the intercept B Slope Change in Y for one unit increase in X s Error term residual that takes into account of all other factors that are not in the model but which in uence the values of Y assumed to have a mean of zero and stande deviation 6 92 Least squares Prediction Equation In the above models on B and o are called the parameters of the regression model estimated by 0 a 8 band c9 S respectively The data come in 11 pairs that is for each unit in the sample we observe the values of X and Y and keep them as a pair Thus the data are shown as X1 3 1 X2 Y2 Xna yn Then using these data we get what is called the prediction equation j abx with j denoting the predicted value of Y for some given value of X X The following are called the least squares estimators LSE of the parameters inc Mb iXiYi nff b il il 20g f 2X3 nT2 zl il Scatter Diagram The rst step of tting a linear regression model to the data is to look at the data in a graph Before we start the estimation and prediction process we rst draw a scatter diagram or scatter plot of the data Such a diagram will give us two important bits of information a Does the relation between X and Y look linear b What can we say about the slope and the strength of the relationship A scatter diagram is a plot of the xy pairs on a two dimensional graph with Y on the vertical axis and X on the horizontal axis Examples of some scatter diagrams Example Suppose we are interested in investigating whether there is a linear relationship between 0 Mileage of cars mpg and 0 Horse power of Cars Observe that both variables are quantitative Which one of these is the dependent variable response and which one is the independent explanatory variable From the observed data on the 382 cars we get the following scatter diagram Figure l A Scatter Diagram of Mileage vs Horse Power Li near Reg ression 40 mm Miles per Gallon I I I 50 100 150 200 Horsepower MilesperGallon 3985 016 horse R Square 059 What can we say about the relationship between the two variables Prediction Equation We obtain the following output by clicking Analyze gt Regression gt Linear And then specify the dependent mpg and independent horse variables Coefficientsa Unstandardized Standardized npff39 ipnt npffir ipnt Model B Std Error Beta t Sig l Constant 39355 730 54578 000 Horsepower 157 007 771 23931 000 a Dependent Variable Miles per Gallon Thus the prediction equation is j 39855 0lS7gtlthp Interpret the slope and the intercept Is the interpretation of the intercept meaningful Hence Residual analyses The prediction equation provides predicted values for Y mpg using the horse power of each car as the predictor independent variable The difference between the observed and the predicted values give the residuals For example for a car that has hp 130 the prediction equation gives 7 39855 0157gtlth 2398255 0 157x130 194 For the first car the observed mileage is 18 hence the corresponding residual is 18 194 14 The following table gives the values of y X and residual for the rst 10 observations in the sample Verify them to see if you ve understood the concept of residuals Observed Horse Predicted Residual Value y power x value f2 y y 18 130 194 1 4 15 165 139 1 1 18 150 162 1 8 16 150 162 0 2 17 140 178 0 8 15 198 87 6 3 14 220 5 2 8 8 14 215 6 0 8 0 14 225 4 4 9 6 15 190 9 9 5 1 Effect of outliers on the prediction equation One of the reasons we are interested in the residuals is to see if there is any outliers in the data set Such outliers will have large residuals in absolute value and my in uence the estimates of the parameters and hence the predictions by pulling the prediction line upwards or downwards We call these observations in uential points and must carry out the analyses without them if they cannot be corrected Note that some outliers may have no signi cant effect on the estimates The crucial question is of course how large is large For this it is better to look at the standardized residuals which is the residuals divided by their standard deviation Method of Least Squares The parameters are estimated using a method called the least squares estimation LSE In this method values of a and b are determined in such a way that a The sum of the residuals is zero and b The sum of the squared residuals is a minimum ie SSE y f2 2 is minimized i1 The formula for a and b given above satisfy both of these conditions For this reason the prediction equation is also called the least squares line The value of the SSE that uses these values of a and b is the smallest among all possible combination of values for a and b ie SSE for the least squares line is smaller than the SSE one would have With any other line tted to the given data set 93 The linear Regression Model We have already de ned the regression model in one of two forms Either as Ya X8 Or in terms of the expected value mean of Y EYuY a X The latter formulation indicates that the mean of the error terms is zero Es 0 As a matter of fact the regression model is based on the assumption that YX NuY X o This means the mean of Y Will change as X changes thus each value of X de nes a population of Y s but all population of Y s have the same variance 039 Based on this assumption the standard deviation 6 is estimated by 625515 Vn Z The denominator n 2 is the degrees of freedom We lost two degrees of freedom because we estimated two unknown parameters or and B Statistical software give the SSE as well as the df and divides them to give an estimate of the variance called the mean squared error as seen in the following output called the ANOVA table for the cars data ANOVA Table Sum of Model Squares df Mean Square F Sig l RegreSSlon 14169756 1 14169756 572709 000a Residual 9649237 390 24742 Total 23818993 391 a Predictors Constant Horsepower b Dependent Variable Miles per Gallon Hence the mean squared MSE error is 3 2 MSE 964939237 24742 390 The square root of the mean squared error is called the root mean square and is an estimate of the standard deviation of the population of Y s Hence om the above output we can write Root mean square 6 IMSE 24742 497 The residuals in the car data are all divided by 497 to give the standardized residuals The following output shows that the standardized residuals are in the range om 3259 to 3414 Hence there may be a few in uential points in the data set Residuals Statisticsa Minimum Maximum Mean Std Deviation N Predicted Value 364 3261 2345 6020 392 Residual 46212 16980 000 4968 392 Std Predicted Value 3290 1523 000 1000 392 Std Residual 3259 3414 900 999 392 a Dependent Variable M les per Gallon 94 Measuring the Strength of Linear Association The Correlation coef cient The Pearson product moment correlation coef cient denoted by r shows the strength of linear association between the two variables It is calculated using 7 S X 3 51V For the cars data set r 0771 What does that tell you Properties of the Pearson Correlation The correlation coef cient is valid and meaningful only when there is a linear relationship It measures the strength of the linear association between X and Y Correlation is always between l and 1 that is lSrSl The correlation coef cient r has the same sign as the slope Thus I rgt0when B andbgt0 I rlt0when B andblt0 I Als0r0whenb0 I r is near zero when B 0 r i1 exactly when there is a perfect linear relation between X and Y ie when all data points are on the estimated regression line Then there is no prediction error in Y a bX The further away r is om zero the stronger is the linear association between X and Y Thus r 08 indicates a stronger association than r 06 The correlation coef cient is a unitless measure ie its value does not depend on units of measurement Correlation between X and Y is the same as the correlation between Y and X this is not true for slope thoughl Correlation can also be interpreted as standardized slope in the sense that an increase in X by one standard deviation SX will result in a change of r standard deviations SY in Y This is seen in the following relationship SXb rsy This comes from the equality r SXSYgtltb obtained by crossmultiplication A PRE Proportional Reduction in Error Measure The coef cient of Determination R2 The coef cient of determination R2 is the square of the correlation coef cient r2 sometimes expressed as a percentage ie R2 lOOgtltr2 Properties of R2 R2 02 02x 100 R2 TSS SSE TSS 1 0 s R2 1 0r 0 s R2 s 100 Always R2 is a measure of the strength of linear association between X and Y o R2 gives the amount percent of variation in response Y that is explained by the explanatory independent variable X o R2 gives is the proportional reduction in error by using the linear prediction equation in stead of the alternative of using the sample mean as a predictor of Y o R2 does not depend 011 units of measurement of X or Y and it is itself unitless IHere TSS T0tal sum of squares inc I72 and SSE Sum of squared error Znin 12 These are given in the ANOVA table 95 Inferences for slope and Correlation As before we will make inferences about the unknown population parameters In this case the parameters of interest are 0 The true but unknown slope B estimated by b o The true but unknown population correlation p estimated by the sample correlation r We have the same 6 step testing procedure 1 Assumptions a Both X and Y are quantitative variables b The sample is selected randomly c Y Nuy x o for each X x 2 Hypotheses Ho BOvs Ha 3720 Note that the null hypothesis is equivalent to stating that X and Y are independent of each other ie there is no linear relation between X and Y whereas the alternative hypothesis is equivalent to stating that X and Y are linearly associated We may also be interested in testing Ho BOvs Ha BltO Ho BOvs Ha BgtO In these cases we are also showing the direction of linear association Other null and alternative hypotheses are also possible b 1 SEb quot 2 SSE n 2 Xi 2 4 The p value As before its de nition depends of Ha IF Ha B at 0 THEN pvalue 2gtltPT Z ITcal IF Ha B lt THEN pvalue PT S Teal IF Ha B gt 0 THEN pvalue PT Z Teal 3 Test Statistic T Here SEb6b The calculated value of the test statistic and the pvalue for a twosided alternative are given in the computer output 5 Decision Same as before 6 Conclusion Same as before Con dence Interval for B The general formula we saw before applies here CI for B estimate of B 1 ME with ME t X SEb Here estimate of B is b and SSE n 2 SEb SEb 6 n 2Xi X2 Example Using the data on cars we can look at the third panel of the computer output to make inferences about B Coefficientsa Unstandardized Standardized 39 39 nt npffir ipnt Model B Std Error Beta t Sig 1 Constant 39855 730 54578 000 Horsepower 157 007 771 23931 000 a Dependent Variable Mi es per Gallon The calculated value of the test statistic is Teal 23931 And for Ha B at 0 the pvalue is 2X PT390 Z l lt Since this is extremely small we will reject Ho and conclude that the observed data give strong evidence of a linear relation between the horse power of cars and their mileage CI for B SlIlCC df we use 110025 Z0025 Hence a 95 CI for B is 0157 i196gtlt0007 0157 i 007372 That is 0171 0143 What does this CI tell us Interpretation of CI for B We are 95 con dent that the true slope B is some number in the interval 0171 0143 Observe that since this interval does not contain zero we will reject the null hypothesis Ho B 0 at 5 level of signi cance Furthermore since both ends of the con dence interval are negative we have evidence to conclude that the true slope is negative That is the observed data give evidence of a decreasing linear relation between the engine power and mileage of cars Computer Output Until now we looked at parts of the SPSS output Here is the complete output There are 4 or 5 panels in the output Panel 1 Variables EnteredRemovedb Variables Variables Model Entered Removed Method 1 Horsepower I Enter a a All requested variables entered b Dependent Variable Miles per Gallon This is the rst panel of SPSS It will be useful in the next two Chapters Panel 2 Model Summaryb Adjusted R Std Error of Model R R Square Square the Estimate 1 771 a 595 594 4974 a Predictors Constant Horsepower b Dependent Variable Miles per Gallon In this panel be careful with R It is not equal to the correlation coef cient when the slope is negative Panel 3 Coefficientsa Unstandardized Standardized Model Coeffcients Coefficients 95 Confiden e Interval for B B Std Error Beta t Sig Lower Bound Upper Bound 1 Constant 39855 730 54578 000 38419 41290 rH rsep We 157 007 771 23931 000 170 145 a Dependent Variable Miles per Gallon This is the panel from which you get a b CI for B and or as well as the calculated values of the test statistics and the twotailed pValues for testing H0 BOvs Ha B Oand Ho0c0vsHaoc O Panel 4 ANOVAb Sum of Model Squares df Mean Square F Sig 1 Regression 14169756 1 14169756 572709 000a Residual 9649237 390 24742 Total 23818993 391 a Predictors Constant Horsepower b Dependent Variable Miles per Gallon We will see more details of using the ANOVA table in the next two chapters In this Chapter we have seen that the MSE gives an estimate of 62 Note that the pValue in this table is also used for testing H0 B 0 vs Ha B at 0 Also the sum of squares for residuals SSE Panel 5 Residuals Statisticsa Minimum Maximum Mean Std Deviation N Predic ied Value 364 3261 2345 6020 392 ReSidUal 46212 16980 000 4968 392 Std Predicted Value 3290 1523 000 1000 392 Std ReSidUal 3259 3414 000 999 392 a Dependent Variable iles per Gallon This is optional and is printed if you ask SPSS to calculate the residuals We look at this panel to see if there are any points observations With large standardized residuals These points may be in uential points and need to be checked Inference on p The Pearson Correlation Hypotheses Ho p0vsHa p720 or Ho p 0 vs Ha p gt0 orHo p0vs Ha plt0 Test statistic r lrz quot2 T ton 2 All other steps are the same as before 96 Model Assumptions and Violations 1 Linear relation between X and Y slight violation of this assumption is not serious we will have an approximation However if the relation is far from linear then the results are meaningless and we must try other models 2 Y NlYX 039 Inferences are reasonable if the assumption is approximately true This assumption is almost never exactly true The larger the sample size the more robust are the results against violation of this assumption Extrapolation is dangerous Any inference and predictions are valid for values of X that are within the range of the observed values of X between the minimum Xvalue to the maximum X value Predictions outside this range will not be valid since we have no idea how the relation is outside that range Watch out for in uential observations A single observation may have large in uence of its X value is unduly large or unduly small or if it falls quite far om the trend than the trend followed by rest of the data Such observations may change the estimates of d B as well as the correlation coef cient We look for such points in the scatter diagram as well as in the standardized residuals If there is a point that we suspect is in uential then 1 Check if there is any recording error in the data If yes correct it and run the regression analysis again 2 Check if the observation does not belong to the population of interest If it does not delete the observation and run the regression analyses again 3 If there is no recording error and the observation does belong to the population of interest delete the observation one observation at a time run the regression analysis again Then a If the estimates do not change signi cantly you may use these results b If the estimates change signi cantly then either collect more data and run the analyses again or use another predictor Factors in uencing the correlation In addition to the outliers the correlation coef cient may be in uenced by the range of Xvalues in the data set if that range is only a subset of possible values in the population of interest For best results use a random sample om the population so that the range of Xvalues are representative of the population values Regression Model with Error term We have already seen that the simple linear regression model is represented as Y ocBX 8 where 8 called the error term incorporates the effects of all factors other than X that in uence Y The error term is also interpreted as the population residual since it is the vertical distance between the observed point and the true but unknown regression line l l39YX X BX Models and Reality Keep in mind that all models are approximate representations of reality Hence we do not expect an exact linear relationship or an exact normal distribution If the model seems to be too simplistic we may modify improve the model after checking for diagnostic tests as shown in Chapters 11 to 15 Model building is an iterative process
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'