Regression & Variance
Regression & Variance MATH 463
Popular in Course
Popular in Mathematics (M)
This 14 page Class Notes was uploaded by Henderson Lind II on Tuesday September 8, 2015. The Class Notes belongs to MATH 463 at University of Oregon taught by Staff in Fall. Since its upload, it has received 51 views. For similar materials see /class/187186/math-463-university-of-oregon in Mathematics (M) at University of Oregon.
Reviews for Regression & Variance
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/08/15
Lecture Notes for Math 463563 Mathematical Methods of Regression Analysis Qi Man Shao Department of Mathematics University of Oregon 2005 by Qi Man Shao All rights reserved Chapter 1 Introduction to Regression Analysis We are often interested in comparisons among several distributions or relationships among several variables A study of data often leads us to ask whether there is a cause and effect rela tion between two or more variables Regression Analysisis a statistical method for investigating such relationships The goal is to build a good model a prediction equation relating the effect to causes Example 11 Height 7 Weight Overweight 7 Health Example 12 Does smoking cause lung cancer Table below summarizes a study carried out by government statisticians in England The data concern 25 occupational groups and are condensed from data on thousands of individual men One variable is smoking ratio which is a measure of the number of cigarettes smoked per day by men in each occupation relative to the number smoked by all men of the same age Another variable is the standardized mortality ratio Smoking Mortality Smoking Mortality 77 84 112 96 137 116 113 144 117 123 110 139 94 128 125 113 116 155 133 146 102 101 115 128 111 118 105 115 93 113 87 79 88 104 91 85 102 88 100 120 91 104 76 60 104 129 66 51 107 86 Response variable7 Explanatory variable A response variable measures an outcome of a study An explanatory variable explains or causes changes in the response variables It is convention to use y to denote the response variable7 and 1 12 etc7 denote the ex planatory variables Note The response variable is also called dependent variable and the explanatory variable is called independent or predictor variable Warning 0 When you observe two variables there may or may not be cause and effect relationship Scatter Plots Always plot the explanatory variable if there is one on the horizontal axis If there is no explanatory response distinction either variable can go on the horizontal axis lnterpreting scatterplots 0 Overall pattern 0 Direction 7 Positive association above average below average values of X tends to accompany above average below average values of y 7 Negative association above average below average values of X tends to accompany below average above average values of y 0 Form of relationship linear 0 Strength of the relationship 0 Outliers or other deviation from the pattern General form of Probabilistic model in regression Assume that y is the response variable and 12 M are predictor variables y E 8 where o y is response variable 0 is the mean of Y and is a function of 1 wk and some parameters 6061 1 2 o 8 is random error 0 Linear model is a linear function of parameters Examples 11 50 51522 87 9 50 51 52 z 87 o Non linear model is not a linear function of parameters Examples 11 50 511 52 8 i 60 y 7 1 6 91m 8 In this course7 we will focus on 0 Simple linear regression model 9 50 51 8 0 Multiple linear regression model 11 o 1 522 39 kk8 0 Model building 0 Estimation7 inference o Diagnostics 0 Model checking o Non linear regression Chapter 2 Simple Linear Regression In this chapter we will discuss simple linear model that relates a response variable y to a single predictor variable x We will use the method of least squares to show how to t the model to a set of data points7 and then perform statistical inference such as prediction interval and nally carry out the analysis of variance The term regression and the general methods for studying relationships now included under this term were introduced by Francis Galton in 19087 the renowned British biologist Galton was engaged in the study of heredity One of his observations was that the children of tall parents to be taller than average but not as tall as their parents This regression toward mediocrity77 gave these statistical methods their name The term regression persists to this day to describe statistical relations between variables Parents7 height Children7s height 645 658 655 667 665 672 675 676 685 682 695 689 705 695 715 699 725 722 21 Simple linear regression model y o 8 where o y is the response variable7 and m the predictor variable 0 8 random error with E6 0 and Var6 02 Note 60 61 Vary 02 For n observations on the predictor variable x and responses variable y 171107 27yz7quot3977yn the statistical model becomes 9139 50 51 813 where 0 8139 are assumed to be independent with 0 and V81 8i 02 0 Parameters BO 61 039 0 1 is a known constant7 not a random variable lmportant Features of the Model 0 Mean response 50 61 Population regression line or Regression function E 50 51 0 Varyi 02 0 BO and 61 are called the regression coef cients7 61 is the slope of the regression line7 and 60 is the intercept of the regression line 22 Fitting the Model The Method of Least Squares To nd good estimators of the regression parameters 60 and 61 we employ the Method of Least Squares Let n Q i 50 51 i1 The estimators of 60 and 61 are those values 30 and 31 that minimize Q7 which satisfy 11quot 30 i 0 M 11 and 30 i i 0 11 A Sm 51 Sm 50 17 51 where Ig0 H H 71 h i i 7 nil77 i1 3 H H x 7 2 D 7 no Regression line Least squares line 173031 o Fitted value or prediction 1 30 31 0 Residual 6139 yi 73139 Properties of Fitted Regression Line 21 5139 0 219139 2111 o The regression line always goes through the point A A 2 o Ema 61 varwl g Ewe 507 V3430 02ltl gt 77 Estimator of 02 52 7 y Hiram 1 51152121 7 Sig n 7 2 Sm Note 71 7 2 is the degrees of freedom of s 23 lnferences about 61 Model Assumptions i 8139 are assumed to be independent with 0 and V81 8i 02 ii 8139 is normally distributed Sampling distribution of 31 2 o Ema 61 varwl g 0 Standard deviation of 31 0amp1 UxSmm 0 Standard error of 31 5amp1 sx Sm 5151 o TANNWJ Mwmim Con dence interval for 61 A level 1001 7 00 con dence interval for 61 is 51 i Ray2 5amp1 Tests concerning 61 Null hypothesis H0 3 51 510 t statistic t 51 510 551 Ha P value Reject H0 at 04 level if 51 gt 510 PTn72 Z t t Z tn204 51 lt 510 PTn72 S t t S tnizy 51 7 510 2PTn72 Z W W 2 toilet2 Example 21 What is the relationship between parents heights and theiiquot adult children s heights Table below shows one of data sets given by Galton Parents height Children s height 645 658 655 66 7 665 672 675 67 6 685 682 695 68 9 705 695 715 69 9 725 722 It is given that i 6857 i 684444 Sm 411 Sm 60 SW 299022 a Find the equation of the least squares regression line b Give a 90 con dence interoalfor the slope c Test H0 61 0 against Ha 61 gt 0 at the 005 signi cance level Solution 24 Con dence Intervals for Mean Response A common objective in regression analysis is to estimate the mean response For a speci c value of 2 say z the mean of the response is given by My 50 51f Estimate of My y 60 l lfk Sampling distribution of fly My BO l lfE 1 i7z2 o opyo S m if N N01 Tn 7 27 where 1 i 7 x2 Sill A level 1001 7 00 con dence interval for My is y i than2 Spy Example 22 Refer to Example 21 Find a 95 con dence interval for the mean height of children whose parents s height is 70 inches Solution 25 Prediction lntervals We consider now the prediction of a new observation y corresponding to a given level z of the predictor variable The new observation on y to be predicted is viewed as the result of a new trial7 independent of the trials on which the regression analysis is based For a speci c value of x say xi the response value is y o l 7 2 7 where 8 is unobservable random error The predicted response y is given by 7 50 51f The predicted value is the same as the estimate of mean response value However the margin of error is larger because it is harder to predict one individual value than to predict the mean Sampling distribution of 7 7 y 1 ii z ooyo 1S 7371 o N0l Ufiy 0 gig Tn72where S iy ii z 52125 17 S A level 1001 7 00 prediction interval for y is 17 i Eliza2 83H Example 23 Refer to Emample 21 If Mark s parents are 70 inches tall nd a 95 prediction interval for Mark s height Solution Example 24 Can the highest price nept day ofa stock be predictedfrom today s closing price Table below are the closing prices and highest prices nedt day of a stock in NASDAQ Closing price Highest price nedt day 90 1 2794 2738 2675 2744 2619 2681 2719 2750 2669 2813 2787 2844 3706 3938 3681 3706 3638 3613 3350 3463 3144 3388 3325 3363 3456 3619 3425 3506 3319 3400 3200 3175 3125 3113 3000 3050 2831 3138 2856 2988 It is known that i 31167 7 32017 Sm 247937 Sm 2471577 5 09825 a Find the equation of the least squares regression line b Give a 95 con dence interval for the slope and the intercept c Test H0 61 0 against Ha 61 gt 0 at the 005 signi cance level d Suppose that today s closing price is 25 Find a 80 prediction interval for tomorrow s highest price Solution Example 25 Is Old Faithful Faithful One of the most well known geysers is Old Faithful located in Yellowstone National Park in Wyoming Old Faithful has two reservoirs which re either together or separately Accurate predictions of eruption times allow the Park Service to inform visitors of the approccimate time of the nept eruption Visitors in turn can adjust their schedules appropriately We will use data from Old Faithful s July 1995 eruptions to investigate estimating the time interval between consecutive eruptions Using the data from July 1995 eruptions of Old Faithful emplore the relationship between duration of the current eruption and the length of time between the current eruption and the nept eruption a Create a scatterplot of the duration of the current eruption vs the length of time between the current eruption and the nept eruption Do the data emhibit a linear relationship b Obtain a point estimate of the least squares estimates of L and 60 and state the estimated regression function c Based on the regression equation found in Part b predict the amount of time between the current eruption and the nept eruption given that the duration of the current eruption is 30 minutes d Obtain a 95 con dence interval for the mean waiting time for the nept eruption if the duration of the current eruption is 30 minutes e Obtain a 95 prediction interval for the amount of time between the current eruption and the nept eruption if the duration of the current eruption is 30 minutes f Should the Park Service use the 95 con dence interval or the 95 prediction interval Why Solution
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'