Popular in Course
Popular in Statistics
This 31 page Class Notes was uploaded by Orval Funk on Monday September 28, 2015. The Class Notes belongs to STAT101 at University of Pennsylvania taught by Staff in Fall. Since its upload, it has received 21 views. For similar materials see /class/215432/stat101-university-of-pennsylvania in Statistics at University of Pennsylvania.
Reviews for INTROBUSINESSSTAT
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
STAT 101 Module 4 Fitting Straight Lines Textbook Section 35 From Linear Association and Correlation to Straight Lines 0 What do we mean exactly when we say the variables x and y are linearly associated One hypothetical answer is as follows If we had lots of y values for each xvalue and if we formed the means of the yvalues at each xvalue then we would say that y is linearly associated with x if the ymeans fell on a straight line meanQlx a x I Read meanyx as mean of yvalues at x Instead of at x we also say conditional on x Example The variable Height in PennStudentsJMP has its values rounded value to half inches If we round to full inches we get several Weight values for each Height value hence we can form the means of the Weights for each value of Height This is depicted in the plot below 0 the ne dots show the Heights and Weights of individual cases and o the fat dots show the means of the Weights for each value of Height If we ignore the leftmost two points and the rightmost point for which there are only single cases hence no averaging we nd that the fat dots follow a straight line quite closely If we had more cases Nlarger the fat dots might follow the line even more closely WE GHT HE GHT Thus we say Weight is linearly associated with Height ifthe conditional means ofWeight fall near a straight line equation in Height which is the case for this dataset The Straight Line Equation Interpretation and Fitting o The equation for the straight line in the above plot is meanWeight Height 5412 Height 216650 Some comments and definitions 0 O O The coefficient 5412 is called the slope It expresses the fact that a difference of 1 inch in Height corresponds to a difference of 5412 lb in Weight on average Note that by talking about differences we avoid the causal trap we are not saying that increasing Height increases Weight This would be nonsense anyway because we can t increase Height The coefficient 216650 is called the intercept It would eXpress that students with zero Height would on average have a negative weight of 2l6650 lb which is obvious nonsense The reason is that zero Height is an unrealistic extrapolation that is reaching outside the range of the data This constant is simply needed to eXpress the best fitting line on the range of the data Heights from about 60in to about 76in Together slope and intercept are called the regression coef cients for the regression of Weight on Height O O O 0 Regression means fitting a straight line to data The variable Height is the predictor variable or simply the predictor It is also called the xvariable The variable Weight is the response variable or simply the response It is also called the yvariable We avoid the terms independent and dependent variables for x and y because these terms conjure up causality y depends on x is proper language only in the controlled experiments familiar from high school science labs Yet you should know which means x and y because the terms are so common even though misleading The fitted line is written in general as 3 2 191 x 0 where in our case xHeight and yWeight and 1 is the slope and 0 is the intercept If one evaluates the equation at the observed values x1 one writes y 139 2 b1 xi b0 The hat on the y is meant to indicate an estimate or prediction of the yvalues not an actually observed y value Given the value of the xvariable y is our best guess of the location of the yvalues 0 Q How does one fit a straight line to xy data How can these values the slope 5412 and the intercept 216650 be found A With the Least Squares or LS method 1 El 1 2 3 4 X In detail Form the socalled Residual Sum of Squares RSS y1b1xl 1702 y2 b1x2 1702 yN b1xN 1702 JG 902 y2A22 yNjN2 The name of this quantity derives from the name residual for the deviation of the response value from the straight line estimate ei yi yi and therefore RSS 2 e12 e22 eN2 This quantity really depends on the choice of the slope b1 and the intercept b0 hence we should write it as RSSb0b1 Imagine playing with the values of b0 and b1 till you can t get the RSS any lower For the HeightWeight data for example if we choose bl 5 and b0 200 we get RSS 221406839 if however we choose b1 5412 and b0 216650 we get RSS l709354 It turns out that this is the lowest RSS value for the HeightWeight data no other combination of values for b0 and b1 can beat it This Here is an applet that lets you play interactively with the slope b1 and the intercept b0 http 39 39 m tm m J 39 39 ap774indeY 39 39 It show five data points in red and a blue line You can move the line up and down by gripping the intercept point and you can play with the slope by gripping the right hand part of the line The plot shows the squared residuals as actual squares attached to the residuals so you can think of the RSS as the sum of the areas of the squares The applet allows you to move the red data points also The beauty is that you see the residuals and the RSS change in realtime as you manipulate the data points xbyi and b0 b1 Unfortunately the applet does not compute the exact Least Squares line The same applet also allows you to change the fitting criterion instead of the RSS you can choose what we may call the RSA the residual sum of absolute values There is even a third criterion the vertical sum of squared distances You should absorb the idea that by playing with slope and intercept one can obtain a straight line that is nearest to the data in the sense that the RSS is made smallest This is called the Least Squares method and the coefficients b0 1 that minimize RSSb0 31 are called the Least Squares or LS estimates of the intercept and slope Here is another applet that allows you to place or enter as many data points as you like h quot39 39 quot notm nrg Activitvnetnil nanVITl46 You can make a guess as to the LS line and you can then ask for the actual LS line to be shown The coolest part is that you can move the data points around and the LS line follows in realtime The drawback of this applet is that it does not show the RSS of your guessed line Q Why squared residuals Why not absolute values of the residuals A Once again squares are good for doing algebra Below we will give eXplicit formulas for the LS estimates of 0 and 31 There eXist deeper reasons that have to do with the bellcurve but for this stay tuned Minimizing the sum of absolute values of the residuals can be done also but there are no eXplicit formulas The first of the above applets lets you play with the RSA 0 Q Why vertical distances and not orthogonal distances A Because we want to predict Weight from Height That is the formula b1xb0 where xHeight should produce a value that is close to yWeight but this means vertical distance in the plot the distance between the numbers y and b1xb0 means the distance between the data point x y and the point on the line x b1xb0 The first of the above applets lets you play with orthogonal distances also The result is not regression for prediction but something that is called principal components which has an entirely different use If you think about it orthogonal distance is messy Do you remember working with formulas for orthogonal distances of points from lines in high school They involve Pythagoras formula which blends horizontal and vertical distance into oblique tilted distance Regression in Practice 0 Data example DiamondJMP o JMP Analyze gt Fit Y by X gt select variables as usual gt OK click little red triangle in top left of the scatterplot window gt Fit Line Aesthetics thicker line rightclick on scatterplot gt Line Width Scale gt 2 0 Output only relevant part is the equation under Linear Fit Here Price Singapore dollars 2596259 37210249 Weight carats o Interpretations o Slope 3721 rounded diamonds that differ by 1 carat in weight differ on average by S3721 in price Is this meaningful check the range of the weights gt need to change units diamonds that differ by 1 carat in weight differ on average by S3721 in price Or with points 1100 carat often used by traders diamonds that differ by 1 point in weight differ on average by S37 21 in price O Intercept 2596 rounded diamonds with zero weight have on average a price of SEE 2596 Not meaningful again a case of extrapolation The intercept is needed to produce the best fitting line in the range of the data What is the range here 0 Predictions The linear equation can be used to estimatepredict average prices 0 O O JMP click little red triangle below scatterplot next to Linear Fit gt Save Predicteds This produces a new column with a formula that describes the fitted straight line Every observed weight in the data has now its estimated average price in this new column For predictionsestimated average prices for weights that are not in the dataset add new rows to the data JMP Rows gt Add Rows gt gt OK Now enter the weight values you are interested in the new rows and the predictions will be calculated instantly For example if we enter a weight of 030 carats the predicted price is shown as 83985468 For a weight of 038 carats the predicted price is S115436 Finally for a weight of 010 carats the prediction is S11248 Quantiles Weight carats 1000 900 750 500 250 100 00 Comments on Extrapolation None of the weights 010 030 038 exists in the data hence these are true predictions The value 038 is a slight extrapolation on the high side as the highest weight seen in the data is 035 The predictedestimated mean price of S115436 is higher than the highest price S1086 seen in the data Similarly the weight 010 requires a slight extrapolation this time on the low side as 012 is the lowest weight seen in the data Again the predictedestimated mean price of S11248 is lower than the lowest price S223 seen in the data Q1 Which of the two extrapolations would you trust less Imagine you were the seller of two diamonds of weight 010 and 038 carats respectively Q2 In general how would you expect prices to deviate from the estimated line between 000 and 012 carats and above 035 carats respectively Make the scatterplot with the fitted line extend the xrange to include 000 and about 045 carats as well as the yrange to include 8550 and about S1600 Rule Know the ranges of the observed x Values Knowing the ranges of the y Values is good also but extrapolation is defined in terms of x Quantiles Price Singapore dollars 035000 029300 025000 018000 016000 015000 012000 maximum 1000 maximum 900 750 500 250 100 00 quartile median quartile quartile median quartile minimum minimum And the Formulas are The LS estimates of slope and intercept can be obtained through explicit formulas covxy b1 2 s x and b0 ybif See the Textbook p71 Deriving these formulas requires derivatives from calculus and will not be done here We will never handcalculate these formulas from the raw x and y columns because this is what JMP is for We will however do some minor algebra For example it is easy to see from the formula for the correlation coefficient that bl mow SOC It follows further 500250021 and ff0 gt b1cxy and b00 In particular if we regress the zscores of y on the Zscores of x the least squares line equation is 2y cxyzx That is after standardization of both x and y the LS line runs through the origin and its slope is the correlation Some Weirdness predicting y from x and x from y Observation Since we measure distance between the data points and the line in terms of vertical distance there seems to eXist an asymmetry between how we treat x and y Q Comparing o regression of y on x that is finding a formula in x that predicts y and o regression of x on y that is finding a formula in y that predicts x aren t we getting the same lines A No we are not The reason was given in the above observation We can easily see the consequences in the regression of the z scores 2y CXay Zx regressy on X 2x 60 y Zy regress X on y If we solve the second equation to predict Zy from Zx we get A 1 Z Z y x Cx y which is not the same as Zy CXyZx How sensible is this formula Here are some special cases 0 If cx y 1 then the data fall on the straight line Zy Zx so of course the best prediction formula for zy is Zx 0 Similarly for Cx y l o If cx y 0 then the xvalues have no information for predicting the yvalues with a linear formula Hence the best prediction is 2y 2 O ignoring x In general note that in the formula for the slope and intercept if the correlation is zero then 0 the slope is zero and o the intercept is the mean of the yvalues Hence the best one can do in the presence of a zero correlation is fitting a horizontal straight line at the level of the overall mean of the observed yvalues This is Changing Units of the X and yvariables Problem We have an equation 3 171 x 170 that predicts precipitation y from average temperature x in a number of locations Precipitation is measured in millimeter of rainfall plus melted snow and hail and dew Whereas temperature is given in degrees Celsius We need to translate the equation from metric to US units How Complete solution 1 Write the starting regression equation more intuitively as Precmm b1mmC TempC b0mm Where the parens indicate the units An example is Precmm 0558 mmC TempC 9515 mm This equation is obtained from the dataset PhilaMonthlyTempPrecJMP Re create this equation and interpret the regression coefficients The target equation is Prec n bl nF TempF bo n J U 4 LI ReeXpress the old units mm and C in new units in and F That is reeXpress both Precmm and T empC in US units Precmm 254 Prec n TempC 59 TempF i 32 httpenwikipediaorgwikiConversion Tablecminches http Ienwikinedia nrg wiki T formulas Substitute in the regression equation 254 Prec n b1mmC 59 TempF i 32 b0mm Solve for the respose in new units Prec n regroup to separate into a new slope times the predictor in new units T empF and constants that form the intercept in new units Prec n 1254 59 b1mmC TempF 1254 32 59 b1mmC b0mm 002187 b1mmC TempF 070 b1mmC1254 b0mm Comparison with the target equation Prec n bl nF TempF 30077 yields bl nF 002187 b1mmC bo n 070 b1mmC 003937 b0mm If it helps you can make this more concrete by assuming some values such as b1mmC 56 mmC and b0mm 96 mm Practice Given a prediction formula Preccm b1cmC TempC b0C find the conversion to degrees Kelvin Practice Given a prediction formula for quantity sold in million items Qmill based on price in US dollars P Qmill b1mill P b0mill translate to quantitiy in thousands Q1000 and price in Euros P assuming the 20070209 conversion of P 130 P Simplified Solution for simple multiplicative changes Often as in the second practice example the unit changes involve only simple multiplications such as mew Q10001000 P 130 P In this case the algebra is easy 0110006 1000 130 b1mill 001000 1000 00074211 Measuring the Quality of Fit of Least Squares Lines In principle the RSS is the measure for the quality of the fit of the line to the data If nothing is said to the contrary the RSS is the value obtained by the Least Squares LS estimate that is the minimum achievable RSSb0b1 There is a problem with the RSS however we don t know how much is much and how little is litte This requires some standardization 0 One way of making the RSS more interpretable is by turning it into something like a standard deviation One could divide it by N or N l or actually N 2 l 1 se zJ RSS N2e12e efv In spite of the division by N72 think of Se as having a division by N and hence an average of squared residuals under the root If N is not tiny greater than 30 say the subtraction of 2 makes almost no difference Still the intellectually curious will wonder why N72 The partial answer is based on the following fact Fitting a straight line to residuals produces a zero slope and a zero intercept Checking what the conditions 310 and 300 mean we see that they imply covxe0 and meane0 These are two linear equations for the numbers e1 62 eN Similar to the argument for the division by Nil in case of the standard deviation we are in a situation where knowing N72 of the N residuals enables us to calculate the remaining two residuals from the equations covxe0 and meane0 Hence the division by N 2 At any rate think of se as the residual standard deviation Its units are those of the response variable This is therefore a measure of dispersion of the observed yvalues around the LS line Recall that the usual standard deviation is a measure of dispersion around the mean JlVTP calls se the Root Mean Square Error or RMSE a term that is unfortunately very common Our quibble is with the term error which to us is not the same as residual as we will see later The term Root Mean Square Residual or RMSR would have been acceptable it is also used but less often 0 Another standardization of the RSS is by comparing it with the sum of squares around the mean The RSS is the sum of squares around the LS line Consider the following ratio e12e efv y1J 2y2y2 yN y2 This ratio is a number between zero and one Why Because numerator min RSSb0b1 denominator RSSb0meany b10 Hence the numerator cannot be greater than the denominator and both are nonnegative The numerator measures how close the data fall to the line while the denominator measure how close the data fall to the mean Therefore the ratio measures how much the LS line beats the mean the smaller the ratio the more the line beats the mean Q When is the ratio 1 when 0 By convention one reverses the scale by de ning e12 e e2v 2 R 1 2 2 2 yly y2 y yNy This is called R Square JMP or R Squared Another way to eXpress R2 is in terms of se2 and sy2 R2 1 S 2 N 2 sy If se2sy2 can be thought of as fraction of variance unexplained hence R2 can be thought of as the fraction of explained variance Whatever just get used to the words It s not a good term either Some soften it to fraction of variance accounted for by the regression Now here is a minor miracle 2 2 R corx y We are not going to prove it but it givese a new interpretation for the correlation its square is the fraction of variance accounted for by the regression The nearer the fraction is to l the better the line summarizes the association between x and y JMP R2 and RMSE s92 are reported in the following table Summary of Fit 0978261 0977788 3184052 5000833 48 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations or Sum Wgts The last two lines are selfexplanatory meany and N We can ignore R Square Adj for now Diagnostics Check Is YLinearly Associated with X Straight lines can be fitted to any pair of quantitative variables x and y Whether the fit makes sense is related to the question of whether the association is linear Here is a diagnostic check that allows us to get a sense of whether the association is not really curved draw a curve in the xy scatterplot JMP As usual do Analyze gt Fit Y by X gt select x and y gt click the top left red triangle near the plotting area gt Fit Spline gt Other gt 001 flexible then play with the slider in the horizontal box at the bottom if there is any other output close it by clicking on its blue icon By playing with the slider we can make the fitted spline curve more or less smooth or wiggly Make the curve so smooth that there is no more than one in ection point and preferably just a concave or conveX curve If the swings of the curve are large comparable in magnitude with the residuals then the association is more likely curved not linear This is obviously not an exact science it requires some practice Examples Bivariate Fit of Price ngapore dollars By Wei ht carats 1000 no 2 as 2 Price Singapore dollars 400 200 l 15 2 25 3 35 Weight carats Smoothing Spline Fit ambda0000001 Smoothing Spline Fit ambda0000001 0986639 RS uare 2866328 Sum of Squares Error This curve is too Wiggly Make it smoother with a larger lambda The larger lambda the smoother the curve The following plot has a smoother curve that is almost a straight line We have pretty much a linear association The little bit of convexity is so small that it should not worry us Then again it is consistent with the idea that convexity has to set in below 012 carats and above 035 carats What are the reasons for this idea Bivariate Fit of Price ngapore dollars By Wei ht carats 1000 no 2 Price Singapore dollars 03 C 400 200 15 2 25 3 35 Weight carats iSmoothing Spline Fit lambda000002283 The following example is quite convincingly curved Bivariate Fit of MPG Highway By Weight 000 lbs MPG Highway l 20 30 40 50 60 Weight 000 lbs iSmoothing Spline Fit lambda027162 Then again we might be suspicious that the feather weight Honda Insight in the top left determines the curvature As rule one should not rely on individual points for any pattern But even after removing J MP select exclude hide the Honda Insight as well as the Toyota Prius with the second highest MPG we still see curvature MPG Highway J8 C 20 30 40 50 60 Weight 000 lbs Diagnostics Check Residual Plot We learnt how to check for nonlinearity at least informally by fitting a curve and judging by eye whether a curve is needed or a straight line suf ces to describe the data Practitioners of regression go one step further by extracting the residuals from the regression and examining gt 39 them separately The idea is quot that if there is a linear association between x and y N then the residuals should look unstructured or random The reason Subtracting the line from the response should leave behind residuals that are entirely unpredictable even if 5 0395 J one knows x The above plot illustrates the connection between the original xy plot and a plot of the residuals against x which one might call an xe plot but is called residual plot Recall that residuals have zero mean and zero correlation with x hence a residual plot should show no structure if the xy assocation is linear knowing x should give us no information about e JMP has two ways to plot residuals versus x either way fit a line first then click the red icon near Linear Fit below the xy plot then select Plot Residuals a residual plot will appear below the output or Save Residuals creates a new column with residuals in the spreadsheet plot the residuals against x with Analyze gt Fit Y by X gt select x as X the residuals as Y gt OK The plots below show four examples of xy plots with associated residual plots The top left example is artificial and shows a perfect case of linear association The bottom left example is real DiamondJMP and also shows a satisfactory residual plot Why satisfactory Ask yourself whether it helps knowing x Weight to predict anything about the residuals The answer is probably no The top right example is artificial and shows nonlinear convex association The residual plot is unsatisfactory because x has information about the residuals for small and large xvalues the residuals are positive and for intermediate xvalues the residuals tend to be negative The bottom right example is real Accord2006JMP and shows unsatisfactory residuals also Similar to the preceding example x Year has information about the residuals small and large values of Year have positive residuals and intermediate values of Year tend to have negative residuals When judging the plots below keep in mind that individual points do not make evidence Also it does not matter how the xvalues are distributed E g in the two real data example you see rounding in x which is irrelevant Residual Price S as 2 2 l a 2 2 l N 2 2 l l 15 2 25 3 35 Weight carats 39 Residual Residual l 1990 l l l 1995 2000 2005 Year 100 Residual l 1990 l l l 1 995 2000 2005 Year Quality of Prediction and Prediction Intervals The predictions 37x 0 blx would be more useful if one had an idea how precise it is So one has to think about what precision might mean Thought 1 The predictions are precise if the actual y values are not far off from their predictions In other words predictions are likely to be precise if the residuals are small Smallness on the other hand is judged with measures of dispersion such as the standard deviation But we already have Se the standard deviation of the residuals as such a measure In fact Se is used to judge the prediction quality of a regression equation More on this later Obviously other measures of residual dispersion could be used as well Thought 2 One can go one step further by asking whether one couldn t augment the predictions 37x with an interval around them The idea would be to give an interval of the form 37x i constant and this interval would be constructed such that it is likely to contain a large fraction such as 95 of actual observations y among all y values observed at x we would want that about 19 out of 20 satisfy any of the following equivalent conditions