### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# STATISTCS BUS DECIS QMB 3250

UF

GPA 3.83

### View Full Document

## 28

## 0

## Popular in Course

###### Class Notes

##### HIST 1200 (History, Angela Bell, Survey of American History Since 1865)

###### SC_Megan Buckallew

verified elite notetaker

## Popular in Quantitative Methods In Business

###### Study Guide

##### HIST 1200 (History, Angela Bell, Survey of American History Since 1865)

###### SC_Megan Buckallew

verified elite notetaker

This 84 page Class Notes was uploaded by Dayne Schmidt PhD on Friday September 18, 2015. The Class Notes belongs to QMB 3250 at University of Florida taught by Staff in Fall. Since its upload, it has received 28 views. For similar materials see /class/206899/qmb-3250-university-of-florida in Quantitative Methods In Business at University of Florida.

## Reviews for STATISTCS BUS DECIS

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/18/15

1 Simple Linear Regression I 7 Least Squares Estimation Textbook Sections 1817183 Previously we have worked with a random variable x that comes from a population that is normally distributed with mean u and variance 02 We have seen that we can write x in terms of u and a random error component 8 that is z u 8 For the time being we are going to change our notation for our random variable from x to y So we now write y u 8 We will now find it useful to call the random variable y a dependent or response variable Many times the response variable of interest may be related to the values of one or more known or controllable independent or predictor variables Consider the following situations LRl A college recruiter would like to be able to predict a potential incoming student s first7year GPA based on known information concerning high school GPA 1 and college entrance examination score She feels that the student s first7year GPA will be related to the values of these two known variables LR2 A marketer is interested in the effect of changing shelf height 1 and shelf width 2 on the weekly sales of her brand of laundry detergent in a grocery store LR3 A psychologist is interested in testing whether the amount of time to become proficient in a foreign language is related to the child s age In each case we have at least one variable that is known in some cases it is controllable and a response variable that is a random variable We would like to fit a model that relates the response to the known or controllable variables The main reasons that scientists and social researchers use linear regression are the following H Prediction 7 To predict a future response based on known values of the predictor variables and past data related to the process to Description 7 To measure the effect of changing a controllable variable on the mean value of the response variable OJ Control 7 To confirm that a process is providing responses results that we expect under the present operating conditions measured by the levels of the predictor variables 11 A Linear Deterministic Model Suppose you are a vendor who sells a product that is in high demand eg cold beer on the beach cable television in Gainesville or life jackets on the Titanic to name a few If you begin your day with 100 items have a profit of 10 per item and an overhead of 30 per day you know exactly how much profit you will make that day namely 10010 30970 Similarly if you begin the day with 50 items you can also state your profits with certainty In fact for any number of items you begin the day with z you can state what the day s pro ts will be That is y 10 m 7 30 This is called a deterministic model In general we can write the equation for a straight line as y 3051957 where 30 is called the yiintercept and 31 is called the slope 30 is the value of y when x O and 31 is the change in y when x increases by 1 unit In many realiworld situations the response of interest in this example its pro t cannot be explained perfectly by a deterministic model In this case we make an adjustment for random variation in the process 12 A Linear Probabilistic Model The adjustment people make is to write the mean response as a linear function of the predictor variable This way we allow for variation in individual responses y while associating the mean linearly with the predictor m The model we fit is as follows Eyl 30 3195 and we write the individual responses as y 30 Biz 87 We can think of y as being broken into a systematic and a random component 24 o im 6 systematic random where z is the level of the predictor variable corresponding to the response 30 and 31 are unknown parameters and e is the random error component corresponding to the response whose distribution we assume is N0 a as before Further we assume the error terms are independent from one another we discuss this in more detail in a later chapter Note that 30 can be interpreted as the mean response when m0 and 31 can be interpreted as the change in the mean response when x is increased by 1 unit Under this model we are saying that ylm N N o lz 0 Consider the following example Example 11 7 Coffee Sales and Shelf Space A marketer is interested in the relation between the width of the shelf space for her brand of coffee and weekly sales of the product in a suburban supermarket assume the height is always at eye level Marketers are well aware of the concept of compulsive purchases and know that the more shelf space their product takes up the higher the frequency of such purchases She believes that in the range of 3 to 9 feet the mean weekly sales will be linearly related to the width of the shelf space Further among weeks with the same shelf space she believes that sales will be normally distributed with unknown standard deviation 0 that is 0 measures how variable weekly sales are at a given amount of shelf space Thus she would like to fit a model relating weekly sales y to the amount of shelf space z her product receives that week That is she is tting the model 24 30 5196 67 so that ylm N N o lm a One limitation of linear regression is that we must restrict our interpretation of the model to the range of values of the predictor variables that we observe in our data We cannot assume this linear relation continues outside the range of our sample data We often refer to 30 lm as the systematic component of y and e as the mndom component 13 Least Squares Estimation of 60 and 61 We now have the problem of using sample data to compute estimates of the parameters 30 and 31 First we take a sample of 71 subjects observing values y of the response variable and z of the predictor variable We would like to choose as estimates for 30 and 31 the values 60 and 61 that best t7 the sample data Consider the coffee example mentioned earlier Suppose the marketer conducted the experiment over a twelve week period 4 weeks with 37 of shelf space 4 weeks with 6 and 4 weeks with 9 and observed the sample data in Table 1 Shelf Space Weekly Sales Shelf Space Weekly Sales m y m y 6 526 6 434 3 421 3 443 6 581 9 590 9 630 6 570 3 412 3 346 9 560 9 672 Table 1 Coffee sales data for n 12 weeks SALES 700 600 500 t 400 300 I I I I I o 3 6 9 12 SPACE Figure 1 Plot of coffee sales vs amount of shelf space Now look at Figure 1 Note that while there is some variation among the weekly sales at 3 6 and 9 respectively there is a trend for the mean sales to increase as shelf space increases If we de ne the tted equation to be an equation Qhamp we can choose the estimates be and 61 to be the values that minimize the distances of the data points to the tted line Now for each observed response yi with a corresponding predictor variable m we obtain a tted value 37 be 61 So we would like to minimize the sum of the squared distances of each observed response to its tted value That is we want to minimize the error sum of squares SSE where TL 11 SSE i 2302 i be 512 11 i1 A little bit of calculus can be used to obtain the estimates 1 2710 i y 5512 271 7 EV SSM7 and n n be y i 315 72171 M 5172171 301 71 71 An alternative formula but exactly the same mathematically is to compute the sample covariance of z and y as well as the sample variance of x then taking the ratio This is the the approach your book uses but is extra work from the formula above 271 7 E i y SSH 82 7 271 7 E 7 SSH 7 m 7 7 7 covmy b1 icovwl y 7171 7171 7171 7171 3 Some shortcut equations known as the corrected sums of squares and crossproducts that While not very intuitive are very useful in computing these and other estimates are SS 7 2m 7 i 7 2 z 7 Lil as 7 mm 7 ml 7 y 7 2217 0 5111 22191397 y2 21 T Example 11 Continued 7 Coffee Sales and Shelf Space For the coffee data we observe the following summary statistics in Table 2 Week Space Sales 2 my y2 1 6 526 36 3156 276676 2 3 421 9 1263 177241 3 6 581 36 3486 337561 4 9 630 81 5670 396900 5 3 412 9 1236 169744 6 9 560 81 5040 313600 7 6 434 36 2604 188356 8 3 443 9 1329 196249 9 9 590 81 5310 348100 10 6 570 36 3420 324900 11 3 346 9 1038 119716 12 9 672 81 6048 451584 3572 214185 23525021 Zmy39600 Ey23300627 Table 2 Summary Calculations 7 Coffee sales data From this we obtain the following sums of squares and crossproducts E 902 72 SSMZm7m2Zm27 n 5047 12 72 7 7 7 7 290M210 7 726185 SSmy7Zx7xy7y 7235177 7396007 1 72490 SSW Zn 7 y 2y 7 Eng 3300627 7 6118252 1127729 From these we obtain the least squares estimate of the true linear regression relation 30 731 SS 2490 b my 7 345833 1 55m 72 z 6185 72 n 7 Y 7 345833 7 5154167 7 2075000 7 307967 37 b0b1m 307967734583m So the tted equation estimating the mean weekly sales when the product has z feet of shelf space is 37 30 31 307967 7 3458331 Our interpretation for 1 is the estimate for the increase in mean weekly sales due to increasing shelf space by 1 foot is 345833 bags of coffee Note that this should only be interpreted within the range of x values that we have observed in the experimen 7 namely z 3 to 9 feet Example 12 7 Computation of a Stock Beta A widely used measure of a company s performance is their beta This is a measure of the rm s stock price volatility relative to the overall market s volatility One common use of beta is in the capital asset pricing model CAPM in finance but you will hear them quoted on many business news shows as well It is computed as Value Line The beta factor77 is derived from a least squares regression analysis between weekly percent changes in the price of a stock and weekly percent changes in the price of all stocks in the survey over a period of five years In the case of shorter price histories a smaller period is used but never less than two years In this example we will compute the stock beta over a 28 week period for Coca Cola and Anheuser Busch using the SampP500 as 7the market7 for comparison Note that this period is only about 10 of the period used by Value Line Note While there are 28 weeks of data there are only n27 weekly changes Table 3 provides the dates weekly closing prices and weekly percent changes of the SampP500 Coca Cola and Anheuser Busch The following summary calculations are also provided with z representing the SampP500 yo representing Coca Cola and 34 representing Anheuser Busch All calculations should be based on 4 decimal places Figure 2 gives the plot and least squares regression line for Anheuser Busch and Figure 3 gives the plot and least squares regression line for Coca Cola 5 155200 Zyc 724882 2 24281 52 1246354 Egg 4617296 Zyi1954900 2W 1614408 nyA 847527 a Compute SS SSHC and SSHA b Compute the stock betas for Coca Cola and Anheuser Busch Closing SampP AB CC SampP A B CC Date Price Price Price Chng Chng Chng 052097 829175 43100 66188 7 7 7 052797 847103 42188 68113 2108 0128 1187 060297 848128 42188 68150 0115 0100 0154 060997 858101 41150 67175 1115 322 1109 061697 893127 43100 71188 4111 3161 6110 062397 898170 43138 71138 0161 0188 0170 063097 887130 42144 71100 127 217 0153 070797 916192 43169 70175 3134 2195 0135 071497 916168 43175 69181 0103 0114 133 072197 915130 45150 69125 0115 4100 0180 072897 938179 43156 70113 2157 426 1127 080497 947114 43119 68163 0189 0185 214 081197 933154 43150 62169 144 0172 866 081897 900181 42106 58175 351 331 628 082597 923155 43138 60169 2152 3114 3130 090197 899147 42163 57131 261 173 557 090897 929105 44131 59188 3129 3194 4148 091597 923191 44100 57106 0155 0170 471 092297 950151 45181 59119 2188 4111 3173 092997 945122 45113 61194 0156 148 4165 100697 965103 44175 62138 2110 0184 0171 101397 966198 43163 61169 0120 250 1l11 102097 944116 42125 58150 236 316 517 102797 941164 40169 55150 0127 369 513 110397 914162 39194 56163 287 184 2104 111097 927151 40181 57100 1141 2118 0165 111797 928135 42156 57156 0109 4129 0198 112497 963109 43163 63175 3174 2151 10175 Table 3 Weekly closing stock prices 7 S851 500 Anheuser Busch Coca Cola Example 13 7 Estimating Cost Functions of a Hosiery Mill The following approximate data were published by Joel Dean in the 1941 article Statistical Cost Functions of a Hosiery Mill77 Studies in Business Administration vol 14 no 3 4 3 2 1 0 1 Z 3 4 5 Figure 2 Plot of weekly percent stock price changes for Anheuser Busch versus S85 500 and least squares regression line 4 3 2 1 0 1 Z 3 4 5 Figure 3 Plot of weekly percent stock price changes for Coca Cola versus S85 500 and least squares regression line y 7 Monthly total production cost in 1000s z 7 Monthly output in thousands of dozens produced A sample of n 48 months of data were used7 with mi and yi being measured for each month The parameter 31 represents the change in mean cost per unit increase in output unit variable cost7 and 30 represents the true mean cost when the output is 07 without shutting plant xed cost The data are given in Table 13 the order is arbitrary as the data are printed in table form7 and were obtained from visual inspection approximation of plot 4675 9264 17 3654 9156 33 3226 6671 4218 8881 18 3703 8412 34 3097 6437 4186 8644 19 3660 8122 35 2820 5609 4329 8880 20 3758 8335 36 2458 5025 4212 8638 21 3648 8229 37 2025 4365 4178 8987 22 3825 8092 38 1709 3801 4147 8853 23 3726 7692 39 1435 3140 4221 9111 24 3859 7835 40 1311 2945 9 4103 8122 25 4089 7457 41 950 2902 10 3984 8372 26 3766 7160 42 974 1905 11 3915 8454 27 3879 6564 43 934 2036 12 3920 8566 28 3878 6209 44 751 1768 13 3952 8587 29 3670 6166 45 835 1923 14 3805 8523 30 3510 7714 46 625 1492 15 3916 8775 31 3375 7547 47 545 1144 16 3859 9262 32 3429 7037 48 379 1269 OOKT U1gtJgtOJMH Table 4 Production costs and Output 7 Dean 1941 This dataset has n 48 observations with a mean output in 1000s of dozens of E 3106737 and a mean monthly cost in 1000s of y 654329 7L 7L 7L 7L 7L 2 z 149123 2 as 5406742 2 yi 314078 2 y 23842446 2 11309580 71 71 71 71 71 From these quantites7 we get quot 2 SSm 271 127 5406742 7 5406742 7 4632848 773894 7L SSW 271 113095807 MW 1130958079757553 1552027 2 V 2 SSW 271 y 7 23842446 7 23842446 7 20551040 3291406 5 4 5572 4 W 7 2 0055 1 7 En m2 7 55m 7 773894 7 39 i1 7L if b0 y 7 m 654329 7 20055310673 31274 b0 4 blzi 31274 200552 i148 57 47 7 y 7 31274 200552 i148 Table 13 gives the raw data their tted values and residuals A plot of the data and regression line are given in Figure 4 c est 10039 size Figure 4 Estimated cost function for hosiery mill Dean 1941 We have seen now how to estimate 30 and 31 Now we can obtain an estimate of the variance of the responses at a given value of z Recall from your previous statistics course you estimated the variance by taking the average squared deviation of each measurement from the sample estimated mean That is you calculated 52 Now that we fit the regression model we know longer use y to estimate the mean for each 37 but rather 37 b0 1 ha to estimate the mean The estimate we use now looks similar to the previous estimate except we replace y with 377 and we replace n 7 1 with n 7 2 since we have estimated 2 parameters 30 and 31 The new estimate which we will refer as to the estimated error variance is 7172 7172 7172 7172 T A 55702 3MSE SSE WMltnilgt lt8 2 31 OOKIQW WIOHs gtJgt to 3 00 00 00 O 00 3 3 01 I H H 01 Table 5 Approximated Monthly Outputs7 total costs7 tted values and residuals 7 Dean 1941 This estimated error variance 5 can be thought of as the average squared distance from each observed response to the fitted line The word average is in quotes since we divide by n 7 2 and not 71 The closer the observed responses fall to the line7 the smaller 3 is and the better our predicted values will be Example 11 Continued 7 Coffee Sales and Shelf Space For the coffee data7 2 7 1127729 7 3 71127729 7 861125 35 7 12 7 2 10 266604 and the estimated residual standard error deviation is Se x266604 5163 We now have estimates for all of the parameters of the regression equation relating the mean weekly sales to the amount of shelf space the coffee gets in the store Figure 5 shows the 12 observed responses and the estimated tted regression equation SALES 700 600 500 400 300 I I I I I O 3 6 9 12 SPACE Figure 5 Plot of coffee data and fitted equation Example 103 Continued 7 Estimating Cost Functions of a Hosiery Mill For the cost function data 5557 32914067W 329140673112555 178851 39 SSE ELMM QDZ SSW 55 773894 2 7 7 SSE 7 178851 7 35 7 MSE 7 m 7 4872 7 3888 0 35 3888 624 2 Simple Regression II 7 Inferences Concerning l Textbook Section 185 and some supplementary material Recall that in our regression model we are stating that 30 lm In this model 31 represents the change in the mean of our response variable y as the predictor variable x increases by 1 unit Note that if 31 0 we have that 30 lm 30 0x 30 which implies the mean of our response variable is the same at all values of m In the context of the coffee sales example this would imply that mean sales are the same regardless of the amount of shelf space so a marketer has no reason to purchase extra shelf space This is like saying that knowing the level of the predictor variable does not help us predict the response variable Under the assumptions stated previously namely that y N N o 1 31mm our estimator bl has a sampling distribution that is normal with mean 31 the true value of the parameter and standard error 2 That is ELJM E bi N NWL 039 xSSm We can now make inferences concerning 31 21 A Con dence Interval for l First we obtain the estimated standard error of 61 this is the standard deviation of its sampling distibution Se 55 s 7 1 1 SSm n 71s The interval can be written 55 b1 i t Z lizsbl 3 b1 i tut2372 1251 Note that V527 is the estimated standard error of 61 since we use 55 xMSE to estimate 0 Also we have n 7 2 degrees of freedom instead of n 7 1 since the estimate 3 has 2 estimated paramters used in it refer back to how we calculate it above Example 21 7 Coffee Sales and Shelf Space For the coffee sales example we have the following results bl 345833 SSm 72 35 5163 71 12 So a 95 con dence interval for the parameter 31 is 5 345833 t 750251272W which gives us the range 21026 48140 We are 95 con dent that the true mean sales increase by between 21026 and 48140 bags of coffee per week for each extra foot of shelf space the brand 345833 i 22286085 34583 i 13557 gets within the range of 3 to 9 feet Note that the entire interval is positive above 0 so we are con dent that in fact 31 gt 0 so the marketer is justi ed in pursuing extra shelf space Example 22 7 Hosiery Mill Cost Function b1 20055 SSH 773894 35 624 n 48 For the hosiery mill cost function analysis we obtain a 95 confidence interval for average unit variable costs 31 Note that 250252872 25025 x 2015 since 2502520 2021 and togaao 2000 we could approximate this with 2025 196 as well 20055 16025 20055 1 20150709 20055 1 01429 18626 21484 624 x773894 We are 95 confident that the true average unit variable costs are between 186 and 215 this is the incremental cost of increasing production by one unit assuming that the production process is in place 22 Hypothesis Tests Concerning l Similar to the idea of the con dence interval we can set up a test of hypothesis concerning 31 Since the con dence interval gives us the range of believable values for 31 it is more useful than a test of hypothesis However here is the procedure to test if 31 is equal to some value say 3 0 H0 31 B B specified usually 0 0 1 Hai 17 l 2 Ha 61 gt 69 3 Ha 61 lt 69 770 bio Tstobsf1 150 m b1 o 1 RR 750175 2 tutZJVZ 2 RR tabs 2 75627672 3 RR tobs S itaJLiZ 2 Pivalue Pt 2 250179 o 1 Pivalue 2Pt Z ltobsl 3 Pivalue Pt 250179 Using tables we can only place bounds on these pivalues Example 21 Continued 7 Coffee Sales and Shelf Space Suppose in our coffee example the marketer gets a set amount of space say 6 for free and she must pay extra for any more space For the extra space to be pro table over the long run 13 the mean weekly sales must increase by more than 20 bags7 otherwise the expense outweighs the increase in sales She wants to test to see if it is worth it to buy more space She works under the assumption that it is not worth it7 and will only purchase more if she can show that it is worth it She sets 04 05 1 11105120 HA61gt20 2 TS 201 73440 7133333 2397 m 3 RR 250179 gtt0510 1812 4 p value PT gt 2397 lt PT gt 2228 025 and PT gt 2397 gt PT gt 2764 0107 so 01 lt p 7 value lt 025 S07 she has concluded that 31 gt 207 and she will purchase the shelf space Note also that the entire con dence interval was over 207 so we already knew this Example 22 Continued 7 Hosiery Mill Cost Function Suppose we want to test whether average monthly production costs increase with monthly production output This is testing whether unit variable costs are positive 04 005 0 H0 Bl 0 Mean Monthly production cost is not associated with output 0 HA 31 gt 0 Mean monthly production cost increases with output i 2005570 20055 TS tobs W 00709 2829 V 7738 94 0 RR tobs gt 25005 x 1680 or use 2005 1645 o p value PT gt 2829 x 0 We have overwhelming evidence of positive unit variable costs 23 The Analysis of Variance Approach to Regression Consider the deviations of the individual responses7 yi from their overall mean y We would like to break these deviations into two parts7 the deviation of the observed value from its fitted value7 y be blmi and the deviation of the fitted value from the overall mean See Figure 6 corresponding to the coffee sales example That is7 we d like to write yi yyiigigi y Note that all we are doing is adding and subtracting the fitted value It so happens that algebraically we can show the same equality holds once we ve squared each side of the equation and summed it over the 71 observed and fitted values That is7 n n n 2 i y 2W 2302 2 i 2 i1 i1 i1 SALES 700 600 500 t 400 300 o 3 6 9 12 SPACE Figure 6 Plot of coffee data tted equation and the line y 5154167 These three pieces are called the total error and model sums of squares respectively We denote them as SSyy SSE and SSR respectively We have already seen that SSyy represents the total variation in the observed responses and that SSE represents the variation in the observed responses around the fitted regression equation That leaves SSR as the amount of the total variation that is accounted for7 by taking into account the predictor variable X We can use this decomposition to test the hypothesis H0 Bl 0 vs HA Bl 7 0 We will also nd this decomposition useful in subsequent sections when we have more than one predictor variable We rst set up the Analysis of Variance ANOVA Table in Table 6 Note that we will have to make minimal calculations to set this up since we have already computed SSyy and SSE in the regression analysis ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F MODEL SSR 319 7 y 1 MSR F m ERROR SSE 22161 7 1902 n i 2 MSE 5 TOTAL SSW 22104 7 y n i 1 Table 6 The Analysis of Variance Table for simple regression The procedure of testing for a linear association between the response and predictor variables using the analysis of variance involves using the Fidistribution which is given in Table 6 pp B lliB 16 of your text book This is the same distribution we used in the chapteron the 1 Way ANOVA The testing procedure is as follows 1 H0 Bl 0 HA Bl 7 0 This will always be a 2isided test 2 TS F0179 3 F0175 gt Fl igy 4 p value PF gt F0175 You can only get bounds on this but computer outputs report them exactly Note that we already have a procedure for testing this hypothesis see the section on lnferences Concerning 31 but this is an important lead7in to multiple regression Example 21 Continued 7 Coffee Sales and Shelf Space Referring back to the coffee sales data we have already made the following calculations SSyy 1127729 SSE 266604 71 12 We then also have that SSR SS2 7 SSE 861125 Then the Analysis of Variance is given in Table 7 ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F MODEL SSR 861125 1 MSR M 861125 F 3230 ERROR SSE 266604 12 7 2 10 MSE w 266604 TOTAL SSyy 1127729 12 7111 Table 7 The Analysis of Variance Table for the coffee data example To test the hypothesis of no linear association between amount of shelf space and mean weekly coffee sales we can use the F test described above Note that the null hypothesis is that there is no effect on mean sales from increasing the amount of shelf space We will use 04 01 1H0 l0 HA617 0 1 7 MSR 7 861125 7 2 TS F0179 7 m 7 266604 7 3230 3 F0175 gt Fl igy F110V01 4 p value PF gt F0175 PF gt 3230 x 0 We reject the null hypothesis and conclude that 31 74 0 There is an effect on mean weekly sales when we increase the shelf space Example 22 Continued 7 Hosiery Mill Cost Function For the hosiery mill data the sums of squares for each source of variation in monthly production costs and their corresponding degrees of freedom are from previous calculations Total SS 7 SSyy SLAM 7 y 3291406 dfTotal n 71 47 Error SS 7 SSE 2161 7 332 178851 de n 7 2 46 Model SS 7 SSR 2771927 y2 SSyy 7 SSE 3291406 7178851 3112555 de 1 16 ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom quare MODEL SSR 3112555 1 MSR W 3112555 F w 80055 ERROR SSE 178851 48 7 2 46 MSE 118g 3888 TOTAL SSW 3291406 48 7 1 47 Table 8 The Analysis of Variance Table for the hosiery mill cost example The Analysis of Variance is given in Table 8 To test whether there is a linear association between mean monthly costs and monthly produc tion output we conduct the F test Oz 005 1H02310 HAC ly O 2 TS F0179 3131823855 80055 3 RR F0179 gt Fm y Fm05 z 406 4 p value PF gt F0175 PF gt 80055 x 0 We reject the null hypothesis and conclude that 31 74 0 231 Coef cient of Determination A measure of association that has a clear physical interpretation is R2 the coef cient of deter mination This measure is always between 0 and 1 so it does not re ect whether y and z are positively or negatively associated and it represents the proportion of the total variation in the response variable that is accounted for by tting the regression on x The formula for R2 is SSE 7 SSR 7 covmy2 SSyy 7 SSyy 7 8332 39 R2R21 Note that SSyy Ziggy 7 y represents the total variation in the response variable while SSE 221341397 332 represents the variation in the observed responses about the fitted equation a er a 1ng1n o accoun x 1s is w y we some 1mes say a is propor ion 0 e var1a ion ft tk t t Th h t thtRZ t fth t in y that is explained by m Example 21 Continued 7 Coffee Sales and Shelf Space For the coffee data we can calculate R2 using the values of SSW SSH SSyy and SSE we have previously obtained 266604 861125 R2 7777636 1127729 1127729 Thus over 34 of the variation in sales is explained by the model using shelf space to predict sales Example 22 Continued 7 Hosiery Mill Cost Function 17 For the hosiery mill data7 the model regression sum of squares is SSR 3112555 and the total sum of squares is SSW 3291406 To get the coefficient of determination 2 7 3112555 3291406 09457 Almost 95 of the variation in monthly production costs is explained by the monthly production output 3 Simple Regression III 7 Estimating the Mean and Prediction at a Particular Level of x Correlation Textbook Sections 187188 We sometimes are interested in estimating the mean response at a particular level of the predic tor variable say z my That is we d like to estimate Eymg 60 31 The actual estimate 9point prediction is just 37 be blzg which is simply where the tted line crosses z my Under the previously stated normality assumptions the estimator 330 is normally distributed with mean 60 lmy and standard error of estimate a That is 71 2 my 4 7 n 2710 32 Note that the standard error of the estimate is smallest at my E that is at the mean of the sampled levels of the predictor variable The standard error increases as the value my goes away from this mean For instance our marketer may wish to estimate the mean sales when she has 6 of shelf space or 7 or 4 She may also wish to obtain a con dence interval for the mean at these levels of z 230 N NWO 519097 0 31 A Con dence Interval for Eylxg 60 61 Using the ideas described in the previous section we can write out the general form for a 1404 100 con dence interval for the mean response when my my 7 E2 l 50 51959 i tut1774255 SSH Example 31 7 Coffee Sales and Shelf Space Suppose our marketer wants to compute 95 con dence intervals for the mean weekly sales at z46 and 7 feet respectively these are not simultaneous con dence intervals as were computed based on Bonferroni s Method previously Each of these intervals will depend on tut2 250540 2228 and E 6 These intervals are 7 2 307967 4 3458334 4 22285163 4 4 726 4 446300 411503271389 446300 4 42872 E 403428 489172 7 2 307967 4 3458336 4 22285163 4 6 726 4 515467 411503270833 515467 4 33200 E 482267 548667 7 2 307967 4 3458337 4 22285163 4 M 550050 411503270972 72 19 550050 i 35863 E 514187 585913 Notice that the interval is the narrowest at my 6 Figure 7 is a computer generated plot of the data the tted equation and the con dence limits for the mean weekly coffee sales at each value of m Note how the limits get wider as x goes away from E 6 SALES 70 SPACE Figure 7 Plot of coffee data fitted equation and 95 con dence limits for the mean Example 32 7 Hosiery Mill Cost Function Suppose the plant manager is interested in mean costs among months where output is 30000 items produced my 30 She wants a 95 confidence interval for this true unknown mean Recall b0 31274 121 20055 38 624 n 48 E 310673 SS 7738i94t02545 m 2015 Then the interval is obtained as 1 30 7 310673 312742005530i2015624 E 773894 E 63729 i 2015612400210 E 6329 i182 E 61476511 We can be 95 confident that the mean production costs among months where 30000 items are produced is between 61470 and 65110 recall units were thousands for z and thousands for A plot of the data regression line and 95 con dence bands for mean costs is given in Figure 8 32 Predicting a Future Response at 21 Given Level of x In many situations a researcher would like to predict the outcome of the response variable at a speci c level of the predictor variable In the previous section we estimated the mean response in this section we are interested in predicting a single outcome In the context of the coffee sales example this would be like trying to predict next week s sales given we know that we will have 6 of shelf space size Figure 8 Plot of hosiery mill cost data7 tted equation7 and 95 con dence limits for the mean First suppose you know the parameters 30 and 31 Then you know that the response variable for a xed level of the predictor variable m mg is normally distributed with mean Eylmg 30 31 and standard deviation 0 We know from previous work with the normal distribution that approximately 95 of the measurements lie within 2 standard deviations of the mean So if we know BO 31 and a we would be very con dent that our response would lie between 30 31 7 2a and 30 lmy 20 Figure 9 represents this idea F2 40 o o w m I I I I I I I 50 6O 7O 80 90 100110120130140150 X Figure 9 Distribution of response variable with known 3031 and a We rarely if ever know these parameters and we must estimate them as we have in previous sections There is uncertainty in what the mean response at the speci ed level mg of the response variable We do however know how to obtain an interval that we are very confident contains the true mean 30 31 If we apply the method of the previous paragraph to all believable values of this mean we can obtain a prediction interval that we are very confident will contain our future response Since a is being estimated as well instead of 2 standard deviations we must use tut2 estimated standard deviations Figure 10 portrays this idea F1 40 Iquot39quotquotquotIquot39quotquotquotIquotquotquotquot39Iquotquotquotquot39I 20 60 100 140 180 X Figure 10 Distribution of response variable with estimated 3031 and a Note that all we really need are the two extreme distributions from the con dence interval for 22 the mean response If we use the method from the last paragraph on each of these two distributions we can obtain the prediction interval by choosing the leftihand point of the lower distribution and the rightihand point of the upper distribution This is displayed in Figure 11 F1 40 o o w m Il 20 60 100 140 180 X Figure 11 Upper and lower prediction limits when we have estimated the mean The general formula for a 1 7 a100 prediction interval of a future response is similar to the con dence interval for the mean at my except that it is wider to re ect the variation in individual responses The formula is my 7 EF 1 b0 b1m9ita2n23 1 Example 31 Continued 7 Coffee Sales and Shelf Space For the coffee example suppose the marketer wishes to predict next week s sales when the coffee will have 5 of shelf space She would like to obtain a 95 prediction interval for the number of bags to be sold First we observe that 2502540 2228 all other relevant numbers can be found in the previous example The prediction interval is then 1 5762 307967 3458336 i 22285163 1 E 480883 i 979 E 382887 578879 480883 i 93 554xli0972 This interval is relatively wide re ecting the large variation in weekly sales at each level of z Note that just as the width of the con dence interval for the mean response depends on the distance between my and E so does the width of the prediction interval This should be of no surprise considering the way we set up the prediction interval see Figure 10 and Figure 11 Figure 12 shows the fitted equation and 95 prediction limits for this example It must be noted that a prediction interval for a future response is only valid if conditions are similar when the response occurs as when the data was collected For instance if the store is being boycotted by a bunch of animal rights activists for selling meat next week our prediction interval will not be valid 300 39 0 3 6 9 12 SPACE Figure 12 Plot of coffee data tted equation and 95 prediction limits for a single response Example 32 Continued 7 Hosiery Mill Cost Function Suppose the plant manager knows based on purchase orders that this month her plant will produce 30000 items my 300 She would like to predict what the plant s production costs will be She obtains a 95 prediction interval for this month s costs 1 30 7 310673 2 31274 2005560 i 2 0156i24 1 E W E 6329 i 2 0156i2410210 E 6329 i12i70 E 5059 7599 She predicts that the costs for this month will be between 50590 and 75990 This interval is much wider than the interval for the mean since it includes random variation in monthly costs around the mean A plot of the 95 prediction bands is given in Figure 13 33 Coe icient of Correlation In many situations we would like to obtain a measure of the strength of the linear association between the variables y and z One measure of this association that is reported in research journals from many elds is the Pearson product moment coefficient of correlation This measure denoted by r is a number that can range from 1 to 1 A value of r close to 0 implies that there is very little association between the two variables y tends to neither increase or decrease as x increases A positive value of r means there is a positive association between y and z y tends to increase as x increases Similarly a negative value means there is a negative association y tends to decrease as x increases If r is either 1 or 1 it means the data fall on a straight line SSE 0 that has either a positive or negative slope depending on the sign of r The formula for calculating r is SSW covx y T ySsmssyy my 39 Note that the sign of r is always the same as the sign of 1 We can test whether a population coefficient of correlation is 0 but since the test is mathematically equivalent to testing whether 31 0 we won t cover this test size Figure 13 Plot of hosiery mill cost data7 tted equation7 and 95 prediction limits for an individual outcome Example 31 Continued 7 Coffee Sales and Shelf Space For the coffee data7 we can calculate r using the values of SSW7 SSm7 SSW we have previously obtained 2490 2490 721127729 m 8738 7 Example 32 Continued 7 Hosiery Mill Cost Function For the hosiery mill cost function data7 we have 1552027 1552027 7738943291406 1595995 9725 r Computer Output for Coffee Sales Example SAS System Dependent Variable SALES Analysis of Variance Sum of Mean Source DF Squares Square F Value ProbgtF Model 1 8611250000 8611250000 32297 00002 Error 10 2666241667 266624167 C Total 11 11277491667 Root MSE 5163566 Rsquare 07636 Dep Mean 51541667 Adj R sq 07399 Parameter Estimates Parameter Standard T for H0 Variable DF Estimate Error Parameter0 Prob gt ITI INTERCEP 1 307916667 3943738884 7808 00001 SPACE 1 34583333 608532121 5683 00002 Dep Var Predict Std Err Lower95 Upper95 Lower95 Obs SALES Value Predict Mean Mean Predict 1 4210 4117 23568 3592 4642 2852 2 4120 4117 23568 3592 4642 285 2 3 4430 4117 23568 3592 4642 285 2 4 3460 4117 23568 3592 4642 285 2 5 5260 5154 14906 4822 5486 395 7 6 5810 5154 14906 4822 5486 395 7 7 4340 5154 14906 4822 5486 395 7 8 5700 5154 14906 4822 5486 395 7 9 6300 6192 23568 5667 6717 492 7 10 5600 6192 23568 5667 6717 492 7 11 5900 6192 23568 5667 6717 492 7 12 6720 6192 23568 5667 6717 492 7 Upper95 Upper95 Obs Predict Residual Obs Predict Residual 1 5381 93333 7 6352 814167 2 5381 03333 8 6352 545833 3 5381 313333 9 7456 108333 4 5381 656667 10 7456 59 1667 5 6352 105833 11 7456 291667 6 6352 655833 12 7456 528333 4 Logistic Regression Often the outcome is nominal or binary and we wish to relate the probability that an outcome has the characteristic of interest to an interval scale predictor variable For instance a local service provider may be interested in the probability that a customer will redeem a coupon that is mailed to himher as a function of the amount of the coupon We would expect that as the value of the coupon increases so does the proportion of coupons redeemed An experiment could be conducted as follows 0 Choose a range of reasonable coupon values say x0 yer only 1 2 5 10 0 Identify a sample of customers say 200 households 0 Randomly assign customers to coupon values say 40 per coupon value level 0 Send out coupons and determine whether each coupon was redeemed by the expiration date y 1 if yes 0 if no 0 Tabulate results and fit estimated regression equation Note that probabilities are bounded by 0 and 1 so we cannot fit a linear regression since it will provide fitted values outside this range unless he is between 0 and 1 and 1 is 0 We consider the following model that does force fitted probabilities to lie between 0 and 1 519044911 pz W e 271828 Unfortunately unlike the case of simple linear regression where there are close form equations for least squares estimates of 30 and 31 computer software must be used to obtain maximum likelihood estimates of 30 and 31 as well as their standard errors Fortunately many software packages eg SAS SPSS Statview offer procedures to obtain the estimates standard errors and tests We will give estimates and standard errors in this section obtained from one of these packages Once the estimates of 30 and 31 are obtained which we will label as be and 1 respectively we obtain the fitted equation ebob1m 133 5 271828 1 ebob1z Example 41 7 Viagra Clinical Trial In a clinical trial for Viagra patients suffering from erectile dysfunction were randomly assigned to one of four daily doses 0mg 25mg 50mg and 100mg One measure obtained from the patients was whether the patient had improved erections after 24 weeks of treatment y 1 if yes y 0 if no Table 9 gives the number of subjects with y 1 an y 0 for each dose level Source 1 Goldstein et al 1998 Oral Sildenafil in the Treatment of Erectile Dysfunction NEJM 3381397 1404 Based on an analysis using SAS software we obtain the following estimates and standard errors for the logistic regression model b0 708311 3 1354 b1 00313 31 0034 Responding Dose n y 1 y 0 0 199 50 149 25 96 54 42 50 105 81 24 100 101 85 16 Table 9 Patients showing improvement y 1 and not showing improvement y 0 by Viagra dose l 0 10 20 3O 4O 50 6O 7O 80 90 100 dose Figure 14 Plot of estimated logistic regression equation Viagra data A plot of the tted equation line and the sample proportions at each dose dots are given in Figure 14 41 Testing for Association between Outcome Probabilities and x Consider the logistic regression model 519044911 pz W e 271828 Note that if 31 07 then the equation becomes pz ego1 ago That is7 the probability that the outcome is the characteristic of interest is not related to m the predictor variable In terms of the Viagra example7 this would mean that the probability a patient shows improvement is independent of dose This is what we would expect if the drug were not effective still allowing for a placebo effect Futher7 note that if 31 gt 07 the probability of the characteristic of interest occurring increases in z and if 31 lt 07 the probability decreases in x We can test whether 31 0 as follows 0 H0 Bl 0 Probability of outcome is independent of z 0 HA Bl 7 0 Probability of outcome is associated with z 0 Test Statistic X3175 blsbllz o Rejection Region X3175 2 xil 38417 for 04 005 o P value Area in x distribution above X 2 0179 Note that if we reject H07 we determine direction of association positivenegative by the sign Of b1 Example 41 Continued 7 Viagra Clinical Trial For this data7 we can test whether the probability of showing improvement is associated with dose as follows 0 H0 Bl 0 Probability of improvement is independent of dose 0 HA Bl 7 0 Probability of improvement is associated with dose 0 Test Statistic X3175 bl31712 031300342 920592 8475 0175 7 o Rejection Region X2 gt X31 38417 for 04 005 o P value Area in x distribution above X3175 virtually 0 Thus7 we have strong evidence of a positive association since 1 gt 0 and we reject H0 between probability of improvement and dose 5 Multiple Linear Regression I Textbook Sections 191 193 and Supplement In most situations we have more than one independent variable While the amount of math can become overwhelming and involves matrix algebra many computer packages exist that will provide the analysis for you In this chapter we will analyze the data by interpreting the results of a computer program It should be noted that simple regression is a special case of multiple regression so most concepts we have already seen apply here 51 The Multiple Regression Model and Least Squares Estimates In general if we have k predictor variables we can write our response variable as yBOBl139 kk5 Again z is broken into a systematic and a random component y o imi kzk f systematic random We make the same assumptions as before in terms of 8 specifically that they are independent and normally distributed with mean 0 and standard deviation 0 That is we are assuming that y at a given set oflevels of the k independent variables 1 zk is normal with mean Eym1 zk 30 lm Bkzk and standard deviation 0 Just as before 3031 k and a are unknown parameters that must be estimated from the sample data The parameters 3 represent the change in the mean response when the ith predictor variable changes by 1 unit and all other predictor variables are held constant In this model 0 y 7 Random outcome of the dependent variable 0 BO 7 Regression constant zk 0 if appropriate 0 7 Partial regression coefficient for variable xi Change in when M increases by 1 unit and all other is are held constant c 6 7 Random error term assumed as before that 6 N N0a o k 7 The number of independent variables By the method of least squares choosing the 1 values that minimize SSE SLAM 7 3702 we obtain the fitted equation Ybobi1522wbkk and our estimate of a 88 m 7 22gt SSE n 7 k 7 1 n 7 k 7 1 The Analysis of Variance table will be very similar to what we used previously with the only adjustments being in the degrees7 of freedom Table 10 shows the values for the general case when there are k predictor variables We will rely on computer outputs to obtain the Analysis of Variance and the estimates 130191 and bk ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F MODEL SSR 2719 7 y k MSR F ERROR SSE 22194 7 191 n 7 k 71 MSE 5131 TOTAL SSW 2210a 7 y n 1 Table 10 The Analysis of Variance Table for multiple regression 52 Testing for Association Between the Response and the Full Set of Predictor Variables To see if the set of predictor variables is useful in predicting the response variable we will test H0 Bl g Bk 0 Note that if H0 is true then the mean response does not depend on the levels of the predictor variables We interpret this to mean that there is no association between the response variable and the set of predictor variables To test this hypothesis we use the following method 1 Hot l z k0 2 HANot every i0 03 i MSR TS F0179 W F F0179 gt Fugk likil p value PF gt F0175 You can only get bounds on this but computer outputs report them exactly 01 The computer automatically performs this test and provides you with the p value of the test so in practice you really don t need to obtain the rejection region explicitly to make the appropriate conclusion However we will do so in this course to help reinforce the relationship between the tests decision rule and the p value Recall that we reject the null hypothesis if the p value is less than a 53 Testing Whether Individual Predictor Variables Help Predict the Response If we reject the previous null hypothesis and conclude that not all of the B are zero we may Wish to test whether individual Bi are zero Note that if we fail to reject the null hypothesis that Bi is zero we can drop the predictor z from our model thus simplifying the model Note that this test is testing whether z is useful given that we are already tting a model containing the remaining k 7 1 predictor variables That is does this variable contribute anything once we ve taken into account the other predictor variables These tests are t tests where we compute t 9177 just as we did in the section on making inferences concerning 31 in simple regression The procedure for testing whether Bi 0 the ith predictor variable does not contribute to predicting the response given the other k 7 1 predictor variables are in the model is as follows 0 H0 0 y is not associated with z after controlling for all other independent variables o 1 HAt iy O 2HA lgtO 3HA lltO o TS 20555ij 0 RR 1 ltobsl gt ta2nk1 tabs gt tan7k71 3 tabs lt itam7k71 1 Irvalue 2PT gt ltobsl 2 Irvalue PT gt 250179 3 Irvalue PT lt 250179 Computer packages print the test statistic and the p value based on the two sided test7 so to conduct this test is simply a matter of interpreting the results of the computer output 54 Testing for an Association Between a Subset of Predictor Variables and the Response We have seen the two extreme cases of testing whether all regression coefficients are simultaneously 0 the F test7 and the case of testing whether a single regression coefficient is 07 controlling for all other predictors the t test We can also test whether a subset of the k regression coefficients are 07 controlling for all other predictors Note that the two extreme cases can be tested using this very general procedure To make the notation as simple as possible7 suppose our model consists of k predictor variables7 of which we d like to test whether q q k are simultaneously not associated with y after control ling for the remaining k 7 q predictor variables Further assume that the k 7 q remaining predictors are labelled m1 m2 zk z and that the q predictors ofinterest are labelled kiq17kiq27 7 This test is of the form H0 Bk l k g Bk 0 HA k Hl 0 andor Bk g 0 andor andor Bk 0 The procedure for obtaining the numeric elements of the test is as follows 1 Fit the model under the null hypothesis Bk l Bk g Bk 0 It will include only the first k 7 q predictor variables This is referred to as the Reduced model Obtain the error sum of squares SSER and the error degrees of freedom deR n 7 k 7 q 7 1 2 Fit the model with all k predictors This is referred to as the Complete or Full model and was used for the F test for all regression coefficients Obtain the error sum of squares SSEF and the error degrees of freedom deF n 7 k 7 1 By definition of the least squares citerion7 we know that SSER 2 SSEF We now obtain the test statistic TS F0179 SSE R 755E F n7ltk7qgt71gt7ltn7k71gt SSER 7 SSEFq MSEF 33 and our rejection region is values of F0175 2 Faqnk1 Example 51 7 Texas Weather Data In this example we will use regression in the context of predicting an outcome A construction company is making a bid on a project in a remote area of Texas A certain component of the project will take place in December and is very sensitive to the daily high temperatures They would like to estimate what the average high temperature will be at the location in December They believe that temperature at a location will depend on its latitude measure of distance from the equator and its elevation That is they believe that the response variable mean daily high temperature in December at a particular location can be written as y 50 51951 52902 53953 87 where 1 is the latitude of the location 2 is the longitude and 3 is its elevation in feet As before we assume that 6 N N0a Note that higher latitudes mean farther north and higher longitudes mean farther west To estimate the parameters o l g g and a they gather data for a sample of n 16 counties and fit the model described above The data including one other variable are given in Table 11 COUNTY LATITUDE LONGl 1 UDE ELEV TEMP INCOME HARRIS 29767 95367 41 56 24322 DALLAS 32850 96850 440 48 21870 KENNEDY 26933 97800 25 60 11384 MIDLAND 31950 102183 2851 46 24322 DEAF SMITH 34800 102467 3840 38 16375 KNOX 33450 99633 1461 46 14595 MAVERICK 28700 100483 815 53 10623 NOLAN 32450 100533 2380 46 16486 ELPASO 31800 10640 3918 44 15366 COLLINGTON 34850 100217 2040 41 13765 PECOS 30867 102900 3000 47 17717 SHERMAN 36350 102083 3693 36 19036 TRAVIS 30300 97700 597 52 20514 ZAPATA 26900 99283 315 60 11523 LASALLE 28450 99217 459 56 10563 CAMERON 25900 97433 19 62 12931 Table 11 Data corresponding to 16 counties in Texas The results of the Analysis of Variance are given in Table 12 and the parameter estimates estimated standard errors t statistics and p values are given in Table 13 Full computer programs and printouts are given as well We see from the Analysis of Variance that at least one of the variables latitude and elevation are related to the response variable temperature This can be seen by setting up the test H0 Bl g g 0 as described previously The elements of this test provided by the computer output are detailed below assuming 04 05 1 H035152330 ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F p value MODEL SSR 934328 k 3 MSR F 0001 311443 491235 ERROR SSE7609 7171071 MSE 16 i 3 i 1 12 0634 TOTAL SSW 941938 71 7 1 15 Table 12 The Analysis of Variance Table for Texas data 25 FOR H0 STANDARD ERROR PARAMETER ESTIMATE i0 P VALUE OF ESTIMATE INTERCEPT Bo 6010925887 3668 0001 297857 LATITUDE 61 1 7199323 71461 0001 013639 LONGITUDE 62 b2 7038471 7168 1182 022858 ELEVATION 63 b3 7000096 7168 1181 000057 Table 13 Parameter estimates and tests of hypotheses for individual parameters to HANotall i0 i 7 MSR 7 311443 7 TS F0179 7 m 7 0634 7 491235 03 RR F0175 gt F2gt1305 381 This is not provided on the output the p value takes the place of it p value PF gt 64445 0001 Actually it is less than 0001 but this is the smallest p value the computer will print F 01 We conclude that we are sure that at least one of these three variables is related to the response variable temperature We also see from the individual t tests that latitude is useful in predicting temperature even after taking into account the other predictor variables The formal test based on 04 005 significance level for determining wheteher temperature is associated with latitude after controlling for longitude and elevation is given here 0 H0 Bl 0 TEMP is not associated with LAT 1 after controlling for LONG 2 and ELEV 353 0 HA 31 74 0 TEMP is associated with LAT after controlling for LONG and ELEV i 7 b 7 7199323 7 TS 201 7 5711 7 036399 7 714614 O ltobsl gt ta2n7k71 7202542 Irvalue 2PT gt 250151 2PT gt 14614 lt 0001 Thus we can conclude that there is an association between temperature and latitude controlling for longitude and elevation Note that the coe cient is negative so we conclude that temperature decreases as latitude increases given a level of longitude and elevation 35 Note from Table 13 that neither the coefficient for LONGITUDE X2 or ELEVATION X3 are significant at the 04 005 significance level p values are 1182 and 1181 respectively Recall these are testing whether each term is 0 controlling for LATITUDE and the other term Before concluding that neither LONGITUDE 2 or ELEVATION mg are useful predictors controlling for LATITUDE we will test whether they are both simultaneously 0 that is H0Bg g0 vs HA327 0andor g 0 First note that we have n 16 k 7 3 q 7 2 SSEF 7 71609 deF 16 7 3 7112 MSEF 7 01634 deR 7 16 7 3 7 27114 605112 7 3189 Next we fit the model with only LATITUDE 1 and obtain the error sum of squares SSER 60935 and get the following test statistic 55193 7SSEFq 60935 7 76092 7 26663 42 055 TS F 7 01 9 MSEF 0634 0634 Since 42055 gtgt 389 we reject H0 and conclude that LONGITUDE 2 andor ELEVATION 3 are associated with TEMPERATURE y after controlling for LATITUDE The reason we failed to reject H0 g 0 and H0 g 0 individually based on the t tests is that ELEVATION and LONGITUDE are highly correlated Elevations rise as you go further west in the state So once you control for LONGITUDE we observe little ELEVATION effect and vice versa We will discuss why this is the case later In theory we have little reason to believe that temperatures naturally increase or decrease with LONGITUDE but we may reasonably expect that as ELEVATION increases TEMPERATURE decreases We re7 t the more parsimonious simplistic model that uses ELEVATION 1 and LATI TUDE 2 to predict TEMPERATURE Note the new symbols for ELEVATION and LATI TUDE That is to show you that they are merely symbols The results are given in Table 14 and Table 15 ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F p value MODEL SSR 932532 k 2 MSR F w 0001 466266 644014 ERROR SSE 9406 n 7 k 7 MSE 16727113 0724 TOTAL SSW 941938 71 7 1 15 Table 14 The Analysis of Variance Table for Texas data 7 without LONGITUDE We see this by observing that the t statistic for testing H0 Bl 0 no latitude effect on temperature is 71765 corresponding to a p value of 0001 and the t statistic for testing H0 g 0 no elevation effect is 7841 also corresponding to a p value of 0001 Further note that both estimates are negative re ecting that as elevation and latitude increase temperature decreases That should not come as any big surprise 36 25 FOR H0 STANDARD ERROR PARAMETER ESTIMATE i0 P VALUE OF ESTIMATE INTERCEPT BO 606345485 3668 0001 048750 ELEVATION 61 b1 7000185 7841 0001 000022 LATITUDE 62 b2 7183216 71765 0001 010380 Table 15 Parameter estimates and tests of hypotheses for individual parameters 7 without LON GITUDE The magnitudes of the estimated coefficients are quite different which may make you believe that one predictor variable is more important than the other This is not necessarily true because the ranges of their levels are quite different 1 unit change in latitude represents a change of approximately 19 miles while a unit change in elevation is 1 foot and recall that 3 represents the change in the mean response when variable Xi is increased by 1 unit The data corresponding to the 16 locations in the sample are plotted in Figure 15 and the fitted equation for the model that does not include LONGITUDE is plotted in Figure 16 The fitted equation is a plane in three dimensions TEMP 6200 5333 4467 3600 3635 3918 32 87 2618 LAT12938 1319ELEV 259019 Figure 15 Plot of temperature data in 3 dimensions Example 52 7 Mortgage Financing Cost Variation By City A study in the mid 1960 s reported regional differences in mortgage costs for new homes The sampling units were 71 18 metro areas SMSA s in the US The dependent variable is the average yield in percent on a new home mortgage for the SMSA The independent variables are given below Source Schaaf AH 1966 Regional Differences in Mortgage Financing Costs77 Joumal of Finance 2185 94 1 7 Average Loan Value Mortgage Value Ratio Higher 1 means lower down payment and higher risk to lender YHAT 6345 5366 4387 4000 1333 ELEV Figure 16 Plot of the tted equation for temperature data 2 7 Road Distance from Boston Higher 2 means further from Northeast where most capital was at the time and higher costs of capital 3 7 Savings per Annual Dwelling Unit Constructed Higher 3 means higher relative credit surplus and lower costs of capital 4 7 Savings per Capita does not adjust for new housing demand 5 7 Percent Increase in Population 195071960 6 7 Percent of First Mortgage Debt Controlled by lnter regional Banks The data fitted values and residuals are given in Table 16 The Analysis of Variance is given in Table 17 The regression coefficients test statistics and p values are given in Table 18 Show that the fitted value for Los Angeles is 619 based on the fitted equation and that the residual is 002 Based on the large F statistic and its small corresponding P value we conclude that this set of predictor variables is associated with the mortgage rate That is at least one of these independent variables is associated with y Based on the t tests while none are strictly significant at the 04 005 level there is some evidence that 1 Loan ValueMortgage Value P 0515 3 Savings per Unit Constructed P 0593 and to a lesser extent 4 Savings per Capita P 1002 are helpful in predicting mortgage rates We can fit a reduced model with just these three predictors and test whether we can simultaneously drop 2 5 and 6 from the model That is H0 g 5360 vs HA gy Oandor 57 0andor367 0 First we have the following values SSEF 7 010980 deF 7 18 7 6 7111 MSEF 7 000998 deR 18 7 6 7 37114 E05341 7 3159 38 SMSA y 11 12 13 I4 15 15 Q 6 y 7 Q Los Angeles Long Beach 6717 7871 3042 9173 173811 4575 3371 6719 0702 Denver 6706 7770 1997 8471 111074 5178 2179 6704 0702 San FranciscoOakland 6704 7517 3162 12913 173811 2470 4670 6705 0701 DallasFort Worth 6704 7774 1821 4172 77874 4577 5173 6705 0701 iami 6702 7774 1542 11971 113677 8879 1877 6704 0702 Atlanta 6702 7376 1074 3273 58279 3979 2676 5792 0710 Houston 5799 7673 1856 4572 77874 5471 3577 6702 0703 Seattle 5791 7275 3024 10977 118670 3171 1770 5791 0700 New York 5789 7773 216 36473 258274 1179 773 5782 0707 Memphis 5787 7774 1350 11170 61376 2774 1173 5786 0701 New Orleans 5785 7274 1544 8170 63671 2773 871 5781 0704 Cleveland 5775 6770 631 20277 134670 2476 1070 5764 0711 Chicago 5773 6879 972 29071 162678 2071 974 5760 0713 Detroit 5766 7077 699 22374 104976 2477 3177 5763 0 03 MinneapolisSt Paul 5766 6978 1377 13874 128973 2878 1977 5781 0715 Baltimore 5763 7279 399 12574 83673 2279 876 5777 0714 Philadelphia 5757 6877 304 25975 131573 1873 1877 5757 0700 Boston 5728 6778 0 42872 208170 775 270 5741 0713 Table 16 Data and tted values for mortgage rate multiple regression example ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F p value MODEL SSR 073877 k 6 MSR F 3633 0003 012313 1233 ERROR SSE 010980 71 7 k 7 1 MSE W 18 7 6 7111 000998 TOTAL SSW 084858 71 7 1 17 Table 17 The Analysis of Variance Table for Mortgage rate regression analysis STANDARD PARAMETER ESTIMATE ERROR t statistic P value INTERCEPT 60 b0428524 066825 641 0001 1 61 b1 002033 000931 218 0515 2 62 b2 0000014 0000047 029 7775 mg 63 b3 7000158 0000753 210 0593 4 64 b4 0000202 0000112 179 1002 5 65 b5 000128 000177 073 4826 6 66 b6 0000236 000230 010 9203 Table 18 Parameter estimates and tests of hypotheses for individual parameters 7 Mortgage rate regression analysis ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom S uare F p value MODEL SSR 073265 k 7 q 3 MSR w F 333333 0001 024422 2949 ERROR SSE 011593 71 7 k 7 q 7 1 MSE w 18 7 3 71 14 000828 TOTAL SSW 084858 71 7 1 17 Table 19 The Analysis of Variance Table for Mortgage rate regression analysis Reduced Model STANDARD PARAMETER ESTIMATE ERROR t statistic P value lNTERCEPT BO b0422260 058139 726 0001 1 31 b1 002229 000792 281 0138 mg 33 b3 7000186 000041778 446 0005 4 34 b4 0000225 0000074 303 0091 Table 20 Parameter estimates and tests of hypotheses for individual parameters 7 Mortgage rate regression analysis Reduced Model Next we fit the reduced model with g B5 36 0 We get the Analysis of Variance in Table 19 and parameter estimates in Table 20 Note first that all three regression coefficients are significant now at the 04 005 significance level Also our residual standard error Se xMSE has also decreased 009991 to 009100 This implies we have lost very little predictive ability by dropping m2 m5 and 6 from the model Now to formally test whether these three predictor variables7 regression coefficients are simultaneously 0 with 04 005 0 H035255360 0 HA g 7 0 andor 35 7 0 andor 36 7 0 7 0115937010980Mz 7 00307 7 TS 39 F0179 000998 00998 0307 0 RR 3 F0179 Z F005311 359 We fail to reject H0 and conclude that none of m2 5 or 6 are associated with mortgage rate after controlling for m1 m3 and 4 Example 53 7 Store Location Characteristics and Sales A study proposed using linear regression to describe sales at retail stores based on location characteristics As a case study the authors modelled sales at n 16 liquor stores in Charlotte NC Note that in North Carolina all stores are state run and do not practice promotion as liquor stores in Florida do The response was SALES volume for the individual stores in the fiscal year 711979 6301980 The independent variables were POP number of people living within 15 miles of store MHI mean household income among households within 15 miles of store DIS distance to the nearest store TFL daily traffic volume on the street the store was located and 40 EMP the amount of employment within 15 miles of the store The regression coefficients and standard errors are given in Table 54 Source Lord JD and CD Lynds 1981 The Use of Regression Models in Store Location Research A Review and Case Study77 Akron Business and Economic Review Summer 13 19 Variable Estimate Std Error 009460 001819 MHI 006129 002057 DIS 488524 172623 TFL 259040 122768 EMP 000245 000454 Table 21 Regression coefficients and standard errors for liquor store sales study a Do any of these variables fail to be associated with store sales after controlling for the others b Consider the signs of the significant regression coefficients What do they imply 55 R2 and Adjusted7R2 As was discussed in the previous chapter the coefficient of multiple determination represents the proportion of the variation in the dependent variable that is explained by the regression on the collection of independent variables 1 zk R2 is computed exactly as before R2 SSR 1 7 SSE SSW SSW One problem with R2 is that when we continually add independent variables to a regression model it continually increases or at least never decreases even when the new variables add little or no predictive power Since we are trying to fit the simplest most parsimonious model that explains the relationship between the set of independent variables and the dependent variable we need a measure that penalizes models that contain useless or redundant independent variables This penalization takes into account that by including useless or redundant predictors we are decreasing error degrees of freedom de n 7 k 7 1 A second measure that does not carry the proportion of variation explained criteria but is useful for comparing models of varying degrees of complexity is Adjusted R2 Ad td7R2 1 17 Juse SSW SSEn7k71 n71 SSE 7 SSWn71 n7k71lt gt Example 51 Continued 7 Texas Weather Data Consider the two models we have fit Full Model 7 IV s LATITUDE LONGITUDE ELEVATION 41 Reduced Model 7 lV s LATITUDE ELEVATION For the Full Model we have n 16 k 3 SSE 7609 SSyy 941938 and we obtain R129 and Adj 135 7609 15 R2177170080992 Ad 7R2177 F 941938 J 12 lt For the Reduced Model we have 7609 m 17125008 09900 n 16 k 2 SSE 9406 SSW 941938 and we obtain R123 and Adj R33 9406 15 9406 2 7 7 7 7 7 2 7 7 7 R 1 941938 1 39010 039990 Adj RR 1 13 941938 Thus by both measures the Full Model wins but it should be added that both appear to fit the data very well gt 17115010 09885 Example 52 Continued 7 Mortgage Financing Costs For the mortgage data with Total Sum of Squares SSyy 084858 and n 18 when we include all 6 independent variables in the full model we obtain the following results SSR 073877 SSE 010980 k 6 From this full model we compute R2 and Adj R2 2 SSRF 073877 RF 7 7 08706 Ad 7R2 17 SS 084858 J F n n71 551915 17 010980 7571 i 77 08000 SSW 11 084858gt Example 53 Continued 7 Store Location Characteristics and Sales In this study the authors reported that R2 069 Note that although we are not given the Analysis of Variance we can still conduct the F test for the overall model F7 MSR 7 SSRk 7 iiik 7 RZk MSE SSE7171571 53 57971 17R2n7k71 For the liquor store example there were 71 16 stores and k 5 variables in the full model To test Hoi l g g 4 50 US HANotallB0 we get the following test statistic and rejection region 04 005 069 5 0138 T5 3 F0179 445 RR 3 F0179 Z Fakn7k71 F005510 333 17 06916 7 5 71 m Thus at least one of these variables is associated with store sales What is Adjusted R2 for this analysis 56 Mult icollinearity Textbook Section 194 Supplement Multicollinearity refers to the situation where independent variables are highly correlated among themselves This can cause problems mathematically and creates problems in interpreting regression coefficients Some of the problems that arise include o Difficult to interpret regression coefficient estimates 0 In ated std errors of estimates and thus small testatistics 0 Signs of coefficients may not be what is expected 0 However predicted values are not adversely affected It can be thought that the independent variables are explaining the same77 variation in y and it is difficult for the model to attribute the variation explained recall partial regression coefficients Variance In ation Factors provide a means of detecting whether a given independent variable is causing multicollinearity They are calculated for each independent variable as 1 VIF7 l 17R where R is the coefficient of multiple determination when x is regressed on the k 7 1 other independent variables One rule of thumb suggests that severe multicollinearity is present if VIE gt 10 gt 90 Example 51 Continued First we run a regression with ELEVATION as the dependent variable and LATITUDE and LONGITUDE as the independent variables We then repeat the process with LATITUDE as the dependent variable and finally with LONGITUDE as the dependent variable Table gives R2 and VIF for each model Variable R2 VIF ELEVATION 9393 1647 LATITUDE 7635 423 LONGITUDE 8940 943 Table 22 Variance In ation Factors for Texas weather data Note how large the factor is for ELEVATION Texas elevation increases as you go West and as you go North The Western rise is the more pronounced of the two the simple correlation between ELEVATION and LONGITUDE is 89 Consider the effects on the coefficients in Table 23 and Table 24 these are subsets of previously shown tables Compare the estimate and estimated standard error for the coefficient for ELEVATION and LATITUDE for the two models In particular the ELEVATION coefficient doubles in absolute value and its standard error decreases by a factor of almost 3 The LATITUDE coefficient and standard error do not change very much We choose to keep ELEVATION as opposed to LONGI TUDE in the model due to theoretical considerations with respect to weather and climate 43 STANDARD ERROR PARAMETER ESTIMATE OF ESTIMATE INTERCEPT BO b010925887 297857 LATITUDE 61 b1 7199323 013639 LONGITUDE 62 b2 7038471 022858 ELEVATION 63 b3 7000096 000057 Table 23 Parameter estimates and standard errors for the full model STANDARD ERROR PARAMETER ESTIMATE OF ESTIMATE INTERCEPT Bo b06345485 048750 ELEVATION 61 b1 7000185 000022 LATITUDE 62 b2 7183216 010380 Table 24 Parameter estimates and standard errors for the reduced model 57 Aut ocorrelation Textbook Section 195 Recall a key assumption in regression Error terms are independent When data are collected over time7 the errors are often serially correlated Autocorrelated Under firstiOrder Autocorre lation7 consecutive error terms are linealy related 6t 381571 Vt Where p is the correlation between consecutive error terms7 and Vt is a normally distributed independent error term When errors display a positive correlation7 p gt 0 Consecutive error terms are associated We can test this relation as follows7 note that when p 07 error terms are independent which is the assumption in the derivation of the tests in the chapters on linear regression Durbin7Watson Test for Autocorrelation H0 p 0 No autocorrelation Ha p gt 0 Postive Autocorrelation D ELJZ glV D 2 dU Dofft1 Reject H0 D dL gt Reject H0 dL D dU gt Withhold judgement Values of 1L and oh indexed by n and k the number of predictor variables are given in Table 11a7 p B 22 Cures for Autocorrelation 0 Additional independent variables 7 A variable may be missing from the model that will eliminate the autocorrelation o Transform the variables 7 Take rst differences77 ytlliyt and ytlliyt and run regression with transformed y and z Example 54 Spirits Sales and Income and Prices in Britain A study was conducted relating annual spirits liquor sales in Britain to per capita income 1 and prices 2 where all monetary values were in constant adjusted for in ation dollars for the years 1870 1938 The following output gives the results from the regression analysis and the Durbin Watson statistic Note that there are n 69 observations and k 2 predictors and the approximate lower and upper bounds for the rejection region are dL 155 and oh 167 for an 04 005 level test Since the test statistic is d 0247 see output below we reject the null hypothesis of no autocorrelation among the residuals and conclude that they are positively correlated See Figure 17 for a plot of the residuals versus year Source Durbin J and Watson GS 1950 Testing for Serial Correlation in Least Squares Regression l Biometrika 37409 428 The REG Procedure Model MODEL1 Dependent Variable consume Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr gt F Model 2 480557 240278 71227 lt0001 Error 66 022264 000337 Corrected Total 68 502821 Root MSE 005808 RSquare 09557 Dependent Mean 176999 Adj RSq 09544 Coeff Var 3 28143 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr gt t Intercept 1 461171 015262 3022 lt0001 income 1 011846 010885 109 02804 price 1 123174 005024 2452 lt0001 DurbinWatson D 0247 Number of Observations 69 1st Order Autocorrelation 0852 O 0quot oooxloumpmMHlx H H HHHHHHHHHHHHHHHHHHHHHHHHHHHMHHHHHHHHHHHHHHHHHHHHHMMMMMMMH M M M M M H H M M H H H H M M M M M M M M M H H H H H M M H H H H H H H H H H H H H H H H H H H H H H H H H H H H H MMMMMMMMMMMHHMMMMMHHHHHHMMMMMMMMMMMMMMMMMMMHHHHHHHHHHHHHHH H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H M H M M M M M M M M M III 000 ooooo oooooooooooooo I 000 o I o 0 00000 o oooooooooooooooooooo 000000 022134 H m m m m m m m m m m m m m m m o N m m m m m m m m m m m m m m o H a H p4 H p4 H p4 H p4 H p4 H m N m m N o m m M M M M m m m M o m m m o o p m m H m M o p o m m o H o m p m M N M o M N H m o N m M I I I I I I I I o o o o o o o o 0 006226 I I I 1870 1880 1890 I 1900 I 1910 year I 1920 I 1930 I 1940 Figure 17 Plot of the residuals versus year for British spirits data 6 Special Cases of Multiple Regression Textbook Sections 202203 In this section we will look at three special cases that are frequently used methods of multiple regression The ideas such as the Analysis of Variance tests of hypotheses and parameter estimates are exactly the same as before and we will concentrate on their interpretation through speci c examples The four special cases are 1 Polynomial Regression 2 Regression Models with Nominal Dummy Variables 3 Regression Models Containing Interaction Terms 61 Polynomial Regression While certainly not restricted to this case it is best to describe polynomial regression in the case of a model with only one predictor variable In many realiworld settings relationships will not be linear but will demonstrate nonlinear associations ln economics a widely described phenomenon is diminishing marginal returns In this case y may increase with m but the rate of increase decreases over the range of z By adding quadratic terms we can test if this is the case Other situations may show that the rate of increase in y is increasing in z Example 61 7 Health Club Demand y o5196522 Again we assume that 6 N N0 a In this model the number of people attending in a day when there are x machines is nomally distributed with mean 30 lm gzz and standard deviation 0 Note that we are no longer saying that the mean is linearly related to x but rather that it is approximately quadratically related to z curved Suppose she leases varying numbers of machines over a period of n 12 Wednesdays always advertising how many machines will be there on the following Wednesday and observes the number of people attending the club each day and obtaining the data in Table 25 In this case we would like to fit the multiple regression model y o iz zm2a which is just like our previous model except instead of a second predictor variable m2 we are using the variable 2 the effect is that the fitted equation 37 will be a curve in 2 dimensions not a plane in 3 dimensions as we saw in the weather example First we will run the regression on the computer obtaining the Analysis of Variance and the parameter estimates then plot the data and fitted equation Table 26 gives the Analysis of Variance for this example and Table 27 gives the parameter estimates and their standard errors Note that even though we have only one predictor variable it is being used twice and could in effect be treated as two different predictor variables so k 2 The rst test of hypothesis is whether the attendance is associated with the number of machines This is a test of H0 Bl g 0 If the null hypothesis is true that implies mean daily attendance 48 Week Machines Attendance 1 3 555 2 6 776 3 1 267 4 2 431 5 5 722 6 4 635 7 1 218 8 5 692 9 3 534 10 2 459 11 6 810 12 4 671 Table 25 Data for health club example ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F p Value MODEL SSR 39393312 k 2 MSR F i 193656 0001 19696656 25380 ERROR SSE 698455 71 7 k 71 MSE W 12 2 19 77606 TOTAL SSW 40091767 n 7 1 11 Table 26 The Analysis of Variance Table for health club data 25 FOR H0 STANDARD ERROR PARAMETER ESTIMATE 610 P VALUE OF ESTIMATE INTERCEPT BO 60720500 204 0712 352377 MACHINES 31 1 1997625 867 0001 230535 MACHINES SQ 32 b2 7136518 7423 0022 32239 Table 27 Parameter estimates and tests of hypotheses for indiVidual parameters is unrelated to the number of machines thus the club owner would purchase very few if any of the machines As before this test is the F test from the Analysis of Variance table which we conduct here at 04 05 1 H0351520 HANot both i0 03 TS F0179 i gg 71937953656 25380 F RR F0175 gt F2905 426 This is not provided on the output the p value takes the place of it U p value PF gt 25380 0001 Actually it is less than 0001 but this is the smallest p value the computer will print Another test with an interesting interpretation is H0 g 0 This is testing the hypothesis that the mean increases linearly with z since if g 0 this becomes the simple regression model refer back to the coffee data example The t test in Table 27 for this hypothesis has a test statistic tobs 7423 which corresponds to a p value of 0022 which since it is below 05 implies we reject H0 and conclude g 7 0 Since 2 is is negative we will conclude that g is negative which is in agreement with her theory that once you get to a certain number of machines it does not help to keep adding new machines This is the idea of diminishing returns Figure 18 shows the actual data and the fitted equation 37 720500 1997625z 7 136518z2 YHAT 900 800 39 700 y 600 500 400 300 200 100 Figure 18 Plot of the data and fitted equation for health club example 62 Regression Models With Nominal Dummy Variables All of the predictor variables we have used so far were numeric or what are often called quantitative variables Other variables also can be used that are called qualitative variables Qualitative vari ables measure characteristics that cannot be described numerically such as a person s sex race religion or blood type a city s region or mayor s political affiliation the list of possibilities is endless In this case we frequently have some numeric predictor variables that we believe is are related to the response variable but we believe this relationship may be different for different levels of some qualitative variable of interest lfa qualitative variable has m levels we create m71 indicator or dummy variables Consider an example where we are interested in health care expenditures as related to age for men and women separately In this case the response variable is health care expenditures one predictor variable is age and we need to create a variable representing sex This can be done by creating a variable 2 that takes on a value 1 if a person is female and 0 if the person is male In this case we can write the mean response as before Elyl172l 50 51901 52902 6 Note that for women of age 1 the mean expenditure is Eyl11 30 Bl1 1 321 30 32 1 311 while for men of age 1 the mean expenditure is Eyl1 0 30 Bl1 1 300 30 311 This model allows for different means for men and women but requires they have the same slope we will see a more general case in the next section In this case the interpretation of g 0 is that the means are the same for both sexes this is a hypothesis a health care professional may wish to test in a study In this example the variable sex had two variables so we had to create 2 7 1 1 dummy variable now consider a second example Example 62 We would like to see if annual per capita clothing expenditures is related to annual per capita income in cities across the Us Further we would like to see if there is any differences in the means across the 4 regions Northeast South Midwest and West Since the variable region has 4 levels we will create 3 dummy variables 23 and 4 as follows we leave 1 to represent the predictor variable per capita income 7 1 ifregionSouth 27 0 otherwise m 7 1 if regionMidwest 3 T 0 otherwise m 7 1 if regionWest 4T 0 otherwise Note that cities in the Northeast have 2 3 4 0 while cities in other regions will have either 23 or 4 being equal to 1 Northeast cities act like males did in the previous example The data are given in Table 28 The Analysis of Variance is given in Table 29 and the parameter estimates and standard errors are given in Table 30 Note that we would fail to reject H0 Bl g g B4 0 at 04 05 significance level if we looked only at the F statistic and it s p value F0179 293 p value0562 This would lead us to conclude that there is no association between the predictor variables income and region and the response variable clothing expenditures This is where you need to be careful when using multiple regression with many predictor variables Look at the test of H0 Bl 0 based on the t test in Table 30 Here we observe tobs311 with a p value of 0071 We thus conclude 31 74 0 and that clothing expenditures is related to income as we would expect However we do fail to reject H0 g 0 H0 g 0and H0 B4 0 so we fail to observe any differences among the regions in terms of clothing expenditures after adjusting for the variable income Figure 19 and Figure 20 show the original data using region as the plotting symbol and the 4 tted equations corresponding to the 4 51 PER CAPITA INCOME amp CLOTHING EXPENDITURES 1990 Income Expenditure Metro Area Region 1 y 2 3 4 New York City Northeast 25405 2290 0 0 0 Philadelphia Northeast 21499 2037 0 0 0 Pittsburgh Northeast 18827 1646 0 0 0 Boston Northeast 24315 1659 0 0 0 Buffalo Northeast 17997 1315 0 0 0 Atlanta South 20263 2108 1 0 0 MiamiFt Laud South 19606 1587 1 0 0 Baltimore South 21461 1978 1 0 0 Houston South 19028 1589 1 0 0 DallasFt Worth South 19821 1982 1 0 0 Chicago Midwest 21982 2108 0 1 0 Detroit Midwest 20595 1262 0 1 0 Cleveland Midwest 19640 2043 0 1 0 MinneapolisSt Paul Midwest 21330 1816 0 1 0 St Louis Midwest 20200 1340 0 1 0 Seattle West 21087 1667 0 0 1 Los Angeles West 20691 2404 0 0 1 Portland West 18938 1440 0 0 1 San Diego West 19588 1849 0 0 1 San FranOakland West 25037 2556 0 0 1 Table 28 Clothes Expenditures and income example ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F p Value MODEL 11164190 4 2791047 293 0562 ERROR 14266402 15 951093 TOTAL 25430592 19 Table 29 The Analysis of Variance Table for clothes expenditure data 25 FOR H0 STANDARD ERROR PARAMETER ESTIMATE 10 P VALUE OF ESTIMATE INTERCEPT 60 7657428 7082 4229 797948 1 61 0113 311 0071 0036 952 62 237494 117 2609 203264 953 63 21691 011 9140 197536 954 64 254992 130 2130 196036 Table 30 Parameter estimates and tests of hypotheses for indiVidual parameters regions Recall that the tted equation is 37 76574280113z1 l 237494z2 l 21691z3254992z47 and each of the regions has a different set of levels of variables m2 m3 and 4 HPJHPJHPJHPJNFQNIQNIQN moJanJwoor4mltpinm ocgto oltaolt3cgtoltaoltaoltzo H cgtolt3olt3cgtoltgtoltaolt3cgtolt3lt I I I I I I 6000 18000 20000 22000 24000 26000 X1 REGION N N N Midwest S S S Northeast M M M South W W N West Figure 19 Plot of clothing data7 with plotting symbol region uuu uuquuuuuupuuq 16000 18000 20000 22000 24000 26000 X1 Figure 20 Plot of tted equations for each region 63 Regression Models With Interactions In some situations7 two or more predictor variables may interact in terms of their effects on the mean response That is7 the effect on the mean response of changing the level of one predictor variable depends on the level of another predictor variable This idea is easiest understood in the case Where one of the variables is qualitative Example 63 7 Truck and SUV Safety Ratings Several years ago7 The Wall Street Joumal reported safety scores on 33 models of SUV s and 53 trucks Safety scores were reported as well as the vehicle s weight 1 and and indicator of whether the vehicle has side air bags m2 1 if it does 0 if not We fit a model relating safety scores to weight presence of side airbags and an interaction term that allows the effect of weight to depend on whether side airbags are present 24 30 31951 32902 63961902 6 We can write the equations for the two side airbag types as follows Side airbags y 50 51901 521 539011 6 50 52 31 350901 6 and N0 Side airbag83 y 50 51901 320 339610 6 30 31951 8 The data for years the 33 models are given in Table 31 The Analysis of Variance table for this example is given in Table 32 Note that R2 5518 Table 33 provides the parameter estimates standard errors and individual t tests Note that the F test for testing H0 Bl g g 0 rejects the null hypothesis F1190 P value0001 but none of the individual t tests are signi cant all P values exceed 005 This can happen due to the nature of the partial regression coefficients It can be seen that weight is a very good predictor and that the presence of side airbags and the interaction term do not contribute much to the model SSE for a model containing only Weight 1 is 34937 use this to test H0 g g 0 For vehicles with side airbags the fitted equation is lmThugs be b2 b1 bgz1 4418 002162251 while for vehicles without airbags the fitted equation is gnoairbags b0 b11 Figure 21 shows the two fitted equations for the safety data Second Order Model No Side Airb I I 3000 4000 5000 6000 weight Figure 21 Plot of fitted equations for each vehicle type SUVTruck Safety Ratings Make Model Safety Weight 11 Airbag 12 TOYOTA AVALON 11134 3437 1 CHEVROLET IMPALA 11922 3454 1 FORD RANGER 11339 3543 0 BUICK LESABRE 12416 3610 1 MAZDA MPV 117113 3660 1 PLYMOUTH VOYAGER 11729 3665 0 VOLVO S80 136166 3698 1 AUDI A8 138162 3751 1 DODGE DAKOTA 12049 3765 0 ACURA RL 113105 3824 1 PONTIAC TRANSPORT 118183 3857 1 CHRYSLER TOWNampCOUNTRY 12262 3918 0 FORD F 150 1187 3926 0 TOYOTA 4RUNNER 13096 3945 0 MERCURY GRAND MARQUIS 13637 3951 0 ISUZU RODEO 12692 3966 0 TOYOTA SIENNA 138154 39 73 0 MERCURY VILLAGER 123107 4041 0 LINCOLN TOWN CAR 120183 4087 1 FORD F150X 132101 4125 0 FORD WINDSTAR 15248 4126 1 NISSAN PATHFINDER 137167 4147 1 OLDSMOBILE BRAVADO 11761 4164 0 HONDA ODYSSEY 156184 4244 0 MERCURY MOUNTAINEER 13627 4258 1 TOYOTA TUNDRA 11827 4356 0 MERCEDESBENZ ML320 140157 4396 1 FORD ECONOLINE 14072 4760 0 DODGE RAM 120108 4884 0 LINCOLN NAVIGATOR 144157 4890 1 DODGE RAM 144175 4896 0 CADILLAC ESCALANTE 15882 5372 1 CHEVROLET SUBURBAN 17026 5759 1 Table 31 Safety ratings for trucks and SUV s ANOVA Source of Sum of Degrees of Mean Variation Squares Freedom Square F p Value MODEL 383803 3 127934 1190 0001 ERROR 311688 29 10748 TOTAL 695491 32 Table 32 The Analysis of Variance Table for truck SUV safety data STANDARD ERROR 25 FOR H0 PARAMETER ESTIMATE OF ESTIMATE 610 P VALUE lNTERCEPT 60 7609 2704 281 0087 1 61 001262 00065 193 0629 952 62 3191 3178 100 3236 953 63 00090 0076 118 2487 Table 33 Parameter estimates and tests of hypotheses for individual parameters 7 Safety data 7 Introduction to Time Series and Forecasting Textbook Sections 211 216 In the remainder of the course we consider data that are collected over time Many economic and nancial models are based on time series First we will describe means of smoothing series then some simple ways to decompose a series then we will describe some simple methods used to predict future outcomes based on past values 71 Time Series Components Textbook Section 212 Time series can be broken into ve components level longterm trend Cyclical variation seasonal variation and random variation A brief description of each is given below Level 7 Horizontal sales history in absence of other sources of variation long run average Trend 7 Continuing pattern of increasingdecreasing values in the form of a line or curve Cyclical 7 Wavelike patterns that represent business cycles over multiple periods such as economic expansions and recessions Seasonal 7 Patterns that occur over repetitive calendar periods such as quarters months weeks or times of the day Random 7 Short term irregularities that cause variation in individual outcomes above and beyond the other sources of variation Example 71 US Cotton Production 19782001 Figure 22 represents a plot of US cotton production from 1978 to 2001 Source Cotton association web site We can see that there has been a trend to higher production over time with cyclical patterns arising as well along the way Since the data are annual production we cannot observe seasonal patterns Example 72 Texas in state FinanceInsuranceReal Estate Sales 19892002 56 yrkbales 20000 18000 16000 14000 12000 10000 8000 6000 Figure 22 Plot of Us cotton production 1978 2001 Table 34 gives in state gross sales for the Finance Insurance and Real Estate FIRE for the state of Texas for the 4 quarters of years 1989 2002 in hundreds of millions of dollars Source State of Texas web site A plot of the data with vertical lines delineating years is shown in Figure 23 There is a clear positive trend in the series and the fourth quarter tends to have much larger sales than the other three quarters We will use the variables in the last two columns in a subsequent section Grossisales 11 10 9 8 7 6 5 4 ill 3 24 1 l l l l l 0 10 20 30 40 50 60 quarterl Figure 23 Plot of quarterly Texas in state FIRE gross sales 1989 2002 72 Smoothing Techniques Textbook Section 213 Moving Averages are averages of values at a particular time period and values that are near it in time We will focus on odd numbered moving averages as they are simpler to describe and 57 t year quarter gross sales yu tted sales ratio ys t 1 1989 1 1567 1725 0908 2 1989 2 1998 1813 1102 3 1989 3 1929 1900 1015 4 1989 4 3152 1988 1586 5 1990 1 2108 2075 1016 6 1990 2 2004 2163 0926 7 1990 3 1965 2250 0873 8 1990 4 3145 2338 1345 9 1991 1 1 850 2425 0 763 10 1991 2 2 303 2513 0 916 11 1991 3 2 209 2600 0 850 12 1991 4 4 030 2688 1 499 13 1992 1 2 455 2776 0 884 14 1992 2 2 536 2863 0 886 15 1992 3 2 800 2951 0 949 16 1992 4 4 733 3038 1 558 17 1993 1 2 666 3126 0 853 18 1993 2 3 256 3213 1 013 19 1993 3 3 050 3301 0 924 20 1993 4 5 307 3388 1 566 21 1994 1 2 950 3476 0 849 22 1994 2 3 190 3563 0 895 23 1994 3 3 025 3651 0 829 24 1994 4 4 847 3738 1 297 25 1995 1 3 005 3826 0 785 26 1995 2 3 297 3913 0 843 27 1995 3 3 301 4001 0825 28 1995 4 4 607 4089 1127 29 1996 1 3 333 4176 0 798 30 1996 2 3 352 4264 0 786 31 1996 3 3 430 4351 0 788 32 1996 4 5 552 4439 1 251 33 1997 1 3 297 4526 0 728 34 1997 2 3 637 4614 0 788 35 1997 3 3 909 4701 0 832 36 1997 4 6 499 4789 1 357 37 1998 1 4 047 4876 0 830 38 1998 2 4 621 4964 0 931 39 1998 3 4 509 5051 0 893 40 1998 4 6 495 5139 1 264 41 1999 1 4 334 5226 0 829 42 1999 2 4 557 5314 0 858 43 1999 3 4 596 5401 0 851 44 1999 4 7 646 5489 1 393 45 2000 1 4 596 5577 0 824 46 2000 2 5 282 5664 0 933 47 2000 3 5 158 5752 0 897 48 2000 4 7 834 5839 1 342 49 2001 1 5 155 5927 0 870 50 2001 2 5 312 6014 0 883 51 2001 3 5 331 6102 0 874 52 2001 4 10 42 6189 1 684 53 2002 1 5 397 6277 0 860 54 2002 2 5 832 6364 0 916 55 2002 3 5 467 6452 0 847 56 2002 4 8 522 6539 1 303 Table 34 Quarterly in state gross sales for Texas FIRE rms 58 implement the textbook also covers even numbered MA s as well A 3 period moving averge involves averaging the value directly prior to the current time point the current value and the value directly after the current time point There will not be values for either the rst or last periods of the series Similarly a 5 period moving average will include the current time point and the two prior time points and the two subsequent time points Example 73 US Internet Retail Sales 1999q4 2003q1 The data in Table 35 gives the US e commerce sales for n 14 quarters quarter 1 is the 4th quarter of 1999 and quarter 14 is preliminary reported sales for the 1st quarter of 2003 in millions of dollars Source US Census Bureau Quarter Sales gt MA3 ES01 ES05 1 5393 5393 5393 2 5722 5788 5426 5558 3 6250 6350 5508 5904 4 7079 7526 5665 6491 5 9248 8112 6024 7870 6 7 8 8009 8387 6222 7939 7904 7936 6390 7922 7894 8862 6541 7908 9 10788 9384 6965 9348 10 9470 10006 7216 9409 11 9761 9899 7470 9585 12 10465 11332 7770 10025 13 13770 12052 8370 11897 14 11921 8725 11909 Table 35 Quarterly e commerce sales and smoothed values for US 1999q4 2003q1 To obatain the three period moving average MA3 for the second quarter we average the first second and third period sales 5393 5722 6250 17365 57883 g 5788 3 We can similarly obtain the three period moving average for quarters 3 13 The data and three period moving averages are given in Figure 24 The moving average is the dashed line while the original series is the solid line Exponential Smoothing is an alternative means of smoothing the series It makes use of all prior time points with higher weights on more recent time points and exponentially decaying weights on more distance time points One advantage is that we have smoothed values for all time points One drawback is that we must select a tuning parameter although we would also have to choose the length of a moving average as well for that method One widely used convention is to set the rst period s smoothed value to the rst observation then make subsequent smoothed values as a weighted average of the current observation and the previous value of the smoothed series We use the notation St for the smoothed value at time t 59 1 2 3 4 5 6 7 8 91011121314 quarter Figure 24 Plot of quarterly US internet retail sales and 3 Period moving average 5151 Stwyt1 wst71 7522 Example 73 Continued Thus for quarter 4 of 1999 we set 51 yl 5393 In Table 35 we include smoothed values based on w 01 and w 05 respectively w 01 2 01 yg 09 gtk 1 015722 095393 5722 48537 54259 z 5426 w 05 2 05 yg 05 1 055722 055393 28610 26965 55575 x 5558 The smoothed values are given in Table 35 as well as in Figure 25 The solid line is the original series the smoothest line is w 01 and the intermediate line is w 05 73 Estimating Trend and Seasonal Effects Textbook Section 214 While the cyclical patterns are difficult to predict and estimate we can estimate linear trend and seasonal indexes fairly simply Further there is no added difficulty if the trend is nonlinear quadratic but we will consider only the linear case here First we must identify seasons these can be weeks months or quarters or even times of the day or days of the week Then we fit a linear trend for the entire series This is followed by taking the ratio of the actual to the fitted value from the regression equation for each period Next we average these ratios for each season and adjust so that the averages sum to 10 Example 72 Continued 1 Z 3 4 5 6 7 8 91011121314 quarter Figure 25 Plot of quarterly US internet retail sales and 32 Exponentially smoothed series Consider the Texas gross in state sales for the FIRE industry The seasons are the four quarters Fitting a simple linear regression relating sales to time period we get y be blt 16376 008753t The fitted values as well as the observed values have been shown previously in Table 34 Also for each outcome we obtain the ratio of the observed to fitted value also given in the table Consider the rst and last cases A yl 1567 t 1 1567 16376 008753 1 1725 7 7 0908 91 241 291 1725 8522 t 56 3456 8522 371 16376 00875356 6539 2171 7 1303 yl 6539 Next we take the mean of the observed to tted ratio for each quarter There are 14 years of data Q1 0908 1016 0763 0884 0853 0849 0785 0798 0728 0830 0829 0824 0870 0860 7 0 84 14 7 The means for the remaining three quarters are Q2 0906 Q3 0875 Q4 1398 The means sum to 4022 and have a mean of 4022410055 If we divide each mean by 10055 the indexes will sum to 1 Q1 0838 Q2 0901 Q3 0870 Q4 1390 The seasonally adjusted time series is given by dividing each observed value by its seasonal index This way we can determine when there are real changes in the series beyond seasonal uctuations Table 36 contains all components as well as the seasonally adjusted values t year quarter gross sales yu tted sales ratio ys s Season adjusted 1 1989 1 1567 1725 0908 1870 2 1989 2 1998 1813 1102 2218 3 1989 3 1929 1900 1015 2218 4 1989 4 3152 1988 1586 2268 5 1990 1 2108 2075 1016 2515 6 1990 2 2004 2163 0926 2224 7 1990 3 1965 2250 0873 2259 8 1990 4 3145 2338 1345 2263 9 1991 1 1 850 2 425 0763 2 207 10 1991 2 2 303 2 513 0 916 2 556 11 1991 3 2 209 2 600 0 850 2 540 12 1991 4 4 030 2 688 1499 2 899 13 1992 1 2 455 2 776 0884 2 929 14 1992 2 2 536 2 863 0886 2 815 15 1992 3 2 800 2 951 0949 3 218 16 1992 4 4 733 3 038 1 558 3 405 17 1993 1 2 666 3 126 0 853 3 182 18 1993 2 3 256 3 213 1013 3 614 19 1993 3 3 050 3 301 0924 3 506 20 1993 4 5 307 3 388 1 566 3 818 21 1994 1 2 950 3 476 0849 3 521 22 1994 2 3 190 3 563 0895 3 540 23 1994 3 3 025 3 651 0829 3 477 24 1994 4 4 847 3 738 1 297 3 487 25 1995 1 3 005 3826 0785 3 585 26 1995 2 3 297 3913 0843 3 660 27 1995 3 3 301 4001 0825 3 794 28 1995 4 4 607 4 089 1127 3 314 29 1996 1 3 333 4 176 0 798 3 977 30 1996 2 3 352 4 264 0786 3 720 31 1996 3 3 430 4 351 0788 3 942 32 1996 4 5 552 4 439 1251 3 994 33 1997 1 3 297 4 526 0728 3 934 34 1997 2 3 637 4 614 0788 4 037 35 1997 3 3 909 4 701 0832 4 493 36 1997 4 6 499 4 789 1357 4 675 37 1998 1 4 047 4 876 0830 4 829 38 1998 2 4 621 4 964 0931 5 129 39 1998 3 4 509 5 051 0893 5 183 40 1998 4 6 495 5 139 1 264 4 672 41 1999 1 4 334 5 226 0829 5 171 42 1999 2 4 557 5 314 0858 5 058 43 1999 3 4 596 5 401 0 851 5 282 44 1999 4 7 646 5 489 1393 5 501 45 2000 1 4 596 5 577 0824 5 485 46 2000 2 5 282 5 664 0933 5 862 47 2000 3 5 158 5 752 0897 5 928 48 2000 4 7 834 5 839 1342 5 636 49 2001 1 5 155 5927 0870 6152 50 2001 2 5 312 6014 0883 5 896 51 2001 3 5 331 6 102 0874 6128 52 2001 4 10 42 6 189 1684 7 498 53 2002 1 5 397 6 277 0860 6 440 54 2002 2 5 832 6 364 0916 6 473 55 2002 3 5 467 6 452 0847 6 283 56 2002 4 8 522 6 539 1 303 6 131 Table 36 Quarterly in state gross sales for Texas FIRE rms and seasonally adjusted series 62 74 Introduction to Forecasting Textbook Section 215 There are unlimited number of possibilities of ways of forecasting future outcomes so we need means of comparing the various methods First we introduce some notation o gt 7 Actual random outcome at time 15 unknown prior to t o E 7 Forecast of yt made prior to t 0 et 7 Forecast error at gt 7 E Book does not use this notation Two commonly used measures of comparing forecasting methods are given below Mean Absolute Deviation MAD 7 MAD w number of forecasts n Sum of Square Errors SSE 7 SSEZet2 SLAM 7 E2 When comparing forecasting methods we wish to minimize one or both of these quantities 75 Simple Forecasting Techniques Textbook Section 216 and Supplement In this section we describe some simple methods of using past data to predict future outcomes Most forecasts you hear reported are generally complex hybrids of these techniques 751 Moving Averages This method which is not included in the tesxt is a slight adjustment to the centered moving averages in the smoothing section At time point t we use the previous k observations to forecast yt We use the mean of the last k observations to forecast outcome at t Xt71 Xtiz Xt7k Ft k Problem How to choose k Example 74 AnhueserBusch Annual Dividend Yields 19521995 Table 37 gives average dividend yields for Anheuser7Busch for the years 195271995 Source Value Line forecasts and errors based on moving averages based on lags of 1 2 and 3 Note that we don t have early year forecasts and the longer the lag the longer we must wait until we get our rst forecast Here we compute moving averages for year1963 17Year F1963 y1962 QiYear F1963 21962 5211961 3228 3390 37Year F1963 y1962y1 61y1960 3223844 33947 Figure 26 displays raw data and moving average forecasts 63 Year y F1t 1t F2t 2t Fm 3t 1952 5130 i i 1953 4120 5 30 1i10 i i 1954 3190 4120 0130 4175 0185 i i 1955 5120 3190 1130 4105 1115 4147 0173 1956 5180 5120 0160 4155 1125 4143 1137 1957 6130 5180 0150 5150 0180 4197 1133 1958 5160 6130 0170 6105 0145 5177 0117 1959 4180 5160 0180 5 95 1i15 5 90 1i10 1960 4140 4180 0140 5120 0180 5 57 1i17 1961 2180 4 40 160 4 60 180 4 93 213 1962 3120 2180 0140 3160 0140 4100 0180 1963 3110 3120 0110 3100 0110 3147 0137 1964 3110 3110 0100 3115 0105 3103 0107 1965 2160 3110 0150 3110 0150 3113 0153 1966 2100 2160 0160 2185 0185 2193 0193 1967 1160 2100 0140 2130 0170 2157 0197 1968 1130 1160 0130 1180 0150 2107 0177 1969 1120 1130 0110 1145 0125 1163 0143 1970 1120 1120 0100 1125 0105 1137 0117 1971 1110 1120 0110 1120 0110 1123 0113 1972 0190 1110 0120 1115 0125 1117 0127 22 1973 1140 0190 0150 1100 0140 1107 0 33 23 1974 2100 1140 0160 1115 0185 1113 0 87 24 1975 1190 2100 0110 1170 0120 1143 0 47 25 1976 2130 1190 0140 1195 0135 1177 0 53 26 1977 3110 2130 0180 2110 1100 2107 1 03 27 1978 3150 3110 0140 2170 0180 2143 1 07 28 1979 3180 3150 0130 3130 0150 2197 0 83 29 1980 3170 3180 0110 3165 0105 3147 0 23 30 1981 3110 3170 0160 3175 0165 3167 0157 31 1982 2160 3110 0150 3140 0180 3153 0193 32 1983 2140 2160 0120 2185 0145 3113 0173 33 1984 3100 2140 0160 2150 0150 2170 0 30 34 1985 2140 3100 0160 2170 0130 2167 0127 35 1986 1180 2140 0160 2170 0190 2160 0180 36 1987 1170 1180 0110 2110 0140 2140 0170 37 1988 2120 1170 0150 1175 0145 1197 0 23 38 1989 2110 2120 0110 1195 0115 1190 0 20 39 1990 2140 2110 0130 2115 0125 2100 0 40 40 1991 2110 2140 0130 2125 0115 2123 0113 41 1992 2120 2110 0110 2125 0105 2120 0 00 42 1993 2170 2120 0150 2115 0155 2123 0 47 43 1994 3100 2170 0130 2145 0155 2133 0 67 44 1995 2180 3100 0120 2185 0105 2163 0 17 MMHHHHHHHHHH HOOmNOAWgtgtWMHO mN mgtgtWMHN Table 37 Dividend yields7 Forecasts7 errors 7 17 27 and 3 year moving Averages DIV7YLD 7 O 1950 1960 1970 1980 1990 2000 CALiYEAR Figure 26 Plot of the data moving average forecast for Anheuser7Busch dividend data 752 Exponential Smoothing Exponential smoothing is a method of forecasting that weights data from previous time periods with exponentially decreasing magnitudes Forecasts can be written as follows7 where the forecast for period 2 is traditionally but not always simply the outcome from period 1 EH1 St wyt 1 wSt71 wyt 1 wFt where 0 EH is the forecast for period t 1 o yt is the outcome at t 0 St is the smoothed value for period t Stil E o w is the smoothing constant 0 w 1 Forecasts are smoother than the raw data and weights of previous observations decline expo nentially with time Example 74 Continued 3 smoothing constants allowing decreasing amounts of smoothness for illustration o w 02 7 E11 02 085L 02 08Ef o w 05 7 E11 05 051 05 05E o w 08 7 E11 08 02554 08 02Ef Year 2 1953 7 set F1953 X19527 then cycle from there 75 Year y Fw2t ew2t Fw5t ew5t Fw8t ew8t 1 1952 530 i i i i i i 2 1953 4 20 530 1i10 5 30 1i10 5 30 1i10 3 1954 3 90 508 1i18 4 75 085 4 42 052 4 1955 5 20 4 84 0 36 4 33 0 88 4 00 120 5 1956 5 80 4 92 0 88 4 76 1 04 4 96 084 6 1957 6 30 5 09 1 21 5 28 1 02 5 63 067 7 1958 5 60 5 33 0 27 5 79 019 6 17 057 8 1959 4 80 5 39 059 5 70 090 5 71 091 9 1960 4 40 5 27 087 5 25 085 4 98 058 10 1961 2 80 5 10 230 4 82 202 452 1i72 11 1962 3 20 464 1i44 3 81 061 3 14 006 12 1963 3 10 435 1i25 3 51 041 3 19 009 13 1964 3 10 410 1i00 3 30 020 3 12 002 14 1965 2 60 390 1i30 3 20 060 3 10 050 15 1966 2 00 3 64 164 2 90 090 2 70 070 16 1967 1 60 331 1i71 2 45 085 2 14 054 17 1968 1 30 2 97 167 2 03 073 1 71 041 18 1969 1 20 264 1i44 1 66 046 1 38 018 19 1970 1 20 235 1i15 1 43 023 1 24 004 20 1971 1 10 212 1i02 1 32 022 1 21 011 21 1972 0 90 191 1i01 1 21 031 1 12 022 22 1973 1 40 1 71 031 1 05 0 35 0 94 046 23 1974 2 00 1 65 0 35 1 23 0 77 1 31 069 24 1975 1 90 1 72 0 18 1 61 0 29 1 86 004 25 1976 2 30 1 76 0 54 1 76 0 54 1 89 041 26 1977 3 10 1 86 1 24 2 03 1 07 2 22 088 27 1978 3 50 2 11 1 39 2 56 0 94 2 92 058 28 1979 3 80 2 39 1 41 3 03 0 77 3 38 042 29 1980 3 70 2 67 1 03 3 42 0 28 3 72 002 30 1981 3 10 2 88 0 22 3 56 046 3 70 060 31 1982 2 60 2 92 032 3 33 073 3 22 062 32 1983 2 40 2 86 046 2 96 056 2 72 032 33 1984 3 00 2 77 0 23 2 68 0 32 2 46 054 34 1985 2 40 2 81 041 2 84 044 2 89 049 35 1986 1 80 2 73 093 2 62 082 2 50 070 36 1987 1 70 2 54 084 2 21 051 1 94 024 37 1988 2 20 2 38 018 1 96 0 24 1 75 045 38 1989 2 10 2 34 024 2 08 0 02 2 11 001 39 1990 2 40 2 29 0 11 2 09 0 31 2 10 030 40 1991 2 10 2 31 021 2 24 014 2 34 024 41 1992 2 20 2 27 007 2 17 0 03 2 15 005 42 1993 2 70 2 26 0 44 2 19 0 51 2 19 051 43 1994 3 00 2 35 0 65 2 44 0 56 2 60 040 44 1995 2 80 2 48 0 32 2 72 0 08 2 92 012 Table 38 Dividend yields7 Forecasts7 and errors based on exponential smoothing with w 02 05013 Table 38 gives average dividend yields for AnheuseriBusch for the years 195271995 Source Value Line7 forecasts and errors based on exponential smoothing based on lags of 17 27 and 3 Here we obtain Forecasts based on Exponential Smoothing7 beginning with year 2 1953 19533 Fw21953 241952 530 Fw51952 241952 530 Fw81952 241952 530 w Fw21954 231953 1 8Fw21953 w Fwv51954 5y1953 5Fwv51953 1954 w 08 Fw81954 8y1953 QFwV51953 8420 2530 442 Which level of w appears to be discounting more distant observations at a quicker rate What would happen if w 1 If w 0 Figure 27 gives raw data and exponential smoothing forecasts DIV YLD 7 7 0 I I I I I I 1950 1960 1970 1980 1990 2000 CALiYEAR Figure 27 Plot of the data and Exponential Smoothing forecasts for AnheuseriBusch dividend data Table 39 gives measures of forecast errors for three moving average7 and three exponential smoothing methods Moving Average Exponential Smoothing Measure 17Period 27Period 37Period w 02 w 05 w 08 MAE 043 053 062 082 058 047 MSE 030 043 057 097 048 034 Table 39 Relative performances of 6 forecasting methods 7 AnheuseriBusch data Note that MSE is SSEn Where n is the number of forecasts 76 Forecasting With Seasonal Indexes After trend and seasonal indexes have been estimated future outcomes can be forecast by the equation Ft 50 51751 X 5L where be blt is the linear trend and 5 is the seasonal index for period 25 Example 72 Continued For the Texas FIRE gross sales data we have be 16376 61 08753 SIQ1 838 SIQZ 901 SIQg 870 SIQ4 1390 Thus for the 4 quarters of 2003 t 57 58 59 60 we have Q1F57 16376 00875357838 5553 Q2 F58 16376 00875358901 6050 Q3 F59 16376 00875359870 5918 Q4 F60 16376 008753601390 9576 761 Autoregression Sometimes regression is run on past or lagged values of the dependent variable and possibly other variables An Autoregressive model with independent variables corresponding to k periods can be written as follows Qt 50 b13671 b23672 39 39 39 bkytik Note that the regression cannot be run for the rst k responses in the series Technically forecasts can be given for only periods after the regression has been t since the regression model depends on all periods used to fit it Example 74 Continued From Computer software autoregressions based on lags of 1 2 and 3 periods are t 17Period gt 029 0669H 27Period gt 029 1181 7 0299H 37Period y 028 121yt1 7 037th 0051113 Table 40 gives raw data and forecasts based on three autoregression models Figure 28 displays the actual outcomes and predictions 75 Year yt FAR1t 6AR1t FAR2t 5AR2t FAR3t 6AR3t 1 1952 53 2 1953 42 496 076 3 1954 39 399 009 372 018 4 1955 52 372 148 368 152 372 148 5 1956 58 487 093 530 050 535 045 6 1957 63 540 090 564 066 558 072 7 1958 56 584 024 606 046 603 043 8 1959 48 522 042 509 029 503 023 9 1960 44 452 012 434 006 435 005 10 1961 28 416 136 410 130 412 132 11 1962 32 275 045 233 087 229 091 12 1963 31 311 001 326 016 335 025 13 1964 31 302 008 303 007 300 010 14 1965 26 302 042 306 046 305 045 15 1966 2 258 058 247 047 244 044 16 1967 16 205 045 190 030 190 030 17 1968 13 170 040 160 030 161 031 18 1969 12 143 023 136 016 137 017 19 1970 12 135 015 133 013 134 014 20 1971 11 135 025 136 026 136 026 21 1972 09 126 036 124 034 123 033 22 1973 14 108 032 103 037 103 037 23 1974 2 152 048 168 032 170 030 24 1975 19 205 015 225 035 223 033 25 1976 23 196 034 196 034 192 038 26 1977 31 231 079 246 064 247 063 27 1978 35 302 048 329 021 328 022 28 1979 38 337 043 353 027 349 031 29 1980 37 364 006 377 007 375 005 30 1981 31 355 045 356 046 354 044 31 1982 26 302 042 288 028 286 026 32 1983 24 258 018 247 007 247 007 33 1984 3 240 060 237 063 239 061 34 1985 24 293 053 314 074 316 076 35 1986 18 240 060 226 046 220 040 36 1987 17 187 017 172 002 173 003 37 1988 22 179 041 178 042 180 040 38 1989 21 223 013 240 030 241 031 39 1990 24 214 026 213 027 210 030 40 1991 21 240 030 252 042 253 043 41 1992 22 214 006 208 012 205 015 42 1993 27 223 047 228 042 229 041 43 1994 3 267 033 284 016 285 015 44 1995 28 293 013 305 025 303 023 Table 40 Average dividend yields and Forecasts err0rs based on autoregression with lags of 17 27 and 3 periods DIV YLD 7 7 0 I I 1950 1960 1970 1980 1990 2000 CALiYEAR Figure 28 Plot of the data and Autoregressive forecasts for AnheuseriBusch dividend data QMB 3250 Statistics for Business Decisions Summer 2003 Dr Larry Winner University of Florida Introduction KampWChapter 1 This course applies and extends methods from STA 2023 to business applications We begin with a series of definitions and descriptions Descriptive Statistics Methods used to describe a set of measurements typically either numerically andor graphically Pages 23 Inferential Statistics Methods to use a sample of measurements to make statements regarding a larger set of measurements or a state of nature Pages 23 Population Set of all items often referred to as units of interest to a researcher This can be a large fixed population eg all undergraduate students registered at UF in Fall 2003 It can also be a conceptual population eg All potential consumers of a product during the product s shelf life Page 5 Parameter A numerical descriptive measure describing a population of measurements leg The mean number of credit hours for all UF undergraduates in Fall 2003 Page 5 Sample Set of items units drawn from a population Page 5 Statistic A numerical descriptive measure describing a sample Page 5 Statistical Inference Process of making a decision estimate andor a prediction regarding a population from sample data Confidence Levels refer to how often estimation procedures give correct statements when applied to different samples from the population Significance levels refer to how often a decision rule will make incorrect conclusions when applied to different samples from the population Page 6 Types of Variables Reading KampW Seclions 22 25 Measurement Types We will classify variables as Three Types nominal ordinal and inTerval Nominal Variables are caTegorical wiTh levels ThaT have no inherenT ordering Assuming you have a car iT s brand make would be nominal eg Ford ToyoTa BMW Also we will TreaT binary variables as nominal eg wheTher a subjecT given OlesTra based poTaTo chips displayed gasTro inTesTinal side effecT Page 26 Ordinal Variables are caTegorical wiTh levels ThaT do have a disTincT ordering however relaTive disTances beTween adjacenT levels may noT be The same eg Film reviewers may raTe movies on a 5 sTar scale College aThleTic Teams and company sales forces may be ranked by some criTeria Page 27 InTerval Variables are numeric variables ThaT preserve disTances beTween levels eg Company quarTerly profiTs or losses sTaTed as negaTive profiTs Time for an accounTanT To compleTe a Tax form Page 26 RelaTionship Variable Types MosT ofTen sTaTisTical inference is focused on sTudying The relaTionship beTween among Two or more variables We will disTinguish beTween dependenf and independenf variables DependenT variables are ouTcomes also referred To as responses or endpoinTs ThaT are hypoThesized To be relaTed To The levels of oTher inpuT variables DependenT variables are Typically labeled as Y Page 58 Independenf variables are inpuTs also referred To as predicfors or explanaTory variables ThaT are hypoThesized To cause or be associaTed wiTh levels of The dependenT variable IndependenT variables are Typically labeled as X when There is a single dependenT variable Page 58 Graphical Descriptive Methods KampW Sections 23 26 and Notes Single Variable Univariate Graphs Interval Scale Outcomes Histograms separate individual outcomes into bins of equal width where extreme bins may represent all individuals below or above a certain level The bins are typically labeled by their midpoints The heights oh the bars over each bin may be either the frequency number of individuals falling in that range or the percent fraction of all individuals falling in that range multiplied by 100 Histograms are typically vertical Page 33 StemandLeaf Diagrams are simple depictions of a distribution of measurements where the stems represent the first digitls and leaves represent last digits or possibly decimals The shape will look very much like a histogram turned on its side Stem and leaf diagrams are typically horizontal Page 41 NominalOrdinallnterval Scale Outcomes Pie Charts count individual outcomes by level of the variable being measured or range of levels for interval scale variables and represent the distribution of the variable such that the area of the pie for each level or range are proportional to the fraction of all measurements Page 48 Bar Charts are similar to histograms except that the bars do not need to physically touch They are typically used to represent frequencies or percentages of nominal and ordinal outcomes Two Variable Bivariafe Graphs Scaffer Diagrams are graphs where pairs of oufcomes XY are ploffed againsf one anofher These are fypically inferval scale variables These graphs are useful in defermining whefher fhe variables are associafed possibly in a posifive or negafive manner The verfical axis is fypically fhe dependenf variable and fhe horizonfal axis is fhe independenf variable one major excepfion are demand curves in economics Page 58 SubType Barcharfs represenf frequencies of nominalordinal dependenf variables broken down by levels of a nominalordinal independenf variable Page 63 ThreeDimensional Barcharfs represenf frequencies of oufcomes where fhe fwo variables are placed on perpendicular axes and fhe heighfs represenf fhe counfs of number of individual observafions falling in each combinafion of cafegories These are fypically reserved for nominalordinal variables Page 63 Time Series Plofs are graphs of a single or more variable versus fime The verfical axis represenfs fhe response while fhe horizonfal axis represenfs fime day week monfh quarfer year decade These plofs are also called line charfs Page 69 Dafa Maps are maps where geographical unifs mufually exclusive and exhausfive regions such as sfafes counfies provinces are shaded fo represenf levels of a variable Nof in fexfbook Examples Example Time Lost to Congested Traffic The following EXCEL spreodsheeI conIoins Ihe meon Iime osI onnuolly in congesIed Iroffic hours per person for n39 US ci es Source Texos Tronsporfo on IninIUIe 572001 LOS ANGELES 56 NEW YORK 34 CHICAGO 34 SAN FRANCISCO 42 DETROIT 41 WASHINGTONDC 46 HOUSTON 50 ATLANTA 53 BOSTON 42 PHILADELPHIA 26 DALLAS 46 SEATTLE 53 SAN DIEGO 37 MINNEAPOLISST PAUL 38 MIAMI 42 ST LOUIS 44 DENVER 45 PHOENIX 31 SAN JOSE 42 BALTIMORE 31 PORTLAND 34 ORLANDO 42 FORT LAUDERDALE 29 CINCINNATI 32 INDIANAPOLIS 37 CLEVELAND 20 KANSAS CITY 24 LOUISVILLE 37 TAMPA 35 COLUMBUS 29 SAN ANTONIO 24 AUSTIN 45 NASHVILLE 42 LAS VEGAS 21 JACKSONVILLE 30 PITTSBURGH 14 MEMPHIS 22 CHARLOTTE 32 NEW ORLEANS 18 A hisTogrom of The Times using defouIT numbers of bins and upper endpoinTs from EXCEL 97 Pages 3334 Histogram Frequency A sTem ond Ieof diagram of The Times using The DoTo Anolysis Plus Tool Page 42 Stem amp Leaf Display Stems Leaves gt48 gtO1244699 gtO1 12244457778 gt122222245566 gt0336 Tlhook Example AAA Quality Ratings of Hotels amp Motels in FL The following EXCEL 97 worksheet gives the AAA rotings 1 5 stors and the frequency counts for Florido hotels Source AAA Tour Book 1999 Edition Rating Count 1 108 2 519 3 744 4 47 5 5 A bor chort representing the distribution of rotings Page 50 800 700 600 500 400 300 200 100 A pie chart representing the distribution of ratings Page 51 Note that the large majority of hotels get ratings of 2 or 3 Examgle Production Costs of a Hosiery Mill The following EXCEL 97 worksheei gives opproximoiely ihe quoniiiy produced Column 2 and ioiol cosis Column 3 for n48 monihs of produciion for o hosiery mill Source Joel Deon i94i Sioiisiicol Cosi Funciions of o Hosiery Iviill Sludies in Business Adminisfro on Vol i4 3 1 4675 9264 2 4218 8881 3 4186 8644 4 4329 888 5 42 12 8638 6 41 78 8987 7 4147 8853 8 4221 91 11 9 4103 81 22 10 3984 83 72 11 3915 84 54 12 392 85 66 13 3952 85 87 14 3805 85 23 15 3916 87 75 16 3859 92 62 17 3654 91 56 18 3703 84 12 19 366 81 22 20 3758 83 35 21 3648 82 29 22 3825 80 92 23 3726 76 92 24 3859 78 35 25 4089 74 57 26 3766 71 6 27 3879 65 64 28 3878 62 09 29 367 61 66 30 351 77 14 31 3375 75 47 32 3429 70 37 33 3226 66 71 34 3097 64 37 35 282 56 09 36 2458 50 25 37 2025 43 65 38 1709 38 01 39 1435 31 4 40 1311 2945 41 95 2902 42 974 1905 43 934 2036 44 751 1768 45 835 1923 46 625 1492 47 545 1144 48 379 1269 Total cost A scofter plof of total costs Y versus qum ify produced X Pages 5960 Pruduc tio n Casts Quantity Produced Note The positive ossocio on between TQTOI COST and quantify produced Example Tobacco Use Among US College Students The following EXCEL 97 worksheel gives frequencies of college sludenls by roce Whilenol hisponic Hisponic Asion and Block and currenl lobocco use Yes No Source Rigolli Lee Wechsler 2000 US College Sludenls Use of Tobocco Producls JAle 2842699 705 A cross lobulolion AKA conlingency loble clossifying sludenls by roce and smoking slolus The numbers in The loble ore The number of sludenls folling in eoch colegory Page 65 Smoke Race Yes No White 3807 6738 Hispanic 261 757 Asian 257 860 Black 125 663 of Students A sub Type bdr chdrT depicTing counTs of smokersnonsmokers by race Page 65 Smoking Status by Race 7000 6000 7 5000 4000 7 3000 7 2000 1000 7 ElYes INo Hispanic Asian White Black RaceSmoking Status There is some evidence Tth 0 higher frdcTion of whiTe sTLJdenTs Thdn block sTLJdenTs currenle smoked CT The Time of The sTLde The reldTive heighT of The Yes bdr To No bar is higher for WhiTes Thdn Blocks

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.