INTRO TO ECONOMETRIC
INTRO TO ECONOMETRIC ECO 4421
Popular in Course
Popular in Economcs
This 139 page Class Notes was uploaded by Elmore Funk on Thursday September 17, 2015. The Class Notes belongs to ECO 4421 at Florida State University taught by Anastasia Semykina in Fall. Since its upload, it has received 7 views. For similar materials see /class/205447/eco-4421-florida-state-university in Economcs at Florida State University.
Reviews for INTRO TO ECONOMETRIC
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/17/15
lntroductqry Econometrics A Modern Approach 46 Jeffrey MWoodridge Introductery Econometrics A Modern Approach 4e Jeffrey M Wooldridge Michigan State University 5 SOUTHWESTERN I CENGAGE Learningquot Australia Brazil Japan Korea Mexico Singapore Spain United Kingdom United States 5 SOUTHWESTERN I CENGAGE Learning39 Introductory Jeffrey M Woold ridge Vice President of Editorial Business Jack W Calhoun Executive Editor Mike Worls Sr Developmental Editor Laura Bo nger Sr Content Project Manager Martha Conway Marketing Specialist Bettyjung Marketing Communications Manager Sarah Greber Media Editor Deepak Ku mar Sr Manufacturing Coordinator Sandee Milewski Production Senice Macmillan Publishing Solutions SnArt Director Michelle Kunkler Internal Designer c miller design Cover Designer c millerdesign Cover Image John FoxxGetty Images Inc Printed in the United States of America 12345671211100908 7009 L7 apartofC 39 ALL RIGHTS RESERVED No part of this work covered by the copyright hereon may be reproduced or used in any form or by any meansigraphic electronic or mechanical including photocopying recording taping Lquot 39L 39 quot 39 d 39 39 orinanyother manneriexcept as may be permitted by the license terms herein For nmdurr39 39 i t n rnntartll at Cengage Learning Customer amp Sales Support18003549706 For permission to use material from this text or product submit all requests online at cengagecompermissions urther permissions questions can be emailed to permissionrequest cengagecom Library of Congress Control Number 2008921832 Student Edition package ISBN IB 97870324758162 Student Edition package ISBN710 0732475816279 Student Edition ISBN 1329787073243966054398 Student Edition ISBN 10 0732476605475 SouthWestern Cengage Learning 5191 Natorp Boulevard Mason OH 45040 USA Cengage Learning products are represented in Canada by Nelson Education Ltd Foryour course and learning solutions visit academiccengagecom Purchase any of our products at your local college store or at our preferred online store wwwidlapterscom Chapter 1 The Nature of Econometrics and Economic Data 1 PART 1 REGRISSION ANALYSIS erH CROSSSECTION DATA 11 Chapter 2 The Simple Regression Model 22 Chapter 3 Multiple Regression Analysis Estimation 68 Chapter 4 ultiple RegressionAnalysis Inference 117 Chapter 5 Multiple Regression Analysis 0LS Asymptotics 167 Chapter 6 Multiple Regression Analysis Eurther ssues 134 Chapter 7 Multiple Regression Analysis with Qualitat39 e Information Binary or Dummy Variables 225 Chapter 3 Heteroskedastici 264 Chapter 9 More on Speci cation and Data Issues 300 PART 1 REGRESSION ANALYSIS wn39II TIME SERIES DATA 339 Chapter 10 Basic Regression Analysis with Time Series Data 340 Chapter 11 Further Issues in Using 0LS with Time Series Data 377 Chapter 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 4023 PART 3 ADVANCED TOPICS 443 Chapter 13 Pooling Cross Sections across Time Simple Panel D Methods 444 Chapter 14 Advanced Panel Data Methods 431 Chapter 15 Instrumental Variables Estimation and Two Stage Least Squares 506 Chapter 16 Simultaneous Equations Mode s 546 Chapter 17 Limited Dependent Variable Models and Sample Selection Conections 574 Chapter 13 Advanced Time Series Topics 623 Chapter 19 Carrying Out an Empirical Project 668 APPENDICES Appendix A Basic Mathematical Tools 695 Appendix B Fundamentals of Probability 714 Appendix C Eundamentals ofMathematical Statistics 747 Appendix D Summary ofMatrix Al ebra 733 Appendix E The Linear Regression Model in Matrix Form 799 Appendix E Answers to Chapter Questions 313 Appendix G Statistical Tables 2323 References 330 Glossary 335 Index 349 iiiContents CHAPTER 1 The Nature of Econometrics and Economic Data 11 What Is Econometrics 1 12 Steps in EmpiricalEcouomicAnalysis 2 13 The Structure of Economic Data 5 t 5 0 e 4 is a 2 e x e Crbs Seeticms 9 Parlel Z77 Langimdirlal Data 10 u s s e if E if r Notion of Ceteris Paribus in 12 Prob ems Computer Exercises 13 PART Regression Anal sis with Cross Sectional ata 21 c H A P T E R 1 The Simple Regression Model 22 21 De nition of the Simple Regression Model 22 22 Deriving the Ordinary Least Squares Estimates 27 A the an Terrninblbgy 35 Properties of OLS on Any Sample ofData 36 Fittea Values ana Resiauals Algebraic Prbperties bf OLS Statistics 37 40 N t N as Units of Measurement and Functional Form 41 The E ects 0f Changing Units beeasmemerlt an 015 Statistics 41 Incbrpbrating anlinearities in simple Regressibn The Meaning 0f Linear Regressibn 45 25 Expected Values and Variances ofthe OLS Estimators 46 El 2 s 0 is P1 1 e a x a is o ft 26 Regression through the Origin 53 Summary 59 Key Terms 60 Problems 61 Computer Exercises Appendix 2A 66 ox as CHAPTER 3 Multiple Regression Analysis Estimation 68 31 Motivation for Multiple Regression 68 bael with wa Inaepenaent Variables 55 The Mbael with k Inaepenaent Variables 71 Mechanics and Interpretation of Ordinary Least S uares 73 5quot M Obtaining the 015 Estimates Interpreting the OLS Regressibn Equatibn 74 n the Meaning bf Hblaing Other Factbrs Fixea in Multiple Regressibn Changing Mbre Than One Inaep enaent Variable 7 Fitted Values arld Residuals 77 A Partialling 0ut 1nterpretatibn bf Multiple Regressibn Cbrnparisbn bfsirnple ana Multiple Regressibn Estimates 79 GaadrlessrafrFit a Regressibn thrbugh the Origin 53 The Expected Value ofthe OLS Estimators 84 Incluaing Irrelevant Variables in a Regressibn c7 el 8 Omitted Variable Bias The Simple Case 89 Omitted Variable Bias Mme Gerleral Cases 93 94 w w 34 The Variance ofthe OLS Estimators The Cbrnpbnents Ofthe Ois Variances Multicbllineari iy 95 Variances in Misspeci ed Models 99 Estimating 0392 Standard Errors of the OLS Estimators I 01 35 Efficiency of OLS The GaussMarkov Theorem 1 02 Appendix 3A 1 13 Contents 52 Asymptotic Normality and Large Sample Inference Other Large Sample Tests The Lagrange Multiplier Statistic 176 53 Asymptotic Efficiency of OLS 179 Summar 8 Key Terms 181 Problems 1 Computer Exercises 181 Appendix 5A 182 CHAPTER 4 Multiple Regression Analysis Inference 117 CHAPTER 6 Multiple Regression Analysis Further Issues 184 41 Sampling Distributions of the OLS E timators 117 4 to s Testing Hypotheses about a Single Population Parameter The tTest 120 Testing against OneSided Alternatives 123 TwoSided Alternatives 128 Testing Other Hypotheses about 5 130 Computing p Values for t Tests 133 A Reminder on the Language of Classical Hypothesis Testing 135 Economic or Practical versus Statistical Signi cance 135 Confidence Intervals 138 Testing Hypotheses about a Single Linear Combination of the Parameters 140 Testing Multiple Linear Restrictions The F Test 143 Testing Exclusion Restrictions 143 Relationship between F and t Statistics 149 The RSquared Form ofthe FStatistic 150 Computing p Values for F Tests The F Statistic for Overall Signi cance of a Regression I 52 Testing General Linear Restrictions 153 46 Reporting Regression Results 154 Summary Key Terms 158 Problems 1 59 Computer Exercises 163 5 Am 5 LII N Lquot CHAPTER 5 Multiple Regression Analysis LS Asymptotics 167 51 Consistency 167 Deriving the Inconsistency in OLS 170 61 Effects of Data Scaling on OLS Statistics 184 Beta Coejj icients 187 62 More on Functional Form 189 More on Using Logarithmic Functional Forms 1 89 Models with Quadratics 192 Models with Interaction Terms 197 63 More on GoodnessofFit and Selection of Regressors 199 Adjusted RSquared 00 Using Adjusted RSquared to Choose between Nonnested Models 201 Controlling for Too Many Factors in Regression Analysis 203 Adding Regressors to Reduce the Error Variance 205 64 Prediction and Residual Analysis 206 Con dence Intervals for P e ictions 206 Residual Analysis 209 Predicting y When log y Is the Dependent Variable 1 Computer Exercises 218 Appendix 6A 223 CHAPTER 7 Multiple Regression Analysis with Qualitative Information Binary or DummyVariables 225 71 Describing Qualitative Information 225 Vt C omean 72 A Single Dummy Independent Variable 226 Interpreting Coef cients on Dummy Explanatory Variables When the Dependent Variable Is log y 231 Using Dummy Variables for Multiple Categories 233 Incorporating Ordinal Informationby Using Du m riables 23 gt1 L gt1 4 Interactions Involving Dummy Variables 238 Interactions among Dummy Variables 238 Allowing for Different Slopes 239 Testing for Di erences in Regression Functions across Groups 75 A Binary Dependent Variable The Linear Probability Model 246 76 More on Policy Analysis and Program Evaluation 251 254 Computer Exercises 258 CHAPTER 8 Heteroskedasticity 264 1 Consequences of Heteroskedasticity for OLS 264 82 HeteroskedasticityRobust Inference after OLS Estimation 265 Computing HeteroskedasticityRobust LM Tests Testing for Heteroskedasticity 271 The White Test for Heteroskedasticity 274 Weighted Least Squares Estimation 276 The Heteroskedasticity Is Known up to a Multiplicative Constant 277 Heteroskedasticity Function Must Be ble GL 82 9 L 9 J Estimated easi What If the Assumed Heteroskedasticity Function Is Wrong Prediction and Prediction Intervals with Heteroskedasticity 289 85 The Linear Probability Model Revisited 290 e Computer Exercises 296 CHAPTER 9 More on Specification and Data Issues 300 91 Functional Form Misspeci cation 300 RESET as a General Test for Functional Form Misspeci cation 303 Tests against Nonnested Alternatives 305 92 Using Proxy Variables for Unobserved Explanatory Variables 306 Using Lagged Dependent Variables as Proxy Variables 310 A Different Slant on Multiple Regression Models with Random Slopes 313 Properties of OLS under Measurement 31 5 0 Am Measurement Error in the Dependent Variable Measurement Error in an Explanatory Variable Missing Data Nonrandom Samples and u ying bservations 22 Missing Data 322 Nonrandom Samples 323 Outliers and In uential Observations 325 96 Least Absolute Deviations Estimation 330 Summary 331 Key Terms 332 Problems 332 Computer Exercises 334 Regression Analysis with Time Series Data 339 CHAPTER 10 Basic Regression Analysis with Time Series Data 340 0 LII 101 The Nature of Time Series Data 340 102 Examples of Time Series Regression odels 342 Static Models 342 Finite Distributed Lag Models 342 A Convention about the Time Index 345 103 Finite Sample Properties of OLS under Classical Assumptions 345 Unbiasedness of OLS 345 The Variances of the OLS Estimators and the GaussMarkov Theorem 349 Inference under the Classical Linear Model 39 35 ssumptions 104 Functional Form Dummy Variables and In ex Numbers 353 105 Trends and Seasonality 360 Characterizing Trending Time Series 360 Using Trending Variables in Regression Analysis 363 A Detrending Interpretation of Regressions with a Time Trend Computing RSquared when the Dependent ariable Is Trending 366 Seasona ity Summar 37 Key Terms 371 Problems 371 Computer Exercises 373 CHAPTER 11 Further Issues in Using LS with Time Series Data 377 111 Stationary and Weakly Dependent Time Series 377 Stationary and Nonstationary Time Series 378 Weakly Dependent Time Series 379 112 Asymptotic Properties of OLS 381 113 Using Highly Persistent Time Series in Regression Analysis 388 Highly Persistent Time Series Transformations on Highly Persistent Time Series 393 Deciding Whether a Time Series Is 1 394 114 Dynamically Complete Models and the Absence of Serial Correlation 115 The Homoskedasticity Assumption for Time eries Mod ls 399 Summary Key Terms 401 Problems 401 Computer Exercises 404 CHAPTER 12 Serial Correlation and Heteroskedasticity in Time Series Regressions 408 121 Properties of OLS with Serially Correlated Errors 408 Unbiasedness and Consistency 408 Ef ciency and Inference 409 GoodnessofF it Contents v11 Serial Correlation in the Presence of Lagged Dependent Variables I 122 Testing for Serial Correlation 412 A t Test for ARI Serial Correlation with Strictly Exogenous Regressors 412 The DurbinWatson Test under Classical Assumptions 415 Testing for AR I Serial Correlation without Strictly Exogenous Regressors 4I6 Testing for Higher Order Serial Correlation 41 7 123 Correcting for Serial Correlation with Strictly Exogenous Regressors 419 Obtaining the Best Linear Unbiased Estimator in the ARI Model 419 Feasible GIS Estimation with ARI Errors 421 Comparing OLS and FGLS 423 Correcting for Higher Order Serial Correlation 425 124 Differencing and Serial Correlation 426 125 Serial CorrelationRobust Inference after OLS 126 Heteroskedasticity in Time Series Regressions 432 HeteroskedasticityRobustStatistics 432 39 s 39 39 432 Heteroskedasticity and Serial Correlation in Regression Models Computer Exercises 438 Advanced Topics 443 CHAPTER 13 Pooling Cross Sections across Time Simple Panel Data Methods 444 131 Pooling Independent Cross Sections across Time The Chow Test for Structural Change across Time 132 Policy Analysis with Pooled Cross Sections 450 133 TwoPeriod Panel Data Analysis 455 Organizing Panel Data 461 vtll C ontean 134 Policy Analysis with TwoPeriod Panel Data 135 Differencing with More Than Two Time Periods Potential Pitfalls in First Dijferencing Panel Data Summary 471 Key Terms 471 Problems 471 Computer Exercises 473 Appendix 13A 478 CHAPTER 14 Advanced Panel Data Methods 481 141 Fixed Effects Estimation 481 The Dummy Variable Regression 485 Fixed Effects or First Dijferencing 487 Fixed Effects with Unbalanced Panels 488 142 Random Effects Models 489 Random Effects or Fixed Effects 493 143 Applying Panel Data Methods to Other Data Structures 494 Problems 497 Computer Exercises 498 Appendix 14A 503 CHAPTER 15 Instrumental Variables Estimation and Two Stage Least Squares 506 151 Motivation Omitted Variables in a Simple Regression Model 507 Statistical Inference with the IV Estimator Properties of IV with a Poor Instrumental V 39 bl 514 e Computing RSquared after IV Estimation 152 IV Estimation of the Multiple Regression Model 7 153 Two Stage Least Squares 521 A Single Endogenous Explanatory Variable Multicollinearity and 2SLS 523 u tip e ndogenous Explanatory ariables 524 Testing Multiple Hypotheses after 2SLS Estimation 154 IV Solutions to ErrorsinVariables Problems 5 155 Testing for Endogeneity and Testing Overidentifying Restrictions 527 Testing for Endogeneity 527 Testing Overidenti cation Restrictions 529 156 2SLS with Heteroskedasticity 531 157 Applying 2SLS to Time Series Equations 531 158 Applying 2SLS to Pooled Cross Sections and Panel Data 534 Summary 536 Key Terms 536 Problems 536 Computer Exercises 539 Appendix 15A 543 CHAPTER 16 Simultaneous Equations Models 546 161 The Nature of Simultaneous Equations Models 162 Simultaneity Bias in OLS 550 163 Identifying and Estimating a Structural Equation 552 Identi cation in a TwoEquation System 552 Estimation by 2SLS 557 164 Systems with More Than Two Equations 559 Identi cation in Systems with Three or More Equations 559 165 Simultaneous Equations Models with Time Series 166 39 Equations Models with Panel Data 564 Summary 566 Key Terms 567 Problems 567 Computer Exercises 570 CHAPTER 11 Limited DependentVariable Models and Sample Selection Corrections 574 171 Logit and Probit Models for Binary Response Specifying Logit and Probit Models 575 aximum Likelihood Estimation of Logit and Probit Models 578 Testing Multiple Hypotheses 579 Interpreting the Logit and Probit 39 0 Estimates 172 The Tobit Model for Corner Solution Responses 587 Interpreting the Tobit Estimates 589 Specification Issues in Tobit Models 594 173 The Poisson Regression Model 595 174 Censored and Truncated Regression Models 600 Censored Regression Models 601 Truncated Regression Models 604 175 Sample Selection Corrections 606 When Is 01S on the Selected Sample Consistent 607 Incidental Truncation 608 Appendix 17B 621 CHAPTER 18 Advanced Time Series Topics 623 181 In nite Distributed Lag Models 624 The Geometric or Koyck Distributed lag 626 Rational Distributed Lag Models 628 182 Testing for Unit Roots 30 183 Spurious Regression 636 184 Cointegration and Error Correction Models 637 Cointegration 637 Error Correction Models 643 185 Forecasting 645 Types of Regression Models Used for Forecasting OneStepAhead Forecasting 64 7 Comparing OneStepAhead Forecasts 651 MultipleStepAhead Forecasts 652 Forecasting Trending Seasonal and Integrated Processes 655 Summary 660 Key Terms 661 Problems 661 Computer Exercises 663 CHAPTER 19 Carrying Out an Empirical Project 668 191 Posing a Question 668 Contents lx 192 Literature Review 670 193 Data Collection 671 Deciding on the Appropriate Data Set 671 Entering and Storing Your Data 672 Inspecting Cleaning and Summarizing Your Data 7 194 Econometric Analysis 675 195 Writing an Empirical Paper 678 Intr du io 678 Conceptual or Theoretical Framework 6 79 Econometric Models and Estimation Methods 679 Conclusions 683 Style Hints 684 Summary Key Terms Sample Empirical Projects 687 List of Journals 692 Data Sources 693 APPENDIX A Basic Mathematical Tools 695 A1 The Summation Operator and Descriptive Statrstrcs 695 A2 Properties of Linear Functions 697 A3 Proportions and Percentages 699 A4 Some Special Functions and Their Properties 702 Quadratic Functions 702 The Natural Logarithm 704 The Exponential Function 708 A5 Differential Calculus 709 Summary 71 1 Key Terms 71 1 Problems 7 1 1 APPENDIX B Fundamentals of Probability 714 31 Random Variables and Their Probability Distributions 714 Discrete Random Variables 7I5 Continuous Random Variables 71 7 32 Joint Distributions Conditional Distributions d l d ndence 719 an n e Joint Distributions and Independence 719 Conditional Distributions 72 33 Features of Probability Distributions 722 x C ontents A Measure of Central Tendency The Expected Value Properties of Expected Values 724 Another Measure of Central Tendency The Median 725 Measures of Variability Variance and Standard Deviation 726 Variance 2 Standard Deviation 728 Standardizing a Random Variable 728 Skewness and Kurtosis 729 Features of Joint and Conditional Distributions 729 Measures of Association Covariance and Correlation Covariance 729 Correlation Coe icient Variance of ums of Random Variables 732 Conditional Expectation 733 Properties of Conditional Expectation 734 Conditional Variance 736 The Normal and Related Distributions 737 The Normal Distribution 737 The Standard Normal Distribution 738 Additional Properties of the Normal Distribution 74 The ChiSquare Distribution 74 The tDistribution 741 The F Distribution 743 w 3 on LA Key Terms 744 Problems 74 5 APPENDIX C Fundamentals of Mathematical Statistics 747 Cl Populations Parameters and Random Sampling Sampling 748 Finite Sample Properties of Estimators 748 Estimators and Estimates 749 Unbiasedness 750 The Sampling Variance of Estimators 752 Ej cie c 754 0 to 0 L n Asymptotic or Larger Sample Properties of Estimators 755 Consistency 755 Asymptotic Normality 758 General Approaches to Parameter Estimation 6 0 4 Method of Moments 760 Maximum Likelihood 76I Least Squares 762 C5 Interval Estimation and Con dence Intervals The Nature of Interval Estimation 762 Con dence Intervals for the Mean from a ormally Distributed Population 76 A Simple Rule ofThumbfor a 95 Con dence Interval 768 Asymptotic Con dence Intervals for Nonnormal Populations 768 Hypothesis Testing 770 Fundamentals of Hypothesis Testing 770 Testing Hypotheses about the Mean in a Normal Population 772 Asymptotic Tests for Nonnormal Populations 774 Computing and Using pValues 776 The Relationship between Con dence Intervals and Hypothesis Testing 779 Practical versus Statistical Signi cance 780 C7 Remarks on Notation l l o to 0 ox APPENDIX D Summary of Matrix Algebra 788 Dl Basic De nitions 788 D2 Matrix Operations 789 Matrix Addition 789 Scalar Multiplication 790 Matrix Multiplication 790 Transpose Partitioned Matrix Multiplication 792 Trace 792 U m Inverse 792 Linear Independence and Rank of a Matrix D4 Quadratic Forms and Positive Definite Matrices 793 Idempotent Matrices 794 Differentiation of Linear and Quadratic Forms D7 Moments and Distributions of andom Vectors 795 Expected Value 795 VarianceCovariance Matrix 795 Multivariate Normal Distribution 796 DU 2an Contents ChiSquare Distribution 7 APPENDIX E t Distribution 797 istributiori 797 Answers to Chapter Questions 813 Summary 79 Key Terms 797 Problems 798 APPENDIX G Statistical Tables 823 APPENDIX E The Linear Regression Model in Matrix Form 799 References 830 Glossary 835 F1 E2 E3 E4 Index 84 9 Estimati Finite Sample Properties of OLS Statistical Inference 80 Some Asymptotic Analysis 807 Wald Statistics for Testing Multiple Hypotheses 809 The Model and Ordinary Least Squares on 7 801 S ummary Key Terms 8 l 1 Problems 8 l l graduates and how e irical researchers think about and apply econometric methods came convinced that teaching introductory econometrics from the perspective of pro fessional users of econometrics would actually simplify the presentation in addition to king the subject much more interesting Based on the positive reactions to earlier editions it appears that my hunch was corr y P J rect aVIn avarie With different levels of menaration L app oach espoused in this text The emphasis in this edition is still on applying econometrics to realiworld problems Each econometric method is motivated by a particular issue facing researchers analyzing nonexperimental data The focus in the main text is on un ers n 7 ing and interpreting the assumptions in light of actual empirical applications the ma 7 ematics required is no more than college algebra and basic probability and statistics Organized for Today s Econometrics Instructor The fourth edition preserves the overall organization of the third The most noticeable fear ture that distinguishes this text from most others is the separation of topics by the kind of 39 quot L39 39 J I 39 39 approach which presents a linear model lists all T 39 J r 39 39 39 and then proves or asserts results without clearly connecting them to the assumptions My approach is rst to treat in Part 1 multiple regression analysis with crossrsectional data under the assumption of 39 TH quotiquot i quotah ml to and quott he aquot they are lamina 39 39 39 39 39 39 39 lmportantly it allows us to distinguish between assumptions made about the underlying population regression modeliassumptions that can be given economic or behavioral c om a 39 with the dependent variable are treated as outcomes of random variables For the social sciences as u tio of nonrandom explanatory variables As a nontrivial benefit the population modelrandom sa lin a oach reduces the number of assumptions at s dents must absorb and understand Ironically the classical approach to regression analysis which treats the explanatory variables as fixed in repeated samples and is still pervasive in intror ductory texts literally applies to data collected in an experimental se ing In addition the contortions required to state and explain assumptions can be confusing to students xii Preface My focus on the population model emphasizes that the fundamental assumptions under lying regression analysis such as the zero mean assumption on the unobservables are prop erly stated conditional on the explanatory variables This leads to a clear understanding of the kinds of problems such as heteroskedasticity nonconstant variance that can invalidate standard inference procedures Also I am able to dispel several misconceptions that arise in econometrics texts at all levels For example I explain why the usual Rsquared is still valid as a goodnessoffit measure in the presence of heteroskedasticity Chapter 8 or seri ally correlated errors Chapter 12 I demonstrate that tests for functional form should not be viewed as general tests of omitted variables Chapter 9 and I explain why one should always include in a regression model extra control variables that are uncorrelated with the explanatory variable of interest often the key policy variable Chapter 6 Because the assumptions for crosssectional analysis are relatively straightforward yet realistic students can get involved early with serious crosssectional applications without having to worry about the thorny issues of trends seasonality serial correlation high persistence and spurious regression that are ubiquitous in time series regression models Initially I figured that my treatment of regression with crosssectional data followed by regression with time series data would find favor with instructors whose own research interests are in applied microeconomics and that appears to be the case It has been grati fying that adopters of the text with an applied time series bent have been equally enthusi astic about the structure of the text By postponing the econometric analysis of time series data I am able to put proper focus on the potential pitfalls in analyzing time series data that do not arise with crosssectional data In effect time series econometrics finally gets the serious treatment it deserves in an introductory text As in the earlier editions I have consciously chosen topics that are important for reading jounial articles and for conducting basic empirical research Within each topic I have deliberately omitted many tesm and estimation procedures that while traditionally included in textbooks have not withstood the empirical test of time Likewise I have emphasized more recent topics that have clearly demonstrated their usefulness such as obtaining test statistics that are robust to heteroskedasticity or serial correlation of unknown form using multiple years of data for policy analysis or solving the omitted variable problem by instrumental variables methods I appear to have made sound choices as I have received only a handful of suggestions for adding or deleting material I take a systematic approach throughout the text by which I mean that each topic is presented by building on the previous material in a logical fashion and assumptions are introduced only as they are needed to obtain a conclusion For example professional users of econometrics understand that not all of the GaussMarkov assumptions are needed to show that the ordinary least squares OLS estimators are unbiased Yet the vast majority of econometrics texts introduce a complete set of assumptions many of which are redun dant or in some cases even logically conflicting before proving the unbiasedness of OLS Similarly the normality assumption is often included among the assumptions that are needed for the GaussMarkov Theorem even though it is fairly well known that normality plays no role in showing that the OLS estimators are the best linear unbiased estimators My systematic approach is illustrated by the order of assumptions that I use for mul tiple regression in Part 1 This ordering results in a natural progression for briefly sum marizing the role of each assumption MLRl Introduce the population model and interpret the population parameters which we hope to estimate Xlll xlv Preface MLR2 Introduce random sampling from the population and describe the data that we use to estimate the population parameters MLR3 Add the assumption on the explanatory variables that allows us to compute the estimates from our sample this is the socalled no perfect collinearity assumption MLR4 Assume that in the population the mean of the unobservable error does not depend on the values of the explanatory variables this is the mean indepen dence assumption combined with a zero population mean for the error and it is the key assumption that delivers unbiasedness of OLS After introducing Assumptions MLRl to MLR3 one can discuss the algebraic properties of ordinary least squaresithat is the properties of OLS for a particular set of data By adding Assumption MLR4 we can show that OLS is unbiased and consistent Assumption MLR5 homoskedasticity is added for the GaussMarkov Theorem and for the usual OLS variance formulas to be valid and MLR6 normality is added to round out the classical linear model assumptions for exact statistical inference I use parallel approaches when I turn to the study of largesample properties and when I treat regression for time series data in Part 2 The careful presentation and dis cussion of assumptions makes it relatively easy to cover more advanced topics such as using pooled cross sections exploiting panel data structures and applying instrumental variables methods Generally I have strived to provide a unified view of econometrics where all estimators and test statistics are obtained using just a few intuitively reasonable principles of estimation and testing which of course also have rigorous justification For example regressionbased tests for heteroskedasticity and serial correlation are easy for students to grasp because they already have a solid understanding of regression This is in contrast to treatments that give a set of disjointed recipes for outdated econometric testing procedures Throughout the text I emphasize ceteris paribus relationships which is why after one chapter on the simple regression model I move to multiple regression analysis The mul tiple regression setting motivates students to think about serious applications early I also give prominence to policy analysis with all kinds of data structures Practical topics such as using proxy variables to obtain ceteris paribus effecm and interpreting partial effecm in models with interaction terms are covered in a simple fashion New to This Edition Specific changes to this edition include a discussion of variance in ation factors in Chapter 3 Until now I have resisted including a formal discussion of the diagnostics available for detecting multicollinearity In this edition with some reservations Iprovide a brief discussion My view from earlier editionsithat multicollinearity is still a poorly understood issue and that claims that one can detect and correct for multicolliriearity are wrongheadedihave not changed But I find myself having to repeatedly explain the use and limim of statistics such as variance inflation factors and so Ihave decided to confront the issue headon In Chapter 6 I add a discussion of the socalled smearing estimate for retransformation after estimating a linear model where the dependent variable is in logarithmic form The smearing approach is widely used and simple to implement it was an oversight of mine not to include it in previous editions On a related matter I have also added material on Preface obtaining a 95 prediction interval after retransforming a model that satisfies the classical linear model assumptions In Chapter 8 I changed Example 86 to one that uses a more modeni much larger data set on financial wealth income and participation in 401k pension plans This example in conjunction with a new subsection on weighted least squares with a misspecified vari ance function provides a nice illustration of how weighted least squares can be signifi cantly more efficient than ordinary least squares even if we allow the variance function to be misspecified Another new subsection in Chapter 8 discusses the problem of prediction after retrans formation in a model with a logarithmic dependent variable and heteroskedasticity in the orginal linear model Chapter 9 contains several new items First I provide a brief discussion of models with random slopes I provide this material as an introduction to the notion that marginal effects can depend on unobserved individual heterogeneity In the discussion of outliers and influential data lhave included a description of studentized residuals as a way to determine influential data poinm I also note how these are easily obtained by dummying out an observation Finally the increasingly important method of least absolute deviations LAD is now more fully described in a new subsection In the computer exercises a new data set on the compensation of Michigan elemenatary school teachers is used to illustrate the resilience of LAD to the inclusion of suspicious data points In the time series chapters Chapters 10 11 and 12 two new examples and data sew on the US economy are included The first is a simple equation known in macroeconom ics as Okun s Law the second is a sectorspecific analysis of the effects of the minimum wage These examples nicely illustrate practical applications to economics of regression with time series data The advanced chapters now include discussions of the Chow test for panel data Chapter 13 a more detailed discussion of pooled OLS and panel data methods for cluster samples Chapter 14 and better discussions of the problems of a weak instrument and the nature of overidentification tests with instrumental variables Chapter 15 In Chapter 17 I expand the discussion of estimating partial effecm in nonlinear models emphasizing the difference between partial effects evaluated at averages of the regressors versus averaging the partial effects across all units lhave added more data sets for the fourth edition I previously mentioned the school level data set on teachers compensation ELEM 4 R addition a data set on charitable contributions in the Netherlands CHARITYRAW is used in some new prob lems The two new time series data sew are OKUNRAW and MINWAGERAW A few other data sets not used in the text will be available on the text s companion Web site including a data set on salaries and publication records of economics professors at Big Ten universities Targeted at Undergraduates Adaptable for Master s Students The text is designed for undergraduate economics majors who have taken college alge bra and one semester of introductory probability and statistics Appendices A B and C contain the requisite background material A onesemester or onequarter econometrics course would not be expected to cover all or even any of the more advanced material in xv Preface Part 3 A typical introductory course includes Chapters 1 through 8 which cover the basics of simple and multiple regression for crosssectional data Provided the emphasis is on intuition and interpreting the empirical examples the material from the first eight chapters should be accessible to undergraduates in most economics departments Most instructors will also want to cover at least parts of the chapters on regression analysis with time series data Chapters 10 11 and 12 with varying degrees of depth In the onesemester course that I teach at Michigan State I cover Chapter 10 fairly carefully give an overview of the material in Chapter 11 and cover the material on serial correlation in Chapter 12 1 find that this basic onesemester course puts students on a solid footing to write empirical papers such as a term paper a senior seminar paper or a senior thesis Chapter 9 con tains more specialized topics that arise in analyzing crosssectional data including data problems such as outliers and nonrandom sampling for a onesemester course it can be skipped without loss of continuity The structure of the text makes it ideal for a course with a crosssectional or policy analysis focus the time series chapters can be skipped in lieu of topics from Chapters 9 13 14 or 15 Chapter 13 is advanced only in the sense that it treats two new data struc tures independently pooled cross sections and twoperiod panel data analysis Such data structures are especially useful for policy analysis and the chapter provides several exam ples Students with a good grasp of Chapters 1 through 8 will have little difficulty with Chapter 13 Chapter 14 covers more advanced panel data methods and would probably be covered only in a second course A good way to end a course on crosssectional methods is to cover the rudiments of instrumental variables estimation in Chapter 1 l have used selected material in Part 3 including Chapters 13 14 15 and 17 in a senior seminar geared to producing a serious research paper Along with the basic one semester course students who have been exposed to basic panel data analysis instrumen tal variables estimation and limited dependent variable models are in a position to read large segments of the applied social sciences literature Chapter 17 provides an introduc tion to the most common limited dependent variable models The text is also well suited for an introductory master s level course where the empha sis is on applications rather than on derivations using matrix algebra Still for instructors wanting to present the material in matrix form Appendices D and E are selfcontained treatments of the matrix algebra and the multiple regression model in matrix form At Michigan State PhD students in many fields that require data analysisiincluding accounting agricultural economics development economics finance inteniational eco nomics labor economics macroeconomics political science and public financeihave found the text to be a useful bridge between the empirical work that they read and the more theoretical econometrics they learn at the PhD level Design Features Numerous intext questions are scattered throughout with answers supplied in Appendix F These questions are intended to provide students with immediate feedback Each chapter contains many numbered examples Several of these are case studies drawn from recently published papers but where I have used my judgment to simplify the analysis hopefully without sacrificing the main point The endofchapter problems and computer exercises are heavily oriented toward empirical work rather than complicated derivations The students are asked to carefully Preface reason based on what they have learned The computer exercises often expand on the intext examples Several exercises use data sets from published works or similar data sew that are motivated by published research in economics and other fields A pioneering feature of this introductory econometrics text is the extensive glossary The short definitions and descriptions are a helpful refresher for students studying for exams or reading empirical research that uses econometric methods I have added and updated several entries for the fourth edition Student Supplements The Student Solutions Manual ISBN 0324586582 contains suggestions on how to read each chapter as well as answers to selected problems and computer exercises The Student Solutions Manual can be accessed online at academiccengagecomlogin An access code has been packaged with every new book and is required to access the material online For students who purchase a used book the access code may be purchased from the same website With their single signon access code students also can access the data sets that accompany the text as well as link to EconApps a continually updated collection of eco nomic news debates and data Instructor Supplements The Instructor s Manual with Solutions ISBN 0324586574 contains answers to all exercises as well as teaching tips on how to present the material in each chapter The instructor s manual also contains sources for each of the data files with man suggestions for how to use them on problem sets exams and term papers This supplement is available online only to instructors at academic cencace 39 id e U on the instructor s request EViews Student Version can be bundled with the text for an additional 18 per book With EViews students can do homework anywhere they have access to a PC However because Student EViews restricts the size of a data set that can be analyzed some of the full data sets used in the text and in the problems cannot be used in Student EViews Instead with the exception of a few of the data sets used only in Part 3 of the text lhave provided smaller versions of the EViews data sets that can be used in Student EViews These are described in the instructor s manual For more information on this special EViews offer contact your SouthWesteniCengage Learning representa tive or call Cengage Learning Customer amp Sales Support at 18003549706 Data Sets Available in Four Formats About 100 data sets are available in ASCII EViews Excel and Stata Because most of the data sets come from actual research some are very large Except for partially listing data sets to illustrate the various data structures the data sew are not reported in the text This book is geared to a course where computer work plays an integral role An extensive data description manual is available online This manual contains a list of data sources along with suggestions for ways to use the data sets that are not described in the text Instructors can access the data sets at this book s companion site at academiccengagecomeconomics wooldridge xvlll Preface An online access card has been packaged with every new book which will give stu dents access to all of these data sew and the data description manual Suggestions for Designing Your Course Ihave already commented on the contents of most of the chapters as well as possible out lines for courses Here Iprovide more specific commenm about material in chapters that might be covered or skipped Chapter 9 has some interesting examples such as a wage regression that includes IQ score as an explanatory variable The rubric of proxy variables does not have to be for mally introduced to present these kinds of examples and Itypically do so when finishing up crosssectional analysis In Chapter 12 for a onesemester course I skip the material on serial correlation robust inference for ordinary least squares as well as dynamic models of heteroskedasticity Even in a second course I tend to spend only a little time on Chapter 16 which cov ers simultaneous equations analysis If people differ about one issue it is the importance of simultaneous equations Some think this material is fundamental others think it is rarely applicable My own view is that simultaneous equations models are overused see Chapter 16 for a discussion If one reads applications carefully omitted variables and measurement error are much more likely to be the reason one adopts instrumental variables estimation and this is why Iuse omitted variables to motivate instrumental variables estimation in Chapter 15 Still simultaneous equations models are indispens able for estimating demand and supply functions and they apply in some other important cases as well Chapter 17 is the only chapter that considers models inherently nonlinear in their parameters and this puts an extra burden on the student The first material one should cover in this chapter is on probit and logit models for biliary response My presentation of Tobit models and censored regression still appears to be novel I explicitly recognize that the Tobit model is applied to conier solution outcomes on random samples while censored regression is applied when the data collection process censors the dependent varia e Chapter 18 covers some recent important topics from time series econometrics includ ing testing for unit roots and cointegration I cover this material only in a secondsemester course at either the undergraduate or master s level A fairly detailed introduction to fore casting is also included in Chapter 18 Chapter 19 which would be added to the syllabus for a course that requires a term paper is much more extensive than similar chapters in other texts It summarizes some of the methods appropriate for various kinds of problems and data structures points out potential pitfalls explains in some detail how to write a term paper in empirical econom ics and includes suggestions for possible projects Preface Acknowledgements I would like to thank those who reviewed the proposal for the fourth edition or provided helpful commenm on the third edition xix Swarnjit S Arora University of WisconsiniMilwaakee Jushan Bai New York University Edward Coulson Penn State University Lisa M Dickson University of MarylandiBaltimore Coanty Angela K Dills Clemson University Michael Jansson University of CaliforniaiBerkeley Subal C Kumbhakar State University of New Yorki Binghamton Angelo Melino Dec Mullarkey Boston College Kevin J Murphy Oakland University Leslie Papke Michigan State University Subhash Ray University of Connecticut Edwin A Sexton Brigham Young Universitindaho Lara ShoreSheppard Williams College Jeffrey S mith University ofMichigan Stephen Stageberg University of Mary Washington Timothy Vogelsang University Of Toronto Michigan State University Anne E Winkler University of MissoariiSt loais Daniel Monchuk University of Southern Mississippi Several of the changes I discussed earlier were driven by commenm I received from people on this list and I continue to mull over specific suggestions made by one or more reviewers Many students and teaching assistants too numerous to list have caught mistakes in earlier editions or have suggested rewording some paragraphs 1 am grateful to them Thanks to the people at SouthWesternCengage Learning the revision process has once again gone smoothly Mike Worls my longtime acquistions editor has been sup portive as always and Laura Bofinger hit the ground running as my new developmental editor lbenefitted from the enthusiasm Laura brought to the project Martha Conway did a terrific job as project manager and Charu Khanna at Macmillan Publishing Solutions professionally and efficiently oversaw the typesetting of the manuscript This book is dedicated to my wife Leslieiwho subjected her senior seminar students to the third editioniand to our children Edmund and Gwenythiwho now understand enough about economics to know that they would rather be real scientists Jeffrey M Wooldridge was an assistant professor of economics at the Massachusetts Institute of Technology He received his bachelor of arts with majors in computer science and economics from the University of California Berkeley in 1982 and received his doctorate i economi and Statistics and the ta Journal He has also acted as an occasional econometrics con sultant for Arthur Andersen Charles River Associates and the Washington State Institute for Public Policy CHAPTER 1 The Nature of Econometrics and Economic Data hapter 1 discusses the scope of econometrics and raises general issues that arise in the application of econometric methods Section 13 examines the kinds of data sets L 39 39 er n n ther trial science ectinn 1 4 provides an intuitive discussion of the difficulties associated with the inference of causality in the social sciences 11 What Is Econometrics Imagine that you are hired by your state government to evaluate the effectiveness of a putch J39 39 39 39 mm a to use computers in the manufacturing process The twentyrweek program offers courses during nonworking hours Any hourly manufacturing worker may participate and enroll ment in all or part of the program is voluntary You are to determine what if any effect the uaining r n a h worker a e Now suppose you work for an investment bank You are to study the returns on difr 39 39 39 39 bills to decide whether they comply with implied economic theories The task of answering such questions may seem daunting at first At this point you may only have a vague idea of the kind of data you would need to collect By the end of this introductory econometrics course you should know how to use econometric methods to formally evaluate a job training program or to test a simple economic theory o 39 i based upon the development of statistical methods for estimating economic relationships testing economic theories and evaluating and implementing men c m g pp a n forecasting of such important macroeconomic variables as interest rates in ation rates and gross domestic product Whereas forecasts of economic indicators are highly visible A n W e y r t a t nothing to do with macroeconomic forecasting For example we will study the effects of political campaign expenditures on voting outcomes We will consider the effect of school spending on student performance in the field of education In addition we will learn how o use econometric methods for forecasting economic time series Chapter 1 The Nature of Econometrics and Economic Data Econometrics has evolved as a separate discipline from mathematical statistics because the former focuses on the problems inherent in collecting and analyzing nonex perimental economic data Nonexperimental data are not accumulated through con trolled experiments on individuals firms or segments of the economy Nonexperimental data are sometimes called observational data or retrospective data to emphasize the fact that the researcher is a passive collector of the data Experimental data are often collected in laboratory environments in the natural sciences but they are much more dif ficult to obtain in the social sciences Although some social experiments can be devised it is often impossible prohibitively expensive or morally repugnant to conduct the kinds of controlled experiments that would be needed to address economic issues We ive some specific examples of the differences between experimental and nonexperimental data in Section 14 Naturally econometricians have borrowed from mathematical statisticians whenever possible The method of multiple regression analysis is the mainstay in both fields but its focus and interpretation can differ markedly In addition economists have devised new techniques to deal with the complexities of economic data and to test the predictions of economic theories 12 Steps in Empirical Economic Analysis Econometric methods are relevant in virtually every branch of applied economics They come into play either when we have an economic theory to test or when we have a rela tionship in mind that has some importance for business decisions or policy analysis An empirical analysis uses data to test a theory or to estimate a relationship How does one go about structuring an empirical economic analysis It may seem obvious but it is worth emphasizing that the first step in any empirical analysis is the careful formulation of the question of interest The question might deal with testing a certain aspect of an economic theory or it might pertain to testing the effects of a gov ernment policy In principle econometric methods can be used to answer a wide range of questions In some cases especially those that involve the testing of economic theories a formal economic model is constructed An economic model consism of mathematical equa tions that describe various relationships Economists are well known for their building of models to describe a vast array of behaviors For example in intermediate microeconom ics individual consumption decisions subject to a budget constraint are described by mathematical models The basic premise underlying these models is utility maximization The assumption that individuals make choices to maximize their wellbeing subject to resource constraints gives us a very powerful framework for creating tractable economic models and making clear predictions In the context of consumption decisions utility maximization leads to a set of demand equations In a demand equation the quantity demanded of each commodity depends on the price of the goods the price of substitute and complementary goods the consumer s income and the individual s characteristics that affect taste These equations can form the basis of an econometric analysis of con sumer demand Economists have used basic economic tools such as the utility maximization frame work to explain behaviors that at first glance may appear to be noneconomic in nature A classic example is Becker s 1968 economic model of criminal behavior Chapter I The Nature of Econometrics and Economic Data Economic Model of Crime In a seminal article Nobel Prize winner Gary Becker postulated a utility maximization framework to describe an individual s participation in crime Certain crimes have clear economic rewards but most criminal behaviors have costs The opportunity costs of crime prevent the criminal from par ticipating in other activities such as legal employment In addition there are costs associated with the possibility of being caught and then if convicted the costs associated with incarceration From Becker s perspective the decision to undertake illegal activity is one of resource allocation with the bene ts and costs of competing activities taken into account Under general assumptions we can derive an equation describing the amount of time spent in criminal activity as a function of various factors We might represent such a function as y f 06 x2 x3 x4 x5 x5 9 El where y hours spent in criminal activities 61 wage for an hour spent in criminal activity 61 hourly wage in legal employment 6 income other than from crime or employment 6 probability of getting caught 6 probability of being convicted if caught 6 expected sentence if convicted and x age Other factors generally affect a person s decision to participate in crime but the list above is repre sentative of what might result from a formal economic analysis As is common in economic theory we have not been speci c about the function f in 11 This function depends on an underlying utility function which is rarely known Nevertheless we can use economic theoryior introspectionito predict the effect that each variable would have on criminal activity This is the basis for an economet ric analysis of individual criminal activity Formal economic modeling is sometimes the starting point for empirical analysis but it is more common to use economic theory less formally or even to rely entirely on intui tion You may agree that the determinants of criminal behavior appearing in equation 11 are reasonable based on common sense we might arrive at such an equation directly without starting from utility maximization This View has some merit although there are cases in which formal derivations provide insights that intuition can overlook Next is an example of an equation that we can derive through somewhat informal reasoning lob Tralnlng and Worker Productivity Consider the problem posed at the beginning of Section 11 A labor economist would like to examine the effects of job training on worker productivity In this case there is little need for formal economic theory Basic economic understanding is sufficient for realizing that factors such as education experience and training affect worker productivity Also economists are well Chapter 1 The Nature of Econometrics and Economic Data aware that workers are paid commensurate with their productivity This simple reasoning leads to a model such as wage feduc exper training 12 where wage hourly wage educ years of formal education exper years of workforce experience and training weeks spent in job training Again other factors generally affect the wage rate but equation 1 2 captures the essence of the problem After we specify an economic model we need to turn it into whatwe call an econometric model Because we will deal with econometric models throughout this text it is important to know how an econometric model relates to an economic model Take equation 11 as an example The form of the function f 0 must be specified before we can undertake an econometric analysis A second issue concerning 11 is how to deal with variables that cannot reasonably be observed For example consider the wage that a person can earn in criminal activity In principle such a quantity is well defined but it would be difficult if not impossible to observe this wage for a given individual Even variables such as the probability of being arrested cannot realistically be obtained for a given individual but at least we can observe relevant arrest statistics and derive a variable that approximates the probability of arrest Many other factors affect criminal behavior that we cannot even list let alone observe but we must somehow account for them The ambiguities inherent in the economic model of crime are resolved by specifying a particular econometric model crime 80 Blwagem 820thinc Bgfreqarr BAfreqconv E Bsavgsen Badge u where crime some measure of the frequency of criminal activity wagem the wage that can be earned in legal employment othinc the income from other sources assets inheritance and so on freqarr the frequency of arrests for prior infractions to approximate the probability of arrest freqcorw the frequency of conviction and avgsen the average sentence length after conviction The choice of these variables is determined by the economic theory as well as data considerations The term u contains unobserved factors such as the wage for criminal activity moral character family background and errors in measuring things like criminal activity and the probability of arrest We could add family background variables to the model such as number of siblings parents education and so on but we can never elimi nate u entirely In fact dealing with this error term or disturbance term is perhaps the most important component of any econometric analysis Chapter I The Nature of Econometrics and Economic Data The constants BO 81 B6 are the parameters of the econometric model and they describe the directions and strengths of the relationship between crime and the factors used to determine crime in the model A complete econometric model for Example 12 might be wage 80 Bledac Bzexper Bstraining a 14 where the term a contains factors such as innate ability quality of education family background and the myriad other factors that can influence a person s wage If we are spe cifically concerned about the effecm of job training then 83 is the parameter of interest For the most part econometric analysis begins by specifying an econometric model without consideration of the details of the model s creation We generally follow this approach largely because careful derivation of something like the economic model of crime is timeconsuming and can take us into some specialized and often difficult areas of economic theory Economic reasoning will play a role in our examples and we will merge any underlying economic theory into the econometric model specification In the economic model of crime example we would start with an econometric model such as 13 and use economic reasoning and common sense as guides for choosing the variables Although this approach loses some of the richness of economic analysis it is commonly and effectively applied by careful researchers Once an econometric model such as 13 or 14 has been specified various hypotheses of interest can be stated in terms of the unknown parameters For example in equation 13 we might hypothesize that wagem the wage that can be earned in legal employment has no effect on criminal behavior In the context of this particular econometric model the hypothesis is equivalent to B1 0 An empirical analysis by definition requires data After data on the relevant vari ables have been collected econometric methods are used to estimate the parameters in the econometric model and to formally test hypotheses of interest In some cases the econo metric model is used to make predictions in either the testing of a theory or the study of a policy s impact Because data collection is so important in empirical work Section 13 will describe the kinds of data that we are likely to encounter 13 The Structure of Economic Data Economic data sets come in a variety of types Whereas some econometric methods can be applied with little or no modification to many different kinds of data sets the special features of some data sew must be accounted for or should be exploited We next describe the most important data structures encountered in applied work CrossSectional Data A crosssectional data set consists of a sample of individuals households firms cities states countries or a variety of other unis taken at a given point in time Sometimes the data on all units do not correspond to precisely the same time period For example several families may be surveyed during different weeks within a year In a pure crosssectional analysis we would ignore any minor timing differences in collecting the data If a set of families was surveyed during different weeks of the same year we would still view this as a crosssectional data set Chapter 1 The Nature of Econometrics and Economic Data An important feature of crosssectional data is that we can often assume that they have been obtained by random sampling from the underlying population For example if we obtain information on wages education experience and other characteristics by randomly drawing 500 people from the working population then we have a random sample from the population of all working people Random sampling is the sampling scheme covered in introductory statistics courses and it simplifies the analysis of crosssectional data A review of random sampling is contained in Appendix C Sometimes random sampling is not appropriate as an assumption for analyzing cross sectional data For example suppose we are interested in studying factors that influence the accumulation of family wealth We could survey a random sample of families but some families might refuse to report their wealth If for example wealthier families are less likely to disclose their wealth then the resulting sample on wealth is not a random sample from the population of all families This is an illustration of a sample selection problem an advanced topic that we will discuss in Chapter 17 Another violation of random sampling occurs when we sample from unis that are large relative to the population particularly geographical units The potential problem in such cases is that the population is not large enough to reasonably assume the observations are independent draws For example if we want to explain new business activity across states as a function of wage rates energy prices corporate and property tax rates services provided quality of the workforce and other state characteristics it is unlikely that business activities in states near one another are independent It turns out that the econometric methods that we discuss do work in such situations but they sometimes need to be refined For the most part we will ignore the intricacies that arise in analyzing such situations and treat these problems in a random sampling framework even when it is not technically correct to do so Crosssectional data are widely used in economics and other social sciences In econom ics the analysis of crosssectional data is closely aligned with the applied microeconomics fields such as labor economics state and local public finance industrial organization urban economics demography and health economics Data on individuals households firms and cities at a given point in time are important for testing microeconomic hypoth eses and evaluating economic policies The crosssectional data used for econometric analysis can be represented and stored in computers Table 11 contains in abbreviated form a crosssectional data set on 526 work ing individuals for the year 1976 This is a subset of the data in the file WAGE1RAW The variables include wage in dollars per hour edac years of education exper years of potential labor force experience female an indicator for gender and married mari tal status These last two variables are binary zeroone in nature and serve to indicate qualitative features of the individual the person is female or not the person is married or not We will have much to say about biliary variables in Chapter 7 and beyond The variable obsno in Table 11 is the observation number assigned to each person in the sample Unlike the other variables it is not a characteristic of the individual All econo metrics and statistics software packages assign an observation number to each data unit Intuition should tell you that for data such as that in Table 11 it does not matter which person is labeled as observation 1 which person is called observation 2 and so on The fact that the ordering of the data does not matter for econometric analysis is a key feature of crosssectional data sets obtained from random sampling Different variables sometimes correspond to different time periods in crosssectional data sets For example to determine the effecm of government policies on longterm eco nomic growth economists have studied the relationship between growth in real per capita Chapter 1 The Nature of Econometrics and Economic Data gross domestic product GDP over a certain period say 1960 to 1985 and variables determined in part by government policy in 1960 government consumption as a percent age of GDP and adult secondary education rates Such a data set might be represented as in Table 12 which constitutes part of the data set used in the study of crosscountry growth rates by De Long and Summers 1991 TABLE 1 l A CrossSectional Data Set on Wages and Other Individual Characteristics obsno wage educ expu female married 1 310 11 2 1 0 2 324 12 22 1 1 3 300 11 2 ll 0 4 600 8 44 0 1 i 5 10 12 7 H 1 525 1156 16 5 0 1 526 350 14 5 1 0 TABLE 12 A Data Set on Economic Growth Rates and Country Characteristics ubsno country gpcrgdp gnvcuns second 1 Argentina 089 9 32 2 Austria 332 16 50 3 Belgium 256 13 69 4 Bolivia 124 18 12 61 Zimbabwe 230 17 6 Chapter 1 The Nature of Econometrics and Economic Data The variable gpcrgdp represents average growth in real per capita GDP over the period 1960 to 1985 The fact that g0vc0ns60 government consumption as a percentage of GDP and second60 percentage of adult population with a secondary education correspond to the year 1960 while gpcrgdp is the average growth over the period from 1960 to 1985 does not lead to any special problems in treating this information as a crosssectional data set The observations are listed alphabetically by country but nothing about this ordering affects any subsequent analysis Time Series Data A time series data set consists of observations on a variable or several variables over time Examples of time series data include stock prices money supply consumer price index gross domestic product annual homicide rates and automobile sales figures Because past events can influence future events and lags in behavior are prevalent in the social sci ences time is an important dimension in a time series data set Unlike the arrangement of crosssectional data the chronological ordering of observations in a time series conveys potentially important information A key feature of time series data that makes them more difficult to analyze than crosssectional data is that economic observations can rarely if ever be assumed to be independent across time Most economic and other time series are related often strongly related to their recent histories For example knowing something about the gross domestic product from last quarter tells us quite a bit about the likely range of the GDP during this quarter because GDP tends to remain fairly stable from one quarter to the next Although most econometric procedures can be used with both crosssectional and time series data more needs to be done in specifying econometric models for time series data before standard econometric methods can be justified In addition modifications and embellishments to standard econometric techniques have been developed to account for and exploit the dependent nature of economic time series and to address other issues such as the fact that some economic variables tend to display clear trends over time other feature of time series data that can require special attention is the data frequency at which the data are collected In economics the most common frequencies are daily weekly monthly quarterly and annually Stock prices are recorded at daily intervals excluding Saturday and Sunday The money supply in the US economy is reported weekly Many macroeconomic series are tabulated monthly including inflation and unemployment rates Other macro series are recorded less frequently such as every three months every quarter Gross domestic product is an important example of a quar terly series Other time series such as infant mortality rates for states in the United States are available only on an annual basis any weekly monthly and quarterly economic time series display a strong seasonal patteni which can be an important factor in a time series analysis For example monthly data on housing starts differ across the months simply due to changing weather conditions We will learn how to deal with seasonal time series in Chapter 10 Table 13 contains a time series data set obtained from an article by CastilloFreeman and Freeman 1992 on minimum wage effecm in Puerto Rico The earliest year in the data set is the first observation and the most recent year available is the last observation When econometric methods are used to analyze time series data the data should be stored in chronological order Chapter I The Nature of Econometrics and Economic Data Minimum Wage Unemployment and Related Data for Puerto Rico obsno year avgmjn avgcov unemp gap 1 1950 020 201 154 8787 2 1951 1121 207 160 9250 3 1952 023 226 148 10159 37 1986 335 581 189 42816 38 1987 335 582 168 44967 The variable avgmin refers to the average minimum wage for the year avgcov is the average coverage rate the percentage of workers covered by the minimum wage law unemp is the unemployment rate and gnp is the gross national product We will use these data later in a time series analysis of the effect of the minimum wage on employment Pooled Cross Sections Some data sets have both crosssectional and time series features For example suppose that two crosssectional household surveys are taken in the United States one in 1985 and one in 1990 In 1985 a random sample of households is surveyed for variables such as income savings family size and so on In 1990 a new random sample of households is taken using the same survey questions To increase our sample size we can form a pooled cross section by combining the two years Pooling cross sections from different years is often an effective way of analyzing the effects of a new government policy The idea is to collect data from the years before and after a key policy change As an example consider the following data set on housing prices taken in 1993 and 1995 before and after a reduction in property taxes in 1994 Suppose we have data on 250 houses for 1993 and on 270 houses for 1995 One way to store such a data set is given in Table 14 Observations 1 through 250 correspond to the houses sold in 1993 and observations 251 through 520 correspond to the 270 houses sold in 1995 Although the order in which we store the data turns out not to be crucial keeping track of the year for each observation is usually very important This is why we enter year as a separate variable A pooled cross section is analyzed much like a standard cross section except that we often need to account for secular differences in the variables across the time In fact in addition to increasing the sample size the point of a pooled crosssectional analysis is often to see how a key relationship has changed over time Chapter 1 The Nature of Econometrics and Economic Data Pooled Cross Sections Two Years of Housing Prlces uhsno year hprice proptax sqrft bdrms bthrms 1 1993 85500 42 1600 3 20 2 1993 67300 36 1440 3 25 3 1993 134000 38 2000 4 25 250 1993 243600 41 2600 4 3 0 251 1995 65000 16 1250 2 10 252 1995 182400 20 2200 4 20 253 1995 97500 15 1540 3 20 520 1995 57200 16 1100 392 15 Panel or Longitudinal Data A panel data or longitudinal data set consists of a time series for each crosssectional member in the data set As an example suppose we have wage education and employment history for a set of individuals followed over a tenyear period Or we might collect infor mation such as investment and financial data about the same set of firms over a fiveyear time period Panel data can also be collected on geographical units For example we can collect data for the same set of counties in the United States on immigration flows tax rates wage rates government expenditures and so on for the years 1980 1985 and 1990 The key feature of panel data that distinguishes them from a pooled cross section is that the same crosssectional units individuals firms or counties in the preceding exam ples are followed over a given time period The data in Table 14 are not considered a panel data set because the houses sold are likely to be different in 1993 and 1995 if there are any duplicates the number is likely to be so small as to be unimportant In contrast Table 15 contains a twoyear panel data set on crime and related statistics for 150 cities in the United States There are several interesting features in Table 15 First each city has been given a number from 1 through 150 Which city we decide to call city 1 city 2 and so on is irrel evant As with a pure cross section the ordering in the cross section of a panel data set does not matter We could use the city name in place of a number but it is often useful to have both Chapter 1 The Nature of Econometrics and Economic Data A Two Year Panel Data Set on City Crime Statistics obsnu city yexu39 murders population unem police 1 1 1986 5 350000 87 440 2 1 1990 8 35921 10 72 471 3 2 1986 2 64300 54 75 4 2 1990 1 65100 55 75 297 149 1986 10 260700 96 286 298 149 1990 6 245000 98 334 299 150 1986 25 543000 43 520 300 150 1990 32 546200 52 493 A second point is that the two years of data for city 1 fill the first two rows or observa tions Observations 3 and 4 correspond to city 2 and so on Because each of the 150 cities has two rows of data any econometrics package will View this as 300 observations This data set can be treated as a pooled cross section where the same cities happen to show up in each year But as we will see in Chapters 13 and 14 we can also use the panel structure to analyze questions that cannot be answered by simply viewing this as a pooled cross section In organizing the observations in Table 15 we place the two years of data for each city adjacent to one another with the first year coming before the second in all cases For just about every practical purpose this is the preferred way for ordering panel data sets Contrast this organization with the way the pooled cross sections are stored in Table 14 In short the reason for ordering panel data as in Table 15 is that we will need to perform data transformations for each city across the two years Because panel data require replication of the same units over time panel data sets especially those on individuals households and firms are more difficult to obtain than pooled cross sections Not surprisingly observing the same units over time leads to several advantages over crosssectional data or even pooled crosssectional data The benefit that we will focus on in this text is that having multiple observations on the same units allows us to control for certain unobserved characteristics of individuals firms and so on As we will see the use of more than one observation can facilitate causal inference in situations where inferring causality would be very difficult if only a single cross section were avail able A second advantage of panel data is that they often allow us to study the importance of lags in behavior or the result of decision making This information can be significant Chapter 1 The Nature of Econometrics and Economic Data because many economic policies can be expected to have an impact only after some time has passed Most books at the undergraduate level do not contain a discussion of econometric methods for panel data However economists now recognize that some questions are dif ficult if not impossible to answer satisfactorily without panel data As you will see we can make considerable progress with simple panel data analysis a method that is not much more difficult than dealing with a standard crosssectional data set A Comment on Data Structures Part 1 of this text is concerned with the analysis of crosssectional data because this poses the fewest conceptual and technical difficulties At the same time it illustrates most of the key themes of econometric analysis We will use the methods and insights from cross sectional analysis in the remainder of the text Although the econometric analysis of time series uses many of the same tools as crosssectional analysis it is more complicated because of the trending highly persistent nature of many economic time series Examples that have been traditionally used to illus trate the manner in which econometric methods can be applied to time series data are now widely believed to be flawed It makes little sense to use such examples initially since this practice will only reinforce poor econometric practice Therefore we will postpone the treatment of time series econometrics until Part 2 when the important issues conceniing trends persistence dynamics and seasonality will be introduced art 3 we will treat pooled cross sections and panel data explicitly The analysis of independently pooled cross sections and simple panel data analysis are fairly straight forward extensions of pure crosssectional analysis Nevertheless we will wait until Chapter 13 to deal with these topics 14 Causality and the Notion of Ceteris Paribus in Econometric Analysis In most tests of economic theory and certainly for evaluating public policy the economist s goal is to infer that one variable such as education has a causal effect on another variable such as worker productivity Simply finding an association between two or more variables might be suggestive but unless causality can be established it is rarely compelling The notion of ceteris paribusiwhich means other relevant factors being equal 7 plays an important role in causal analysis This idea has been implicit in some of our earlier discussion particularly Examples 11 and 12 but thus far we have not explicitly mentioned it You probably remember from introductory economics that most economic questions are ceteris paribus by nature For example in analyzing consumer demand we are inter ested in knowing the effect of changing the price of a good on its quantity demanded while holding all other factorsisuch as income prices of other goods and individual tastesifixed If other factors are not held fixed then we cannot know the causal effect of a price change on quantity demanded Holding other factors fixed is critical for policy analysis as well In the job train ing example Example 12 we might be interested in the effect of another week of job Chapter I The Nature of Econometrics and Economic Data training on wages with all other components being equal in particular education and experience If we succeed in holding all other relevant factors fixed and then find a link between job training and wages we can conclude that job training has a causal effect on worker productivity Although this may seem pretty simple even at this early stage it should be clear that except in very special cases it will not be possible to literally hold all else equal The key question in most empirical studies is Have enough other factors been held fixed to make a case for causality Rarely is an econometric study evaluated without raising this issue In most serious applications the number of factors that can affect the variable of interestisuch as criminal activity or wagesiis immense and the isolation of any particular variable may seem like a hopeless effort However we will eventually see that when carefully applied econometric methods can simulate a ceteris paribus experiment At this point we cannot yet explain how econometric methods can be used to estimate ceteris paribus effects so we will consider some problems that can arise in trying to infer causality in economics We do not use any equations in this discussion For each example the problem of inferring causality disappears if an appropriate experiment can be carried out Thus it is useful to describe how such an experiment might be structured and to observe that in most cases obtaining experimental data is impractical It is also helpful to think about why the available data fail to have the important features of an experimental data set We rely for now on your intuitive understanding of such terms as random indepen dence and correlation all of which should be familiar from an introductory probability and statistics course These concepts are reviewed in Appendix B We begin with an example that illustrates some of these important issues E x a m p I e 3 Effects of Fertilizer on Crop Yield Some early econometric studies for example Griliches 1957 considered the effects of new fer tilizers on crop yields Suppose the crop under consideration is soybeans Since fertilizer amount is only one factor affecting yieldsisome others include rainfall quality of land and presence of parasitesithis issue must be posed as a ceteris paribus question One way to detennine the causal effect of fertilizer amount on soybean yield is to conduct an experiment which might include the following steps Choose several oneacre plots of land Apply different amounts of fertilizer to each plot and subsequently measure the yields this gives us a crosssectional data set Then use statisti cal methods to be introduced in Chapter 2 to measure the association between yields and fertilizer amounts As described earlier this may not seem like a very good experiment because we have said noth ing about choosing plots of land that are identical in all respects except for the amount of fertilizer In fact choosing plots of land with this feature is not feasible some of the factors such as land qual ity cannot even be fully observed How do we know the results of this experiment can be used to measure the ceteris paribus effect of fertilizer The answer depends on the specifics of how fertilizer amounts are chosen If the levels of fertilizer are assigned to plots independently of other plot fea tures that affect yieldithat is other characteristics of plots are completely ignored when deciding on fertilizer arnountsithen we are in business We will justify this statement in Chapter 2 Chapter 1 The Nature of Econometrics and Economic Data The next example is more representative of the difficulties that arise when inferring cau sality in applied economics Measu ng the Return to Education Labor economists and policy makers have long been interested in the return to education Somewhat informally the question is posed as follows If a person is chosen from the population and given another year of education by how much will his or her wage increase As with the previous examples this is a ceteris paribus question which implies that all other factors are held xed while another year of education is given to the person We can imagine a social planner designing an experiment to get at this issue much as the agri cultural researcher can design an experiment to estimate fertilizer effects Assume for the moment that the social planner has the ability to assign any level of education to any person How would this planner emulate the fertilizer experiment in Example 13 The planner would choose a group of people and randomly assign each person an amount of education some people are given an eighth grade education some are given a high school education some are given two years of college and so on Subsequently the planner measures wages for this group of people where we assume that each person then works in a job The people here are like the plots in the fertilizer example where educa tion plays the role of fertilizer and wage rate plays the role of soybean yield As with Example 13 if levels of education are assigned independently of other characteristics that affect productivity such as experience and innate ability then an analysis that ignores these other factors will yield useful results Again it will take some effort in Chapter 2 to justify this claim for now we state it without support Unlike the fertilizeryield example the experiment described in Example 14 is unfeasible The ethical issues not to mention the economic costs associated with randomly deter mining education levels for a group of individuals are obvious As a logistical matter we could not give someone only an eighthgrade education if he or she already has a college degree Even though experimental data cannot be obtained for measuring the retuni to educa tion we can certainly collect nonexperimental data on education levels and wages for a large group by sampling randomly from the population of working people Such data are available from a variety of surveys used in labor economics but these data sets have a feature that makes it difficult to estimate the ceteris paribus return to education People choose their own levels of education therefore education levels are probably not deter mined independently of all other factors affecting wage This problem is a feature shared by most nonexperimental data sets One factor that affecm wage is experience in the workforce Since pursuing more edu cation generally requires postponing entering the workforce those with more education usually have less experience Thus in a nonexperimental data set on wages and education education is likely to be negatively associated with a key variable that also affects wage It is also believed that people with more innate ability often choose higher levels of edu cation Since higher ability leads to higher wages we again have a correlation between education and a critical factor that affects wage Chapter I The Nature of Econometrics and Economic Data The omitted factors of experience and ability in the wage example have analogs in the fertilizer example Experience is generally easy to measure and therefore is similar to a variable such as rainfall Ability on the other hand is nebulous and dif ficult to quantify it is similar to land quality in the fertilizer example As we will see throughout this text accounting for other observed factors such as experience when estimating the ceteris paribus effect of another variable such as education is relatively straightforward We will also find that accounting for inherently unobserv able factors such as ability is much more problematic It is fair to say that many of the advances in econometric methods have tried to deal with unobserved factors in econometric models One final parallel can be drawn between Examples 13 and 14 Suppose that in the fer tilizer example the fertilizer amounts were not entirely determined at random Instead the assistant who chose the fertilizer levels thought it would be better to put more fertilizer on the higherquality plots of land Agricultural researchers should have a rough idea about which plots of land are better quality even though they may not be able to fully quantify the differences This situation is completely analogous to the level of schooling being related to unobserved ability in Example 14 Because better land leads to higher yields and more fertilizer was used on the better plots any observed relationship between yield and fertilizer might be spurious The Effect of Law Enforcement on City Crime Levels The issue of how best to prevent crime has been and will probably continue to be with us for some time One especially important question in this regard is Does the presence of more police of cers on the street deter crime The ceteris paribus question is easy to state If a city is randomly chosen and given say ten additional police officers by how much would its crime rates fall Another way to state the ques tion is If two cities are the same in all respects except that city A has ten more police of cers than city B by how much would the two cities crime rates differ It would be virtually impossible to nd pairs of communities identical in all respects except for the size of their police force Fortunately econometric analysis does not require this What we do need to know is whether the data we can collect on community crime levels and the size of the police force can be viewed as experimental We can certainly imagine a true experiment involving a large collection of cities where we dictate how many police of cers each city will use for the upcoming year Although policies can be used to affect the size of police forces we clearly cannot tell each city how many police of cers it can hire If as is likely a city s decision on how many police of cers to hire is correlated with other city factors that affect crime then the data must be viewed as non experimental In fact one way to view this problem is to see that a city s choice of police force size and the amount of crime are simultaneously determined We will explicitly address such problems in Chapter 16 The first three examples we have discussed have dealt with crosssectional data at vari ous levels of aggregation for example at the individual or city levels The same hurdles arise when inferring causality in time series problems Chapter 1 The Nature of Econometrics and Economtc Data The Effect of the Minimum Wage on Unemployment An important and perhaps contentious policy issue concerns the effect of the minimum wage on unemployment rates for various groups of workers Although this problem can be studied in a variety of data settings crosssectional time series or panel data time series data are often used to look at aggregate effects An example of a time series data set on unemployment rates and minimum wages was given in Table 13 Standard supply and demand analysis implies that as the minimum wage is increased above the market clearing wage we slide up the demand curve for labor and total employment decreases Labor supply exceeds labor demand To quantify this effect we can study the relationship between employment and the minimum wage over time In addition to some special difficulties that can arise in dealing with time series data there are possible problems with inferring causality The minimum wage in the United States is not determined in a vacuum Various economic and political forces impinge on the final minimum wage for any given year The minimum wage once determined is usually in place for several years unless it is indexed for in ation Thus it is probable that the amount of the minimum wage is related to other factors that have an effect on employment levels We can imagine the Us government conducting an experiment to determine the employment effects of the minimum wage as opposed to worrying about the welfare of lowwage workers The minimum wage could be randomly set by the government each year and then the employment outcomes could be tabulated The resulting experimental time series data could then be analyzed using fairly simple econometric methods But this scenario hardly describes how minimum wages are set If we can control enough other factors relating to employment then we can still hope to estimate the ceteris paribus effect of the minimum wage on employment In this sense the problem is very similar to the previous crosssectional examples Even when economic theories are not most naturally described in terms of causality they often have predictions that can be tested using econometric methods The following example demonstrates this approach The Expectations Hypothesis The expectations hypothesis from financial economics states that given all information available to investors at the time of investing the expected return on any two investments is the same For example consider two possible investments with a threemonth investment horizon purchased at the same time 1 Buy a threemonth Tbill with a face value 0f10000 for a price below 10000 in three months you receive 10000 2 Buy a sixmonth Tbill at a price below 10000 and in three months sell it as a threemonth Tbill Each investment requires roughly the same amount of initial capital but there is an important difference For the first investment you know exactly what the return is at the time of purchase because you know the initial price of the threemonth Tbill along with its face value This is not true for the second investment although you know the price of a sixmonth Tbill when you purchase it you do not know the price you can sell it for in three months Therefore there is uncertainty in this investment for someone who has a threemonth investment horizon Chapter 1 The Nature of Econometrics and Economic Data The actual returns on these two investments will usually be different According to the expecta tions hypothesis the expected return from the second investment given all information at the time of investment should equal the return from purchasing a threemonth Tbill This theory turns out to be fairly easy to test as we will see in Chapter 11 m Ln this introductory chapter we have discussed the purpose and scope of econometric analy sis Econometrics is used in all applied economics fields to test economic theories to inform government and private policy makers and to predict economic time series Sometimes an econometric model is derived from a formal economic model but in other cases econometric models are based on informal economic reasoning and intuition The goals of any econometric analysis are to estimate the parameters in the model and to test hypotheses about these param eters the values and signs of the parameters determine the validity of an economic theory and the effects of certain policies Crosssectional time series pooled crosssectional and panel data are the most common types of data structures that are used in applied econometrics Data sets involving a time dimen sion such as time series and panel data require special treatment because of the correlation across time of most economic time series Other issues such as trends and seasonality arise in the analysis of time series data but not crosssectional data In Section 14 we discussed the notions of ceteris paribus and causal inference In most cases hypotheses in the social sciences are ceteris paribus in nature all other relevant factors must be fixed when studying the relationship between two variables Because of the nonexperi mental nature of most data collected in the social sciences uncovering causal relationships is very challenging m Causal Effect Economic Model Panel Data Ceteris Paribus Empirical Analysis Pooled Cross Section CrossSectional Data Set Experimental Data Random Sampling Data Frequency Nonexperimental Data Retrospective Data Econometric Model Observational Data Time Series Data M Li Suppose that you are asked to conduct a study to determine whether smaller class sizes lead to improved student performance of fourth graders 39 If you could conduct any experiment you want what would you do Be specific ii More realistically suppose you can collect observational data on several thousand A fourth graders in a given state You can obtain the size of their fourthgrade class and a standardized test score taken at the end of fourth grade Why might you expect a negative correlation between class size and test score Chapter 1 The Nature of Econometrics and Economic Data L in Would a ne ative correlation necessaril show that smaller class sizes cause better 3 y performance Explain A justification for job training programs is that they improve worker productivity Suppose that you are asked to evaluate whether more job training makes workers more productive However rather than having data on individual workers you have access to data on manufacturing firms in Ohio In particular for each firm you have information on hours of j ob training per worker training and number of nondefective items produced per worker hour output Carefully state the ceteris paribus thought experiment underlying this policy question ii Does it seem likely that a firm s decision to train its workers will be independent of worker characteristics What are some of those measurable and unmeasurable worker characteristics iii Name a factor other than worker characteri tic that can anect worker A iv If you find a positive correlation between output and training would you have con vincingly established that job training makes workers more productive Explain 0 V Suppose at your university you are asked to find the relationship between weekly hours spent studying study and weekly hours spent working work Does it make sense to characterize the problem as inferring whether study causes work or work causes study Explain COMPUTFR EXERCIS CLI Use the data in WAGElRAW for this exercise i Find the average education level in the sample What are the lowest and highest years of education ii Find the average hourly wage in the sample Does it seem high or low iii The wage data are reported in 1976 dollars Using the Economic Report of the President 2004 or later obtain and report the Consumer Price Index CPI for the years 1976 and 2003 Use the CPI values from part iii to find the average hourly wage in 2003 dollars Now does the average hourly wage seem reasonable v How many women are in the sample How many men 0 4 V Use the data in BWGHTRAW to answer this question i How many women are in the sample and how many report smoking during pregnancy ii What is the average number of cigarettes smoked per day Is the average a good measure of the typical woman in this case Explain iii Among women who smoked during pregnancy what is the average number ofcig arettes smoked per day How does this compare with your answer from part ii and why iv Find the average of futheduc in the sample Why are only 1192 observations used to compute this average Report the average family income and its stande deviation in dollars v C13 Chapter 1 The Nature of Econometrics and Economic Data The data in MEAPO1RAW are for the state of Michigan in the year 2001 Use these data to answer the following questions i Find the largest and smallest values of math4 Does the range make sense Explain ii How many schools have a perfect pass rate on the math test What percentage is this of the total sample iii How many schools have math pass rates of exactly 50 percent iv Compare the average pass rates for the math and reading scores Which test is harder to pass v Find the correlation between math4 and read4 What do you conclude vi The variable exppp is expenditure per pupil Find the average of exppp along with its standard deviation Would you say there is wide variation in per pupil spending vii Suppose School A spends 6000 per student and School B spends 5500 per stu dent By what percentage does School A s spending exceed School B s Compare this to 100 log6000 7 log5500 which is the approximation percentage differ ence based on the difference in the natural logs See Section A4 in Appendix A The data in JTRAINZRAW come from a job training experiment conducted for low income men during 197671977 see Lalonde 1986 i Use the indicator variable train to determine the fraction of men receiving job aining ii The variable r678 is earnings from 1978 measured in thousands of 1982 dollars Find the averages of r678 for the sample of men receiving job training and the sample not receiving job training Is the difference economically large iii The variable unem78 is an indicator of whether a man is unemployed or not in 1978 What fraction of the men who received job training are unemployed What about for men who did not receive job training Conunth on the difference iv From parts ii and iii does it appear that the job training program was effective What would make our conclusions more convincing This page intentionally left blank Regression Analysis with CrossSectional Data art 1 of the text covers regression anaiysis vvith crossrsectionai data it buiids upon a soiid base of coiiege aigebra and basic concepts in probabiiity and statistics Appendices A B and C contain compiete reviews of these topics Chapter 2 begins With the simpie iinear regression modei Where We expiain one variabie in terms ofanothervariabie Aithough simpie regression is not vvideiy used in appiied econometrics it is used occasionaiiy and serves as a naturai starting point because the aigebra and interpretations are reiativeiy straightforvvar Chapters 3 and 4 cover the fundamentais of muitipie regression anaiysis Where We aiiovv more than one variabie to affect the variabie We are trying to expiain Muitipie regression is stiii the most commoniy used method in empiricai research and so these chapters deserve carefui attention Chapter 3 focuses on the aigebra of the method of ordinary ieast squares OLS vvhiie aiso estabiishing conditions under which the OLS estimator is unbiased and best iinear unbiased Chapter 4 covers the important topic of statisticai inference ha ter 5 discusses the iarge sampie or asymptotic properties ofthe OLS estimar tors This provides justification ofthe inference procedures in Chapter4 when the errors in a regression modei are not normaiiy distributed Chapter 6 covers some additionai topics in regression anaiysis inciuding advanced functionai form issues data scaiing prediction and goodnessrofrfit Chapter7 expiains hovv quaiitative information can be incorporated into muitipie regression modeis Chapter 8 iiiustrates how to test for and correct the probiem of heteroskedasticity or nonconstantvariance in the error terms We show how the usuai OLS statistics can be adjusted and We aiso present an extension of OLS known as welghted least squares that expiicitiy accounts for differentvariances in the errors Chapter 9 deives further into t e very important probiem of correiation between the error term and one or more of the expianatory variabies We demonstrate how the avaiiabiiity ofa proxy variabie can soive the omitted variabies probiem in addition we estabiish the bias and inconsistency in the OLS estimators in the presence of certain kinds of measurement errors in the variabies Various data probiems are aiso discussed inciuding the probiem ofoutiiers CHAPTER 2 The Simple Regression Model variables For reasons we will see the simple regression model has limitations as a general tool for empirical analysis Nevertheless it is sometimes appropriate as an empirical tool Learning how to interpret the simple regression model is good practice for studying multiple regression which we will do in subsequent chapters The simple regression model can be used to study the relationship between two 21 Definition of the Simple Regression Model Much of applied econometric analysis begins with the following premise y and X are two Vari es representing some population and we are interested in explaining y in terms issues First since there is never an exact relationship between two variables how do we allow for other factors to affect y Second what is th 39onship between y and X And third how can we be sure we are capturing a ceteris paribus relationship between y and X if that is a desired goal a E o l a E i m a y tuX A simple equation is yBOB Xu 1 Equation 21 which is assumed to hold in the population of interest defines the simple linear regression model It is also called the lwoyvariable inear regress39wn or bivariale linear regress39wn model because it relates the two variables X and y We now 39scuss the meaning of each of the quantities in 21 Incidentally the term regression has origins that are not especially important for most modern econometric applications Chapter 2 The Simple Regression ModeI so we will not explain it here See Stigler 1986 for an engaging history of regression analysis When related by 21 the variables y andx have several different names used irrtercharrge ably as follows y is called the dependent variable the explained variable the response variable the predicted variable or r is called 39 39 variable the explanatory variable the control variable the predictor variable or the regressor The term covariate is also used for x The terms dependent variable and independent variable are frequently used in econometrics But be aware that the label irrdependerr here does not refer to the statistical notion of independence between random variables see Appendix B The terms explained and explanatory variables are probably the most descriptive Response and control are used mostly in the experimental sciences where the variable x is under the experimenter s control We will not use the terms predicted variable and predictor although you sometimes see these in applications that are purely about predic tion and not causality Our terminology for simple regression is summarized in Table 21 The variable n mlled the error term or disturbance in the relationship represents factors other than 6 that affect y A simple regression analysis effectively treats all factors affecting y other than 6 as being unobserved You can usefully think of n as standing for unobserved Equation 21 also addresses the issue of the functional relationship between y and x If the other factors in n are held fixed so that the change in n is zero An 0 then x has a linear effect on y Ay 13le if An 0 22 Thus the change in y is simply Bl multiplied by the change in x This means that B is the slope parameter in the relationship between y and x holding the other factors in u fixed it is of primary interest in applied economics The intercept parameter BU sometimes called the constant term also has its uses although it is rarely central to an analysis Temiinology for Slmple Regression y X Dependent variable lr rdependent variable Explained variable Explanatory variable Response variable Control variable Predicted variable Predictor variable Regr39essand Regressor Z3 Part 1 Regression Analysis with Croserectional Data E x a m p I e Z I Soybean Yield and Fertilizer Suppose that soybean yield is determined by the model yield BU Blfertilizer a 23 so that y yield andx fertilizer The agricultural researcher is interested in the effect of fertilizer on yield holding other factors xed This effect is given by 51 The error term a contains factors such as land quality rainfall and so on The coef cient 51 measures the effect of fertilizer on yield holding other factors xed Ayiela BIA fertilizer E x a m p I e l l A Simple Wage Equation A model relating a person s wage to observed education and other unobserved factors is wage BU Bledac a If wage is measured in dollars per hour and eaae is years of education then 51 measures the change in hourly wage given another year of education holding all other factors xed Some of those fac tors include labor force experience innate ability tenure with current employer work ethic and innumerable other things The linearity of 21 implies that a oneunit change in x has the same effect on y regardless of the initial value of x This is unrealistic for many economic applications For example in the wageeducation example we might want to allow for increasing returns the next year of education has a larger effect on wages than did the previous year We will see how to allow for such possibilities in Section 24 The most difficult issue to address is whether model 21 really allows us to draw ceteris paribus conclusions about how 6 affects y We just saw in equation 22 that B does measure the effect of x on y holding all other factors in a fixed Is this the end of the causality issue Unfortunately no How can we hope to learn in general about the ceteris paribus effect of x on y holding other factors fixed when we are ignoring all those other factors Section 25 will show that we are only able to get reliable estimators of Bo and B from a random sample of data when we make an assumption restricting how the unobservable a is related to the explanatory variable 6 Without such a restriction we will not be able to estimate the ceteris paribus effect 3 Because a and x are random variables we need a concept grounded in probability Before we state the key assumption about how 6 and a are related we can always make one assumption about a As long as the intercept BU is included in the equation nothing is lost by assuming that the average value of a in the population is zero Mathematically Eltugt 0 E Chapter 2 The Simple Regression ModeI Assumption 25 says nothing about the relationship between a and x but simply makes a statement about the distribution of the unobservables in the population Using the previ ous examples for illustration we can see that assumption 25 is not very restrictive In Example 21 we lose nothing by normalizing the unobserved factors affecting soybean yield such as land quality to have an average of zero in the population of all cultivated plots The same is true of the unobserved factors in Example 22 Without loss of gener ali we can assume that things such as average ability are zero in the population of all working people If you are not convinced you should work through Problem 22 to see that we can always redefine the intercept in equation 21 to make 25 true We now turn to the crucial assumption regarding how a and x are related A natural measure of the association between two random variables is the correlation coe icient See Appendix B for definition and properties If a and x are uncorrelated then as ran dom variables they are not linearly related Assuming that a and x are uncorrelated goes a long way toward defining the sense in which a and 6 should be unrelated in equation 21 But it does not go far enough because correlation measures only linear dependence between a and 6 Correlation has a somewhat counterintuitive feature it is possible for a to be uncorrelated with 6 while being correlated with functions of x such as x2 See Section B4 for further discussion This possibility is not acceptable for most regression purposes as it causes problems for interpreting the model and for deriving statistical properties A better assumption involves the expected value of a given 6 Bemuse a and x are random variables we can define the conditional distribution of a given any value of x In particular for any x we can obtain the expected or average value of a for that slice of the population described by the value of x The crucial assumption is that the average value of a does not depend on the value of x We can write this assumption as Emir Ea l 26 Equation 26 says that the average value of the unobservables is the same across all slices of the population determined by the value of x and that the common average is necessar ily equal to the average of a over the entire population When assumption 26 holds we say that a is mean independent of 6 Of course mean independence is implied by full independence between a and 6 an assumption often used in basic probability and statis tics When we combine mean independence with assumption 25 we obtain the zero conditional mean assumption Ealx 0 It is critical to remember that equation 26 is the assumption with impact assumption 25 essentially defines the intercept BU Let us see what 26 entails in the wage example To simplify the discussion assume that a is the same as innate ability Then 26 requires that the average level of ability is the same regardless of years of education For example if Eabill8 denotes the average ability for the group of all people with eight years of education and Eabilll6 denotes the average ability among people in the population with sixteen years of educa tion then 26 implies that these must be the same In fact the average ability level must be the same for all education levels If for example we think that average ability increases with years of education then 26 is false This would happen if on aver age people with more ability choose to become more educated As we cannot observe innate ability we have no way of knowing whether or not average ability is the same for all education levels But this is an issue that we must address before relying on simple regression analysis 25 5H When would WU 9x090 this model to pattstv Z 6quot Pan 1 Regresst in Analystswttlt Crossseamat Data In the fertilizer example if fertilizer amounts are chosen independently of ppose that a we on a final exam score depends on classes other features of me plots men 25 attended arremdl am unobsevved factovs that affect exam petr 39 tormame such a SVUdP t athty Then will hold the average land quality will not depend on the amount of fertilizer However if more fertilizer is put on the higherrquality plots of land then the expected value of 14 changes with the store 1 0 Ballend 14 level of fertilizer and 26 fails he zero conditional mean assumption gives B another interpretation that is often useful Taking the expected value of 21 conditional on X and using Emlx 0 gives Ewe w A Equation 28 shows that the population regression function PRF Hylx is a linear function of X The linearity means that a onerunit increase in X changes the expected value ofy by the amount B For any given value of X the distribution of y is centered about Eylx as illustrated in Figure 21 yx as a linear motion of x Ele Bu W Chapter 2 The Simple Regression Model It is important to understand that equation 28 tells us how the average value of y changes with x it does not say that y equals BU B 1x for all unis in the population For example suppose that x is the high school grade point average and y is the college GPA and we happen to know that EColGPAihsGPA 15 05 hsGPA Of course in practice we never know the population intercept and slope but it is useful to pretend momentarily that we do to understand the nature of equation 28 This GPA equation tells us the average col lege GPA among all studenm who have a given high school GPA So suppose that hsGPA 36 Then the average colGPA for all high school graduates who attend college with hsGPA 36 is 15 0536 33 We are certainly not saying that every student with hsGPA 36 will have a 33 college GPA this is clearly false The PRF gives us a relationship between the average level of y at different levels of x Some students with hsGPA 36 will have a college GPA higher than 33 and some will have a lower col lege GPA Whether the actual colGPA is above or below 33 depends on the unobserv able factors in a and those differ among students even within the slice of the population with hsGPA 36 Given the zero conditional mean assumption Eaix 0 it is useful to view equation 21 as breaking y into two components The piece BU 31x which represents Eyix is called the systematic part of yithat is the part of y explained by xiarid a is called the unsystematic part or the part of y not explained by x In Chapter 3 when we introduce more than one explanatory variable we will discuss how to determine how large the sys tematic part is relative to the unsystematic part In the next section we will use assumptions 25 and 26 to motivate estimators of Bo and B given a random sample of data The zero conditional mean assumption also plays a crucial role in the statistical analysis in Section 26 22 Deriving the Ordinary Least Squares Estimates Now that we have discussed the basic ingredients of the simple regression model we will address the important issue of how to estimate the parameters Bo and B in equation 2 1 To do this we need a sample from the population Let xlyl i l n denote a random sample of size n from the population Because these data come from 21 we can write ylBUlel at 29 for each 139 Here at is the error term for observation 139 because it contains all factors affect ing yl other than xx As an example xx might be the annual income and yl the annual savings for family 139 during a particular year If we have collected data on fifteen families then It 15 A scatterplot of such a data set is given in Figure 22 along with the necessarily fictitious population regression function We must decide how to use these data to obtain estimates of the intercept and slope in the population regression of savings on income There are several ways to motivate the following estimation procedure We will use 25 and an important implication of assumption 26 in the population a is uncorrelated Part 1 Regression Analysis with Crosereclional Data E 7 Scatterplot of savings and Income for 15 families and the population regression suvlngsincome 3a Bllncome savings 0 O O O O 0 O I Esavingslincome BO Jncome 390 0 O 0 income 0 0 with 6 Therefore we see that M has zero expected value and that the covariance between 5 and u is zero Eu 0 210 and Covxu Exu 0 21 where the first equality in 211 follows from 210 See Section B4 for the defini tion and properties of covariance In terms of the observable variables x and y and the unknown parameters Bo and BI equations 210 and 211 can be written as My 7 BO 7 31x 0 212 and Exy i 80 7 390 0 E respectively Equations 212 and 213 imply two restrictions on the joint probability distribution of Jay in the population Since there are two unknown parameters to estimate we might hope that equations 212 and 213 can be used to obtain good estimators of Chapter 2 The Simple Regression ModeI Bo and BI In fact they can be Given a sample of data we choose estimates Bo and 81 to solve the sample counterparts of 212 and 213 114291 7 BO 7 81111 0 214 1 11 and 114211917 807 lx 0 l 2115 1 11 This is an example of the method of moments approach to estimation See Section C4 for a discussion of different estimation approaches These equations can be solved for Bo and BI Using the basic properties of the summation operator from Appendix A equation 214 can be rewritten as y 80 811 116 where y n 1 2 yl is the sample average of the yl and likewise for X This equation allows us to write BU in terms of Bl y and X 30 y 7 111 217 Therefore once we have the slope estimate 81 it is straightforward to obtain the intercept estimate BU given y and i Dropping the n 1 in 215 since it does not affect the solution and plugging 217 into 215 yields 2111 1 7 Ex 7 81x1 7 0 11 which upon rearrangement gives 2160 r I32 2x106 X 11 11 From basic properties of the summation operator see A7 and A8 21606 X 206 i 13902 and 216m Y 202 My 3 11 11 11 11 Therefore provided that 201 7 102 gt 0 218 11 the estimated slope is 1 2017 x 17 y A 11 1 2061 W l 219 l Part 1 Regression Analysis with Crosereclional Data Equation 219 is simply the sample covariance between 6 and y divided by the sample variance of 6 See Appendix C Dividing both the numerator and the denominator by n 7 1 changes nothing This makes sense because Bl equals the population covariance divided by the variance of 6 when Eu 0 and Covxu 0 An immediate implication is that if x and y are positively correlated in the sample then 81 is positive if x and y are negatively correlated then BI is negative Although the method for obtaining 217 and 219 is motivated by 26 the only assumption needed to compute the estimates for a particular sample is 218 This is hardly an assumption at all 218 is true provided the xx in the sample are not all equal to the same value If 218 fails then we have either been unlucky in obtaining our sample from the population or we have not specified an interesting problem 6 does not vary in the population For example if y wage and x Educ then 218 fails only if everyone in the sample has the same amount of education for example if everyone is a high school graduate see Figure 23 If just one person has a different amount of education then 218 holds and the estimates can be computed The estimates given in 217 and 219 are called the ordinary least squares OLS estimates of Bo and BI To justify this name for any BAD and 81 define a fitted value for y when x xx as y B 8x A scatterplot of wage against education when educi II for all i wage 0 12 educ Chapter 2 The Simple Regression ModeI This is the value we predict for y when x xx for the given intercept and slope There is a fitted value for each observation in the sample The residual for observation 139 is the dif ference between the actual y and its fitted value yr y80 1x 111 1 Again there are n such residuals These are not the same as the errors in 29 a point we return to in Section 25 The fitted values and residuals are indicated in Figure 24 Now suppose we choose Bo and BI to make the sum of squared residuals 2393 XXV 30 81x92 222 11 11 as small as possible The appendix to this chapter shows that the conditions necessary for 8081 to minimize 222 are given exactly by equations 214 and 215 without 11 Equations 214 and 215 are often called the rst order conditions for the OLS estimates a term that comes from optimization using calculus see Appendix A From our previous calculations we know that the solutions to the OLS first order conditions are given by 217 and 219 The name ordinary least squares comes from the fact that these estimates minimize the sum of squared residuals fitted values and reslduals Y 0 residual 7 fitted value X X X Part 1 Regression Analysis with Crosereclional Data When we view ordinary least squares as minimizing the sum of squared residuals it is natural to ask Why not minimize some other function of the residuals such as the abso lute values of the residuals In fact1 as we will discuss in the more advanced Section 94 minimizing the sum of the absolute values of the residuals is sometimes very useful But it does have some drawbacks First we cannot obtain formulas for the resulting estimators given a data set the estimates must be obtained by numerical optimization routines As a consequence the statistical theory for estimators that minimize the sum of the absolute residuals is very complicated Minimizing other functions of the residuals say the sum of the residuals each raised to the fourth power has similar drawbacks We would never choose our estimates to minimize say the sum of the residuals themselves as residuals large in magnitude but with opposite signs would tend to cancel out With OLS we will be able to derive unbiasedness consistency and other important statistical properties rela tively easily Plus as the motivation in equations 213 and 214 suggests and as we will see in Section 25 OLS is suited for estimating the parameters appearing in the conditional mean function 28 Once we have determined the OLS intercept and slope estimates we form the OLS regression line raw where it is understood that 80 and B have been obtained using equations 217 and 219 The notation y read as y hat emphasizes that the predicted values from equa tion 223 are estimates The intercept 80 is the predicted value of y whenAx 0 although in some cases it will not make sense to setx 0 In those situations BU is not in itself very interesting When using 223 to compute predicted values of y for various values of x we must account for the intercept in the calculations Equation 223 is also called the sample regression function SRF because it is the estimated version of the population regression function Eylx BU BT76 It is important to remember that the PRF is something fixed but unknown in the population Because the SRF is obtained for a given sample of data a new sample will generate a different slope and intercept in equation 223 In most cases the slope estimate which we can write as 8 AyAx l 224 is of primary interest It tells us the amount by which y changes when 6 increases by one unit Equivalently Ay 81m 1 225 so that given any change in 6 whether positive or negative we can compute the predicted change in y We now present several examples of simple regression obtained by using real data In other words we find the intercept and slope estimates with equations 217 and 219 Since these examples involve many observations the calculations were done using an econometrics software package At this point you should be careful not to read too much into these regressions they are not necessarily uncovering a causal relationship We Chapter 2 The Simple Regression ModeI have said nothing so far about the statistical properties of OLS In Section 25 we con sider statistical properties after we explicitly impose assumptions on the population model equation 21 E x a m p I e l 3 CEO Salary and Return on Equity For the population of chief executive officers let y be annual salary salary in thousands of dol lars Thus y 8563 indicates an annual salary of 856300 and y 14526 indicates a salary of 1452600 Let x be the average return on equity roe for the CEO s firm for the previous three years Return on equity is de ned in terms of net income as a percentage of common equity For example if roe 10 then average retuni on equity is 10 To study the relationship between this measure of rm performance and CEO compensation we postulate the simple model salary BU Blroe u The slope parameter 5 measures the change in annual salary in thousands of dollars when return on equity increases by one percentage point Because a higher roe is good for the company we think 5 gt 0 The data set CEOSAL1RAW contains information on 209 CEOs for the year 1990 these data were obtained from Business Week 5691 In this sample the average annual salary is 1281120 with the smallest and largest being 223000 and 14822000 respectively The average return on equity for the years 1988 1989 and 1990 is 1718 with the smallest and largest values being 05 and 563 respectively Using the data in CEOSAL1RAW the OLS regression line relating salary to roe is salary 963191 18501 roe E where the intercept and slope estimates have been rounded to three decimal places we use salary hat to indicate that this is an estimated equation How do we interpret the equation First if the return on equity is zero roe 0 then the predicted salary is the intercept 963191 which equals 963191 since salary is measured in thousands Next we can write the predicted change in sal ary as a function of the change in roe Am 18501 Aroe This means that if the return on equity increases by one percentage point Aroe 1 then salary is predicted to change by about 185 or 18500 Because 226 is a linear equation this is the estimated change regardless of the initial salary We can easily use 226 to compare predicted salaries at different values of roe Suppose roe 30 Then m 963191 1850130 1518221 which is just over 15 million However this does not mean that a particular CEO whose firm had a roe 30 earns 1518221 Many other factors affect salary This is just our prediction from the OLS regression line 226 The estimated line is graphed in Figure 25 along with the population regression function Esalarylr0e We will never know the PRF so we cannot tell how close the SRF is to the PRF Another sample of data will give a different regression line which may or may not be closer to the population regression line Part 1 Regression Analysis with Croserectional Data The 015 regression line 336 963l9l 15501 me and the unknown population regression function salary saIaTy 963191 18501 roe ga g Esaary1roe BO Joe 963191 Wage and Education For the population of people in the workforce in 1976 let y wage where wage is measured in dol lars per hour Thus for a particular person if wage 675 the hourly wage is 675 Let x educ denote years of schooling for example educ 12 corresponds to a complete high school education Since the average wage in the sample is 590 the Consumer Price Index indicates that this amount is equivalent to 1906 in 2003 dollars Using the data in WAGE1RAW where n 526 individuals we obtain the following OLS regression line or sample regression function g 7090 054 educ We must interpret this equation with caution The intercept of O90 literally means that a person with no education has a predicted hourly wage of i90 an hour This of course is silly It turns out that only 18 people in the sample of 526 have less than eight years of education Consequently it Chaplel z The aim l9 Reglessloll Mcdel 15 is not surprising that the regression line does poorly at very low levels ofeducation For a r 39 39 39 P nrp dicted wage is g 7090 0543 342 or 342 per hour in 1976 dollars The slope estimate in 227 implies that one more year of education increases hourly wage by 54 an hour 39lherefore four more years of education increase the predicted wage by 4054 216 or 216 per hour These are fairly large effects Because of the linear nature of 227 another year ofeducation increases the wage by the same amount regardless ofthe initial level ofeducation In Section 24 we discuss some methods u m 1 2 l 39 we 1 ad wller 86534 n l97r lollals Whatlsrhls valtleln 100 do any W YOU have gnougll lrlfOlmathl39l in Example 2 a to alvswm ill question tic ariaure IVoth ng Outcomes and Campaign Expenditures The file VOTE1RAW t 39 A t l t39 39 39 173 two party races for the Us House of Representatives in 1933 39lhere are two candidates in each race A andB Let A 39 39 A and A be the percent age of total campaign expenditures accounted for by Candidate A Many factors other than rhmA 39 39 39 39 39 39 quot quot amounts spent by A and B Nevertheless we can estimate a simple regression model to find out whether mile a nigne g m The estimated equation using the 173 observations is SE4 2681 0454 shareA This means that if Candidate A s share of spending increases by one percentage point Candidam A receives almost onerhalf a percentage point 0464 more ofthe total vote Whether or not this is wro 39 quot 39 rA50va3939 L 50 or halfthe vote In sortie cases regression analysis is not used to determine causality but to if Oi Candidate U simplylook at whether two Variables are in Exam 2 at positively or negatively related much mar34 no Wmm mg a st n 1 reasonably like a dard corre ation analysis An exa le of this occurs in Computer time spent sleeping and working to investigate the tradeoff between these two factors A Note on Terminology In most cases we will indicate the estimation of a relationship through OLS by writing an equation such as 226 227 or 228 Sometimes for the sake of brevity it is useful to indicate that an 018 regression has been run without actually writing out the equation Part 1 Regression Analysis with Croserectional Data We will often indicate that equation 223 has been obtained by OLS in saying that we ran the regression of y on x l 229 or simply that we regress y on x The positions of y and x in 229 indicate which is the dependent variable and which is the independent variable we always regress the depen dent variable on the independent variable For specific applications we replace y and x with their names Thus to obtain 226 we regress salary on roe or to obtain 228 we regress voteA on shareA When we use such terminology in 229 we will always mean that we plan to estimate the intercept 80 along with the slope 81 This case is appropriate for the vast majority of applications Occasionally we may want to estimate the relationship between y and 6 assuming that the intercept is zero so that x 0 implies that 0 we cover this case briefly in Section 26 Unless explicitly stated otherwise we always estimate an intercept along with a slope 23 Properties of OLS on Any Sample of Data In the previous section we went through the algebra of deriving the formulas for the OLS intercept and slope estimates In this section we cover some further algebraic properties of the fitted OLS regression line The best way to think about these properties is to remember that they hold by construction for any sample of data The harder taskgconsidering the properties of OLS across all possible random samples of dataiis postponed until Section 25 Several of the algebraic properties we are going to derive will appear mundane Never theless having a grasp of these properties helps us to figure out what happens to the OLS estimates and related statistics when the data are manipulated in certain ways such as when the measurement units of the dependent and independent variables change Fitted Values and Residuals We assume that the interceptAand slope estimates Bo and 81 have been obtained for the given sample of data Given Bo and 3 we can obtain the fitted value 2 for each observa tion This is given by equation 220 By definition each fitted value of y is on the OLS regression line The OLS residual associated with observation 139 121 is the difference between y and its fitted value as given in equation 221 If 1211s positive the line under predicts y if 1211s negative the line overpredicts y The ideal case for observation 139 is when 121 0 but in most cases every residual is not equal to zero In other words none of the data points must actually lie on the OLS line E x a m p I e Z 6 CEO Salary and Retum on Equity Table 22 contains a listing of the first 15 observations in the CEO data set along with the fitted values called salaryhat and the residuals called ahat Chapter 2 The Simple Regression Model fitted Values and Residuals for the rst 15 05 obsrm me salary sr myhar what 1 141 1095 1224058 7 1290581 2 109 1001 1164854 71638542 3 235 1122 1397969 72759692 4 59 578 1072348 74943484 5 138 1368 1218508 1494923 6 200 1145 1333215 71882151 7 164 1078 1266611 71886108 8 163 1094 1264761 71707606 9 105 1237 1157454 7954626 10 263 833 1449773 76167726 11 259 567 1442372 78753721 12 268 933 1459023 75260231 13 148 1339 1237009 1019911 14 223 937 1375768 74387678 15 563 2011 2004808 6191895 The first four CEOs have lower salaries than what we predicted from the OLS regression line 226 in other words given only the firm s me these CEOs make less than what we predicted As can be seen from the positive what the fifth CEO makes more than predicted from the OLS regres sion line Algebraic Properties of OLS Statistics There are several useful algebraic properties of OLS estimates and their associated statis tics We now cover the three most important of these Part 1 Regression Analysis with Croserectional Data 1 The sum and therefore the sample average of the OLS residuals is zero Mathematically 22 0 t 230 11 This property needs no proof it follows immediately from the OLS first order condition 214 when we remember that the residuals are defined by 121 yl BO 7 Ext In other words the OLS estimates Bo and B are chosen to make the residuals add up to zero for any data set This says nothing about the residual for any particular observation 139 2 The sample covariance between the regressors and the OLS residuals is zero This follows from the first order condition 215 which can be written in terms of the residuals as ixl l 0 231 11 The sample average of the OLS residuals is zero so the lefthand side of 231 is proportional to the sample covariance between xx and 121 3 The point y is always on the OLS regression line In other words if we take equation 223 and plug in i for x then the predicted value is y This is exactly what equation 216 showed us IE x a m p l e 2 7 Wage and Education For the data in WAGE1RAW the average hourly wage in the sample is 590 rounded to two deci mal places and the average education is 1256 If we plug educ 1256 into the OLS regression line 227 we get wage i090 O541256 58824 which equals 59 when rounded to the rst decimal place These figures do not exactly agree because we have rounded the average wage and education as well as the intercept and slope estimates If we did not initially round any of the values we would get the answers to agree more closely but to little useful effect Writing each yl as its fitted value plus its residual provides another way to interpret an OLS regression For each i write yl yl 232 From property 1 the average of the residuals is zero equivalently the sample aver age of the fitted values 2 is the same as the sample average of the yl or y Further properties 1 and 2 can be used to show that the sample covariance between 2 and 121 is zero Thus we can view OLS as decomposing each yl into two parts a fitted value and a residual The fitted values and residuals are uncorrelated in the sample Define the total sum of squares SST the explained sum of squares SSE and the residual sum of squares SSR also known as the sum of squared residuals as follows SST 2 in y l 233 11 Chapter 2 The Simple Regression ModeI 11 SSE 2yl y 11 2 34 m E 11 SST is a measure of the total sample variation in the 1 that is it measures how spread out the yl are in the sample If we divide SST by n 7 l we obtain the sample variance of y as discussed in Appendix C Similarly SSE measures the sample variation in the y where we use the fact that 5 y and SSR measures the sample variation in the 121 The total variation in y can always be expressed as the sum of the explained variation and the unexplained variation SSR Thus SST SSE SSR 236 Proving 236 is not difficult but it requires us to use all of the properties of the summa tion operator covered in Appendix A Write 2o 112 2K1 2 9 1112 11 11 Emir 1 1112 11 212 2 Emory 3amper 11 11 11 SSR 2 1W1 y SSE 1 Now 236 holds if we show that 21W 7 y 0 237 l 1 But we have already claimed that the sample covariance between the residuals and the fitted values is zero and this covariance is just 237 divided by n 1 Thus we have established 236 Some words of caution about SST SSE and SSR are in order There is no uniform agreement on the names or abbreviations for the three quantities defined in equations 233 234 and 235 The total sum of squares is called either SST or TSS so there is little con fusion here Unfortunately the explained sum of squares is sometimes called the regression sum of squares If this term is given its natural abbreviation it can easily be confused with the term residual sum of squares Some regression packages refer to the explained sum of squares as the model sum of squares To make matters even worse the residual sum of squares is often called the error sum of squares This is especially unfortunate because as we will see in Section 25 the errors and the residuals are different quantities Thus we will always call 235 the residual sum of squares or the sum of squared residuals We prefer to use the abbreviation SSR to denote the sum of squared residuals because it is more common in econometric packages Part 1 Regression Analysis with Croserectional Data Goodnessr of F it So far we have no way of measuring how well the explanatory or independent variable 6 explains the dependent variable y It is often useful to compute a number that summarizes how well the OLS regression line fits the data In the following discussion be sure to remember that we assume that an intercept is estimated along with the slope Assuming that the total sum of squares SST is not equal to zeroiwhich is true except in the very unlikely event that all the yl equal the same valueiwe can divide 236 by SST to get 1 SSESST SSRSST The Rsquared of the regression sometimes called the coefficient of determination is defined as R2 E SSESST l 7 SSRSST R2 is the ratio of the explained variation compared to the total variation thus it is interpreted as the fraction of the sample variation in y that is explained by x The second equality in 238 provides another way for computing R2 From 236 the value of R2 is always between zero and one because SSE can be no greater than SST When interpreting R2 we usually multiply it by 100 to change it into a percent 100 R2 is the percentage of the sample variation in y that is explained y x If the data points all lie on the same line OLS provides a perfect fit to the data In this case R2 l A value of R2 that is nearly equal to zero indicates a poor fit of the OLS line very little of the variation in the yl is captured by the variation in the t which all lie on the OLS regression line In fact it can be shown that R2 is equal to the square of the sample correlation coefficient between yl and 511 This is where the term Rsquared came from The letter R was traditionally used to denote an esti mate of a population correlation coefficient and its usage has survived in regression analysis Example 28 CEO Salary and Return on Equity In the CEO salary regression we obtain the following m 963191 18501 roe n 209 R2 00132 We have reproduced the OLS regression line and the number of observations for clarity Using the Rsquared rounded to four decimal places reported for this equation we can see how much of the variation in salary is actually explained by the return on equity The answer is not much The finn s return on equity explains only about 13 percent of the variation in salaries for this sample of 209 CEOs That means that 987 percent of the salary variations for these CEOs is left unexplained This lack of explanatory power may not be too surprising because many other characteristics of both the rm and the individual CEO should in uence salary these factors are necessarily included in the errors in a simple regression analysis Chapter 2 The Simple Regression Model In the social sciences low Rsquareds in regression equations are not uncommon especially for crosssectional analysis We will discuss this issue more generally under multiple regression analysis but it is worth emphasizing now that a seemingly low Rsquared does not necessarily mean that an OLS regression equation is useless It is still possible that 239 is a good estimate of the ceteris paribus relationship between salary and roe whether or not this is true does not depend directly on the size of Rsquared Students who are first learning econometrics tend to put too much weight on the size of the Rsquared in evaluating regression equations For now be aware that using Rsquared as the main gauge of success for an econometric analysis can lead to trouble Sometimes the explanatory variable explains a substantial part of the sample variation in the dependent variable E x a m p I e l 9 Voting Outcomes and Campaign Expenditures In the voting outcome equation in 228 R2 0856 Thus the share of campaign expenditures explains over 85 of the variation in the election outcomes for this sample This is a sizable portion 24 Units of Measurement and Functional Form Two important issues in applied economics are 1 understanding how changing the unis of measurement of the dependent andor independent variables affects OLS estimates and 2 knowing how to incorporate popular functional forms used in economics into regres sion analysis The mathematics needed for a full understanding of functional form issues is reviewed in Appendix A The Effects of Changing Units of Measurement on OLS Statistics In Example 23 we chose to measure annual salary in thousands of dollars and the return on equity was measured as a percentage rather than as a decimal It is crucial to know how salary and roe are measured in this example in order to make sense of the estimates in equation 239 We must also know that OLS estimates change in entirely expected ways when the units of measurement of the dependent and independent variables change In Example 23 suppose that rather than measuring salary in thousands of dollars we measure it in dollars Let salardol be salary in dollars salardol 845761 would be interpreted as 845761 Of course salardol has a simple relationship to the salary measured in thousands of 4 1391 am 31M 3 11 lop I 5 Part Regl esilhn AHalysls Wltll Crossriecntmal Data rlllar39 quotIn1110msala on roe to know that the estimated equation is m 953191 18501 me We obtain the intercept and slope in 240 simply by multiplying the intercept and the slope in 239by1000 TL39 39 39 9 d 2 4quot 391 39 39 Looking at 240 if roe 0 then salardol 963191 so the predicted salary is 963191 the same value we obtained from equation 239 Furthermore if roe increases by one then the predicted salary increases by 18501 again this is What we concluded from our earlier analysis of equation 239 enerally it is easy to figure out what happens to the intercept and slope estimates when the dependent variable changes units of measurement If the dependent variable is multiplied by the constant ciwhich means each value in the sample is multiplied by cithen the OLS intercept and slope estimates are also multiplied by c This assumes nothing has changed about the independent variable In the CEO salary example 0 1000 in moving from salary to salardol We can also use the CEO salary example to see what happens when we change the units of measurement of the indepenr dent variable Define roedec roe100 to be the decimal equivalent means 5 l1nlmndredzwfrloll rs 1211119th111 v Sdarhun What W111 be 1171 15 We 19 mgrnsslon or yaamun an roe39 we return to our original dependen variable salary which is measured in thousands of dollars When we regress salary on roedec we obtain W 963191 18501 roedec ZAI The coefficient on roedec is 100 times the coefficient on roe in 239 This is as it should be Changing roe by one percentage point is equivalent to Aroedec 001 From 241 if Aroe 001 Asalary 1 01001 18501 which is what is obtained by using 239 Note that divided by 100 and so the pretation of the equation Generally if the independent variable is divided or multiplied by some nonzero constant c then the OLS slope coefficient is multiplied or divided by 0 respectively The intercept has not changed in 241 because roedec 0 still corresponds to a zero return on equity In general changing the units of measurement of only the independent variable does not affect the intercept In the previous section we defined Risquared as a goodnessrofrfit measure for OLS regression We can also ask what happens to R2 When the unit of measurement of either the independent or the dependent variable changes Without doing any algebra we should know the result the goodnessrofrfit of the model should not depend on the units of measurement of our variables For example the amount of variation in salary explained in moving from 239 to 241 the independent variable was m I 39 39 mo 39 L 39 Chapter 2 The Simple Regression ModeI by the return on equity should not depend on whether salary is measured in dollars or in thousands of dollars or on whether return on equity is a percentage or a decimal This intuition can be verified mathematically using the definition of R2 it can be shown that R2 is in fact invariant to changes in the units of y or x Incorporating Nonlinearities in Simple Regression So far we have focused on linear relationships between the dependent and indepen dent variables As we mentioned in Chapter 1 linear relationships are not nearly general enough for all economic applications Fortunately it is rather easy to incorporate many nonlinearities into simple regression analysis by appropriately defining the dependent and independent variables Here we will cover two possibilities that often appear in applied work In reading applied work in the social sciences you will often encounter regression equations where the dependent variable appears in logarithmic form Why is this done Recall the wageeducation example where we regressed hourly wage on years of educa tion We obtained a slope estimate of 054 see equation 227 which means that each additional year of education is predicted to increase hourly wage by 54 cents Because of the linear nature of 227 54 cents is the increase for either the first year of education or the twentieth year this may not be reasonable Probably a better characterization of how wage changes with education is that each year of education increases wage by a constant percentage For example an increase in education from 5 years to 6 years increases wage by say 8 ceteris paribus and an increase in education from 11 to 12 years also increases wage by 8 A model that gives approximately a constant percentage effect is logwage BU Bledac a i 242 i where log denotes the natural logarithm See Appendix A for a review of logarithms In particular if An 0 then AwagelOOBlAedac i 243 i Notice how we multiply Bl by 100 to get the percentage change in wage given one additional year of education Since the percentage change in wage is the same for each additional year of education the change in wage for an extra year of education increases as education increases in other words 242 implies an increasing return to education By exponentiating 242 we can write wage expB0 Bledac a This equation is graphed in Figure 26 with a 0 Estimating a model such as 242 is straightforward when using simple regression Just define the dependent variable y to be y logwage The independent variable is represented by x edac The mechanics of OLS are the same as before the intercept and slope estimates are given by the formulas 217 and 219 In other words we obtain Bo and BI from the OLS regression of logwage on edac 43 Part 1 Regression Analysis with CI39OSSSECtionaI Data wage eprBo leduc with 339 gt 0 wage 0 educ A Log Wage Equation Using the same data as in Example 24 but using logwage as the dependent variable we obtain the following relationship W 0584 0083 educ n 526 R2 0186 The coefficient on educ has a percentage interpretation when it is multiplied by 100 wa increases by 83 for every additional year of education This is what economists mean when they refer to the return to another year of education It is important to remember that the main reason for using the log of wage in 242 is to impose a constant percentage effect of education on wage Once equation 242 is obtained the natural log of wage is rarely mentioned In particular it is not correct to say that another year of education increases logwage by 83 The intercept in 242 is not very meaningful because it gives the predicted logwage when educ O The Rsquared shows that educ explains about 186 of the variation in logwage not wage Finally equation 244 might not capture all of the nonlinearity in the relationship between wage and schooling If there are diploma effects then the twelfth year of educationigraduation from high schoolicould be worth much more than the eleventh year We will learn how to allow for this kind of nonlinearity in Chapter 7 Chapter 2 The Simple Regression ModeI Another important use of the natural log is in obtaining a constant elasticity model E x a m p I e l l l CEO Salary and Firm Sales We can estimate a constant elasticity model relating CEO salary to firm sales The data set is the same one used in Example 23 except we now relate salary to sales Let sales be annual firm sales measured in millions of dollars A constant elasticity model is logsalary BU Bllogsales u 245 l where 51 is the elasticity of salary with respect to sales This model falls under the simple regression model by de ning the dependent variable to be y logsalary and the independent variable to be 6 logsales Estimating this equation by OLS gives m 4822 0257 logsales 246 l n 20912 0211 T quot f 39 of salary with respect to sales It implies that a 1 increase in firm sales increases CEO salary by about 0257ithe usual interpretation of an elasticity 39oflu loamy The two functional forms covered in this section will often arise in the remainder of this text We have covered models containing natural logarithms here because they appear so frequently in applied work The interpretation of such models will not be much different in the multiple regression case It is also useful to note what happens to the intercept and slope estimates if we change the unis of measurement of the dependent variable when it appears in logarithmic form Because the change to logarithmic form approximates a proportionate change it makes sense that nothing happens to the slope We can see this by writing the rescaled variable as clyl for each observation 139 The original equation is logyl BU lel at If we add logc1 to both sides we get logc1 logyl logc1 Bo 3361 at or logc1yl logc1 Bo lel at Remember that the sum of the logs is equal to the log of their product as shown in Appendix A Therefore the slope is still Bl but the intercept is now logc1 Bo Similarly if the independent variable is logos and we change the unis of measurement of 6 before taking the log the slope remains the same but the intercept changes You will be asked to verify these claims in Problem 29 e end this subsection by summarizing four combinations of functional forms available from using either the original variable or is natural log In Table 23 x and y stand for the variables in their original form The model with y as the dependent variable and x as the inde pendent variable is mlled the level level model because each variable appears in its level form The model with logy as the dependent variable and x as the independent variable is called the loglevel model We will not explicitly discuss the level l0g model here because it arises less often in practice In any case we will see examples of this model in later chapters The last column in Table 23 gives the interpretation of Bl In the loglevel model 100 31 is sometimes called the semielasticity of y with respect to x As we mentioned in 45 Part 1 Regression Analysis with Crosereciional Data Summary of functional Forms Involving Logarithms Dependent Independent Interpretation Model Variable Variable of 31 Levellevel y x Ay 31m Levellog y loglx Ay 3100Ax Loglevel logvl r Ay HOGalmx Loglog logv logtx Ay l3Ax Example 211 in the log log model BI is the elasticity of y with respect to 6 Table 23 warrants careful study as we will refer to it often in the remainder of the text The Meaning of Linear Regression The simple regression model that we have studied in this chapter is also called the simple linear regression model Yet as we have just seen the general model also allows for certain nonlinear relationships So what does linear mean here You can see by looking at equa tion 21 that y BU 96 a The key is that this equation is linear in the parameters Bo and BI There are no restrictions on how y and x relate to the original explained and explana tory variables of interest As we saw in Examples 210 and 211 y and x can be natural logs of variables and this is quite common in applications But we need not stop there For example nothing prevenm us from using simple regression to estimate a model such as cons BU BpirE a where cons is annual consumption and inc is annual income Whereas the mechanics of simple regression do not depend on how y and x are defined the interpretation of the coefficients does depend on their definitions For successful empiri cal work it is much more important to become proficient at interpreting coefficienm than to become efficient at computing formulas such as 219 We will get much more practice with interpreting the estimates in OLS regression lines when we study multiple regression Plenty of models cannot be cast as a linear regression model because they are not linear in their parameters an example is cons 1B0 Blinc a Estimation of such models takes us into the realm of the nonlinear regression model which is beyond the scope of this text For most applications choosing a model that can be put into the linear regression framework is sufficient 25 Expected Values and Variances of the OLS Estimators In Section 21 we defined the population model y BU 96 a and we claimed that the key assumption for simple regression analysis to be useful is that the expected value of a given any value of x is zero In Sections 22 23 and 24 we discussed the algebraic Chapter 2 The Simple Regression ModeI properties of OLS estimation We now return to the population model and study the statistical properties of OLS In other words we now View Bo and 81 as estimators for the parameters Bo and BI that appear in the population model This means that we will study properties of the distributions of Bo and BI over different random samples from the population Appendix C contains definitions of estimators and reviews some of their important properties Unbiasedness of OLS We begin by establishing the unbiasedness of OLS under a simple set of assumptions For future reference it is useful to number these assumptions using the prefix SLR for simple linear regression The first assumption defines the population model Assum tion SLR1 Linear ln quot In the population model the dependent variable y is related to the independent variable X and the error or disturbance u as y B 81139 u 247 where and ii are the population intercept and alope parameters respectively To be realistic y x and u are all viewed as random variables in stating the population model We discussed the interpretation of this model at some length in Section 21 and gave several examples In the previous section we learned that equation 247 is not as restrictive as it initially seems by choosing y and x appropriately we can obtain interesting nonlinear relationships such as constant elasticity models We are interested in using data on y andx to estimate the parameters Bo and especially Bl We assume that our data were obtained as a random sample See Appendix C for a review of random sampling Assumption SLR2 Random We have a random sample of size n ny i l 2 n following the population model in equation 247 We will have to address failure of the random sampling assumption in later chapters that deal with time series analysis and sample selection problems Not all crosssectional sam ples can be viewed as outcomes of random samples but many can be We can write 247 in terms of the random sample as VFBU le 14 i 1 2 n 248 where ml is the error or disturbance for observationi for example person 139 firm 139 city 139 and so on Thus ul contains the unobservables for observation 139 that affect yl The ul should not be confused with the residuals 121 that we defined in Section 23 Later on we will 47 Part 1 Regression Analysis with Croserectional Data Graph of y 30 lx 11 PRF EONX Bo W explore the relationship between the errors and the residuals For interpreting Bo and BI in a particular application 247 is most informative but 248 is also needed for some of the statistical derivations The relationship 248 can be plotted for a particular outcome of data as shown in Figure 27 As we already saw in Section 22 the OLS slope and intercept estimates are not defined unless we have some sample variation in the explanatory variable We now add variation in the xi to our list of assumptions The sample outcomes on x namely lX I 1 nl are not all the same value This is a very weak assumptionicertainly not worth emphasizing but needed never theless fo varies in the population random samples on x will typically contain variation unless the population variation is minimal or the sample size is small Simple inspection of summary statistics on xx reveals whether Assumption SLR3 fails if the sample standard deviation of xi is zero then Assumption SLR3 fails otherwise it holds Finally in order to obtain unbiased estimators of Bo and BI we need to impose the zero conditional mean assumption that we discussed in some detail in Section 21 We now explicitly add it to our list of assumptions Chapter 2 The Simple Regression ModeI Assam tion SLRA Zero r 39 Mean The IIOI u has an expected value of zero given any value of the explanatory Mliahle In other lledS Etaix 0 For a random sample this assumption implies that Ealixl 0 for all i l 2 n In addition to restricting the relationship between a and x in the population the zero conditional mean assumptionicoupled with the random sampling assumptioniallows for a convenient technical simplification In particular we can derive the statistical properties of the OLS estimators as conditional on the values of the xx in our sample Technically in statistical derivations conditioning on the sample values of the indepen dent variable is the same as treau39ng the xx as fixed in repeated samples which we think of as follows We first choose n sample values for x1 ese can be repeated Given these values we then obtain a sample on y effectively by obtaining a random sample of the at Next another sample of y is obtained using the same values for x1 x2 6 Then another sample of y is obtained again using the same x1 x2 6 so on The fixedinrepeatedsamples scenario is not very realistic in nonexperimental con texts For instance in sampling individuals for the wageeducation example it makes little sense to think of choosing the values of edac ahead of time and then sampling individuals with those particular levels of education Random sampling where individuals are chosen randomly and their wage and education are both recorded is representative of how most data sets are obtained for empirical analysis in the social sciences Once we assume that Eaix 0 and we have random sampling nothing is lost in derivations by treating the xx as nonrandom The danger is that the fixedinrepeated samples assumption always implies that at and xx are independent In deciding when simple regression analysis is going to produce unbiased estimators it is critical to think in terms of Assumption SLR4 ow we are ready to show that the OLS estimators are unbiased To this end we use the fact that Z 1106 7 iyl 7 y Z 1106 7 y see Appendix A to write the OLS slope estimator in equation 219 as id 7 W a El 2061 x2 Because we are now interested in the behavior of 81 across all possible samples 81 is prop erly viewed as a random variable We can write B in terms of the population coef cienm and errors by substituting the righthand side of 248 into 249 We have 206 W 202 030 I392 u 7 11 1 250 SSTX SSTX 49 Part 1 Regression Analysis with CrossSectional Data where we have defined the total variation inxl as SSTX Z 1106 7 icy to simplify the notation This is not quite the sample variance of the xx because we do not divide by n 7 1 Using the algebra of the summation operator write the numerator of B as got 7 xr3zx7 x3xzx7 you E 7 302 x17 x 3 2 x17 mafia ml As shown inAppendiX A Zips 7 x 7 0 and 2 x 7 m 7 27 x 7 302 7 SSTX Therefore we can write the numerator of B as BISSTX 2106 7 ux Putting this over the denominator gives 2 x 7 m 7 71 BI T Bl T SSTX 7 w ussmitu E 71 where dz xx 7 i We now see that the estimator 8 equals the population slope Bl plus a term that is a linear combination in the errors 111 uz u Conditional on the values of xi the randomness in B is due entirely to the errors in the sample The fact that these errors are generally different from zero is what causes 8 to differ from Using the representation in 252 we can prove the first important statistical property of OLS m of 015 Using Assumptions SLRJ through SLR4 Err3 7 5 and Eta 7 5 3 for any values of Egand 3 In other words 3 is unbiased for Ewand lg is unbiased for 3 PROOF In this proof the expected values are conditional on the sample values of the inde7 pendent variable Because SST and dare functions only of the 9 they are nonrandom in the conditioning Therefore from 252 and keeping the conditioning on X X2 an implicit we have BE 7 i3 Er1SSTZdu 7 i3 tISSTij Etdzu Fl zl 7 B t1SSTXZdEtur 7 3 iiiSSTQZ yo 7 5 21 il Where we have used the fact that the expected vaIUe of each u conditional on lX x2 XHl is zero under Assumptions SLR2 and SLRA Since unhiasedness holds for any outcome on XV x2 XHl unbiasedness also holds Without conditioning on x X3 xl The proof for 3 is now straightforward Average 248 across I to get 7 B0 Bo I and plug this into the formula for 30 Chapter 2 The Simple Regression Model 8037 781 BHJrBlX39 1 7315c30 3178th 12 Then conditional on the values of the X Evin B Elk3 7 Ban E0171 B Ema 7 Bali since Em 0 by Assumptionsle and SLR4 But we showed that EtB 5 which implies that Ell3 Bu 0 Thus Elli 30 Both of these arguments are valid for any values of B and 13 and so we have established unbiasedness Remember that unbiasedness is a feature of the sampling distributions of 81 and 80 which says nothing about the estimate that we obtain for a given sample We hope that if the sample we obtain is somehow typical then our estimate should be near the population value Unfortunately it is always possible that we could obtain an unlucky sample that would give us a point estimate far from BI and we can never know for sure whether this is the case You may want to review the material on unbiased estimators in Appendix C especially the simulation exercise in Table C1 that illustrates the concept of unbiasedness Unbiasedness generally fails if any of our four assumptions fail This means that it is important to think about the veracity of each assumption for a particular application Assumption SLRl requires that y and x be linearly related with an additive disturbance This can certainly fail But we also know that y and x can be chosen to yield interesting nonlinear relationships Dealing with the failure of 247 requires more advanced methods that are beyond the scope of this text Later we will have to relax Assumption SLR2 the random sampling assumption for time series analysis But what about using it for crosssectional analysis Random sampling can fail in a cross section when samples are not representative of the underlying population in fact some data sets are constructed by intentionally oversampling different parts of the population We will discuss problems of nonrandom sampling in Chapters 9 and 17 As we have already discussed Assumption SLR3 almost always holds in interesting regression applications Without it we cannot even obtain the OLS estimates The assumption we should concentrate on for now is SLR4 If SLR4 holds the OLS estimators are unbiased Likewise if SLR4 fails the OLS estimators generally will be biased There are ways to determine the likely direction and size of the bias which we will study in Chapter 3 The possibility that x is correlated with u is almost always a concern in simple regression analysis with nonexperimental data as we indicated with several examples in Section 21 Using simple regression when u contains factors affecting y that are also cor related with x can result in spurious correlation that is we find a relationship between y and x that is really due to other unobserved factors that affect y and also happen to be correlated with x Part 1 Regression Analysis with Croserectional Data Student Math Performance and the School Lunch Program Let matth denote the percentage of tenth graders at a high school receiving a passing score on a standardized mathematics exam Suppose we wish to estimate the effect of the federally funded school lunch program on student performance If anything we expect the lunch program to have a positive ceteris paribus effect on performance all other factors being equal if a student who is too poor to eat regular meals becomes eligible for the school lunch program his or her performance should improve Let Incitng denote the percentage of students who are eligible for the lunch pro gram Then a simple regression model is mathIO BU Bllnchprg u where u contains school and student characteristics that affect overall school performance Using the data in MEAP93RAW on 408 Michigan high schools for the 199271993 school year we obtain m 3214 7 0319 lnchprg n 408R2 0171 This equation predicts that if student eligibility in the lunch program increases by 10 percentage points the percentage of students passing the math exam falls by about 32 percentage points Do we really believe that higher participation in the lunch program actually causes worse performance Almost certainly not A better explanation is that the error term u in equation 254 is correlated with lrtchprg In fact u contains factors such as the poverty rate of children attending school which affects student performance and is highly correlated with eligibility in the lunch program Variables such as school quality and resources are also contained in u and these are likely correlated with lrtchprg It is important to remember that the estimate 0319 is only for this particular sample but its sign and magnitude make us suspect that u andx are correlated so that simple regression is biased In addition to omitted variables there are other reasons for x to be correlated with u in the simple regression model Because the same issues arise in multiple regression analysis we will postpone a systematic treatment of the problem until then Variances of the OLS Estimators In addition to knowing that the sampling distribution of AB is centered about B B is unbiased it is important to know how far we can expect B to be away from B on aver age Among other things this allows us to choose the best estimator among all or at least a broad class of unbiased estimators The measure of spread in the distribution of B and Bo that is easiest to work with is the variance or its square root the standard deviation See Appendix C for a more detailed discussion It turns out that the variance of the OLS estimators can be computed under Assumptions SLRl through SLR4 However these expressions would be somewhat complicated Instead we add an assumption that is traditional for crosssectional analysis This assump tion states that the variance of the unobservable u conditional on x is constant This is known as the homoskedasticity or constant variance assumption Chapter 2 The Simple Regression ModeI Assum tion SLR5 quot 39 39 L The error it has the same variance given any value of the explanatory variable In other words Vartulxt 0391 We must emphasize that the homoskedasticity assumption is quite distinct from the zero conditional mean assumption Emir 0 Assumption SLR4 involves the expected value of a while Assumption SLR5 concenis the variance of a both conditional on 6 Recall that we established the unbiasedness of OLS without Assumption SLR5 the homoskedasticity assumption plays no role in showing that 80 and B are unbiased We add Assumption SLR5 because it simpli es the variance calculations for Bo and BI and because it implies that ordinary least squares has certain ef ciency properties which we will see in Chapter 3 If we were to assume that n and x are independent then the distribution of a given 6 does not depend on x and so Emir En 0 and Varnlx 02 But indepen dence is sometimes too strong of an assumption Because Varnlx En2lx 7 13mm 2 and Emir 0 0392 En2lx which means 0392 is also the unconditional expectation of n2 Therefore 0392 E012 Varn bemuse En 0 In other words 0392 is the unconditional variance of n and so 0392 is often called the error vari ance or disturbance variance The square root of 02 039 is the standard deviation of the error A larger 039 means that the distribution of the unobservables affecting y is more spread out It is often useful to write Assumptions SLR4 and SLR5 in terms of the condi tional mean and conditional variance of y Eotx 13 13x El Varylx 0392 156 t In other words the conditional expectation of y given 6 is linear in x but the variance of y givenx is constant This situation is graphed in Figure 28 where BU gt 0 and B gt 0 When Varnlx depends on x the error term is said to exhibit heteroskedasticity or nonconstant variance Because Varnlx Varylx heteroskedasticity is present whenever Varylx is a function of 6 Example 1 l 3 Heteroskedastlcity in a Wage Equatlon In order to get an unbiased estimator of the ceteris paribus effect of ednc on wage we must assume that Enlednc O and this implies Ewagelednc 50 Blednc If we also make the homoskedas ticity assumption then Varnlednc 0392 does not depend on the level of education which is the same as assuming Varwagelednc 2 Thus while average wage is allowed to increase with education leveliit is this rate of increase that we are interested in estimatingithe variability in wage about its mean is assumed to be constant across all education levels This may not be realistic It is likely that people with more education have a wider variety of interests and job opportunities which could lead to more wage variability at higher levels of education People with very low levels of education have fewer opportunities and often must work at the minimum wage this serves to reduce wage variability at low education levels This situation is shown in Figure 29 Ultimately whether Assumption SLR5 holds is an empirical issue and in Chapter 8 we will show how to test Assumption SLR5 Part 1 Regression Analysis with Croserectional Data FIGURE 28 KW The simple regression model under homoskedastidty l EOlX e lX With the homoskedasticity assumption in place we are ready to prove the following m quot Variances of the 015 quot and Under Assumptions SLRJ through SLR5 A Z we ozssn Zmew ll Where these are conditional on the sample values XV Xl Chapter 2 The Simple Regression Model FIGURE 19 Vanwageleduc increasing with educ Kwageleduc wage I 8 EWageleduc 12 BO BWeduc educ PROOF We derive the tormula tor Varvi leaving the other derivation as Problem 210 The starting point is equation 252 B1 3 1SSTI Z dyuy Because 1quot is just a constant and we are conditioning on the Xy SSTX and d X 7X are also nonrandom Furthermore because the u are independent random variables across by random sampling the variance of the sum is the sum of the variances Using these facts We have Varng lSSTXVVar din 1SSTXZdeVaruL l r rl lSSTXZ deal since Varul 0392 for all i r r ariaSST 2115 o WlSSTXYSSTX TlSST l which is What we Wanted to show Equations 257 and 258 are the standard formulas for simple regression analysis which are invalid in the presence of heteroskedasticity This will be important when we turn to confidence intervals and hypothesis testing in multiple regression analysis 56 Part Regl nssrwn Arralysls with CrosereCUJ al Data 39 ar it w 39 m 31M depends on the error variance 02 and the total variation in X X2 X SSTX First the larger the error variance the larger is Var3 This makes sense since more variation in the L L1 t t L 4 mum 5 more variability in the independent variable is preferred as the variability in the X increases the variance of 3 decreases This also makes intuitive sense since the more spread out is the sample of independent variables the easier it is to trace out the relationship between Eylc and X That is the easier it is to estimate r If there is li le variation in the X then it can be r39r39 m39e in L T 39 39 A varir ation in the X Therefore a larger sample size results in a smaller variance for A This analysis shows that if we are interested in B and we have a choice then we s jtrow mat wran D lmatl g 3quot H 5 best to have M Wh Vam n ln HHS ase Him For any sample 01 numbers 2 2 21 7 2 3 Wlli r Equalrtv only if X 0 J random sampling Sometimes we have an opportunity to obtain larger sample sizes alth this can be costl e purposes of constructing confidence intervals and deriving test statistics we will need to work with the standard deviations of 3 and 30 sd3 and sd30 Recall that these are obtained by taking the square roots of the variances in 257 and 258 In particular sd3 U SSTX where o is the square root of 02 and SSTX is the square root of SST Estimating the Error Variance The formulas in 257 and 258 allow us to isolate the factors that contribute to Var and Var 0 But these formulas are unknown except in the extremely rare case that 02 is known Nevertheless we can use the data to estimate 02 which then allows us to estimate VarB and VarB0 his is a good place to emphasize the difference between the errors or disturbances and the residuals since this distinction is crucial for constructing an estimator of o Equation 248 shows how to write the population model in terms of a randomly sampled observation as y e B u re 14 is the error for observation i We can also express y in terms of its fined value and residual as in equation 232 y Ar it Comparing these two equations we see that the error shows up in the equation contain7 ing the population parameters 50 and B On the other hand the residuals show up in the estimated equation with B0 and A The errors are never observable while the residuals are computed from the data We can use equations 232 and 248 to write the residuals as a function of the errors a 7 y 7 I30 7 Ax 7 030 W u 7 I30 7 u 7 30 7 r3 7 3 7 ram Chapter 2 The Simple Regression Model Although the expected value of Ba equals Bo and similarly for BI i is not the same as MI The difference between them does have an expected value of zero Now that we understand the difference between the errors and the residuals we can return to estimating 0392 First 0392 E042 so an unbiased estimator of 0392 is rr 1 1 Unfortunately this is not a true estimator because we do not observe the errors ul But we do have estimates of the Mt namely the OLS residuals 12 If we replace the errors with the OLS residuals we have n lzil f SSRrr This is a true estimator because it gives a computable rule for any sample of data on x and y One slight drawback to this estimator is that it turns out to be biased although for large n the bias is small Because it is easy to compute an unbiased estimator we use that instead The estimator SSRrr is biased essentially because it does not account for two restric tions that must be satisfied by the OLS residuals These restrictions are given by the two OLS first order conditions n n 212 0 2x32 0 260 11 11 One way to view these restrictions is this if we know rr 7 2 of the residuals we can always get the other two residuals by using the restrictions implied by the first order conditions in 260 Thus there are only rr 7 2 degrees of freedom in the OLS residuals as opposed to n degrees of freedom in the errors If we replace 121 with ml in 260 the restrictions would no longer hold The unbiased estimator of 0392 that we will use makes a degrees of freedom adjustment 62 1 122 SSRn 2 26 n 7 2 1 This estimator is sometimes denoted as 92 but we continue to use the convention of putting hats over estimators mu r ofaz Under Assumptions SLRJ through SLR5 PROOF If we average equation 259 across all and use the fact that the OLS residuals average out to zero we have 0 Li 7 0 7 g 7 B 7 L5X subtracting this from 259 gives uquot u 7 7 i B 7 BWXX 7 x Therefore I u 7 a B1 7 L397 X7 gt62 7 2u 7 a 7 Bulk 7 X Summing across all i gives 0f 1 u 7 I2 3 7 if x 7 X 7 AB 7 A A uth7 6 Now the expected dlU of the first term is n 7 liqquot something tlrat is shown in Appendix C The expected value of the second terrrr is simply 0391 because EMS 7 m1 Vang 0355 Finally the third term can be written as 281 713isi taking expedations gives 2073 Putting these three terms together gives E Egan n e no J2 7 2w n e Ziac so that ElSSRn e 2 31 Part 1 Regression Analysis with Crosereclional Data 1f 6392 is plugged into the variance formulas 257 and 258 then we have unbiased estimators of VarBl and VarBU Later on we will need estimators of the standard devia tions of BI and Bo and this requires estimating a39 The natural estimator of a39 is and is called the standard error of the regression SER Other names for 639 are the standard error of the estimate and the root mean squared error but we will not use these ough 639 is not an unbiased estimator of 039 we can show that it is a consistent estimator of a39 see Appendix C and it will serve our purposes well The estimate 639 is interesting because it is an estimate of the standard deviation in the unobservables affecting y equivalently it estimates the standard deviation in y after the effect of x has been taken out Most regression packages report the value of 639 along with the Rsquared intercept slope and other OLS statistics under one of the several names listed above For now our primary interest is in using 639 to estimate the standard devia tions of 80 and 81 Since sdBl 71 SSTX the natural estimator of sdBl is seBl 61SSTX aZ toc X2 m this is called the standard error of A Note that seBl is viewed as a random variable when we think of running OLS over different samples of y this is true because frvar ies with different samples For a given sample seBl is a number just as 81 is simply a number when we compute it from the given data Similarly seBU is obtained from sd io by replacing a39 with 639 The standard error of any estimate gives us an idea of how precise the estimator is Standard errors play a cen tral role throughout this text we will use them to construct test statistics and confidence intervals for every econometric procedure we cover starting in Chapter 4 26 Regression through the Origin In rare cases we wish to impose the restriction that when x 0 the expected value of y is zero There are certain relationships for which this is reasonable For example if income 6 is zero then income tax revenues y must also be zero In addition there are settings where a model that originally has a nonzero intercept is transformed into a model without an intercept Formally we now choose a slope estimator which we call 81 and a line of the form Ha where the tildes over BI and yare used to distinguish this problem from the much more common problem of estimating an intercept along with a slope Obtaining 263 is called regression through the origin because the line 263 passes through the point x 0 y 0 To obtain the slope estimate in 263 we still rely on the method of ordinary least squares which in this case minimizes the sum of squared residuals Em 7 ix i 264 Chapter 2 The Simple Regression ModeI Using onevariable calculus it can be shown that 3 must solve the rst order condition n 2 My lel 0 265 11 From this we can solve for Bl 2 I31 7 7 Ex 11 provided that not all the xx are zero a case we rule out Note how 3 compares with the slope estimate when we also estimate the intercept rather than set it equal to zero These two estimates are the same if and only if it 0 See equation 249 for 81 Obtaining an estimate of B using regression through the ori gin isnot done very often in applied work and for good reason if the intercept B0 i 0 then B is a biased estimator of Bl You will be asked to prove this in Problem 28 266 l m We have introduced the simple linear regression model in this chapter and we have covered its basic properties Given a random sample the method of ordinary least squares is used to estimate the slope and intercept parameters in the population model We have demonstrated the algebra of the OLS regression line including computation of fitted values and residuals and the obtaining of predicted changes in the dependent variable for a given change in the indepen dent variable In Section 24 we discussed two issues of practical importance 1 the behavior of the OLS estimates when we change the units of measurement of the dependent variable or the independent variable and 2 the use of the natural log to allow for constant elasticity and constant semielasticity models In Section 25 we showed that under the four Assumptions SLRl through SLR4 the OLS estimators are unbiased The key assumption is that the error term M has zero mean given any value of the independent variable x Unfortunately there are reasons to think this is false in many social science applications of simple regression where the omitted factors in u are often correlated with x When we add the assumption that the variance of the error given x is constant we get simple formulas for the sampling variances of the OLS estimators As we saw the variance of the slope estimator 3 increases as the error variance increases and it decreases when there is more sample variation in the independent variable We also derived an unbiased estimator for 0392 Varu In Section 26 we brie y discussedregression through the origin where the slope estimator is obtained under the assumption that the intercept is zero Sometimes this is useful but it appears infrequently in applied work Much work is left to be done For example we still do not know how to test hypotheses about the population parameters Bo and 81 Thus although we know that OLS is unbiased for the population parameters under Assumptions SLRl through SLR4 we have no way of draw ing inference about the population Other topics such as the efficiency of OLS relative to other possible procedures have also been omitted Part 1 Regression Analysis with Croserectional Data The issues of confidence intervals hypothesis testing and efficiency are central to multiple regression analysis as well Since the way we construct confidence intervals and test statistics is very similar for multiple regressioniand because simple regression is a special case of multiple regressioniour time is better spent moving on to multiple regression which is much more widely applicable than simple regression Our purpose in Chapter 2 was to get you think ing about the issues that arise in econometric analysis in a fairly simple setting The GaussMarkov Assumptions for Simple Regression For convenience we summarize the GaussMarkov assumptions that we used in this chapter It is important to remember that only SLRl through SLR4 are needed to show Bo and B are unbiased We added the homoskedasticity assumption SLRS to obtain the usual OLS vari ance formulas 257 and 258 Assumption SLRJ Linear in Parameters In the population model the dependent variable y is related to the independent variable x and the error or disturbance u as y 30 96 u where Bo and B are the population intercept and slope parameters respectively Assmnption SLRJ Random Sampling We have a random sample of size n xlyl i l 2 n following the population model in Assumption SLR1 Assumption SLR3 Sample Variation in the Explanatory Variable The sample outcomes on 6 namely x i l n are not all the same value Assumption SLR4 Zero Conditional Mean The error M has an expected value of zero given any value of the explanatory variable In other words Eulx 0 Assumption SLR5 Homoskedasticity The error M has the same variance given any value of the explanatory variable In other wor s Varulx 0392 m Coefficient of Degrees of Freedom Explained Sum of Squares Determination Dependent Variable SSE Constant Elasticity M0d 1 Elasticity Explained Variable Control Variable Error Term Disturbance Explanatory Variable Covariate Error Variance First Order Conditions Chapter 2 The Simple Regression ModeI Fitted Value Predictor Variable Simple Linear Regression GaussMarkov Assumptions Regressand Model Heteroskedasticity Regression through the Origin Slope Parameter Homoskedasticity Regressor Standard Error of 3 Independent Variable Residual Standard Error of the Intercept Parameter Residual Sum of Squares Regression SER Mean Independent SSR Sum of Squared Residuals OLS Regression Line Response Variable SSR Ordinary Least Squares OLS Rsquared Total Sum of Squares SST Population Regression Function PRF SRF Predicted Variable Zero Conditional Mean Assumption Sample Regression Function Semielasticity m 2 23 Let kids denote the number of children ever born to a woman and let educ denote years of education for the woman A simple model relating fertility to years of education is kids BU Bleduc u where u is the unobserved error i What kinds of factors are contained in u Are these likely to be correlated with level of education ii Will a simple regression analysis uncover the ceteris paribus effect of education on fertility Explain In the simple linear regression model y 3 le u suppose that Eu 0 Letting a0 Eu show that the model can always be rewritten with the same slope but a new intercept and error where the new error has a zero expected value The following table contains the ACT scores and the GPA grade point average for eight college students Grade point average is based on a fourpoint scale and has been rounded to one digit after the decimal Student GPA ACT I 28 21 2 34 24 3 30 26 4 35 27 5 36 29 6 30 25 7 27 25 8 37 30 Part 1 Regression Analysis with Croserectional Data i Estimate the relationship between GPA and ACT using OLS that is obtain the intercept and slope estimates in the equation 6P B BIACT Comment on the direction of the relationship Does the intercept have a useful inter pretation here Explain How much higher is the GPA predicted to be if the ACT score is increased by five points ii Compute the fitted values and residuals for each observation and verify that the residuals approximately sum to zero iii What is the predicted value of GPA when ACT 20 iv How much of the variation in GPA for these eight students is explained by ACT Explain The data set BWGHTRAW contains data on births to women in the United States Two variables of interest are the dependent variable infant birth weight in ounces bwght and an explanatory variable average number of cigarettes the mother smoked per day during pregnancy cigs The following simple regression was estimated using data on n 1388 births W 11977 7 0514 cigs i What is the predicted birth weight when cigs 0 What about when cigs 20 one pack per day Comment on the difference ii Does this simple regression necessarily capture a causal relationship between the child s birth weight and the mother s smoking habits Explain iii To predict a birth weight of 125 ounces what would cigs have to be Comment iv The proportion of women in the sample who do not smoke while pregnant is about 85 Does this help reconcile your finding from part iii In the linear consumption function W BU Blinc the estimated marginal propensity to consume MPC out of income is simply the slope 3 while the average propensity to consume APC is linc BDinc 3 Using observations for 100 families on annual income and consumption both measured in dol lars the following equation is obtained 6 712484 0853 inc n 100 R2 0692 i interpret the intercept in this equation and comment on its sign and magnitude ii What is the predicted consumption when family income is 30000 iii With inc on the x axis draw a graph of the estimated MPC and APC Using data from 1988 for houses sold in Andover Massachusetts from Kiel and McClain 1995 the following equation relates housing price price to the distance from a recently built garbage incinerator dist m 940 0312 logdist n 135 R2 0162 Chapter 2 The Simple Regression ModeI i Interpret the coefficient on logdist Is the sign of this estimate what you expect it to be ii Do you think simple regression provides an unbiased estimator of the ceteris paribus elasticity ofprice with respect to dist Think about the city s decision on where to put the incinerator iii What other factors about a house affect its price Might these be correlated with distance from the incinerator Consider the savings function my 80 Blinc u u Virtue where e is a random variable with Ee 0 and Vare of Assume that e is indepen dent of inc i Show that Euinc 0 so that the key zero conditional mean assumption Assumption SLR4 is satisfied Hint If e is independent of inc then Eeinc Ee ii Show that Varuinc aims so that the homoskedasticity Assumption SLR5 is violated In particular the variance of say increases with inc Hint Vareinc Vare if e and inc are independent iii Provide a discussion that supports the assumption that the variance of savings increases with family income Consider the standard simple regression modely B le uunder the GaussMarkov Assumptions SLRl through SLR5 The usual OLS estimators Bo and BI are unbiased for their respective population parameters Let 81 be the estimator of 81 obtained by assuming the intercept is zero see Section 26 i Find EB1 in terms of the xx 80 and BI Verify that B1 is unbiased for 81 when the population intercept 30 is zero Are there other cases where BI is unbiased ii Find the variance of 81 Him The variance does not depend on BU iii Shoyv that VarB1 Var 1 Hint39 For any sample of data 2195 2 xx 502 with strict inequality unless it 0 V 11 iv Comment on the tradeoff between bias and variance when choosing between BI and 81 i Let Bo and B be the intercept and slope from the regression of yl on xx using I observations Let c1 and CT with CZ 0 be constants Let go and 31 be the intercept and slope from the regression of clyl on 62x1 Show that B cl62ml0 and 8 8180 thereby verifying the claims on units of measurement in Section 24 Hint39 To obtain 8 plug the scaled versions of x and y into 219 Then use 217 for B being sure to plug in the scaled x and y and the correct slope ii Now let Bi and lbe from the regressionof c1 y on c2 xx with no restriction on c1 or 02 Show that Bl BI and Bo 80 c1 7 cz l iii Now let 30 and BI be the OLS estimates from the regression logyl on x1 where we must assume yl gt 0 for all i For 61 gt 0 let 30 andf1 be the intercept and slope from the regression of logclyl on xx Show that Bl BI and Bo logcl Bo iv Now assuming that xx gt 0 for all i let Bo and BI be the intercept and slope from the regression ofyl on logcle How do go and 1 compare with the intercept and slope from the regression of yl on logxl 63 Part 1 210 P C2l Regression Analysis with CrossSectional Data Let Bo and 31 be the OLS intercept and slope estimators respectively and let 12 be the sample average of the errors not the residualsl 391 Show that 81 can be writtenas 31 7 81 Z Wu where wt 7 zitSST and dz 7 xx 7 i ii Use part i along with 2 wt 0 to show that 31 and 12 are uncorrelated Hihtz You are being asked to show that E31 Bl 12 0 iii Show that 30 can be written as 30 7 80 12 7 31 7 3912 iv Use parts ii and iii to show that VaiBU 02 0222SSTK v Do the algebra to simplify the expression in part iv to equation 258 Hihtz SSTXn 7 n42 x 7 5021 Suppose you are interested in estimating the effect ofhours spent in an SAT preparation course hours on total SAT score sat The population is all collegebound high school seniors for a particular year 39 Suppose you are given a grant to run a controlled experiment Explain how you would structure the experiment in order to estimate the causal effect of hours on sat Consider the more realistic case where students choose how much time to spend in a preparation course and you can only randomly sample sat and hours from the population Write the population model as 5 ii sat BU Blhours u where as usual in a model with an intercept we can assume Eu 0 List at least two factors contained in u Are these likely to have positive or negative correlation with hours iii In the equation from part ii what should be the sign of 81 if the preparation course is effective iv Ln the equation from part ii what is the interpretation of BU COMPUTER EXERCISES The data in 401KRAW are a subset of data analyzed by Papke 1995 to study the relationship between participation in a 401k pension plan and the generosity of the plan The variable prute is the percentage of eligible workers with an active account this is the variable we would like to explain The measure of generosity is the plan match rate mrute This variable gives the average amount the firm contributes to each worker s plan for each 1 contribution by the worker For example if mrute 050 then a 1 contribution by the worker is matched by a 50 contribution by the firm i Find the average participation rate and the average match rate in the sample of plans ii Now estimate the simple regression equation prhE 30 31 mrute and report the results along with the sample size and Rsquared iii Interpret the intercept in your equation Interpret the coefficient on mrute iv Find the predicted prute when mrute 35 Is this a reasonable prediction Explain what is happening here v How much of the variation in prate is explained by mrute Is this a lot in your opinion Chapter 2 The Simple Regression ModeI The data set in CEOSALZRAW contains information on chief executive officers for US corporations The variable salary is annual compensation in thousands of dollars and ceoten is prior number of years as company CEO i Find the average salary and the average tenure in the sample ii How many CEOs are in their first year as CEO that is ceoten 0 What is the longest tenure as a CEO iii Estimate the simple regression model logsalary 30 Blceoten a and report your results in the usual form What is the approximate predicted percentage increase in salary given one more year as a CEO Use the data in SLEEP75RAW from Biddle and Harnermesh 1990 to study whether there is a tradeoff between the time spent sleeping per week and the time spent in paid work We could use either variable as the dependent variable For concreteness estimate the model sleep BU Bltotwrk a where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked during the week i Report your results in equation form along with the number of observations and R2 What does the intercept in this equation mean ii If tolwrk increases by 2 hours by how much is sleep estimated to fall Do you find this to be a large effect Use the data in WAGEZRAW to estimate a simple regression explaining monthly sal ary wage in terms of IQ score IQ i Find the average salary and average IQ in the sample What is the sample standard deviation of IQ IQ scores are standardized so that the average in the population is 100 with a standard deviation equal to 15 ii Estimate a simple regression model where a onepoint increase in IQ changes wage by a constant dollar amount Use this model to nd the predicted increase in wage for an increase in IQ of 15 points Does IQ explain most of the variation in wa e iii Now estimate a model where each onepoint increase in IQ has the same percent age effect on wage If IQ increases by 15 points what is the approximate percent age increase in predicted wage For the population of firms in the chemical industry let rd denote armual expenditures on research and development and let sales denote armual sales both are in millions of dollars i Write down a model not an estimated equation that implies a constant elasticity between rd and sales Which parameter is the elasticity ii Now estimate the model using the data in RDCHEMRAW Write out the esti mated equation in the usual fomi What is the estimated elasticity of rd with respect to sales Explain in words what this elasticity means We used the data in MEAP93RAW for Example 212 Now we want to explore the relationship between the math pass rate mathIO and spending per student expend Part 1 Regression Analysis with Croserectional Data i Do you think each additional dollar spent has the same effect on the pass rate or does a diminishing effect seem more appropriate Explain ii In the population model mathIO Bo B logexpend u argue that 3110 is the percentage point change in mathIO given a 10 increase in expend iii Use the data in MEAP93RAW to estimate the model from part ii Report the estimated equation in the usual way including the sample size and Rsquared iv How big is the estimated spending effect Namely if spending increases by 10 what is the estimated percentage point increase in mathIO v One might worry that regression analysis can produce fitted values for math 0 that are greater than 100 Why is this not much of a worry in this data set Use the data in CHARITYRAW obtained from Franses and Paap 2001 to answer the following questions i What is the average gift in the sample of 4268 people in Dutch guilders What percentage of people gave no gift ii What is the average mailings per year What are the minimum and maximum values iii Estimate the model gift BU Blmailsyear u by OLS and report the results in the usual way including the sample size and Rsquared iv Interpret the slope coefficient If each mailing costs one guilder is the charity expected to male a net gain on each mailing Does this mean the charity makes a net gain on every mailing Explain v What is the smallest predicted charitable contribution in the sample Using this simple regression analysis can you ever predict zero for gift Appendix 2A Minimizing the Sum of Squared Residuals We show that the OLS estimates BO and 3 do minimize the sum of squared residuals as asserted in Section 22 Formally the problem is to characterize the solutions 80 and B to the minimization problem min y b 7 my bmb1 g 0 1 where 0 and b1 are the dummy arguments for the optimization problem for simplic ity call this function Qb0b1 By a fundamental result from multivariable calculus see Appendix A a necessaiy condition for 80 and B to solve the minimization problem is that the paitial derivatives of Qb0b1 with respect to b0 and b1 must be zero when evaluated at Chapter 2 The Simple Regression ModeI 67 BO 81 aQrZ0r31ab0 0 and BQBUB1Bb1 0 Using the chain rule from calculus these two equations become 72 Zo 1307 131x 7 0 2 Zxxyli BU 31x1 0 11 These two equations are just 214 and 215 multiplied by 7271 and therefore are solved by the same Bo and BI How do we know that we have actually minimized the sum of squared residuals The first order conditions are necessary but not sufficient conditions One way to verify that we have minimized the sum of squared residuals is to write for any b0 and b1 9 b1 7207 807 399 307 b 817 boxy 11 781 be 817 box 11 Z f quot 107 be2 Bl 7 122er 2307 bUXBl 7 WE sz 1 11 11 where we have used equations 230 and 231 The rst term does not depend on b0 or b1 while the sum of the last three terms can be written as Z oi be Bl 7 WEE 11 as can be verified by straightforward algebra Because this is a sum of squared terms the smallest it can be is zero Therefore it is smallest when b0 Bo and b B 1 1 CHAPTER 3 Multiple Regression Analysis Estimation n Chapter 2 we learned how to use simple regression analysis to explain a dependent variable y as a function of a single independent variable X The primary drawback in ceteris paribus conclusions about how X affects y the key assumption SLR4ithat all C pe policy effects when we must rely on nonexperimental data Because multiple regression J J r 39 w cuuelated we can o to infer causality in cases where simple regression analysis would be misleading Na ly if we add more factors to our mo e at are useful for explaining y then m e of the variation in y can be explained Thus multiple regression analysis can be used to build better models for predicting the dependent variable n additional advantage of multiple regression analysis is that it can incorporate fairly general functional form relationships In the simple regression model only one function of a single explanatory variable can appear in the equation As we will see the mu 39 regression model a ows for much more exibility ec ion 39 39r 39 advantages of multiple regression over simple regression In Section 32 we demonstrate h toestimate 39 39r 39 mo 9 39 L L J of or ir nary least squares In Sections 33 34 and 35 we describe various statistical properties of the OLS estimators including unbiasedness and efficiency 39 L 39r 39 39 39 39 useu mime forempirical analyr sis in economics and other social sciences Likewise the method of ordinary least squares is popularly used for estimating the parameters of the multiple regression mode 31 Motivation for Multiple Regression The Model with Two Independent Variables be 39 with some simple examples to show how multiple regression analysis can be used to solve problems that cannot be solved by simple regression Chapter 3 Multiple Regression Analysis Estimation The first example is a simple variation of the wage equation introduced in Chapter 2 for obtaining the effect of education on hourly wage wage BU Blednc Bzexper a 3 where exper is years of labor market experience Thus wage is determined by the two explanatory or independent variables education and experience and by other unobserved factors which are contained in a We are still primarily interested in the effect of ednc on wage holding fixed all other factors affecting wage that is we are interested in the parameter 3 Compared with a simple regression analysis relating wage to ednc equation 31 effectively takes exper out of the error term and puts it explicitly in the equation Because exper appears in the equation its coefficient BZ measures the ceteris paribus effect of exper on wage which is also of some interest Not surprisingly just as with simple regression we will have to make assumptions about how a in 31 is related to the independent variables ednc and exper However as we will see in Section 32 there is one thing of which we can be confident because 31 contains experience explicitly we will be able to measure the effect of education on wage holding experience fixed In a simple regression analysisiwhich puts exper in the error termiwe would have to assume that experience is uncorrelated with education a tenuous assumption As a second example consider the problem of explaining the effect of per student spending expend on the average standardized test score avgscore at the high school level Suppose that the average test score depends on funding average family income avginc and other unobservables avgscore BU Blexpend Bzavginc a l 32 The coefficient of interest for policy purposes is 3 the ceteris paribus effect of expend on avgscore By including avginc explicitly in the model we are able to control for its effect on avgscore This is likely to be important because average family income tends to be correlated with per student spending spending levels are often determined by both property and local income taxes In simple regression analysis avginc would be included in the error term which would likely be correlated with expend causing the OLS estimator of Bl in the twovariable model to be biased In the two previous similar examples we have shown how observable factors other than the variable of primary interest ednc in equation 31 and expend in equation 32 can be included in a regression model Generally we can write a model with two inde pendent variables as y3 1396 Haw u where BU is the intercept Bl measures the change in y with respect to x1 holding other factors fixed BZ measures the change in y with respect to x2 holding other factors fixed 69 Part 1 Regression Analysis with Crosereclional Data Multiple regression analysis is also useful for generalizing functional relationships between variables As an example suppose family consumption cons is a quadratic func tion of family income inc cons BU Blinc le39nCZ u 34 where u contains other factors affecting consumption In this model consumption depends on only one observed factor income so it might seem that it can be handled in a simple regression framework But the model falls outside simple regression because it contains two functions of income inc and inc2 and therefore three parameters BO 3 and 2 Nevertheless the consumption function is easily written as a regression model with two independent variables by letting x1 inc and x2 incz Mechanically there will be no difference in using the method of ordinary least squares introduced in Section 32 to estimate equations as different as 31 and 34 Each equa tion can be written as 33 which is all that matters for computation There is however an important difference in how one interprets the parameters In equation 31 B is the ceteris paribus effect of educ on wage The parameter E has no such interpretation in 34 In other words it makes no sense to measure the effect of inc on cons while holding inc2 fixed because if inc changes then so must inczl Instead the change in consumption with respect to the change in incomeithe marginal propensity to consumeiis approxi mated by AACZFZE Bl ZBZZILC See Appendix A for the calculus needed to derive this equation In other words the mar ginal effect of income on consumption depends on BZ as well as on B and the level of income This example shows that1 in any particular application the definitions of the inde pendent variables are crucial But for the theoretical development of multiple regression we can be vague about such details We will study examples like this more completely in Chapter 6 In the model with two independent variables the key assumption about how u is related to x and x2 is 1514le x2 0 l 35 The interpretation of condition 35 is similar to the interpretation of Assumption SLR4 for simple regression analysis It means that1 for any values of 6 and x2 in the population the average unobservable is equal to zero As with simple regression the important part of the assumption is that the expected value of u is the same for all combinations of 6 and x that this common value is zero is no assumption at all as long as the intercept BU is included in the model see Section 21 How can we interpret the zero conditional mean assumption in the previous examples In equation 31 the assumption is Euleducexper 0 This implies that other factors affecting wage are not related on average to educ and exper Therefore if we think innate ability is part of u then we will need average ability levels to be the same across all com binations of education and experience in the working population This may or may not be true but as we will see in Section 33 this is the question we need to ask in order to determine whether the method of ordinary least squares produces unbiased estimators affectin test SCOIBFSChOOl or ent characteristickare on aver age unrelate to per student funding and average family income When a lied to the quadratic con Chame 2 Mllltllrle Reglessl39mAmhsls Esrlmarmv 71 quotl latos mum39rafg in telms verage H Sl i wdel to exala lW mLHd m We MDbabHHy 0t convuuon Q length a w l5 in l39uie mmde 30 WWW elmgm u Wha 3W Jome TECKH Obtained H1 U7 DO you thmk the key asmnmmn 3 5 l5 likely to hold sumption function in 34 the zero conditional mean assumption has a slightly different interpretation Written literally equation 35 becomes Eulincincz 0 Since inc2 is knownxhn39nilznnxnquotquot 39 quot 39 39 quot 39 39 L 1 39 l 39 39 aiuug tation when stating the assumption but Emlinc O is more concise m M 0 L The Model with k Independent Variables Once we are in the context of multiple regression there is no need to stop with two inder pendent variables Multiple regression analysis allows many observed factors to affect y the wage example we ml rnea nre ofahilit also include 4 amount of job training years of tenure with 1391 L L of siblings orrmother s education In the school funding example additional variables might include measures of teacher quality and school size The general multiple inear regression model also called the multiple regression model can be written in the population as y o m zxz zxam kaku where B0 is the interce t B is the parameter associated with X B2 is the parameter associated with X2 and so on Since there are k independent variables and an intercepL equation 36 contains k l 39 39 quot 39 39 39 we w39ll 39 LU Line r u I parameters other than the intercept as slope parameters even though this is not always literally what they are See equation 34 where neither B nor B2 is itself a slope but t ogether they determine the sl The terminology for ope of the relationship between consumption and income multiple regression is similar to that for simple regression and is given in Table 1 Just as in simple regression the variable 14 is the error term or disturbance It contains factors other than X explanatory variables we include in our m include and these X2 Xk t affect y No matter how many v a odel there will always be factors we cannot are collectively contained in 14 When applying the general multiple regression model we must know how to interpret the parameters We will get plenty of practice now and in subsequent chapters but it is Part 1 Regression Analysis with Croserectional Data Terminology for Multiple Regression y x1x2xk Dependent variable Independent variables Explained variable Explanatmy variables Response variable Control variables Predicted variable Predictor variables Regressand Regressors useful at this point to be reminded of some things we already know Suppose that CEO salary salary is related to firm sales sales and CEO tenure ceoten with the firm by logsalary BU Bllogsales Bzceoten gceoten2 a 37 This fits into the multiple regression model with k 3 by defining y logsalary x1 logsales x2 ceoten andx ceotenz As we know from Chapter 2 the parameter BI is the ceteris paribus elasticity of salary with respect to sales If BB 0 then 10032 is approximately the ceteris paribus percentage increase in salary when ceoten increases by one year When 83 i 0 the effect of ceoten on salary is more complicated We will postpone a detailed treatment of general models with quadratics until Chapter 6 Equation 37 provides an important reminder about multiple regression analysis The term linear in multiple linear regression model means that equation 36 is linear in the parameters B Equation 37 is an example of a multiple regression model that while linear in the B is a nonlinear relationship between salary and the variables sales and cea ten Many applications of multiple linear regression involve nonlinear relationships among the underlying variables The key assumption for the general multiple regression model is easy to state in terms of a conditional expectation Ea x1 x2 xk 0 At a minimum equation 38 requires that all factors in the unobserved error term be uncorrelated with the explanatory variables It also means that we have correctly accounted for the functional relationships between the explained and explanatory variables Any problem that causes a to be correlated with any of the independent variables causes 38 to fail In Section 33 we will show that assumption 38 implies that OLS is unbiased and will derive the bias that arises when a key variable has been omitted from the equation In Chapters 15 and 16 we will study other reasons that might cause 38 to fail and show what can be done in cases where it does fail Chapter 3 Multiple Regression Analysis Estimation 32 Mechanics and Interpretation of Ordinary Least Squares We now summarize some computational and algebraic features of the method of ordinary least squares as it applies to a particular set of data We also discuss how to interpret the estimated equation Obtaining the OLS Estimates We first consider estimating the model with two independent variables The estimated OLS equation is written in a form similar to the simple regression case 7 30 31x1 E er 39 where BU the estimate of BU Bl the estimate of Bl BZ the estimate of 2 But how do we obtain BU BI and 32 The method of ordinary least squares chooses the estimates to minimize the sum of squared residuals That is given It observations on y x1 and x2 xll x12 y i 1 2 n the estimates BU BI and 32 are chosen simultane ously to make 20 7 Ba 7 Arxzr 782x192 33910 11 as small as possible To understand what OLS is doing it is important to master the meaning of the index ing of the independent variables in 310 The independent variables have two subscripts here i followed by either 1 or 2 The i subscript refers to the observation number Thus the sum in 310 is over all i 1 to n observations The second index is simply a method of distinguishing between different independent variables In the example relat ing wage to Educ and axper x11 educl is education for person i in the sample and x12 axperl is experience for person i The sum of squared residuals in equation 310 is 21wagel 7 EU 7 Bleducl 7 Ezexperl In what follows the i subscript is reserved for indexing the observation number If we write xx then this means the zquoth observation on the j39h independent variable Some authors prefer to switch the order of the observation number and the variable number so that x is observation 139 on variable one But this is just a matter of notational taste In the general case with k independent variables we seek estimates BU Bl 3k in the equation 73 Part 1 Regression Analysis with Croserectional Data The OLS estimates k 1 of them are chosen to minimize the sum of squared residuals 2011780 781sz 7 785 E This minimization problem can be solved using multivariable calculus see Appendix 3A This leads to k 1 linear equations in k 1 unknowns BU Bl Bk 2x12011780 731x11 7 Tkazk 0 E 2sz 730 781x11 7 inx c 039 These are often called the OLS first order conditions As with the simple regression model in Section 22 the OLS first order conditions can be obtained by the method of moments under assumption 38 En 0 and Exn 0 wherej 1 2 k The equations in 313 are the sample counterparts of these population moments although we have omitted the division by the sample size n For even moderately sized n and k solving the equations in 313 by hand calculations is tedious Nevertheless modeni computers running standard statistics and econometrics software can solve these equations with large n and k very quickly There is only one slight caveat we must assume that the equations in 313 can be solved uniquely for the B For now we just assume this as it is usually the case in wellspecified models In Section 33 we state the assumption needed for unique OLS estimates to exist see Assumption MLR3 As in simple regression analysis equation 311 is called the OLS regression line or the sample regression function SRF We will call EU the OLS intercept estimate and BI Bk the OLS slope estimates corresponding to the independent variables x1 x2 To indicate that an OLS regression has been run we will either write out equation 311 with y and x1 xk replaced by their variable names such as wage ednc and exper or we will say that we ran an OLS regression of y on x1 x2 xk or that we regressed y on x1 x2 xk These are shorthand for saying that the method of ordinary least squares was used to obtain the OLS equation 311 Unless explicitly stated otherwise we always estimate an intercept along with the slopes Interpreting the OLS Regression Equation More important than the details underlying the computation of the B is the interpretation of the estimated equation We begin with the case of two independent variables y B 31x1 22 Chapter 3 Multiple Regression Analysis Estimation The intercept EU in equation 314 is the predicted value of y when x 0 and x2 0 Sometimes setting 6 and xzboth equal to zero is an interesting scenario in other cases it will not make sense Nevertheless the intercept is always needed to obtain a prediction of y from the OLS regression line as 314 makes clear The estimates 3 and 3 have partial effect or ceteris paribus interpretations From equation 314 we have Ay 81m BZAXZ so we can obtain the predicted change in y given the changes in x and x2 Note how the intercept has nothing to do with the changes in y In particular when 6 is held fixed so that Ax2 0 then holding xzfixed The key point is that by including xzin our model we obtain a coefficient on 6 with a ceteris paribus interpretation This is why multiple regression analysis is so useful Similarly holding 6 fixed E x a m p I e 3 I Determinants of College GPA The variables in GPA1RAW include the college grade point average colGPA high school GPA hsGPA and achievement test score ACT for a sample of 141 students from a large university both college and high school GPAs are on a fourpoint scale We obtain the following OLS regres sion line to predict college GPA from high school GPA and achievement test score m 129 453 hsGPA 0094ACT 315 How do we interpret this equation First the intercept 129 is the predicted college GPA if hsGPA andACT are both set as zero Since no one who attends college has either a zero high school GPA or a zero on the achievement test the intercept in this equation is not by itself meaningful More interesting estimates are the slope coefficients on hsGPA and ACT As expected there is a positive partial relationship between colGPA and hsGPA Holding ACT xed another point on hsGPA is associated with 453 of a point on the college GPA or almost half a point In other words if we choose two students A and B and these students have the same ACT score but the high school GPA of Student A is one point higher than the high school GPA of Student B then we predict Student A to have a college GPA 453 higher than that of Student B This says nothing about any two actual people but it is our best prediction The sign on ACT implies that while holding hsGPA fixed a change in the ACT score of 10 pointsia very large change since the average score in the sample is about 24 with a standard Part 1 Regression Analysis with CrossySectional Data deviation less than threeiaffects colGPA by less than onetenth of a point This is a small effect and it suggests that once high school GPA is accounted for the ACT score is not a strong predictor of college GPA Naturally there are many other factors that contribute to GPA but here we focus on statistics available for high school students Later after we discuss statistical inference we will show that not only is the coefficient on ACT practically small it is also statistically insignificant If we focus on a simple regression analysis relating colGPA to ACT only we obtain 3am 240 0271 ACT thus the coefficient onACTis almost three times as large as the estimate in 315 But this equation does not allow us to compare two people with the same high school GPA it corresponds to a differ ent experiment We say more about the differences between multiple and simple regression later The case with more than two independent variables is similar The OLS regression line is y80 31x1 8252 kak39 Written in terms of changes JE Ay Elel 32sz BkAxk 317 The coefficient on 6 measures the change in due to a oneunit increase in x1 holding all other independent variables fixed That is At 133m holding x2 x3 xk fixed Thus we have controlled for the variables x2 x3 xk when estimating the effect of 6 on y The other coefficients have a similar interpretation The following is an example with three independent variables E x a m p I e 3 Z Hourly Wage quatlon Using the 526 observations on workers in WAGE1RAW we include ednc years of education exper years of labor market experience and tenure years with the current employer in an equation explaining logwage The estimated equation is W 284 092 ednc 0041 exper 022 tenure l 319 As in the simple regression case the coefficients have a percentage interpretation The only dif ference here is that they also have a ceteris paribus interpretation The coefficient 092 means that holding exper and tenure fixed another year of education is predicted to increase logwage by 092 which translates into an approximate 92 lOOO92 increase in wage Alternatively if we take two people with the same levels of experience and job tenure the coef cient on ednc is the propor tionate difference in predicted wage when their education levels differ by one year This measure of Chapter 3 Multiple Regression Analysis Estimation the return to education at least keeps two important productivity factors xed whether it is a good estimate of the ceteris paribus return to another year of education requires us to study the statistical properties of OLS see Section 33 On the Meaning of Holding Other Factors Fixedquot in Multiple Regression The partial effect interpretation of slope coefficients in multiple regression analysis can cause some confusion so we provide a further discussion now In Example 31 we observed that the coefficient onACT measures the predicted differ ence in colGPA holding hsGPA fixed The power of multiple regression analysis is that it provides this ceteris paribus interpretation even though the data have not been collected in a ceteris paribus fashion In giving the coefficient onACT a partial effect interpretation it may seem that we actually went out and sampled people with the same high school GPA but possibly with different ACT scores This is not the case The data are a random sample from a large university there were no restrictions placed on the sample values of hsGPA or ACT in obtaining the data Rarely do we have the luxury of holding certain variables fixed in obtaining our sample If we could collect a sample of individuals with the same high school GPA then we could perform a simple regression analysis relating colGPA to ACT Multiple regression effectively allows us to mimic this situation without restricting the values of any independent variables The power of multiple regression analysis is that it allows us to do in nonexperimental environments what natural scientists are able to do in a controlled laboratory setting keep other factors fixed Changing More Than One Independent Variable Simultaneously Sometimes we want to change more than one independent variable at the same time to find the resulting effect on the dependent variable This is easily done using equation 317 For example in equation 319 we can obtain the estimated effect on wage when an individual stays at the same firm for another year exper general workforce experience and tenure both increase by one year The total effect holding educ fixed is AW 0041 Aexper 022 Atenure 0041 022 0261 or about 26 Since exper and tenure each increase by one year we just add the coef ficients on exper and tenure and multiply by 100 to turn the effect into a percentage OLS Fitted Values and Residuals After obtaining the OLS regression line 311 we can obtain a tted or predicted value for each observation For observation 139 the fitted value is simply y 30 31x11 Bzx kalk 320 77 78 Part Regression Analysls with Cl OSSrSeCllJilEil Data which is just the predicted value obtained by plugging the values of the independent varie ables for observation 139 into equation 311 We should not forget about the intercept in obtaining the fitted values otherwise the answer can be very rnisleadin As an example if in 315 hsGPA 35 and ACT 24 colGPA 129 45335 009424 3101 rounded to three places after the decimal Normally the actual value y for any observation 139 will not equal the predicted value S minimizes the average squared prediction error which says nothing about prediction error for any particular observation The residual for observation 139 is defined just as in the simple regression case I There is a residual for each observation If a gt 0 then y is below y which means that for this observation yr is underpredicted If a lt 0 then y lt 9 and y is overpredicted The 0 fitted 39 39 39 39 39 extensions from the single variable case 1 The sample average of the residuals is zero and so 9 f 2 TL 1 I I a L i 1 139 zero Consequently the sample covariance between the OLS fined values and the OLS residuals is zero The pointh X 2k 9 is always on the OLS regression line 9 e A 32 w A 239 3m in ExaiiiplhIS l the CL tilted line explaining lllege GPA in terms used to obtain the OLS estimates The L7 r i m high srhnn GPA and u first equation in 313 says that the sum 20 GPX 129 453 hsGPA 0094ACI f me Siduals is m Th3 M Iquot equations are of the form 2 X JM O which implies that each independent vars iable has zero sample covariance with Property 3 follows immediately from property 1 ii the aveiagsi iiigii Cl iODl iiu is about u and iii average ACT 9 is about 14 2 what is the average College JPA iri 39iie gami39iie A Partialling Out Interpretation of Multiple Regression When applying OLS we do not need to know explicit formulas for the a that solve the s stem of equations in 313 Nevertheless for certain derivations we do need explicit formulas for the These formulas also shed further light on the workings of OLS Consider again the caseAwith k 2 independent variables e EX t 32 For concreteness we focus on B One way to express B is 3i l m i lift Chapter 3 Multiple Regression Analysis Estimation where the it are the OLS residuals from a simple regression of XI on x2 using the sample at hand We regress our first independent variable x1 on our second independent variable x2 and then obtain the residuals y plays no roleliere Equation 322 shows that we can then do a simple regression of y on f to obtain 3 Note that the residuals I ll have a zero sample average and so 3 is the usual slope estimate from simple regressionA The representation in equation 322 gives another demonstration of 31 s partial effect interpretation The residuals 911 are the part of xi that is uncorrelated with x12 Another way of saying this is that it is x11 after the effects of x12 have been partialled out or netted out Thus Bl measures the sample relationship between y and 6 after x2 has been partialled out In simple regression analysis there is no partialling out of other variables because no other variables are included in the regression Computer Exercise C35 steps you through the partialling out process using the wage data from Example 32 For practical purposes the important thing is that B 1 in the equation 30 B 1x1 82x2 measures the change in y given a oneunit increase in x1 holding 6 fixed A In the general model with k explanatory variables B can still be written asAin equation 322 but the residuals 51 come from the regression of XI on x2 xk Thus Bl measures the effect of x on y after x2 xk have been partialled or netted out Comparison of Simple and Multiple Regression Estimates Two special cases exist in which the simple regression of y on x will produce the same OLS estimate on x as the regression of y on x and x2 To be more precise write the simple regression of y on x1 asy BU lel and write the multiple regression asy BU lel 32x2 We know that the simple regression coefficient B does not usually equal the multi ple regression coefficient 3 It turns out there is a simple relationship between B and 3 which allows for interesting comparisons between simple and multiple regression B1 B I335 323 where 51 is the slope coefficient from the simple regression of x12 on x11 139 l n This equation shows how Bl differs from the partial effect of x on The confounding term is the partial effect of 6 on y times the slope in the sample regression of 6 onxl See Section 3A4 in the chapter appendix for a more general verification The relationship between E and 3 also shows there are two distinct cases where they are equa l The partial effect of 6 on is zero in the sampleThat is 32 0 2 x1 and x2 are uncorrelated in the sample That is 51 0 Even though simple and multiple regression estimates are almost never identical we can use the above formula to characterize why they might be either very different or quite similar For example if 32 is small we might expect the multiple and simple regression estimates of B to be similar In Example 31 the sample correlation between hsGPA and ACT is about 0346 which is a nontrivial correlation But the coefficient onACT is fairly little It is not surprising to find that the simple regression of colGPA on hsGPA produces a slope estimate of 482 which is not much different from the estimate 453 in 315 79 Part 1 Regression Analysis with CrossSeciional Data I x a m I e 3 3 Participation in 40lk Pension Plans We use the data in 401KRAW to estimate the effect of a plan s match rate mrate on the participa tion rate prate in its 401k pension plan The match rate is the amount the rm contributes to a worker s fund for each dollar the worker contributes up to some limit thus mrate 75 means that the rm contributes 75 for each dollar contributed by the worker The participation rate is the percentage of eligible workers having a 401k account The variable age is the age of the 401k plan There are 1534 plans in the data set the average prate is 8736 the average mrate is 732 and the average age is 132 Re gressing prate on mrate age gives W 8012 552 mrate 243 age Thus both mrate and age have the expected effects What happens if we do not control for age The estimated effect of age is not trivial and so we might expect a large change in the estimated effect of mrate if age is dropped from the regression However the simple regression of prate on mrate yields f7aTe 8308 586 mrate The simple regression estimate of the effect of mrate on prate is clearly different from the multiple regression estimate but the difference is not very big The simple regression estimate is only about 62 larger than the multiple regression estimate This can be explained by the fact that the sample correlation between mrate and age is only 12 In the case with k independent variables the simple regression of y on x and the mul tiple regression of y on x1 x2 xk produce an identical estimate of 6 only if 1 the OLS coefficients on x2 through xk are all zero or 2 x1 is uncorrelated with each of x2 xk Neither of these is very likely in practice But if the coefficients on xzthrough xk are small or the sample correlations between 6 and the other independent variables are insubstantial then the simple and multiple regression estimates of the effect of x on y can be similar GoodnessofFit As with simple regression we can define the total sum of squares SST the explained sum of squares SSE and the residual sum of squares or sum of squared residuals SSR as SST Eiwl W 11 SSE i 91 W 325 11 11 Using the same argument as in the simple regression case we can show that SST SSE SSR Chapter 3 Multiple Regression Analysis Estimation In other words the total variation in yl is the sum of the total variations in yl and in 12 Assuming that the total variation in y is nonzero as is the case unless yl is constant in the sample we can divide 327 by SST to get SSRSST SSESST 1 Just as in the simple regression case the Rsquared is defined to be R2 SSESST l 7 SSRSST 328 and it is interpreted as the proportion of the sample variation in yl that is explained by the OLS regression line By definition R2 is a number between zero and one R2 can also be shown to equal the squared correlation coefficient between the actual yl and the fitted values 511 That is 1 2 2o 7 m 71 11 2 y 7 W Z 7 W 11 11 We have put the average of the yl in 329 to be true to the formula for a correlation coef ficient we know that this average equals y because the sample average of the residuals is zero and yl yl 121 An important fact about R2 is that it never decreases and it usually increases when another independent variable is added to a regression This algebraic fact follows because by definition the sum of squared residuals never increases when additional regressors are added to the model For example the last digit of one s social security number has nothing to do with one s hourly wage but adding this digit to a wage equation will increase the R2 by a little at least The fact that R2 never decreases when any variable is added to a regression makes it a poor tool for deciding whether one variable or several variables should be added to a model The factor that should determine whether an explanatory variable belongs in a model is whether the explanatory variable has a nonzero partial effect on y in the popula tion We will show how to test this hypothesis in Chapter 4 when we cover statistical infer ence We will also see that when used properly R2 allows us to test a group of variables to see if it is important for explaining y For now we use it as a goodnessoffit measure for a given model R2 l 329 Determinants of College GPA From the grade point average regression that we did earlier the equation with R2 is SW 7 129 453 hsGPA 0094 ACT 11 141 R2 176 This means that hsGPA and ACT together explain about 176 of the variation in college GPA for this sample of students This may not seem like a high percentage but we must remember that there 8 l 82 Part 1 Regression Analysis with CrossSeclional Data are many other factorsiincluding family background personality quality of high school educa tion af nity for collegeithat contribute to a student s college performance If hsGPA and ACT explained almost all of the variation in colGPA then performance in college would be preordained by high school performance Example 35 Explalnlng Arrest Records CRIME1RAW contains data on arrests during the year 1986 and other information on 2725 men born in either 1960 or 1961 in California Each man in the sample was arrested at least once prior to 1986 The variable narr86 is the number of times the man was arrested during 1986 it is zero for most men in the sample 7229 and it varies from 0 to 12 The percentage of men arrested once during 1986 was 2051 The variable pcrw is the proportion not percentage of arrests prior to 1986 that led to conviction avgsen is average sentence lenth served for prior convictions zero for most people ptime86 is month spent in prison in 1986 and qemp86 is the number of quarters during which the man was employed in 1986 from zero to four A linear model explaining arrests is narr86 BU Blpcrw Bzavgsen Baptim286 BAqemp86 u where pcrw is a proxy for the likelihood for being convicted of a crime and avgsen is a measure of expected severity of punishment if convicted The variable ptime86 captures the incarcerative effects of crime if an individual is in prison he cannot be arrested for a crime outside of prison Labor market opportunities are crudely captured by qemp86 First we estimate the model without the variable avgsen We obtain 77 7 712 7 150 pcrw 7 034 ptime86 7 104 qemp86 n 2725 R2 0413 This equation says that as a group the three variables pcrw ptime86 and qemp86 explain about 41 of the variation in narrS Each of the OLS slope coefficients has the anticipated sign An increase in the proportion of convictions lowers the predicted number of arrests If we increase pcrw by 50 a large increase in the probability of conviction then holding the other factors f1xed Am 15050 i 075 This may seem unusual because an arrest cannot change by a fraction But we can use this value to obtain the predicted change in expected arrests for a large group of men For example among 100 men the predicted fall in arrests when pcrw increases by 50 is 75 Similarly a longer prison term leads to a lower predicted number of arrests In fact if ptime86 increases from 0 to 12 predicted arrests for a particular man fall by 03412 408 Another quarter in which legal employment is reported lowers predicted arrests by 104 which would be 104 arrests among 100 men If avgsen is added to the model we know that R2 will increase The estimated equation is m 7 707 7 151 pcrw 7 0074 avgsen 7 037 ptime86 7 103 qemp86 2725 R2 0422 Chapter 3 Multiple Regression Analysis Estimation Thus adding the average sentence variable increases R2 from 0413 to 0422 a practically small effect The sign of the coefficient on avgseh is also unexpected it says that a longer average sentence length increases criminal activity Example 35 deserves a final word of caution The fact that the four explanatory vari ables included in the second regression explain only about 42 of the variation in narr86 does not necessarily mean that the equation is useless Even though these variables col lectively do not explain much of the variation in arresm it is still possible that the OLS estimates are reliable estimates of the ceteris paribus effecm of each independent variable on narr86 As we will see whether this is the case does not directly depend on the size of R2 Generally a low R2 indicates that it is hard to predict individual outcomes on y with much accuracy something we study in more detail in Chapter 6 In the arrest example the small R2 reflects what we already suspect in the social sciences it is generally very difficult to predict individual behavior Regression through the Origin Sometimes an economic theory or common sense suggests that BU should be zero and so we should briefly mention OLS estimation when the intercept is zero Specifically we now seek an equation of the form 9 51x1 32x2 kak 330 where the symbol over the estimates is used to distinguish them from the OLS estimates obtained along with the intercept as in 311 In 330 when x1 0 x2 0 xk 0 the predicted value is zero In this case Bl Bk are said to be the OLS estimates from the regression of y onx x2 xk through the origin The OLS estimates in 330 as always minimize the sum of squared residuals but with the intercept set at zero You should be warned that the properties of OLS that we derived earlier no longer hold for regression through the origin In particular the OLS residuals no longer have a zero sample average Further if R2 is defined as l 7 SSRSST where SST is given in 324 and SSR is now 2119 7 1x11 7 e kalk 2 then R2 can actually be negative This means that the sample average y explains more of the variation in the yl than the explanatory variables Either we should include an intercept in the regression or conclude that the explanatory variables poorly explain y To always have a nonnegative Rsquared some economists prefer to calculate R2 as the squared correlation coefficient between the actual and fitted values of y as in 329 In this case the average fitted value must be computed directly since it no longer equals y However there is no set rule on computing Rsquared for regression through the origin One serious drawback with regression through the origin is that if the intercept BU in the population model is different from zero then the OLS estimators of the slope parameters will be biased The bias can be severe in some cases The cost of estimating an intercept when BU is truly zero is that the variances of the OLS slope estimators are larger 83 Part 1 Regression Analysis with frossySectional Data 33 The Expected Value of the OLS Estimators We now turn to the statistical properties of OLS for estimating the parameters in an under lying population model In this section we derive the expected value of the OLS estima tors In particular we state and discuss four assumptions which are direct extensions of the simple regression model assumptions under which the OLS estimators are unbiased for the population parameters We also explicitly obtain the bias in OLS when an impor tant variable has been omitted from the regression You should remember that statistical properties have nothing to do with a particular sample but rather with the property of estimators when random sampling is done repeat edly Thus Sections 33 34 and 35 are somewhat abstract Although we give examples of deriving bias for particular models it is not meaningful to talk about the statistical properties of a set of estimates obtained from a single sample The first assumption we make simply defines the multiple linear regression MLR model Assumption MLRJ Linear in I The model in the population can be written as y 30 51 Bit2 13g u 331 where 3quot Bk are the unknown parameters constants ot interest and u is an unobsem able random error or disturhance term Equation 331 formally states the population model sometimes called the true model to allow for the possibility that we might estimate a model that differs from 331 The key feature is that the model is linear in the parameters BO 3 Bk As we know 331 is quite flexible because y and the independent variables can be arbitrary functions of the underlying variables of interest such as natural logarithms and squares see for example equation 37 Assumption MLRZ Random quot We have a random sample of n obsenations 29 x X k y I 1 Z n following the population model in Assumption MLR I Sometimes we need to write the equation for a particular observation 139 for a randomly drawn observation from the population we have yl 130 131x11 32x12 kaw ul l 332 Chapter 3 Multiple Regression Analysis Estimation Remember that 139 refers to the observation and the second subscript on x is the variable number For example we can write a CEO salary equation for a particular CEO 139 as logsalaryl BU Bllogsalesl Bzeeotenl lgeeotenl2 ml 333 The term ul contains the unobserved factors for CEO 139 that affect his or her salary For applications it is usually easiest to write the model in population form as in 331 It contains less clutter and emphasizes the fact that we are interested in estimating a popula tion relationship In light of model 331 the OLS estimators BO BI 32 3k from the regression of y on x1 xk are now considered to be estimators of BU Bl Bk In Section 32 we saw that OLS chooses the intercept and slope estimates for a particular sample so that the residuals average to zero and the sample correlation between each independent variable and the residuals is zero Still we did not include conditions under which the OLS esti mates are welldefined for a given sample The next assumption fills that gap Assum tion MLR3 No Perfect r quot In the sample and therefore in the population none of the independentvariables is constant and there are no exad linear relationships among the independent variables Assumption MLR3 is more complicated than its counterpart for simple regression because we must now look at relationships between all independent variables If an independent variable in 331 is an exact linear combination of the other independent variables then we say the model suffers from perfect collinearity and it cannot be estimated by OLS It is important to note that Assumption MLR3 does allow the independent variables to be correlated they just cannot be perfectly correlated If we did not allow for any correla tion among the independent variables then multiple regression would be of very limited use for econometric analysis For example in the model relating test scores to educational expenditures and average family income avgseore BU Blexpend Bzavginc u we fully expect expend and avginc to be correlated school districts with high average family incomes tend to spend more per student on education In fact the primary motiva tion for including avgine in the equation is that we suspect it is correlated with expend and so we would like to hold it fixed in the analysis Assumption MLR3 only rules out perfect correlation between expend and avginc in our sample We would be very unlucky to obtain a sample where per student expenditures are perfectly correlated with average family income But some correlation perhaps a substantial amount is expected and cer tainly allowed The simplest way that two independent variables can be perfectly correlated is when one variable is a constant multiple of another This can happen when a researcher inad vertently puts the same variable measured in different unis into a regression equation For example in estimating a relationship between consumption and income it makes no sense to include as independent variables income measured in dollars as well as income 85 Part 1 Regression Analysis with Crosereclional Data measured in thousands of dollars One of these is redundant What sense would it make to hold income measured in dollars fixed while changing income measured in thousands of dollars We already know that different nonlinear functions of the same variable can appear among the regressors For example the model cons BU B inc lzz39nc2 u does not violate Assumption MLR3 even though x2 inc2 is an exact function of XI inc inc2 is not an exact linear function of inc Including inc2 in the model is a useful way to gen eralize functional form unlike including income measured in dollars and in thousands of dollars Common sense tells us not to include the same explanatory variable measured in dif ferent units in the same regression equation There are also more subtle ways that one independent variable can be a multiple of another Suppose we would like to estimate an extension of a constant elasticity consumption function It might seem natural to specify a model such as logc0ns BU Blloginc leog ncz u 334 where x1 loginc and x2 loginc2 Using the basic properties of the natural log see Appendix A loginc2 2loginc That is x2 2x1 and naturally this holds for all observations in the sample This violates Assumption MLR3 What we should do instead is include loginc2 not loginc2 along with loginc This is a sensible extension of the constant elasticity model and we will see how to interpret such models in Chapter 6 Another way that independent variables can be perfectly collinear is when one inde pendent variable can be expressed as an exact linear function of two or more of the other independent variables For example suppose we want to estimate the effect of campaign spending on campaign outcomes For simplicity assume that each election has two candi dates Let voteA be the percentage of the vote for Candidate A let expendA be campaign expenditures by Candidate A let axpendB be campaign expenditures by Candidate B and let totaxpend be total campaign expenditures the latter three variables are all measured in dollars It may seem natural to specify the model as voteA BU BlaxpendA BzaxpendB Bgtotaxpend u 335 in order to isolate the effecm of spending by each candidate and the total amount of spend ing But this model violates Assumption MLR3 because x3 x1 xzby definition Trying to interpret this equation in a ceteris paribus fashion reveals the problem The parameter of Bl in equation 335 is supposed to measure the effect of increasing expenditures by Candidate A by one dollar on Candidate A s vote holding Candidate B s spending and total spending fixed This is nonsense because if axpendB and totexpend are held fixed en we cannot increase axpendA The solution to the perfect collinearity in 335 is simple drop any one of the three variables from the model We would probably drop totexpend and then the coefficient on expendA would measure the effect of increasing expenditures by A on the percentage of the vote received by A holding the spending by B fixed The prior examples show that Assumption MLR3 can fail if we are not careful in specifying our model Assumption MLR3 also fails if the sample size n is too small in relation to the number of parameters being estimated In the general regression model in equation 331 there are k 1 parameters and MLR3 fails if n lt k l Intuitively this Chame 2 mime Ragteswmmuwsts Esnmamy makes sense to estimate k l paramr eters we need at least k l observar Lions Not surprisingly it is better to have as many observations as possible something we will see wi our varir use quot her lat t ln tht ptemus xamtle if we nxpand expendb shareA Vt totex en the pelCeNBge Shale madt by 1 39 r e of ot explanatorv M tables t al am am expenditure tdate A does this violate Assumption MLR 3 ance calculations in Section 34 If th model is carefully speci ed and IL 2 k l Assumption MLR3 can fail in rare cases due to bad luck in collecting the sample For example in a wage equation with education and experience as variables it is possible that we could obtain a random sample where each individual has exactly twice as much education as years of experience This scenario would cause Assump 39on R3 to fail but it can be considered very unlikely unless we have an extremely small sample size The final and most important assumption needed for unbiasedness is a direct extenr sion of Assumption SLR4 Zen Conditional Mean The evror tr has an exped d value at Zelo given my values of the mulepeutlem vauables ln nthet words 36 Eml V One way that Assumption MLRA can fail is if the functional relationship between the explained and explanatory variables is rnisspecified in equation 331 for example if we forget to include the quadratic term im 2 in the consumption function cons B0 B iric Blinc u when 39 L J 39 39 39 39 39 occurs when we use the level of a variable when the log of the variable is what actually shows up in the population model or vice versa For example if the true model has logwage as the dependent variable in our regression analysis e estimators will be biased Intuitively this should be preny clear We will discuss 9 mming an important factor that is correlated With any of X X2 Xk causes Assumption MLRA to fail also With multiple regression analysis we are able to include many factors among the explanatory variables and omitted variables are less likely to be 39 quot 39 39 39 391 39 39 Nevertheless in any application there are always factors that due to data limitations or ignorance we will not be able to include If we think these factors should be controlled for and they are correlated with one or more of the independent variables then Assumption MLRA will be violated We will derive this bias later There are other wa s that u can be correlated with an explanatory variable In Chapter 15 we will discuss the problem of measurement error in an exp ana ory variab e In Chapter 16 we cover the conce tually more difficult problem in whic one or more of the explanatory variables is determined jointly with y We must postpone our study of these problems until we have a firm grasp of multiple regression analysis under an ideal set of assumptions Part 1 Regression Analysis with Croserectional Data When Assumption MLR4 holds we often say that we have exogenous explanatory variables If x is correlated with u for any reason then x is said to be an endogenous explanatory variable The terms exogenous and endogenous originated in simul taneous equations analysis see Chapter 16 but the term endogenous explanatory variable has evolved to cover any case in which an explanatory variable may be cor related with the error term Before we show the unbiasedness of the OLS estimators under MLR1 to MLR4 a word of caution Beginning students of econometrics sometimes confuse Assumptions MLR3 and MLR4 but they are quite different Assumption MLR3 rules out certain relationships among the independent or explanatory variables and has nothing to do with the error u You will know immediately when carrying out OLS estimation whether or not Assumption MLR3 holds On the other hand Assumption MLR4ithe much more important of the twoirestricts the relationship between the unobservables in u and the explanatory variables Unfortunately we will never know for sure whether the average value of the unobservables is unrelated to the explanatory variables But this is the critical assumption We are now ready to show unbiasedness of OLS under the first four multiple regres sion assumptions As in the simple regression case the expectations are conditional on the values of the explanatory variables in the sample something we show explicitly in Appendix 3A but not in the text m of OLS Under Assumptions MLRJ through MLR4 EtBi Bjj 0 l k 337 for any values of the population parameter ii In other words the OLS estimators are unbiased estimators oi the population parameters In our previous empirical examples Assumption MLR3 has been satisfied because we have been able to compute the OLS estimates Furthermore for the most part the samples are randomly chosen from a welldefined population If we believe that the speci fied models are correct under the key Assumption MLR4 then we can conclude that OLS is unbiased in these examples Since we are approaching the point where we can use multiple regression in serious empirical work it is useful to remember the meaning of unbiasedness It is tempting in examples such as the wage equation in 319 to say something like 92 is an unbiased estimate of the return to education As we know an estimate cannot be unbiased an esti mate is a fixed number obtained from a particular sample which usually is not equal to the population parameter When we say that OLS is unbiased under Assumptions MLR1 through MLR4 we mean that the procedure by which the OLS estimates are obtained is unbiased when we View the procedure as being applied across all possible random samples We hope that we have obtained a sample that gives us an estimate close to the population value but unfortunately this cannot be assured What is assured is that we have no reason to believe our estimate is more likely to be too big or more likely to be too small Chapter 3 Multiple Regression Analysis Estimation Including Irrelevant Variables in a Regression Model One issue that we can dispense with fairly quickly is that of inclusion of an irrelevant variable or overspecifying the model in multiple regression analysis This means that one or more of the independent variables is included in the model even though it has no partial effect on y in the population T hat is its population coefficient is zero To illustrate the issue suppose we specify the model as y BU lel Bzrz ngg a 338 l and this model satisfies Assumptions MLRl through MLR4 However x3 has no effect on y after 6 and xzhave been controlled for which means that 33 0 The variable x3 may or may not be correlated with 6 or x2 all that matters is that once 6 and x2 are controlled for x3 has no effect on y In terms of conditional expectations Eylx1x2x3 Eylx1x2 Bo l 31x1 l Bzxz39 Because we do not know that BB 0 we are inclined to estimate the equation includ ing x3 y BU 81x 82x2 33x3 339 l We have included the irrelevant variable x3 in our regression What is the effect of includ ing x3 in 339 when its coefficient in the population model 338 is zero In terms of the unbiasedness of B and 32 there is no effect This conclusion requires no special derivation as it follows immediately from Theorem 31 Remember unbiasednessAmeans EBJA B for any value of B including B 0 Thus we can conclude that EBU A 0 EBl Bl EBZ 2 and EBg 0 for any values of BO 3 and 2 Even though l3 itself will never be exactly zero its average value across all random samples will be zero The conclusion of the preceding example is much more general including one or more irrelevant variables in a multiple regression model or overspecifying the model does not affect the unbiasedness of the OLS estimators Does this mean it is harmless to include irrelevant variables No As we will see in Section 34 including irrelevant variables can have undesirable effects on the variances of the OLS estimators Omitted Variable Bias The Simple Case Now suppose that rather than including an irrelevant variable we omit a variable that actually belongs in the true or population model This is often called the problem of excluding a relevant variable or underspecifying the model We claimed in Chapter 2 and earlier in this chapter that this problem generally causes the OLS estimators to be biased It is time to show this explicitly and just as importantly to derive the direction and size of the bias Deriving the bias caused by omitting an important variable is an example of mis specification analysis We begin with the case where the true population model has two explanatory variables and an error term yBgB1x1Bzx2u7 340 l and we assume that this model satisfies Assumptions MLRl through MLR4 89 Part 1 Regression Analysis with Crosereclional Data Suppose that our primary interest is in 3 the partial effect of x on y For example y is hourly wage or log of hourly wage x1 is education and x2 is a measure of innate abil ity In order to get an unbiased estimator of 3 we should run a regression of y on 6 and x2 which gives unbiased estimators of BO 3 and 2 However due to our ignorance or data unavailability we estimate the model by excluding x2 In other words we perform a simple regression of y on 6 only obtaining the equation 9 50 51x1 341 We use the symbo rather than A to emphasize that 5 comes from an underspecified model When first learning about the omitted variable problem it can be difficult to distin guish between the underlying true model 340 in this case and the model that we actu ally estimate which is captured by the regression in 341 It may seem silly to omit the variable x2 if it belongs in the model but often we have no choice For example suppose that wage is determined by wage BU Bledac Bzabil a Since ability is not observed we instead estimate the model wage BU Bledac v where v Bzabil y The estimator of B from the simple regression of wage on edac is what we are calling B We derive the expected value of B conditionalon the sample values of x and x2 Deriving this expectation is not difficult because B is just the OLS slope estimator from a simple regression and we have already studied this estimator extensively in Chapter 2 The difference here is that we must analyze its properties when the simple regression model is misspecified due to an omitted variable As it turns out we have done almost all of the work to derive the bias in the sim ple regressionestimator of B From equation 323 we have the algebraic relationship B B 8261 where B and 32 are the slope estimators if we could have them from the multiple regression i ln yr on x117 x127 and 81 is the slope from the simple regression i l n X12 011 X11 Because 5 depends only on the independent variables in the sample we treat it as fixed nonrandom when computing E31 Further since the model in 340 satisfies Assumptions MLRl to MLR4 we know that B and 2 would be unbiased for B and 2 respectively Therefore EB1 8261 1381 E z l Ea 3 t 3251 Chapter 3 Multiple Regression Analysis Estimation which implies the bias i1181 is Biasag l E02 7 81 3281 346 i Because the bias in this case arises from omitting the explanatory variable x2 the term on the righthand side of equation 346 is often called the omitted variable bias From equation 346 we see that there are two cases where E is unbiased The first is pretty obvious if BZ Oiso that x2 does not appear in the true model 3407then BI is unbiased We already know this from the simple regression analysis in Chapter 2 The second case is more interesting If 61 0 then BI is unbiased for 31 even if BZ i 0 Because 61 is the sample covariance between x1 and x2 over the sample variance of XI 6 0 if and only if x1 and x2 are uncorrelated in the sample Thus we have the important conclusion that if x1 and x2 are uncorrelated in the sample then BI is unbiased This is not surprising in Section 32 we showed that the simple regression estirnator BI and the mul tiple regression estimator B1 are the same when x1 and x2 are uncorrelated in the sample We can also show that BI is unbiased without conditioning on the x12 if EOCZixl E062 then for estimating Bl leaving x2 in the error term does not violate the zero conditional mean assumption for the error once we adjust the intercept When x1 and x2 are correlated 61 has the same sign as the correlation between x1 and x2 61 gt 0 if x1 and x2 are positively correlated and 61 lt 0 if x1 and xzare negatively corre lated The sign of the bias in Bl depends on the signs of both 2 and 61 and is summarized in Table 32 for the four possible cases when there is bias Table 32 warrants careful study For example the bias inB1 is positive if BZ gt 0 62 has a positive effect on y andx1 and x2 are positively correlated the bias is negative if BZ gt 0 and x1 and x2 are negatively correlated and so on Table 32 summarizes the direction of the bias but the size of the bias is also very important A small bias of either sign need not be a cause for concern For example if the return to education in the population is 86 and the bias in the OLS estimator is 01 a tenth of one percentage point then we would not be very concerned On the other hand a bias on the order of three percentage points would be much more serious The size of the bias is determined by the sizes of 2 and 6 In practice since Bzis an unknown population parameter we cannot be certain whether BZ is positive or negative N evertheless we usually have a pretty good idea about the direc tion of the partial effect of x2 on y Further even though the sign of the correlation between Summary of Bias in Bquot when X1 ls Omitted in Estimating Equation 340 Corruva gt 0 Corrx1x lt 0 B gt 0 Positive bras Negatlve bras B lt 0 Negative bias Positive bias 9 l Part 1 Regression Analysis with Croserectional Data 6 and 6 cannot be known if x2 is not observed in many cases we can make an educated guess about whether 6 and x2 are positively or negatively correlated In the wage equation 342 by definition more ability leads to higher productivity and therefore higher wages B gt 0 Also there are reasons to believe that edac and abil are positively correlated on average individuals with more innate ability choose higher levels of education Thus the OLS estimates from the simple regression equation wage BU Bledac v are on average too large This does not mean that the estimate obtained from our sample is too big We can only say that if we collect many random samples and obtain the simple regression estimates each time then the average of these estimates will be greater than 3 Example 36 Hourly Wage Equatlon Suppose the model logwage 50 Bledac Blabil a satisfies Assumptions MLR1 through MLR4 The data set in WAGE1RAW does not contain data on ability so we estimate 51 from the simple regression logwage 584 083 edac n 526R2 186 This is the result from only a single sample so we cannot say that 083 is greater than 51 the true return to education could be lower or higher than 83 and we will never know for sure Nevertheless we know that the average of the estimates across all random samples would be too large As a second example suppose that at the elementary school level the average score for students on a standardized exam is determined by avgscore BU Blexpend szovrate a 348 where expend is expenditure per student and povrate is the poverty rate of the children in the school Using school district data we only have observations on the percentage of students with a passing grade and per student expenditures we do not have information on poverty rates Thus we estimate Bl from the simple regression of avgscore on expend We can again obtain the likely bias in 3 First B is probably negative there is ample evidence that children living in poverty score lower on average on standardized tests Second the average expenditure per student is probably negatively correlated with the poverty rate The higher the poverty rate the lower the average per student spending so that Corrx1 x2 lt 0 From Table 32 B will have a positive bias This observation has important implications It could be that the true effect of spending is zero that is B 0 However the simple regression estimate of B will usually be greater than zero and this could lead us to conclude that expenditures are important when they are not When reading and performing empirical work in economics it is important to master the terminology associated with biased estimators In the context of omitting a vari able from model 340 if EBl gt 3 then we say that B has an upward bias When Chapter 3 Multiple Regression Analysis Estimation EQI lt Bl E has a downward bias These definitions are the same whether BI is positive or negative The phrase biased toward zero refers to cases where 1381 is closer to zero than Bl Therefore if BI is positivetl1en BI is biased toward zero if it has a downward bias On the other hand if Bl lt 0 then BI is biased toward zero if it has an upward bias Omitted Variable Bias More General Cases Deriving the sign of omitted variable bias when there are multiple regressors in the estimated model is more difficult We must remember that correlation between a single explanatory variable and the error generally results in all OLS estimators being biased For example suppose the population model y BU lel Bzxz ngs u 349 satisfies Assumptions MLRl through MLR4 But we omit x3 and estimate the model as y 50 5951 52x2 350 l Now suppose that x2 and x3 are uncorrelated but that x1 is correlated with x3 In other WOI dSx1 is correlated with the omitted variable but x2 is not It is tempting to that while BI is probably biased based on the derivation in the previous subsection BZ is unbi ased because x2 is uncorrelated with x3 Unfortunately this is not generally the case both BI and 2 will normally be biased The only exception to this is when x1 and x2 are also uncorrelated Even in the fairly simple model above it can be difficult to obtain the direction of bias in El and 52 This is because x1 x2 and x3 can all be pairwise correlated Nevertheless an approximation is often practically useful If we assume that x1 and x2 are uncorrelated then we can study the bias in Bl as if x2 were absent from both the population and the estimated models In fact when x1 and x2 are uncorrelated it can be shown that i 061 7 Erna MB 13 133 x11 7 7Ci2 11 This is just like equation 345 but 33 replaces 2 and x3 replaces x2 in regression 344 Therefore the bias in BI is obtained by replacing 2 with 33 and x2 with x3 in Table 32 If 33 gt 0 and Corrx1 x3 gt 0 the bias inB1 is positive and so on As an example suppose we add exper to the wage model wage BU Bleduc Bzexper Bgubil u If 117139 is omitted from the model the estimators of both BI and 2 are biased even if we assume axper is uncorrelated with abil We are mostly interested in the return to educa tion so it would be nice if we could conclude that B has an upward or a downward bias due to omitted ability This conclusion is not possible without further assumptions As an approximation let us suppose that in addition to axper and 117139 being uncorrelated Educ 93 Part 1 Regression Analysis with Croserectional Data and exper are also uncorrelated In reality they are somewhat negatively correlated Since BB gt 0 and educ and abil are positively correlated Bl would have an upward bias just as if exper were not in the model The reasoning used in the previous example is often followed as a rough guide for obtaining the likely bias in estimators in more complicated models Usually the focus is on the relationship between a particular explanatory variable say x1 and the key omitted factor Strictly speaking ignoring all other explanatory variables is a valid practice only when each one is uncorrelated with x but it is still a useful guide Appendix 3A contains a more careful analysis of omitted variable bias with multiple explanatory variables 34 The Variance of the OLS Estimators We now obtain the variance of the OLS estimators so that in addition to knowing the cen tral tendencies of the B we also have a measure of the spread in its sampling distribution Before finding the variances we add a homoskedasticity assumption as in Chapter 2 We do this for two reasons First the formulas are simplified by imposing the constant error variance assumption Second in Section 35 we will see that OLS has an important effi ciency property if we add the homoskedasticity assumption In the multiple regression framework homoskedasticity is stated as follows The error u has the same variance given any values of the explanatory variables In other words arulx Xk Assumption MLR5 means that the variance in the error term u conditional on the explanatory variables is the same for all combinations of outcomes of the explanatory variables If this assumption fails then the model exhibits heteroskedasticity just as in the twovariable case In the equation wage BU Bleduc Bzexper Batermre u homoskedasticity requires that the variance of the unobserved error u does not depend on the levels of education experience or tenure That is Varuleduc exper tenure 0392 If this variance changes with any of the three explanatory variables then heteroskedasticity is present Assumptions MLRl through MLR5 are collectively known as the GaussMarkov assumptions for crosssectional regression So far our statements of the assumptions are suitable only when applied to crosssectional analysis with random sampling As we will Chapter 3 Multiple Regression Analysis Estimation see the GaussMarkov assumptions for time series analysis and for other situations such as panel data analysis are more difficult to state although there are many similarities In the discussion that follows we will use the symbol X to denote the set of all inde pendent variables xl xk Thus in the wage regression with educ exper and tenure as independent variables x educ exper tenure Then we can write Assumptions MLR1 and MLR4 as BU l 3 T 32x2 T kak and Assumption MLR5 is the same as Varylx 0392 Stating the assumptions in this way clearly illustrates how Assumption MLR5 differs greatly from Assumption MLR4 Assumption MLR4 says that the expected value of y given x is linear in the parameters but it certainly depends on x1 x2 xk Assumption MLR5 says that the variance of y given x does not depend on the values of the independent variables We can now obtain the variances of the B where we again condition on the sample values of the independent variables The proof is in the appendix to this chapter ampling Variances of the OLS Slope Under Assumptions lvlLR l through MLRS concliiicinal on il39ie saniple value at the ll39iClE39DEHClt l variables Vanii 351 ssrn 7 Rf i lOl i 1 2 k Where SST ELM 7 2 IS the total sample variation in X and R is the Rysquared from regressing X on all other independent variables and including an intercept Before we study equation 351 in more detail it is important to know that all of the GaussMarkov assumptions are used in obtaining this formula Whereas we did not need the homoskedasticity assumption to conclude that OLS is unbiased we do need it to vali date equation 351 The size of Var is practically important A larger variance means a less precise estimator and this translates into larger confidence intervals and less accurate hypotheses tesm as we will see in Chapter 4 In the next subsection we discuss the elements compris ing 351 The Components of the OLS Variances Multicollinearity Equation 351 shows that the variance of 8 depends on three factors 0392 SST and R Remember that the index j simply denotes any one of the independent variables such as education or poverty rate We now consider each of the factors affecting Van3 in turn The Error Variance aquot From equation 351 a larger 0392 means larger variances for the OLS estimators This is not at all surprising more noise in the equation a larger 0392 makes it more difficult to estimate the partial effect of any of the independent variables on y and this is reflected in higher variances for the OLS slope estimators Because 0392 is a feature of the population it has nothing to do with the sample size It is 95 96 Part 1 Regression Analysis with Crosereclional Data the one component of 351 that is unknown We will see later how to obtain an unbiased estimator of 0392 For a given dependent variable y there is really only one way to reduce the error vari ance and that is to add more explanatory variables to the equation take some factors out of the error term Unfortunately it is not always possible to find additional legitimate factors that affect y The Total Sample Variation In x SST From equation 351 we see that the larger the total variation in x is the smaller is Var3 Thus everything else being equal for estimating B we prefer to have as much sample variation in x as possible We already discovered this in the simple regression case in Chapter 2 Although it is rarely possible for us to choose the sample values of the independent variables there is a way to increase the sample variation in each of the independent variables increase the sample size In fact when sampling randomly from a population SST increases without bound as the sam ple size gets larger and larger This is the component of the variance that systematically depends on the sample size When SST is small Van3 1 can get very large but a small ASST is not a violation of Assumption MLR3 Technically as SST goes to zero Van3 approaches infinity The extreme case of no sample variation in 6 ST 0 is not allowed by Assumption MLR3 The linear Relationships among the Independent Variables R The term R2 in equation 351 is the most difficult of the three components to understand This term does not appear in simple regression analysis because there is only one independent variable in such cases It is important to see that this Rsquared is distinct from the Rsquared in the regression of y on x1 x2 xk Rf is obtained from a regression involving only the inde pendent variables in the original model where 6 plays the role of a dependent variable Consider first the k 2 case y BU lel 32x2 M Then VarBl a39Z SST1l 7 RS where R is the Rsquared from the simple regression of XI on x2 and an intercept as always Because the Rsquared measures goodnessoffit a value of R close to one indicates that x2 explains much of the variation in x1 in the sample This means that x1 and x2 are highly correlated A s R increases to one VarBl gets larger and larger Thus a high degree of linear relationship between x1 and x2 can lead to large variances for the OLS slope estimators A similar argument applies to 32 See Figure 31 for the relationship between Var 1 and the Rsquared from the regression of XI on x In the general case Rf is the proportion of the total variation in x that can be explained by the other independent variables appearing in the equation For a given 0392 and ST the smallest Van3 is obtained when Rf 0 which happens if and only if x has zero sample correlation with every other independent variable This is the best case for estimating B but it is rarely encountered The other extreme case R l is ruled out by Assumption MLR3 because R2 1 means that in the sample 6 is a perfect linear combination of some of the other independent variables in the regression A more relevant case is when R is close to one From equation 351 and Figure 31 we see that this can cause Var to be large Var3 gt 00 as R gt 1 High but not perfect correlation between two or more independent variables is called multicollinearity Chapter 3 Multiple Regression Analysis Estimation Varuil as a function of Rf Var Before we discuss the multicollinearity issue further it is important to be very clear on one thing a case where R is close to one is not a violation of Assumption MLR3 Since multicollinearity violates none of our assumptions the problem of multicol linearity is not really welldefined When we say that multicollinearity arises for estimat ing B when R is close to one we put close in quotation marks because there is no absolute number that we can cite to conclude that multicollinearity is a problem For example R 9 means that 90 of the sample variation in x can be explained by the other independent variables in the regression model Unquestionably this means that x has a strong linear relationship to the other independent variables But whether this translates into a Var3 that is too large to be useful depends on the sizes of 0392 and SST As we will see in Chapter 4 for statistical inference what ultimately matters is how big B is in relation to its standard deviation A Just as a large value of R can cause a large Var3 so can a small value of SST Therefore a small sample size can lead to large sampling variances too Worryng about high degrees of correlation among the independent variables in the sample is really no dif ferent from worrying about a small sample size both work to increase Var The famous University of Wisconsin econometrician Arthur Goldberger reacting to econometricians obsession with multicollinearity has tongue in cheek coined the term micronumerosity 97 4 E bemuse umulatlve A S C pellormancp aw lllsely m be l39llgllly mllmeal What would be 7 Part Regl39essllin Arlalysls mm CrosereCtlJnal Data which he defines as the problem of small sample size For an engaging discussion of multicollinearity and rnicronumerosity see Goldberger 1991 1 1 4 c 1 39 p 1 J 4 L L a 1 4 4 1 J 39 aliaum 39 f how to solve the multicollinearity problem In the social sciences where we are usually passive collectors of data there is no good way to reduce variances of unbiased estimators other L 11 4 1 4 ables from the model in an effort to reduce multicollinearity Unfortunately dropping a variable that belongs in the population model can lead to bias as we saw in ection 3 Perhaps an example at this point will help clarify some of the issues raised concernr ing multicollinearity Suppose we are interested in estimating the effect of various school expenditure categories on student erformance It is likely that expenditures on teacher salaries instructional materials athletics and so on are highly correlated Wealthier schoo s tend to spend more on everything 39 Not surprisingly it can be difficult to estimate the effect of any particular expenditure category on student performance when there is little variation in one cate ory that cannot largely be explained by variations in the other expenditure categories this leads to high Rffor each of L r 439 variable 39 39 39 r u 4 by collecting more data but in a sense we have imposed the problem on ourselves we are 39 39 ue um u uue L quot L 4 with any precision We can probably do much better by changing the scope of the analysis and lumping all 4 V r r effect of each separate category variables can be irrelevant as to how well we can estimate other parameters in the model or example consider a model with three independent variables y3031X132X233X3 v where X2 and X3 are highly correlated Then Var32 and Var33 may be large But the amount of correlation between X2 and X3 has no direct effect on Var3 In fact if X is uncorrelated with X2 and X3 then R O and Var3 UZISST regardless of how much 1is the you postulate a model explallllm nal exam scale ll39l really care about the amount of correlar slt altendame Thug the dependant varlable ls llnzll m between Xzandx 3 or lpw m rrltlt Wat yvallat l l lmLu nt lae evmus observamn 15 1 or or Ellabl e and measures at lllgll school permutava many com 1 Variables in order m 150 You almol hope tu lealll a vll39llnvl from ll39lls late the causal effect o i M 0 9 3m 9 WWW variable For example in looking at the 39 relationship between loan approval rates and percentage of minorities in 39 r m m 0 g E a a nei borhood we might include variables like average income average housing value measures of creditworthiness and so on because thesefatr J L Jf39 roraw quot b39d39 quot nation Income housing prices and creditworthiness are generally highly correlated with Chapter 3 Multiple Regression Analysis Estimation each other But high correlations among these controls do not make it more difficult to determine the effecm of discrimination Some researchers find it useful to compute statistics intended to determine the severity of multicollinearity in a given application Unfortunately it is easy to misuse such statistics because as we have discussed we cannot specify how much correlation among explana tory variables is too much Some multicollinearity diagnostics are omnibus statistics in the sense that they detect a strong linear relationship among any subset of explanatory variables For reasons that we just saw such statistics are of questionable value because they might reveal a problem simply because two control variables whose coefficients we do not care about are highly correlated Probably the most common omnibus mul ticollinearity statistic is the socalled condition number which is defined in terms of the full data matrix and is beyond the scope of this text See for example Belsley Kuh and Welsh 1980 Somewhat more useful but still prone to misuse are statistics for individual coef ficients The most common of these is the variance inflation factor VIF which is obtained directly from equation 351 The VIP for slope coefficient j is simply VIFJ 1 1 7 Rf precisely the term in Van3 that is determined by correlation between 6 and the other explanatory variables Vle is the factor by which Var3 is higher because 6 is not uncorrelated with all other explanatory variables Because VIFJ is a function of Rfiindeed Figure 31 is essentially a graph of VlFliour previous discussion can be cast entirely in terms of the VIP For example if we had the choice we would like VIP to be smaller other things equal But we rarely have the choice If we think certain explanatory variables need to be included in a regression to infer causality of x then we are hesitant to drop them and whether we think VIF is too high cannot really affect that decision If say our main interest is in the causal effect of 6 on y then we should ignore entirely the VlFs of other coefficients Finally setting a cutoff value for VIP above which we conclude multicollinearity is a problem is arbitrary and not especially help ful Sometimes the value 10 is chosen If VlF is above 10 equivalently R is above 9 then we conclude that multicollinearity is a problem for estimating B But a Vle above 10 does not mean that the standard deviation of B is too large to be useful because the standard deviation also depends on a39 and SST and the latter can be increased by increas ing the sample size Therefore just as with looking at the size of R directly looking at the size of VIP is of limited use although one might want to do so out of curiosity Variances in Misspeci ed Models The choice of whether to include a particular variable in a regression model can be made by analyzing the tradeoff between bias and variance In Section 33 we derived the bias induced by leaving out a relevant variable when the true model contains two explanatory variables We continue the analysis of this model by comparing the variances of the OLS estimators Write the true population model which satisfies the GaussMarkov assumptions as yBUle1BZx2n We consider two estimators of 3 The estimator 3 comes from the multiple regression y B 3m 262 351 l 100 Part 1 Regression Analysis with Crosereclional Data In other words we include x2 along with x1 in the regression model The estimator BI is obtained by omitting x2 from the model and running a simple regression of y on x1 9 50 51x1 l 353 When BZ 0 equation 353 excludes a relevant variable from the model and as we saw inASection 33 this induces a bias in Bl unless x1 and x2 are uncorrelated On the other hand BI is unbiased for BI for any value of Bi including BZ 0 It follows that if bias is used as the only criterion BI is preferred to Bl The conclusion that BI is always preferred to Bl does not carry over when we bring variance into the picture Conditioning on the values of XI and xzin the sample we have from 351 Van3 1 a39ZSST1l 7 R3 354 where SST1 is the total variation in x1 and R is the Rsquared from the regression of XI on x2 Further a simple modification of the proof in Chapter 2 for twovariable regression shows that Var 1 aZSSTI E Comparing 355 to 354 shows that Var 1 is always smaller than Vang uAnless x1 and x2 are uncorrelated in the sample in which case the two estimators BI and BI are the same Assuming that x1 and x2 are not uncorrelated we can draw the following conclusions 1 When 82 7t 0 511s biased BI is unbiased and VarQl lt Van31 2 When BZ 0 Bl andB1 are both unbiased and Var l lt VarBl From the second conclusion it is clear that E is preferred if BZ 0 lntuitively if x2 does not have a partial effect on y then including it in the model can only exacerbate the multi collinearity problem which leads to a less efficient estimator of Bl A higher variance for the estimator of BI is the cost of including an irrelevant variable in a model The case where BZ i 0 is more difficult Leaving x2 out of the model results in a biased estimator of Bl Traditionally econometricians have suggested comparing the likely size of the bias due to omitting x2 with the reduction in the varianceisurninarized in the size of Rfito decide whether x2 should be included However when BZ 0 there are two favorable reasons for including x2 in the model The most important of these is that any bias in Bl does not shrink as the sample size grows in fact the bias does not necessarily follow any pattern Therefore we can usefully think of the bias as being roughly the same for any sample size On the other hand Var l and Var 1 both shrink to zero as It gets large which means that the multicollinearity induced by adding x2 becomes less important as the sample size grows In large samples we would prefer The other reason for favoring BI is more subtle The variance formula in 355 is conditional on the values of x11 and 3612 in the sample which provides the best scenario for BI When i 0 the variance of Bl conditional only on x1 is larger than that presented in 355 Intuitively when BZ 0 and x2 is excluded from the model the error variance Chapter 3 Multiple Regression Analysis Estimation increases because the error effectively contains part of x2 But 355 ignores the error vari ance increase because it treats both regressors as nonrandom A full discussion of which independent variables to condition on would lead us too far astray It is sufficient to say that 355 is too generous when it comes to measuring the precision in El Estimating 02 Standard Errors of the OLS Estimators We now show how to choose an unbiased estimator of 02 which then allows us to obtain unbiased estimators of VarB Because 0392 Euz an unbiased estimator of 0392 is the sample average of the squared errors r142 Unfortunately this is not a true estimator because we do not observe the Mt Nevertheless recall that the errors can be written as ul y 7 BO 7 31x11 7 Bzr 7 kak and so the reason we do not observe the ml is that we do not know the B When we replace each B with its OLS estimator we get the OLS residuals 12 yz 7B0 7 1x11 732x12 7 Tkaac39 It seems natural to estimate 0392 by replacing ml with the 12 In the simple regression case x we saw that this leads to a biased estimator The unbiased estimator of 0392 in the general multiple regression case is a z 2 11 n7k71SSRn7k71 i356i We already encountered this estimator in the k 1 case in simple regression The term It 7 k 7 l in 356 is the degrees of freedom df for the general OLS problem with n observations and k independent variables Since there are k 1 parameters in a regression model with k independent variables and an intercept we can write df n 7 k 1 357 number of observations 7 number of estimated parameters This is the easiest way to compute the degrees of freedom in a particular application count the number of parameters including the intercept and subtract this amount from the number of observations In the rare case that an intercept is not estimated the number of parameters decreases by one Technically the division by n 7 k 7 l in 356 comes from the fact that the expected value of the sum of squared residuals is ESSR n 7 k 7 la392 Intuitively we can figure out why the degrees of freedom adjustment is necessary by returning to the first order conditions for the OLS estimators These can be written 2111 0 and 293121 where j l 2 k Thus in obtaining the OLS estimates k l restrictions are imposed on the OLS residuals This means that given It 7 k l of the residuals the remaining k l residuals are known there are only It 7 k 1 degrees of freedom in the residu als This can be contrasted with the errors ml which have n degrees of freedom in the sample 07 Part 1 Regression Analysis with Crosereclional Data For reference we summarize this discussion with Theorem 33 We proved this theo rem for the case of simple regression analysis in Chapter 2 see Theorem 23 A general proof that requires matrix algebra is provided in Appendix E g r ofal Undei the Gauserarkov assumptions MLR1 through MLRS Ec r l 03 The positive square root of 6392 denoted 639 is called the standard error of the regression SER The SER is an estimator of the standard deviation of the error term This estimate is usually reported by regression packages although it is called different things by differ ent packages In addition to SER 639 is also called the standard error of the estimate and the root mean squared error Note that 639 can either decrease or increase when another independent variable is added to a regression for a given sample This is because although SSR must fall when another explanatory variable is added the degrees of freedom also falls by one Because SSR is in the numerator and df is in the denominator we cannot tell beforehand which effect will dominate For constructing confidence intervals and conducting tests in Chapter 4 we will need to estimate the standard deviation of 3 which is just the square root of the variance sage assrjn 7 Rf 2 Since a is unknown we replace it with its estimator 639 This gives us the standard error of B 1 sea mm 7 R611 Just as the OLS estimates can be obtained for any given sample so can the standard errors Since se i depends on 639 the standard error has a sampling distribution which will play a role in Chapter 4 We should emphasize one thing about standard errors Because 358 is obtained directly from the variance formula in 351 and because 35l relies on the homoskedas ticity Assumption MLR5 it follows that the standard error formula in 358 is not a valid estimator of sdB if the errors exhibit heteroskedasticity Thus while the presence of heteroskedasticity does not cause bias in the B it does lead to bias in the usual formula for VarBJ which then invalidates the standard errors This is important because any regres sion package computes 358 as the default standard error for each coefficient with a somewhat different representation for the intercept If we suspect heteroskedasticity then the usual OLS standard errors are invalid and some corrective action should be taken We will see in Chapter 8 what methods are available for dealing with heteroskedasticity 35 Efficiency of OLS The GaussMarkov Theorem In this section we state and discuss the important GaussMarkov Theorem which justi fies the use of the OLS method rather than using a variety of competing estimators We know one justification for OLS already under Assumptions MLRl through MLR4 OLS Chapter 3 Multiple Regression Analysis Estimation is unbiased However there are many unbiased estimators of the B under these assump tions for example see Problem 313 Might there be other unbiased estimators with vari ances smaller than the OLS estimators If we limit the class of competing estimators appropriately then we can show that OLS is best within this class Specifically we will argue that under Assumptions MLRl through MLR5 the OLS estimator B for B is the best linear unbiased estimator BLUE To state the theorem we need to understand each component of the acronym BLUE First we know what an estimator is it is a rule that can be applied to any sample of data to produce an estimate We also know what an unbiased estimator is in the current context an estimator say B of B is an unbiased estimator of B if EB B for any B0 B B 1 Whatk about the meaning of the term linear In the current context an estimator B of B is linear if and only if it can be expressed as a linear function of the data on the dependent variable 5 zwly ii 11 where each wt can be a function of the sample values of all the independent variables The OLS estimators are linear as can be seen from equation 322 Finally how do we define best For the current theorem best is defined as smallest variance Given two unbiased estimators it is logical to prefer the one with the smallest vari ance see Appendix C Now let B0 B1 Bk denote the OLS estimators in model 331 under Assumptions MLRl through MLR5 The paussMarkov Theorem says that for any estimator B that is linear and unbiased VarB S VarB and the inequality is usually strict In other words in the class of linear unbiased estimators OLS has the smallest variance under the five GaussMarkov assumptions Actually the theorem says more than this If we want to estimate any linear function of the B then the corresponding linear combination of the OLS estimators achieves the smallest variance among all linear unbiased estimators We conclude with a theorem which is proven in Appendix 3A Gauss Markov Theorem Under Assumptions MLRJI through MLRB Bu 8 Bk are the best linear unbiased estimar tors iBLUEs of 5 5 BK respectively It is because of this theorem that Assumptions MLRl through MLR5 are known as the C quot M rquot 1 39 for r H 1 1anal sis The importance of the GaussMarkov Theorem is that when the standard set of assumptions holds we need not look for alternative unbiased estimators of the form in 359 none will be better than OLS Equivalently if we are presented with an estimator that is both linear and unbiased then we know that the variance of this estimator is at least as large as the OLS variance no additional calculation is needed to show this For our purposes Theorem 34 justifies the use of OLS to estimate multiple regres sion models If any of the GaussMarkov assumptions fail then this theorem no longer 104 Part 1 Regression Analysis with Crosereclional Data holds We already know that failure of the zero conditional mean assumption Assumption MLR4 causes OLS to be biased so Theorem 34 also fails We also know that heteroske dasticity failure of Assumption MLR5 does not cause OLS to be biased However OLS no lon er has the smallest variance among linear unbiased estimators in the presence of heteroskedasticity In Chapter 8 we analyze an estimator that improves upon OLS when we know the brand of heteroskedasticity m l The multiple regression model allows us to effectively hold other factors fixed while examining the effects of a particular independent variable on the dependent variable It explicitly allows the independent variables to be correlated 2 Although the model is linear in its parameters it can be used to model nonlinear relation ships by appropriately choosing the dependent and independent variables 3 The method of ordinary least squares is easily applied to estimate the multiple regression model Each slope estimate measures the partial effect of the corresponding independent variable on the dependent variable holding all other independent variables fixed 4 R2 is the proportion of the sample variation in the dependent variable explained by the independent variables and it serves as a goodnessof fit measure It is important not to put too much weight on the value of R2 when evaluating econometric models 5 Under the first four GaussMarkov assumptions MLRl through MLR4 the OLS esti mators are unbiased This implies that including an irrelevant variable in a model has no effect on the unbiasedness of the intercept and other slope estimators On the other hand omitting arelevant variable causes OLS to be biased In many circumstances the direc tion of the bias can be determined 6 Under the five GaussMarkov assumptions the variance of an OLS slope estimator is given by VarBA 0392SST11 Rb As the error variance 0392 increases so does VarB while Vang decreases as the sample variationin x SST increases The term R2 measures the amount of collinearity between x and the other explanatory variables As R2 approaches one VarB is unbounded 7 Adding an irrelevant variable to an equation generally increases the variances of the remaining OLS estimators because of multicollinearity 8 Under the GaussMarkov assumptions MLRl through MLRS the OLS estimators are the best linear unbiased estimators BLUES The GaussMarkov Assumptions The following is a surmnary of the five GaussMarkov assumptions that we used in this chap ter Remember the first four were used to establish unbiasedness of OLS whereas the fifth was added to derive the usual variance formulas and to conclude that OLS is best linear unbiased Assumption MLRJ Linear in Parameters The model in the population can be written as y BU 31x1 32x2 kak u where BO 3 M B are the unknown parameters constants of interest and u is an unob servable random error or disturbance term Chapter 3 Multiple Regression Analysis Estimation Assumption MLRJ Random Sampling We have a random sample of It observations 06 population model in Assumption MLRl Xlk y i l 2 M rt following the 17 sz7 Assumption MLR3 No Perfect Collinearity Ln the sample and therefore in the population none of the independent variables is constant and there are no exact linear relationships among the independent variables Assumption MLRA Zero Conditional Mean The error M has an expected value of zero given any values of the independent variables Ln other words Eulx1 x2 M xk 0 Assumption MLR5 Homoskedasticity The error M has the same variance given any value of the explanatory variables Ln other words Varulx1 M xk 0392 m Best Linear Unbiased Estimator BLUE Biased Toward Zero Ceteris Paribus Degrees of Freedom df Disturbance Downward Bias Endogenous Explanatory Variable Error Term Excluding a Relevant Variable Exogenous Explanatory Variable Explained Sum of Squares SSE First Order Conditions GaussMarkov Assumptions GaussMarkov Theorem Inclusion of an Irrelevant Variable Intercept Micronumerosity Misspecification Analysis Multicollinearity Multiple Linear Regression Model Multiple Regression Analysis OLS Intercept Estimate OLS Regression Line OLS Slope Estimate Omitted Variable Bias Ordinary Least Squares Overspecifying the Model Partial Effect Perfect ColJinearity Population Model Residual Residual Sum of Squares Sample Regression Function SRF Slope Parameter Standard Deviation of BA Standard Error of 5 Standard Error of the Regression SER Sum of Squared Residuals SSR Total Sum of Squares SST True Model Underspecifying the Model Upward Bias Variance In ation Factor VIF nmm 31 Using the data in GPAZRAW on 4137 college students the following equation was estimated by OLS Part 1 Regression Analysis with CrossSectional Data W 7 1392 7 0135 hsperc 00148 sat n 4137 R2 7 273 where colgpa is measured on a fourpoint scale hsperc is the percentile in the high school graduating class defined so that for example hsperc 5 means the top 5 of the class and sat is the combined math and verbal scores on the student achievement test i Why does it make sense for the coefficient on hyperc to be negative ii What is the predicted college GPA when hsperc 20 and sat 1050 iii Suppose that two high school graduates A and B graduated in the same percentile from high school but Student A s SAT score was 140 points higher about one standard deviation in the sample What is the predicted difference in college GPA for these two students Is the difference large iv Holding hsperc fixed what difference in SAT scores leads to a predicted colgpa difference of 50 or onehalf of a grade point Comment on your answer The data in WAGEZRAW on working men was used to estimate the following equation 1036 7 094 sibs 131 meduc 210feduc n 722 R2 214 where educ is years of schooling sibs is number of siblings meduc is mother s years of schooling and feduc is father s years of schooling i Does sibs have the expected effect Explain Holding meduc and feduc xed by how much does sibs have to increase to reduce predicted years of education by one year A noninteger answer is acceptable here ii Discuss the interpretation of the coefficient on meduc iii Suppose that Man A has no siblings and his mother and father each have 12 years of education Man B has no siblings and his mother and father each have 16 years of education What is the predicted difference in years of education between B and A The following model is a simplified version of the multiple regression model used by Biddle and Hamermesh 1990 to study the tradeoff between time spent sleeping and working and to look at other factors affecting sleep sleep 30 Bltotwrk Bzeduc Bgage u where sleep and totwrk total work are measured in minutes per week and educ and age are measured in years See also Computer Exercise C23 i If adults trade off sleep for work what is the sign of 3 ii What signs do you think 32 and 33 will have iii Using the data in SLEEP75RAW the estimated equation is Flee 363825 7 148 totwrk 7 1113 educ 220 age It 706 R2 113 If someone works five more hours per week by how many minutes is sleep pre dicted to fall Is this a large tradeoff Chapter 3 Multiple Regression Analysis Estimation iv Discuss the sign and magnitude of the estimated coefficient on educ v Would you say totwrk educ and age explain much of the variation in sleep What other factors might affect the time spent sleeping Are these likely to be correlated with tolwrk The median starting salary for new law school graduates is determined by logsalary BU BILS AT BZGPA leogUibvol BAlogc0st Bsrank u where LSAT is the median LSAT score for the graduating class GPA is the median col lege GPA for the class libvol is the number of volumes in the law school library cost is the annual cost of attending law school and rank is a law school ranking with rank 1 being the best i Explain why we expect BS S 0 ii What signs do you expect for the other slope parameters Justify your answers iii Using the data in LAWSCHSSRAW the estimated equation is A logsalary 834 0047 LSAT 248 GPA 095 loglz39bvol 038 logc0st 7 0033 rank n 136 R2 842 What is the predicted ceteris paribus difference in salary for schools with a median GPA different by one point Report your answer as a percentage iv Interpret the coefficient on the variable loglibvol v Would you say it is better to attend a higher ranked law school How much is a dif ference in ranking of 20 worth in terms of predicted starting salary In a study relating college grade point average to time spent in various activities you distribute a survey to several students The students are asked how many hours they spend each week in four activities studying sleeping working and leisure Any activity is put into one of the four categories so that for each student the sum of hours in the four activi ties must be 168 i In the model GPA B0 Blsmdy stleep ngork B4leisure u does it make sense to hold sleep work and leisure fixed while changing study ii Explain why this model violates Assumption MLR3 iii How could you reformulate the model so that its parameters have auseful interpreta tion and it satisfies Assumption MLRr3 Consider the multiple regression model containing three independent variables under Assumptions MLRl through MLR4 y 30 BIX1BZX2 33 You are interested in estimating the sum of the parameters on 61 and x2 call this 61 31 32 108 Part 1 w 39 O Regression Analysis with Crosereclional Data i Show that 31 32 is an unbiased estimator of 61 ii Find Var61 in terms of VarBl VarBZ and CorrBl 32 Which of the following can cause OLS estimators to be biased i Heteroskedasticity ii Omitting an important variable iii A sample correlation coefficient of 95 between two independent variables both included in the model Suppose that average worker productivity at manufacturing firms avgprod depends on two factors average hours of training avgtrain and average worker ability avgabil avgprod B0 Bluvgtrain Bzavgabil u Assume that this equation satisfies the GaussMarkov assumptions If grants have been given to firms whose workers have less than average ability so that avgtrain and avgabil are negatively correlated what is the likely bias in 31 obtained from the simple regression of avgprod on avgtrain The following equation describes the median housing price in a community in terms of amount of pollution ux for nitrous oxide and the average number of rooms in houses in the community rooms logprice BU Bllogmox Bzrooms u i What are the probable signs of 81 and 32 What is the interpretation of 8 Explain ii Why might nox or more precisely logrwx and rooms be negatively correlated If this is the case does the simple regression of logprice on logrwx produce an upward or a downward biased estimator of 3 iii Using the data in HPRICEZRAW the following equations were estimated logprice 1171 7 1043 logn0x n 506 R2 264 W 923 7 71810gnox 306 rooms n 506 R2 514 Is the relationship between the simple and multiple regression estimates of the elas ticity ofprice with respect to nox what you would have predicted given your answer in part ii Does this mean that 718 is definitely closer to the true elasticity than 71043 Suppose that you are interested in estimating the ceteris paribus relationship between y and x1 For this purpose you can collect data on two control variables 62 and x3 For concreteness you might think of y as final exam score 61 as class attendance 62 as GPA up through the previous semester and 63 as SAT or ACT score Let B be the simple regression estimate from y on 61 and let 3 be the multiple regression estimate from y on x1x2x3 i If 61 is highly correlated with x2 and 63 in the sample and x2 and 63 have large paitial effects on y would you expect 31 and 31 to be similar or very different Explain Chapter 3 Multiple Regression Analysis Estimation B and tend to be similar or very different Exzplain iii If 61 is highly correlatedwith x2 and x3 and x2 and 63 have small partial effects on y would you expect seBl or seBl to be smaller Explain iv If 61 is almost uncorrelated with x and x x2 and 63 have large partial effects on y and x2 and x3 are highly correlated would you expect seBl or se l to be smaller Explain ii If 61 is almost uncorrelated with x2 and x3 but x and x3 are highly correlated will Suppose that the population model determining y is y Bu l 31x1 l Bzxz l 33x3 l u and this model satisifies Assumptions MLRl through MLR4 However we estimate the model that omits x3 Let 80 BI and 32 be the OLS estimators from the regression of y on 61 and x2 Show that the expected value of 8 given the values of the independent variables in the sample is 6 It 1 3 HBO 131 Ba PM Zr 1 11 where the fl are the OLS residuals from the regression of 61 on x2 Hint39 The formula for 9 comes from equation 322 Plug yl BU le 32xZ2 3le3 ul into this equation After some algebra take the expectation treating xx3 and fl as nonrandom The following equation represents the effects of tax revenue mix on subsequent employ ment growth for the population of counties in the United States growth BU BlshtzreP 32mgerI Bgshares other factors where growth is the percentage change in employment from 1980 to 1990 shareP is the share of property taxes in total tax revenue shareI is the share of income tax revenues and shareS is the share of sales tax revenues All ofthese variables are measured in 1980 The omitted share shareF includes fees and miscellaneous taxes By definition the four shares add up to one Other factors would include expenditures on education infrastruc ture and so on all measured in 1980 i Why must we omit one of the tax share variables from the equation ii Give a careful interpretation of 3 i Consider the simple regression modely B B x uunder the first four Gauss Markov assumptions For some function gx for example gx x2 or gx logl x2 define zl gxl Define a slope estimator as Dz A 2 lZ z A in 11 11 Show that B is linear and unbiased Remember because Eulx 0 you can treat both xx and zl as nonrandom in your derivation ii Add the homoskedasticity assumption MLRS Show that it a 2 Bl 2 W1 glitz 7 2y 110 Part 1 C3l Regression Analysis with Crosereclional Data iii Show directly that under the GaussMarkov assumptions Var 1 S Var l where 81 is the OLS estimator Hint39 The CauchySchwartz inequality in Appendix B implies that n 1 Z Z KY 11 11 1 061 7 f2 11 n 1 2amp1 7 3061732 i1 7 notice that we can drop X from the sample covariance COMPUTER EXERCISES Aproblem of interest to health officials and others is to determine the effects of smok ing during pregnancy on infant health One measure of infant health is birth weight a birth weight that is too low can put an infant at risk for contracting various illnesses Since factors other than cigarette smoking that affect birth weight are likely to be cor related with smoking we should take those factors into account For example higher income generally results in access to better prenatal care as well as better nutrition for the mother An equation that recognizes this is bwght BU Blcigs 2 famine u i What is the most likely sign for 32 ii Do you think cigs and famine are likely to be correlated Explain why the correla tion might be positive or negative iii Now estimate the equation with and without famine using the data in BWGHT RAW Report the results in equation form including the sample size and R squared Discuss your results focusing on whether adding famine substantially changes the estimated effect of cigs on bwght Use the data in HPRICE1RAW to estimate the model price BU Blsmft szdrms u where price is the house price measured in thousands of dollars i Write out the results in equation form ii What is the estimated increase in price for ahouse with one more bedroom holding square footage constant iii What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size Compare this to your answer in part ii iv What percentage of the variation in price is explained by square footage and number of bedrooms v The first house in the sample has sqift 2438 and bdrms 4 Find the predicted selling price for this house from the OLS regression line vi The actual selling price of the first house in the sample was 300000 so price 300 Find the residual for this house Does it suggest that the buyer under paid or overpaid for the house C33 Chapter 3 Multiple Regression Analysis Estimation The file CEOSAL2RAW contains data on 177 chief executive officers and can be used to examine the effects of firm performance on CEO salary i Estimate a model relating annual salary to firm sales and market value Make the model of the constant elasticity variety for both independent variables Write the results out in equation form ii Add pro ts to the model from part i Why can this variable not be included in logarithmic form Would you say that these firm performance variables explain most of the variation in CEO salaries iii Add the variable ceotert to the model in part ii What is the estimated percentage return for another year of CEO tenure holding other factors fixed iv Find the sample correlation coefficient between the variables logmktval and profits Are these variables highly correlated What does this say about the OLS estimators Use the data in ATTENDRAW for this exercise i Obtain the minimum maximum and average values for the variables atrtdrte priGPA and ACT ii Estimate the model atrtdrte BU BlpriGPA BZACT u and write the results in equation form Interpret the intercept Does it have a useful meaning iii Discuss the estimated slope coefficients Are there any surprises iv What is the predicted atrtdrte if priGPA 365 andACT 20 What do you make of this result Are there any students in the sample with these values of the explana tory variables v If Student Ahas priGPA 31 and ACT 21 and Student B has priGPA 21 and ACT 26 what is the predicted difference in their attendance rates Confirm the partialling out interpretation of the OLS estimates by explicitly doing the partialljng out for Example 32 This first requires regressing educ on exper and tenure and saving the residuals r1 Then regress logwage on r1 Compare the coefficient on r1 with the coefficient on educ in the regression of logwage on educ exper and tenure Use the data set inWAGE2RAW for this problem As usual be sure all of the following regressions contain an intercept i Run a simple regression of IQ on educ to obtain the slope coefficient say 51 ii Run the simple regression of logwage on educ and obtain the slope coefficient Bl iii Run the multiple regression of logwage on educ and IQ and obtain the slope coefficients1 and 32 respectively iv Verify that 81 81 3251 Use the data in MEAP93RAW to answer this question i Estimate the model matth BU 81 logexpertd lenchprg u and report the results in the usual form including the sample size and Rsquared Are the signs of the slope coefficients what you expected Explain Part 1 Regression Analysis with CrossySeclional Data ii What do you make of the intercept you estimated in part i In particular does it make sense to set the two explanatory variables to zero Hint Recall that logl0 Now run the simple regression of mathIO on logexpend and compare the slope coefficient with the estimate obtained in part i Is the estimated spending effect now larger or smaller than in part i iv Find the correlation between expend logexpend and lnehprg Does its sign make sense to you v Use part iv to explain your findings in pait iii iii Use the data in DISCRIMRAW to answer this question These are ZIP codeilevel data on prices for various items at fastfood restaurants along with characteristics of the zip code population in New Jersey and Pennsylvania The idea is to see whether fastfood restaurants charge higher prices in areas with a larger concentration of blacks i Find the average values of prpblek and income in the sample along with their standard deviations What are the units of measurement of prpblck and income ii Consider a model to explain the price of soda psoda in terms of the proportion of the population that is black and median income psoda BU Blprpblck Bzincome u Estimate this model by OLS and report the results in equation form including the sample size and Rsquared Do not use scientific notation when reporting the esti mates Interpret the coefficient on prpblek Do you think it is economically large iii Compare the estimate from part ii with the simple regression estimate from psoda on prpblck Is the discrimination effect larger or smaller when you control for income iv A model with a constant price elasticity with respect to income may be more appro priate Report estimates of the model logps0da BU Blprpblck leogineome u prrpblck increases by 20 20 percentage points what is the estimatedpercentage change inpsoda Hint The answer is 2xx where you fill in the xx v Now add the variable prppov to the regression in part iv What happens to B t vi Fmdrpfli correlation between logincome and prppov Is it roughly what you expected vii Evaluate the following statement Because logineome and prppov are so highly correlated they have no business being in the same regression Use the data in CHARITYRAW to answer the following questions i Estimate the equation gift BU Blmailsyear Bzgiftlast Bapropresp u by OLS and report the results in the usual way including the sample size and R squared How does the Rsquared compare with that from the simple regression that omits giftlast and propresp ii Interpret the coefficient on mailsyear Is it bigger or smaller than the corresponding simple regression coefficient iii Interpret the coefficient on propresp Be careful to notice the units ofmeasurement of propresp Chapter 3 Multiple Regression Analysis Estimation l13 iv Now add the variable avggift to the equation What happens to the estimated effect of mailsyear V In the equation from part iv what has happened to the coefficient on giftlast What do you think is happening Appendix 3A 3A Derivation of the First Order Conditions in Equation 313 The analysis is very similar to the simple regression case We must characterize the solu tions to the problem bk 2 T b0 7 blle 7 bkxzk2 Taking the partial derivatives with respect to each of the 1 see Appendix A evaluating them at the solutions and setting them equal to zero gives 2 2m 730 7396 73m 0 11 iziwl 730 731x 7 73kle 0 forallj 1 11 Canceling the 2 gives the first order conditions in 313 3A1 Derivation of Equation 322 To derive 322 write xx1 in terms of its fitted value and its residual from the regres sion of 61 on x2 xk xx1 it av for alli l M It Now plug this into the second equationin3l3 21lelezi i llei T kxzk0 11 By the definition of the OLS residual 121 since Jill is just a linear function of the explana tory variables xa M xlk it follows that Zill fc l 0 Therefore equation 360 can be expressed as 2amp10 go 7 31x11 7 T kazk 0 3 6 11 Since the ll are the residuals from regressing 61 on x2 xk 21 xyfll 0 for all j 2 M k Therefore 361 is equivalent to Ef gy 31x11 0 Finally we use the fact that Z 71 7 fr 0which means that solves til ll ll 1 2amp0 731 039 11 114 Part 1 Regression Analysis with CrossySeclional Data Now straightforward algebra gives 3 22 provided of course that 21221 gt 0 this is ensured by Assumption MLR3 3A3 Proof of Theorem 31 We prove Theorem 31 for 3 the proof for the other slope parameters is virtually iden tical See Appendix E for a more succinct proof using matrices Under Assumption MLR3 the OLS estimators exist and we can write 3A1 as in 322 Under Assumption MLR1 we can write yl as in 332 substitute this for yl in 322 Then using 2211 0 21 xyf 0 for allj 2 M k and Ellx f 21221 we have 11 11 Now under Assumptions MLR2 and MLR4 the expected value of each ul given all independent variables in the sample is zero Since the fl are just functions of the sample independent variables it follows that EQIlX 8 Z11EullXZf i1 H ampamp B1ZizlioZfl 31 11 11 where X denotes the data on all independent variables and EQIlX is the expected value ofBl given xll xlk for alli 1 M n This completes the proof 3AA General Omitted Variable Bias We can derive the omitted variable bias in the general model in equation 331 under the first four GaussMarkov assumptions In particular let the 31 0 1 M k be the OLS estimators from the regression using the full set of explanatory variables Let the B J 0 1 M k 7 1 be the OLS estimators from the regression that leaves out xk Let 5 1 M k 7 1 be the slope coefficient on x in the auxiliary regression of xlk on xll xlz xlykil i 1 M n Auseful fact is that BI 81 Bk 5f This shows explicitly that when we do not control for xk in the regression the estimated partial effect of x equals the partial effect when we include xk plus the partial effect of xk on 7 times the partial relationship between the omitted variable xk and xj lt k Conditional on the entire set of explanatory variables X we know that the B are all unbiased for the corresponding 31 1 M k Further since 5 is just a function of X we have EQIX E3 IX E3 le5 3ka Chapter 3 Multiple Regression Analysis Estimation Equation 364 shows that E is biased for B unless k 07in which case xk has no partial effect in the populationior 5 equals zero which means that xlk and xv are par tially uncorrelated in the sample The key to obtaining equation 364 is equation 363 To show equation 363 we can use equation 322 a couple of times For simplicity we look atj 1 Now B is the slope coefficient in the simple regression ofyi on l M n where the it are the OLS residuals from the regression ofxl1 on xa x 1111 13 xhl Consider the numerator of the expression for BI i f yl But for each i we can write yl BU 311611 kalk 121 and plug in for y Now by properties of the OLS residuals the it have zero sample average and are uncorrelated with xa x xlyki1 in the sample Similarly the 121 have zero sample average and zero sample correlation with x xlk It follows that the ill and 121 are uncorrelated in the sample since the ill are x117 127 just linear combinations of x So 11 117 x127 7 x1k1 11 11 11 Zr11y1 Bl Zr11x11 Zr11x1k 11 11 11 11 11 Now Z r xl1 Z r which is also the denominator of 3 Therefore we have shown that 11 if 11 3 a 31 81 31 Bk 5 This is the relationship we wanted to show 3A5 Proof of Theorem 31 Again we prove this forj 1 Write 31 as in equation 362 Now under MLR5 VarullX 0392 for alli l M 71 Under random sampling the Mt are independent even conditional on X and the fl are nonrandom conditional on X Therefore 11 12 2r11 11 11 11 11 2 11 7 A2 2 A2 7 2 A2 2 11 2 r11 02 r11 11 11 11 Now since leffl is the sum of squared residuals from regressing 61 on x2 xk 2123 SSTll Rf This completes the proof 2 Van lm VarullX 3A6 Proof of Theorem 34 We show that for any other linear unbiased estimator SI of 3 VarB1 Z VarBAl where BLis the OLS estimator The focus on l is without loss of generality For 81 as in equation 359 we can plug in for yl to obtain Bl BUZWH Bizwnxn BZZWXIXZZ Bkzwllxlk Zwllul39 11 11 11 11 11 Part 1 Regression Analysis with froserectional Data Now since the wi1 are functions of the xx 71 71 71 71 71 EwliX Bozwzi T Blzwzlle T Bzzwilxiz T T Bkzwilxzk T ZwilEuziX 11 11 11 11 11 n n n n BUZWH Blzwllxll B22 W1lx12 Bkz wllxlk 11 11 11 11 because EullX 0 for all i l M 71 under MLRZ and MLR4 Therefore for EBI X to equal 81 for any values of the parameters we must have iw 0 Zialw xl1 l iw xl 0 j 2 M k 366 11 11 11 Now let fl be the residuals from the regression of x11 on xa m xlk Then from 366 it follows that Zw f 1 11 because xx1 11 fl and ELM113611 0 Now consider the difference between Vara llm and Var llX under MLR1 through MLR5 02 zwg 7 02 231 11 11 Because of 367 we can write the difference in 368 without 0392 as 71 71 2 71 2 A A2 2 2W1 Era v 11 11 11 71 20711 T 7127 11 But 369 is simply 11 summing and then canceling terms Because 370 is just the sum of squared residu als from the simple regression of wt1 on filiremember that the sample average offl1 is zero370 must be nonnegative This completes the proof where 91 Zn wlfllZf21 as can be seen by squaring each term in 370 CHAPTER 4 Multiple Regression Analysis Inference his chapter continues our treatment of multiple regression analysis We now turn 39n the population regresr n th 45 and pay particular attention to determining whether a group of independent variables can be omitted from a mo 41 Sampling Distributions of the OLS Estimators Up to this point we have formed a set of assumptions under which OLS is unbiased we have also derived and discussed the bias caused by omitted variables In Section 34 we the variances of the OLS estimators under the Gauserarkov assumptions In 39 variance 39 l39near 39 quot quot ue and V timat r l ariance of the OLS estimators is useful for describe r However in mum 39 I I we need to know more than just the first two moments of 3 we need to know the full sampling distribution of the Even under the Gauserarkov assumptions the distribu7 tion of 3 can have virtually any shape e we condition on the values of the independent variables in our sample it is clear that the sampling distributions of the OLS estimators depend on the underlying distribur tion of the errors To make the sampling distributions of the 3 tractable we now assume that the unobserved error is normally distributed in the population We call this the nor mality assumption Knowing the expected val e