### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Econometrics II ECOE 60303

ND

GPA 3.51

### View Full Document

## 13

## 0

## Popular in Course

## Popular in Economy

This 0 page Class Notes was uploaded by Jazmyne Ernser DDS on Sunday November 1, 2015. The Class Notes belongs to ECOE 60303 at University of Notre Dame taught by William Evans in Fall. Since its upload, it has received 13 views. For similar materials see /class/232708/ecoe-60303-university-of-notre-dame in Economy at University of Notre Dame.

## Similar to ECOE 60303 at ND

## Reviews for Econometrics II

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/01/15

Discrete and Categorical Data Part I William N Evans Introduction Department of EconomicsMPRC University of Maryland Linear model Introduction Y1 0 BXi 81 39 Workhorse statistical model in social 0L and Bare population values represent sciences is the multivariate regression the true relatIODShIP between X and y model 39 Unfortunately 7 these values are unknown 39 The job of the researcher is to estimate these values 39 Ordinary least squares OLS YB XBX BX 35 y X03 1 a 2 m k 39 Notice that if we differentiate y With respect to i i i x we obtain 39 dydx 3 39 3 represents how much y will change for a fixed change in X 7 Increase in income for more education 7 Change in crime or bankruptcy when slots are legalized 7 Increase in test score if you study more Put some concreteness 0n the problem 39 State of Maryland budget problems 7Drop in revenues 7EXpensive k12 school spending initiatives 39 Shortterm solution 7 raise taX on cigarettes by 84 centspack Problem 7 a taX hike will reduce consumption of taxable product Question for state 7 as taXes are raised how much will cigarette consumption fall 39 Simple model y1 a 3X1 a1 39 Suppose y is a state s per capita consumption of cigarettes 39 X represents taXes on cigarettes 39 Question 7 how much will y fall if X is increased by 84 centspack 39 Problem 7 many reasons why people smoke 7 cost is but one of them 7 39 Data Y State per capita cigarette consumption for the years 19801997 e X tax State Federal in real cents per pack 7 Scatter plot of the data 7 Negative covariance between variables When xgti more hkeiy that yltV When xltX more hkeiy that ygtV 39 Goal pick values OfDL and Ethat best fit the data 7 Define best t in a moment Notation 39 True model y a kg a We observe data points yx The parameters a and p are unknown The actual error a is unknown 39 Estimated model ab are estimates for the parameters mp 39 e is an estimate of a where ey a bx 39 How do you estimate a and b Objective Minimize sum of squared errors 39 Min Elle2 2161 7 a 7 10512 39 Minimize sum of squared errors SSE 39 Treat and errors equally 7 Over or under predict by 5 is the same magnitude of error 7 Quadratic form 7 The optimal value for a and b are those that make the 15 derivative equal zero 7Functions reach min or max values when derivatives are zero cigarene Cnmumplinn and Taxes Per mm packsyear Cigarette Cnnsumplinn and szvs Per capilz padsyear 39 The model has a lot of nice features 7 Statistical properties easy to establish 7 Optimal estimates easy to obtain 7Parameter estimates are easy to interpret 7Model maximizes prediction Ifyou minimize SSE you maximize R2 39 The model does well as a first order approximation to lots of problems Discrete and Qualitative Data 39 The OLS model work well when y is a continuous variab e 7lncome wages test scores weight GDP 39 Does not has as many nice properties when y is not continuous 39 Example doctor visits Integer values Low counts for most people 39 Mass of observations at zero Downside of forcing nonstandard outcomes into OLS world 39 Can predict outside the allowable range 7 eg negative MD visits 39 Does not describe the data generating process we 7eg mass of observations at zero 39 Violates many properties of OLS 7 eg heteroskedasticity This talk 39 Look at situations when the data generating process does not lend itself well to OLS models 39 Mathematically describe the data generating process 39 Show how we use different optimization procedure to obtain estimates 39 Describe the statistical properties 39 Show how to interpret parameters 39 Illustrate how to estimate the models with popular program STATA Types of data generating processes we will consider 39 Dichotomous events yes or no 7 lyes Ono 7 Graduate high school work Are obese Smoke 39 Ordinal data 7Self reported health fair poor good excel 7Strongly disagree disagree agree strongly agree 39 Count data 7Doctor visits lost workdays fatality counts 39 Duration data 7 Time to failure time to death time to re employment Recommended Textbooks 39 Jeffrey Wooldridge Econometric analysis ofcross sectional and panel dataquot 7Lots of insight and mathematicalstatistical detail 7Very good examples 39 William Greene Econometric Analysisquot 7 more topics 7 Somewhat dated examples Course web page 39 wwwbsosumdedueconevansjpsmhtml 39 Contains 7 These notes 7 A11 STATA programs and data sets 7 A couple of Introduc on to STATA handouts 7 Links to some useful web sites STATA Resources Discrete Outcomes 39 Regression Models for Categorical Dependent Variables Using STATAquot 7J Scott Long and Jeremy Freese 39 Available for sale from STATA website for 52 wwwstatacom 39 Postestimation subroutines that translate results 7Do not need to buy the book to use the subroutines 39 ln STATA command line type net Search spost 39 Will give you a list of available programs to downloa 39 One is Spostado from httpwwwindianaedustlsocstata 39 Click on the link and install the files Part II A brief introduction to STATA STATA 39 Very fast convenient welldocumented cheap and exible statistical package 39 Excellent for crosssectionpanel data projects not as great for time series but getting better 39 Not as easy to manipulate large data sets from flat files as SAS 39 I usually clean data in SAS estimate models in STATA 39 Key characteristic of STATA 7 All data must be loaded into RAM 7 Computations are very fast 7 But size ofthe project is limited by available memory 39 Results can be generated two different ways 7 Command line 7 Write a program do then submit from the command line Sample program to get you started 39 cps87iordo 39 Program gets you to the point where can Load data into memory Construct new variables Get simple statistics Run a basic regression Store the results on a disk Data cp 587dodta 39 Random sample of data from 1987 Current Population Survey outgoing rotation group Sample selection 7Males 7 2164 7Working 30hoursweek 19906 observations Major caveat 39 Hardest thing to learndo get data from some other source and get it into STATA data set 39 We skip over that part 39 All the data sets are loaded into a STATA data file that can be called by saying use data file name Housekeeping at the top of the program the semicolon as the end of line delimiter delimit this line defines set menork for 1D meg set memory l m write results to s log 112 the replace options writes over old log 1125 log using cpsE7iorngreplace open ststs date set use ch111statacp587ior list vsrishles and lshels in date set desc Constructing new variables 39 Use gen command for generate new variables Syntax e gen new variable namemath statement Easily construct new variables via e Algebraic operations e Mathtrig functions Jn exp etc e Logical operators when true 1 when false 0 From program t generate new variables t lines 172 illustrate basic math functolns line 6 illustrates the AND statement after you construct new variables compress the 9 Getting basic statistics 39 desc describes variables in the data set 39 sum 7 gets summary statistics 39 tab 7 produces frequencies tables of discrete variables From program get descriptive statistics sum t get detailed descriptics for continuous variables sun earnwke detail t get frequencies of discrete variables tabulate unionn tabulate race t g twoeway table of frequencies tabulate region snsa row column cell Results from sum Vanable Elbs Mean Stu Dev Min Max av issus 3796619 1115345 zi ace l issus 1199136 525493 i 3 Educ issus 1315125 2 795234 u 8 unlanm issus 1759u55 8 i 2 ans l i4 i 3 Detailed summary usue1 weekly eeemngs percenenes Smallest n 129 53 179 59 193 219 59 Obs 19995 253 399 53 Sum of Wgt 19995 593 449 M en 499254 Largest std nev 2354713 753 515 999 993 955 999 Veuenee 559197 953 999 999 skewness 559545 993 999 999 etc 2532355 Results for tab 39 1un10n member 20therw15e Freq Percent Cum 1 1 4597 2399 2399 2 1 15399 76 91 19999 Total 2X2 Table Running a regression 39 Syntax reg dependentevsnsble independentevsnsbles 39 Example from program run simple regression reg earnwkl age age2 educ nonwhlte union Analysis of variance 39 R2 3085 iVariables explain 31 of the variation in log weekly earnings 39 F5 19900 7 Tests the hypothesis that all covariates except constant are jointly zero Interpret results YB0 61X1sl 39 dYdX Bl 39 But in this case YlnVV where W weekly wages d1nWdX dWWde B1 7 Percentage change in wages given a change in x 39 For each additional year of education wages increase by 69 39 Non whites earn 172 less than whites 39 Union members earn 13 more than nonunion members Part 111 Some notes about probability distributions Continuous Distributions 39 Random variables with infinite number of possible values 39 Examples units of measure time weight distance 39 Many discrete outcomes can be treated as continuous eg SAT scores m How to describe a continuous random variable 39 The Probability Density Function PDF 39 The PDF for a random variable X is defined as fX where fx 2 0 ffxdx 1 39 Calculus review The integral ofa function gives the area under the curvequot o7 Graph of yfx TM almostquot Cumulative Distribution Function CDF 39 Suppose X is a measure like distance or time 390gxgoo 39 We may be interested in the Prxa 7 CDF Ha f xmx Pro a What if we consider all values Prx g 00 Tfxdx1 Properties of CDF 39 Note that Prx b Prxgtb 1 39 Prxgtb 17Prx b 39 Many times it is easier to work with compliments General notation for continuous distributions 39 The PDF is described by lower case such as fx 39 The CDF is defined as upper case such as Fa Standard Normal Distribution 39 Most frequently used continuous distribution 39 Symmetric bellshapedquot distribution 39 As we will show the normal has useful properties 39 Many variables we observe in the real world look normally distributed 39 Can translate normal into standard normal Examples of variables that look normally distributed 39 IQ scores 39 SAT scores 39 Heights of females 39 Log income 39 Average gestation weeks of pregnancy 39 As we will show in a few weeks 7 sample means are normally distributed Standard Normal Distribution 39 PDF fz ltz 39Foroogzgoo Notation 39 Z is the standard normal PDF evaluated at Z 39 Ia Prz a Prz S a j zdz CDa Standard Nurna m Standard Normal Notice that 7 Normal is symmetric a a 7 Normal is unimodal r Medianmean 7 Area under curve1 7 Almost all area is between 33 Evaluations ofthe CDF are done with 7 Statistical functions excel SAS etc 7 Tables Standard Normal CDF Prz 098 lt1098 01635 has Understawdard Norma m nn5 um nn6n6 mum n2os1 n2m n251a n2sas n25n n2611 n2m n2m n27n9 n2m n6n n2776 n2a1n n2m n2377 n2912 n2m n2sa1 n3n15 n3n5n mamas Prz 141qgt14109207 has Under mdard Norma m z nnn nn2 um um nn5 nn6 um um nn9 n5n 6915 H695 means n7n19 n7nso mm n7123 n7157 n719n n7zzo m 215 212 m m 2 2 1 u 1 2 2 Z 61 62 39 Prxgt11717Prz 117 1 I117 z nnn nn1 nn2 um um nn5 um um nn9 1 7 08790 01210 n5nn6915n695nn6935n7n19n7n5 n7naa n7123n7157n719nn722 21221 11121312111211 111112122 05 m 2 2 1 u 1 2 2 63 m 39 Pr01 Z 19 Prz 197Prz 01 19qgt01 04315 09713 05898 yea my 3mm Nurma m nn6 um um nn9 n5239 n5279 n5319 n5359 Z um um nn2 um um nn5 6 um um nn9 1nn mun mam mum mms n35nu n3531 massa man mass mum Important Properties of Normal Distribution 39 Prz A A 39Prz gt A 1 ltIgtA 39 Prz A IA 39 Prz gt A 1 IA A Section IV Maximum likelihood estimation Maximum likelihood estimation 39 Observe n independent outcomes all drawn from the same distribution 39 CV11 Y2 Yam 39Yn 39 y1 is drawn from y 6 where B is an unknown parameter for the PDF f 39 Recall definition of indepedence If a andb and independent Proba and b PraPrB 39 Because all the draws are independent the probability these particular n values on would be drawn at random is called the rlikelihood function and it equals L Pray1gtPrcy2gtPrcyngt L fw1egtfcy2egtfcy3 6 39 MLE pick a value for B that best represents the chance these n values ofy would have been generated randomly 39 To maximize L maximize a monotonic function of L 39 Recall lnabcdlnalnblnclnd 39 Max 3 1I1L 1I1fy1 9 1I1fy2 9 1mm s 21mm 6 39 Pick 6 so that E is maximized 39 was 0 91 92 e Example Poisson 39 Suppose y measures counts such as doctor Visits 39 y1 is drawn from a Poisson distribution 39 fy1 equot Aylly For gt0 39 EBA Vary1 A 39 Given n observations y1 y2 ya yn 39 Pick Value ofA that maximizes E 39 MaxE Z lnfy 6 Z lne39Ayy Z 7 A ylln 7 lny n ln E y7 E lny 0 E n lnZ1 yr 211nm 39 dEdBZnlZly10 39 Solve for 39 Zly1 n V sample mean ofy 39 In most cases however cannot find a closed form solution for the parameter in lnfy 6 39 Must search over all possible solutions 39 How does the search work 39 Start with candidate value of B 39 Calculate dEdB 39 lf dEdB gt 0 increasing 6 will increase E so we increase 6 some 39 lf dEdB lt 0 decreasing B will increase E so we decrease 6 some 39 Keep changing 6 until dEdB 0 39 How far you step when you change 6 is determined by a number of different factors 20 dEde gt o dEde lt o Properties of MLE estimates 39 Sometimes call efficient estimation Can never generate a smal er variance than one obtained by MLE 39 Parameters estimates are distributed as a normal distribution when samples sizes are large 39 Therefore if we divide the parameter by its standard error should be normally distributed with a mean zero and variance 1 if the null 0 is correct Section 5 Dichotomous outcomes 21 Dichotomous Data 39 Suppose data is discrete but there are only 2 outcomes 39 Examples 7 Graduate high school or not iPatient dies or not 7Working or not 7 Smoker or not 39 In data y11 if yes y1 0 if no How to model the data generating process 39 There are only two outcomes 39 Research question What factors impact whether the event occurs 39 To answer will model the probability the outcome occurs PrY1 when y1 or PrY0 1 PrY1 when y0 39 Think of the problem from a MLE perspective 39 Likelihood for i th observation LJ PrltYJ1gtYI 1 PrY1 39Y When y1 only relevant part is PrY1 When y0 only relevant part is 1 PrY1 g 21mm 2 Z y11nPry11l 1y11nPry10l 39 Notice that up to this point the model is generic The log likelihood function will determined by the assumptions concerning how we determine Pry11 22 Modeling the probability 39 There is some process biological social 39 Consider a women s decision to work decision theoretic etc that determines ylvi the person s net bene t to work the outcome y 39 Two components of y 39 Some of the variables impacting are observed some are no 39 Requires that we model how these factors impact the probabilities some we cannot measure 39 Model from a latent variable perspective 7 Characteristics that we can measure Education age income of spouse prices of child care How much you like spending time with your kids liow much you likehate yourj We aggregate these two components into one equation y 1 if y stgt0 x x x 1 1 y Eowi si alas oar q y iqiplqsomp i 5 gt 39 Xi 3 n measurable nhaxanteustms butwlth uncertain weights grandam unmeasured characteristics 39 y 0 if ySO Decision rule person will work ify gt 0 yr X B e s 0 only if ifnet benefits are positive 5 lt X B i HOW to interpret 9 39 When we look at certain people we have expectations about whether y should equal 1 or 0 These expectations do not always hold true The error 2 represents deviations from what we expect 39 Go back to the work example suppose x1 3 is big We observe a woman with 7 High wa es 7 Low husband s income 7 Low cost of child care 39 We would expect this person to work UNLESS there is some unmeasured variable that counteracts this 39 For example 7 Suppose a mom really likes spending time with her kids or she hates herjob 7 The unmeasured benefit of working is then a big negative coefficient 8 39 If we observe them working there are a certain range of values that 8 must have been in excess of y1 le gt 7x43 If we observe someone not working then Consider the opposite Suppose we observe someone NOT working Then 8 must not have been big or it was a bigger negative number since y01fa7xs The Probabilities 39 The estimation procedure used is determined by the assumed distribution of 2 39 What is the probability we observe someone wit y1 7Use definition ofthe CDF 7Pry1 Prygt0 Pr gt xJ 1 FX l3 24 39 What is the probability we observe someone wit y0 7Use definition of the CDF 7Pry0 PrnyS 0 Pr 1 S x1 3 FX 3 39 Two standard models a is either 7 normal or 7 logistic Normal probit Model 39 E is distributed as a standard normal 7 Mean zero 7 Variance 39 Evaluate probability 311 P yl PMS gt X 31 WK 3 7 Given symmetry 17 IXx 3 10J 3 39 Evaluate probability 310 PFQIHJ PMS 5 X 3 WK 3 7Given symmetry IXx 3 1 10J 3 39 Summary PFO FD WK 3 PFO FO 1 490 l3 39 Notice that a is increasing a Therefore is one of the X s increases the probability of observing y we would expect the coefficient on that variable to be 39 The standard normal assumption variance1 is not critical 39 In practice the variance may be not equal t 1 but given the math of the problem we cannot identify the variance It is absorbed into parameter estimates 25 Logit 39 CDF Fa expa1expa 7 Symmetric unimodal distribution 7 Looks a lot like the normal 7 Incredibly easy to evaluate the CDF and PDF 7 Mean of zero variance gt 1 more variance than normal 39 Evaluate probability y1 PFG FD PFE gt X1 5 1 FXi 5 7 Given symmetry 1 7 Fx 13 Fx 3 FXi 5 eXPXi 519XPXi 5 39 Evaluate probability 310 PFO FO PMS 5 39 X 5 F39Xi 5 iGiven symmetry FxJ 3 1 FxJ 3 1 39 F01 5 119XPXi 5 39 When a is a logistic distribution PFOG 1 9XPXi 519XPXi 5 PFO FO 119XPXi 5 Example Workplace smoking bans 39 Smoking supplements to 1991 and 1993 National Health lnterview Survey 39 Asked all respondents whether they currently smo e 39 Asked workers about workplace tobacco policies 39 Sample workers 39 Key variables current smoking and whether they faced by workplace ban 39 Data workplace1dta 39 Sample program workplace1doc 39 Results workplace1log 26 Description of variables in data desc amage dliplay Value Vanable babe cybe fauna label Vanable label gt make as eueeeba smaklng wax has baexblaee suman bans age age 111 yeaea bale male black l ck hlspanlc hlspanlc bageaa as be graduate samecal has same eallege eallege Summary statistics Elbs Mean sm Dev um ax lszsa 25163 433953 u l lszsa saslass 4544745 u l lszsa 3554742 llsslas 8 a7 lszs u l lszsa u l lszsa usmuas maz hlspanll u 3 u l ncamel lszsa JD42EI97 7524525 52145ua ll22524 bageaa lszsa 335527 mlaas u l samecal lszsa msw 443216 u l lszsa 3293753 muulz u l WE Running a probit problt smoker age incomel male black hlspanlc hsgrad somecol coiiege worka 39 The first variable after probit is the discrete outcome the rest of t variables are the independent variables 39 Includes a constant as a default Running a logit loglt smoker age incomel male black hlspanlc hsgrad somecol coiiege worka 39 Same as probit just change the first word 27 Running linear probability reg smoker age incomei male black hlspanlc hsgrad somecol college worka robust 39 Simple regression 39 Standard errors are incorrect heteroskedasticity 39 robust option produces standard errors with arbitrary form of heteroskedasticity Probit Results i How to measure fit 39 Regression OLS e minimize sum of squared errors 7 Or maximize R2 e The model is designed to maximize predictive capacity 39 Not the case with ProbitLogit e MLE models pick distribution parameters so as best describe the data generating process 7 May or may not predict the outcome well Pseudo R2 39 LLk log likelihood with all variables 39 LL1 log likelihood with only a constant 0gtLLkgtLL1 so I LLk lt ILLII 39 Pseudo R2 1 ILLllLLkI 39 Bounded between 01 39 Not anything like an R2 from a regression 28 Predicting Y Let b be the estimated value of 3 For any candidate vector of X we can predict probabilities P P xb Once you have 1 pick a threshold value T so that you predict YP l fP gt T YP 0 ifP T Then colnpare fraction correctly predicted 39 Question what value to pick for T 39 Can pick 5 7 Intuitive More likely to engage in the activity than to not engage in it sHowever when the V is small this criteria does a poor job of predicting Y1 sHowever when the V is close to 1 this criteria does a poor job ofpicking YO predlct probability of smoking predict prediprobismoke detailed descriptive data about predicted pro sum prediprob detail predict binary outcome mitn EEI cutoff gen predismokelpred7probismokegt5 label variable pred smokel quotpredicted smoking EEI cutoffquot compare actual values tab smoker predismokel row col cell sum prediproh detail pr lsmoker percentiles smallest ii ugsgam EI615221 5i 1155u22 uszzgsa 1m 1237434 usaagzg obs 1 25 ISZDESI u733495 sum of wgt 15258 5m 2559952 Largest std Dev ogsoom 75 3187975 5619798 9m 37957u4 5555878 Variance EIEIQZISI 95x 4u39573 5584112 skewness 152u254 99 4572597 szuauza Kurtasis 2149247 29 39 Notice two things 7 Sample mean of the predicted probabilities is close to the sample mean outcome 7 99 of the probabilities are less than 5 7 Should predict few smokers if use a 50 cutoff ls mm Prammed mung 5m cucaff smaxmg u 1 Tuna u 12153 14 12157 Tana 6 9 39 Check ondiagonal elements 39 The last number in each 2X2 element is the fraction in the cell 39 The model correctly predicts 7475 015 7490 of the obs 39 It only predicts a small fraction of smokers 39 Do not be amazed by the 75 percent correct prediction 39 If you said everyone has a 7 chance of smoking a case of no covariates you would be correct Max71V percent of the time 30 39 In this case 2516 smoke 39 If everyone had the same chance of smoking we would assign everyone Pry1 2516 39 We would be correct for the 1 2516 07484 people who do not smoke Key points about prediction 39 MLE models are not designed to maximize prediction 39 Should not be surprised they do not predict well 39 In this case not particularly good measures of predictive capacity Translating coefficients in probit Continuous Covariates 39 PTCVFD 2 W130 X11 131 X21 B2 Xk Bk 39 Suppose that X1 is a continuous variable 39 dPrQy11 d xh 39 What is the change in the probability of an event give a change in x117 Marginal Effect 39 d Pry11 d x1 39 31 50 X11 131 X21 B2 Xk Bk 39 Notice two things Marginal effect is a function of the other parameters and the values of X 31 Translating Coefficients Discrete Covariates 39 PTCVFD Dllso X11 51 X21 B2 X10 Bk 39 Suppose that x21 is a dummy variable 1 if yes 0 if no 39 Marginal effect makes no sense cannot change x21 by a little amount It is either 1 or 0 39 Redefine the variable of interest Compare outcomes with and Without x21 125 39 y1PrQVJ1 I X2121 DlBo l39 Xhlsi39l39 52 Xsllss39l39n l 39 y0 Pry11 I X2120 DlBo l39 X11BI X3163 l Marginal effect y1 7 yo Difference in probabilities with and Without x27 ln STATA 39 Marginal effects for continuous variables and Change in probabilities for dichotomous outcomes STATA picks sample means for X s STATA command for Marginal Effects 0 mfx compute 39 Must come after the outcome when estimates are still active in program 32 Interpret results 39 10 increase in income will reduce smoking by 29 percentage points 39 10 year increase in age will decrease smoking rates 4 percentage points 39 Those with a college degree are 215 percentage points less likely to smoke 39 Those that face a workplace smoking ban have 67 percentage point lower probability of smoking 39 Do not confuse percentage point and percent differences 7A 67 percentage point drop is 29 ofthe sample mean of 24 percent iBlacks have smoking rates that are 32 percentage points lower than others which is 13 percent ofthe sample mean Comparing Marginal Effects Variable LP Probit Logit age 000040 000048 000048 incomel 00289 00287 00276 male 00167 00168 00172 Black 00321 00357 00342 hispanic 00658 00706 00602 hsgrad 00533 00661 00514 college 02149 02406 02121 worka 00669 00661 00658 33 Marginal effects for specific characteristics 39 Can generate marginal effects for a specific X prchange xage4E b1ackE h15pan1c hsgrad SOmeCOlE worka 39 If an X is not specified STATA will use the sample mean eg log income in this case 39 Make sure when you specify a particular dummy variable 1 you set the rest to zero pruhlt Changes in Predlcted Pruhabllltles for smoke mibemex a 412 ewez MargEfct age enmza EIEEE5 emnns rmnnss e uuns 1ncumel b1795 emazn enmqq uzsa 0345 u male umga umga umga u black eumas eumas emuagq u biepebic eumuq eumuq eumqs euuzuz hsgra en El works euqu euqu eumvs EIE361 eumw Testing significance of individual parameters 39 In large samples MLE estimates are normally distributed 39 Null hypothesis 810 39 If the null is true and the sample is larges B is distributed as a normal with variance 0 39 Using notes from before if we divide B by the standard deviation we get standard normal 39 BseB should be N0 1 39 BseB Zscore 39 95 of the distribution ofa N0 1 is between 196 196 39 Reject null of the Zscore gt 196I 39 Only age is statistically insignificant cannot reject nul 34 When will results differ Normal and logit CDF look 7 Similar in the mid point ofthe distribution 7Different in the tails You obtain more observations in the tails of the distribution when 7 Samples sizes are large 7 V approaches 1 or 0 These situations will produce more differences in estimates Some nice properties of the Logit 39 Outcome y1 or 0 39 Treatment x1 or 0 39 Other covariates X 39 Context 7 x whether a baby is born with a low weight birth 7 x whether the mom smoked or not during pregnancy Risk ratio RR Proby1 l X1Proby1 lx0 Differences in the probability of an event w en x is and is not observed How much does smoking elevate the chance your child will be a low weight birth 39 Let ny be the probability y1 or 0 given x1 or 0 39 Think of the risk ratio the following way Y11 is the probability Y1 when X1 Y10 is the probability Y1 when X0 39 Yll RRY10 35 39 Odds Ratio ORAB Yn leYmYool A PrY1l X1PrY0lX1 odds on occurring if you are a smoker B PrY1l X0PrY0lX0 odds ofy happening ifyou are not a smoker What are the relative odds on happening ifyou do or do not experience x m 39 Suppose PNY 1 Fl30 BIXJ zz and F is the logistic function 39 Can show that OR expel e F 39 This number is typically reported by most statistical packages 39 Details Yn eXPl3e 51 BZZ 1 eXPl3e 31 52 Ym eXPl3e BZZ1 eXPl3el32Z Ym 1 1 9XPBD 31 BZZ Yoo 11 exPltBDBZZ YnYm 9XPBD 31 BZZ YmYou exPltl3e BZZ 0RAB YiiYoilYioYool eXPG3e 31 BZZV eXPl3e BZZ eXPG31 39 Suppose Y is rare V close to 0 7PrY0 l Xl and PrY0 lXO are both close to 1 so they cancel 39 Therefore when V is close to 0 7 Odds Ratio Risk Ratio 39 Why is this nice 36 Population attributable risk 39 Average outcome in the population 39 V 1K Ylo X Y11l XY10 XRRY10 39 Average outcomes are a weighted average of outcomes for X0 and X1 39 What would the average outcome be in the absence ofX eg reduce smoking rates to 0 39 YaY10 Population Attributable Risk 39 PAR 39 Fraction of outcome attributed to X 39 The difference between the current rate and the rate that would exist without X divided by the current rate 39 PAR V 7 YaY RR 71X1X RRX Example Maternal Smoking and Low Weight Births 39 6 births are low weight 7 lt 2500 grams 7Average birth is 3300 grams 55 lbs Maternal smoking during pregnancy has been identified as a key cofactor 713 of mothers smoke 7 This number was falling about 1 percentage point per year during 1980s90s 7Doubles chance of low weight birth Natality detail data 39 Census of all births 4 millionyear 39 Annual files starting in the 60s 39 Information about 7 Baby birth weight length date sex plurality birth injuries 7 Demographics age race marital educ of mom 7 Birth who delivered method of delivery 7 Health of mom smokedrank during preg weight gain 37 39 Smoking not available from CA or NY 39 3 million usable observations 39 I pulled 5 random sample from 1995 39 About 12500 obs 39 Variables birthweight grams smoked married 4level race 5 level education mothers age at birth vanable 1 1 55 mam 555555 5555lt2s55 55mg pxegnancY 55555 5 55551 5 11 525 1745 37 55 95 1355 155 55 94 54 5972 93 95 51 75 1225 93 95 1 559 255 7572 2325 155 55 35 1525 453 14 55551 12255 1945 14235 55 33 1357 15555 155 55 15555 15555 5 33 1357 15555 39 Notice a few things 7137 of women smoke 7 6 have low weight birth 39 PrLBW I Smoke 1028 39 PrLBW I Smoke 586 39 RR PrLBW I Smoke PrLBW I Smoke 0102800536 192 38 Logit results Odds Ratios 39 Smoked 7 exp0674 196 7 Smokers are twice as likely to have a low weight birth 39 lrace472 Blacks 7 exp0707 202 iBlacks are twice as likely to have a low weight birt m Asking for odds ratios Logistic y X1 X2 y In this case EEEEEEE 33 EEEE EEEEEEE iiEEEES xiz logistic lowbw smoked age 2333 iii 2 3322 iiilili married ieduc5 irace4 55 55 39 PAR PAR RR 71X1XRRXl 39 X0137 39 RR 196 39 PAR0116 39 116 of low weight births attributed to maternal smoking Hypothesis Testing in MLE models 39 MLE are asymptotically normally distributed one of the properties of MLE 39 Therefore standard ttests of hypothesis will work as long as samples are large 39 What 39large means is open to question 39 What to do when samples are small 7 table for a moment Testing a linear combination of parameters 39 Suppose you have a probit model Bo 19131 X21132 XaBa l 39 Test a linear combination or parameters 39 Simplest example test a subset are zero 39 131 132 Z 133 1340 39 To fix the discussion s 39 N observation K parameters J restrictions count the equals signs j4 Wald Test 39 Based on the fact that the parameters are distributed asymptotically normal 39 Probability theory review 7 Suppose you have m draws from a standard normal distribution 2 7M 212 222 zm2 7M is distributed as a Chisquare with m degrees of freedom 4o 39 Wald test constructs a quadratic form suggested by the test you want to perform 39 This combination because it contains squares of the true parameters should if the hypothesis is true be distributed as a Chi square with J egrees of free om 39 Ifthe test statistic is large relative to the degrees of freedom of the test we reject because there is a low probability we would have drawn that value at random from the distribution Reading critical values from a table 39 All stats books will report the percentiles ofa chisquare iVertical axis degrees of freedom iHorizontal axis percentiles iEntry is the value where percentile of the distribution falls below 39 Example Suppose 4 restrictions 39 95 of a chisquare distribution falls below 9488 39 So there is only a 5 a number drawn at random will exceed 9488 39 If your test statistic is below cannot reject n 39 If your test statistics is above reject null 163 Chisquare Percentiles ofthe Chisquared 0 0 0 0900 0990 0995 6635 7879 1323 1642 773 3 219 9210 10597 4108 4642 1134512838 15036 16750 16312 13543 13475 20273 20090 21955 21666 23539 23209 25133 1m Wald test in STATA I Default test in MLE models I Easy to do Look at program test hsgrad somecol college I Does not estimate the restricted model I Lower power than other tests ie high chance of false negative 39 test hsgrad somecol college 39 l hsgrad 0 I 2 somecol 0 I 3 college 0 I ch12 3 50478 I Prob gt ch12 00000 I Notice the higher value of the test statistic There is a low chance that a variable drawn at random from a ch square with three degrees of freedom will be this large I Reject null 2 Log likelihood test how te huh the same tests vnuth a 72 lug 11kg teet eetmete the unrestlcted model and save the eetmetee 1h urmudel pruhlt smoker age incumel male black hiepehic hsgrad somecul college works eetmetee Store urmudel39 eetmete the restricted model save results in theaei pruhlt smoker age incumel male black hiepehic works eetmetee Store theaei lrtest urmudel theaei 42 39 I prefer 2 log likelihood test iEstimates the restricted and unrestricted mo e 7 Therefore has more power than a Wald test 39 In most cases they give the same decision rejectnot reject Section VI Categorical D ata Ordered Probit 39 Many discrete outcomes are to questions that have a natural ordering but no quantitative interpretation 39 Examples 7 Self reported health status excellent very good good fair poor 7Do you agree with the following statement Strongly agree agree disagree strongly disagree 39 Can use the same type of model as in the previous section to analyze these outcomes 39 Another rlatent variable model 39 Key to the model there is a monotonic ordering of the qualitative responses 43 Self reported health status 39 Excellent very good good fair poor 39 Coded as 1 2 3 4 5 on National Health Interview Survey 39 We will code as 54321 easier to think of this way 39 Asked on every major health survey 39 Important predictor of health outcomes eg mortality 39 Key question what predicts health status 39 Important to note 7 the numbers 15 mean nothing in terms of their value just an ordering to show you the lowest to highest 39 The example below is easily adapted to include categorical variables with any number of outcomes Model 39 y latent index of reported health 39 The latent index measures your own scale of health Once yf crosses a certain value you report poor then good then very good then excellent health 39 y1 12345 for fair poor VG G exce 39 Interval decision rule y1 if y 5 n1 y2 if u1 lt y 5 n2 y3 if u2 lt y 5 us y4 if ua lt y 5 n4 y5 if y gt u4 44 39 As with logit and probit models we will assume y is a function of observed and unobserved variables 39 371BOXhBlX21i32quotquotxk1i3k81 39yXB81 39 The threshold values ul u2 us 114 are unknown We do not know the value of the index necessary to push you from very good to excellent 39 In theory the threshold values are different for everyone 39 Computer will not only estimate the B s but also the thresholds 7 average across people 39 As with probit and logit the model will be determined by the assumed distribution of a 39 In practice most people pick nornal generating an ordered probit l have no idea why 39 We will generate the math for the probit version Prob abilitie s 39 Lets do the outliers Pry11 and Pry15 first 39 Pry11 39 Pry 5 ul ZPNKBQSUQ 39 Pr815u1 X18 39 u1XBl1 XBu1l 45 39 Pry15 Pry gt u4 Prxl B C1 gt u4 PIquot 1 gt u4 x1 B 1 U4 X18 lxll37u4l Sample one for y3 39 13113153 Pru2 lt yf 5 u3 PrQy 5 ug 7 Pry S u2 PMl B C1 5 ug 7 Prx1 B C1 S u2 PM 5 11339 X 3 PM 5 U2 X 3 us X18 W112 X1 3 lDXB u371ltDXB u2 WEB 112 WK 3 Us Summary 39 Pry11 1 59 B 7 U1 39 Pry12 Phi B 7 U1 Phi B 7 U2 39 Pry13 Phi B 7 U2 Phi B 7 Us 39 Pry14 Phi B 7 Us Phi B 7 U4 39 Pry15 Phi B 7 U4 Likelihood function 39 There are 5 possible choices for each person 39 Only 1 is observed 0 E lenPryk for k 46 Programming example Cancer control supplement to 1994 National Health Interview Survey 39 Question what observed characteristics predict self reported health 15 scale 1poor 5excellent Key covariates income education age current and former smoking status Programs srihealthistatusdo dta log dean male by 890 1 age byte SQIIg age in ye a byte SQIIg years of educatlon smoke byte 90 tuttent a t amakes byte SQIIg smoked in past 5 years black float SQIIg 1 it respondent la black atbtate flu t SQIIg 1 it other tate lwhlte la tEf Srihealth float SQIIg is self repurted beaitb 5extei 1paat tambti float SQIIg log famlly incumE tab a rihealth In STATA oprobit srihealth male age educ famincl black othrace smoke smokeS 47 Interpret coefficients 39 Marginal effectschanges in probabilities are now a function of 2 things iPoint of expansion X s iFrame of reference for outcome y 39 STATA iPicks mean values for X s iYou pick the Value ofy Continuous X s 39 Consider y5 39 d Prle5dx1 d WK 3 milix BltPX1 3 U4 39 Consider y8 39 d Pry13dxl BltPX1 3 Us BltPX1 3 U4 Discrete X s 39 X1BZBOXIJBIX21i32HXk1i3k 7X2 is yes or no 1 or O 39 APrQy15 39 Disc X1151 52 X3153 quotXk1 Bk 39 M50 X1151 X3 53 Xk Bk 39 Change in the probabilities when X2121 and X2120 48 Ask for marginal effects 0 mfx compute predict outcome 5 Interpret the results 39 Males are 47 percentage points more likely to report excellent 39 Each year of age decreases chance of reporting excellent by 07 percentage points 39 Current smokers are 75 percentage points less likely to report excellent alth Minor notes about estimation 39 Wald testsl2 log likelihood tests are done the exact same was as in PROBIT and LOGIT 39 Tests of individual parameters are done the same way Zscore 49 39 Use PRCHANGE to calculate marginal effect for a specific person prchange Xage40 black0 othrace0 smoke0 5moke50 educl6 39 7When a variable is NOT specified famincl STATA takes the sample mean 39 PRCHANGE will produce results for all outcomes 197 158 age mum 1 2 3 4 mm 13355317 mm as us797u72 17555112 u7u54757 412 uu321942 uuuazsla umAJsAz uu424452 uuzuszM 45m ua7zau14 umzw mmm u49m323 u237aas quotmm m uuuazsls uuuusas uu424452 uuzuszsz Section 711 Count Data Models 159 mu 50 Introduction 39 Many outcomes of interest are integer 39 Example counts 7 SAT scores are essentially integer values 7Doctor visits 7Few at tails 7Low work days 7Distribution is fairly continuous 7 Cigarettes smoked per day 7 OLS models well Missed SChOOl days 39 In contrast suppose 39 OLS models can easily handle some 7High fraction of zeros integer models 7 Small positive values 39 OLS models will 7 17 have no visits 1 7 Predict negative values D l 915 1713 1713 1 au1 1129 2945 7 Do a poor Job of predicting the mass of observations 2 r 533 mm a at zero 3 513 944 4791 4 1 45 545 5535 39 Example 5 x 391 734 5359 6 l 319 5 99 69 6E 7 Dr mm m past year Medicare patients65 7 m Hg 53 7 1987 National Medical Expenditure Survey E l 215 4 5 7353 9 1 192 36m 5219 7 Top Code for now at 10 1D x 949 1731 1nuuu Poisson Model 39 y1 is drawn from a Poisson distribution 39 Poisson parameter varies across observations 39 fy1 equot 1 y ly For AJgt0 39 EBA Varyl A x B 39 A must be positive at all times 39 Therefore we CANNOT let A XJS 39 Let A expx13 39 anx KB 39 d lndx1 B 39 Remember that 1 ln d1 39 lnterpret B as the percentage change in mean outcomes for a change in X Problems with Poisson 39 Variance grows with the mean Elyll VanGl A x 3 39 Most data sets have over dispersion where the variance grows faster than the mean 39 ln dr visits sample V 56 s67 39 Impose MeanVar severe restriction and you tend to reduce standard errors 52 Negative Binomial Model 17 y 7 L y PM ryry 161 51 39 Where V1 expxJ3 and 6 Z 0 39 EBA 5V eXpOQB 39 Varb J 5 15 V1 39 Varb lEb l 15 39 5 must always be 2 0 39 In this case the variance grows faster than the mean 39 If 50 the model collapses into the Poisson 39 Always estimate negative binomial 39 If you cannot reject the null that 50 report the Poisson estimates 39Notice that lnEy 1nlt6gt 1nv so 39 d1nEy1 MK B 39 Parameters have the same interpretation as in the Poisson model In STATA 39 POISSON estimates a MLE model for poisson 7 Syntax POISSONy independent variables 39 NBREG estimates MLE negative binomial 7 Syntax NBREGy independent variables 53 Interpret results for Poisson 39 Those with CHRONIC condition have 50 more mean MD visits 39 Those in EXCELent health have 78 fewer MD visits 39 BLACKS have 33 fewer visits than whites 39 Income elasticity is 0021 10 increase in income generates a 21 increase in visits Negative Binomial 39 Interpret results the same was as Poisson 39 Look at coefficientstandard error on delta 39 Ho delta 0 Poisson model is correct 39 In this case delta 521 standard error is 015 easily reject null 39 VarMean 1delta 621 Poisson is misspecificed should see very small tandard error in the wrong model no Selected Results Count Models Parameter Standard Error Variable Poisson Negative Binomial Age65 0214 0026 0103 0055 Age70 0787 0026 0204 0054 Chronic 0500 0014 0509 0029 Excel 0784 0031 0527 0059 LnInc 0021 0007 0038 0016 215 Section VIII Duration Data 54 Introduction 39 Sometimes we have data on length of time of a particular event or spells 7 Time until death 7 Time on unemployment 7 Time to complete a PhD 39 Techniques we will discuss were originally used to examine lifespan of objects like light bulbs or machines These models are often referred to as time to failurequot Notation 39 T is a random variable that indicates duration time til death find a new job etc 39 t is the realization of that variable 39 ft is a PDF that describes the process that determines the time to failure 39 CDF is Ft represents the probability an event will happen by time t m Prsgzjn fsds 39 Ft represents the probability that the event happens 39 What is the probability a person will die on or before the 65th birthday 39 Survivor function what is the chance you live past t 39 St17Ft 39 If 10 of a cohort dies by their 65th birthday 90 will die sometime after their 65th birthday 55 39 Hazard function ht 39 What is the probability the spell will end at time t given that it has already lasted t 39 What is the chance you find a new job in month 12 given that you ve been unemployed for 12 months already PrITltIth21 x10 11mm 39 PDF CDF Failure function survivor function and hazard function are all related 39 Mt ftSt ft1Ft 39 We focus on the hazard rate because its relationship to time indicates duration dependence m 39 Example suppose the longer someone is out of work the lower the chance they will exit unemployment 7 damaged goods 39 This is an example of duration dependence the probability of exiting a state of the world is a function of the length 39 Mathematically d Mt dt 0 then there is no duration dep d Mt dt gt 0 there is duration dependence the probability the spell will end increases with time d Mt dt lt o the e is 7 duration dependence the probability the spell will end decre as es over time 56 39 Your choice is to pick values for ft that have or no duration dependence Different Functional Forms 39 Exponential 7 Mt A 7 Hazard is the same over time a memor39y less process 39 Weibull 7 Ft 1 7 expth where N gt 0 7 Mt pvpquot 7 if pgt1 increasing hazard 7 if plt1 decreasing hazard 7 if p1 exponential 39 Others Lognormal loglogistic Gompertz 39 Little more difficult 7 can examine when you get comfortable with Weibull A note about most data sets 39 Most data sets have censored spells iFollow people over time iAll will eventually die but some do not in your period of analysis 7 Incomplete spells or censored data 39 Must build into the log likelihood function 57 39 Let t1 be the duration we observe for all people 39 Some people die and their they lived until period t1 39 Others are observed for t1 periods but do not 39 Let d51 if data is complete spell 39 d1 if incomplete 39 Recall that fs is the PDF for someone who dies at period s Ft Prs t fsds 39 Ft is the probability you die by t 39 1Ft the probability you die after t 39 If d1 then we observe ft someone who died in period t1 39 If d0 then someone lived past period t1 and the probability of that is 1Ft E z dj1nft1 lt1 gt1n1Fltr91 Introducing covariates 39 Look at exponential 39 t 39 Allow this to vary across people 39 1t A 39 But like Poisson A is always positive 39 Let A ex13030 X11 31 X2 52 X19 Bk 58 39 In the Weibull Mt uth1 39 Allow it to vary across people 1t av1 t x391 39 V1 2 expltBo X1151 X2 52 X10 Bk Interpreting Coefficients 39 This is the same for both Weibull and Exponential 39 ln Weibull Mtl avlt l391 39 Suppose X1 is a dummy variable 39 When 5911 then V1 e 0 lx21 2 xki k 39 When 5910 then V0 e 0x21 2 my 39 When you construct the ratio of V11 V10 all the others parameters cancel so 39 GVn 5 GVJO t 3910rvlo 5 em 391 39 Percentage change in the hazard when X1 turns from 0 to 1 39 STATA prints out e31 just subtract 1 Suppose X2i is continuous 39 Suppose we increase x2 by 1 unit 39 V11 eX19050 5111 X2 52 Xk 5k 39 V12 eX19050 51X111 X2 52 Kin 5k 39 Can show that 39 11032 i 1 39 Percentage change in the hazard for 1 unit increase in x 59 NHlS Multiple Cause of Death NHIS 7 annual survey of 60Khouseholds 7 Data on individuals 7 Selfreported healthm DR visits 1ost workdays etc MCOD 7 Linked NHIS respondents from 19861994 to National Death Index through Dec 31 1995 7 Identified whether respondent died and of what cause 39 Our sample iMales 5070 who were married at the time of the survey 719871989 surveys 7 Give everyone 5 years 60 months of followup Key Variables 39 maximths maximum months in the survey 39 Diedin5 respondent died during the 5 years of followup 39 Note if diedn50 the maximths60 Diedin5 identifies whether the data is censored or not 25554 2755577 961846 i 4 a euue i uieuinsi 25554 JZZEDEZ 327993 6O Duration Data in STATA 39 Need to identify which is the duration ata stset length failurefailvar Lengthduration variable 39 Failvar1 when durations end in failure 0 for censored values 39 If all data is uncensored omit failure failvar 39 In our case I stset maximths failure diedinB Getting KaplanMeier Curves 39 Tabular presentation of results 5 ts l i s t 39 Graphical presentation sts graph 39 Results by subgroup sts graph byeduc KaplanMeier survival eslim as n m w an analvsls me 61 KaplanMeier survival es1imales by educ n 2 on an anaivsisume Educ 1 Educ 2 Educ 3 Educ t MLE of duration model With Covariates 39 Basic syntax I streg covariates d distribution streg age 5 yrs black hlspanlc let In dltweibuiigt77 7 7 39 In this model STATA will print out exp 39 If you want the coefficients add nohr option no hazard ratio Weibull coefficients Mn ism aim in am imm iimni 39 The sign ofthe parameters is informative 7 Hazard increasing in age 7 Blacks hispanics have higher mortality rates 7 Hazard decreases with income and age 39 The parameter p 117 7 Check 95 confidence interval 113 121 Can reject null p1 exponential 7 Hazard is increasing over tinue 62 Hazard ratios Interpret coefficients 39 Age every year hazard increases by 46 39 Black have 61 greater hazard than whites 39 Hispanics 14 greater hazard than nonhispanics 39 Educ 2 3 4 are some 9111215 and 16 years of school 39 Educ 3 those with 1215 years of educ have 98 71 007 or a 7 lower hazard than those with lt9 years of school 39 Educ 4 those with a college degree have 088 71 012 or a 12 lower hazard than those with lt9 years of school 39 Income 25 are dummies for people with 1020K 2030K 3040K gt40K 39 Income 2 Those with 1020K have 088 71 017 or a 17 lower hazard than those with income lt10K 39 Income 5 those with gt40K in income have a 058 71 042 or a 42 lower hazard than those with income lt10K 63 Topics not covered 39 Time varying covariates 39 Competing risk models iDie from multiple causes 39 Cox proportional hazard model iHeterogeneity in baseline hazard 64 Introduction to STATA ECONOMICS 30331 Bill Evans Fall 2008 This handout provides a very brief introduction to STATA a convenient and versatile econometrics package In a few short years STATA has become one of the leading programs used by researchers in applied micro economics STATA was written by economists so it is more intuitive for researchers in our eld It is fast and relatively easy to use STATA s speed advantage comes from the fact that all data is loaded into RAM Subsequently the amount of high memory restricts the size of the problem Given the size of the data sets we will use in class and the available memory on typical machines this will not prove to be a constraint All the STATA data les sample programs this handout etc will be available for download from the course web page httpwwwndeduwevanslecon3033lhtm In the lower right hand side of the page is a link to STATA programs and data les This outline demonstrates those STATA procedures necessary for the course However this handout only scratches the surface of STATA s capabilities The text is written so that you should be able to follow along on a computer with STATA and gradually build up to the point where you can generate simple statistics My suggestion is that you print out this tutorial nd a computer with STATA enter the program then follow along with the tutorial Some places on the web where you can learn more about STATA include STATA faq s httpwwwstatac0msupportfags Resources for learning STATA The STATA listserv httpwwwstatac0mstatalist UCLA s resources for learning STATA httpwwwatsuclaedustatstata STATA Availability STATA is available in all clusters and classrooms on campus Ifyou want your own copy of STATA a oneyear site license for STATA lOIC can be purchased through the STATA Grad Purchase plan The web site is httpwww stata 39 39 39 39 html and the cost is 95 This is not required for class but if you want to use STATA on your own laptopdesktop this is the only available avenue Once you are into STATA When you rst enter STATA the screen will look like Figure 1 below You will notice that there are four boxes on the screen Area A is called the command line This is where you will type executable statements Area B is the variable list Once you load a data set into STATA all the variables available to you will be listed in the box Area C is the review box and it will contain a history of all the commands executed during this STATA session Area D is where any results will be reported Figurel A zl dat Istartjgogwgnjgjgg JE H t I I I1 7 I nw a r lyunhtledrr amt hum The command line is the active area of the screen where you will be typing all your commands The contents of the other boxes will be determined by what you type here Once in STATA the cursor should be blinking in the command line indicating to you that the program is waiting to accept input Commands are executed by hitting return after you have typed the command Throughout this tutorial anything written in COURIER FONT is a command that should be executed through the command line There are two ways to produce statistics in STATA First you can write executable statements line by line from the command line and execute the codes Alternatively you can write an entire program that contains a group of executable statements then submit the program from the command line In the text below we will indicate the line byline interactive approach but in Appendix 1 I provide a STATA program cps87do that generates all the results in this tutorial At the end of the handout I also outline how to execute the batch program The results from this single program are reported in Appendix 2 Please refer to these results when you want to see the output from any particular linebyline statement in the tutorial below From the command line you can ask for help at any point Suppose you wanted some information about how to describe the contents of data sets From the command line you would type help describe then hit return A popup box appears that outlines the syntax for the describe command Notice also that the command you executed is now in the Review box C If at any time you want to reuse a command that has already been executed using the mouse click once on the command and the text appears in the command line The Basics of STATA In any software package in order to generate statistical results you must do three things 1 Read in raw data from another format and store in a form usable to the statistical package 2 Manipulate the data delete observations create new variables as needed 3 Generate the statistics As you will quickly learn the bulk of your time will be spent on tasks 1 and 2 Generating results in most software packages is trivial getting the data in a form that is usable is what takes time Over the next few sections I will illustrate some ways that STATA handles each of the tasks above STATA assumes that all external les and stored on the default subdirectory folder What that default directory is depends on how your particular machine is set up What I recommend is that you construct a subdirectory for your STATA work and once in STATA change the default folder So for example suppose that you have constructed a folder dbillecon3033l for your STATA work From the command line you would type Cd 1 billecon3033l and hit return Now STATA will look in this folder for all data sets and write all results to this folder as well When you are working interactively you may want to save a log of your activity 7 a list of all the commands and results from your current STATA session that are posted in the results section area D in Figure 1 You can construct a log by typing the following command log using statalogllog replace and hitting return The log will be written to the le statailogillog and the replace option tells the program to overwrite an existing le with that name At the end of your session you will type log close and hit return to close the log le Please note that STATA commands data set names and variable names are case sensitive Opening a STATA data le A data set contains a collection of variables that describe different units Think of the data set as a matrix having columns and rows The rows are separate observations people companies cities time periods while each column is a different variable that describes a speci c characteristic of the observations in the sample What all statistical software packages are designed to manipulate are the columns in the matrix or the different variables So you may want to know average earnings across the respondents in your sample or what fraction of people voted in the last election or the correlation between income and years of education for people For many projects you will have access to a STATA data le that is already in STATA format and ready for use by the program STATA data les always have a dta extension and loading them into STATA is straightforward Before we demonstrate this a quick word about memory STATA is a very fast program because it requires that all data be read into RAM Therefore the constraint on the program is usually the available RAM The program will not let you load the data set if there is not enough RAM How much RAM is allocated to STATA at the start is a function of the machine you are using However you can allocate more RAM to STATA at anytime during your STATA session For most of the class projects 2 meg of RAM should be sufficient so before you get started set the RAM by typing from the command line set memory 2m and hit return Ifthe class projects need more RAM I will let you know On the class we page is a STATA data set cps87dta Please download that into the folder you are using for this class Suppose the data set has been downloaded and the folder is named cbillecon3033l To load the data into STATA simply type use cps87 and hit return The variables data is now available for use in STATA One the data set is in memory you can construct new variables delete particular observations and generate statistics After you have constructed new variables you can save the revised data set by saving the data under a new name save cps87update or you can save the new data set under the old name by typing save cps87 replace If you no longer need the data set you can clear it out of memory by typing clear and hit return Before going on in this tutorial please clear this data out of memory Reading raw data into STATA For most empirical projects you will receive data in some format like ASCII and you must read the data into a STATA data le This can be accomplished through a variety of different steps and what I illustrate below is but one way to take data from a spreadsheet like EXCEL and transport the data into a STATA data set Pictured below in Figure 2 are the rst 32 lines from the EXCEL data set cps87Xls This data set is available for download from the class web page Please download the le to the folder you are using for this class The data le is a matrix with 7 columns and 19906 rows Each row represents data for another observation person and each column is a new variable The data is taken from the 1987 Current Population Survey and this data le consists of males aged 2164 who worked full time gt30 hours per week at the time of the survey The variables in order are age race yearsieduc unionistatus smsaisize region and weeklyieam and detailed descriptions of the variables are provided in Table 1 below Looking at the variable de nitions in Table l the rst observation is from a 55 year old white man with 12 years of education in a union from one of the largest 19 standard metropolitan 4 statistical areas in the northeast and making 750 per week EXCEL stores data differently than STATA so we must transform the EXCEL le into something usable for STATA This is done by saving the data into format called comma delimited data or CSV format In CSV format each row is stored on a different line and variables are separated by commas When in EXCEL from the toolbar along the top choose File then Save As and in the Save as Type section in the Save As box choose CSV Comma delimited csv This will construct a data set named cps87csv If you open the data set into a program editor you will see data as it appears in Figure 3 Notice that the rst row contains variable names while the other rows contain the data All rows are on different lines and all variables are separated by commas Make sure the cps87csv le has been added to the default folder you are using for this class STATA is all set up to read a data set in this format into a STATA data le From the command line if you type insheet using cps87csv comma then hit return the data will be loaded up into STATA This command tells STATA to read in the le cps87csv and all the variables are separated by commas You will notice in the results box that 7 variables and 19906 observations were loaded up into STATA You will notice in the variables box that 7 variables are listed It is good programming practice to LABEL your variables which are short descriptions of the variables that are helpful for later use To provide a label for the variable age you would type label var age age in yearsquot then hit return Sample labels for the other 6 variables are listed below You may want to type these now or at some other time label var race 1 if white non Hisp 2 if black non Hisp 3 if Hispanic label var yearseduc years of competed education label var unionstatus 1 if in union 2 otherwise label var smsasize 1 if largest l9 smsa 2 if other smsa 3 not in smsa label var region 1 if northeast 2 if midwest 3 if south 4 if west label var weeklyearn usual weekly earnings up to 999 At any time you can get a list of all of the variables in your data set by typing describe and hit return A description of the data set so far is printed in block A of results in Appendix 2 Figure 2 Contents of cps87xls I 3 431 am an Mew lnsevt ngmat 1m gala mm M Tvpeaquestmnfmhelp 5 x JE Antlta a mqngz Qty gm W ia Pg 394 Save 1 J1 12 51 My new Cmievmw D1 v 2 umun status s am A 1 a 1 c 1 D 1 E 1 F age vane yeavsieduc umun statussmsaistze vegmn 2 2 1 L55 3 c Weekiea 1 u M Alu u Alu u u tutu mama MMMMMMMAMMMMMMMMMAAMMMMAMMMMMM n13 lDzaw39 Q l gtashaves39 Elegant mausaar Rea1v NUM 15m g g g a g E E a a gtgtj W WwelessNet QDWMSGG mm mm ansnniquot El D5875asr gnqgg 32pr Table 1 Detailed Variable Definitions for cp587xls Variable De nition AGE Age in years RACE 1 if White nonHispanic 2 if black nonHispanic 3 if Hispanic EDUC Years of completed education maximum is 18 UNIONM 1 ifa union member g2 otherwise SMSA 1 if live in one of 19 largest Standard metropolitan Statistical Areas SMSA 2 ifliVe in other SMSA 3 ifliVe in nonSMSA REGION 1 if live in Northeast 2 if live in Midwest 3 if live in South 4 if live in West EARNWKE Usual Weekly earnings nominal 1987 dollars maximum is 999 Figure 3 Contents of cpsS7 sv age race yearseduc unionstatus smsasi ze region weekearn 55112214750 571162l4690 3031221424o 341182l4800 311162l4999 321182l4750 3931721424o 5531211444o 39112214999 5230224420 551162l4850 451152l4830 341142l4596 30ll42l4563 421142l4625 491182l4500 27112214700 351162l4999 30112114439 403122l4462 Generating new variables in ST Additional variables can easily be created with the gen command The syntax for gen is gen new variable namemathematic expression The new variable is the name of the newly created variable and it must follow STATA naming conventions The basic rules for variable names are STATA is casesensitive Names can contain no more than 32 characters They can contain letters numbers or underscores 7 Spaces or other special characters like amp etc are not allowed The rst character must be a letter or underscore not a number Below are siX examples of the gen statement that construct new variables from the data set we just loaded into memory gen age2ageage gen lnweeklyearnlnweeklyearn gen unionunionstatus gen nonwhiterace2race3 gen bignortheastcityregion lampsmsa The rst two lines use standard mathematical operators to construct new variables Here we construct age squared forget for now why we are interested in this variable and the natural log of usual weekly earnings We usually analyze lnearnings rather than earnings because the latter is a skewed variable while the former is normally distributed One of the most common variables in applied work is a dummy variablequot that equals 1 or 0 separating people into two groups male or female black or white etc These variables are easy to construct with the use of logical operators Logical operators are of the form gen ylogical statement that construct a new variable Y that equals 1 when the logical statement is true and zero otherwise The last three variables listed above demonstrate how to use logical operators The variable union constructs a variable that equals 1 for union members and zero otherwise Notice that two equal signs must be used when exact equality is indicated in a logical statement Combinations of logical statements can be used to construct dummy variables The vertical line l represents or and the ampquot sign represent and The variable nonwhite equals 1 if races equals 1 OR 2 and bigine equals 1 if a respondent comes from a big SMSA from the Northeast census region After the variables are constructed I add a set of variable LABELs The syntaX for labels is illustrated in the neXt siX lines label var age2 quotage squaredquot label var lnweeklyearn quotln usual earnings per weekquot label var union quotlin union 0 otherwisequot label var nonwhite quotlnonwhite Owhitequot label var bigne quot1 live in big smsa from northeast Ootherwsiequot Getting descriptive statistics Once you have the correct collection of variables in your STATA data le you may want to construct some simple descriptive statistics Summary statistics mean min maX and standard deviation are produced with the sum command So the command gets descriptive statistics for all variables If you only want information for a subset of variables like age and education then add the variables after the sum command sum age yearseduc and hit return If you want more detailed information on a particular variable quantiles medians skewness kurtosis etc use the sum command list the variables and ask for detailed calculations sum weeklyearn age detail generates detailed statistics for only two variables Results from these three exercises are reported in blocks B C and D respectively in Appendix 2 In Block B note that the average age is 3797 years and 23 of workers are in unions In Box D note that median weekly earnings are 449 dollars but average earnings are higher at 48826 Summary statistics for subsamples of the population are easily calculated as well For example suppose one wanted to look at average weekly earnings across different racial and ethnic groups First you would sort the data by race sort race then ask to have the means calculated for the racial subgroups by race sum weeklyearn The by variable option must be ended with a colon and the data must be sorted in order for this option to work The by option can be used with virtually all of STATA s commands Results from this exercise are reported in Box E of Appendix 2 Note that average earnings for whites black and Hispanics are 506 383 and 369 Suppose instead that one needed sample means for those with at least a high school education In this case the if statement can be used as an option and he sample restricted to those people where the if statement is correct So for example sum weeklyearn if yearseducgt12 will only generate sample means for those people with 12 or more years of education The observations with yearsieducltl2 have not been deleted from the sample but rather they were simply not used in the previous command These results are in Box F in Appendix 2 and note that average earnings increase to 50962 when lower educated workers are excluded You can obtain complete distributions for discrete variables by using the TABULATE command For example if you want to know the fraction of people by racial ethnic group you would type tabulate race and hit return These results are reported in block G in Appendix 2 and 859 percent of the sample is white non Hispanic 825 are Black nonHispanic while 583 are Hispanic You can construct twoway contingency tables by listing the two variables in the TABULATE command For example in the line tabulate region smsa row column and hit return STATA will count the number of observations for all 12 unique groups of region and SMSA The row and column options to the command tell STATA to produce row and column totals The results from this exercise are reported in Block H of Appendix 2 Notice in this case that 2906 observations have region1 northeast and smsa1 one of the 19 largest smsa while 1133 observations have region4 west and smsa3 non Testing Whether means in two subsamples are the same The simplest statistical test than can be performed is to examine whether the means from two different groups are the same In this case we will examine weekly earnings for union and nonunions workers The difference in means across samples is tested with a ttest and the syntax is ttest weeklyearn by union The results from this exercise are reported in section I of the results In this case notice that the mean earnings among unions workers is 51528 while the mean earnings for nonunion workers is 48015 and therefore the difference across the two groups nonunion minus union is 3513 The tstatistic on this difference is 2735 The 95 critical value of a ttest with 19904 degrees of freedom is 196 so we can easily reject the null hypothesis that the means across the two subsamples are the same which is indicated by the low pvalue on the ttest Running a simple OLS regression The mostoften estimated model in labor economics is the human capital earnings function Log weekly wages has been shown to be roughly linear in education and quadratic in age In the next few lines we run a simple OLS regression Basic regressions are generated by the reg command and the syntax is simple where the rst variable after reg is the dependent variable and all other variables are independent variables In this example there are ve covariates age age2 yearsieduc union and non white STATA automatically adds a constant to every model unless otherwise speci ed The regression statement in the sample program is as follows reg lnweeklyearn age age2 yearseduc nonwhite union The results from this example are reported in Block J of Appendix 2 We will not interpret these results at this time In many empirical models observations can be grouped into discrete categories Sometimes the number of categories is small eg race and sex Sometimes the categories are numerous states and countries In a sample with people from 50 states to add state dummy variables requires the construction of 49 variables STATA has an automated procedure that will construct the discrete variables and add them to a model Before the REG command is invoked the XI option signals to STATA that the variables de ned by iname Clearing and closing Once you are done with your interactive STATA session you can close the log le by typing log close and hitting return Also in order to exit you must clear the data out of memory which can be done by typing clear You can clear the data out of memory at this point Running d0 programs The text above describes an interactive STATA session where lines of code are typed in the command line and submitted one at a time An interactive session is excellent way to learn STATA you see the errors right away and you adjust as you go along However as you get more pro cient in your programming you will turn want to write STATA programs and submit them as a batch job STATA programs can be written in any ASCII editor such as Wordpad or Notepad and the les must have a do extension All of the lines of code discussed above have been collected in a STATA do program called cps87do and a copy of this program is contained in Appendix 1 below The program is also available for download from the class web page Please download this le to the default folder you are using for this class STATA reads each line of this program as a separate executable statement Note that between the executable statements there are lines that begin with s These stars indicate that the line is a comment and is not an executable command It is good programming practice to include comments in your programs This helps you when you go back to a program after a long delay and detailed comments helps anyone else who reads your program understand what you are up to A few lines into the program you will notice the line set more off When you execute a program STATA will ll up one screen s worth of text then wait for the operator to hit return in order to proceed The command above turns this feature off If you have a copy of the commadelimited data set cps87csv and a copy of the STATA program cps87do on your default folder you can execute the STATA batch program by typing the following do Cp887 and hit return The command do will look for the cps87do le and execute the commands line by line The results from this program should be identical to that in Appendix 2 Handling errors If your program has errors enter any ASCH editor call up the program then edit and save the program You will need to close any open log from the command line by typing log close and clear any active variables in memory You are then ready to rerun your program If you hit the page upquot key you will notice that previouslyentered commands appear in the command line This is a quick way of recalling lines of code Exiting STATA To eXit STATA please do to the command line type CLEAR and hit return which clears all variables from memory then type EXIT and hit return Appendix A cp587do set the memory to 2 meg set memory 2m set it such that the computer does not need the operator to hit the return key to continue set more off write results to a log file log using cps87logreplace read in raw data from comma delimited data insheet using cps87csv comma label the variables label var age quotage in yearsquot label var race quot1 if white non Hisp 2 if black non Hisp 3 if Hispanicquot label var yearseduc quotyears of competed educationquot label var unionstatus quot1 if in union 2 otherwisequot label var smsasize quot1 if largest l9 smsa 2 if other smsa 3 not in smsaquot label var region quot1 if northeast 2 if midwest 3 if south 4 if westquot label var weeklyearn quotusual weekly earnings up to 999quot describe what is in the describe generate new variables lines 1 2 illustrate basic math functoins line 3 line illustrates a logical operator line 4 illustrate the OR statement line 5 illustrates the AND statement gen age2ageage lnweeklyearnlnweeklyearn gen unionunionstatusl gen nonwhiterace gen bigne D 5 label var age2 quotage squaredquot a lnweeklyearn quotlog earnings per weekquot a union quotlin union 0 otherwisequot label var nonwhite quotlnonwhite Owhitequot label var bigne quot1 live in big smsa from northeast Ootherwsiequot m 0 m lt R m 0 m lt R get descriptive statistics for all variables get statistics for only a subset of variables sum age yearseduc get detailed descriptics for a subset of variables sum weeklyearn age detail to get means across different subgroups in the sample first sort the data hen generat summary statistics by subgroup sort race by race sum weeklyearn get weekly earnings for only those with a high school education sum weeklyearn if yearseducgt12 get frequencies of discrete variables tabulate race get two way table of frequencies tabulate region smsa row column test whether means are the same across two subsamples ttest weeklyearn byunion run simple regression reg lnweeklyearn age age2 yearseduc nonwhite union run regression adding smsa region and race fixed effects Xi reg lnweeklyearn age age2 yearseduc union irace iregion ismsa close log file log close see ya Appendix B Results cpsS 710g log dbillstatacps87log log type text opened on 12 Aug 2008 122205 read in raw data from comma delimited data insheet using cps87csv comma 7 vars 19906 obs label the variables label var age quotage in yearsquot label var race quot1 if white non Hisp 2 if black non Hisp 3 if Hispanicquot label var yearseduc quotyears of competed educationquot label var unionstatus quot1 if in union 2 otherwisequot label var smsasize quot1 if largest 19 smsa 2 if other smsa 3 not in smsaquot label var region quot1 if northeast 2 if midwest 3 if south 4 if westquot label var weeklyearn quotusual weekly earnings up to 999quot describe what is in the data set describe Contains data obs 19906 vars 7 size 318496 886 of memory free storage display value variable name type format label variable label age byte 80g age in years race byte 8 0g 1 if white non Hisp if black non Hisp 3 if Hispanic yearseduc byte 80g years of competed education unionstatus byte 80g 1 if in union 2 otherwise smsasize byte 80g 1 if largest 19 smsa 2 if other smsa 3 not in smsa region byte 8 0g 1 if northeast 2 if midwest if south 4 if west weeklyearn int 80g usual weekly earnings up to 999 Sorted by Note dataset has changed since last saved generate new variables lines 1 2 illustrate basic math functoins line 3 line illustrates a logical operator line 4 illustrate the OR statement 14 BOX BOX BOX line 5 illustrates the AND statement gen age2ageage gen lnweeklyearnlnweeklyearn gen unionunionstatus gen nonwhiterace race gen bigneregion ampsmsa label var age2 quotage squaredquot label var lnweeklyearn quotlog earnings per weekquot label var union quotlin union 0 otherwisequot label var nonwhite quotlnonwhite Owhitequot label var bigne quot1 live in big smsa from northeast Ootherwsiequot get descriptive statistics for all variables sum Variable I Obs Mean Std Dev Min Max age I 19906 3796619 1115348 21 64 ce I 19906 1199136 525493 1 3 yearseduc I 19906 1316126 2795234 0 18 unionstatus I 19906 1769065 4214418 1 2 smsasiZe I 19906 1908369 7955814 1 3 region I 19906 2462373 1079514 1 4 weeklyearn I 19906 488264 2364713 60 999 age2 I 19906 1565826 9124383 4 40 lnweeklyn I 19906 6067307 513047 4 094345 6 906755 union I 19906 2309354 4214418 nonwhite I 19906 1408118 3478361 0 1 bigne I 19906 1409625 3479916 0 1 get statistics for only a subset of variables sum age yearseduc Variable I Obs Mean Std Dev Min Max a I 19906 3796619 1115348 21 64 yearseduc I 19906 1316126 2795234 0 l8 get detailed descriptics for a subset of variables sum weeklyearn age detail usual weekly earnings up to 999 Percentiles Smallest BOX BOX 1 128 60 5 178 60 10 210 60 Obs 19906 25 300 63 Sum of Wgt 19906 50 449 Mean 488264 Largest Std Dev 2364713 75 615 9 90 865 999 Variance 559187 95 999 999 Skewness 668646 99 999 999 Kurtosis 2632356 Percentiles Smallest 1 5 23 21 10 24 21 Obs 19906 25 29 21 Sum of Wgt 19906 50 36 Mean 3796619 Largest Std Dev 1115348 75 46 64 90 55 64 Variance 1244001 95 59 64 Skewness 4571929 99 63 64 Kurtosis 2224794 to get means across different subgroups in the sample first sort the data then generate summary statistics by subgroup sort race by race sum weeklyearn gt race 2 Variable Obs Mean Std Dev Min Max Q ii L 3quotm f quotquoti f i quotquotquotquot quot6 quotquotquot quot5 IILQQIQ Variable Obs Mean Std Dev Min Max 39Q ii quotquot quotii i39quot39 f if39 f quotquotquotquot quot quotquotquot quot52 get weekly earnings for only those with a 16 high school education sum weeklyearn if yearseducgtl2 Variable Obs Mean Std Dev Min Max weeklyearn 17129 5096206 2381675 60 999 get frequencies of discrete variables tabulate race 1 if white non Hisp 2 if black non Hisp 3 if Hispanic Freq Percent Cum 1 17103 8592 8592 2 1642 825 9417 3 1161 583 10000 Total 19906 100 00 et two way table of frequencies tabulate region smsa row column I I frequency row percentage column percentage 1 if northeast midwest 3 if 1 if largest l9 smsa 2 if south 4 other smsa 3 not in smsa if west l 2 Total 1 2806 1349 842 4997 5615 2700 1685 10000 3846 1889 1539 2510 2 1501 1742 1592 4835 3104 3603 3293 10000 2058 2440 2910 2429 3 1501 2542 1904 5947 2524 4274 3202 10000 2058 3560 3480 2988 4 1487 1507 1133 4127 3603 3652 2745 10000 2038 2111 2071 2073 BOX BOX Total 7295 7140 5471 19906 3665 3587 2748 10000 10000 10000 10000 10000 test whether means are the same across two subsamples ttest weeklyearn byunion Two sample t test with equal variances Group I Obs Mean Std Err Std Dev 95 Conf Interval 0 I 15309 4801503 2017734 2496532 4761953 4841053 1 I 4597 5152845 2705061 1834063 5099813 5205878 combined I 19906 488264 1676048 2364713 4849788 4915492 diff I 3513423 3969334 4291446 27354 diff mean0 meanl 8 8514 Ho diff degrees of freedom 19904 Ha diff lt 0 Ha diff 0 Ha diff gt 0 PrT lt t 00000 PrITI gt ItI 00000 PrT gt t 10000 run simple regression reg lnweeklyearn age age2 yearseduc nonwhite union Source I SS df MS Number of obs 19906 F 5 19900 177570 o I 161639963 5 323279927 Prob gt F 00000 Residual I 362293905 19900 182057239 R sgu r 03085 dj R sguared 03083 Total I 523933869 19905 263217216 Root MSE 42668 Coef Std Err t PgtIt 95 Conf Interval I 0679808 0020033 3393 0000 0640542 0719075 I 0006778 0000245 2769 0000 0007258 0006299 yearseduc I 069219 0011256 6150 0000 0670127 0714252 nonwhite I 1716133 0089118 1926 0000 1890812 1541453 union I 1301547 0072923 1785 0000 1158612 1444481 cons I 3630805 0394126 9212 0000 3553553 3708057 run regression adding smsa region and race fixed effects Xi reg lnweeklyearn age age2 yearseduc union irace iregion ismsa 3 H race Iracel naturally coded Iracel omitted iregion Iregionl 4 naturally coded Iregionl omitted ismsasize Ismsasizel 3 naturally coded Ismsasizel omitted Source I SS df MS Number of obs 19906 F 11 19894 92086 Model I 176766908 11 160697189 Prob gt F 00000 Residual I 347166961 19894 174508375 R squared 03374 18 Adj R squared 03370 Total I 523933869 19905 263217216 Root MSE 41774 lnweeklyn I Coef Std Err t Pgtt 95 Conf Interval age I 070194 0019645 3573 0000 0663435 0740446 62 I 0007052 000024 2937 0000 0007522 0006581 yearseduc I 0643064 0011285 5698 0000 0620944 0665184 union I 1131485 007257 1559 0000 0989241 1273729 Irace2 I 2329794 0110958 2100 0000 254728 2112308 Irace3 I 1795253 0134073 1339 0000 2058047 1532458 Iregion2 I 0088962 0085926 104 0301 0257383 007946 Iregion3 I 0281747 008443 334 0001 0447238 0116257 Iregion4 I 0318053 0089802 354 0000 0142034 0494071 Ismsa51z2 I 1225607 0072078 1700 0000 1366886 1084328 Ismsasiz3 I 2054124 0078651 2612 0000 2208287 1899961 cons I 376812 0391241 9631 0000 3691434 3844807 close log file log close log dbillstatacp587log log type text closed on 12 Aug 2008 122206 Economics 626 Local Average Treatment Effects Angrist Imbens and Rubin Equation of interest 1 yiaBXi8i Where y is wages and X is Vietnam vet status Because of potential correlation between X and s the model will be estimated by instrumental variables Let the instrument for Xi be zi which is a discrete variable that equals 1 0r 0 for draft eligibility status The IV or Wald estimate for B is 2 lv lZi1 lZi0 Q lZi1 Zi0l Terms yiziXithe value of y for a given zX pair XiZi the value of X given 2 Monotonicity assumption Xi1 gt Xi0 zi0 Xi0 Xi1 Xi0 Never takers Empty no one will Xi1 Xi0 0 enter the military yi10 yi00 0 because they received zi1 a high lottery number Xi1 Compliers Always takers Xi1 Xi0 gt 0 Xi1 Xi0 0 Yi1139Yi00A Yi1139Yi010 The value of the denominator in the Wald estimate is generated by a regression of X on 2 In this case the value of lzil Hzi0 will be determined by the those who change their behavior because they received a low draft lottery number the compliers The behavior of the never takers and the always takers will not be altered by the instrument The same can be said for the numerator Therefore the IV estimate of B determines the impact of veteran status on wages for those whose behavior was changed by the receipt of a low draft lottery number This is the Local Average Treatment Effect Measurement Error OLS and Fixed Effects Economics 626 Bill Evans Spring 1999 OLS Model Y 0 BX 5 EX2X26 62 I lyl I I x x Measurement error in Y y is true value yi recorded value Y Y W wi is an iid random error Ewi0 and varwiozW and COVWiXi0 True model y 3Xi ei 3 2i Xiyi 2i Xiz 2i Xiyi away 2i Xiz 2i Xiyi 2i xii 2i Xiwi 2i Xiz 2i Xil3Xi i 2i xii 2i Xiwi 2i Xiz l3 2i Xiei 2i Xiz HE Xiwi 2i Xiz 3 SK62x SW62x Plim 5 3 Measurement error in X i Xi is true value Xi recorded value Xi X vi vi is an iid random error Evi0 and varviozV and COVViXi0 True model yi x ei 3 2i Xiyi 2i Xiz l3 2i XiVi Yi 2i XiVi2 E 0129 vai Xfei visaZi sz fovi viz l3 62v 3 6xw 6m 6W 61 26m 62v Plim 3 3 ozxi ozxi 02V 36 where Ogegl With measurement error OLS estimate is biased towards zero Measurement error in X in xed effects model Ya Xit 3 ui 5 il2n t12 solve xedeffects model w rstdifference AY Yiz 39 yil AX Xiz 39 Xil A5 eiz 39 5n Ayi AXi 3 A6i Xit is recorded value Xi true value Xit Xit Vic Vit is an iid random error EVit0 VaI Vit02V COVVitXit0 COVV1Vi2 0 AXit AX AVit True model Ayi AXi 3 A6i 3fe Z AXi Ayi Z AX2 Z AXi AVAXi3 Aei Z AXi AVQZ 3 Z AXi 2 3 Z AXiAVi Z AeiAX Z AVi Ae Zi Axi22 Z AXf AVi Z Aviz Plim Bf 3 Ozma 02Ax 02m What are 02A and OZAV Since Xit Xi Vit and Vit is it39d then VarVi1VarVi2 02V VarXi1 VarXi2 02 VarXi1 VarXi2 02V 05 VarAVi VarVi1 VarVi2 2 02V VarAXi VarXi2 VarXi1 2C0VXi1Xi2 Varxi2 VarXi1 2p12 VarXi2VarXi1 5 202V 020 2p12 02V 020 202v 02139p12 Since AXit AX Ava then VarAXi VarAXi VarAVi and VarAXi VarAXi VarAVi 202V 02lp12 202V Plim Bf l3 02M 02m 02m l3202v 021912 202v 202v 021912 202v 203 B202v 021912 202v 202v 0200910 B02v 021912 031 05 0200910 BU Ozv 02v02139912 l3 6f As 02V grows of approaches 0 Compare OLS to Fixedeffect if p12gt0 then 6lt6fe or the attenuation bias in xedeffects will be larger than the bias in OLS The problems of measurement error is magni ed in a xedeffects context Test of Overidentifying Restrictions ZSLS Models Economics 626 Bill Evans Background Generalized Methods ofMoments GMM Estimation t observations of data Int is a moment condition of dimension g that is a function of 21 data 3 unknown parameter le m 10 2 11142 B GMNI estimation min q m39w m where W covm Let b be the estimate of It can be shown that covb G39w39lG391 where G 6m6 Example OLS as a special case of GMM YX e Ee 0 cov 021 EX39e 0 De ne m X39e 0 w cov 1 EX39 X39e39 EX39ee39X 02X39X min q m39wl 1h 102 X39e39 X39X391X39e102e39XX39X391X39e Let YX b q many w1mb39 102 39XX39X1X39 102 YX b XX X39lX YXb It is easy to show that bX39X391X39Y G 611mg 111 X YXb so GZX39X covb G39w39lG391 X39X3902X39X391X39X391 0X39X391 General Test of Overidenti cation k number of parameters b g number of restrictions length of m g2k if ggtk the system is overidenti ed if gk the system is exactly identi ed quot1 mb39 vquotv391 mb XZgk under null that model is correctly speci ed When gk quot10 Example consider the OLS case By construction X39 X39YX b0 so quot10 Two stage least squares as GMM YX e B is kxl vector Ee 0 cov 021 EX39e 0 EZ39e 0 XZy u y is a gxk matrix 1 Z39Z391Z39X x Z 1 Z Z39Z391Z39X De ne m Z39e 0 w covm E Z39e Z39e39 EZ39ee39Z 02Z39Z min q m39w39l 1h 102Z39e39Z39Z391Z39e102e39ZZ39Z391Z39e Let YX h min m39wt 1h 102 YX b ZZ Z391Z YXb Overidenti cation test in the ZSLS case a than than mb Z39 Z39YX b vlt391 Z39Z391J 2 quot1 39ZZ39Z391Z39 62 62 39 n consistent estimate quot1 39ZZ39Z391Z39 nk 39 consider the synthetic regression Z 6 E The estimate of 6 is 8 Z39Z391Z39 and de ne e Z8 ZZ39Z391Z39 this is the predicted value of from the synthetic regression The sum of squares of this value is by de nition ZZ Z391Z ZZ Z391Z ZZ Z391Z 39 SST or the sum of squared total e39e SSM or the sum of squares from the model R2 SSMSST so by de nition 61 39ZZ39Z391Z39 n 39 nSSMSST nR2 where R2 is from the second stage regression ofthe predicted errors on all exogenous factors in the model

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.