Adv Busn & Econ Forecasting
Adv Busn & Econ Forecasting ECON 6218
Popular in Course
Popular in Economcs
This 53 page Class Notes was uploaded by Ora Rath on Sunday October 25, 2015. The Class Notes belongs to ECON 6218 at University of North Carolina - Charlotte taught by Staff in Fall. Since its upload, it has received 14 views. For similar materials see /class/229022/econ-6218-university-of-north-carolina-charlotte in Economcs at University of North Carolina - Charlotte.
Reviews for Adv Busn & Econ Forecasting
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/25/15
Maximum Likelihood Estimation Contents 1 Introduction to MLE 2 Numerical optimization of likelihood function 3 The MLE estimation of ARMA processes 31 Introduction to estimation of ARMA models i i i i i i i i i i i i i i i i i i i i i i i i 32 The Likelihood function AR1 process i i i i i i i i i i i i i i i i i i i i i i i i i i i i 33 Simulation Example i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 34 The conditional likelihood function for ARp process i i i i i i i i i i i i i i i i i i i 35 The conditional likelihood function for MA1 process i i i i i i i i i i i i i i i i i i i 36 The conditional likelihood function for MAq process i i i i i i i i i i i i i i i i i i i 37 The conditional likelihood function for ARMApq process i i i i i i i i i i i i i i i i 4 Statistical Inference with MLE 41 Distribution of ML estimators i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 42 Example AR1 process i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 43 Likelihood Ratio Test i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 44 The Wald test i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 45 Lagrange Multiplier Test i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 46 Relation between tests i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 5 Vector and Matrix Differentiation References 0 Greene W H 2000 Econometric Analysis PrenticeHall Chapter 4 0 Hamilton J D 1994 Time Series Analysis Princeton University Press Chapter 5 o Heij C Paul de Boer Franses P H and H K van Dijk 2004 Econometric methods with applications in business and economics Oxford University Press Chapter 4 o Lutkepohl H 1993 Introduction to multiple times series analysis 0 Myung l l 2003 Tutorial on maximum likelihood estimation Journal of Mathematical Psy chology 47 90100 1 Introduction to MLE Let denote the probability density function pdf that speci es the probability of observing y giving the parameter p In general y can be a T X 1 vector and p can be a K X 1 vector of parameters For illustration purposes assume that y and p are scalars Suppose that the data y represents the number of successes in a sequence of n Bernoulli trials tossing a coin n times and the probability of a success on one trial is p The pdf in this case is given by fyln p Lac 7 My y 01 n 7 ywniy 7 7 7M7 For n 10 10 trials and p 02 the pdf becomes f ln710 702 i 1 m02y170210y 701 10 y 7 7107 iywloiy 797777777 The pdf for this random variable is presented in Figure 1 Figure 1 also presents probability density functions for p 05 and p 07 Knowledge of the pdf for y is very useful and is the main object of interest for an econometrician If you know the pdf for y then you know the expected value of y standard deviation the most likely value to occur you can construct con dence intervals for forecasting etc The question becomes which pdf function in Figure 1 you need to use to nd expected value of y and other quantities of interest To know which pdf function you need to choose you need to know the true value of p This is a problem because you do not know the true value of p and you need to guess estimate it The estimation of p is done using observed realizations of y The maximum likelihood estimation deals with the inverse problem like this given observations y and a model a candidate distribution function that depends on uknown parameters nd the ONE pdf among ALL possible probability density functions For our simple example this implies that a researcher needs to nd ONE value of p among all possible values of p on 01 interval which will give the desired pdf The principle of maximum likelihood estimation states that the desired probability distribution is the one that makes the observed data 77 most li e y From Figure 1 you can see that when p 02 y 2 is more likely to occur than y 5 0302 vs 0026 For p 05 y 5 is the most likely value to observe while for p 07 the most likely value to observe is y 7 Using the maximum likelihood estimation principle it implies that when the realizaiton y 2 the desired pdf has parameter value p 02 When the realization y 5 is observed the desired pdf has pararmeter value p 5 In general to estimate unknown parameters p you need to form a kikelihood function Lply which shows the likelihood of parameter p given realizations y The value of p that maximizes the likelihood function is MLE estimate of unknown parameter 11 Notice that the pdf is a function of y given the value of parameter p while the likelihood is a function of the parameter p given data y Example Suppose that you observe realization y 7 when n 1 Then from looking at Figure 1 you can see that realization y 7 is the most likely to occur when p 07 p 02 and p 05 indicate Figure 1 The probability of successes m pm 2 M n 5 Data y vin r f 5 Data y l p 7 M n 5 Data y The probability density function of binomial distribution for any number of trials n and the probability of success 12 is as follows n y niy fylnz7 ii 1 7 17 y n i 2 For n 10 and p 02 the pdf becomes fyln 10 p 02 i02yl 7 0210 y y10 i 2 Given a realization y 2 the MLE estimate of p is 02 because for p 02 the value y 2 is the most likely to occur Given a realization y 5 the MLE estimate of p is 05 because for 12 05 the value y 5 is the most likely to occur Given a realization y 7 the MLE estimate of p is 0397 because for p 07 the value y 7 is the most likely to occur values 2 and 5 as the most likely This implies that p 07 must be maximum likelihood estimate of p Next let us look at the likelihood function L0ln 107y 7 fy W 10710 10 mp7 ms 1 Likelihood function in I is presented Figure 2 You can see from the graph that the likelihood function is maximized at p 07 which produces the MLE estimate Figure 2 Likelihood function 025 7 i 7 027 Lpn 0y o G 017 7 0 l l l 0 02 04 06 08 1 I Parameter p The probability density function of binomial distribution for any number of trials n and the probability of success p is as follows n fylnz7 py Pquot y yn y The likelihood function given observed y 7 and sample size n 10 LPln 1021 7 fy 7ln 1017 7 10 7 a 7 g lt1 7 2 lt2 The likelihood function is maximized at p 07 which is the MLE estimate In practice7 When the number of unknown parameters in the pdf is large more than 37 you can not really plot the likelihood function in a gure You need to use either analytical or numerical methods to nd the MLE estimatel Since the likelihood function and its logarithm are maximized for the same value p7 for computational convenience7 the MLE estimate is obtained by maximizing the loglikelihood function ln Lplyi Assuming that the loglikelihood function is differentiable and pMLE exists it must satisfy the rst order conditions FCC 9171 Holy 0 910 To ensure that pMLE is a maximum of the loglikelihood function you als need to check the second 2 order conditions W lt 0 17 Example cont The loglikelihood for is as follows ln Lpln 10y 7 ln10l 7 ln7l 7 ln3l 7lnp 3lnl 7 p The derivative of the loglikelihood 9171 Hm I 3 19p p 1 7 p 71 7 0 7 3p 101 7 0 and the F00 7 7 10PMLE 7 0 pMLE E which is the MLE estimate that we found by looking at Figure 2 You can check that second order conditions for maximization are satis ed In general you should be able to show that for any n and y and binomial probability density function the loglikelihood becomes in L0ln7y 7171011 717M 7lnn 7 WM W410 n 7 W710 7 0 and the MLE estimate is y pMLE 7 n So far we have assumed that there is only one observation for y Let us look at an example With several random variables Example Let Y1Y2 MYT denote iid random variables that have pdf fyz 7 10317p1 y2 92071 0 elsewhere 3 where 0 S p S 1 Random variable YE takes on value 1 when there is a success of experiment and 0 when there is a failure We need to construct a joint probability density function that Y1 y1Y2 yg M YT yT Since Y1 Y2 M YT are independent fy17y27w7yTlp 7 py117 p1 y1py21 7 10W X X pyT17 wk T T 7 1021 M1 7 pT7E 1yt 4 The likelihood function has the same form as equation 4 but it is written as afunction ofp rather than afunction of y1y2myT T T LPly17 w 7 P271 1 1 7 PT Et71 1 The loglikelihood function becomes T T ln Lply17 myT lnp T 7 29 ln17 p and the F00 dlnLQU Q7T7Eyz 0 do p 1710 Provided that p is not equal to zero or 1 boundary values of p this is equivalent to T T 170Zyz PT 92 2 21 whose solution produces the MLE for p PMLE 721 M T MLE has many optimal properties 1 Consistency 2 Efficiency 3 Invariance 4 Asymptotically normal 2 Numerical optimization of likelihood function Very often the loglikelihood function can not be maximized analytically and you need to use numerical methods to nd the value of parameters that maximize the loglikelihood functionl Let me change notation in this section In practice we observe realizations of T random variable Y1 thQ ygwl YT yT which constitute our sample The vector of unknown parameters is denoted 7 K X 1 vector7 and the likelihood function is denoted LW Maximum likelihood estimator is found by solving the following optimization prob em inan Likelihood function L7ly1y2wyT ln regression analysis7 the MLE typically requires specifying a model which is a particular distribution for the white noise process e which allows a researcher to formulate the joint probability distribution function of having the observed sample fy17w7yTW 5 While the joint pdf is written as a function of yl7 yg lll7yT the likelihood function is a jonit pdf function written as a function of unknown parameters 7 L7ly17yT fyl7wyTW 6 For computational convenience the likelihood function is Written as the loglikelihood function Usually if the likelihood function is not quadratic in the parameters 7 the optimal value of 7 can not be Written as an explicit formula in terms of the data y1y2yT In this case a researcher needs to use numerical optimization Let us de ne the column vector of gradients C7 BL7B7 and the Hessian matrix H7 782L7B787 Optimal values of 7 are characterized by the rst order conditions 07 0 V Kgtlt1 When K 1 like in examples in the rst section the gradient is simply the rst derivative With respect to a single parameter 7 Example Suppose that the objective likelihood function has the following form L7 772 The gradient in this case is simply the derivative with respect to 7 C7 727 The objective function and gradient are represented in Figure 5 You may notice that the objective function is maximized for a value of 7MLE such that C7MLE 0 Any numerical optimization algorithm is an iterative procedure consisting of several steps H Start make a guess about the initial estimate of 7 70 E0 lmprove and repeat based on the value 70 determine an improved estimate of 7 71 lterate this improvements giving a sequence of estimates 71 72 72 An example of the improve and repeat step is illustrated in 4 9 Stop stop the iterations if the improvement becomes suf ciently small ie if lL7m17L7ml lt c and c is some convergence criteria a very small number There are many numerical optimization algorithms available 1 Grid search This may be used for a one dimensional problem 2 NewtonRaphson algorithm In this algorithm the estimates of 7 are improved according to the following formu a W1 w WWW lt7 The iterations of NewtonRaphson algorithm for a onedimensional problem are illustrated in Figure 4 3 Steepest Ascent Algorithm may require a very large number of iterations 4 DavidsonFletcher Powell algorithm Example Consider an objective function L7 772 The gradient in this case is simply the derivative with respect to 7 C7 7 7 and the Hessian is simply the negatice of second derivative with respect to 7 H7 2 We know that for this simple objective function 7 0 is the value that maximizes the objective function Let us check how the NewtonRaphson algorithm nds the maximum value Let us bigin iterations with the starting value 70 4 which gives the value of objective function L70 742 716 Use equation 7 to nd 71 em 721 Figure 3 Illustration of optimization problem 71 70HV l 107 472x40 and the value of objective function is LWI 0 Notice that LWI 7 LWO 16 which means that the value of objective function increased quite a bit The next iteration gives us value 7 39 72 71 H71l 10vl l 2 72gtlt20 and you may notice that L72 7 L71 0 and the value of the objective function did not increase It means that the algorithm has converged on a solution The iterations of numerical algorithm stop and it means that y 0 is the value that maximizes the objective function This is exactly the same answer as you would get using an analytical solution Notice that NewtonRaphson algorithm found the correct solution in one step Notice that in this example no matter what the starting value 70 is we nd the correct solution in one step This is not true in general and algorithm may require many iterations before it converges 3 The MLE estimation of ARMA processes 31 Introduction to estimation of ARMA models Consider an ARMA model 9 6 1y271 2yz72 qbpytip E 916271 94162711 where e is a White noise process Let y c74517 74517791 M 6q70392 denote the vector of population parameters to be estimated Suppose that we have observed a sample yl7 Hiyp7yp17 Hqu l TWO popular approaches to estimate ARp model are 1 Ordinary least squares OLS T min 2 e 8 21 6 y i 6 1y271 151291717 30m X X 1X Y g Y e 2st e momT e k BOLS N N 50702XX71gt Where h 17 1 For ARp models7 OLS produces consistent asymptotically normal CAN estimators for true model parameters 2 Maximum likelihood estimation MLE maxq Likelihood function 7n 71 72 VMLE Figure 4 Illustration of three NewtonRaphson iterations to nd the optimum of an objective function The graph shows the gradient CV of the objective function as a function of the parameter 7 The algorithm starts at Qyo on 2 and QyMLE denote the estimates obtained in the first7 second7 and third iterationsi Notice that MLE is the optimal value because at this value GWMLE 0i The OLS can not be applied to estimate a general ARMApq model but one may use a nonlin ear version of OLS7 called nonlinear least squares Therefore7 there are two popular approaches for estimation of ARMAD7 4 models 1 Nonlinear least squares NLS T min 2 e 9 21 5 y i C 1519271 151792717 915171 i 94151741 6 y i C 1y271 151291717 91yz71 C 1519272 15179271771 t915272 19415271171 quotu 9qyz7q C 1519174171 151735717711 t916271171 1911627211 The difference between OLS estimation in 8 and NLS estimation in 9 is that the computation of error term in 9 is nonlinear because it involves products of There is no formula for NLS estimators of 45 9 or 02 One has to nd parameter estimates using an optimization algorithm 2 MLE inan Likelihood function The MLE typically requires specifying a particular distribution for the white noise process at which allows a researcher to formulate the probability distribution function likelihood function of having the observed sample y Wyn 7 10 Finding the MLE estimates of the population parameters 7 conceptually involves two steps 1 The likelihood function 10 must be calculated 2 The values of 7 must be found that maximize this function lntuitively7 the parameters are chosen in such way that the observed data become as likely or 77probable77 as possib e 32 The Likelihood function AR1 process A Gaussian AR1 process 9 0 159271 5 with e N iid N0a2 The vector of population parameters to be estimated 7 541502 The probability density function of e is 1 L 2 ew 7 W61 7 2026 Note that e y 7 c 7 yt1 Lyt 7 M where 1 7 4sz7 M cl 7 Then conditional on yt1 the density of yt becomes mum 7 ew e 7c7 y gt2l m fully m 7 7ltyt7c7 yz1gt2l 11 where lJl is the absolute value of the Jacobian of transformation from at to y and lJl 27 ll If we know the parameters 7 we can compute the value of likelihood function in ll for t 2 in T The computation of likelihood for the rst period t l is problematic because the probability density function of y1 conditional on yo is mam 7 ezpk m 7 c7 asyo lt12 and we need to know observation yo in order to compute the likelihood in 12 There are two ways how one can treat the rst observation 1 popular and relatively easy Assume that yl is xed and nd conditional likelihood function In this case the conditional likelihood function of entire sample is T fy27m7yTW 7 H fyzlyz71v 12 1 1 11f 5 W702i 1 T 27r02 H exp 757 7 C 7 15924 12 Notice that the computation of likelihood functions starts from period t 2 because we assume that yl is xed and only yg in yT are randomi The conditional log likelihood of the entire sample is LW lnfy27m7ylem 1092W2 tlzl2 69 l 7 C 7 WAVD T 1 2 T 1 2 Ma 2 Wz C yz71 22 T71 7 2 ln27r7 2 Lu olt 7T 11W 7 570 7 X Y 7 X5 13gt where Y y2y3wyT is a T 71 X 1 vector X 11 3mz 7 is a T 71 X 2 matrix I 1y171 5 412 is a 2 X 1 vector of model parameters The rst order conditions for parameters 6 are BLW W 7 0 1 7 7XY7XXB 7 0 8MLE X X 1X Y l2 and the MLE estimators and OLS estimators are the same for parameters The rst order conditions for the parameter a2 is 0M1 7 802 T711 1 1 7 7T Y7X Y7X aj 7 0 n Y 7 XQWY 7 X3T 71 Notice that the OLS estimator of a2 for the ARl model considered is amp Ls Y 7 X3Y 7 X3T 717 k Where k is the number of regressors in the model k2 for an ARl model1 Notice that there is T l term instead of T in denominator because there are T 7 l observations in the samp e1 1 may be quite complicated In this approach you need to nd the unconditional density of y1 and nd the exact likelihood function of the entire observed samp e1 Because 61 is a Gaussian White noise the density of y1 also follows normal distribution Recall 2 that Ey1 M 145 Ey1 7 102 1552 Therefore lt gt2 7 WW Maw yl 17 You may notice that the unconditional probability density function of y1 does not depend on unobserved yo While the conditional probability density function of y1 depends on go The un conditional likelihood of the entire sample is T fy17y27m7yTW 7 fyl1Hfyzlyz71W 12 17452 117456 yr c W 202 1 7 ab T gtlt27r02 12 exp 7y 7 C 7 159271 And the log likelihood of the sample becomes LW 7 7 znlt2wgt 7 gaglt1 7 as 7 12 l 7 11 T 7112711742707 T11n02 7 217 M 7 C 59171 14 In contrast to conditional maximum likelihood estimates the exact maximum likelihood estimates are dif cult to compute and numerical methods are needed However if the sample size T is suf ciently large the exact MLE and conditional MLE turn out to have the same largesample distribution provided that lt 11 And When gt 1 the conditional MLE continues to provide consistent estimates Whereas maximization of 14 does not 13 33 Simulation Example Consider an AR1 process 9 0 159271 5 with e N Lid N002 ln simulation I set the population parameters to c 12 45 08 02 4 and the sample size T 200 Assume that the vector of population parameters to be estimated in this example consists of one parameter 7 because we know the parameters c 02 l have generated an AR1 process described above and have calculated the approximate and exact likelihood function in 13 and 14 in Figure 5 34 The conditional likelihood function for ARp process Consider an ARp process 9 0 1519271 151291717 5 with en N iidN0 U2 The vector of population parameters to be estimated is 7 c 451 452 m 451 02 The log likelihood function conditional on the rst p observations assumes the simple form T T LW 7 znltfltyp17yTypmy1vgtgt 7 7 gpznm 7 gpznw T yz C 159171 2yz72 pyz7p2 tp1 T 7 1 LW olt 7 2 pznlta2gt 7 WY 7 X Y 7 X5 15gt where Y yp1yp2yT is a T 7p X 1 vector X z1z2wzT is a T 7p X 10 1 matrix I 1yt1ytp C 1 p is a 10 l X 1 vector of model parameters The conditional likelihood function for ARp process is similar to the conditional likelihood function for AR1 process The MLE estimates for parameters 451 452 M 4517 are the same as OLS estimates and the MLE estimate of the variance is a2 Y 7 X8Y 7 X8T 7 p 7 k 35 The conditional likelihood function for MA1 process Consider the Gaussian MA1 process 9 M z9 z71 with e N iidN002 The population parameters to be estimated are 7 M60 2 If the values 611 were known with certainty then 91 5171 N NM 9 z71702 l 7 796 2 fyzez717 7WEWlt Wgt Recall that given 60 0 and y1yt we can compute 61 M et recursively using 5 yz M 95271 Therefore AR U process Time Approximate and exact likelihood 400 500 600 700 800 900 1000 00 0 2 04 06 08 10 Value of 1 Figure 51 The exact and approximate log likelihood function for an ARl model in which only a single parameter 5 is unknown The likelihood function is a function of all unknown parameters If there are more than two unknown par ters it is not possible to graph the likelihood function The approximate log likelihood function is a blue line and exact log likelihood is a red line For this articular exam le approximate and exact likelihood functions are Virtually the s e Note however that sometimes approximate log likelihood may give a poor approximation to exact logilikelihood fyzly2717wyi7 0 0W 1 y i M i 96H exp 7 V 27ra2 202 Notice the difference in the information set in conditional probability density function for y for AR1 and MAO models In MA1 model we condition not only on past observations of y but also on 60 0 The sample conditional likelihood function becomes T fyT7m7y1leo0W 7 fylleowHfltyzlyz717wy17eo70W 12 T 2 7 2 2 y 7 M 7 9671 7 2M 2 ELI161Plt7T The conditional loglikelihood is LW 109fyT7m7y1leo 7 0 7 T 7 T T 2 yr M i 95712 LW 7 751M270 7 51740 7 T For a particular value of 7 we thus calculate the sequence of 6 s implied by the data and calculate the value of the conditional log likelihood The value 9y that maximizes conditional likelihood is the conditional MLE estimate The MLE estimates are found using numerical optimization algorithms For conditional MLE to give reasonable answers we should have invertible MA process 9 lt l 36 The conditional likelihood function for MAq process Consider the Gaussian MA1 process y M 6 916171 6qet7q with e N iidN002 The population parameters to be estimated are 7 491 m q 02 Given the starting values 60 e71 Meiq1 0 and y17M7yt we can compute 61 at recursively 6 y 7 M 7 91L 7 7 Oqet Therefore 1 yt7u7916117w79 en 2 fytlyt1y1eo 61 eq1 0 7 We lt7 202 q q The conditional log likelihood is T T 1 T LW 75174270 7 317402 7 Ely 7 M 7 616171 77 61167102 Again the MA process should be invertible ie the roots of 1 912 9qu 0 should lie outside the unit circle 37 The conditional likelihood function for ARMApq process Consider a covariancestationary Gaussian ARMApq process 9 C 1y271 pyt7p E 916271 9416271 with e N iidN07 02 The population parameters to be estimated are 7 c7451 q pt91 M76117 02 Setting the initial values yo y1 y7p1 g e0 e1 eq1 0 and y1yt we can compute e17 M eT recursively 5 y i C 1y271 pyt7p 916271 i 94162711 where g 21 y The conditional log likelihood may be written as T T T l LW 75174270 7 517402 7 g E e 21 where et are computed using a recursive algorithm The MA process should be invertible lnstead of setting e0 e1 eq1 0 and yo y1 y7p1 07 one may assume that y17y27m7yp are xed 4 Statistical Inference with MLE To estimate model parameters by the method of ML7 one proceeds as follows Computations in ML 0 Step 1 Formulate the loglikelihood A researcher needs to specify the form of the likelihood function LW For given data Y and X7 this should be a known function of 7 so that for every choice of 7 the value of LW can be computed 0 Step 2 Maximize the loglikelihood For the observed data Y and X the loglikelihood LW is maximized with respect to parameters 7 This is often a nonlinear optimization problem for which no closed form analytical solution is available and the numerical optimization is required 0 Step 5 Asymptotic tests Approximate tvalues and Ftests for the ML estimates QyMLE can be obtained from t e fact that this estimator is consistent and approximately normally with the covariance matrix determined by the information matrix One may also use 7 The Likelihood Ratio test 7 The Wald test 7 The Lagrange Multiplier test 41 Distribution of ML estimators Maximum likelihood estimators have asymptotically optimal statistical properties Under some mild conditions the observed data are strictly stationary and neither the estimate QyMLE nor the true value 70 falls on a boundary of the allowable parameter space and if the model the joint pdf of data is correctly speci ed then the maximum likelihood estimator is l Consistent A 17 YMLE A 7 17 2 Asymptotically ef cient The asymptotic ef ciency means that VTWMLE 7 70 has the smallest covariance matrix among all consistent estimators of 70 9 Asymptotically normali Some regulatory conditions are necessary for generalizations of the central limit theorem to hold so that VTWMLE 70 A N07 131 Where Io is the asymptotic information matrix evaluated at 70 ie Io limeoo IT 70 Here IT 70 is de ned as follows It is the information matrix evaluated at 7 70 for sample size T This means that asymptotically conventional t and Ftests can be based on the approximate distribution AYMLE N NWOJEIWMLE 16 Where I is called the information matrix and is equal to BZLW 8787 2 T 1mm 7E1 mgzmmmm 17gt l lYMLE 7YMLE T 39 39 7YMLE and L7 2311 lnfytlYt1 7 1171 denotes the history of observations on y through date t 42 Example AR1 process y c y171eh t1mT Y X l et Where e N iiiidiN002 Y is a T X 1 vector X Ll 71 is a T X 2 matrix B cq5 is a 2 X1 vector The parameters to be estimated are 7 502 The conditional likelihood function for AR1 process is given in 13 T T T 1 L7 7 ln27r 7 517402 7 Z yt 7 c 7 yt12 22 T T 1 7 ln27r 7 517402 7 go 7 X6 Y 7 X6 Then the rst order conditions for maximization become 18 BLW 3 0 av Lg Where 8L7 1 7X X iX Y BLW 7 T 1 1 1 602 7 igp YiX YiX a7 To nd the variancecovariance matrix of MLE in 167 we need to compute IT MLE de ned in 17 2 2 A 62m 2553 2 ITWMLE E l 8767 7 E 62m 32m 18 Yi YMLE W W Where 62L7 1 WW 7 7X X 19 82L7 T 1 aa2aa2 EO X Y X5 20 62L7 1 25602 X Y X5 21 Taking the expectation of 19 217 the IT MLE in 18 becomes F1X X 0 J 0 m ITWMLE Using the formula for MLE estimators 167 one may nd that MLE N N 70702XX71gt 202 A2 2 UMLE N N lt0 7 Note that BAMLE has the same distribution as OLS estimator Also7 notice that the distribution of ampJLE is approximately normal When the sample size T is large Normal distribution does not rule out the probability of negative values for amp4LE but the probability of negative variance estimate is very very small When the sample size is large 43 Likelihood Ratio Test Suppose that a researcher is interested in testing m linear or nonlinear restrictions of the following form T7 07 ie the null hypothesis is H0 T7 0 vs H1T77 0 The LR test requires two optimizations the ML estimation under the null hypothesis and ML under the alternative hypothesis Denote the ML estimator under the null hypothesis by 3 and under the 19 m LWLE Likelihood RENO test 7 0 in w in LW Lagrange mmupner WM test 0 7R ME 7 d ln Lenav Figure 6 Demonstration of hypothesis tests For this example the restriction TM 0 has a linear form ie a7 b 0 Which implies that 3 7715 he Likelihood ratio test compares the values of likelihood function at LWMLE and LWR the Wald test checks Whether the restriction TM 0 is likely to be satis ed For this example it checks how far a MLE b is from zero The Lagrange multiplier test checks the slope of loglikelihood function dln LOd i If the restriction is valid the slop of the loglikelihood function should be near zero at the restricted estimatori alternative by MLE The Likelihood ratio LR test is based on the loss of loglikelihood that results if the restrictions are imposed The idea of LR test is presented in Figure 6 The LR statistic has the following form LR 2l109LWMLE 109LiRl N X For the standard linear regression model LR test becomes A A LR T109 lt63 m Tl 1 7F 09 lt T 7 k gt Where F is the usual Ftest for checking linear restrictionsi 44 The Wald test The Wald test requires only one optimization of the unrestricted model This test considers how far the restrictions are satis ed under the unrestricted estimator VMLEi The idea of Wald test is presented in 20 Figure 6 The Wald statistic has the following form W TWMLER11E1WMLERiTWMLE N X 22 where R1 3557 evaluated at 7 MLE and IT WMLE is the information matrix for the sample size T de ned in 18 45 Lagrange Multiplier Test The Lagrange multiplier LM test is also known as the w test This test considers whether the gradient also called the 77score of the unrestricted likelihood functions is suf ciently close to zero at the restricted estimate 3 The test is depicted in Figure 6 The Lagrange multiplier statistic has the following form LM iamm W Law 23 where all the expressions are evaluated at 7 g Therefore the LM test requires only the estimation of the restricted model under the null hypothesis The LM test is popular because it is very easy to compute It can be shown that in many cases nonlinear models nonlinear restrictions nonnormal disturbances the LM test can be computed as follows 1 Step 1 estimate the restricted model Estimate the restricted model and compute the correspond ing residuals ER 9 1017312 where for a linear model ning X83 and g Samp2 2 Step 2 Auxiliary regression of residuals of step 1 Perform a regression of eR on all the variables BMW in the unrestricted model In nonlinear models y fz 7 e the regressors are given by 37 but the regressors may be of different nature for some mode s CA3 Step 5 LM TR2 of step 2 Then LM TR2 N xi where R2 is the R2 of the regression in step 2 46 Relation between tests The following inequalities may be derived for testing linear restrictions in a linear model LM S LR S W which means that pvalues based on this distribution satisfy PLM 2 PLR 2 This means that if the LM test rejects the null hypothesis the same hold true for the LR and Wald tests If the Wald test fails to reject the null hypothesis then the same holds true for the LM and LR tests 21 5 Vector and Matrix Differentiation Let L7 be a scalar function that depends on n X 1 vector 7 717111771 BMW 3 BLW 71 aL me Mm 7 7 7 3 3 n 97 BL 37 11 1 3 1 are n X 1 and 1 X n vectors of rst partial derivatives7 respectfully7 and 32L 1 32L 1 BQLW 7 3711971 39 39 39 371th 8767 3L6 am 33 11 39 39 39 avna vn is the n X n Hessian matrix of second order partial derivatives For exampla if 7 717 72 and LW 7i 271 then 37L la4Wi a 271 2 32L 32L 8 L 3712371 3712372 2 2 a L a L 2 0 6767 372371 372372 Let X be an m X n matrix and B be an n X 1 vector Then 6 X7 and Lig X a X a X if XX 7 and 6955 X X 60 7 XBWO 7 X6 63 7 72Y 7 X6 9X 0Y7X 9YiX W 7 2XQX 22 Cointegration References 0 Cochrane 1 H1 1997 Time series for macroeconomics and nance Lecture Notes Chapter 11 httpWWWgsbiuchichagoiedufacj ohnicochraneresearchPaperstimeser1ipdf 0 Hamilton 1 D1 1994 Time Series Analysis Princeton University Press Chapter o Heij C Paul de Boer Franses Pi Hi and Hi K van Dijk 2004i Econometric methods With applications in business and economics Oxford University Press Chapter 7 o Hendry E Di and K Juselius 2001i Explaining Cointegration Analysis Part 111 Journal of Energy 221 75120 1 Introduction To illustrate the idea of cointegration we Will start With a simple model Let pt be a 2 X 1 vector of transaction prices for the same asset from different markets P1 1 U1 p 1 P2 1 U2 x E 3 Pt 1 z 1 X 1 1 X 1 Where e N iiiid0a2 um N iiiidi0a2 p1 may be though of as the stock price of IBM on NYSE and p2 may be though of the stock price of IBM on NASDAQ uh and um market speci c noisei Both prices depend on the ef cient price p Which follows random walk Note that in this model 01 0 uh 1 i 10101 1 i 10 1 Dun 1 mph 5 ult 711271 and 1 LP2z E 712 i 712271 Therefore While prices p1 and p2 are 11 both A101 and A102 are 10 Figure 1 present an example of observations for p11 p2 along With the stochastic trend p Figure 2 presents only prices p1 and p2 You can seen that even though both prices p1 and p2 have unit root they move close to each other because they have the same stochastic trendi Figure 3 illustrates that the linear combination p11 7 p21 is stationary Cointegration is a generalization of unit root to vector models as a single series cannot be coin tegratedi Cointegration analysis is designed to nd linear combinations of variables that remove unit rootsi Suppose that two series are each integrated with the following MA representation lt17Lgtyz altLgtut 17Lwt bLvt In general linear combinations of y and wt will also have a unit rootsi But if there is some linear combination y 7 Own that is stationary y and wt are said to be conitegmted and a E 1 79 is their caintegmting vectm Cointegration vectors are of considerable interest when they exist since they determine 10 relations that hold between variables that are individually nonstationaryi As an example we may look at the real GNP and consumption Each of these series probably has a unit root but the ratio of consumption to real GNP is stable over the long periods of time Therefore log consumption minus log GNP is stationary and log GNP and consumption are cointegratedi Other possible examples include the dividendprice ratio or money and prices However cointegration as such does not say anything about the direction of causality 2 Cointegrating regressions Cointegrating vectors are 77superconsistent77 which means that you can estimate them by OLS even when ther right hand side variables are correlated with the error term and the estimates converge at a faster rate than usual OLS estimatesi Suppose y and wt are cointegrated so that y 7 9w is stationary Estimate the following model using OLS regression y 5w E 1 OLS estimates of B converge to 9 even if the errors are correlated with w Note that if y and wt are each individuall 11 but are not cointegrated then the regression in equation 1 results in spurious regressioni Therefore you have to check whether the estimated residuals are 11 or 10 1 will discuss it later in the notes x x 100 150 200 250 Figure 1 RW component and prices 100 Figure 2 150 Prices p1 and p2 200 0 50 100 150 200 250 Figure 3 Spread p1 P2 3 Useful representations for running cointegrated VAR s 31 VAR1 with two variables Main ideas of the relation between VAR and error correction model ECM may be seen by looking at VARl model with two variables y 9171 5 2 where y yh ygtl is a 2 X 1 vector at is a 2 X 1 vector of error terms Subtracting yt1 form the LHS and RHS of equation 2 equation 2 becomes Ayn H9271 5 3 where H 39i39 7 I is a 2 X 2 matrix Depending on the rank of matrix H1 there are three possible cases 1 Tank l 0 This happens only when all element in H are zeros ie H In this case 2X2 39i39 I and model 3 becomes Ayn 5 which means that both variables yh and y2 follow random walks and there exist twio stochastic trends for the two variables E0 Tank l 2 This happens only if detll y 0 ie dew 7 I f 0 which implies that the eigenvalues of 39i39 are less than 12 Because eigenvalues of 39i39 are less than one both variables yh and ygt are stationary 9 Tank l 1 In this case detH 0 and one eigenvalue of 39i39 is 1 This implies that both variables yh and y2 have stochastic trends but unlike the case when Tank l 0 variables yh and y2 have the same stochastic trend The fact that H has rank 1 means that the second column is a multiple of the rst column H 71 971711 79 A a V 7 72 972 72 2X1 1X2 1The rank of a matrix is the maximum number of independent rows or the maximum number of independent columns A square matrix Aux is nonesingular detA 7 0 only if its rank is equal to n 2Notice that eigenvalues of the identity matrix I are ones where 7 7172 is a 2 X 1 vector 1 17t9l is a 2 X l cointegrating vector In this case the ECM 3 may be written Ayn Myrna 992271 61 A92 V2y1271 9y2171 62 or de ning u ylt71 9y2 71 Ayn Mu 61 A92 7211 62 The term u is called the error correction term y1t1 7 9yt1 is called the cointegrating relation 71 and 72 are called the adjustment coef cients 32 Autoregressive Representations We start with the autoregressive representation of levels of y Lyt 6 y lt1gt1yz71 lt1gt2yz72 fpytip at My 6 where L I 7 1L 77 39i39pri Applying BN decomposition L l l 7 LFL we obtain lt1gt11LFLyz e gt1yzFLAyE a 1771 y Hyt71 ETHAij 5t 11 where H 7 l 39i39p 7 1771 77 1 7 L The matrix H controls the cointegration properties 1 to H has full rank any linear combination of y is stationary y is stationary In this case we run a normal VARi i H has rank between 0 and full rank There are some linear linear combinations of y that are stationaryi The VAR in levels is consistent but inef cient if you know the cointegrating vector and the VAR in differences is misspeci ed in this case 3 H has rank zero so no linear combination of y is stationary Ayn is stationary with no cointegra tion In this case we run a normal VAR in differences 33 Error Correction Representation If H has less than full rank we can express it as Hya If there are K cointegrating vectors then the rank of H is K and 7 and a each have K columnsi Then the system can be rewritten as p71 Ayn 709271 Z FjAytij 6 j1 where 0 is a K X N matrix of cointegrating vectorsi It is not easy to estimate this model when all cointegrated vectors in 0 are unknown 4 Representation of cointegrating system Let y be a N X 1 rst difference stationary vector time series The elements of y are cointegrated if there is at least one vector 1 cointegrating vector such that oyt is stationary in levels Since the difference of y is stationary it has a moving average representation 1 My AL z 4 Since oyt stationary is an extra restriction it mus imply a restriction on ALi 41 Multivariate Beveridge Nelson decomposition The multivariate Beveridge Nelson decomposition looks like univariate BN decomposition y Z Ct l 7 Lzt Alet 5 c A Le A Z Ak 6 kj1 The restriction on Al implied by cointegration the elements of y are cointegrated With cointe grating vectors ai iff 04141 0 This implies that the rank of Al is number of elements of y number of conitegrating vectors ail Three cases for Al ll Al 0 42gt y is stationary in levels all linear combinations of y are stationary in levels 2 Al has less than full rank 42gt 17Lyt is stationary some linear combinations oyt are stationary 3i Al has full rank 42gt l 7 Lyt is stationary no linear combinations of y are stationary 42 Impulse response function A1 is the limiting impulseresponse of the levels of the vector y To see how cointegration affects 141 consider a simple case a 1 illi The reduced rank of Al means Dz141 0 1 141 mm l llAlt1gtw Alt1gtWl Therefore 141 A1wx A1ww A1ww each variable 5 longTun response to a shock must be the same 5 Testing for cointegration There are several ways to decide Whether variables can be modeled as cointegrated H i Use expert knowledge and economic theory E0 Graph the series and see Whether they appear to have a common stochastic trend CA3 i Perform statistical tests for cointegrationi We Will consider residual based statistical test for cointegration and cointegration rank tests Tank l All three methods should be used in practice 51 Testing for cointegration when cointegrating vector is known Sometimes a research may know the cointegrating vector based on the economic theory For example the hypothesis of purchasing power parity implies that EampXE where P is an index of the price level in the US 5 is the exchange rate ltalian Lira Pi is a price index for Italy Taking logs this equation can be written as PLSLP A weaker version of the hypothesis is that the variable 1 de ned by is stationary even though the individual elements pushpg are all 11 In this case the cointegrating vector 1 is known to be 1 71 71 Testing for cointegration in this case consists of several steps 1 Verify that 17 3 p are each individually 11 This will be true if a You test for unit root in levels of these series and can no reject the null hypothesis of unit root ADF or other tests for unit root b You test for unit root in rst differences of these series and reject the null hypothesis of unit root ADF or other tests for unit root 2 Test that a series 1 is stationary or not 52 Testing for cointegration when cointegrating vector is unknown Consider an example in which two series y and I are cointegrated with cointegrated vector 1 1 79 so that v y 7 91 is stationary However the cointegrating coef cient 9 is not known Engle Granger Augmented DickeyFuller EGADF test for cointegration consists of two steps 1 Verify that yt and It are each individually 11 Table 1 Critical values for the EngleGranger ADF statistic Number of X7s in equation 7 10 5 1 1 3112 341 3i96 2 3152 3180 4 36 3 3184 4A16 4 73 4 4 20 449 5107 2 Estimate the cointegrating coefficient 9 using OLS estimation of the regression ytMt9zUz 7 3 A DickeyFuller ttest with intercept bu no time trend is used to test for a unit root in the residual from this regression 13 Since we estimate residuals in the first step we need to use different critical values for the unit root testi Critical values for EGADF statistic are given in Table 1 This table is taken from Stock and Watson 2002 If I and y are cointegrated then the OLS estimator of the coef cient in the regression 7 is superconsistenti Therefore the OLS estimator has a nonormal distribution and inferences based on its tstatistic can be misleading To avoid this problem Stock and Watson 1993 develop dynamic OLS DOLS estimator of 9 from the following regression p y M9Iz 2 2Aij 11 1717 If I and y are cointegrated statistical inferences about 9 and 67s based on HAC standard errors are valid If I were strictly exogenous then the coef cient on It 9 would be the longrun cumulative multi plier that is the longrun effect on y of a change in 1 Recall the longrun multiplier between oil and gasoline prices in the paper by Borenstein et all 1997 53 The Johansen trace test The LR test for the null hypothesis Tank l K against the alternative Tank l 2 K 1 1 Step 1 Test H0 K 0 against H1 K 2 1 This is the test that there is no coinegration and that there are n stochastic trendsi If H0 is not rejected there is no coineg rationi If H0 is rejected continue with step 2 2 Step 2 Test H0 K 1 against H1 K 2 2 If H0 is not rejected then there is a single cointeg ration relation and there are n 7 1 common trendsi If H0 is rejected continue with step 3 9 Iteratively test H0 Tankll K against H1 Tankll 2 K 1 Continue until the rst time that H0 is rejected number of cointegrating relations is K number of common trends N 7 K 6 Running VAR S Consider a multivariate model consisting of two variables I and wt which are individually lli One may model these two variables using one of the following models 1 A VAR model in levels 2 A VAR in the rst differences 3 An ECM representation With cointeg ration a pure VAR in differences is misspeci edi Art aLAzt1 bLAwt1 e Aw CLAzt1 dLAwt1 e Looking at the errorcorrection form there is a missing regressor a1171wt1i This is a problemi A pure VAR in levels is a little bit unconventional since the variables in the model are nonstationaryi The VAR in levels is not misspeci ed and the estimates are consistent but the coef cients may have nonstandard distributions and they are not ef cient If there is cointeg ration it imposes restrictions on Bl that are not imposed in a pure VAR in levels Cochrane 1994 suggests that one way to impose cointegration is to run an errorcorrection VAR AI 71011114 awwt71 aLAIt71 bLAwt71 6 Aw M 111171 awwt71 CLMzza dLAwt71 v This speci cation imposes that z and w are cointegrated with cointegrating vector on This is very useful if you know that the variables are cointegrated and you know the cointegrating vectori Otherwise you have to pretest for cointegration and estimate the cointegrating vector in a separate stepi Another difficulty with ther errorcorrection form is that it does not t nicely into standard VAR packages A way to use stadard packages is to estimate companion form Art 7 aLAIt71 100111171 awwt71 6 111171 awwtil CLMIEA dLaxIt71 awwt71 v We need to know cointegrating vector to use this procedure There is much debate as to which approach is best 0 When you do not really know whether there is cointegration or what the cointegrating vector is the VAR in levels seems to be better 0 When you know that there is cointegration and what the cointegrating vector is the error correction form model or VAR in companion form are better 12000 10000 i 8000 6000 i 4000 7 2000 7 n 1 1 1 1940 1950 1960 1970 1980 1990 2000 2010 2010 Figure 4 GNP and PDI Non stationary Time Series Process Contents H Introduction lil Motivation for trends i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 12 Detrending of a timeseries With a linear deterministic trend i i i i i i i i i i i i i i i i 13 Detrending of a timeseries With a stochastic tren 2 Random Walk Models 21 Statistical issues i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 22 lnappropriate detrending i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 23 Spurious nonsense regressions i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 3 Unit root and stationary process 31 Response to Shocks in TS and DS Models i i i i i i i i i i i i i i i i i i i i i i i i i i 32 Comparison of forecasts of TS and DS processes i i i i i i i i i i i i i i i i i i i i i i 33 Random walk components and stochastic trends i i i i i i i i i i i i i i i i i i i i i i 34 Forecast error variance i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 4 Trend estimation and forecasting 4 1 Forecasting a deterministic trend i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 42 Forecasting a stochastic trend i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 43 Forecasting of ARMA models With deterministic trends i i i i i i i i i i i i i i i i i i 44 Forecasting of ARlMA models i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 5 Testing for a unit root 51 The DickeyFuller and Augmented DickeyFuller Test i i i i i i i i i i i i i i i i i i i 52 Avoiding the problems cause by stochastic trends i i i i i i i i i i i i i i i i i i i i i i 6 Structural breaks 61 Testing for Breaks 6 Zivot and Andrews 1992 testing procedure i i i i i i i i i i i i i i i i i i i i i i i i i 63 Avoiding the problems caused by breaks i i i i i i i i i i i i i i i i i i i i i i i i i i i References 0 Cochrane J H 1997 Time series for macroeconomics and nance Lecture Notes Chapter 10 httpwwwgsbuchichagoedufacjohncochraneresearchPaperstimeser1pdf Hamilton 1 D 1994 Time Series Analysis Princeton University Press Chapter 15 17 Heij C Paul de Boer Franses P H and H K van Dijk 2004 Econometric methods with applications in business and economics Oxford University Press Chapter 7 o Hendry F D and K Juselius 2000 Explaining Cointegration Analysis Part 1 Journal of Energy 21 1 142 Stock J H and M W Watson 2003 Introduction to Econometrics AddisonWesley Chapter 12 1 Introduction In our analysis so far we have assumed that the variables in the models that we have analyzed univariate ARMA model VAR models SEM models are stationary This is a technical assumption that allows us to write any ARMApq model 9 6 1y271 qbpytip E 916271 94162711 Lyz 6 9L6z where and 9L are appropriately de ned polynomials as a MA00 process 9 M L z where y may be a scalar or a vector of n variables at time period t 1f the timeseries is not covariancestationary then MA00 representation is not possible and parameters 0 14112 become meaningless This is bad It means that we can not use the tools that we discussed in the class to construct point and interval forecasts construct impulseresponse functions or conduct statistical tests about parameter restrictions A glance at graphs of most economic time series suf ces to reveal invalidity of covariancestationarity assumption economies evolve grow and change over time in both real and nominal terms ometimes changes in timeseries are smooth and hard to detect Sometimes changes are abrupt and can be noticed by examining time plots or scatter plots Moreover if you examine the accuracy of economic forecasts you may notice that they are very wrong quite often although that should occur relatively infrequently in a stationary process The practical problem of an econometrician is to nd relationships among variables y and I that survive for a relatively long period of time so that they can be used for forecasting and policy analysis Four questions immediately arise with respect to stationaritynonstationarity of a timeseries1 1 How important is the assumption of stationarity for modeling and inference 77Very When data means and variances are nonconstant observations come from different distri butions over time creating dif cult problems for empirical modeling It implies that relationships among variables y and z are changing over time and we need to model how these relationships change 1These questions and answers are presented by Hendry and Juselius 2 What is the effect of incorrectly assuming stationarity 77Potentially hazardous Assuming constant means and variances when that is false can induce serious statistical mistakesi If the variables in y are not stationary then conventional hypoth esis tests con dence intervals and forecasts can be very unreliable and mislea ing A standard asymptotic distribution theory often does not apply to regressions involving variables with unit roots and inference may be misleading if this is ignored For example conventional ttests and Ftests are no longer valid 3 What are the sources of nonstationarity 77Many and varied77i There are many possible reasons for nonstationarity of a timeseriesi Non stationarity may be due to evolution of the economy legislative changes technological changes political events etc 4 Can empirical analysis be transformed so that stationarity becomes a valid assumption 77Sometimes depending on the source of n t 39 it 77 a quot t 39 removed by transformations liner detrending differencing ity can be 11 Motivation for trends A nonstationary process is one which violates the stationary requirement so its means and variances are nonconstant over time A trend is a persistent longterm movement of a variable over time A time series uctuates around its trend 1 Deterministic trends A deterministic trend is a nonrandom function of time variable t For example a deterministic trend might be linear in time or linear and quadratic in time etc For example a model with a linear trend has the following form 9 a 575 L z while a model with a linear and quadratic trends has the following form y 7 a 61t 62t2 L t You may notice that the mean of y varies with time because it depends on the value of time trend t or t2 Also notice that a change in deterministic trends are not random and known For example a change in a linear trend is At t 7 t7 1 1 and a change in a quadratic trend is At2 t 7 t 7 12 2t 7 1 While a change in a linear trend is constant a change in a quadratic trend is increasing with time However both changes are known with certainty 2 Stochastic trends A stochastic trend is random and varies over time An example of a model with a stochastic trend is as fo lows y yt7l6 L t 1 Lyz 6 L z What is the stochastic trend in this model The trend is a lagged value yt1 and a change in trend is a change in y Aykli You are not used to think about a lagged value as a trend but there is nothing that prevents it from being a trend The difference between a linear stochastic trend and a linear or quadratic deterministic trend is that changes of a stochastic trend Ayt1 over time are random whereas those of a deterministic trends are known with certainty 3 Permanence of shocks Macroeconomists used to detrend data and regarded business cycles as the stationary deviations from that trend However7 some economists investigated whether GNP is better described as a random walk model with stochastic trend or a a process with deterministic trend g Statistical issues In practice7 we need to remove nonstationarity before estimating a model for a timeseries How we remove nonstationarity depends on whether a variable has a deterministic trend or a stochastic trend A dif culty in empirical work is that it is not easy to distinguish stochastic trends from deterministic trends 12 Detrending of a timeseries With a linear deterministic trend Consider a model with a deterministic y a 575 L z Assume that you know true parameters oz7 6 The variable y is nonstationary because it has a deterministic trend A deterministic trend may be removed by subtracting it from y Let Q be de ned as Q y i 575 which implies that the model for Q has the following form it a L z The variable t is called a detrended variable because it does not have a deterministic trend 13 Detrending of a timeseries With a stochastic trend Consider a model with a stochastic trend 9 2471 5 L z Assume that you know true parameters 6 The variable y is nonstationary because it has a stochastic trend A stochastic trend may be removed by subtracting it from y Let Q be de ned as Q 3 A9 y 971 which implies that the model for Q has the following form it 6 L z The variable t or rst difference is also called a detrended variable because it does not have a stochastic trend 2 Random Walk Models A basic random walk model is y ytil l ez E1716 0 where E6 02 E1716 is the expected value of error term at given all available information at period t 7 1 A random walk model is a model with a stochastic trend It implies that Et 9amp1 y ie the best forecast of value yt1 give all available information at period t7 l is the value y Random walks have a number of interesting properties 1 The impulseresponse function IRF of random walk is one at all horizons The lRF of a stationary processes dies out eventually converges to zero An economic interpretation of this fact is that shocks to a random walk timeseries have a permanent effect E0 The forecast error variance of the random walk grows linearly with the forecast horizon WTltyzhlyzgt E a Ezhz M2 haQ As the forecast horizon h increases without a bound the forecast error variance increases without a bound This implies that interval forecast become very wi e 9 The autocovariances 7139 covytytj of random walk hence autocorrelations as well are not de ned Recall that autocovariances and autocorrelations are used to measure the dependence between y and lagged values of y You can not use autocorrelations to measure dependence between y and yt v for a random walk process 21 Statistical issues Suppose a series is generated by a random walk 9 ytil E where e N iid0 02 You might want to test for a random walk by running 9 M yzii z by OLS and testing whether 45 1 and M 0 using individual tstatistics or an Ftest Whether M 0 is not really important but we do want test whether 45 1 It turns out that using individual tstatistics of an Ftest is not correct because the assumptions underlying the usual asymptotic theory for OL estimates and test statistics are violated 22 Inappropriate detrending Things get even more complicated with a trend in the model Suppose the true model is a random walk model without a tren y M l ytil l et However suppose that you assume that y has a deterministic trend instead of stochastic trend Because you assume a deterministic trend you remove a linear trend from a model by looking at y y 7 bt where b is an estimated parameter for a trend Next you t an AR1 model for a transformed series it bgtil 6 ie you t the model 1 4151 y i 1 75 5 1 y bt17 L71et But model 1 can be written as follows 17 my 7 17 Lbt 7 e 7 bt 7 bt71 e 17 Lyt 1512 1217 q5t e y bb1 t yz71 6 9 a7t yz71 z where a 45b 6 121 Therefore removing a linear trend rst and then estimation of an AR model is equivalent to running y directly on a time trend and lagged y In this case 45 is biased downward and the standard OLS errors are misleading 23 Spurious nonsense regressions Suppose there are two series I and y that are generated by random walk models 9 9271 E It 171 U Eetv5 0 for all t s 2 Because error terms at and v are uncorrelated it means that I does not have any predictive power in a linear model for y However suppose that you run y on I by OLS y a 51 u Because the assumption of stationarity for classical regression is violated we tend to see 77signi cant77 6 more often than OLS formulas say we should In practice it implies that we may conclude that I has a signi cant effect on y more often than our selected signi cance level 3 Unit root and stationary process A more general process than a pure random walk may have the following form 9 M M71 e a1671 aqet7q 1 My M aL z 3 where aL 1 a1L aqL2 is a lag polynomial The process in 3 is called a unit root or a difference stationary DS processes In the simplest case aL 1 the DS process becomes a random walk with drift 9 Myz71 z Alternatively we may consider a process to be stationary around a linear trend yr W bade 4 This process is called trend stationary TS process A TS process can be considered as a special case of a DS model If 12L contains a unit root we can Write this polynomial as bL 17 LaL Where aL is an appropriately de ned lag polynomiali We can Write a trend stationary model 4 as a DS model y MtbLet lt17 my 7 lt17 Lgt111lt17 LgtbltLgteE 1Lyz MaL z Therefore if the true process for a timeseries is a trendstationary process model the DS model is still valid and stationary One can think about unit roots as the study of the implications for levels of a process that is stationary in differences Therefore it is very important to keep track of Whether you are thinking about the level of the process y or its rst difference 31 Response to Shocks in TS and DS Models The impulse response function of a TS process is determined by a MA polynomial 12L iiei bj is the jth period ahead response by aytj J 86 For a DS model aj gives the response of the difference 1 7 Lytj to a shock at time t 9ampij Be 0391 How do we compute the response of the level yt v to a shock in e for a differencestationary process Notice that yt v may be presented as the sum of y and differences from period t l to t j 92H y yz1 9 92472 9471 92H 9147171 The response of yt v is the sum of the response of the differences for all periods from t to t j The response of yt v to a shock at period t E v 6 fly yz1 9 9472 92H 9471 ytj71l Bet Bet Byz 39 691 971 6yz1 9 6yzj yzj71 Bet Bet Bet EL a0a1maj a 7 Where th gt 1 gig at 32 Comparison of forecasts of TS and DS processes The h periodahead forecast of a trendstationary process 4 process is as follows 2444 1415 h like bh1 z71 bh25172 5 The hperiodahead forecast of a difference stationary process 3 can be computed as follows 1444 A wh z A th71lz A t1lt y M like bh1 z71 bh25172 M bhilet bhetil bh15172 1 M 516 126271 1136272 y 1444 h y 1m bhil b1et bh1bhmb2 z71m 6 To see the difference between forecasts for TS and DS processes consider a case in which 121 b2 0 Then the h period ahead forecast for trend stationary and difference stationary process are T51 1444 t h DS Hh t Mh y Next compare the forecast error for the TS and DS processes For a TS process the forecast error is yth thz 1475 h th b1 th71 b2 th72 bh71 t1 like bh1 z71 7016 h bhet bh1 til rh b1 Lh71 bh71 z1 The MSE of the forecast for a TS process is a Eyzs Qwslzy 1 5 b 52102 For a DS process the forecast error becomes yzh 1444 A9144 Ayz1 9 A whlz A t1lt 9 th 1 b1 th71 1 111 b2 th72 1 111 112 bh71 t1 The MSE of the forecast for a DS process is 62 Eyzh 7 mm 1 1 b12 1 b1 b22 1 b1 b2 bh712l0392 For a TS process as the forecast horizon h increases the added uncertainty from forecasting further into future becomes negligible limhaooEyzh 9144402 1 I 11 1902 and the limiting MSE is just the unconditional variance of the stationary component 12L This is not true for the DS process The MSE for a DS process does not converge to any xed value as h goes to in nity To summarize for a TS process the MSE reaches a nite bound as the forecast horizon becomes large whereas for a unit root process the MSE eventually grows linearly with the forecast horizon 33 Random walk components and stochastic trends A useful fact Every DS process can be written as a sum of a random walk process and a stationary component A decomposition with a nice properties is the Beveridge Nelson decomposition lf 1 7 Lyt a aLet then we can write 9 Ct 2 where 2 M 271 a161 00 ct aquot Let a 7 Z ak kj1 To see that this decomposition is true we need to notice that any lag polynomial aL can be written as ad a11 LaL a Z ak kj1 To see this just write it out a a0 a1 a2 a3 l 7 La L 7a1 7a2 7a3 a1L a2L a3L 7 7a2L 7a3L and the term aL remain when you cancel all the terms ere are many ways to decompose a unit root into stationary and random walk components The BN decomposition is a popular choice because it has a special property the random walk component is a sensible de nition of the 77trend77 in y The component 2 is the limiting forecast of future y La today7s y plus all future expected changes in y In the BN decomposition the innovations to the stationary and random walk components are perfectly correlated Consider an arbitrary combination of stationary and random walk components 9 Z Ct 2 M 271 U C bLet 8 It can be shown that in every decomposition of y into stationary and random walk components the variance of changes to the random walk component is the same al203 34 Forecast error variance Since the unit root process is composed of a stationary plus random walk component the unit root process has the same variance of forecasts behavior as the random walk when the horizon is long enough 4 Trend estimation and forecasting 41 Forecasting a deterministic trend Consider the liner deterministic model y a tet t 12T The h step ahead forecast is given by 1444 51 805 h where d and 3 are OLS estimates of the parameters a and The forecast variance may be computed using the following formula 1 t h 7 Elyzh tht 02 1 1 2 w 02 t Em1m V where the last approximation is valid ift the period at which forecast is constructed is large relative to the forecast horizon h 42 Forecasting a stochastic trend Consider the random walk with drift y ayt1et t 23T Let 3 be an estimate of 1 obtained from the following regression model Ayn a 5 The h step ahead forecast is given by 91 y 34h The forecast variance may be computed using the following formula A h2 Elyzh 7 yum 02 h if 1 m h02 where the last approximation is valid if t is large relative to h 43 Forecasting of ARMA models With deterministic trends The basic models for deterministic and stochastic trend ignore possible shortrun uctuations in the series Consider the following ARMA model with deterministic tren Lyz a B75 9L z where the polynomial satis es the stationarity condition and 9L satis es the invertibility condi tion The forecast is constructed as follows 1 Linear detrending Estimate the following regression model yz5152t5z and compute z y 7 81 7 Sgt 2 Estimate an appropriate ARMApq model for covariance stationary variable 2 Lzt 9Let 3 Use estimated 131 qu and 31 M q to construct h periodahead forecasts of 21 214W 4 Given 431 42 and 31 will compute MA representation parameters 12 17izlg77ll h and compute the MSE of 214W 0 at 1 w w 21002 5 Construct the h periodahead forecast of y as follows 1444 3444 81 82 t h 6 Approximate the MSE of Hh t by the MSE of 214W 44 Forecasting of ARIMA models Consider the forecasting of time series that are integrated of order 1 that are described by the ARlMAplq model L1 My a 9L6z where the polynomial satis es the stationarity condition and 9L satis es the invertibility condi tion The forecast is constructed as follows 1 Remove stochastic trend by computing the rst difference of yt ie 2 Ayn 2 Estimate an appropriate ARMApq model for a covariance stationary variable 2 Lzt 9Let 3 Use estimated 131 qu and 31 M q to construct h periodahead forecasts of 21 214W 4 Given 431 42 and 31 will compute MA representation parameters 12 17izlg77ll h and compute the MSE of 214W 0 at 1 w w 21002 5 Construct the h periodahead forecast of y as follows h 1444 y 221444 j1 6 Construct MSE of the forecast Hh t 6i 1 1 1 1 1131 22 1 1131 2 lama Example Figure 1 presents the time seris of the level and rst di erence of the retail gasoline price Figure 2 presents constructed forecasts for both variables 11 Retail Gasoline Price Level I I centsgal I I I I I I 100 200 300 400 500 600 Retail Gasoline Price Difference I I oNAm centsgal I I I I I I 100 200 300 400 500 600 Figure l The retail gasoline price and the rst difference of the reatil gasoline price The test for unit root does not reject the null hypothesis that the level of retail gasoline price has a unit root which implies that the gasoline price is nonestationary The test of unit root for the rst difference of the retail gasoline price rejects the null of unit root implying that the rst difference is a stationary process 5 Testing for a unit root Although it might be interesting to know Whether a time series has a unit root several paper have argued that the question can not be answered on the basis of a nite sample of observations Nevertheless you Will have to conduct test of unit root in doing empirical projects it can be done using informal or informal methods The informal methods involve inspecting a time series plot of the data and computing the autocorrelation coef cients If a series has a stochastic trend the rst autocorrelation coef cient Will be near one A small rst autocorrelation coef cient combined With a time series plot that has no apparent trend suggest that the series does not have a trend DickeyFuller DF test is a most popular formal statistical procedure for unit root testingi 51 The DickeyFuller and Augmented DickeyFuller Test The starting point for the DF test is the autoregressive model of order one AR1 y a P9171 6 9 If p l y is nonstationary and contains a stochastic trendi Therefore Within the AR1 model the hypothesis that y has a trend can be tested by testing Hozp1 vsiH11pltl This test is most easily implemented by estimating a modi ed version of Subtract yt1 from both sides and let 6 p 7 1 Then model 9 becomes 12 Retail Gasoline Price Difference Forecast I I I KI I I 6 0 Qxx 4 l x W I quotx I an If wa SI 2 39 quot xxxxxxquot E I 39 6 quot n 5 0 6 oIG pm 0 N I r39 o g b 2 X x mm M xx wax g axxx 4 xx 9 I I I I I I 600 610 620 630 640 650 660 670 Retail Gasoline Price Level Forecast I I I I I x309 190 w wmmxx x 180 Km T 4 170 0 Nmw 5 160 q o d quotx 150 5 me M quotx 140 seed w Mumxt xx 1 I I I I I I 600 610 620 630 640 650 660 670 Figure 21 The forecasts of the rst difference and level of the retail gasoline price for forecast horizons h 1 2 V 36 Notice that the con dence interval for the level forecast nonistationary process is always increasing as the forecast horizon h increases The con dence interval for the rst difference of the retail gasoline price stationary process increases for small values of h but it does not increase for large values of h Ayn a 69 5 10 and test the hypothesis H0260 vsi H126lt0 The OLS tstatistic in 10 testing 6 0 is known as the DickeyFuller statistic The extension of the DF test to the ARp model is a test of the null hypothesis H0 6 0 against the onesided alternative H1 6 lt 0 in the following regression Ayn 06yt7171Ayt71m7pAytip L 11 Under the null hypothesis y has a stochastic trend under the alternative hypothesis y is stationary lf instead the alternative hypothesis is that y is stationary around a deterministic linear time trend then this trend must be added as an additional regressor in model 11 and the DF regression becomes Ayn a M 69171 IIAyH 71 pr et 12 The ADF statistic is the OLS tstatistic testing that 6 0 in equation 12 The ADF statistic does not have a normal distribution even in large samples Critical values for the onesides ADF test depend on whether the test is based on equation 11 or 12 and are given in Table 1 Table 1 Largesample critical values for the ADF statistic Deterministic regressors 10 5 1 Intercept only 257 286 343 lntercept and time trend 312 341 396 Table 2 Summary of DE test for unit roots in the absence of serial correlation Case 1 Estimated regression y pykl u True process y yt1 at u N N002 TWT 7 1 has the distribution described under the heading Case 1 in Table 135 T 7 1amp t has the distribution described under Case 1 in Table 136 Case 2 Estimated regression y a pykl u True process y yt1 at u N N002 T T 7 1 has the distribution described under Case 2 in Table 135 T 7 1amp t has the distribution described under Case 2 in Table 136 OLS F test ofjoin hypothesis that a 0 and p 1 has the distribution described under Case 2 in Table 137 Case 5 Estimated regression y a pykl u True process y a yt1 at a 0111 N NOa2 a 7 1w 7 MW Case 4 Estimated regression y a pykl 6t u True process y a yt1 at a 0111 N N002 T T 7 1 has the distribution described under Case 4 in Table 135 T 7 1amp t has the distribution described under Case 4 in Table 136 OLS F test ofjoin hypothesis that p 1 and 6 0 has the distribution described under Case 4 in Table 137 Table 171 of Hamilton p 502 presents a summary of DE tests for unit roots in the absence of serial correlation for testing the null hypothesis of unit root against some different alternative hypothesis It is very important for you to understand What your alternative hypothesis is in conducting unit root tests 1 reproduce this table here but you need to check Hamilton book for the critical values of DE statistic for different cases The critical values are presented in the Appendix of the book 52 Avoiding the problems cause by stochastic trends The most reliable way to handle a trend is a series is to transform the series so that it does not have the trend 1f the series has a stochastic trend unit root then the rst difference of the series does not have a trend In practice you can rarely be sure Whether a series has a stochastic trend or not Recall that a failure to reject the null hypothesis doe not necessarily mean that the null hypothesis is true it simply means that there are not enough evidence to conclude that it is false Therefore failure to reject the null hypothesis of a unit root using the ADF test does not mean that the series actually has a unit root Having said that even though failure to reject the null hypothesis of a unit root does not mean the series has a unit root it still can be reasonable to approximate the true autoregressive root as equaling one and use the rst difference of the series rather than its levels 14 6 Structural breaks Another type of nonstationarity arises when the population regression function changes over the sample period This may occur because of changes in economic policy changes in the structure of the economy or industry events that change the dynamics of speci c industries or rm related quantities inventories sales production etc If such changes called breaks occur then regression models that neglect those changes lead to a misleading inference or forecasting Breaks may result from a discrete change or changes in the population regression coef cients at distinct dates or from a gradual evolution of the coef cients over a longer period of time Discrete breaks may be a result of some major changes in economic policy or in the economy oil shocks while 77gradual77 breaks population parameters evolve slowly over time may be a result of slow evolution of economic policy If a break occurs in the population parameters during the sample then the OLS regression estimates over the full sample will estimate a relationship that holds on 77 average i 61 Testing for Breaks Tests for breaks in the regression parameters depend on whether the break date is know or not If the date of the hypothesized break in the coef cients is known then the null hypothesis of no break can be testing using a dummy variable Consider the following model 9 50 51y271 511271 70Dz739 71D2T gtlt M71 72Dz739 X 171 u where T denotes the hypothesized break date DAT is a binary variable that equals zero before the break date and one after iiei DAT 0 ift S T and DAT 1 ift gt T Under the null hypothesis of no break 0 71 72 0 and the hypothesis of a break can be tested using the Fstatistici This is called a Chow test for a break at a known break date If there are more variable or more lags this test can be extended by constructing binary variable interaction variables for all the dependent variables This approach can be modi ed to check for a break in a subset of the coef cients The break date is unknown in most of the application but you may suspect that a break occurred sometime between two dates 7390 and TL The Chow test can be modi ed to handle this by testing for break at all possible dates T in between To and 7391 then using the largest of the resulting Fstatistics to test for a break at an unknown date This modi ed test is often called Quandt likelihood ratio QLR statistic or the sup Wald statistic QLR mazFT0 FT0 1 MFT1 Since the QLR statistic is the largest of many Fstatistics its distribution is not the same as an individual Fstatistici The critical values for QLR statistic must be obtained from a special distribution This distribution depends on the number of restriction being tested m and on ToT and TlT the subsample over which the Fstatistics are computed expressed as a fraction of the total sample size For the largesample approximation to the distribution of the QLR statistic to be a good one the subsample endpoints 7390 and 7391 cannot be too close to the end of the sample That is why the QLR statistic is computed over a 77trimmed77 subset of the sample A popular choice is to use 15 trimming that is to set for To 015T and 7391 0 85Tl With 15 trimming the Fstatistic is computed for break dates in the central 70 of the sample Table 3 presents the critical values for QLR statistic computed with 15 trimmingl This table is from Stock and Watson 2003 and you should check the book for a complete table The QLR test can detect a single break multiple discrete breaks and a slow evolution of the regression parameters If there is a distinct break in the regression function the date at which the largest Chow statistic occurs is an estimator of the break datel Table 3 Critical Values of the QLR statistic with 15 Trimming Number of restrictions m 10 5 1 1 712 868 1216 2 500 586 778 3 409 471 602 4 359 409 512 5 326 366 453 6 302 337 412 7 284 315 382 8 269 298 357 9 258 284 338 10 248 271 323 62 Zivot and Andrews 1992 testing procedure Sometimes7 you would suspect that a series may either have a unit root or be a trend stationary process that has a structural break at some unknown period of time and you would want to test the null hypothesis of unit root against the alternative of a trend stationary process with a structural break This is exactly the hypothesis tested by Zivot and Andrews test In this testing procedure7 the null hypothesis is a unit root process without any structural breaks and the alternative hypothesis is a trend stationary process with possible structural change occurring at an unknown point in time Zivot and Andrews suggest estimating the following regression k I M 6DUT t 7DTT 1111 EelA114 e 13 i1 where 739 TBT is the break fraction DUT 1 if t gt TA and 0 otherwise DTET t 7 TA if t gt TA and 0 otherwise and I is the time series of interest This regression allows both the slope and intercept to change at date TB Note that for t 2 TA t lt TB model 13 becomes k I M t 1171 ZCiAIH 6 11 while for t gt TA t gt TB model 13 becomes k z t 9 m w 7 TB mm Zeimm e 391 1 Model 13 is estimated by OLS with the break points ranging over the sample and the t statistic for testing a 1 is computed The minimum t statistic is reported Critical values for 15 and 10 critical values are 7534 748 and 7458 respectively The appropriate number of lags in differences is estimated for each value of 739 63 Avoiding the problems caused by breaks The appropriate way to adjust for a break in the population parameters depend on the sources of the break If a distinct break occurs at a speci c date7 this break will be detected with high probability by 16 the QLR statistic7 and the break date can be estimated The regression model can be estimated using a dummy variable If there is a distinct break7 then inference on the regression coef cients can proceed as usual using tstatistics for hypothesis testing Forecasts can be produced using the estimated regression model that applies to the end of the sample The problem is more dif cult if the break is not distinct and the parameters slowly evolve over time In this case a statespace modelling is require i