Computational Data Analy
Computational Data Analy CSE 6740
Popular in Course
Popular in Computational
verified elite notetaker
This 0 page Class Notes was uploaded by Dell Spencer IV on Monday November 2, 2015. The Class Notes belongs to CSE 6740 at Georgia Institute of Technology - Main Campus taught by Alexander Gray in Fall. Since its upload, it has received 20 views. For similar materials see /class/234227/cse-6740-georgia-institute-of-technology-main-campus in Computational at Georgia Institute of Technology - Main Campus.
Reviews for Computational Data Analy
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/02/15
CSE 6740 Lecture 19 How Do I Treat Temporal Data Time Series Analysis Alexander Gray agray cc gatech edu Georgia Institute of Technology l33 0 Time Series 9 Univariate Linear Methods 9 Extensions Time Series General concepts in time series analysis Time Series Time series are previous ones Parts of a Time Series Some sources of variation 0 Seasonalty For example unemployment is typically high in the winter and lower in summer 0 Trend A long term change in the mean level Example of a trend Xt mt 6t 1 where m 60 W 2 and at is zero mean noise Parts of a Time Series Seasonality can be modeled in different ways including Xt l mt l 5t l at additive Xt l mtSt l at multiplicative AA 4gtcu vv Xtmtstet multiplicative A 039 v Note that the third model can be made into the first one with a logarithmic transformation Stationarity A time series is said to be stationary roughly speaking if there is no systematic change in mean no trend or variance and if strictly periodic variations have been removed In other words the properties of one section of the data look much like that of any other section Actually only models are stationary not data Differencing Differencing is a pre processing operation which is effective for removing a trend It is performed on the time series X17 7XIV to obtain a new time series yg7 7yN by yt Xt 7 X1571 E th 6 Occasionally second order differencing is required szt th 7 VXt1 Xt 7 2Xt1 l X1372 7 A seasonal effect can be removed with seasonal differencing eg Vlgxt VXt 7 VX1571 8 Stochastic Process A stochastic process or random process is a sequence of random variables X17 Xt 7XIV ordered in time The mean function is defined by 0 EXt39 9 The variance function is defined by 02t Wm 10 The autocovariance function is defined by Whiz EXt1 MnXth MtgHv 11 which generalizes the variance function when t1 7 t2 Stationary Process A time series is said to be strictly stationary if thejoint distribution of X17 7XIV is the same as thejoint distribution of X14r77 XNT for all t17 W and 739 In other words shifting the time origin by an amount 739 has no effect on the joint distributions which thus must depend only on the intervals between t17 tN 1033 Stationary Process Strict stationarity using the case N 1 also implies that the distribution of Xt is the same for all t or ut u and 02t 02 Using the case N 2 it also implies that 39yt1 t2 depends only on the lag 739 ie 39yt1 t2 39y739 If EXt u and CovXtXtT 39y739 we say the process is second order stationary or weakly stationary 1133 Univariate Linear Methods Some linear models for a single time series Purely Random Process A process is called a purely random process or White noise if it consists of a sequence of random variables Zr which are IID Suppose they are normally distributed with mean zero and variance 0 Then CovXtXtT 39y739 0 for 739 0 otherwise 0 1333 Random Walk Suppose that Zr is a process with mean u and variance 0 A process Xt is called a random walk if Xt Xt1 Zt 12 Starting the process with X1 Zl we have EXt to VXt to Since the mean and variance change with t the process is non stationary However note that first order differencing VXt X1 Xtil Zt forms a purely random process making it stationary 1433 Moving Average Process Suppose that Z is a purely random process with mean zero and variance 0 A process Xt is called a moving average process of order q MAq process if Xt ozt Blztil Bthsq 14 This process is weakly stationary 1533 Autoregressive Process Suppose that Zr is a purely random process with mean zero and variance 0 A process Xt is called an autoregressive process of order p ARp process if Xt alx apXH Zr 15 The auto refers to the fact that the prediction is based on past values of Xt 16 33 AR i rMA Let39s look at the AR1 process or Markov process Xt oth1 Z 16 By successive substitution we have X1 0404Xt2 Z1571 Z1 04204Xt73 Zt72 Olztil Zt 18 Zt aZt1 63th 19 v provided 71lt 04 lt 1 so that the sum converges In general there is a duality between AR and MA processes ie each process can be written as the other 1733 ARMA Process A mixed autoregressivemoving average process containing p AR terms and q MA terms is said to be an ARMA process of order p q Xt a1XH apXH Zt le qz r 20 It turns out that a stationary time series can often be adequately modelled by an ARMA model with fewer parameters than a pure AR or MA process alone 1833 ARIMA Process The integrated ARMA process accounts for non stationarity by differencing first Replacing Xt by Wt VdXt we obtain an ARIMAp7 d7 q model Xt a1 WH apWH Zt nle qz r 21 The seasonal ARIMA SARIMA process includes seasonal differencing 1933 Continuous Time It is interesting to consider the continuoustime version of these models which turns out to be much less tractable mathematically Consider an AR1 process Xt ozXFl l Zt Then 17 04Xt onXt Zr 22 This has the continuous analog dXt dt aXt Zt 23 where Zt is continuous white noise which leads to the Langevin equation One interpretation of this analogy is that we are effectively learning differential equations 20 39 33 State Space Model We assume the data are obtained from a linear combination of some hidden state variables st plus noise Xt mtTSt 6t 24 where mt is considered to be known and 6t N0a We also assume we know how st changes in time st Atst1 l ct 25 where the matrix At is considered known and ct is multivariate normal with zero mean vector and known covariance matrix Ct This Markov model is called a state space model 2133 State Space Model It turns out many models can be represented in the state space model including all ARIMA models As an example consider the AR2 model Xt 041Xt71 OZZXt72 Zr By defining st Xt7 agXt1 we have Xt 17 0st with In 10 and a 0 and Oz 1 l 5tlta 0gt5t71lt0gtzt 26 since 5171 Xt717Xt72 22 r 33 Kalman Filter In the usual formulation st is the unknown quantity to be estimated The Kalman filter or linear dynamical system is a way to estimate st when a new observation in the time series becomes available ie in an online fashion consisting of two stages prediction and correction Because it is online it can follow the movement of a system where the underlying model is evolving through time Suppose we39ve observed a univariate time series up to time t 71 and 71 is the best minimum MSE estimator for stsl up to this time Kalman Filter The prediction stage is concerned with forecasting st from data up to time t 71 to obtain a prediction Emil Atgtil 27 with covariance matrix Ptitil AtPtilAtT i Ct 28 When the new observation at time t Xt has been observed the estimator for st can then be corrected to account for this additional information 24 39 33 Kalman Filter At time t 71 the best forecast of Xt is given by mtht t1 so that the prediction error is et Xt 7 mtht t1 29 Using this error feedback we can update the various quantities by Et Etitq i Ktet 30 Pt Ptitil KtmtTPtitil 31 Kt PtitqmtmtTPtitAmt03 32 Kt is called the Kalman gain matrix Extensions Multivariate and nonlinear Linear Multivariate Formulations The univariate statespace model is easily generalized to the case where Xt is a vector by making mt a matrix and at a vector By simply using lagged variables past values we can construct the usual data matrix for D regression and use say standard linear regression Note that the machinery we have discussing is more general it can handle linear coefficients which change in time for example Linear Multivariate Formulations An AR model can easily use past values from other time series to predict values of the time series of interest We can also account for feedback between different time series by using simultaneous equations the natural extension to this setting is called vector ARMA VARMA Inspired by the idea that the relationship between two or more stocks might remain constant co integration models focus on finding stationary linear combinations of time series variables N nlinear Models A locally linear approximation to a nonlinear system can be obtained by a simple Taylor expansion derivation resulting in a nonlinear model called the extended Kalman filter Instead of propagating a single state mean and covariance we can propagate a set of possible states represented as points One variant using a few determistically chosen points is called unscented Kalman filtering An approach using many points to obtain a nonparametric ish approximation of the state density is called particle filtering It has been useful for cases where there are multiple modes for example due to spurious measurements 29 r 33 N nlinear Models A framework which allows parametric forms beyond linear models and Gaussians is called the generalized state space model Another viewpoint is that of graphical models which allow easy specification of many of the models we39ve discussed at some level Graphical models which are repeated over time such as Kalman filters and hidden Markov models are called dynamic Bayesian networks Piecewise linear modelling corresponding to different regimes of the time series is another way of achieving nonlinearity Mixtures of Kalman filters where regimes are defined by a discrete hidden Markov model like process or switching state space linear dynamical systems comprise one way to do this Threshold autoregressive models are another way 0 39 33 linear Models We can always generalize the linear AR model with some class of nonlinear functions from past values to the current value Various ways of making linear parameters change over time constitute overall nonlinear models In econometrics models which account for timevarying variance called quot 39v quot 39 quot hetero red fir models GARCH have been explored heavily 5133 Other Topics 0 Temporal crossvalidation h block cross validation leave a space of h points between each point to be predicted and the rest of the data used for training 0 Fourier space Good for periodicities 0 Chaos Gave us phase plots 52 r 33 V 7 CSE 6740 Lecture 5 How Do I Learn Any Density Model Selection and Nonparametric Estimation Alexander Gray agraycc A gatech edu Georgia Institute of Technology L J CSE 6740 Lecture 5 p1439 Today I 1 Model selection and generalization How do we 7 minimize future error Two Gaussans or three 2 Nonparametric estimation What if don t want to specify a simple parametric form 3 Kernel density estimation How can estimate a density nonparametricaIy L J CSE 6740 Lecture 5 p2439 rModel Selection and 7 Generalization How do we minimize future error Two Gaussians or three Model Class Selection rHow many components K should we use We could say 7 that our model class is the set of all possible mixtures with a parameter K thrown in However this is no longer a finite number of parameters we ll consider this scenario nonparametric estimation separately 80 we usually consider parameter selection eg choosing 9 u 2 separately from model selection eg choosing K or choosing between two different kinds of distributions The extent to which these two processes should be different or the same is subtle and we will talk about this more L J CSE 6740 Lecture 5 p4439 Training and Testing targetclass training dalasel lest dalasel apply model Recall the meaning of training and testing cszmm mama 7w Training and Test Error W est error or generalization error or prediction risk is the 7 expected error over an independent test sample drawn from the same distribution as that of the training data RM EL EAM Y 1 Training error is the average loss over the training data N 3AM ZL 17mm 2 i1 L J CSE 6740 Lecture 5 p6439 Training and Test Error test error error training error model complexity Our goal find the model M which minimizes the test error LinfM RM This is called model selection cszmm Ledme5 7v74 Training and Test Error test error error training error model complexity We call choosing a suboptimal model overfit39ting or Lunderfitting cszmm Leduve5 7pm Training and Test Error test error error training error model complexity In L2 loss coming up toosimple models give too much Lbias and toocomplex models give too much variance cszmm Leduve5 7pm Training and Test Error test error error training error model complexity The training error is a downwardbiased estimate of the Lprediction risk EMM lt RM csEamuLumm 7pm Model Selection and Assessment VT 7 o perform model selection we just need to know the relative values of the test error for different models Asymptotic approximations can sometimes be useful for comparing the test error for different models These are generally not good estimators of the actual values of the errors Asymptotic arguments are generally not good enough to give us good finitesample estimates except possibly for very simple models such as linear regression For this we will turn to resampling methods Of course if we have a good way direct estimator of the test error we can use it for model selection We should when we can L J CSE 6740 Lecture 5 p11439 Optimism of the Training Error VT 7 he optimism of the training error is opM bias trM 3 E trM RM 4 2 N A Cov Yqu 5 N7 2 N A A NZEOE EEXE EE 6 i1 In other words the amount by which EMM underestimates RM depends on how strongly yr affects its own prediction LThe harder we fit the data the greater the optimism will be A CSE 6740 Lecture 5 p12439 Optimism of the Training Error VI 7 n general we have RM WWW opM 7 a lack of fit complexity penalty 8 Thus to select the model we can 1 obtain an estimate M or 2 directly estimate RM some other way L J CSE 6740 Lecture 5 p13439 Asymptotic Approach rFirst let s try to estimate the optimism asymptotically 7 Under squarederror loss with the error model Y fX e where the error 6 has zero mean and variance 02 opM 2 COVOE Y1 2 9 i1 where M is the number of parameters of the model L J CSE 6740 Lecture 5 p14439 Mallows Op Statistic 7 rThis leads to an estimate of RM called Malows Op statistic A M CpM RtrM 2l NI82 10 where 82 is obtained from the MSE of a lowbias complex model L J CSE 6740 Lecture 5 p15439 AIC Statistic VS 7 uppose we are using the loglikelihood for our loss N N HM 10gLM 10g fltX7L Mgt 210360991 7 1 i1 11 Actually we use 2lM lfM J2 5 it turns out that as N gt oo A A 2E10gfX M N 2EltrM6 2 N 12 where gis the MLE for J3 and EMMg is the likelihood on the Ltraining data J CSE 6740 Lecture 5 p16439 AIC Statistic rThis leads to an estimate of RM called Akake s 7 Information Criterion AIC A MA2 AICM RtrM 2W0 13 where 82 is obtained from the MSE of a lowbias complex model This is the same as Mallows Op statistic except that it holds under broader assumptions it is a generalization Note howeverthat this does not hold in general for example for 01 loss Note that AIC is not consistent ie does not choose the Lright model asymptotically J CSE 6740 Lecture 5 p17439 FiniteSample Risk Estimation 7W 7 e can also use try to directly estimate the test error using the actual data we have Suppose we had infinite data We could use a chunk of it to train a model and a chunk of it a validation set to estimate the test error of the model However it s never clear when we have enough data to be able to throw some away So we simulate validation sets by resamping Crossvalidation and the bootstrap are examples of this approach They directly estimate RM L J CSE 6740 Lecture 5 p18439 CrossValidation r Vfod crossvalidation splits the data into V roughly equalsized chunks using each Chunk in turn as the validation set while taking the remainder as the training set 7 Denote by film the estimate using the model M trained on the data with the 11th part removed Then the crossvalidation estimate of the test error is 1 V NU A CWM N ZZL y 7fampvrz 14 111 i1 L J CSE 6740 Lecture 5 p19439 Vfold crossvalidation requires running V training CrossValidation DNE ITERM39MCIM DiF A 5an Cmssmemnw iM mm 2H mun aRu mun 4mm FOLD 5irH mm tarsal wanism F i if j mainisn39t testml timings trainsst testsal Helmet i i trainset testsa1 mainsm rahmet matsin optimizations L J CSE 6740 Lecture 5 p20439 CrossValidation Choice of V r V N is called leaveoneout crossvalidation LOOCV It is approximately unbiased though some argue it can have high variance LOOCV and AIC can be shown to be asymptotically approximately equivalent 7 It is computationally intensive in general For linear models of the form g Ay for some matrix A under squarederror loss there is a convenient approximation called generalized crossvalidation Having too small a value for V will overestimate RM because smaller training sets yield poorer estimators Often V 10 is chosen as a compromise L J CSE 6740 Lecture 5 p21439 Main Things You Should Know 7 7 9 What a mixture of Gaussians is 9 What the EM algorithm is for and its iterative form 0 What AIC is o What crossvalidation is L J CSE 6740 Lecture 5 p22439