Statistical Analysis of Time Series
Statistical Analysis of Time Series STAT 635
Popular in Course
Popular in Statistics
verified elite notetaker
This 109 page Class Notes was uploaded by Alison Vandervort on Monday September 21, 2015. The Class Notes belongs to STAT 635 at Ohio State University taught by Staff in Fall. Since its upload, it has received 4 views. For similar materials see /class/209999/stat-635-ohio-state-university in Statistics at Ohio State University.
Reviews for Statistical Analysis of Time Series
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/21/15
STATISTICS 635 SUMMER 2005 FORECASTING ARMA PROCESSES As A Man Thinketh in His Heart So Is He Proverbs 23 7 ACF AND PACF OF ARMA PROCESSES EH What is the pattern of ACF and PACF for MAltqgt processes ACF fur MA1Wth mayo 7 FACF fur MA1wth 0 u 7 n2 nn n2 no nn nn H P am no n2 nn n2 no Figure 1 ACF and PACF MA1 with X Z 0724 ACF furMA1Wth masks 7 and thetazu 1 FACF furMA1wth 0 1 u 7 a 0 2 u 1 Fawn n no n2 nn n2 no nn nn H Figure 2 ACF and PACF MA2 with X Z 0724 0122 EH Remark The ACF of an MAq tends to cut off after lag q7 while its PACF tends to tail off STATISTICAL ANALYSIS OF TIME SERIES 1 STATISTICS 635 SUMMER 2005 ACF AND PACF OF ARMA PROCESSES EH What Is the pattern of ACF and PACF for ARltpgt processes ACF furAR1 Wm pmu 5 FACF AR1wth an o 5 Him AIM III III III I no as M AM I11 In no mi U Figure 3 ACF and PACF of an AR1 process X 05xt1 Z AR2 with phi 1 o7 and phi 2 01 AR2with phi 1 o7 and phi 2 01 ACF U Famai ACF Figure 4 ACF and PACF of an AR2 process X 07xt1 7 01xt2 Zt EH Remark The ACF of an ARltpgt tends to tail off7 While Its PACF tends to cut off after lag p Note The PACF of MA models tends to behave llhe the ACF of AR models whlle the PA CF of AR models tends to behave llhe the ACF of MA models STATISTICAL ANALYSIS OF TIME SERIES 2 STATISTICS 635 SUMMER 2005 ACF AND PACF OF ARMA PROCESSES EH What Is the pattern of ACF and PACF for ARMAQ q processes ACF Dr an ARMAKZ2 FACF Dr an ARMAG2 m 412 I11 III In III U Imam In 412 I11 III In III Figure 5 ACF and PACF of an ARMA27 2 process xt708xt1701xt2 zt07zt101zt2 ACF of an ARMA22 PACF of an ARMA22 ACF FamaI ACF Figure 6 ACF and PACF of an ARMA27 2 process xt07xt101xt2 zt7082t1701ztg EH Remark The ACF of an ARMAQ q tends to tail off7 and Its PACF also tends to tail off STATISTICAL ANALYSIS OF TIME SERIES 3 STATISTICS 635 SUMMER 2005 Example A LOOK AT THE FAMOUS SUNSPOTS DATA Pim uIAnnuaI sunspms Imm 177EHEE9 ACF ofthe Annual SunSpots Data from 1770 1569 l I an EU m an an I m Figure 7 Plot and ACF Of the yearly sunspots from 1770 1869 EH The ACF strongly suggests a lingering dependency EH Plausible tO assume underlying autoregressive mechanism PACF of the Annual SunSpots Data from 17701869 Partial ACF Figure 8 PACF Of the yearly sunspots from 1770 1869 EH What does the PACF suggest STATISTICAL ANALYSIS OF TIME SERIES 4 STATISTICS 635 SUMMER 2005 FORECASTING AN ARMAp q PROCESS EH Given the series up to time point 717 ie XT Xn X714 Xl EH Predict a future value of Xnh for h 1 2 EH We developed the Best Linear Predictor BLP in Chapter 2 131th a0 aan aan1 anXl 71 a0 Z az39XnJrliz39 21 The two determining equations of BLP can be summarized as E Error gtlt PredictorVariable 0 EH We introduced the Durbin Levinson Algorithm D Naturally suitable for ARltpgt processes D Provides an estimate of the PACF as its byproduct D Computationally e icient Provides an 9012 solution where di rect methods would require 9013 D Useful for estimating the order of an ARp EH We introduced the Innovations Algorithm D Strength More general7 does not require stationarity D Naturally suitable for MAltqgt processes B Good for ARMA models With q gt O D Can be made even better See textbook STATISTICAL ANALYSIS OF TIME SERIES 5 STATISTICS 635 SUMMER 2005 GENERAL APPROACH To FITTING ARMA PROCESSES EH Obtain the observations EH Do the observations come from a stationary process D Does the ACF decay to zero rapidly D lf need be7 perform transformation to obtain stationarity EH What is the order of the process D What is the autoregressive order p D What is the moving average order q EH Assuming p and q known and xed7 we have to D Estimate Q51Q52quot39 ng7 then 61 62 g and 02 D Total of p q 1 parameters EH How do we estimate the parameters of the ARMAQ q E Maximum Likelihood Estimation D Numerically requires good preliminary estimates b Durbin Levinson for pure autoregressive processes b Innovations vvhen q gt O in ARMAQ q EH Remember to perform goodness of t Remark Since the methods assume zero mean processes7 it is good practice to subtract the sample mean out prior to using the technique STATISTICAL ANALYSIS OF TIME SERIES 6 STATISTICS 635 SUMMER 2005 REVISITING THE DURBIN LEVINSON ALGORITHM Step 1 Initialize 2500 O and 1310 70 Step 2 For n 2 17 calculate Where7 for n 2 2 71k Cbnil Cbnngbnil ik Step 3 For n 2 17 calculate 1321 P 1lt1 in lt3gt WHAT IS THE ALGORITHM REALLY DOING Goal Solve an ARltpgt problem D L proceeds as follows EH Solve ARC problem rst EH Then solve the larger ARlt2gt problem7 using results from AR1 EH Then solve the larger ARlt3gt problem7 using results from AR2 EH So on EH Solve the original ARltpgt problem7 using results from ARltp I STATISTICAL ANALYSIS OF TIME SERIES 7 STATISTICS 635 SUMMER 2005 Exercise FINDING SOME USEFUL ACF EH Consider the causal ARlt2gt process Xt 1Xt71 2Xt72 Zt EH Find Its ACF using the method of difference equation STATISTICAL ANALYSIS OF TIME SERIES 8 STATISTICS 635 SUMMER 2005 Exercise PREDICTING AN AR2 PROCESS EH Consider the causal ARlt2gt process Xt 1Xt71 2Xt72 Zt EH Expressing in matrix vector fornI gives 71 i 70 71 71 2 71 70 72 1 vlt0 vlt1gt 71 117 2 61 i p 17p A 5523 EH The ACF derived earlier to simplify the above solution EH What do you conclude EH Apply the Durbin Levinson algorithm and compare STATISTICAL ANALYSIS OF TIME SERIES 9 STATISTICS 635 SUMMER 2005 PREDICTING h STEP AHEAD EH Assume that we still have our 71 observations Xm X714 Xl EH The best linear prediction h step ahead is given by X5311 lt25an 2 gtXn71w 52 2X1 Where a all ax aw and h h h 1 h 1T n Mlt gt M 71 gtgt and all satisfy anbff 7791 EH The mean square error in this case is given by P531 E Xnm X512 70 lt7 MgtTF7217 M EH What is the key difference between the h step ahead and the one step ahead Each one has to find his peace from within And peace to be real must be unaffected by outside circumstances Mahatma Gandhi STATISTICAL ANALYSIS OF TIME SERIES 10 STATISTICS 635 SUMMER 2005 STAT635 LECTURE OUTLINE 2 Discovery consists of seeing what everybody has seen and thinking what nobody else has thought Jonathan Swift COVARIANCES AUTOCOVARIANCES AND AUTOCORRELATIONS REMINDER IfIEX2 lt oo EEO2 lt oo IEZ2 lt 00 and a b c are real constants then covaX bY c7Z acovXZ bcovYZ De nition 1 Let Xi be a stationary time series With mean function uX The autocovariance function ACVF of Xi at lag h is 7X00 C0VXth7 Xt E iltXth MXXt MXii The autocorrelation function ACF of Xi at lag h is pXh corrXthXt Note They measure the amount of dependence between Xi and XtLh SOME PROPERTIES D 7X0 VXt D yXUL 7X h for each h B When Xi is Gaussian the distribution is completely speci ed by uX and 7X ii pX0 1 D pXh pX h for each h D 1 S pXh S 1 for each h STATISTICAL ANALYSIS OF TIME SERIES 1 STATISTICS 635 SUMMER 2005 MORE EXAMPLES OF TIME SERIES MODELS THE RANDOM WALK PROCESS De nition 2 A random walk is a series de ned by the equation XtZ1Z2Zt t12 Where Zt N WN0 02 A random walk is thus obtained by cumulatively summing iid random variables Randum WaIk gtz1z2 Z WIMZ MD was Find D IEXt and VXt D If 5 gt t then covZSXt El yXt hi B Is the series Xi stationary B Is there any transformation that can change that STATISTICAL ANALYSIS OF TIME SERIES 2 STATISTICS 635 SUMMER 2005 FIRST ORDER MOVING AVERAGE OR MA1 PROCESS De nition 3 A series Xi is a first order moving average or MA1 process if Xi Zt6ZH t0i1i2 Where D Z N WN002 D 0 is a real valued constant It is easy to show that the autocovariance function ACVF is 02162 ifh0 7Xtht 020 ifh i1 0 if h gt 1 D Clearly an MA1 process is a stationary process Also the autocorrelation function ACF is I ifh0 pxth7tpxh 0102I ifhi1 0 if h gt 1 EXAMPLE OF AN MA1 PROCESS WITH 0 9 AC 4n2 n2 as In I I I I I STATISTICAL ANALYSIS OF TIME SERIES 3 STATISTICS 635 SUMMER 2005 FIRST ORDER AUTOREGRESSIVE OR AR1 PROCESS De nition 4 A series Xi is a first order autoregressive or AR1 process if XI q XHZt t0i1i2 Where D Z N WN002 D lt 1 D Z is uncorrelated With X8 for each 5 lt t It is easy to show that the autocovariance function ACVF is 2 W 1 27 h0i1i2 D An AR1 process is a stationary process The autocorrelation function ACF of an AR1 is given by pxlthgt W h 0 i1 i2 EXAMPLE OF AN AR1 PROCESS WITH q 72 ACF 4n2 n2 as In I I I I I STATISTICAL ANALYSIS OF TIME SERIES 4 STATISTICS 635 SUMMER 2005 ESTIMATION OF THE MEAN OF A TIME SERIES B Let Xi be a stationary process with mean u and ACVF 7 and let X1X2 Xn be a sample from the process Xi D A natural estimator of u is the sample mean X i 1 ix 7 n t1 D The sample mean X is an unbiased estimator of u 1Ele M D X is not always a consistent estimator of u because X may not be close in probability to u as n gt 00 U More efficient estimator of u can be constructed for instance if one knows something about the ACVF ESTIMATION OF THE AUTOCOVARIANCE FUNCTION D A natural estimator of the ACVF 70L IE Xt uXth M is 1 nilhl 7 7 Ilhn WZXtihi XXt XI nlthltn t1 Q at is an unbiased estimator of 70L when X is unbiased Q Unfortunately the matrix Fn j2j1 is not always nonnegative definite D The most common estimator of Wat is the sample autocovariance function nilhl flhZXtlhlXXt X7 nlthltn Q is a biased estimator of 70L Q The matrix Fn jZ71 is nonnegative definite Q MSEWUL is typically smaller than MSEWUL Q Normally gt 0 as h gt oo STATISTICAL ANALYSIS OF TIME SERIES 5 STATISTICS 635 SUMMER 2005 ESTIMATION OF THE AUTOCORRELATION FUNCTION ACF D The sample autocorrelation function is nlthltn D Heuristic Use and only if m S i and n 2 50 n D Use the fact that A i lhl We 1 n w to argue why large values of lead to unstable estimates of A TEST FOR IID NOISE USING THE SAMPLE ACF D Note For iid noise with nite variance we have for h y 0 pm N N lt0 D Steps of the diagnostic for iid noise Q Plot the lag h versus Q Draw two horizontal lines at ll96 C These two lines are drawn automatically in R Q You should have about 95 of the the values h 12 within the lines if the noise is indeed iid D Which of the following depicts an HD noise STATISTICAL ANALYSIS OF TIME SERIES 6 STATISTICS 635 SUMMER 2005 EXAMPLE OF R CODE FOR SIMULATING IID NOISE Simulate 144 IID NO1 random variables Xlt rnorm14401 Let s have a plot of 2 panels high by 1 Also make the teXt 07 times smaller parmfrowc21 ceXO7 Plot the resulting time series plotX Xlabquottimequot ylabquotIID noisequot typequotlquot Calculate and plot the ACF acf X IIDnaIse I I I I I 42410123 EI 2D 4D EEI Eu IEIEI IZEI MEI UmE Series x COMMENTS STATISTICAL ANALYSIS OF TIME SERIES 7 STATISTICS 635 SUMMER 2005 EXAMPLE OF R CODE FOR SIMULATING AN MA1 PROCESS Simulate an MA1 process of length 144 with theta 9 Xlt arimasimlist mac 9 n144 Let s have a plot of 2 panels high by 1 Also make the teXt 07 times smaller par mfrowc2 1 ceXO7 Plot the resulting time series plotX Xlabquottimequot ylabquotMA1 processquot typequotlquot Calculate and plot the ACF acf X MA1pracess 20 In a In 20 I I I I D In 4D EEI an Inn IZEI MEI Ume Series x COMMENTS STATISTICAL ANALYSIS OF TIME SERIES 8 STATISTICS 635 SUMMER 2005 EXAMPLE OF R CODE FOR SIMULATING AN AR1 PROCESS Simulate an AR1 process of length 144 with phi 072 Xlt arimasimlistarc 072 n144 Let s have a plot of 2 panels high by 1 Also make the teXt 07 times smaller par mfrowc2 1 ceXO7 Plot the resulting time series plotX Xlabquottimequot ylabquotAR1 processquot typequotlquot Calculate and plot the ACF acf X ARIpmcess 41 a I 2 3 43 I I D In 4D EEI an Inn IZEI MEI Ume Series x COMMENTS STATISTICAL ANALYSIS OF TIME SERIES 9 STATISTICS 635 SUMMER 2005 MODELS WITH TREND AND SEASONAL COMPONENTS A GOOD PRACTICE IN TIME SERIES ANALYSIS D Step 1 Plot the data D Step 2 Investigate the existence of Q Apparent discontinuities such as sudden change in level or behavior C Could it be that the series needs to be broken into homogeneous segments C ls the variance increasing with time Q Outliers D Step 3 Perform any suitable transformation e g logarithm to stabilize the variance if the graph suggests that D Step 4 Inspect the graph to nd out if it suggests the possibility of representing the data as a realization of the classical decomposition model Xtmt8tYt 1 where Q mt is the trend component Q st is a seasonal component with period d and 251 st 0 Q Yt is a random noise component with 0 THE STANDARD PROCEDURE HAS TWO STEPS D Estimate and extract the deterministic components mt and st Q A popular model for st is the harmonic regression model k st Z Olj COSlt27Tfjtgt j sin27rfjt j1 C The Cg1 j 1 h are the parameters to be estimated C The fj control the frequency of the periodicity D Analyze the residual noise component Yt Q The aim is to hopefully model the residual Yt as a stationary time series STATISTICAL ANALYSIS OF TIME SERIES 10 STATISTICS 635 SUMMER 2005 A NONSEASONAL MODEL WITH TREND If our plot reveals no seasonal effect but does suggest the existence of a trend then we may use the simpler decomposition Xt mt Yr with 0 To construct a model for the data we consider two methods D METHOD 1 Trend estimation Q Fit a polynomial trend by Least squares regression Q Subtract the tted trend from the data Q Find an appropriate stationary time series model for the residuals D METHOD 2 Differencing Q Eliminate the trend directly by differencing Q Find an appropriate stationary time series model for the residuals Method 2 has the advantage that it typically uses fewer parameters and does not rest on the assumption that the trend remains the same throughout the observation time METHOD 12 TREND ESTIMATION BY LEAST SQUARES D A polynomial of degree k in t is posited as a model for mt ie mt 50 it 2t2 quot ktk k 51 j0 D The coe icients parameters are estimated by the least squares method ie nding those 53 that minimize the residual sum of squares n n k 2 RSS Z Xt mt2 Z X ZBjtj i1 2 1 D Computing In R this is done with the function lm Type helplm STATISTICAL ANALYSIS OF TIME SERIES 11 STATISTICS 635 SUMMER 2005 MORE ON LEAST SQUARES ESTIMATION OF TREND X17X27 39 39 39 7XnT7 Y Y17YVZ7 39 39 39 7YnT7 and6 807817827 39 39 39 78kT7 the model can be written in matrix form as XA6Y Where 1 t1 1 A 1 t2 15726 1 tn 1576 The ordinary least squares estimator of 6 is therefore I ATA 1ATX Remark Since the process is typically not an HD noise process the statistical properties of 6 will be different from the the results encountered in basic regression courses METHOD 1 APPLIED TO THE LAKE HURON DATA D The plot of the Lake Huron data set seemed to suggest a linear downward trend so that one could posit the following simple linear regression model mt 50 5115 B One could use the scaled index If 1 2 711 98 or the more meaningful original index If 1875 1876 7 1972 D Some of the R commands used dataLakeHuron Load the Lake Huron time series years lt 18751972 Create a variable named years lmhuronlt lmLakeHuron years Fit the linear regression model plotyearsLakeHuron type quotlquot Xlabquotyearquot ylabquotLevelquot ablinelmhuronlty2 Add the line to Huron plot acf residlmhuron Plot the ACF of the residuals STATISTICAL ANALYSIS OF TIME SERIES 12 STATISTICS 635 SUMMER 2005 D Some of the R commands used summarylmhuron Summarize the model obtained Call lmformula LakeHuron years Residuals Min 1Q Median 3Q MaX 2509970 0727260 0000829 0744024 2535650 Coefficients Estimate Std Error t value Prgtt Intercept 625554918 7764293 80568 lt 2e16 years 0024201 0004036 5996 355e08 Signif codes 0 0001 001 005 01 1 Residual standard error 113 on 96 degrees of freedom Multiple R Squared 02725 Adjusted R squared 02649 F statistic 3595 on 1 and 96 DF p value 3545e 08 D Plot of trend and plot of sample ACF Series residlmhuron Level W 221 578 579 580 581 582 I I I I I 577 I 576 I I I I I I man man mm mm wen n In 2U an AD yea Lag D Judging from the ACF plot is an HD noise process realistic adequate for Yt in the Lake Huron data STATISTICAL ANALYSIS OF TIME SERIES 13 STATISTICS 635 SUMMER 2005 NONSTATIONARY AND SEASONAL TIME SERIES MODELS ARIMA MODELS FOR NONSTATIONARY TIME SERIES EH So far7 we have considered the ARMA family of models7 which rested on the crucial assumption of stationarity EH We now consider a more general family that allows the modeling of nonstationary time series through the application of differencing EH The simplest example of a nonstationary time series transformed into a stationary one after differencing is the random walk encountered in chapter 1 Recall that7 we de ned the random walk Xt as Xt Xt1 Zt whereZt N WNO 02 Xt so de ned turned out to be a nonstationary ARC process However7 VXt with VXt Xt Xt1 is a stationary process7 being just the white noise Zt STATISTICAL ANALYSIS OF TIME SERIES 1 STATISTICS 635 SUMMER 2005 EH Consider the following nonseasonal model with trend Xt mt Yt where mt is a polynomial of order k and Yt is a stationary process D Xt is nonstationary since it has trend polynomial D With mt 60 lt kt we have VkXt kl k VkYt D VkXt is a stationary process with mean Sig7 since VkYt is stationary D The operation VkXt removed the trend and yielded a stationary time series that can be analyzed with the ARMA machinery De nition 1 Hal is a nonnegative integer then Xt is an ARIMAp d q process if Yt 1 BdXt is a causal ARMAp q process lt is clear from the de nition that BXt E B1 BdXt 6BZt Z N WNO 02 33 Note ARIMA stands for Integrated ARMA If E VdXt u y 07 then the model can be written as B1 BdXt oz 6BZt whereault1 1 pgt STATISTICAL ANALYSIS OF TIME SERIES 2 STATISTICS 635 SUMMER 2005 EH Since zd has a unit root With multiplicity d7 the process is still nonstationary even if the roots of are different from 1 However7 VdXt is stationary EXAMPLE 1 Simulate an ARIMA110 With gb 09 and 02 1 X lt arimasimlistorder c110 at 09 n 200 oquot mama quot1 mm mm uh n9 3 my 1 Sa vle ACF man mummy man an as and Ian 1 saw PAcr Man mmmm um um Sand gm 1 INTUITIVE UNDERSTANDING ARIMA PROCESSES EH The need for ARIMA arises from the fact that the series Xt in itself is nonstationary EH The key idea is that repeated applications of differencing operator V 1 B ultinIately yields a transformed series Yt VdXt that exhibits stationarity EH ln the ARIMA equation ltBgtXt E Bgtlt1 BgtdXt 909 Zt N WN0 02 1 the integer d 2 O is the number of applications of differencing STATISTICAL ANALYSIS OF TIME SERIES 3 STATISTICS 635 SUMMER 2005 one analltallnn m milligram nn mum1ny saw ACF m amuumed nn ARIMAHJM EH ARMApq processes studied earlier are stationary7 and oorre spond to d 0 ie no need to apply differencing to get stationarity EH ARIMA models are useful for representing data with trend D Why is this the case EH The parameters 9b 0 and 02 are based on the observed differences 1 BdXt PARAMETER ESTIMATION WITH R EH Based on the original series X we use p 17 d 1 and q O7 M1lt arimaX orderc110 We nd 83 089 with segi5 00818 Also7 8 2 089247 and AIC 55033 EH Based on the differenoed Yt VXt7 we use p 17 d O and q O7 M2lt arimaY orderc100 We nd 83 088 with segi5 00322 Also7 8 2 089067 and AIC 55192 STATISTICAL ANALYSIS OF TIME SERIES 4 STATISTICS 635 SUMMER 2005 EH The diagnostics checks in both cases support the plausibility of the Chosen model Note lrI light of its PACF7 it is plausible to t an ARlt2gt to this clata7 ie M3lt arimaX orderc 2 O 0 The result of which is gzgl 189 with segi51 003107 Q32 O89 with segi52 003137 Also7 amp2 087457 arid AIC 55982 Note lrI factorized form7 this ARlt2gt is very similar to the ARIMA1 1 0 EH Clearly7 ARIMA1 1 O can be written as 1 08931 BXt Z Z N WNO 08924 EH The ARlt2gt can be written as 1 1893 08932Xt 1 08931 BXt Z with Z N WNO 08745 STATISTICAL ANALYSIS OF TIME SERIES 5 STATISTICS 635 SUMMER 2005 Problem The ARlt2gt tting assumes stationarity7 yet the resulting coef cients are very similar to those of the nonstationary generating process Fact From a sample of nite length7 it Will be extremely di icult to distinguish between a nonstationary process for which Which O and a process Which has very similar coe icients bur for Which gbquot has all of its zeros outside the unit circle Lesson Applying differencing until the resulting series has a sample ACF that decays rapidly EH The di erenced data can be tted by a lovv order ARMA process EH The autoregressive polynomial gbquot Will have zeros that are comfort ably outside the unit circle as desired IDENTIFICATION TECHNIQUES EH Preliminary transformations These are mainly aimed at forming sta tionary series7 since the techniques described assume the stationarity of the process being studied Typical easily detectable indicators of nonstationarity are D Unstable variance D Trend D Seasonal variations First7 here are some variance stabilizing transformations STATISTICAL ANALYSIS OF TIME SERIES 6 STATISTICS 635 SUMMER 2005 D Logarithmic transformation This is one of the more common transformations used whenever Ut is a series whose standard deviation tends to increase linearly with the mean Vt ln Ut D General Box Cox transformation WT Ut 2 O gt 07 ant Ut gt O O fAltUtgt Transformations for elimination of trend and seasonal variations through D Classical decomposition D Differencing D Fitting of a sum of harmonics and a polynomial EH Model Identification and Estimation Let Xt be the mean corrected transformed time series obtained from preliminary transformations E Main guiding question What is the most satisfactory ARMAQ q for Xt 9 If p and q are known7 then simply estimate 9b 0 and 02 D Model size determination In general however7 we will need to nd p and q from the data by minimizing 2p q 1n n p q 2 Note Some software packages allow the speci cation of a range AICC 21nL p 9g snip Hqn for p and q STATISTICAL ANALYSIS OF TIME SERIES 7 STATISTICS 635 SUMMER 2005 EH Identification and estimation in practice 1 00M C OTHgt OON Q3 Plot your data and inspect to determine the type of preliminary transformation that is appropriate e g What is a reasonable value of d Transform the data is necessary Plot ACF and PACF to get rough estimates of p and or q Note You may need to repeat and several times before nding an ACF andor PACF that exhibits a rapidly decaying behavior which is our indicator of the reasonahility of the assumption of stationarity Choose a range for p and q if necessary Estimate the parameters 7 0 and 02 for each plausible model Perform a residual analysis for each plausible model Narrow everything down to models with good diagnostic check Select the model with the lowest AIC Perform prediction based on this best model Caution When performing differencing7 one should be careful as some correlation may be unnecessarily introduced as a result of overdifferenc ing For example7 from an uncorrelated th7 we get VZt Zt Zt1 that is correlated STATISTICAL ANALYSIS OF TIME SERIES 8 STATISTICS 635 SUMMER 2005 SEASONAL ARMA MODELS EH Consider the following seasonal model Without trend Xt 5t Yt Where is a stationary process St has the rst property 5t Stirs Where 5 denotes the length of the seasonal period Also St is i Sj O j1 EH Recall the lag 5 differencing operator V5 We have VSXt Xt XFS Xt BSXt 1 BSXt EH Applying V5 to Xt de ned above7 we get Vth 5t Yt Stirs K75 VsYt EH The seasonal ARMA Inodel denoted by ARMAltP Q57 With BSDQ BSZt Where 132 1 ltIgt1z ltIgt2z2 ltIgtpzp and 921 1z 2z2 QZQ are respectively7 the seasonal AR operator and the seasonal MA operator7 With period 5 STATISTICAL ANALYSIS OF TIME SERIES 9 STATISTICS 635 SUMMER 2005 CAUSALITY AND INVERTIBILITY OF SEASONAL ARMA EH Just like with ARMAQ q7 the ARMAP Q5 model is causal only if the roots of zs lie outside the unit circle EH ln the same way7 the ARMAP Q5 model is invertible only if the roots of zs lie outside the unit circle EXAMPLE OF SEASONAL ARMA EH Example 1 Consider the seasonal ARMA1 D12 We can write 1 B Xt 1 B Zt or equivalently Xt DXtilQ Z Zt712 EH For our seasonal ARMAltL 112 to be causal7 we require ltlgt lt 17 and for invertibility7 we require lt 1 EH When we studied the nonseasonal ARMA1 1 process we had Xt CbXtil Zt 0Zt71 Note The difference here is that lag 1 is now replaced by lag 12 EB A seasonal ARMAO Q5 reduces to a seasonal lIAQS7 and A seasonal ARMAP OS reduces to a seasonal ARltPgtS STATISTICAL ANALYSIS OF TIME SERIES 10 STATISTICS 635 SUMMER 2005 MORE EXAMPLES OF NONSEASONAL ARMA EH Example 2 Compute the ACVF and the ACF of the following seasonal MAlt1gt12 With simple calculations we get for 739 0 77 for 739 i otherwise and mi EH Example 3 Compute the ACVF and the ACF of the following seasonal ARlt1gtS Xt Xtis Zt Calculations similar to those used for nonseasonal ARlt1gt7 yield for 739 0 77 for 739 i otherwise and therefore for 739 0 97 for 739 i otherwise STATISTICAL ANALYSIS OF TIME SERIES 11 STATISTICS 635 SUMMER 2005 SUMMARY OF ACF AND PACF BEHAVIORS FOR SEASONAL ARMA ARPs MAQS ARMMR QL ACF Tails off at lags k5 Cuts off after lags Q5 Tails off at lags k5 PACF Cuts off after lags P5 Tails off at lags k5 Tails off at lags k5 EH The values of ACF and PACF are zero at non seasonal lagS 739 y k5 EH Here7 5 is the length of the period7 and k 12 MIXED SEASONAL ARMA MODELS EH The seasonal and non seasonal ARMA models can be combined The resulting model has equation ltIgtltBSgt ltBgtXI 9ltBsgt6ltBgtz EH This resulting model is called the mixed seasonal ARlVIA7 and is denoted by ARMAp q x P Qgts EH The behavior of a mixed seasonal ARMA model is a combina tion of the the behaviors of its seasonal and nonseasonal constituents EH EXAMPLE Consider the model ARMAO1 gtlt 1012 D Write down the equation of this model E Find the expression of the ACF When ltlgt lt l and 6 lt l STATISTICAL ANALYSIS OF TIME SERIES STATISTICS 635 SUMMER 2005 SEASONALAEUAAABACDELS De nition 2 If d and D are nonnegative integers then Xt is a seasonal ARIMAp d q x P D Q process With period 5 if the di erenced time series Yt 1 Bd1 BSDXt is a causal ARMA process de ned by BltIgtBSYt 6BBSZt Z N WNO 02 2 Where gt1 2 W 132 1 ltIgt1z ltIgt2z2 ltIgtpzp 621 61z 62z2 6qu 921 1z 222 QZQ NOTEWORTHY ASPECTS OF SARIMA PROCESSES EH The process is causal if and only if y O and z y O for z g 1 EH ln practice7 the value of D is rarely more than 17 and P and Q are typically less than 3 EH Seasonal ARIMA models allow for for randomness in the seasonal pattern from one cycle to the next7 making it more general than the decomposition seen earlier STATISTICAL ANALYSIS OF TIME SERIES 13 STATISTICS 635 SUMMER 2005 SARIMA MODELS FOR MONTHLY DATA OVER YEARS YearMonth 1 2 12 1 Y1 Y2 YI2 2 Y13 YIA Y24 3 325 326 39 39 39 336 7quot lfl12r71 B12r71 quot YI2r EH Each column of the table may itself be Viewed a realization of a time series7 so that we have a total of 12 such realizations EH Suppose that each of the 12 time series is generated by the same ARMAltP model7 ie for t O 7quot 17 Yj12t FriHugh 39 39 39 DPYj12t7P Uj12t 1Uj12t71 39 39 39 QUj12tiQ 3 where Uj12t71gt t gt1gtOgt1gtquot39 EH Since the same ARMAltP model is assumed for j 1 127 we can rewrite 3 as Bmm enemm 5 Uj12t1 t 10 1 N WNO02 for each j with j 1 2 12 5 will be referred to as the between year model STATISTICAL ANALYSIS OF TIME SERIES 14 STATISTICS 635 SUMMER 2005 STAT635 LECTURE OUTLINE 4 There is nothing training cannot do Nothing is above its reach It can turn bad morals to good it can destroy bad principles and recreate good ones it can lift men to angelship M ark Twain STATIONARY PROCESSES EH Making predictions in time series analysis presupposes that something does not vary with time EH In a deterministic setting a linear function is used for extrapolation if the rst derivative is found to be constant EH As we saw in the previous chapter the time invariance of aspects of a time series was captured by the fundamental concept of stationarity EH If the random component of a time series is stationary then one can develop powerful techniques to forecast its future values Let Xi be a stationary time series with mean function u ES The autocovariance function ACVF of Xi at lag h is Wt covXthXt IE MW We MI ES The autocorrelation function ACF of Xi at lag his h corrXth Xi The ACVF and the ACF play a very important role in stationary time series STATISTICAL ANALYSIS OF TIME SERIES STATISTICS 635 SUMMER 2005 NONNEGATIVE DEFINITE FUNCTIONS A real valued function E de ned on the integers is nonnegative definite if iiaZMt jaj Z 0 2391 j1 for all positive integers n and vectors a a1 a2 anT BASIC PROPERTIES OF THE ACVF 7 Let Xi be a stationary time series with autocovariance function ACVF The following hold true 53 70 VOW 2 0 EH 7h 3 70 for all h EH 70L 7 h for all h so that is even function EH is nonnegative definite WHAT FUNCTIONS QUALIFY AS ACVF OR ACF EH Theorem 1 A real valued function de ned on the integers is the ACVF ofa stationary time series if and only if it is even and nonnegative definite EH Theorem 2 A real valued function de ned on the integers is the ACF ofa stationary time series if and only if it is even and nonnegative definite and p0 l STATISTICAL ANALYSIS OF TIME SERIES 2 STATISTICS 635 SUMMER 2005 STATIONARY PROCESSES HAVE NONNEGATIVE ACVF EH Let Xi be a stationary time series with ACVF De ne Y En OZtXt tl where n is some integer and 011012 ozn are reals constants EH Using the linearity property of covariances it is easy to show that TL vY v 2 ago t1 TL 2 as atcovX8Xt 31 tl ZZasatys t 31 tl ES The above result is a standard property of autocovariance functions EH Since VY is a variance we have VY Z 0 EH Therefore iiasatys t Z 0 31 tl for all positive integers n and vectors a 011012 oznT ES The ACVF of any stationary time series Xi is nonnegative definite FROM A NONNEGATIVE FUNCTION TO ITS STATIONARY PROCESS ES There exists a stationary time series with ACVF E if E is even realvalued and nonnegative definite This proof is lengthier than the previous one ES If E is even realvalued and nonnegative definite then there exists a sta tionary Gaussian time series Xi with mean 0 and ACVF STATISTICAL ANALYSIS OF TIME SERIES 3 STATISTICS 635 SUMMER 2005 Simple little Exercise Let Y and Z be two uncorrelated random variables both with mean 0 and variance 1 De ne Xi Ycos6t Zsin6t 1 Find the mean of the process Xi 2 Find the autocovariance function of Xi 3 Is Xi a stationary process 4 Is the function KUL cos6h a nonnegative de nite function Remark To verify that a given function is nonnegative de nite it is often easier to to nd a stationary process that has the given function as its ACVF than to verify the conditions of the above theorem STATISTICAL ANALYSIS OF TIME SERIES 4 STATISTICS 635 SUMMER 2005 STATIONARITY REVISITED De nition 1 A time series Xi is said to be weakly stationary if the vector X1 XnT and the time shifted vector X1h 7XnhT have the same mean vectors and the same covariance matrices for all integers h and n 2 1 In other words a time series Xi is weakly stationary if i aXt is independent of t ie aXt aX for all t ii 7Xt h t is independent oft for each h h is called the lag De nition 2 A time series Xi is said to be strictly stationary if the vector X1 XnT and the time shifted vector X1h 7XnhT have the same joint distribution function ie d X1 an X1h Xan for all integers h and n 2 1 Consequences of strict stationarity ES The random variable Xi are identically distributed as XtXthT 2 X1 X1hT for all t and h EH Xi is weakly stationary if ElXt lt 00 for all t EH Weak stationarity does not imply strict stationarity EH An iid sequence is strictly stationary Note Remember that stationarity is crucial to forecasting STATISTICAL ANALYSIS OF TIME SERIES 5 STATISTICS 635 SUMMER 2005 EXCURSION To THE GAUSSIAN WORLD The multivariate normal distribution is one of the most important distribution in the whole eld of statistics and also plays an important role in time series analysis ES Let X X1X2 XnT be a random vector X has a multivariate normal distribution or multivariate Gaussian distribution with mean vector L and variance covariance matrix 2 if 1 m lt2wgt 2ltdetlt2gtgt12exp ltx mTz x m EH Notation X N NL 2 EH If X has a multivariate normal distribution then the conditional distribution of any set of components given any order is again multivariate normal EH For instance we can partition as follows X 1 X and L lEX XO 2 As a result 2 E E covX 11 12 E312 2322 where IN lEXi and 22 lEXi uiXj INUT EH X and X9 are independent if and only if 212 0 ES The conditional distribution of X given Xe X0 is N 0 21222721062 Ital E311 2122521221 So clearly EX1X2 X2 1 2122231062 0 STATISTICAL ANALYSIS OF TIME SERIES 6 STATISTICS 635 SUMMER 2005 SIMPLE LITTLE EXERCISE Let XT X1X2 be a 2 x 1 random vector Let X have a bivariate normal distribution with mean vector LT 1 ml and covariance matrix 2 011 012 012 022 1 Find the conditional mean IEX2X1 X1 2 Find the conditional variance VX2X1 X1 GAUSSIAN TIME SERIES EH Xi is a Gaussian time series if all of its joint distributions are multivariate normal ie if for any collection of integers t1t239 tn the random vector Xil 22 XZn has a multivariate normal distribution EH If Xi is a Gaussian time series then all of its joint distributions are com pletely determined by the mean function Mt ElXt and the autocovariance function 8t covX3 Xi EH If the process is stationary then ut u and Mt ht 70L for all t so that X1 XnT i X1h XnhT for all integers h and n gt 0 EH For a Gaussian time series strict stationarity is equivalent to weak station arity STATISTICAL ANALYSIS OF TIME SERIES 7 STATISTICS 635 SUMMER 2005 ON THE ROLE OF THE ACF IN PREDICTION EH Suppose that Xi is a stationary Gaussian time series with ACF p mean u and variance 02 ES Suppose that we have observed Xn EH Goal Find the function of Xn that gives us the best predictor of Xnh BE A natural and computationally convenient of best 7 is to specify our required predictor to be the function of Xn that minimizes the mean squared error m X71 argnrpeig new mltXngtgt2l Fact 1 The function m of Xn that minimizes EKXWL mXn2l is the conditional mean mXn Eanthnl u phXn M With the corresponding mean squared error being EKXWL mXn2l 021 p002 PROOF D STATISTICAL ANALYSIS OF TIME SERIES 8 STATISTICS 635 SUMMER 2005 ASPECTS OF PREDICTION WITH GAUSSIAN TIME SERIES EH For Gaussian time series prediction of Xnh in terms of Xn is more accurate as ph gets closer to 1 EH In the limit of p gt ll the mean squared error approaches 0 Remarks ES The above predictor relies on the joint normality of Xnh and Xn PREDICTION WITH NON GAUSSIAN TIME SERIES EH For time series with nonnormal joint distributions the derivation of the best predictor is much more complicated EH Question What do practitioners do EH Popular wisdom Do not look for the best predictor Instead look for the best linear predictor which is the best predictor of the form Xn aXn b EH In MSE terms 6Xn arg IEXnh aXn by ES It turns out that the best linear predictor is 13Xn u phXn M EH Xn depends only on the mean and the ACF of the process Xi EH Xn can be derived without detailed knowledge of the joint distributions EH If Xi is Gaussian Xn and mXn are the same In general MSEltmltXngtgt MSElt ltXngtgt STATISTICAL ANALYSIS OF TIME SERIES 9 STATISTICS 635 SUMMER 2005 SOME USEFUL TERMINOLOGY De nition 3 A stationary time series Xi is said to be q dependent if X8 and Xi are independent whenever t 5 gt q EH Example An iid sequence is O dependent De nition 4 A stationary time series Xi is said to be q correlated if7h 0 whenever h gt q For example BE A white noise process is 0 correlated ES The MA1 process is I correlated CONSTRUCTION OF STRICTLY STATIONARY TIME SERIES EH Let Z be an iid sequence which we know to be strictly stationary EH De ne the series Xi as follows Xt gZt Ztih 39 quot 7Zt7q for some real valued function g ES The time series Xi is said to constructed by ltering the iid sequence Zt using filter EH Clearly Xi is strictly stationary since Zth7 39 39 39 7ZthiqlT i Z757 39 39 39 7Ztiq for all integers h EH Xi is clearly q dependent Remark The above method provides one of the simplest ways to construct time series that are strictly stationary STATISTICAL ANALYSIS OF TIME SERIES 10 STATISTICS 635 SUMMER 2005 THE MAq PROCESS AND RELATED RESULTS EH Let Z be a WN002 process Let 60 1 Let q be any integer and consider the real constants 01 02 64 with 6g y 0 De ne X Z 61Zt71 62Zt72 39 39 39 qutiq OOZI 61Zt71 62Zt72 39 39 39 qutiq q ZOJZH j0 EH Xi is known as the moving average of order q or the MAq process ES The MAq process is q correlated Theorem 3 If Xi is a stationary q correlated time series with a zero mean then it can be represented as a MAq process LINEAR PROCESSES BE A set of real constants IJj j E Z is absolutely summable if 00 Z M lt 00 j7oo ES The time series Xi is a linear process if it has the representation X Z TJthej j7oo for all t where Z N WN0 02 and IJj is absolutely summable EH We can write Xi as a linear combination of all the noise terms the past current and future EH Absolute summability of the constants guarantees that the in nite sum con verges STATISTICAL ANALYSIS OF TIME SERIES STATISTICS 635 SUMMER 2005 Proposition 1 Let be a stationary time series With mean 0 and covariance function 7y 1162700 lt 00 then the time series XI 2 ijH me j7oo is stationary With mean 0 and au tocovariance function 00 00 7X00 Z Z j kwh k J j7ooj7oo In the special case Where Xi is a linear process 7X00 Z j jh02 j7oo B is the backward shift operator BXt XH STATISTICAL ANALYSIS OF TIME SERIES 12 STATISTICS 635 SUMMER 2005 STAT635 LECTURE OUTLINE 3 When you truly see all fears and illusions vanish and the beauty of truth is revealed to you Unknown EXAMPLE OF R CODE FOR SIMULATING AN AR1 PROCESS Simulate an AR1 process of length 144 with phi 072 Xlt arimasimlistarc 072 n144 Let s have a plot of 2 panels high by 1 Also make the teXt 07 times smaller parmfrowc21CeXO7 Plot the resulting time series plotX Xlabquottimequot ylabquotAR1 processquot typequotlquot Calculate and plot the ACF acfX 410123 lllllll ARM process l l D In 4D EEI Eu lEIEI izu MEI time Series x STATISTICAL ANALYSIS OF TIME SERIES 1 STATISTICS 635 SUMMER 2005 MODELS WITH TREND AND SEASONAL COMPONENTS A GOOD PRACTICE IN TIME SERIES ANALYSIS D Step 1 Plot the data D Step 2 Investigate the existence of Q Apparent discontinuities such as sudden change in level or behavior C Could it be that the series needs to be broken into homogeneous segments C ls the variance increasing with time Q Outliers D Step 3 Perform any suitable transformation e g logarithm to stabilize the variance if the graph suggests that D Step 4 Inspect the graph to nd out if it suggests the possibility of representing the data as a realization of the classical decomposition model Xtmt8tYt 1 where Q mt is the trend component Q st is a seasonal component with period d and 251 st 0 Q Yt is a random noise component with 0 THE STANDARD PROCEDURE HAS TWO STEPS D Estimate and extract the deterministic components mt and st Q A popular model for st is the harmonic regression model k st Z Olj COSlt27Tfjtgt j sin27rfjt j1 C The Cg1 j 1 h are the parameters to be estimated C The fj control the frequency of the periodicity D Analyze the residual noise component Yt Q The aim is to hopefully model the residual Yt as a stationary time series STATISTICAL ANALYSIS OF TIME SERIES 2 STATISTICS 635 SUMMER 2005 A NONSEASONAL MODEL WITH TREND If our plot reveals no seasonal effect but does suggest the existence of a trend then we may use the simpler decomposition Xt mt Yr with 0 To construct a model for the data we consider two methods D METHOD 1 Trend estimation Q Fit a polynomial trend by Least squares regression Q Subtract the tted trend from the data Q Find an appropriate stationary time series model for the residuals D METHOD 2 Differencing Q Eliminate the trend directly by differencing Q Find an appropriate stationary time series model for the residuals Method 2 has the advantage that it typically uses fewer parameters and does not rest on the assumption that the trend remains the same throughout the observation time METHOD 12 TREND ESTIMATION BY LEAST SQUARES D A polynomial of degree k in t is posited as a model for mt ie mt 50 it 2t2 quot ktk k 51 j0 D The coe icients parameters are estimated by the least squares method ie nding those 53 that minimize the residual sum of squares n n k 2 RSS Z Xt mt2 Z X ZBjtj i1 2 1 D Computing In R this is done with the function lm Type helplm STATISTICAL ANALYSIS OF TIME SERIES 3 STATISTICS 635 SUMMER 2005 MORE ON LEAST SQUARES ESTIMATION OF TREND X17X27 39 39 39 7XnT7 Y Y17YVZ7 39 39 39 7YnT7 and6 807817827 39 39 39 78kT7 the model can be written in matrix form as XA6Y Where 1 t1 1 A 1 t2 15726 1 tn 1576 The ordinary least squares estimator of 6 is therefore I ATA 1ATX Remark Since the process is typically not an HD noise process the statistical properties of 6 will be different from the the results encountered in basic regression courses METHOD 1 APPLIED TO THE LAKE HURON DATA D The plot of the Lake Huron data set seemed to suggest a linear downward trend so that one could posit the following simple linear regression model mt 50 5115 B One could use the scaled index If 1 2 711 98 or the more meaningful original index If 1875 1876 7 1972 D Some of the R commands used dataLakeHuron Load the Lake Huron time series years lt 18751972 Create a variable named years lmhuronlt lmLakeHuron years Fit the linear regression model plotyearsLakeHuron type quotlquot Xlabquotyearquot ylabquotLevelquot ablinelmhuronlty2 Add the line to Huron plot acf residlmhuron Plot the ACF of the residuals STATISTICAL ANALYSIS OF TIME SERIES 4 STATISTICS 635 SUMMER 2005 D Some of the R commands used summarylmhuron Summarize the model obtained Call lmformula LakeHuron years Residuals Min 1Q Median 3Q MaX 2 509970 0 727260 0000829 0744024 2 535650 Coefficients Estimate Std Error t value Prgtt Intercept 625554918 7764293 80568 lt 2e16 years 0024201 0004036 5996 355e08 Signif codes 0 0001 001 005 01 1 Residual standard error 113 on 96 degrees of freedom Multiple R Squared 02725 Adjusted R squared 02649 F statistic 3595 on 1 and 96 DF p value 3545e 08 D Plot of trend and plot of sample ACF Series residlmhuron 550 581 552 I I I n mm m 221 579 I Am n 575 I 577 I 576 I I I I I I man man mm mm wen n In 2U an AD yea Lag D Judging from the ACF plot is an HD noise process realistic adequate for Yt in the Lake Huron data STATISTICAL ANALYSIS OF TIME SERIES 5 STATISTICS 635 SUMMER 2005 TREND ESTIMATION BY LINEAR FILTERING D General idea of moving average filtering Imagine an operator that for each point Xi in the time series constructs a corresponding point de ned as a weighted average of an in nite number of points around Xt X75 1 X7 Z antj j7oo Q The above is a moving average filter Q The coe icients aj de ne a filter Q The lter is de ned to remove the noise since it averages leaving the smooth part of the series which should be an estimate of the trend Q Different lters a will yield different transforms of the original series B Finite simple moving average filter Imagine the above idea applied locally with the weights a being all equal Clearly that means mapping each Xi to Wt where 1 4 W E X 1 lt t lt t 2q1j7q tj7 n q Q Finite because it is nite and simple because all the weights are equal 2q172q17i 72q1 D Finite simple moving average filter for linear trend 1 1 1 Q With that our lter is simply Q Assume that mt is approximately linear on the interval t qt q Q Assume also that the average of error terms is zero on t q t q ie 1 q Y1m0 2W M Q For the nonseasonal series with trend ie Xi mt Yt we have 1 q 1 4 W E E Y m t 2q117 mt12q117 t mt 1774 1774 for q 1 S t S n q The moving thus provides us with trend estimates 1 q m E X1 Igtgn t iq t j q q STATISTICAL ANALYSIS OF TIME SERIES 6 STATISTICS 635 SUMMER 2005 DEALING WITH BOUNDARY CONDITION D Xi is not observed for t S 0 or t gt 11 Therefore the above summation are not de ned for t S q or t gt n q E In practice one may set Xi X1 for t lt I and Xi Xn for t gt n MOVING AVERAGE FILTERING OF LAKE HURON SERIES dataLakeHuron Load the LakeHuron series 1h lt filterLakeHuron filterrep11111 For q 5 MA lterwith q5 MA lterwith q7 I I I I I I I I I I man 19m mm mm Iaen man man mm mm Iaen TIME TIME MA lterwith q11 MA lterwith q17 a a 582 I I I 580 581 582 I I I mm If set 5 579 I 7 I 578 I 577 I 577 I 576 I 576 I I I I I I I I I man 19m mm mm Iaen man man mm mm Iaen TIME TIME D Notice There is a bias variance tradeoff at play here Q As q increases the estimate of the trend has STATISTICAL ANALYSIS OF TIME SERIES 7 STATISTICS 635 SUMMER 2005 THE SPENCER 15 POINT MA FILTER D This speci c lter is de ned as aj 0 07a17a27a37a47a57a67a7 l 746746213 5 6 3 320 with 0 1 07 gt 7 and a1 aij lt 7 E Special feature This lter passes through polynomials of degree 3 without distortion Q Provide an explanation in problem 12 EXPONENTIAL SMOOTHING D De nition For any xed oz 6 01 de ne m t 12 n mt OZXt1 Olmt1 t2737quot397n and m X1 D The recursion implies mt OZlt1 athj OZVTle t2 2 D Clearly that de nes a weighted moving average of Xi XtL1 D The moving is clearly one sided unlike the ones encountered earlier D The weights of are al ozj and since oz 6 0 1 these weights are decreasing exponentially except for the last one hence the term exponential smoothing STATISTICAL ANALYSIS OF TIME SERIES 8 STATISTICS 635 SUMMER 2005 METHOD 2 DIFFERENCING The method of elimination of trend by differencing makes use of the lag l differ ence operator V given by VXt Xi Xt1 1 BXt where B is the backward shift operator BX XH D Powers of B and V lntuitively meaning several recursive applications of the same operator 31th Xtij Vth VltVj71Xt Z l and VOXt X75 Q Treat then as polynomials Example Power 2 ngt WWX 1 B1 BX 1 2B 36X X 2Xt71 Xt72 D Clearly for Xi mt Yt with mt 2270 cjtj v IX kick v IY Q If is stationary with mean 0 then VkXt is a stationary process with mean kick Q Nice implication Given any series Xi one can keep on applying the lad 1 difference operator until one gets VkXt that is realistically stationary Q This is indeed reasonable since polynomials usually provide decent ap proximations for many functions D In R differencing is easily done by calling the function diff D Homework Exercise 5 Apply the differencing technique to the Lake Huron data plot the acf and comment STATISTICAL ANALYSIS OF TIME SERIES 9 STATISTICS 635 SUMMER 2005 ESTIMATION OF A SEASONAL MODEL WITH TREND D We now consider the model with both seasonal effect and trend XtmtstYt tl2n with 0 and SHd 5t and 2151 0 D Step 1 Estimation of the trend by an MA lter of length d Q Ifd2ql then A 1 4 Q If d 2g then 1 4 1 17q1 D Step 2 Estimation of the seasonal component Q Suppose that n kd Q Compute the average of the deviations wkkjd mkjd7 Qltkden q Q Since these average deviations do not necessarily sum to zero the esti mate of 5k is d kwk d 1ZwZ k12 d i1 and k kd for k gt d D Deseasonalize dtt t7 t1727 7 l L D Re estimate the trend from the deseasonalized data dt by tting a least squares polynomial trend m to dt D Extract an estimate of the noise series as follows YIxtmt5t7 t1727quot397n STATISTICAL ANALYSIS OF TIME SERIES 10 STATISTICS 635 SUMMER 2005 DIFFERENCING OF A SEASONAL MODEL WITH TREND D The lag d differencing operator vdX X XH 1 36X D With Xi mt st Yt where st has period d VdXt mt mtid Yt Ytid D So VdXt admits a decomposition into Q A trend component mt mpd Q A noise term Yt Kid U That s good news because D The trend mt mt7d can be eliminated using methods described earlier Q Application of power of the lag l difference operator V TESTING THE ESTIMATED NOISE SEQUENCE D For large n and an iid sequence Y1 Y2 Yn with nite variance the sample autocorrelations are approximately iid with distribution N0 1 D The Portmanteau test lts advantage over the HD test mentioned earlier is that here we get to use a single statistic h Q 712 j1 Fiom the fact that N N01 forj 1 h it is clear that Q 26 Therefore reject the iid hypothesis at level oz if Q N X12Qh B Your responsibility is to read about the other techniques in Section 16 STATISTICAL ANALYSIS OF TIME SERIES 11 STATISTICS 635 SUMMER 2005 A TEST FOR IID NOISE USING THE SAMPLE ACF D Note For iid noise with nite variance we have for h y 0 pm N N lt0 D Steps of the diagnostic for iid noise Q Plot the lag h versus Q Draw two horizontal lines at ll96 C These two lines are drawn automatically m R Q You should have about 95 of the the values h 12 within the lines if the noise is indeed iid D Which of the following depicts an HD noise Series X Series X ACF 4 What should we do when the noise process is found to be iid STATISTICAL ANALYSIS OF TIME SERIES 12 An Introduction to R Notes on R A Programming Environment for Data Analysis and Graphics Version 210 2005 04 18 W N Venables D M Smith and the R Development Core Team Copyright 1990 W N Venables Copyright 1992 W N Venables amp D M Smith Copyright 1997 R Gentleman amp R lhaka Copyright 19977 1998 M Maechler Copyright 199972005 R Development Core Team Permission is granted to make and distribute verbatim copies of this manual provided the copy right notice and this permission notice are preserved on all copies Permission is granted to copy and distribute modi ed versions of this manual under the condi tions for verbatim copying7 provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one Permission is granted to copy and distribute translations of this manual into another language7 under the above conditions for modi ed versions7 except that this permission notice may be stated in a translation approved by the R Development Core Team ISBN 3 900051127 Table of Contents Preface 1 1 Introduction and preliminaries 2 11 The R environment 2 12 Related software and documentation 2 13 R and statistics 2 14 R and the window system 3 15 Using R interactively 3 16 An introductory session 4 17 Getting help with functions and features 4 18 R commands7 case sensitivity7 etc 4 19 Recall and correction of previous commands 5 110 Executing commands from or diverting output to a le 5 111 Data permanency and removing objects 5 2 Simple manipulations numbers and vectors 7 21 Vectors and assignment 7 22 Vector arithmetic 7 23 Generating regular sequences 8 24 Logical vectors 9 25 Missing values 9 26 Character vectors 10 27 Index vectors selecting and modifying subsets of a data set 10 28 Other types of objects 11 3 Objects their modes and attributes 13 31 Intrinsic attributes mode and length 13 32 Changing the length of an object 14 33 Getting and setting attributes 14 34 The class of an object 14 4 Ordered and unordered factors 16 41 A speci c example 16 42 The function tapply and ragged arrays 16 43 Ordered factors 17 5 Arrays and matrices 18 51 Arrays 18 52 Array indexing Subsections of an array 18 53 Index arrays 19 54 The array function 20 541 Mixed vector and array arithmetic The recycling rule 20 55 The outer product of two arrays 20 56 Generalized transpose of an array 21 57 Matrix facilities 21 571 Matrix multiplication 22 ii 572 Linear equations and inversion 22 573 Eigenvalues and eigenvectors 22 574 Singular value decomposition and determinants 23 575 Least squares tting and the QR decomposition 23 58 Forming partitioned matrices7 cbindO and rbindO 24 59 The concatenation function7 c 7 with arrays 24 510 Frequency tables from factors 24 6 Lists and data frames 26 61 Lists 26 62 Constructing and modifying lists 26 621 Concatenating lists 27 63 Data frames 27 631 Making data frames 27 632 attach and detach 27 633 Working with data frames 28 634 Attaching arbitrary lists 28 635 Managing the search path 29 7 Reading data from les 30 71 The readtable function 30 72 The scan function 31 73 Accessing builtin datasets 31 731 Loading data from other R packages 31 74 Editing data 32 8 Probability distributions 33 81 R as a set of statistical tables 33 82 Examining the distribution of a set of data 33 83 One and two sample tests 36 9 Grouping loops and conditional execution 40 91 Grouped expressions 40 92 Control statements iii 11 Statistical models in R 50 111 De ning statistical models formulae 50 1111 Contrasts 52 112 Linear models 53 113 Generic functions for extracting model information 53 114 Analysis of variance and model comparison 54 1141 ANOVA tables 54 115 Updating tted models 54 116 Generalized linear models 55 1161 Families 55 1162 The glmO function 56 117 Nonlinear least squares and maximum likelihood models 58 1171 Least squares 58 1172 Maximum likelihood 59 118 Some non standard models 60 12 Graphical procedures 61 121 High level plotting commands 61 1211 The plot function 61 1212 Displaying multivariate data 62 1213 Display graphics 62 1214 Arguments to high level plotting functions 63 122 Low level plotting commands 64 1221 Mathematical annotation 65 1222 Hershey vector fonts 65 123 Interacting with graphics 65 124 Using graphics parameters 66 1241 Permanent changes The par function 66 1242 Temporary changes Arguments to graphics functions 67 125 Graphics parameters list 67 1251 Graphical elements 68 1252 Axes and tick marks 68 1253 Figure margins 69 1254 Multiple gure environment 70 126 Device drivers 71 1261 PostScript diagrams for typeset documents 72 1262 Multiple graphics devices 72 127 Dynamic graphics 73 13 Packages 74 131 Standard packages 74 132 Contributed packages and ORAN 74 133 Namespaces 74 Appendix A A sample session 76 Appendix B Invoking R 79 B1 lnvoking R from the command line 79 B2 lnvoking R under Windows 82 B3 lnvoking R under Mac OS X 82 iV Appendix C The commandline editor 83 01 Preliminaries 83 02 Editing actions 83 03 Command line editor summary 83 Appendix D Function and variable index 85 Appendix E Concept index 88 Appendix F References 90 Preface 1 Preface This introduction to R is derived from an original set of notes describing the S and S PLUS environments written by Bill Venables and David M Smith Insightful Corporation We have made a number of small changes to re ect differences between the R and S programs and expanded some of the material We would like to extend warm thanks to Bill Venables and David Smith for granting permission to distribute this modi ed version of the notes in this way and for being a supporter of R from way back Comments and corrections are always welcome Please address email correspondence to Rcore Rprojectorg Suggestions to the reader Most R novices will start with the introductory session in Appendix A This should give some familiarity with the style of R sessions and more importantly some instant feedback on what actually happens Many users will come to R mainly for its graphical facilities In this case Chapter 12 Graphics page 61 on the graphics facilities can be read at almost any time and need not wait until all the preceding sections have been digested Chapter 1 Introduction and preliminaries 2 1 Introduction and preliminaries 11 The R environment R is an integrated suite of software facilities for data manipulation calculation and graphical display Among other things it has 0 an effective data handling and storage facility 0 a suite of operators for calculations on arrays in particular matrices a large coherent integrated collection of intermediate tools for data analysis graphical facilities for data analysis and display either directly at the computer or on hard copy and a well developed simple and effective programming language called S which includes conditionals loops user de ned recursive functions and input and output facilities Indeed most of the system supplied functions are themselves written in the S language The term environmen 7 is intended to characterize it as a fully planned and coherent system rather than an incremental accretion of very speci c and in exible tools as is frequently the case with other data analysis software R is very much a vehicle for newly developing methods of interactive data analysis It has developed rapidly and has been extended by a large collection of packages However most programs written in R are essentially ephemeral written for a single piece of data analysis 12 Related software and documentation R can be regarded as an implementation of the S language which was developed at Bell Labora tories by Rick Becker John Chambers and Allan Wilks and also forms the basis of the S PLUS systems The evolution of the S language is characterized by four books by John Chambers and coauthors For R the basic reference is The New S T M 39 A P g 39 g r39 39 t for Data Analysis and Graphics by Richard A Becker John M Chambers and Allan R Wilks The new features of the 1991 release of S are covered in Statistical Models in S edited by John M Chambers and Trevor J Hastie The formal methods and classes of the methods package are based on those described in Programming with Data by John M Chambers See Appendix F References page 90 for precise references There are now a number of books which describe how to use R for data analysis and statistics and documentation for SS PLUS can typically be used with R keeping the differences between the S implementations in mind See section What documentation exists for R77 in The R statistical system FAQ 13 R and statistics Our introduction to the R environment did not mention statistics yet many people use R as a statistics system We prefer to think of it of an environment within which many classical and modern statistical techniques have been implemented A few of these are built into the base R environment but many are supplied as packages There are about 25 packages supplied with R called standard and recommended packages and many more are available through the CRAN family of Internet sites via http CRMIRprojevtorg and elsewhere More details on packages are given later see Chapter 13 Packages page 7 4 Most classical statistics and much of the latest methodology is available for use with R but users may need to be prepared to do a little work to nd it Chapter 1 Introduction and preliminaries 3 There is an important difference in philosophy between S and hence R and the other main statistical systems In S a statistical analysis is normally done as a series of steps with intermediate results being stored in objects Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis R will give minimal output and store the results in a t object for subsequent interrogation by further R functions 14 R and the window system The most convenient way to use R is at a graphics workstation running a windowing system This guide is aimed at users who have this facility In particular we will occasionally refer to the use of R on an X window system although the vast bulk of what is said applies generally to any implementation of the R environment Most users will nd it necessary to interact directly with the operating system on their computer from time to time In this guide we mainly discuss interaction with the operating system on UNIX machines If you are running R under Windows or MacOS you will need to make some small adjustments Setting up a workstation to take full advantage of the customizable features of R is a straight forward if somewhat tedious procedure and will not be considered further here Users in dif culty should seek local expert help 15 Using R interactively When you use the R program it issues a prompt when it expects input commands The default prompt is gt which on UNIX might be the same as the shell prompt and so it may appear that nothing is happening However as we shall see it is easy to change to a different R prompt if you wish We will assume that the UNIX shell prompt is In using R under UNIX the suggested procedure for the rst occasion is as follows 1 Create a separate sub directory say work to hold data les on which you will use R for this problem This will be the working directory whenever you use R for this particular problem mkdir work Cd work Start the R program with the command R At this point R commands may be issued see later to F To quit the R program the command is gt 10 At this point you will be asked whether you want to save the data from your R session On some systems this will bring up a dialog box and on others you will receive a text prompt to which you can respond yes no or cancel a single letter abbreviation will do to save the data before quitting quit without saving or return to the R session Data which is saved will be available in future R sessions Further R sessions are simple 1 Make work the working directory and start the program as before Cd work R 2 Use the R program terminating with the q command at the end of the session To use R under Windows the procedure to follow is basically the same Create a folder as the working directory and set that in the Start In7 eld in your R shortcut Then launch R by double clicking on the icon Chapter 1 Introduction and preliminaries 4 16 An introductory session Readers wishing to get a feel for R at a computer before proceeding are strongly advised to work through the introductory session given in Appendix A A sample session page 76 17 Getting help with functions and features R has an inbuilt help facility similar to the man facility of UNIX To get more information on any speci c named function for example solve the command is gt helpsolve An alternative is gt solve For a feature speci ed by special characters the argument must be enclosed in double or single quotes making it a character string This is also necessary for a few words with syntactic meaning including if for and function gt helpquot IIquot Either form of quote mark may be used to escape the other as in the string quotIt 5 important Our convention is to use double quote marks for preference On most R installations help is available in HTML format by running gt help start which will launch a Web browser that allows the help pages to be browsed with hyperlinks On UNIX subsequent help requests are sent to the HTML based help system The Search Engine and Keywords7 link in the page loaded by helpstart is particularly useful as it is contains a high level concept list which searches though available functions It can be a great way to get your bearings quickly and to understand the breadth of what R has to offer The help search command allows searching for help in various ways try help search for details and examples The examples on a help topic can normally be run by gt example topic Windows versions of R have other optional help systems use gt help for further details 18 R commands case sensitivity etc Technically R is an expression language with a very simple syntax It is case sensitive as are most UNIX based packages so A and a are different symbols and would refer to different variables The set of symbols which can be used in R names depends on the operating system and country within which R is being run technically on the locale in use Normally all alphanumeric symbols are allowed and in some countries this includes accented letters plus and with the restriction that a name must start with or a letter and if it starts with the second character must not be a digit Elementary commands consist of either p 39 or 39 If an e piession is given as a command it is evaluated printed and the value is lost An assignment also evaluates an expression and passes the value to a variable but the result is not automatically printed Commands are separated either by a semi colon 7 or by a newline Elementary commands can be grouped together into one compound expression by braces T and Comments can be put almost 1 anywhere starting with a hashmark everything to the end of the line is a comment 1 not inside strings nor within the argument list of a function de nition Chapter 1 Introduction and preliminaries 5 If a command is not complete at the end of a line R will give a different prompt by default on second and subsequent lines and continue to read input until the command is syntactically complete This prompt may be changed by the user We will generally omit the continuation prompt and indicate continuation by simple indenting 19 Recall and correction of previous commands Under many versions of UNIX and on Windows R provides a mechanism for recalling and re executing previous commands The vertical arrow keys on the keyboard can be used to scroll forward and backward through a command history Once a command is located in this way the cursor can be moved within the command using the horizontal arrow keys and characters can be removed with the key or added with the other keys More details are provided later see Appendix C The command line editor page 83 The recall and editing capabilities under UNIX are highly customizable You can nd out how to do this by reading the manual entry for the readline library Alternatively the Emacs text editor provides more general support mechanisms via ESS Emacs Speaks Statistics for working interactively with R See section R and Emacs77 in The R statistical system FAQ 110 Executing commands from or diverting output to a le If commands are stored in an external le say commandsR in the working directory work they may be executed at any time in an R session with the command gt sourcequotcommandsRquot For Windows Source is also available on the File menu The function sink gt sinkquotrecordlisquot will divert all subsequent output from the console to an external le recordlis The com mand gt sinkO restores it to the console once again 111 Data permanency and removing objects The entities that R creates and manipulates are known as objects These may be variables arrays of numbers character strings functions or more general structures built from such components During an R session objects are created and stored by name we discuss this process in the next session The R command gt ob j ects O alternatively ls can be used to display the names of most of the objects which are currently stored within R The collection of objects currently stored is called the workspace To remove objects the function rm is available gt rmx y z ink junk temp foo bar All objects created during an R sessions can be stored permanently in a le for use in future R sessions At the end of each R session you are given the opportunity to save all the currently available objects If you indicate that you want to do this the objects are written to a le called RData 2 in the current directory and the command lines used in the session are saved to a le called Rhistory 2 The leading dot in this le name makes it invisible in normal le listings in UNIX Chapter 1 Introduction and preliminaries 6 When R is started at later time from the same directory it reloads the workspace from this le At the same time the associated commands history is reloaded It is recommended that you should use separate working directories for analyses conducted with R It is quite common for objects With names x and y to be created during an analysis Names like this are often meaningful in the context of a single analysis7 but it can be quite hard to decide what they might be when the several analyses have been conducted in the same directory Chapter 2 Simple manipulations numbers and vectors 7 2 Simple manipulations numbers and vectors 21 Vectors and assignment R operates on named data structures The simplest such structure is the numeric uector which is a single entity consisting of an ordered collection of numbers To set up a vector named x say consisting of ve numbers namely 104 56 31 64 and 217 use the R command gt x lt c104 56 31 64 217 This is an assignment statement using the function c which in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end1 A number occurring by itself in an expression is taken as a vector of length one Notice that the assignment operator lt which consists of the two characters lt less than and minus occurring strictly sideby side and it points to the object receiving the value of the expression In most contexts the operator can be used as a alternative Assignment can also be made using the function assign An equivalent way of making the same assignment as above is with gt assignquotxquot c104 56 31 64 217 The usual operator lt can be thought of as a syntactic short cut to this Assignments can also be made in the other direction using the obvious change in the assign ment operator So the same assignment could be made using gt c104 56 31 64 217 gt x If an expression is used as a complete command the value is printed and lost2 So now if we were to use the command gt 1 the reciprocals of the ve values would be printed at the terminal and the value of x of course unchanged The further assignment gt y lt cx 0 X would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place 22 Vector arithmetic Vectors can be used in arithmetic expressions in which case the operations are performed element by element Vectors occurring in the same expression need not all be of the same length If they are not the value of the expression is a vector with the same length as the longest vector which occurs in the expression Shorter vectors in the expression are recycled as often as need be perhaps fractionally until they match the length of the longest vector In particular a constant is simply repeated So with the above assignments the command gt v lt 2x y 1 generates a new vector v of length 11 constructed by adding together element by element 2 repeated 22 times y repeated just once and 1 repeated 11 times The elementary arithmetic operators are the usual and quot for raising to a power In addition all of the common arithmetic functions are available log exp sin cos tan sqrt 1 With other than Vector types of argument such as list mode arguments the action of CO is rather different See Section 621 Concatenating lists page 27 2 Actually it is still available as Lastvalue before any other statements are executed Chapter 2 Simple manipulations numbers and vectors 8 and so on all have their usual meaning max and min select the largest and smallest elements of a vector respectively range is a function whose value is a vector of length two namely 6 minx maxx lengthx is the number of elements in x sumx gives the total of the elements in x and prodx their product Two statistical functions are meanx which calculates the sample mean which is the same as sumxlengthx and varx which gives sum xmeanx quot2 lengthx 1 or sample variance If the argument to varO is an n by p matrix the value is a p by p sample covariance matrix got by regarding the rows as independent p variate sample vectors sort x returns a vector of the same size as x with the elements arranged in increasing order however there are other more exible sorting facilities available see order or sortlist which produce a permutation to do the sorting Note that max and min select the largest and smallest values in their arguments even if they are given several vectors The parallel maximum and minimum functions pmax and pmin return a vector of length equal to their longest argument that contains in each element the largest smallest element in that position in any of the input vectors For most purposes the user will not be concerned if the numbers in a numeric vector are integers reals or even complex lnternally calculations are done as double precision real numbers or double precision complex numbers if the input data are complex To work with complex numbers supply an explicit complex part Thus sqrt 17 will give NaN and a warning but sqrt 170i will do the computations as complex numbers 23 Generating regular sequences R has a number of facilities for generating commonly used sequences of numbers For example 1 30 is the vector Cl 2 29 80 The colon operator has highest priority within an ex pression so for example 21 15 is the vector C2 4 28 30 Put n lt 10 and compare the sequences 1 n 1 and 1 n 1 The construction 30 1 may be used to generate a sequence backwards The function seq is a more general facility for generating sequences It has ve arguments only some of which may be speci ed in any one call The rst two arguments if given specify the beginning and end of the sequence and if these are the only two arguments given the result is the same as the colon operator That is seq2 10 is the same vector as 2 10 Parameters to ser and to many other R functions can also be given in named form in which case the order in which they appear is irrelevant The rst two parameters may be named fromva1ue and toVa1ue thus seq130 seqfrom1 to30 and seqto30 from1 are all the same as 130 The next two parameters to seq may be named byva1ue and lengthva1ue which specify a step size and a length for the sequence respectively If neither of these is given the default by1 is assumed For example gt seq5 5 by2 gt SS generates in 3 the vector c 50 48 46 46 48 50 Similarly gt 54 lt seqlength51 from5 by2 generates the same vector in s4 Chapter 2 Simple manipulations numbers and vectors 9 The fth parameter may be named alongvector which if used must be the only parameter and creates a sequence 1 2 lengthvector or the empty sequence if the vector is empty as it can be A related function is repO which can be used for replicating an object in various complicated ways The simplest form is gt 55 lt repx times5 which will put ve copies of x end to end in 55 Another useful version is gt 56 lt repx each5 which repeats each element of x ve times before moving on to the next 24 Logical vectors As well as numerical vectors R allows manipulation of logical quantities The elements of a logical vector can have the values TRUE FALSE and NA for not available see below The rst two are often abbreviated as T and F respectively Note however that T and F are just variables which are set to TRUE and FALSE by default but are not reserved words and hence can be overwritten by the user Hence you should always use TRUE and FALSE Logical vectors are generated by conditions For example gt temp lt x gt 13 sets temp as a vector of the same length as x with values FALSE corresponding to elements of x where the condition is not met and TRUE where it is The logical operators are lt lt gt gt for exact equality and for inequality In addition if c1 and c2 are logical expressions then c1 amp c2 is their intersection and cl c2 is their union or and c1 is the negation of c1 Logical vectors may be used in ordinary arithmetic in which case they are coerced into numeric vectors FALSE becoming O and TRUE becoming 1 However there are situations where logical vectors and their coerced numeric counterparts are not equivalent for example see the next subsection 25 Missing values In some cases the components of a vector may not be completely known When an element or value is not available or a missing value in the statistical sense a place within a vector may be reserved for it by assigning it the special value NA In general any operation on an NA becomes an NA The motivation for this rule is simply that if the speci cation of an operation is incomplete the result cannot be known and hence is not available The function isnax gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA gt z lt c13NA ind lt isnaz Notice that the logical expression x NA is quite different from isnax since NA is not really a value but a marker for a quantity that is not available Thus x NA is a vector of the same length as x all of whose values are NA as the logical expression itself is incomplete and hence undecidable Note that there is a second kind of missing values which are produced by numerical com putation the so called Not a Number NaN values Examples are gt 00 Chapter 2 Simple manipulations numbers and vectors 10 gt Inf Inf which both give NaN since the result cannot be de ned sensibly In summary isnaxx is TRUE both for NA and NaN values To differentiate these is nanxx is only TRUE for NaNs Missing values are sometimes printed as ltNAgt when character vectors are printed without quotes 26 Character vectors Character quantities and character vectors are used frequently in R for example as plot labels Where needed they are denoted by a sequence of characters delimited by the double quote character eg quotxvaluesquot quotNew iteration resultsquot Character strings are entered using either double quot or single quotes but are printed using double quotes or sometimes without quotes They use C style escape sequences using as the escape character so is entered and printed as and inside double quotes quot is entered as quot Other useful escape sequences are n newline t tab and b backspace Character vectors may be concatenated into a vector by the c function examples of their use will emerge frequently The paste function takes an arbitrary number of arguments and concatenates them one by one into character strings Any numbers given among the arguments are coerced into character strings in the evident way that is in the same way they would be if they were printed The arguments are by default separated in the result by a single blank character but this can be changed by the named parameter sepstring which changes it to string possibly empty For example gt labs lt pastecquotXquot quotYquot 110 sepquotquot makes labs into the character vector CquotXlquot quotY2quot quotX3quot quotY4quot quotX5quot quotY6quot quotX7quot quotY8quot quotX9quot quotY10quot Note particularly that recycling of short lists takes place here too thus CquotXquot quotYquot is repeated 5 times to match the sequence 1103 27 Index vectors selecting and modifying subsets of a data set Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets More generally any expression that evaluates to a vector may have subsets of its elements similarly selected by appending an index vector in square brackets immediately after the expression Such index vectors can be any of four distinct types 1 A logical vector In this case the index vector must be of the same length as the vector from which elements are to be selected Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted For example gt y lt xisnax creates or recreates an object y which will contain the non missing values of x in the same order Note that if x has missing values y will be shorter than x Also gt x1 isnax amp xgt0 gt 2 creates an object 2 and places in it the values of the vector x1 for which the corresponding value in x was both non missing and positive t collapsess joins the arguments into a single Character string putting ss in between There are pas s more tools for Character manipulation see the help for sub and substring Chapter 2 Simple manipulations numbers and vectors 11 2 A vector of positive integral quantities In this case the values in the index vector must lie in the set 1 2 lengthx The corresponding elements of the vector are selected and concatenated in that order in the result The index vector can be of any length and the result is of the same length as the index vector For example x 6 is the sixth component of x and gt x 1 10 selects the rst 10 elements of x assuming lengthx is not less than 10 Also gt cquotxquotquotyquot repc1221 times4 an admittedly unlikely thing to do produces a character vector of length 16 consisting of quotxquot quotyquot quotyquot quotxquot repeated four times OJ A vector of negative integral quantities Such an index vector speci es the values to be excluded rather than included Thus gt y lt x 15 gives y all but the rst ve elements of x q A vector of character strings This possibility only applies where an object has a names attribute to identify its components In this case a sub vector of the names vector may be used in the same way as the positive integral labels in item 2 further above gt fruit lt 65 10 1 20 gt namesfruit lt Cquotorangequot quotbananaquot quotapplequot quotpeachquot gt lunch lt fruit Cquotapplequot quotorangequot The advantage is that alphanumeric names are often easier to remember than numeric indices This option is particularly useful in connection with data frames as we shall see later An indexed expression can also appear on the receiving end of an assignment in which case the assignment operation is performed only on those elements of the vector The expression must be of the form vector indexvect0r as having an arbitrary expression in place of the vector name does not make much sense here The vector assigned must match the length of the index vector and in the case of a logical index vector it must again be the same length as the vector it is indexing For example gt xisnax lt 0 replaces any missing values in x by zeros and gt yy lt O lt yy lt O has the same effect as gt y lt absy 28 Other types of objects Vectors are the most important type of object in R but there are several others which we will meet more formally in later sections 0 matrices or more generally arrays are multi dimensional generalizations of vectors In fact they are vectors that can be indexed by two or more indices and will be printed in special ways See Chapter 5 Arrays and matrices page 18 0 factors provide compact ways to handle categorical data See Chapter 4 Factors page 16 0 lists are a general form of vector in which the various elements need not be of the same type and are often themselves vectors or lists Lists provide a convenient way to return the results of a statistical computation See Section 61 Lists page 26 Chapter 2 Simple manipulations numbers and vectors 12 0 data frames are matrix like structures in which the columns can be of different types Think of data frames as data matrices7 with one row per observational unit but with possibly both numerical and categorical variables Many experiments are best described by data frames the treatments are categorical but the response is numeric See Section 63 Data frames page 27 0 functions are themselves objects in R which can be stored in the project s workspace This provides a simple and convenient way to extend R See Chapter 10 Writing your own functions page 42 Chapter 3 Objects their modes and attributes 13 3 Objects their modes and attributes 31 Intrinsic attributes mode and length The entities R operates on are technically known as objects Examples are vectors of numeric real or complex values vectors of logical values and vectors of character strings These are known as atomic structures since their components are all of the same type or mode namely numericl complex logical character and raw Vectors must have their values all of the same mode Thus any given vector must be un ambiguously either logical numeric complex character or raw The only apparent exception to this rule is the special value listed as NA for quantities not available but in fact there are several types of NA Note that a vector can be empty and still have a mode For example the empty character string vector is listed as character0 and the empty numeric vector as numeri c O R also operates on objects called lists which are of mode list These are ordered sequences of objects which individually can be of any mode lists are known as recursive rather than atomic structures since their components can themselves be lists in their own right The other recursive structures are those of mode function and expression Functions are the objects that form part of the R system along with similar user written functions which we discuss in some detail later Expressions as objects form an advanced part of R which will not be discussed in this guide except indirectly when we discuss formulae used with modeling in R By the mode of an object we mean the basic type of its fundamental constituents This is a special case of a property of an object Another property of every object is its length The functions modeobject and lengthobject can be used to nd out the mode and length of any de ned structure2 Further properties of an object are usually provided by attributesobject see Section 33 Getting and setting attributes page 14 Because of this mode and length are also called intrinsic attributes of an object For example if z is a complex vector of length 100 then in an expression modez is the character string complex and lengthz is 100 R caters for changes of mode almost anywhere it could be considered sensible to do so and a few where it might not be For example with gt z lt 09 we could put gt digits lt ascharacterz after which digits is the character vector cquot0quot quot1quot quot2quot quot9quot A further coercion or change of mode reconstructs the numerical vector again gt d lt asintegerdigits Now d and z are the same3 There is a large collection of functions of the form as something for either coercion from one mode to another or for investing an object with some other attribute it may not already possess The reader should consult the different help les to become familiar with them 1 numeric mode is actually an amalgam of two distinct modes namely integer and double precision as explained in the manual 2 Note however that lengthobject does not always contain intrinsic useful information eg when object is a function 3 In general coercion from numeric to character and back again will not be exactly reversible because of roundoif errors in the character representation Chapter 3 Objects their modes and attributes 14 32 Changing the length of an object An empty object may still have a mode For example gt e lt numeric makes e an empty vector structure of mode numeric Similarly character is a empty character vector and so on Once an object of any size has been created new components may be added to it simply by giving it an index value outside its previous range Thus gt e 3 lt 17 now makes e a vector of length 3 the rst two components of which are at this point both NA This applies to any structure at all provided the mode of the additional components agrees with the mode of the object in the rst place This automatic adjustment of lengths of an object is used often for example in the scan function for input see Section 72 The scan function page 31 Conversely to truncate the size of an object requires only an assignment to do so Hence if alpha is an object of length 10 then gt alpha lt alpha 2 1 5 makes it an object of length 5 consisting of just the former components with even index The old indices are not retained of course We can then retain just the rst three values by gt lengthalpha lt 3 and vectors can be extended by missing values in the same way 33 Getting and setting attributes The function attributes object gives a list of all the non intrinsic attributes currently de ned for that object The function attrobject name can be used to select a speci c attribute These functions are rarely used except in rather special circumstances when some new attribute is being created for some particular purpose for example to associate a creation date or an operator with an R object The concept however is very important Some care should be exercised when assigning or deleting attributes since they are an integral part of the object system used in R When it is used on the left hand side of an assignment it can be used either to associate a new attribute with object or to change an existing one For example gt attrz quotdim lt c 10 10 allows R to treat 2 as if it were a 10 by 10 matrix 34 The class of an object All objects in R have a class reported by the function class For simple vectors this is just the mode for example numeric logical character or quotlistquot but matrix array factor and dataframe are other possible values A special attribute known as the class of the object is used to allow for an object oriented style4 of programming in R For example if an object has class data frame it will be printed in a certain way the plot function will display it graphically in a certain way and other so called generic functions such as summary will react to it as an argument in a way sensitive to its class To remove temporarily the effects of class use the function unclass O For example if winter has the class data frame then 4 A diHerent style using formal or S4 classes is provided in package methods Chapter 3 Objects their modes and attributes 15 gt winter will print it in data frame form which is rather like a matrix whereas gt unclasswinter will print it as an ordinary list Only in rather special situations do you need to use this facility but one is when you are learning to come to terms with the idea of class and generic functions Generic functions and classes will be discussed further in Section 109 Object orientation page 48 but only brie y Chapter 4 Ordered and unordered factors 16 4 Ordered and unordered factors A factor is a vector object used to specify a discrete classi cation grouping of the components of other vectors of the same length R provides both ordered and unordered factors While the real application of factors is with model formulae see Section 1111 Contrasts page 52 we here look at a speci c example 41 A speci c example Suppose for example we have a sample of 30 tax accountants from all the states and territories of Australia1 and their individual state of origin is speci ed by a character vector of state mnemonics as gt state lt Cntasn quotsaw nqldn vvnswn vvnswn nntn quotwan quotwan nqldn quotVicquot nnsw quotVicquot nqldn nqldn quotsaw quottasquot 5 5 5 5 5 5 5 5 saw nntn quotwan quotVicquot nqldn vvnswn vvnswn quotwan saw quotactquot vvnswn quotVicquot quotVicquot nactn 5 Notice that in the case of a character vector sorted means sorted in alphabetical order A factor is similarly created using the factor function gt statef lt factorstate The print function handles factors slightly differently from other objects gt statef 1 tas sa qld nsw nsw nt wa wa qld Vic nsw Vic qld qld sa 16 tas sa nt wa Vic qld nsw nsw wa sa act nsw Vic Vic act Levels act nsw nt qld sa tas Vic wa To nd out the levels of a factor the function levels can be used gt levelsstatef quotactquot quot11st quotmtquot qudn quotsaw quottasquot quotVicquot quotwan 42 The function tapplyO and ragged arrays To continue the previous example suppose we have the incomes of the same tax accountants in another vector in suitably large units of money gt incomes lt c60 49 40 61 64 60 59 54 62 69 70 42 56 61 61 61 58 51 48 65 49 49 41 48 52 46 59 46 58 43 To calculate the sample mean income for each state we can now use the special function tapplyO gt incmeans lt tapplyincomes statef mean giving a means vector with the components labelled by the levels act nsw nt 1d sa tas Vic wa 44 500 57 333 55 500 53 600 55 000 60 500 56 000 52 250 The function tapply is used to apply a function here mean to each group of components of the rst argument here incomes de ned by the levels of the second component here statef2 1 Readers should note that there are eight states and territories in Australia namely the Australian Capital Territory New South Wales the Northern Territory Queensland South Australia Tasmania Victoria and Western Australia 2 Note that tapplyO also works in this case when its second argument is not a factor eg tapplyincomes state 7 and this is true for quite a few other functions since arguments are coerced to factors when necessary using asfactor Chapter 4 Ordered and unordered factors 17 as if they were separate vector structures The result is a structure of the same length as the levels attribute of the factor containing the results The reader should consult the help document for more details Suppose further we needed to calculate the standard errors of the state income means To do this we need to write an R function to calculate the standard error for any given vector Since there is an builtin function varO to calculate the sample variance such a function is a very simple one liner speci ed by the assignment gt stderr lt functionx sqrt var x lengthx Writing functions will be considered later in Chapter 10 Writing your own functions page 42 and in this case was unnecessary as R also has a builtin function sd After this assignment the standard errors are calculated by gt incster lt tapplyincomes statef stderr and the values calculated are then gt incster act nsw nt 1d sa tas Vic wa 15 43102 45 41061 27386 05 5244 26575 As an exercise you may care to nd the usual 95 con dence limits for the state mean incomes To do this you could use tapply once more with the length function to nd the sample sizes and the qt function to nd the percentage points of the appropriate 25 distributions You could also investigate R s facilities for t tests The function tapply can also be used to handle more complicated indexing of a vector by multiple categories For example we might wish to split the tax accountants by both state and sex However in this simple instance just one factor what happens can be thought of as follows The values in the vector are collected into groups corresponding to the distinct entries in the factor The function is then applied to each of these groups individually The value is a vector of function results labelled by the levels attribute of the factor The combination of a vector and a labelling factor is an example of what is sometimes called a rugged army since the subclass sizes are possibly irregular When the subclass sizes are all the same the indexing may be done implicitly and much more ef ciently as we see in the next section 43 Ordered factors The levels of factors are stored in alphabetical order or in the order they were speci ed to factor if they were speci ed explicitly Sometimes the levels will have a natural ordering that we want to record and want our statistical analysis to make use of The orderedO function creates such ordered factors but is otherwise identical to factor For most purposes the only difference between ordered and unordered factors is that the former are printed showing the ordering of the levels but the contrasts generated for them in tting linear models are different Chapter 5 Arrays and matrices 18 5 Arrays and matrices 51 Arrays An array can be considered as a multiply subscripted collection of data entries for example numeric R allows simple facilities for creating and handling arrays and in particular the special case of matrices A dimension vector is a vector of non negative integers If its length is k then the array is k dimensional eg a matrix is a 2 dimensional array The dimensions are indexed from one up to the values given in the dimension vector A vector can be used by R as an array only if it has a dimension vector as its dim attribute Suppose for example 2 is a vector of 1500 elements The assignment gt dimz lt c35100 gives it the dim attribute that allows it to be treated as a 3 by 5 by 100 array Other functions such as matrix and array are available for simpler and more natural looking assignments as we shall see in Section 54 The array function page 20 The values in the data vector give the values in the array in the same order as they would occur in FORTRAN that is column major order77 with the rst subscript moving fastest and the last subscript slowest For example if the dimension vector for an array say a is 6 342 then there are 3 x 4 x 2 24 entries in a and the data vector holds them in the order a1 1 1 a21 1 a242 a342 Arrays can be one dimensional such arrays are usually treated in the same way as vectors including when printing but the exceptions can cause confusion 52 Array indexing Subsections of an array lndividual elements of an array may be referenced by giving the name of the array followed by the subscripts in square brackets separated by commas More generally subsections of an array may be speci ed by giving a sequence of index vectors in place of subscripts however if any index position is given an empty index vector then the full range of that subscript is taken Continuing the previous example a2 is a 4 x 2 array with dimension vector 6 42 and data vector containing the values ca211 a221 a231 a241 a212 a222 a232 a242 in that order a stands for the entire array which is the same as omitting the subscripts entirely and using a alone For any array say Z the dimension vector may be referenced explicitly as dimZ on either side of an assignment Also if an array name is given with just one subscript or index vector then the corresponding values of the data vector only are used in this case the dimension vector is ignored This is not the case however if the single index is not a vector but itself an array as we next discuss Chapter 5 Arrays and matrices 19 53 Index arrays As well as an index vector in any subscript position an array may be used with a single index army in order either to assign a vector of quantities to an irregular collection of elements in the array or to extract an irregular collection as a vector A matrix example makes the process clear In the case of a doubly indexed array an index matrix may be given consisting of two columns and as many rows as desired The entries in the index matrix are the row and column indices for the doubly indexed array Suppose for example we have a 4 by 5 array X and we wish to do the following 0 Extract elements X 1 3 X 2 2 and X 3 1 as a vector structure and 0 Replace these entries in the array X by zeroes In this case we need a 3 by 2 subscript array as in the following example gt x lt array120 dimc 45 Generate a 4 by 5 array gt x 1 2 3 4 5 1 1 5 9 13 17 2 2 6 10 14 18 3 3 7 11 15 19 4 4 8 12 16 20 gt i lt arrayc1331 dimc32 gt i i is a 3 by 2 index array 1 2 1 1 3 2 2 2 3 3 1 gt xi Extract those elements 1 9 6 3 gt xi lt O Replace those elements by zeros gt x 1 2 3 4 5 1 1 5 o 13 17 2 2 0 1o 14 18 3 o 7 11 15 19 4 4 8 12 16 20 gt As a less trivial example suppose we wish to generate an unreduced design matrix for a block design de ned by factors blocks b levels and varieties v levels Further suppose there are n plots in the experiment We could proceed as follows Xb lt matrix0 n b Xv lt matrix0 n v ib lt cbind1n blocks iv lt cbind1n varieties Xb ib lt 1 Xv iv lt 1 gt X lt cbindXb Xv V VVVVV To construct the incidence matrix N say we could use gt N lt crossprodXb Xv However a simpler direct way of producing this matrix is to use table gt N lt tableblocks varieties Chapter 5 Arrays and matrices 20 54 The array function As well as giving a vector structure a dim attribute arrays can be constructed from vectors by the array function which has the form gt Z lt arraydatavector dimvect0r For example if the vector h contains 24 or fewer numbers then the command gt Z lt arrayh dimc342 would use h to set up 3 by 4 by 2 array in Z If the size of h is exactly 24 the result is the same as gt dimZ lt C342 However if h is shorter than 24 its values are recycled from the beginning again to make it up to size 24 see Section 541 The recycling rule page 20 As an extreme but common example gt Z lt array0 c342 makes Z an array of all zeros At this point dimZ stands for the dimension vector 6 342 and Z 1 24 stands for the data vector as it was in h and Z with an empty subscript or Z with no subscript stands for the entire array as an array Arrays may be used in arithmetic expressions and the result is an array formed by element by element operations on the data vector The dim attributes of operands generally need to be the same and this becomes the dimension vector of the result So if A B and C are all similar arrays then gtDlt 2ABC1 makes D a similar array with its data vector being the result of the given element by element operations However the precise rule concerning mixed array and vector calculations has to be considered a little more carefully 541 Mixed vector and array arithmetic The recycling rule The precise rule affecting element by element mixed calculations with vectors and arrays is somewhat quirky and hard to nd in the references From experience we have found the following to be a reliable guide 0 The expression is scanned from left to right Any short vector operands are extended by recycling their values until they match the size of any other operands As long as short vectors and arrays only are encountered the arrays must all have the same dim attribute or an error results Any vector operand longer than a matrix or array operand generates an error If array structures are present and no error or coercion to vector has been precipitated the result is an array structure with the common dim attribute of its array operands 55 The outer product of two arrays An important operation on arrays is the outer product If a and b are two numeric arrays their outer product is an array whose dimension vector is obtained by concatenating their two dimension vectors order is important and whose data vector is got by forming all possible products of elements of the data vector of a with those of b The outer product is formed by the special operator 700 gt ab lt a 700 b An alternative is Chapter 5 Arrays and matrices 21 gt ab lt outera b quotquot The multiplication function can be replaced by an arbitrary function of two variables For example if we wished to evaluate the function m y cosy1 2 over a regular grid of values with z and y coordinates de ned by the R vectors x and y respectively we could proceed as follows gt f lt functionx y cosy1 xquot2 gt z lt outerx y f In particular the outer product of two ordinary vectors is a doubly subscripted array that is a matrix of rank at most 1 Notice that the outer product operator is of course non commutative De ning your own R functions will be considered further in Chapter 10 Writing your own functions page 42 An example Determinants of 2 by 2 singledigit matrices As an arti cial but cute example consider the determinants of 2 by 2 matrices ab cd where each entry is a non negative integer in the range 0 1 9 that is a digit The problem is to nd the determinants ad 7 be of all possible matrices of this form and represent the frequency with which each value occurs as a high density plot This amounts to nding the probability distribution of the determinant if each digit is chosen independently and uniformly at random A neat way of doing this uses the outer function twice gt d lt outer09 09 gt fr lt tableouterd d quotquot gt plot as numeric names fr fr typequothquot xlabquotDeterminantquot ylabquotFrequencyquot Notice the coercion of the names attribute of the frequency table to numeric in order to recover the range of the determinant values The obvious way of doing this problem with for loops to be discussed in Chapter 9 Loops and conditional execution page 40 is so inef cient as to be impractical It is also perhaps surprising that about 1 in 20 such matrices is singular 56 Generalized transpose of an array The function aperma perm may be used to permute an array a The argument perm must be a permutation of the integers 1 k where k is the number of subscripts in a The result of the function is an array of the same size as a but with old dimension given by perm j becoming the new j th dimension The easiest way to think of this operation is as a generalization of transposition for matrices Indeed if A is a matrix that is a doubly subscripted array then B given by gt B lt apermA c21 is just the transpose of A For this special case a simpler function 00 is available so we could have used B lt tA 57 Matrix facilities As noted above a matrix is just an array with two subscripts However it is such an important special case it needs a separate discussion R contains many operators and functions that are available only for matrices For example tX is the matrix transpose function as noted above The functions nrowA and ncolA give the number of rows and columns in the matrix A respectively Chapter 5 Arrays and matrices 22 571 Matrix multiplication The operator is used for matrix multiplication An 71 by 1 or 1 by 71 matrix may of course be used as an n vector if in the context such is appropriate Conversely vectors which occur in matrix multiplication expressions are automatically promoted either to row or column vectors whichever is multiplicatively coherent if possible although this is not always unambiguously possible as we see later If for example A and B are square matrices of the same size then gt A B is the matrix of element by element products and gt A 796 B is the matrix product If x is a vector then gt x A x is a quadratic form L The function crossprod forms crossproducts meaning that crossprodX y is the same as tX y but the operation is more efficient If the second argument to crossprod is omitted it is taken to be the same as the rst The meaning of diagO depends on its argument diagv where v is a vector gives a diagonal matrix with elements of the vector as the diagonal entries On the other hand di agM where M is a matrix gives the vector of main diagonal entries of M This is the same convention as that used for diagO in MATLAB Also somewhat confusingly if k is a single numeric value then diag k is the k by k identity matrix 572 Linear equations and inversion Solving linear equations is the inverse of matrix multiplication When after gt b lt A x only A and b are given the vector x is the solution of that linear equation system In R gt solveAb solves the system returning x up to some accuracy loss Note that in linear algebra formally x A lb where A 1 denotes the inverse of A which can be computed by solve A but rarely is needed Numerically it is both inefficient and potentially unstable to compute x lt solve A b instead of solveAb The quadratic form x A lx which is used in multivariate computations should be computed by something like x solve Ax rather than computing the inverse of A 573 Eigenvalues and eigenvectors 1 The function eigenSm t the and s of a symmetric matrix Sm The result of this function is a list of two components named values and vectors The assignment 1 Note that x 3939 x is ambiguous as it could mean either xx or xx where x is the column form In such cases the smaller matrix seems implicitly to be the interpretation adopted so the scalar xx is in this case the result The matrix xx may be calculated either by cbindx 3939 x or x 3939 rbindx since the result of rbindo 0r cbindo is always a matrix However the best way to compute xx or xx is crossprodx or x 39o39 K respectively 0 EVen better would be to form a matrix square root B with A BB and nd the squared length of the solution of By x perhaps using the Cholesky or eigendecomposition of A Chapter 5 Arrays and matrices 23 gt ev lt eigenSm will assign this list to ev Then evval is the vector of eigenvalues of Sm and evvec is the matrix of corresponding eigenvectors Had we only needed the eigenvalues we could have used the assignment gt evals lt eigenSmvalues evals now holds the vector of eigenvalues and the second component is discarded If the expression gt eigenSm is used by itself as a command the two components are printed with their names For large matrices it is better to avoid computing the eigenvectors if they are not needed by using the expression gt evals lt eigenSm onlyvalues TRUEvalues 574 Singular value decomposition and determinants The function svdM takes an arbitrary matrix argument M and calculates the singular value decomposition of M This consists of a matrix of orthonormal columns U with the same column space as M a second matrix of orthonormal columns V whose column space is the row space of M and a diagonal matrix of positive entries D such that M U D tV D is actually returned as a vector of the diagonal elements The result of svdM is actually a list of three components named d u and v with evident meanings If M is in fact square then it is not hard to see that gt absdetM lt prodsvdMd calculates the absolute value of the determinant of M If this calculation were needed often with a variety of matrices it could be de ned as an R function gt absdet lt functionM prodsvdMd after which we could use absdet as just another R function As a further trivial but potentially useful example you might like to consider writing a function say trO to calculate the trace of a square matrix Hint You will not need to use an explicit loop Look again at the diagO function R has a builtin function det to calculate a determinant including the sign and another determinant to give the sign and modulus optionally on log scale 575 Least squares tting and the QR decomposition The function lsfitO returns a list giving results of a least squares tting procedure An assignment such as gt ans lt lsfitX y gives the results of a least squares t where y is the vector of observations and X is the design matrix See the help facility for more details and also for the follow up function lsdiag for among other things regression diagnostics Note that a grand mean term is automatically in cluded and need not be included explicitly as a column of X Further note that you almost always will prefer using lm see Section 112 Linear models page 53 to lsfitO for regression modelling Another closely related function is qu and its allies Consider the following assignments gt Xplus lt qrX gt b lt qrcoef Xplus y gt fit lt qrfittedXplus y gt res lt qrresidXplus y Chapter 5 Arrays and matrices 24 These compute the orthogonal projection of y onto the range of X in fit the projection onto the orthogonal complement in res and the coef cient vector for the projection in b that is b is essentially the result of the MATLAB backslash operator It is not assumed that X has full column rank Redundancies will be discovered and removed as they are found This alternative is the older low level way to perform least squares calculations Although still useful in some contexts it would now generally be replaced by the statistical models features as will be discussed in Chapter 11 Statistical models in R page 50 58 Forming partitioned matrices Cbind and rbind As we have already seen informally matrices can be built up from other vectors and matrices by the functions cbindO and rbind Roughly cbindO forms matrices by binding together matrices horizontally or column wise and rbind vertically or row wise In the assignment gt X lt cbindarg1 arg2 arg3 the arguments to cbindO must be either vectors of any length or matrices with the same column size that is the same number of rows The result is a matrix with the concatenated arguments argJ arg2 forming the columns If some of the arguments to cbindO are vectors they may be shorter than the column size of any matrices present in which case they are cyclically extended to match the matrix column size or the length of the longest vector if no matrices are given The function rbind does the corresponding operation for rows In this case any vector argument possibly cyclically extended are of course taken as row vectors Suppose X1 and X2 have the same number of rows To combine these by columns into a matrix X together with an initial column of is we can use gt X lt Cbind1 X1 X2 The result of rbind or cbindO always has matrix status Hence cbindx and rbindx are possibly the simplest ways explicitly to allow the vector x to be treated as a column or row matrix respectively 59 The concatenation function 60 with arrays It should be noted that whereas cbindO and rbind are concatenation functions that respect dim attributes the basic 6 function does not but rather clears numeric objects of all dim and dimnames attributes This is occasionally useful in its own right The of cial way to coerce an array back to a simple vector object is to use asvector gt vec lt asvectorX However a similar result can be achieved by using c with just one argument simply for this side effect gt vec lt c X There are slight differences between the two but ultimately the choice between them is largely a matter of style with the former being preferable 510 Frequency tables from factors Recall that a factor de nes a partition into groups Similarly a pair of factors de nes a two way cross classi cation and so on The function table allows frequency tables to be calcu lated from equal length factors If there are k factor arguments the result is a k way array of frequencies Chapter 5 Arrays and matrices Suppose for example that statef is a factor giving the state code for each entry in a data vector The assignment gt statefr lt tablestatef gives in statefr a table of frequencies of each state in the sample The frequencies are ordered and labelled by the levels attribute of the factor This simple case is equivalent to but more convenient than gt statefr lt tapplystatef statef length Further suppose that incomef is a factor giving a suitably de ned income class77 for each entry in the data vector for example with the cut function gt factorcutincomes breaks 351007 gt incomef Then to calculate a two way table of frequencies gt tableincomefstatef statef incomef act nsw nt qld sa tas ViC wa 3545 1 1 0 1 0 0 4555 1 1 1 1 2 0 1 3 5565 0 3 1 3 2 2 2 1 6575 0 1 0 0 0 0 1 0 Extension to higher way frequency tables is immediate Chapter 6 Lists and data frames 26 6 Lists and data frames 61 Lists An R list is an object consisting of an ordered collection of objects known as its components There is no particular need for the components to be of the same mode or type and for example a list could consist of a numeric vector a logical value a matrix a complex vector a character array a function and so on Here is a simple example of how to make a list gt Lst lt list namequotFredquot wifequotMaryquot nochildren3 child agesc 4 7 9 Components are always numbered and may always be referred to as such Thus if Lst is the name of a list with four components these may be individually referred to as Lst1 Lst 2 L51 3 and L51 4 lf further L51 4 is a vector subscripted array then Lst4 1 is its rst entry If Lst is a list then the function lengthLst gives the number of top level components it has Components of lists may also be named and in this case the component may be referred to either by giving the component name as a character string in place of the number in double square brackets or more conveniently by giving an expression of the form gt namecomponentname for the same thing This is a very useful convention as it makes it easier to get the right component if you forget the number So in the simple example given above Lstname is the same as L51 1 and is the string quotFredquot Lstwife is the same as L51 2 and is the string quotMaryquot Lstchildages 1 is the same as Lst4 1 and is the number 4 Additionally one can also use the names of the list components in double square brackets ie Lst quotnamequot is the same as Lstname This is especially useful when the name of the component to be extracted is stored in another variable as in gt x lt quotnamequot Lstx It is very important to distinguish L51 1 from L51 1 is the operator used to select a single element whereas is a general subscripting operator Thus the former is the rst object in the list Lst and if it is a named list the name is not included The latter is a sublist of the list Lst consisting of the rst entry only If it is a named list the names are transferred to the sublist The names of components may be abbreviated down to the minimum number of letters needed to identify them uniquely Thus Lstcoefficients may be minimally speci ed as Lstcoe and Lstcovariance as Lstcov The vector of names is in fact simply an attribute of the list like any other and may be handled as such Other structures besides lists may of course similarly be given a names attribute also 62 Constructing and modifying lists New lists may be formed from existing objects by the function list An assignment of the form Chapter 6 Lists and data frames 27 gt Lst lt list name1object1 namemobjectm sets up a list Lst of m components using objectJ objectm for the components and giving them names as speci ed by the argument names which can be freely chosen If these names are omitted the components are numbered only The components used to form the list are copied when forming the new list and the originals are not affected Lists like any subscripted object can be extended by specifying additional components For example gt Lst 5 lt listmatrixMat 621 Concatenating lists When the concatenation function c is given list arguments the result is an object of mode list also whose components are those of the argument lists joined together in sequence gt listABC lt ClistA listB listC Recall that with vector objects as arguments the concatenation function similarly joined together all arguments into a single vector structure In this case all other attributes such as dim attributes are discarded 63 Data frames A data frame is a list with class quotdata frame There are restrictions on lists that may be made into data frames namely 0 The components must be vectors numeric character or logical factors numeric matrices lists or other data frames Matrices lists and data frames provide as many variables to the new data frame as they have columns elements or variables respectively Numeric vectors logicals and factors are included as is and character vectors are coerced to be factors whose levels are the unique values appearing in the vector Vector structures appearing as variables of the data frame must all have the same length and matrix structures must all have the same row size A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes It may be displayed in matrix form and its rows and columns extracted using matrix indexing conventions 63 1 Making data frames Objects satisfying the restrictions placed on the columns components of a data frame may be used to form one using the function data frame gt accountants lt dataframe homestatef lootincomes shotincomef A list whose components conform to the restrictions of a data frame may be coerced into a data frame using the function asdata frame The simplest way to construct a data frame from scratch is to use the readtable function to read an entire data frame from an external le This is discussed further in Chapter 7 Reading data from les page 30 632 attach and detach The notation such as accountantsstatef for list components is not always very convenient A useful facility would be somehow to make the components of a list or data frame temporarily visible as variables under their component name without the need to quote the list name explicitly each time Chapter 6 Lists and data frames 28 The attachO function as well as having a directory name as its argument may also have a data frame Thus suppose lentils is a data frame with three variables lentilsu lentilsv lentilsw The attach gt attachlentils places the data frame in the search path at position 2 and provided there are no variables 11 v or w in position 1 u v and w are available as variables from the data frame in their own right At this point an assignment such as gt u lt vw does not replace the component 11 of the data frame but rather masks it with another variable 11 in the working directory at position 1 on the search path To make a permanent change to the data frame itself the simplest way is to resort once again to the notation gt lentilsu lt vw However the new value of component 11 is not visible until the data frame is detached and attached again To detach a data frame use the function gt detachO More precisely this statement detaches from the search path the entity currently at position 2 Thus in the present context the variables 11 v and w would be no longer visible except under the list notation as lentilsu and so on Entities at positions greater than 2 on the search path can be detached by giving their number to detach but it is much safer to always use a name for example by detachlentils or detachquotlentilsquot Note With the current release of R lists and data frames can only be attached at position 2 or above It is not possible to directly assign into an attached list or data frame thus to some extent they are static 633 Working With data frames A useful convention that allows you to work with many different problems comfortably together in the same working directory is o gather together all variables for any well de ned and separate problem in a data frame under a suitably informative name 0 when working with a problem attach the appropriate data frame at position 2 and use the working directory at level 1 for operational quantities and temporary variables before leaving a problem add any variables you wish to keep for future reference to the data frame using the form of assignment and then detachO nally remove all unwanted variables from the working directory and keep it as clean of left over temporary variables as possible In this way it is quite simple to work with many problems in the same directory all of which have variables named x y and z for example 634 Attaching arbitrary lists attachO is a generic function that allows not only directories and data frames to be attached to the search path but other classes of object as well In particular any object of mode quotlistquot may be attached in the same way gt attachany old list Anything that has been attached can be detached by detach by position number or prefer ably by name Chapter 6 Lists and data frames 29 635 Managing the search path The function search shows the current search path and so is a very useful way to keep track of which data frames and lists and packages have been attached and detached Initially it gives gt search 1 quot GlobalEnvquot quotAutoloadsquot quotpackagezbasequot where GlobalEnv is the workspace1 After lentils is attached we have gt search 1 quot GlobalEnvquot lentils quotAutoloadsquot quotpackage basequot gt ls2 quotuquot quotVquot quotwquot and as we see ls or obj ects can be used to examine the contents of any position on the search path Finally7 we detach the data frame and con rm it has been removed from the search path gt detachquotlentilsquot gt search 1 quot GlobalEnvquot quotAutoloadsquot quotpackagezbasequot 1 See the online help for autoload for the meaning of the second term Chapter 7 Reading data from les 30 7 Reading data from les Large data objects will usually be read as values from external les rather than entered during an R session at the keyboard R input facilities are simple and their requirements are fairly strict and even rather in exible There is a clear presumption by the designers of R that you will be able to modify your input les using other tools such as le editors or Perll to t in with the requirements of R Generally this is very simple If variables are to be held mainly in data frames as we strongly suggest they should be an entire data frame can be read directly with the readtable function There is also a more primitive input function scan that can be called directly For more details on importing data into R and also exporting data see the B Data Im portExport manual 71 The readtab1e function To read an entire data frame directly the external le will normally have a special form 0 The rst line of the le should have a name for each variable in the data frame 0 Each additional line of the le has as its rst item a row label and the values for each variable If the le has one fewer item in its rst line than in its second this arrangement is presumed to be in force So the rst few lines of a le to be read as a data frame might look as follows lnput le form with names and row labels Price Floor Area Rooms Age Centheat 01 5200 1110 830 5 62 no 02 5475 1280 710 5 75 no 03 5750 1010 1000 5 42 no 04 5750 1310 690 6 88 no 05 5975 930 900 5 19 yes By default numeric items except row labels are read as numeric variables and non numeric variables such as Centheat in the example as factors This can be changed if necessary The function readtable can then be used to read the data frame directly gt HousePrice lt read table quothouses dataquot Often you will want to omit including the row labels directly and use the default labels In this case the le may omit the row label column as in the following lnput le form without row labels Price Floor Area Rooms Age Cent heat 5200 1110 830 5 62 no 5475 1280 710 5 75 no 5750 1010 1000 5 42 no 5750 1310 690 6 88 no 5 1 9 yes 59 75 93 0 900 1 Under UNIX the utilities Sed or Awk can be used Chapter 7 Reading data from les 31 The data frame may then be read as gt HousePrice lt readtablequothouses dataquot headerTRUE where the headerTRUE option speci es that the rst line is a line of headings and hence by implication from the form of the le that no explicit row labels are given 72 The scan function Suppose the data vectors are of equal length and are to be read in parallel Further suppose that there are three vectors the rst of mode character and the remaining two of mode numeric and the le is inputdat The rst step is to use scan to read in the three vectors as a list as follows gt inp lt scanquotinput datquot listquot quot 0 O The second argument is a dummy list structure that establishes the mode of the three vectors to be read The result held in inp is a list whose components are the three vectors read in To separate the data items into three separate vectors use assignments like gt label lt inp1 x lt inp2 y lt inp3ll More conveniently the dummy list can have named components in which case the names can be used to access the vectors read in For example gt inp lt scanquotinput datquot listidquot quot x0 y0 If you wish to access the variables separately they may either be re assigned to variables in the working frame gt label lt inpid x lt inpx y lt inpy or the list may be attached at position 2 of the search path see Section 634 Attaching arbitrary lists page 28 lfthe second argument is a single value and not a list a single vector is read in all components of which must be of the same mode as the dummy value gt X lt matrixscanquotlightdatquot O ncol5 byrowTRUE There are more elaborate input facilities available and these are detailed in the manuals 73 Accessing builtin datasets Around 100 datasets are supplied with R in package datasets and others are available in packages including the recommended packages supplied with R To see the list of datasets currently available use data As from R version 200 all the datasets suppied with R are available directly by name However many packages still use the earlier convention in which data was also used to load datasets into R for example datainfert and this can still be used with the standard packages as in this example In most cases this will load an R object of the same name However in a few cases it loads several objects so see the on line help for the object to see what to expect 731 Loading data from other R packages To access data from a particular package use the package argument for example datapackagequotrpartquot dataPuromycin packagequotdatasetsquot If a package has been attached by library its datasets are automatically included in the search User contributed packages can be a rich source of datasets Chapter 7 Reading data from les 32 74 Editing data When invoked on a data frame or matrix7 edit brings up a separate spreadsheet like environment for editing This is useful for making small changes once a data set has been read The command gt xnew lt editxold will allow you to edit your data set xold7 and on completion the changed object is assigned to xnew If you want to alter the original dataset xold7 the simplest way is to use fixxold7 which is equivalent to xold lt editxold Use gt xnew lt edit data frame to enter new data via the spreadsheet interface Chapter 8 Probability distributions 33 8 Probability distributions 81 R as a set of statistical tables One convenient use of R is to provide a comprehensive set of statistical tables Functions are provided to evaluate the cumulative distribution function PX x the probability density function and the quantile function given q the smallest x such that PX z gt q and to simulate from the distribution Distribution R name additional arguments beta beta shapel shape2 ncp binomial binom size prob Cauchy cauchy location scale chi squared chisq df ncp exponential exp rate f df 1 df 1 ncp gamma gamma shape scale geometric geom prob hypergeometric hyper m n k log normal lnorm meanlog sdlog logistic logis location scale negative binomial nbinom size prob normal norm mean sd Poisson poi s lambda Student s t t df ncp uniform unif min max Weibull weibull shape scale Wilcoxon wilcox m n Pre x the name given here by d for the density p for the CDF q for the quantile function and r for simulation random deviates The rst argument is x for dxxx q for pxxx p for qxxx and n for rxxx except for rhyper and rwilcox for which it is The non centrality parameter ncp is currently only available for the CDFs most pdfs and a few other cases see the on line help for details The pxxx and qxxx functions all have logical arguments lowertail and logp and the dxxx ones have log This allows eg getting the cumulative or integrated hazard function 11175 10g1 Ft7 by pXXXt lowertail FALSE logp TRUE or more accurate log likelihoods by dxxx log TRUE directly In addition there are functions ptukey and qtukey for the distribution of the studentized range of samples from a normal distribution Here are some examples gt 2 tailed p value for t distribution gt 2pt243 df 13 gt upper 1 point for an F2 7 distribution gt qf099 2 7 82 Examining the distribution of a set of data Given a univariate set of data we can examine its distribution in a large number of ways The simplest is to examine the numbers Two slightly different summaries are given by summary and fivenum and a display of the numbers by stem a stem and leaf77 plot Chapter 8 Probability distributions 34 gt attachfaithful gt summaryeruptions Min lst Qu Median Mean 3rd Qu Max 1600 2163 4000 3488 4454 5100 gt fivenumeruptions 1 16000 21585 40000 44585 51000 gt stemeruptions The decimal point is 1 digits to the left of the I 16 I 070355555588 18 lllI l888 20 I 00002223378800035778 22 I 0002335578023578 24 I 00228 26 I 23 28 I 080 30 I 7 32 I 2337 34 I 250077 36 I 0000823577 38 I 2333335582225577 40 I 0000003357788888002233555577778 42 I 03335555778800233333555577778 44 I l8888 46 I 0000233357700000023578 48 I 00000022335800333 50 I 0370 A stem and leaf plot is like a histogram7 and R has a function hist to plot histograms gt histeruptions make the bins smaller7 make a plot of density gt hist eruptions seq1 6 5 2 O 2 probTRUE gt linesdensityeruptions bw0 1 gt rugeruptions show the actual data points More elegant density plots can be made by density and we added a line produced by density in this example The bandwidth bw was chosen by trial and error as the default gives Chapter 8 Probability distributions 35 too much smoothing it usually does for interesting densities Better automated methods of bandwidth choice are available7 and in this example bw quotSJquot gives a good result Histogram of eruptions Relawe Frequency m n l iii 539 HlllllllllllllHllllllllllllllllll ll lll l llllllll W 15 2D 25 an 35 An 45 an empuuns We can plot the empirical cumulative distribution function by using the function ecdf gt plot ecdf eruptions do pointsFALSE verticalsTRUE This distribution is obviously far from any standard distribution How about the right hand mode7 say eruptions of longer than 3 minutes Let us t a normal distribution and overlay the tted CDF gt long lt eruptions eruptions gt 3 gt plot ecdf long do pointsFALSE verticalsTRUE gt x lt seq3 54 001 gt linesx pnormx meanmeanlong sdsqrtvarlong lty3 ec dflo ng Quantile quantile Q Q plots can help us examine this more carefully parptyquotsquot arrange for a square figure region qqnormlong qqlinelong Chapter 8 Probability distributions 36 which shows a reasonable t but a shorter right tail than one would expect from a normal distribution Let us compare this with some simulated data from a 25 distribution Normal Q Q Plot Sample Ouanhles Theuveucal Ouarmles x lt rt250 df 5 qqnorm x qqlinex which will usually if it is a random sample show longer tails than expected for a normal We can make a Q Q plot against the generating distribution by qqplot qt ppoints250 df 5 x xlab quotQQ plot for t dsnquot qqline x Finally7 we might want a more formal test of agreement with normality or not R provides the Shapiro Wilk test gt shapiro test long ShapiroWilk normality test data long W 09793 pvalue 001052 and the Kolmogorov Smirnov test gt ks testlong quotpnormquot mean meanlong sd sqrtvarlong Onesample KolmogorovSmirnov test data long D 00661 pvalue 04284 alternative hypothesis twosided Note that the distribution theory is not valid here as we have estimated the parameters of the normal distribution from the same sample 83 One and twosample tests So far we have compared a single sample to a normal distribution A much more common operation is to compare aspects of two samples Note that in R7 all classical tests including the ones used below are in package stats which is normally loaded Consider the following sets of data on the latent heat of the fusion of ice calgm from Rice 19957 p490 Chapter 8 Probability distributions 37 Method A 7998 8004 8002 8004 8003 8003 8004 7997 8005 8003 8002 8000 8002 Method B 8002 7994 7998 7997 7997 8003 7995 7997 Boxplots provide a simple graphical comparison of the two samples A lt scan 7998 8004 8002 8004 8003 8003 8004 7997 8005 8003 8002 8000 8002 B lt scan 8002 7994 7998 7997 7997 8003 7995 7997 boxplot A B which indicates that the rst group tends to give higher results than the second an 02 an m l l an an 79 99 l a 79 99 l 79 94 l To test for the equality of the means of the two examples7 we can use an unpaired t test by gt ttestA B Welch Two Sample ttest data A and B t 32499 df 12027 pvalue 000694 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 001385526 007018320 sample estimates mean of x mean of y 8002077 7997875 which does indicate a signi cant difference7 assuming normality By default the R function does not assume equality of variances in the two samples in contrast to the similar S PLUS ttest function We can use the F test to test for equality in the variances7 provided that the two samples are from normal populations gt var test A B F test to compare two variances Chapter 8 Probability distributions 38 data A and B F 05837 num df 12 denom df 7 pvalue 03938 alternative hypothesis true ratio of variances is not equal to 1 95 percent confidence interval 01251097 21052687 sample estimates ratio of variances 05837405 which shows no evidence of a signi cant difference7 and so we can use the classical t test that assumes equality of the variances gt ttestA B varequalTRUE Two Sample ttest data A and B t 34722 df 19 pvalue 0002551 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 001669058 006734788 sample estimates mean of x mean of y 8002077 7997875 All these tests assume normality of the two samples The two sample Wilcoxon or Mann Whitney test only assumes a common continuous distribution under the null hypothesis gt wilcoxtestA B Wilcoxon rank sum test with continuity correction data A and B W 89 pvalue 0007497 alternative hypothesis true mu is not equal to 0 Warning message Cannot compute exact pvalue with ties in wilcoxtestA B Note the warning there are several ties in each sample7 which suggests strongly that these data are from a discrete distribution probably due to rounding There are several ways to compare graphically the two samples We have already seen a pair of boxplots The following gt plot ecdf A do pointsFALSE verticalsTRUE xlimrange A B gt plot ecdf B do pointsFALSE verticalsTRUE addTRUE will show the two empirical CDFs7 and qqplot will perform a Q Q plot of the two samples The Kolmogorov Smirnov test is of the maximal vertical distance between the two ecdf s7 assuming a common continuous distribution gt kstestA B Twosample KolmogorovSmirnov test data A and B D 05962 p value 005919 Chapter 8 Probability distributions alternative hypothesis twosided Warning message cannot compute correct pvalues with ties in kstestA B Chapter 9 Grouping loops and conditional execution 40 9 Grouping loops and conditional execution 91 Grouped expressions R is an expression language in the sense that its only command type is a function or expression which returns a result Even an assignment is an expression whose result is the value assigned and it may be used wherever any expression may be used in particular multiple assignments are possible Commands may be grouped together in braces expr1 exprm in which case the value of the group is the result of the last expression in the group evaluated Since such a group is also an expression it may for example be itself included in parentheses and used a part of an even larger expression and so on 92 Control statements 921 Conditional execution if statements The language has available a conditional construction of the form gt if expr1 expr2 else expr3 where exprJ must evaluate to a single logical value and the result of the entire expression is then evident The short circuit operators ampamp and I are often used as part of the condition in an if statement Whereas amp and apply element wise to vectors ampamp and apply to vectors of length one and only evaluate their second argument if necessary There is a vectorized version of the ifelse construct the ifelse function This has the form ifelsecondition a b and returns a vector of the length of its longest argument with elements ai if condition i is true otherwise bi 922 Repetitive execution for loops repeat and while There is also a for loop construction which has the form gt for name in expr1 expr2 where name is the loop variable expril is a vector expression often a sequence like 1 20 and expr2 is often a grouped expression with its sub expressions written in terms of the dummy name expri2 is repeatedly evaluated as name ranges through the values in the vector result of expril As an example suppose ind is a vector of class indicators and we wish to produce separate plots of y versus x within classes One possibility here is to use coplot1 which will produce an array of plots corresponding to each level of the factor Another way to do this now putting all plots on the one display is as follows gt xc lt splitx ind gt yc lt splity ind gt for i in 1lengthyc plotxc ill yC ill 5 abline lsfit xc i yc i Note the function split which produces a list of vectors obtained by splitting a larger vector according to the classes speci ed by a factor This is a useful function mostly used in connection with boxplots See the help facility for further details 1 to be discussed later or use xyplot from package lattice Chapter 9 Grouping7 loops and conditional execution 41 Warning for loops are used in R code much less often than in compiled languages Code that takes a whole object7 View is likely to be both clearer and faster in R Other looping facilities include the gt repeat expr statement and the gt while condition expr statement The break statement can be used to terminate any loop7 possibly abnormally This is the only way to terminate repeat loops The next statement can be used to discontinue one particular cycle and skip to the nex 77 Control statements are most often used in connection with functions which are discussed in Chapter 10 Writing your own functions7 page 427 and where more examples will emerge
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'