Professional Documents
Culture Documents
analysis
Giovanni Petris
Fall 2006
1 Introduction
Our software of choice is R. R and accompanying manuals are available for free downloading
at http://www.r-project.org. To get started with R I suggest that you read the first few
sections of An Introduction to R. This is one of the manuals that come with the program.
You can access it through the Help menu of R. It is highly recommended that you try all
the examples in R. They will help you learn the concepts, give you a little programming
experience, and give you facility with a very flexible statistical software package. And don’t
just try the examples as written. Vary them a little; play around with them; experiment!
In this Lab we will gain familiarity with R by considering ways of performing descriptive
analysis of time series data. In particular, we will learn the following:
will give you the list of all functions and built-in data sets somehow related to time series
analysis. As a more specific example, if you don’t recall how the function that draws lag
plots is called, try
1
Once you discover that it is called lag.plot, you can query R about its usage using the
help function in one of the two equivalent forms:
> help("lag.plot")
> ?lag.plot
The help function can also be used to obtain a description of any of the data sets that
are shipped with R. Use it to see the content of the data sets used in the examples in the
following sections.
> plot(AirPassengers)
Plot it and see how it looks like. A log transformation in this case works well to stabilize
the variance of the series. You can plot the transformed data as follows (the resulting plot
is shown in Fig 1):
> plot(log(AirPassengers))
6.5
6.0
log(AirPassengers)
5.5
5.0
Time
By default R plots time series with a continuous line. Sometimes you want to identify the
data points on this continuous line. Try plot(log(AirPassengers), type='o') (the 'o' is
2
for overlapping – points and line, that is). If you don’t like the default character used to draw
points, you can change it with the optional argument pch, as in plot(log(AirPassengers), type='o',
A plot that you can draw to look at patterns of dependency between consecutive ob-
servations is a lag plot, in which you plot xt versus xt−1 . This can be generalized to ob-
servations more than one time lag apart. Look at the plot produced by the command
lag.plot(LakeHuron, do.lines=F). You should see that the dots tend to cluster around
a straight line, which implies that xt−1 is a good (linear) predictor for xt .
0.2
0.1
seasonal
0.0
−0.1
−0.2
4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2
trend
0.05
remainder
0.00
−0.05
−0.10
time
Instead of estimating the trend and/or seasonal component, we can difference a series
to make it approximately stationary. The function diff can be used in R for this purpose.
1
Another very useful feature of R’s help system!
3
Consider again the logged AirPassengers time series. By default, diff takes one difference
at lag one. This would remove the trend but leave a seasonal component (try it!). On the
other hand, a lag 12 difference removes both seasonality and trend (why?), as Figure 3 shows.
0.1
0.0
Time
Consider now the data set JohnsonJohnson. Make an appropriate transformation and
difference it so as to remove the seasonal component and the trend. You may use the function
frequency to get the sampling frequency of a time series.
Do it now!
(Intercept) time(AirPassengers)
-230.1878355 0.1205806
4
> plot(log(AirPassengers), type = "o")
> abline(m)
Take a look at time(AirPassengers) and make sure you understand the time unit that
goes with the slope 0.12.
Now take a look at the data set UKgas. Use lm to fit an appropriate curve to the data.
Plot the data and the curve superimposed on the same display. Do not transform the data
to draw the plot. You may want to use the command lines to superimpose the fitted curve.
Do it now!
In addition to linear models, in R you can use many sophisticated nonlinear smoothers
to estimate the trend of a time series.
As you can see, also in this case the default choice made by R was not very appropriate.
5
5 Simulating time series data
Why would somebody want to simulage a time series? Aren’t there enough real data sets
around? Well, one answer is that simulating from a specific model gives you a feeling for the
typical behavior of observations coming from that model. There are other reasons that have
to do with Monte Carlo studies, the bootstrap, and more.
R has a fairly extensive set of functions that you can use to generate independent random
variables from the most common distributions, like Normal, Gamma, Beta, Poisson, Bino-
mial, etc. For example, to generate a hundred iid N (0, 3) random variables, you just have
to issue the command rnorm(100, sd=sqrt(3)). This gives you a hundred observations
from a Gaussian White Noise with variance 3. To generate time series from ARIMA models,
R provides the function arima.sim The principal arguments it takes are the length of the
series, n, and the model, specified as a list with components ar, ma and, optionally, order.
The following chunk of code generates and plots an AR(1) and an MA(3) process. I omit the
plots, but try it out and see what you get. Try simulating the same process several times to
familiarize with sampling variability.
Do it now!
6
By successive substitutions, one obtains the alternative representation
X
xt = x0 + twk .
k=1
The second representation suggests the following as a possible way of generating a random
walk, setting for simplicity x0 = 0.
Try doing it and plotting the resulting series. Repeat the exercise several times and look
at the different shapes the plot can take. If you want a random walk with drift, follow the
same procedure, but inlude the drift parameter as the mean of the normal random variables.
The following gives a random walk with drift equal 0.5.