You are on page 1of 143

A time series is a set of observations, each one being

recorded at a specific time t.


A discrete-time time series is one in which the set of times at
which observations are made is a discrete set, as is the case,
for example, when observations are made at fixed time
intervals. Continuous time series are obtained when
observations are recorded continuously over some time
Introduction interval.
to time
The analysis of experimental data that have been observed at
series different points in time leads to new and unique problems in
statistical modeling and inference. The obvious correlation
introduced by the sampling of adjacent points in time can
severely restrict the applicability of the many conventional
statistical methods traditionally dependent on the assumption
that these adjacent observations are independent and
identically distributed.
• The first step in any time series investigation always involves careful examination of the recorded
data plotted over time. This scrutiny often suggests the method of analysis as well as statistics that will
be of use in summarizing the information in the data.
• Some example….
• Some example….
• where {Nt} is a sequence of independent normal
random variables, with mean 0 and variance 0.25.
Such a series is often referred to as signal plus
noise, the signal being the smooth function, in this
case.
The population of the U.S.A., measured at 10-year intervals.
The graph suggests the possibility of fitting a quadratic or exponential trend to the
data.

Population of the U.S.A., 1790–1990


• 0.1 second (1000 point) sample of
recorded speech for the phrase aaa .
. . hhh, and we note the repetitive
nature of the signal and the rather
regular periodicities. One current
problem of great interest is
computer
• recognition of speech, which
would require converting this
particular signal into the recorded
phrase aaa . . . hhh. Spectral analysis
can be used in this context to
produce a signature of this phrase
that can be compared with
signatures of various library syllables
to look for a match.
• As a final example, the series
represents two phases or arrivals
along the surface, denoted by P (t = 1,
. . . , 1024) and S (t = 1025, . . . , 2048),
at a seismic recording station. The
general problem of interest is in
distinguishing or discriminating
between waveforms generated by
earthquakes and those generated by
explosions. Features that may be
important are the rough amplitude
ratios of the first phase P to the
second phase S, which tend to be
smaller for earthquakes than for
explosions.
A General Approach to Time Series Modeling

 Plot the series and examine the main features of the graph, checking in particular whether there is

• a trend,
• a seasonal component,
• any apparent sharp changes in behavior,
• any outlying observations.

 Remove the trend and seasonal components to get stationary residuals

 Choose a model to fit the residuals

 Forecasting will be achieved by forecasting the residuals and then inverting the
transformations described above to arrive at forecasts of the original series
Unemployment data

Classical decomposition:
Trend
• Trend plus seasonal
variation
• Residuals
Some Zero-Mean Models
White noise
Gaussian noise
Gaussian noise
Random walk
Random Walk
Moving Averages and Filtering
We might replace the white noise series wt by a moving average that smooths the series.

A linear combination of values in a time series is referred to, generically, as a filtered series

Consider the example:

It represents a regression or prediction of the current value xt of a time series as a function of the past
two values of the series, and, hence, the term autoregression is suggested for this model
Residuals
The previous plot suggests trying a model of the form:

where mt is a slowly changing function, known as the trend component and Yt has zero mean

A useful technique for estimating mt is the method of least squares. In the least squares procedure we attempt to fit
a parametric family of functions, e.g.:
Trend
Trend and Seasonal Models
Residuals
Summary

Plot the time series: Look for trends, seasonal components, step changes, outliers.

Transform data so that residuals are stationary


Estimate and subtract Tt; St

Differencing

Nonlinear transformations (log, Square root, etc)

Fit model to residuals


Basic statistics
A complete description of a time series, observed as a collection of n random variables at arbitrary time
points t1, t2, . . . , tn, for any positive integer n, is provided by the joint distribution function, evaluated as the
probability that the values of the series are jointly less than the n constants, c1, c2, . . . , cn

Although the joint distribution function describes the data completely, it is an unwieldy tool for displaying
and analyzing time series data. The distribution function must be evaluated as a function of n arguments,
so any plotting of the corresponding multivariate density functions is virtually impossible

Usually we use:

The marginal distribution functions

or
marginal density functions
The mean function is defined as:

The mean function depends on t

The autocovariance function is defined as the second moment product


The autocovariance measures the linear dependence between two points on the same series
observed at different times. Very smooth series exhibit autocovariance functions that stay large even
when the t and s are far apart, whereas choppy series tend to have autocovariance functions that are
nearly zero for large separations.

for s = t

As in classical statistics, it is more convenient to deal with a measure of association between −1 and
1, and this leads to the following definition.

The autocorrelation function (ACF) is defined as:


As for classic linear correlation:

Moving to multiple time series….

The cross-covariance function between two series, xt and yt, is:

Depends on s and t
The cross-correlation function (CCF) is given by

Depends on s and t
Stationarity
Trend Stationarity
• Theoretical autocorrelation and cross-
correlation functions are useful for describing the
properties of certain hypothesized models, most of
the analyses must be performed using sampled
data.

Estimation of Correlation • This limitation means the sampled points x1, x2, .
. . , xn only are available for estimating the mean,
autocovariance, and autocorrelation functions.

• From the point of view of classical statistics, this


poses a problem because we will typically not have
iid copies of xt that are available for estimating the
covariance and correlation functions
if a time series is stationary, the mean function constant so that we can estimate it by the
sample mean

It is defined until n-h because we cannot run over n


The sample autocorrelation function is defined as:
a simulated sequence of 200 iid normal random variables
with mean 0 and variance 1 Sample Autocorrelation
Vector-Valued and Multidimensional Series

We frequently encounter situations in which the relationships between a number of jointly measured
time series are of interest.

Ex. Multiple sensors which record temperatures in geographic regions

vector time series

p univariate time series


pxp autocovariance matrix

the elements of the matrix Γ(h) are the cross-covariance functions

Now, the sample autocovariance matrix of the vector series x(t) is the pxp matrix of sample cross-covariances

with
Time Series Regression

We focus on classical multiple linear regression in a time series context

by assuming some output or dependent time series, say, xt , for t = 1, . . . , n, is


being influenced by a collection of possible inputs or independent series, say, zt
1, zt 2, . . . , zt q , where we first regard the inputs as fixed and known.

{w } is a random error or noise process consisting of independent


t

and identically distributed (iid) normal variables with mean zero


and variance σ w2
In ordinary least squares (OLS), we minimize the error sum of squares

OLS
The multiple linear regression model described by (2.1) can be conveniently written in a more general
notation by defining the column vectors

OLS estimation finds the coefficient vector β that minimizes the error sum of squares
OLS

The minimized error sum of squares denoted SSE, can be written as:
Models to isolate or select the best subset of independent
variables
Suppose a proposed model specifies that only a subset r < q independent variables is influencing the dependent
variable xt .

where SSEr is the error sum of squares under the reduced model

has a central F-distribution with q − r and n − q − 1 degrees of freedom


IfH :β =···=β =0 is true, then SSE ≈ SSE because the estimates of those βs
0 r+1 q r

will be close to 0 and the reduced model is the correct model.

These tests have been used in the past in a stepwise manner, where variables are added or deleted when
the values from the F-test either exceed or fail to exceed some predetermined levels. The procedure, called
stepwise multiple regression, is useful in arriving at a set of useful variables.

Suppose we consider a normal regression model with k coefficients and denote the maximum likelihood
estimator for the variance as

Akaike suggested measuring the goodness of fit for this particular model by balancing the error of the fit
against the number of parameters in the model; we define the following
Akaike’s Information Criterion
(AIC)

k is the number of parameters in the modeln is the sample size.

The value of k yielding the minimum AIC specifies the best model.

The idea is that minimizing the variance would will be a reasonable objective however, it decreases as k
increases. Therefore, we ought to penalize the error variance by a term proportional to the number of
parameters.
We may also derive a correction term based on Bayesian arguments, as in
Schwarz which leads to the following.

Bayesian Information Criterion (BIC)

Notice that the penalty term in BIC is much larger than in AIC, consequently, BIC tends to choose smaller
models. Various simulation studies have tended to verify that BIC does well at getting the correct order in large
samples, whereas AICc tends to be superior in smaller samples where the relative number of parameters is
large
where we adjust temperature for its mean, T· = 74.26,
We note that each model does substantially better than the one before
it and that the model including temperature, temperature squared, and
particulates does the best, accounting for some 60% of the variability
and with the best value for AIC and BIC (because of the large sample
size, AIC and AICc are nearly the same).
We may measure the proportion of variation accounted for by all the
variables using:
In general, it is necessary for time series data to be stationary so that averaging lagged products over time, as
in the previous section, will be a sensible thing to do. With time series data, it is the dependence between the
values of the series that is important to measure; we must, at least, be able to estimate autocorrelations with
precision. It would be difficult to measure that dependence if the dependence structure is not regular or is
changing at every time point. Hence, to achieve any meaningful statistical analysis of time series data, it will be
crucial that, if nothing else, the mean and the autocovariance functions satisfy the conditions of stationarity (for
at least some reasonable stretch of time)

The easiest form of nonstationarity to work with is the trend stationary model where the process has
stationary behavior around a trend.We may write this type of model as:

where xt are the observations, μt denotes the trend, and yt is a stationary process

Here, we should obtain a reasonable estimate of the trend component, and then work with the residuals
Consider this data:

a straight line might be useful for detrending the data

To obtain the detrended series we simply subtract the estimates from the observations
we might model trend as a stochastic component using the random walk with drift model:

where wt is white noise and is independent of yt

In this case, differencing the data, xt , yields a stationary process:

Stationary component

One advantage of differencing over detrending to remove trend is that no parameters are estimated
in the differencing operation. One disadvantage, however, is that differencing does not yield an
estimate of the stationary process yt. If an estimate of yt is essential, then detrending may be more
appropriate
Because differencing plays a central role in time series analysis, it receives its own notation. The first
difference is denoted as

Backshift operator

We define the backshift operator by:

The idea of an inverse operator can also be given if we require B−1B = 1,

forward-shift operator
We may extend the notion further. For example, the second difference becomes

Differences of order d are defined as:


Often, obvious aberrations are present that can contribute nonstationary as well as nonlinear behavior
in observed time series. In such cases, transformations may be useful to equalize the
variability over the length of a single series. A particularly useful transformation is:

which tends to suppress larger fluctuations that occur over portions of the serieswhere the underlying
values are larger
Smoothing in the Time Series Context

Consider the time series xt

where and
Kernel Smoothing in the Time Series Context

Kernel smoothing is a moving average smoother that uses a weight function, or kernel, to average the
observations

Where:

is a kernel function

This estimator, which was originally explored by Parzen [148] and Rosenblatt [170], is often called the
Nadaraya–Watson estimator (Watson [207]).

the normal kernel:


An obvious deal with data would be to fit a polynomial regression in terms of time. For example, a cubic
polynomial would have

where
Smoothing splines
A related method is smoothing splines, which minimizes a compromise between the fit and
the degree of smoothness given by

where mt is a cubic spline with a knot at each t and primes denote


differentiation. The degree of smoothness is controlled by λ > 0.
The larger the value of λ, the smoother the fit.

Think of taking a long drive where mt is the position of your car at time t. In
this case, mt is instantaneous acceleration/deceleration, and ∫ (mt’’) 2dt is a
measure of the total amount of acceleration and deceleration on your trip. A
smooth drive would be one where a constant velocity, is maintained (i.e., mt
= 0). A choppy ride would be when the driver is constantly accelerating and
decelerating, such as beginning drivers tend to do
ARIMA models

Classical regression is often insufficient for explaining all of the interesting dynamics of a time series. For
example, the ACF of the residuals of the simple linear regression fit to the price of chicken data (see
Example 2.4) reveals additional structure in the data that regression did not capture. Instead, the
introduction of correlation that may be generated through lagged linear relations leads to proposing the
autoregressive (AR) and autoregressive moving average (ARMA) models that were presented in
Whittle [209]. Adding nonstationary models to the mix leads to the autoregressive integrated moving
average (ARIMA) model popularized in the landmarkwork by Box and Jenkins [30].

Autoregressive models are based on the idea that the current value of the series, xt , can be
explained as a function of p past values, xt−1, xt−2, . . . , xt−p, where p determines the number of steps
into the past needed to forecast the current value
The extent to which it might be possible to forecast a real data series from its own past values can be
assessed by looking at the autocorrelation and lagged scatterplot matrices :
Scatterplot matrix relating current SOI values, St , to past
SOI values, St−h, at lags h = 1, 2, . . . , 12.
A useful form follows by using the backshift operator to write the AR(p) model, as:

The autoregressive operator is defined to be:

Thus:
The AR(1) Model

This method suggests that, by continuing to iterate backward, and provided that

we can represent an AR(1) model as a linear process:


and autocovariance function:

And autocorrelation:
Moving Average Models
As an alternative to the autoregressive representation in which the xt on the left-hand side of the equation are
assumed to be combined linearly, the moving average model of order q, abbreviated as MA(q), assumes the
white noise wt on the right-hand side of the defining equation are combined linearly to form the observed data

We may also write the MA(q) process in the equivalent


form:
The moving average operator is

Also, xt is correlated with xt−1, but not with xt−2, xt−3, . . . . Contrast this with the case of the AR(1) model in which
the correlation between xt and xt−k is never zero
Autoregressive Moving Average Models

We now proceed with the general development of autoregressive, moving average, and mixed autoregressive
moving average (ARMA), models for stationary time series

The parameters p and q are called the autoregressive and the moving average orders,
respectively
To aid in the investigation of ARMA models, it will be useful to write them using the AR operator, and the MA
operator. In particular, the ARMA(p, q) model can then be written in concise form as:
Autocorrelation and Partial Autocorrelation

the ACF of an MA(q) process,

Because xt is a finite linear combination of white noise terms, the process is stationary with mean:

The cutting off of γ(h) after q lags is the signature of the MA(q) model
Dividing by γ(0) yields the ACF of an MA(q):

the ACF of an ARMA(p,q) process is more


complicated….
The ACF of an AR(p)

then ρ(h) dampens exponentially fast to zero as h → ∞ or in a sinusoidal fashion, exponentially fast to zero as h
→ ∞.

In the second case, the time series will appear to be cyclic in nature.
The Partial Autocorrelation Function (PACF)

We have seen that for MA(q) models, the ACF will be zero for lags greater than q. Thus, the ACF provides a
considerable amount of information about the order of the dependence when the process is a moving average
process. If the process, however, is ARMA or AR, the ACF alone tells us little about the orders of dependence.
Hence, it is worthwhile pursuing a function that will behave like the ACF of MA models, but for AR models,
namely, the partial autocorrelation function (PACF).

The idea is that ρXY |Z measures the correlation between X and Y with the linear effect of Z removed (or
partialled out).
The correlation between xt and xt−2 is not zero, as it would be for an MA(1), because xt is dependent
on xt−2 through xt−1

Suppose we break this chain of dependence by removing (or partial out) the effect xt−1. That is, we
consider the correlation between xt − φxt−1 and xt−2 − φxt−1, because it is the correlation between xt and
xt −2 with the linear dependence of each on xt −1 removed.
Hence, the tool we need is partial autocorrelation, which is the correlation between xs and xt with the linear
effect of everything “in the middle” removed.
the partial autocorrelation value would be 0 for the time lags that are greater than t-1.
Moving average
ARMA
In forecasting, the goal is to predict future values of a time series, xn+m , m = 1, 2, . . .,based on the data
collected to the present x ={x ,x ,...,x }.
1:n 1 2 n

Through out this section, we will assume xt is stationary and the model parameters are known.

The minimum mean square error predictor of xn+m is

because the conditional expectation minimizes the mean square error

where g(x1:n) is a function of the observations x1:n


First, we will restrict attention to predictors that are linear functions of the data, that is, predictors of the form

if n = m = 1, then x12 is the one-step-ahead linear forecast of x2 given x1

if n = 2, x23 is the one-step-ahead linear forecast of x3 given x1 and x2

Linear predictors of the previous form that minimize the mean square prediction error are called best linear
predictors (BLPs). Linear prediction depends only on the second-order moments of the process, which are
easy to estimate from the data.
min
First, consider one-step-ahead prediction.
Given {x1, . . ., xn}, we wish to forecast the value of the time series at the next time point, xn+1.

one-step-ahead forecast in vector notation:


thus the elements of φn are unique, and are given by

The mean square one-step-ahead prediction error is


Prediction for an AR(2)

Suppose we have a causal AR(2) process xt = φ1xt−1 + φ2xt−2 + wt, and one
observation x1
By using Matrix notation!!!

the one-step-ahead prediction of x2 based on x1 is


n=1

Now, suppose n=2

we want the one-step-ahead prediction of x3 based on two observations x1 and x2; i.e., x23 = φ21x2 +
φ22x1.
It should be apparent from the model that

it follows that:

Continuing in this way, it is easy to verify that, for n ≥ 2,

if the time series is a causal AR(p) process, then, for n ≥ p,


For ARMA models in general, the prediction equations will not be as simple as the pure AR case
In addition, for n large, it is prohibitive because it requires the inversion of a large matrix. A solution
which does not require matrix inversion is The Durbin–Levinson Algorithm

We assume xt is a causal and invertible ARMA(p, q)


process
We consider two types of forecasts

For ARMA models, it is easier to calculate the predictor of xn+m, assuming we have the complete
history of the process We will denote the predictor of xn+m based on the infinite past as
Now, write xn+m in its causal and invertible forms:
Consistently we assume that:

because
Prediction is accomplished recursively starting with the one-step-ahead predictor, m = 1, and then continuing for
m = 2, 3,
Given an autocovariance function, is there a unique stoc. process that owns it?
It can be shown that there are multiple stationary S.P with the same autocovariance function. Only one, however, is
invertible.
A S.P. is said to be invertible if it is possible to reconstruct the value at t on the basis of the values of the observations
xt,xt−1,xt−2,...
All AR processes are invertible, while this is not always true for MA processes. All MAs are instead stationary, but this is
not true for ARs
In order for a process MA to be invertible it is necessary that all the roots of the polynomial MA are external to the
circle of unit root, ie that all the roots are in modulus greater than one.
The invertibility condition is used to uniquely identify a S.P. starting from the estimated autocorrelation functions.
the Dickey–Fuller test tests the null hypothesis that a unit root is present in an autoregressive (AR)
time series model.

The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or
trend-stationarity.

𝜇𝑡 𝑖𝑠 𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑟𝑦

We can write:
Since the test is done over the residual term rather than raw data, it is not possible to use standard t-distribution
to provide critical values. Therefore, this statistic has a specific distribution simply known as the Dickey–Fuller
table.

𝐻0 : 𝛿 = 0
𝐻1 : 𝛿 > 0
Each version of the test has its own critical value which depends on the size of the sample.

Fuller distribution
Issues in ARMA Models

1. parameter redundant models


2. stationary AR models that depend on the future, and
3. MA models that are not unique
Causality

Previously it was discovered that the random walk is not stationary

We might wonder whether there is a stationary AR(1) process with

We could get a stationary process as follows:

…..
this result suggests the stationary future dependent AR(1) model

Unfortunately, this model is useless because it requires us to know the future to be able to predict the future.
Non-uniqueness of MA Models and Invertibility
Parameter Redundancy

To aid in the investigation of ARMA models, it will be useful to write them using the AR operator and the MA
operator. In particular, the ARMA(p, q) model in can then be written in concise form as:
ARMA Model

The concise form of the model points to a potential problem in that we can unnecessarily complicate the model
by multiplying both sides by another operator, say:
Example

That is, in addition to the original definition, we will also require that φ and θ have no common factors. So, the
process, xt = .5xt−1 − .5wt−1 + wt , is not referred to as an ARMA(1, 1) process because, in its reduced form, xt is
white noise.
Integrated Models for Nonstationary Data

we saw that if xt is a random walk, xt = xt−1 +wt then by differencing xt , we find that is stationary

In many situations, time series can be thought of as being composed of two components, a nonstationary
trend component and a zero-mean stationary component.
Another model that leads to first differencing is the case in which μt is stochastic and slowly varying
according to a random walk.

where vt is stationary. In this case

Stochastic trend models can also lead to higher order differencing. For example, suppose
we focus on the frequency domain approach to time series analysis.

We argue that the concept of regularity of a series can best be expressed in terms of periodic variations
of the underlying phenomenon that produced the series

We measure frequency, ω, at cycles per time point rather than the alternative λ = 2πω that would give
radians per point. Of descriptive interest is the period of a time series, defined as the number of points
in a cycle, i.e., 1/ω.
Cyclical Behavior and Periodicity

The general notion of periodicity can be made more precise by introducing some terminology. In order to
define the rate at which a series oscillates, we first define a cycle as one complete period of a sine or
cosine function defined over a unit time interval

we consider the periodic process:


for purposes of data analysis, it is easier to use a trigonometric identity and write:

based on:
noting that cov(U1, U2) = 0
is function of its frequency, ω
For ω = 1, the series makes one cycle per time unit; for ω = .50, the series makes a cycle every two time
units; for ω = .25, every four units, and so on.

In general, for data that occur at discrete time points, we will need at least two points to determine a cycle, so
the highest frequency of interest is .5 cycles per point. This frequency is called the folding frequency and
defines the highest frequency that can be seen in discrete sampling
Consider a generalization that allows mixtures of periodic series with multiple frequencies and
amplitudes,

the autocovariance function of this process is:


The autocovariance function is the sum of periodic components with weights proportional to the
variances σk2

Hence, xt is a mean-zero stationary processes with


variance
Estimation and the Periodogram

for t = 1, . . . , n and suitably chosen coefficients.

We then define the scaled periodogram to be


it indicates which frequency components are large in magnitude and which components are small

The scaled periodogram is simply the sample variance at each frequency component and consequently is an
estimate of σj2 corresponding to the sinusoid oscillating at a frequency of ωj = j/n. These particular frequencies
are called the Fourier or fundamental frequencies.

Large values of P(j/n) indicate which frequencies ωj = j/n are predominant in the series, whereas small values
of P(j/n) may be associated with noise.

You might also like