Professional Documents
Culture Documents
5 Estimation of Correlation
The webpages for the texts and some help on using R for time series analysis can be found at
https://nickpoison.github.io/ (https://nickpoison.github.io/).
#install.packages(
# pkgs = "remotes"
#)
#remotes::install_github(
# repo = "nickpoison/astsa/astsa_build"
#)
options(
digits = 3,
scipen = 99
)
rm(
list = ls()
)
The previous videos focused on theoretical autocorrelation and cross-correlation. Here we will shift from theory
to analyzing sampled data.
x1 , x2 , … , xn
Typically we will not have iid observations of x t for estimating the values.
When a time series is stationary we can estimate the constant mean μt = μ using the sample mean.
E(x̄ ) = μ
n
1
var(x̄ ) =var ( ∑ xt )
n
t=1
n n
1
= cov (∑ x s , ∑ x t )
2
n
s=1 t=1
n n
1
= ∑ ∑ cov(x s , x t )
2
n
s=1 t=1
n n
1
= ∑ ∑ γ x (s − t)
2
n
s=1 t=1
n
1
= ∑ (n − |h|)γ x (h)
2
n
h=−n
1
2
var(x̄ ) = γ x (0) = σ x /n
n
Note that var(x̄ ) maybe larger or less than σ 2x /n when the covariance terms are non-zero.
{(x t , x t+h ); t = 1, … , n − h}
Southern Oscillation Index (SOI) for a period of 453 months ranging over the years 1950-1987.
The format is: Time-Series [1:453] from 1950 to 1988: 0.377 0.246 0.311 0.104 -0.016 0.235 0.137 0.191
-0.016 0.29 …
Data furnished by Dr. Roy Mendelssohn of the Pacific Fisheries Environmental Laboratory, NOAA (personal
communication).
data(
list = "soi",
package = "astsa"
)
acf1_soi <- astsa::acf1(
series = soi,
max.lag = 6,
plot = TRUE
)
astsa::tsplot(
x = lag(
x = soi,
k = -1
),
y = soi,
col = 4,
type = 'p',
xlab = 'lag(soi,-1)'
)
legend(
x = "topleft",
legend = acf1_soi[1],
bg = "white",
adj = 0.45,
cex = 0.85
)
astsa::tsplot(
x = lag(
x = soi,
k = -6
),
y = soi,
col = 4,
type = 'p',
xlab = 'lag(soi,-6)'
)
legend(
x = "topleft",
legend = acf1_soi[6],
bg = "white",
adj = 0.25,
cex = 0.8
)
Using this approximate distribution we obtain a method of assessing whether peaks in ρ̂ x (h) are significant by
determining whether the peak is outside the interval
2
± .
−
−
√n
For a white noise sequence, approximately 95% of the sample ACFs will be in this interval.
Many statistical modeling procedures depend on reducing a time series to a white noise series using various
kinds of transformations.
1 if heads
xt = {
−1 if tails
yt = 5 + x t − 0.7x t−1
−0.7
Using the standard covariance formula, we see that ρy (1) = ≈ −0.47 .
1.49
Since x t are independent, the covariance of yt is zero when the shift is greater than one (no non-zero addends
in the summation).
We generate two sets of data with different sample sizes. Notice that for the smaller sample the sample ACF is
larger for the shifts that should be approximately zero.
## [1] 5.26
astsa::tsplot(
x = y1,
type = 's'
) # plot 1st series
points(
x = y1,
pch = 19
)
acf(
x = y1,
lag.max = 10,
plot = TRUE
)
## [1] 4.98
astsa::tsplot(
x = y2,
type = 's'
) # plot 1st series
points(
x = y2,
pch = 19
)
acf(
x = y2,
lag.max = 10,
plot = TRUE
)
here’s the version from the other text - same idea but the y values are 2-4-6-8 like the children’s cheer
set.seed(
seed = 823
)
x1 = sample(
x = c(-2,2),
size = 11,
replace = TRUE
)
x2 = sample(
x = c(-2,2),
size = 101,
replace = TRUE
)
y1 = 5 + filter(
x = x1,
sides = 1,
filter = c(1,-.5)
)[-1]
y2 = 5 + filter(
x = x2,
sides = 1,
filter = c(1,-.5)
)[-1]
astsa::tsplot(
x = y1,
type = "s",
col = 4,
xaxt = "n",
yaxt = "n"
)
points(
x = y1,
pch = 21,
cex = 1.1,
bg = 6
)
axis(
side = 1,
at = 1:10
)
axis(
side = 2,
at = seq(
from = 2,
to = 8,
by = 2
),
las = 1
)
astsa::tsplot(
x = y2,
type = "s",
col = 4,
xaxt = "n",
yaxt = "n"
)
points(
x = y2,
pch = 21,
cex = 1.1,
bg = 6
)
axis(
side = 1,
at = 1:10
)
axis(
side = 2,
at = seq(
from = 2,
to = 8,
by = 2
),
las = 1
)
acf(
x = y1,
lag.max = 4,
plot = TRUE
)
acf(
x = y2,
lag.max = 4,
plot = TRUE
)
The series appears to contain a sequence of repeating short signals. The ACF confirms this, showing repeating
peaks spaced about 106–109 points.
The distance between the repeating signals is known as the pitch period and is a fundamental parameter of
interest in systems that encode and decipher speech.
Because the series is sampled at 10,000 points per second, the pitch period appears to be between 0.0106
and 0.0109 seconds. To compute the sample ACF in R, use acf(speech, 250).
A small 0.1 second (1000 points) sample of recorded speech for the phrase “aaa…hhh”.
The format is: Time-Series [1:1020] from 1 to 1020: 1814 1556 1442 1416 1352 …
data(
list = "speech",
package = "astsa"
)
astsa::acf1(
series = speech,
max.lag = 250
)
## [1] 0.92 0.71 0.42 0.12 -0.14 -0.34 -0.47 -0.54 -0.53 -0.46 -0.31 -0.12
## [13] 0.08 0.24 0.33 0.34 0.30 0.22 0.14 0.05 -0.02 -0.10 -0.16 -0.20
## [25] -0.21 -0.19 -0.13 -0.07 -0.01 0.04 0.07 0.09 0.09 0.09 0.06 0.03
## [37] -0.02 -0.07 -0.10 -0.12 -0.12 -0.11 -0.09 -0.07 -0.05 -0.03 -0.01 0.00
## [49] 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 -0.01 -0.03
## [61] -0.05 -0.07 -0.09 -0.10 -0.11 -0.10 -0.07 -0.04 -0.01 0.02 0.05 0.06
## [73] 0.07 0.07 0.06 0.03 -0.01 -0.06 -0.11 -0.15 -0.17 -0.16 -0.13 -0.07
## [85] -0.01 0.06 0.13 0.19 0.23 0.24 0.22 0.14 0.03 -0.11 -0.24 -0.34
## [97] -0.39 -0.38 -0.32 -0.20 -0.04 0.14 0.33 0.50 0.62 0.67 0.62 0.48
## [109] 0.30 0.10 -0.08 -0.22 -0.31 -0.36 -0.35 -0.30 -0.22 -0.10 0.02 0.12
## [121] 0.19 0.22 0.20 0.16 0.11 0.06 0.01 -0.03 -0.08 -0.11 -0.13 -0.13
## [133] -0.10 -0.07 -0.04 -0.01 0.02 0.03 0.05 0.05 0.04 0.02 -0.01 -0.04
## [145] -0.06 -0.08 -0.08 -0.08 -0.07 -0.06 -0.05 -0.04 -0.02 -0.01 0.00 0.01
## [157] 0.00 0.00 -0.01 -0.01 0.00 0.00 0.01 0.01 0.00 -0.01 -0.03 -0.04
## [169] -0.05 -0.06 -0.06 -0.07 -0.06 -0.05 -0.03 -0.01 0.02 0.03 0.04 0.04
## [181] 0.04 0.03 0.01 -0.01 -0.04 -0.07 -0.09 -0.09 -0.08 -0.05 -0.01 0.03
## [193] 0.06 0.08 0.09 0.10 0.10 0.08 0.03 -0.03 -0.09 -0.15 -0.18 -0.18
## [205] -0.15 -0.10 -0.03 0.04 0.13 0.21 0.30 0.35 0.37 0.35 0.28 0.19
## [217] 0.08 -0.01 -0.10 -0.16 -0.19 -0.20 -0.18 -0.15 -0.09 -0.02 0.04 0.08
## [229] 0.11 0.12 0.11 0.09 0.07 0.04 0.01 -0.02 -0.05 -0.07 -0.08 -0.08
## [241] -0.07 -0.06 -0.04 -0.03 -0.01 0.00 0.01 0.01 0.00 -0.01
γ xy (h)
ρ xy (h) =
−−−−−−−−
√ γ x (0)γ y (0)
are
Notice that the use of the time-independent sample mean implies that the series are stationary.
−1 ≤ ρ̂ (h) ≤ 1
xy
The sample cross-correlation function can be examined graphically as a function of lag h to search for leading
or lagging relations.
1
σˆ =
ρ xy −
−
√n
The autocorrelation and cross-correlation functions can be used to analyzing the joint behavior of two
stationary series.
We considered simultaneous monthly readings of the SOI and the number of new fish (Recruitment) computed
from a model. The figures shows the autocorrelationand cross-correlationfunctions for these two series.
Both of the autocorrelation plots show periodicities corresponding to the correlation between values separated
by 12 units.
Observations 12 months or one year apart are strongly positively correlated, as are observations at 24, 36, 48.
Observations separated by six months are negatively correlated, showing that positive excursions tend to be
associated with negative excursions six months removed.
The cross-correlation plot, however, shows some departure from the cyclic component of each series and there
is an obvious peak at h = −6.
This result implies that SOI measured at time t − 6 months is associated with the Recruitment series at time t .
We could say the SOI leads the Recruitment series by six months.
The sign of the CCF is negative, leading to the conclusion that the two series move in different directions; that
is, increases in SOI lead to decreases in Recruitment and vice versa. (There is a non-linear relationship we will
discover later.)
±2
The dashed lines shown on the plots are at −−−
≈ 0.094 but since neither series is noise, these lines do
√453
not apply.
data(
list = "rec",
package = "astsa"
)
astsa::acf1(
series = soi,
main = "Southern Oscillation Index"
)
## [1] 0.60 0.37 0.21 0.05 -0.11 -0.19 -0.18 -0.10 0.05 0.22 0.36 0.41
## [13] 0.31 0.10 -0.06 -0.17 -0.29 -0.37 -0.32 -0.19 -0.04 0.15 0.31 0.35
## [25] 0.25 0.10 -0.03 -0.16 -0.28 -0.37 -0.32 -0.16 -0.02 0.17 0.33 0.39
## [37] 0.30 0.16 0.00 -0.13 -0.24 -0.27 -0.25 -0.13 0.06 0.21 0.38 0.40
astsa::acf1(
series = rec,
main = "Recruitment"
)
## [1] 0.92 0.78 0.63 0.48 0.36 0.26 0.18 0.13 0.09 0.07 0.06 0.02
## [13] -0.04 -0.12 -0.19 -0.24 -0.27 -0.27 -0.24 -0.19 -0.11 -0.03 0.03 0.06
## [25] 0.06 0.02 -0.02 -0.06 -0.09 -0.12 -0.13 -0.11 -0.05 0.02 0.08 0.12
## [37] 0.10 0.06 0.01 -0.02 -0.03 -0.03 -0.02 0.01 0.06 0.12 0.17 0.20
astsa::ccf2(
x = soi,
y = rec,
main = "SOI vs Recruitment"
)
To use the large-sample distribution of cross-correlation property, at least one of the series must be white
noise.
If this is not the case, there is no simple way to tell if a cross-correlation estimate is significantly different from
zero.
Hence in the SOI and Recruitment example we were guessing at the linear dependence relationship.
1
x t = 2cos (2πt ) + wt1
12
1
yt = 2cos (2π[t + 5] ) + wt2
12
The middle two figures show the auto-correlation plots. Both of which exhibits the cyclic nature of each series.
The second to last plot is the cross-correlation plots, which appears to show cross-correlation even though the
series are independent.
The last plot is cross-correlation plot between x t and the prewhitened yt , which shows that the two sequences
are uncorrelated.
By prewhitening yt , we mean that the signal has been removed from the data by running a regression of yt on
cos(2πt) and sin(2πt) and then subtracting off the predicted values yt − ŷ t (the raw residuals).
set.seed(
seed = 823
)
n = 120
t = 1:n
X = ts(
data = 2*cos(x = 2*pi*t/12) + rnorm(n = n),
freq = 12
)
Y = ts(
data = 2*cos(x = 2*pi*(t+5)/12) + rnorm(n = n),
freq = 12
)
Yw = resid(
object = lm(
formula = Y~ cos(2*pi*t/12) + sin(2*pi*t/12),
na.action = NULL
)
)
astsa::tsplot(
x = X
)
astsa::tsplot(
x = Y
)
astsa::acf1(
series = X,
max.lag = 48,
ylab = 'ACF(X)'
)
## [1] 0.58 0.32 -0.01 -0.30 -0.56 -0.68 -0.60 -0.35 0.01 0.33 0.57 0.61
## [13] 0.50 0.33 0.03 -0.33 -0.53 -0.63 -0.50 -0.28 0.03 0.25 0.48 0.56
## [25] 0.47 0.29 -0.04 -0.27 -0.45 -0.52 -0.48 -0.23 -0.03 0.27 0.44 0.45
## [37] 0.39 0.27 -0.02 -0.24 -0.42 -0.46 -0.37 -0.22 0.01 0.20 0.39 0.43
astsa::acf1(
series = Y,
max.lag = 48,
ylab = 'ACF(Y)'
)
## [1] 0.56 0.39 -0.03 -0.35 -0.57 -0.68 -0.56 -0.31 0.00 0.31 0.57 0.60
## [13] 0.55 0.28 -0.01 -0.30 -0.47 -0.57 -0.50 -0.31 0.00 0.24 0.46 0.60
## [25] 0.42 0.31 -0.01 -0.22 -0.45 -0.52 -0.45 -0.27 0.00 0.21 0.48 0.46
## [37] 0.43 0.21 0.05 -0.22 -0.37 -0.46 -0.39 -0.22 -0.06 0.20 0.33 0.44
astsa::ccf2(
x = X,
y = Y, 24)
astsa::ccf2(
x = X,
y = Yw,
max.lag = 24,
ylim = c(-.6,.6)
)
here’s another example that’s simpler, the series are trend stationary with just a hint of trend - but same result
set.seed(
seed = 823
)
n = 250
t = 1:n
X = 0.01*t + rnorm(n = n,mean = 0,sd = 2)
Y = 0.01*t + rnorm(n = n)
astsa::tsplot(
x = cbind(X,Y),
spaghetti = TRUE,
col = astsa::astsa.col(
col = c(4,2),
alpha = 0.7
),
lwd = 2,
ylab = 'data'
)
astsa::ccf2(
x = X,
y = Y,
ylim = c(-.3,.3),
col = 4,
lwd = 2
)
Yw = astsa::detrend(
series = Y
) # whiten Y by removing trend
astsa::ccf2(
x = X,
y = Yw,
ylim = c(-0.3,0.3),
col = 4,
lwd = 2
)