You are on page 1of 21

Tutorial 9 - Solutions

1. The us_employment file contains the total employment in different industries in the
United States. Using the industry: "Leisure and Hospitality":
a. Plot the data. Do the data need transforming? If so, find a suitable
transformation.

leisure <- us_employment |> filter(Title


== "Leisure and Hospitality") leisure |>
autoplot(Employed)

This clearly needs transforming – the variation at beginning of the series is small
and increases significantly with time. We can use log or Box-Cox with guerrero
to transform.

leisure |> features(Employed, guerrero)


## # A tibble: 1 x 2
## Series_ID lambda_guerrero
## <chr> <dbl>
## 1 CEU7000000001 0.00113

Lambda is very close to zero so a log transformation will work well.


b. Produce an STL decomposition of the data and describe the
trend and seasonality.

leisure |> model(STL(log(Employed)))


|> components |>
autoplot()

The trend is roughly linear on a log scale.


The seasonality changes a lot over time, which is not surprising given this is a very
long period (about 80 years). The seasonal pattern changed in the 1990s to what it is
now, but the seasonal component hasn’t fully captured the change as can be seen in
the large changes during this period in the remainder series.
c. Are the data stationary? If not, find an appropriate differencing
which yields stationary data.
The data has both trend and seasonality so it is certainly not stationary. Given both
trend and seasonality we should try both first and seasonal differencing.

leisure |>
autoplot(log(Employed) |> difference(lag=12) |> difference())
The double differenced logged data is close to stationary, although the variance has
decreased over time.

d. Identify a couple of ARIMA models that might be useful in describing the


time series. Which of your models is the best according to their AICc
values?

leisure |>
gg_tsdisplay(log(Employed) |> difference(lag=12) |> difference(), plot_type=" partial")
Using our guide for identifying terms – examining the ACF plot (for MA terms) we
can see that the last significant spike in the early lags is at lag 4, and that we have a
single seasonal spike at lag 12. Recall that we differenced the data twice – first and
seasonal differences. So, a suggested model could be ARIMA(0,1,4)(0,1,1). Note,
the order in brackets is (AR,I,MA) which we denote using (p,d,q) for the non-
seasonal part and (P,D,Q) for the seasonal part. The non seasonal part is in the first
brackets, (0,1,4) in this case, where - the p=0 is because we are doing an MA model
(so p=0 or AR(0)). The d=1 is because the data is differenced once. The q=4 is
because of the last spike at lag 4 (out of the early lags as there are also spikes after
lag 4).

The seasonal part is in the second brackets, (0,1,1) in this case – again, the P=0 is
because we are doing an MA model (so AR(0)). The D=1 is because it is differenced
once (seasonal difference). The Q=1 is because there is only one significant
seasonal lag, at lag 12 – this would have been D=2 if we had a spike at lags 12 and
24.

Alternatively, and in a similar way, examining the PACF plot (for AR terms), we can
see that the last spike in the early lags is at lag 3, so use non-seasonal AR(3), and
that we have a seasonal spike at lags 12 and 24 so can use a seasonal AR(2), or
together, ARIMA(3,1,0)(2,1,0). You can experiment with different models as this is
just a guide.

Which of your models is the best according to their AICc values?

fit <- leisure |>


model(
arima310210 = ARIMA(log(Employed) ~ pdq(3,1,0) + PDQ(2,1,0)),
arima014011 = ARIMA(log(Employed) ~ pdq(0,1,4) + PDQ(0,1,1))
)
glance(fit)

The ARIMA(0,1,4)(0,1,1) model seems to be better as it has a lower AIC of -7260 vs


-7258.

e. Examine the residuals, do they resemble white noise? If not, try to


find another ARIMA model which fits better.

fit |> select(arima014011)


|> gg_tsresiduals()

There is significant autocorrelation at a few lags (we usually would like to see no
more than 1 for white noise).
We will now use the arima function to auto select a model:

fit <- leisure |>


model(
arima310210 = ARIMA(log(Employed) ~ pdq(3,1,0) + PDQ(2,1,0)),
arima014011 = ARIMA(log(Employed) ~ pdq(0,1,4) + PDQ(0,1,1)),
auto = ARIMA(log(Employed))
)
glance(fit)
fit |> select(auto) |> report()

The automatically selected ARIMA(1,1,2)(2,1,1) model is better than either of the


previous selections as it has a lower AIC of -7281.
For interpretation of coefficients - the seasonal terms are simply multiplied by the
non-seasonal terms.
Note on transformations: since we have transformed the data here, we need to
reverse the transformation (or back-transform) to obtain forecasts on the original
scale. For example, if you take log of y, the reverse is exp of y. In a similar way,
we can back-transform a Box-Cox (or any other transformation) by reversing the
Box-Cox formula (see 3.1 in the fpp2/transformations). In most cases this will not
be necessary as the forecasts are automatically produced by R Studio.
We can now examine the new model’s residuals:

fit |> select(auto) |>


gg_tsresiduals()
The residuals look much the same as the previous model. Note however that the
significant autocorrelations do not occur at the early lags or seasonal lags. This is an
indication that forecasts from this method could be good, but prediction intervals
that are computed assuming a normal distribution may be inaccurate.

f. Forecast the next 3 years of data and comment on the results.

fc <- fit |>


forecast(h = "3 years")
fc |>
filter(.model=="auto") |>
autoplot(us_employment |> filter(year(Month) > 2000))
Remember to always put in context what you are forecasting. Here you are
forecasting employment in the Leisure and Hospitality industry in the US. This
industry has obviously been affected by COVID and might continue to be affected in
the near future so we need to be careful with our long term forecast of increased
employment.

2. Use the Australian production of electricity from aus_production for


the following questions.

a. Plot the data. Do the data need transforming? If so, find a suitable
transformation.

aus_production |> autoplot(Electricity)


Yes, these need transforming to make the variance more constant.

lambda <- aus_production |>


features(Electricity, guerrero) |>
pull(lambda_guerrero)
aus_production |>
autoplot(box_cox(Electricity, lambda))
Guerrero method suggests using Box-Cox transformation with
parameter λ=0.53 (λ=0 would have given similar results). (You can find that lamda =
0.53 in the environment section).

b. Are the data stationary? If not, find an appropriate differencing which


yields stationary data.

The trend and seasonality show that the data are not stationary.
We will try seasonal differencing, quarterly in this case as the data is quarterly.

aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4), plot_type = "partial")
It seems that we could have continued with only taking seasonal differences but we
will try to take a first order difference as well to make it more stationary.

aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1), pl ot_type
= "partial")
Only a slight effect on stationarity after taking first difference as well.

c. Identify a couple of ARIMA models that might be useful in describing the


time series. Which of your models is the best according to their AIC
values?

Using our guide for identifying terms – examining the ACF plot (for MA terms) we
can see that there is a significant spike in the early lags at lag 1, and that we
have a seasonal spike at lags 4 and 8 (remember the data is quarterly). Recall
that we differenced the data twice – first and seasonal differences. So a
suggested model could be ARIMA(0,1,1)(0,1,2). Note that there is indeed a spike
in lag 4 as well so we can use a non-seasonal MA(4) as well.

Examining the PACF plot (for AR terms) we can see that there is a spike in the
early lags at lag 1, so use non-seasonal AR(1), and that we have a seasonal
spike at lags 4, 8, 12,16 so can use a seasonal AR(4), or together,
ARIMA(1,1,0)(4,1,0). You can experiment with different models as this is just a
guide, for example ARIMA(1,1,0)(0,1,2). We will also use the auto selector to
compare models.

fit <- aus_production |>


model(
arima011012 = ARIMA(box_cox(Electricity, lambda) ~ 0 + pdq(0, 1, 1) + PDQ(0, 1
, 2)),
arima110410 = ARIMA(box_cox(Electricity, lambda) ~ 0 + pdq(1, 1, 0) + PDQ(4, 1
, 0)),
arima110012 = ARIMA(box_cox(Electricity, lambda) ~ 0 + pdq(1, 1, 0) + PDQ(0, 1
, 2)),
auto = ARIMA(box_cox(Electricity, lambda))
)
fit |> select(auto) |>
report()
glance(fit)

Automatic model selection has also taken a first order difference, and so we can
compare the AICc values (we can’t use AIC to compare models with different
difference oreders). ARIMA(1,1,4)(0,1,1) was selected due to the lowest AIC.

d. Examine the residuals, do they resemble white noise? If not, try to find
another ARIMA model which fits better.

fit |> select(auto) |>


gg_tsresiduals()
Residuals resemble white noise. Only one lag (22) is significant.

fit |> select(auto) |>


augment() |>
features(.innov, ljung_box, dof = 6, lag = 12)

## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 auto 8.55 0.201

The Ljung Box test has a large p-value = 0.201 (larger than 5%) so we can’t reject
the assumption (H0) that the residuals are white noise. So this confirms the residual
graph above.

e. Forecast the next 24 months of data using your preferred model.

fit |> select(auto) |>


forecast(h = "2 years") |>
autoplot(aus_production)
This seems like a reasonable forecast.

f. Repeat this exercise with Gas production.

#Q2f
aus_production |> autoplot(Gas)
#Clearly a transformation is needed
lambda <- aus_production |>
features(Gas, guerrero) |>
pull(lambda_guerrero)
aus_production |>
autoplot(box_cox(Gas, lambda))
view(lambda)
#Variance seems more stable after the box-cox transformation with lambda of 0.12.
Find this number in the environment section, lamda.
#b
#data will need differencing

aus_production |>
gg_tsdisplay(box_cox(Gas, lambda) |> difference(4) |> difference(1), plot_type =
"partial")
#first and seasonal differencing make it stationary
#c
#try different models. looking at the acf and pacf we can try arima013011 or
arima310210
fit <- aus_production |> model(
arima013011 = ARIMA(box_cox(Gas, lambda) ~ 0 + pdq(0, 1, 3) + PDQ(0, 1, 1)),
arima310210 = ARIMA(box_cox(Gas, lambda) ~ 0 + pdq(3, 1, 0) + PDQ(2, 1, 0)),
auto = ARIMA(box_cox(Gas, lambda))
)
fit |> select(auto) |>
report()
#Auto selection is ARIMA(2,1,2)(1,1,1)[4]
glance(fit)

# We can compare AIC of the different models as all have the same difference order.
Seems that our arima013011 is actually a bit better than the auto selected one (has
lower AIC).

#d Examine the residuals, do they resemble white noise? If not, try to


find another ARIMA model which fits better.

fit |> select(auto) |>


gg_tsresiduals()
fit |> select(auto) |>
augment() |>
features(.innov, ljung_box, dof = 6, lag = 12)

#Residuals seem to resemble white noise as ljung_box p-value is higher than 5%


and only one significant lag (lag 8)
#e Forecast the next 24 months of data using your preferred model
fit |> select(auto) |>
forecast(h = "2 years") |>
autoplot(aus_production)

#seems to be a reasonable forecast

You might also like