You are on page 1of 34

Project – Time Series Forecasting

Project Report

1
P a g e |1

Table of Contents

1 Project Objective ............................................................................... ......................................... 2


2 Time series Forecast – Step by step approach ........................................................................... 2
2.1 Environment Set up and Data Import ............................................................................... 2
2.2 Exploratory analysis and Identification of components.................................................... 3
2.3 Decomposition and De-seasonalisation ............................................................................ 5
2.4 Test for stationary series……………………….......................................................................... 6
2.5 Exploring Auto and Partial Correlation............................................................................. 8
2.6 Developing the ARIMA model and forecasting……………………………………………………………. 9
2.7 Accuracy of the model…….………………………………...………………………………………………………… 16
3 Conclusion .................................................................................................................................. 17
4 Appendix A – Source Code ........................................................................................................ 17
P a g e |2

1 Project Objective

For this assignment, you are requested to download the Forecast package in R. The
package contains methods and tools for displaying and analyzing univariate time series
forecasts including exponential smoothing via state-space models and automatic ARIMA
modelling. Explore the gas (Australian monthly gas production) dataset in Forecast
package to do the following:

 Read the data as a time series object in R. Plot the data


 What do you observe? Which components of the time series are present in this
dataset? What is the periodicity of the dataset?
 Is the time series Stationary? Inspect visually and conduct an ADF test? Write down
the null and alternate-hypothesis for the stationarity test? De-seasonalize the series if
seasonality is present?
 Develop an ARIMA Model to forecast for next 12 periods. Use both manual and
auto.arima (Show & explain all the steps)
 Report the accuracy of the model

2 Time Series Forecast – Step by step approach

A Typical Data exploration activity consists of the following steps:

1. Environment Set up and Data Import


2. Exploratory analysis and Identification of components
3. Decomposition and Deseasonalisation
4. Test for stationary series
5. Exploring Auto and Partial Correlation
6. Developing the ARIMA model and forecasting
7. Accuracy of the model

2.1 Environment Set up and Data Import


Setting a working directory on the starting of the R session makes importing and exporting
data files and code files easier. Basically, the working directory is the location/ folder on the
PC where you have the data, codes etc. related to the project.
For doing an analysis of the given data, the following libraries may be required. Which will be
helpful for doing Time series analysis and forecasting of the data. Any other libraries
required will be called later as per requirement.
P a g e |3

The given dataset is available within the library forecast as described in the question.

Please refer Appendix A for Source Code.

2.2 Exploratory analysis and Identification of components


The data was viewed and studied for further analysis. The data imported from the library is a
time series and the details of the time-series data such as starting point, endpoint, frequency,
etc were checked.

Here the data is of time series already so no need to convert it to a time series data. The
starting point is January of 1956 and the endpoint is August of 1995. Total of 473 observations
are there and the frequency is 12 which represents that the data is taken every month. So the
periodicity of the data can be considered as 12. As it is a time-series data there are no missing
variables.
On further analysis, the 5 point summary of the data was checked. Then the data was plotted
for visualising the characteristics.
P a g e |4

From the plot, it is visible that there is a certain level of trend and seasonality. Cyclic
components are not very visible in the data. There is no trend in the initial part of the data.
From 1970 the trend starts and carries over throughout the data. As time passes the
production of gas increases evidently and there is a definite seasonality component
throughout the data and the seasonality also gets magnified as the trend starts to creep in.
How the data was changed over the years with respect to months is another important
observation and it can be checked in R using the function “monthplot”.

Here the monthplot shows that the initial data is constant over the period in every month and
the value increases rapidly after a point which is evidently 1970.
The periodicity of the data can be analysed further using autocorrelation function and by
which we can plot the periodicity present in the data. The
P a g e |5

2.3 Decomposition and De-seasonalisation

The components present in the time series data can be further identified by decomposing
them to trend, seasonality and reminder. This can be done in R and can be plotted
independently.

Here the seasonality was considered to be constant and which leads to the reminder with a
seasonal component. So the window was redefined as a value so that the error becomes low.
Different values for the window was tried and 7 looked better among them.
P a g e |6

The seasonality can be an issue while forecasting so the seasonal components are removed from the
data. De-seasonalisation will help us to get rid of the errors which may occur due to the seasonal
components. As the data is already decomposed, by adding the trend component and remainder will
give the deseasonalised data.

2.4 Test for stationary series

Before developing an ARIMA model we need to be sure if the data is stationary or not. If it is
not a stationary series, the model will be improbable to give a proper forecast. If it is not a
stationary series it can be converted to one by taking the difference of the series.
From visual inspection, it is visible that means and variance is not similar over the series. There
are very rapid variations in the data hence the variance and mean also changes significantly.
But before proceeding further it is important to confirm if it is a stationary series or not. For
that, we use the Augmented Dickey-Fuller test. It is basically a two-tailed test for checking its
stationarity. For the test, the null hypothesis and alternative hypothesis will be as follows

𝐻0 : 𝑇ℎ𝑒 𝑠𝑒𝑟𝑖𝑒𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑟𝑦


𝐻1 : 𝑇ℎ𝑒 𝑠𝑒𝑟𝑖𝑒𝑠 𝑖𝑠 𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑟𝑦

If the Augmented Dickey-Fuller test is having a p-value less than 0.05, it means that the null
hypothesis is not true and if the p-value is more than 0.05, the data is stationary and no need
of any modification.
P a g e |7

The p-value is 27.64% and which is very high compared to 5%. So the null hypothesis exists
and the alternate hypothesis is failed. The series is not stationary. The data is converted to a
stationary series by finding out the difference series of the order 1.

After converting it to a stationary series by taking the difference of the series the ADF test was
conducted again. P-value becomes less than 0.01 and the null hypothesis failed. That means
the series is a stationary series now.

From the graph, it is evident that the series has certainly changed and the mean stays similar
though out the data and the spread of the data is looking significantly better than the original
data. And the seasonality and trend are not very significant.
P a g e |8

2.5 Exploring Auto and Partial Correlation


The data were further analysed for autocorrelation and partial correlation. Here the stationary
series data is used for the analysis. It is for checking how the data is correlated within the data
with a delay. In autocorrelation, the data was analysed to see how the current value is depended
on past values. And in partial autocorrelation, the influence of past values was found after
removing the influence of all other except the largest lag.

From the data, it is observed that there are factors of periodicity over the data. The
Autocorrelation of with a lag 20 was found and plotted in R. The correlation values are
significant except at 3 points as the ACF values are lying outside the blue line which represents
the minimum significance.
Similarlly partial auto correlation was checked,
P a g e |9

Here also we could find colours of the periodicity of data as the correlation values show a
seasonality like characteristics. The significance goes down after the 12th point. So the maximum
number of predictors will be 10 or 12.

2.6 Developing ARIMA Model and Forecasting

The ARIMA model was generated in R by trying different p, d and q values. And from various
values, ARIMA(p=1, d=1, q=1) was the most simple model. P=1 gives the maximum accuracy as
the correlation is very high at that point. D is kept as 1 as the data was non-stationary. If the
model needs a higher-order difference series, the d value can be increased. Apparently, it will
increase the complexity of the series. And the moving average we window we are keeping as
0. With this, the model was made and forecasted for 12 months.

Once the model was made, the fitted values are extracted to check how well the model is
following the actual values. From the plot it is visible that the fitted values are following the
actual values pretty well.
P a g e | 10

To see the distribution of the residual values over the data, the histogram was plotted and It
looked like a normal distribution.

To check whether the residual of the fit is a white-noise, the autocorrelation of the residual was
taken and there was a small amount of periodicity present and it can be ignored as the
significance is comparatively smaller.
P a g e | 11

Once the fitting is over and satisfactory, the model is used to forecast for the next 12 months
and the forecasted values looked as follows.

The forecast points for the next 12 months looked similar after the 7th point and which was as
expected. Then the forecasted values are plotted with a confidence interval of 95% and 80%.
P a g e | 12

The

The model was validated by using the Box-Ljung method.

The results are not satisfactory as the p-value is very low and it failed to prove that the
remainder is pure white noise. and we can try with the Auto-ARIMA method

In Auto-ARIMA, the function itself finds out the parameters and suggests the best possible
output.
P a g e | 13

After developing the model, The fit was studied to check how close the fit is following the
actual values.
P a g e | 14

From the graph, it is clear that the fit is very close to the actual values and the error will be
very minimum.
The residuals of the data fitted values can be checked and it is found that the error follows
randomness and a normal distribution. And the autocorrelation of the fit was checked to
confirm that there are no significant trend and periodicity present in the data
So the 12-month prediction can be done using this model.

The error is spread around 0 with a normal distribution. And further, the autocorrelation of
the error shows that there are no significant trends or periodicity. The residuals are very
negligible in nature.
P a g e | 15

The forecasting can be done once we are satisfied with the model.

The forecasted values were observed it was plotted with an interval of 95% and 80% as
follows. The graph just blends in with the history and which means the predicted values are
following the same characteristics of the actual data.
P a g e | 16

For checking whether the residual is white noise or there is a particular characteristic with the
remainder Box-Ljung test was conducted. And from the test, it is visible that the null
hypothesis exists and the residuals are randomly distributed

2.7 Accuracy of the model

Accuracy of the model can be found by using various techniques, like MAPE, MAE, MSE, etc

Mean absolute percentage error is found to be 3.9% and which is a very good value from this
we can say that the model which was build using auto ARIMA method was really good and the
forecasting values are very close to the actual values.
P a g e | 17

3 Conclusion
The data was imported and studied using exploratory data analytics techniques. The series was
plotted and checked for visible trend, seasonality and other characteristics. There were visible
seasonality and trends in the data. Then the components were identified by decomposing the
data into trend, seasonality and remainder. After that, the data was freed from the seasonality
effect by adding just trend and remainder components and the output was plotted to observe
the difference.
The series was checked to know that if it is a stationary series. The stationarity of the series was
analysed visually and then using the Augmented Dickey-Fuller test. The test failed and the series
resulted to be a non- stationary series. Series was made into a stationary series by taking the
difference series and the test was conducted again. The test showed that the difference series
is a stationary series. Even though the test proved it is a stationary series the series looked like
there are some components of periodicity in the series.
An ARIMA model was made in R using various p, d, and q values and settled for a simple ARIMA
model. Fitted values of ARIMA model was compared with the actual values and its result was
satisfactory. However, when the forecasting was done for 12 months, the output was not
satisfactory as the remainder failed to show randomness.
Auto-ARIMA model was used to make a better model and it also gave satisfactory result when
the actual values were compared with the fitted values. Then the model was used to forecast
for 12 months and it looked like a good prediction visually. And the accuracy of prediction was
calculated using MAPE, MAE, MSE and other methods. The MAPE value was 3.9% and which
was really promising for a forecast model.
Auto ARIMA model gave a better result than normal ARIMA model as the auto-ARIMA
calculated various possibilities of p,d and q values to optimise the model for minimum errors.
Manually it was not very easy to build using trial and error

4 Appendix A: Source code

project.R
VIJITH

2019-08-02
#setting up the working directory
setwd("F:/BABI/Time series forecasting/project")

# Calling required libraries


library(forecast)

## Registered S3 method overwritten by 'xts':


## method from
## as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':


## method from
## as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':


## method from
P a g e | 18

## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff

library(quantmod)

## Loading required package: xts

## Loading required package: zoo

##
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':


##
## as.Date, as.Date.numeric

## Loading required package: TTR

## Version 0.4-0 included new data defaults. See ?getSymbols.

library(tseries)

# Importing the data


myts=gas
class(myts)

## [1] "ts"

start(myts)

## [1] 1956 1

end(myts)

## [1] 1995 8

length(myts)

## [1] 476

frequency(myts)

## [1] 12

periodicity(myts)

## Monthly periodicity from Jan 1956 to Aug 1995

length(na.omit(myts))

## [1] 476

summary(myts)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1646 2675 16788 21415 38629 66600

plot(myts,main="Time Series Plot")


P a g e | 19

monthplot(myts)

acf(myts)
P a g e | 20

#Decomposition of the components.


decomp<- stl(myts,s.window = 'p')
plot(decomp)

decomp5<- stl(myts,s.window = 5)
plot(decomp5)
P a g e | 21

decomp9<-stl(myts,s.window = 25)
plot(decomp9)

decomp7<- stl(myts,s.window = 7)
plot(decomp7, main = "Components of the time series data")
P a g e | 22

head(decomp7$time.series)

## seasonal trend remainder


## Jan 1956 -362.8624 2018.888 52.974659
## Feb 1956 -421.7285 2023.650 44.078340
## Mar 1956 -253.9505 2028.413 19.537938
## Apr 1956 -162.8653 2033.175 7.690425
## May 1956 182.6463 2038.217 -47.862905
## Jun 1956 312.7999 2043.258 -35.058210

DSdata<-(decomp7$time.series[,2]+decomp7$time.series[,3])
ts.plot(DSdata,myts,col=c("red", "blue"), main="Comparison of data with an
d with out seasonality")
P a g e | 23

#Checkinf for stationarity (Dickey Fuller Test)


#Null Hypothesis (H0): Series is not stationary
#Alternate Hypothesis (H1): Series is stationary
adf.test(myts)

##
## Augmented Dickey-Fuller Test
##
## data: myts
## Dickey-Fuller = -2.7131, Lag order = 7, p-value = 0.2764
## alternative hypothesis: stationary

difdata<- diff(myts)
adf.test(difdata)

## Warning in adf.test(difdata): p-value smaller than printed p-value

##
## Augmented Dickey-Fuller Test
##
## data: difdata
## Dickey-Fuller = -19.321, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

plot(difdata)
P a g e | 24

# Autocorrelation of the data


acf(difdata,lag=10)

acf(difdata,lag=20,plot = "false")

##
## Autocorrelations of series 'difdata', by lag
##
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500
## 1.000 0.295 0.241 -0.063 -0.263 -0.377 -0.572 -0.423 -0.236 -0.059
## 0.8333 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833
## 0.246 0.414 0.679 0.362 0.200 -0.022 -0.223 -0.368 -0.543 -0.409
P a g e | 25

## 1.6667
## -0.252

# Partial AutoCorrelation
pacf(difdata, lag=20)

pacf(difdata, lag= 20, plot = "false")

##
## Partial autocorrelations of series 'difdata', by lag
##
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333
## 0.295 0.168 -0.193 -0.282 -0.242 -0.440 -0.324 -0.206 -0.305 -0.183
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
## -0.059 0.358 0.068 -0.040 0.004 0.041 0.051 -0.025 0.000 -0.055

#Simple ARIMA with p=1, d=1, q=0


arima110<-arima(myts,order=c(1,1,0))
arima110

##
## Call:
## arima(x = myts, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## 0.2994
## s.e. 0.0440
##
## sigma^2 estimated as 7230640: log likelihood = -4425.08, aic = 8854.1
6

arima110fit=fitted(arima110)
ts.plot(myts,arima110fit,col=c("blue","red"))
P a g e | 26

arima110_forecast<-forecast(arima110, h =12)
arima110_forecast

## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 58094.15 54648.08 61540.22 52823.84 63364.46
## Oct 1995 57507.38 51857.05 63157.71 48865.94 66148.81
## Nov 1995 57331.70 49926.37 64737.03 46006.23 68657.17
## Dec 1995 57279.10 48410.87 66147.33 43716.32 70841.89
## Jan 1996 57263.35 47128.12 67398.59 41762.85 72763.86
## Feb 1996 57258.64 45994.45 68522.83 40031.54 74485.73
## Mar 1996 57257.23 44966.36 69548.09 38459.97 76054.49
## Apr 1996 57256.81 44018.37 70495.24 37010.37 77503.24
## May 1996 57256.68 43134.04 71379.32 35657.96 78855.39
## Jun 1996 57256.64 42301.96 72211.32 34385.43 80127.85
## Jul 1996 57256.63 41513.81 72999.45 33180.07 81333.19
## Aug 1996 57256.63 40763.29 73749.96 32032.25 82481.00

plot(arima110_forecast)
P a g e | 27

hist(arima110$residuals)

acf(arima110$residuals)
P a g e | 28

#validate ARIMA -110


Box.test(arima110_forecast$residuals, lag=20, type="Ljung-Box")

##
## Box-Ljung test
##
## data: arima110_forecast$residuals
## X-squared = 529.75, df = 20, p-value < 2.2e-16

#auto-arima
autoarima<- auto.arima(myts)
autoarima

## Series: myts
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.3756 0.1457 -0.8620 -0.6216
## s.e. 0.0780 0.0621 0.0571 0.0376
##
## sigma^2 estimated as 2587081: log likelihood=-4076.58
## AIC=8163.16 AICc=8163.29 BIC=8183.85

auto.arima(myts, ic = "aic", trace = TRUE)

##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,1,2)(1,1,1)[12] : 7975.349
## ARIMA(0,1,0)(0,1,0)[12] : 8195.019
## ARIMA(1,1,0)(1,1,0)[12] : 8058.749
## ARIMA(0,1,1)(0,1,1)[12] : 7967.206
## ARIMA(0,1,1)(0,1,0)[12] : 8099.363
## ARIMA(0,1,1)(1,1,1)[12] : 7981.228
## ARIMA(0,1,1)(0,1,2)[12] : 7969.036
P a g e | 29

## ARIMA(0,1,1)(1,1,0)[12] : 8022.449
## ARIMA(0,1,1)(1,1,2)[12] : 7983.344
## ARIMA(0,1,0)(0,1,1)[12] : 8046.55
## ARIMA(1,1,1)(0,1,1)[12] : 7962.501
## ARIMA(1,1,1)(0,1,0)[12] : 8099.564
## ARIMA(1,1,1)(1,1,1)[12] : 7976.677
## ARIMA(1,1,1)(0,1,2)[12] : 7964.495
## ARIMA(1,1,1)(1,1,0)[12] : 8021.922
## ARIMA(1,1,1)(1,1,2)[12] : 7978.528
## ARIMA(1,1,0)(0,1,1)[12] : 7989.013
## ARIMA(2,1,1)(0,1,1)[12] : 7959.637
## ARIMA(2,1,1)(0,1,0)[12] : 8084.795
## ARIMA(2,1,1)(1,1,1)[12] : 7973.753
## ARIMA(2,1,1)(0,1,2)[12] : 7961.349
## ARIMA(2,1,1)(1,1,0)[12] : 8023.178
## ARIMA(2,1,1)(1,1,2)[12] : Inf
## ARIMA(2,1,0)(0,1,1)[12] : 7984.839
## ARIMA(3,1,1)(0,1,1)[12] : 7962.143
## ARIMA(2,1,2)(0,1,1)[12] : 7961.233
## ARIMA(1,1,2)(0,1,1)[12] : 7959.927
## ARIMA(3,1,0)(0,1,1)[12] : 7977.683
## ARIMA(3,1,2)(0,1,1)[12] : 7964.138
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,1,1)(0,1,1)[12] : 8163.16
##
## Best model: ARIMA(2,1,1)(0,1,1)[12]

## Series: myts
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.3756 0.1457 -0.8620 -0.6216
## s.e. 0.0780 0.0621 0.0571 0.0376
##
## sigma^2 estimated as 2587081: log likelihood=-4076.58
## AIC=8163.16 AICc=8163.29 BIC=8183.85

autoarimafitted=fitted(autoarima)
ts.plot(myts,autoarimafitted,col=c("blue","red"))
P a g e | 30

hist(autoarima$residuals)

acf(autoarima$residuals)
P a g e | 31

autoarima_forecast<- forecast(autoarima, h =12)


autoarima_forecast

## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 56907.83 54846.53 58969.13 53755.35 60060.32
## Oct 1995 52476.28 50158.98 54793.57 48932.28 56020.27
## Nov 1995 49719.79 47202.83 52236.76 45870.42 53569.16
## Dec 1995 43473.70 40830.29 46117.11 39430.96 47516.45
## Jan 1996 43318.41 40575.77 46061.05 39123.91 47512.91
## Feb 1996 43601.71 40776.80 46426.62 39281.38 47922.04
## Mar 1996 46668.11 43770.45 49565.78 42236.52 51099.71
## Apr 1996 49376.36 46411.96 52340.76 44842.70 53910.02
## May 1996 57536.25 54509.03 60563.46 52906.53 62165.97
## Jun 1996 62184.69 59097.40 65271.98 57463.09 66906.29
## Jul 1996 65795.74 62650.40 68941.08 60985.35 70606.13
## Aug 1996 63391.54 60189.71 66593.37 58494.76 68288.31

plot(autoarima_forecast)
P a g e | 32

#validate auto arima


Box.test(autoarima_forecast$residuals, type="Ljung-Box")

##
## Box-Ljung test
##
## data: autoarima_forecast$residuals
## X-squared = 0.0024719, df = 1, p-value = 0.9603

# Accuracy of the model


library(MLmetrics)

##
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':


##
## Recall

library(Metrics)

##
## Attaching package: 'Metrics'

## The following object is masked from 'package:forecast':


##
## accuracy

MAPE(autoarimafitted,myts)

## [1] 0.03900233

MAE(autoarimafitted,myts)

## [1] 893.4504

MSE(autoarimafitted,myts)
P a g e | 33

## [1] 2494685

You might also like