Professional Documents
Culture Documents
Project Report
1
P a g e |1
Table of Contents
1 Project Objective
For this assignment, you are requested to download the Forecast package in R. The
package contains methods and tools for displaying and analyzing univariate time series
forecasts including exponential smoothing via state-space models and automatic ARIMA
modelling. Explore the gas (Australian monthly gas production) dataset in Forecast
package to do the following:
The given dataset is available within the library forecast as described in the question.
Here the data is of time series already so no need to convert it to a time series data. The
starting point is January of 1956 and the endpoint is August of 1995. Total of 473 observations
are there and the frequency is 12 which represents that the data is taken every month. So the
periodicity of the data can be considered as 12. As it is a time-series data there are no missing
variables.
On further analysis, the 5 point summary of the data was checked. Then the data was plotted
for visualising the characteristics.
P a g e |4
From the plot, it is visible that there is a certain level of trend and seasonality. Cyclic
components are not very visible in the data. There is no trend in the initial part of the data.
From 1970 the trend starts and carries over throughout the data. As time passes the
production of gas increases evidently and there is a definite seasonality component
throughout the data and the seasonality also gets magnified as the trend starts to creep in.
How the data was changed over the years with respect to months is another important
observation and it can be checked in R using the function “monthplot”.
Here the monthplot shows that the initial data is constant over the period in every month and
the value increases rapidly after a point which is evidently 1970.
The periodicity of the data can be analysed further using autocorrelation function and by
which we can plot the periodicity present in the data. The
P a g e |5
The components present in the time series data can be further identified by decomposing
them to trend, seasonality and reminder. This can be done in R and can be plotted
independently.
Here the seasonality was considered to be constant and which leads to the reminder with a
seasonal component. So the window was redefined as a value so that the error becomes low.
Different values for the window was tried and 7 looked better among them.
P a g e |6
The seasonality can be an issue while forecasting so the seasonal components are removed from the
data. De-seasonalisation will help us to get rid of the errors which may occur due to the seasonal
components. As the data is already decomposed, by adding the trend component and remainder will
give the deseasonalised data.
Before developing an ARIMA model we need to be sure if the data is stationary or not. If it is
not a stationary series, the model will be improbable to give a proper forecast. If it is not a
stationary series it can be converted to one by taking the difference of the series.
From visual inspection, it is visible that means and variance is not similar over the series. There
are very rapid variations in the data hence the variance and mean also changes significantly.
But before proceeding further it is important to confirm if it is a stationary series or not. For
that, we use the Augmented Dickey-Fuller test. It is basically a two-tailed test for checking its
stationarity. For the test, the null hypothesis and alternative hypothesis will be as follows
If the Augmented Dickey-Fuller test is having a p-value less than 0.05, it means that the null
hypothesis is not true and if the p-value is more than 0.05, the data is stationary and no need
of any modification.
P a g e |7
The p-value is 27.64% and which is very high compared to 5%. So the null hypothesis exists
and the alternate hypothesis is failed. The series is not stationary. The data is converted to a
stationary series by finding out the difference series of the order 1.
After converting it to a stationary series by taking the difference of the series the ADF test was
conducted again. P-value becomes less than 0.01 and the null hypothesis failed. That means
the series is a stationary series now.
From the graph, it is evident that the series has certainly changed and the mean stays similar
though out the data and the spread of the data is looking significantly better than the original
data. And the seasonality and trend are not very significant.
P a g e |8
From the data, it is observed that there are factors of periodicity over the data. The
Autocorrelation of with a lag 20 was found and plotted in R. The correlation values are
significant except at 3 points as the ACF values are lying outside the blue line which represents
the minimum significance.
Similarlly partial auto correlation was checked,
P a g e |9
Here also we could find colours of the periodicity of data as the correlation values show a
seasonality like characteristics. The significance goes down after the 12th point. So the maximum
number of predictors will be 10 or 12.
The ARIMA model was generated in R by trying different p, d and q values. And from various
values, ARIMA(p=1, d=1, q=1) was the most simple model. P=1 gives the maximum accuracy as
the correlation is very high at that point. D is kept as 1 as the data was non-stationary. If the
model needs a higher-order difference series, the d value can be increased. Apparently, it will
increase the complexity of the series. And the moving average we window we are keeping as
0. With this, the model was made and forecasted for 12 months.
Once the model was made, the fitted values are extracted to check how well the model is
following the actual values. From the plot it is visible that the fitted values are following the
actual values pretty well.
P a g e | 10
To see the distribution of the residual values over the data, the histogram was plotted and It
looked like a normal distribution.
To check whether the residual of the fit is a white-noise, the autocorrelation of the residual was
taken and there was a small amount of periodicity present and it can be ignored as the
significance is comparatively smaller.
P a g e | 11
Once the fitting is over and satisfactory, the model is used to forecast for the next 12 months
and the forecasted values looked as follows.
The forecast points for the next 12 months looked similar after the 7th point and which was as
expected. Then the forecasted values are plotted with a confidence interval of 95% and 80%.
P a g e | 12
The
The results are not satisfactory as the p-value is very low and it failed to prove that the
remainder is pure white noise. and we can try with the Auto-ARIMA method
In Auto-ARIMA, the function itself finds out the parameters and suggests the best possible
output.
P a g e | 13
After developing the model, The fit was studied to check how close the fit is following the
actual values.
P a g e | 14
From the graph, it is clear that the fit is very close to the actual values and the error will be
very minimum.
The residuals of the data fitted values can be checked and it is found that the error follows
randomness and a normal distribution. And the autocorrelation of the fit was checked to
confirm that there are no significant trend and periodicity present in the data
So the 12-month prediction can be done using this model.
The error is spread around 0 with a normal distribution. And further, the autocorrelation of
the error shows that there are no significant trends or periodicity. The residuals are very
negligible in nature.
P a g e | 15
The forecasting can be done once we are satisfied with the model.
The forecasted values were observed it was plotted with an interval of 95% and 80% as
follows. The graph just blends in with the history and which means the predicted values are
following the same characteristics of the actual data.
P a g e | 16
For checking whether the residual is white noise or there is a particular characteristic with the
remainder Box-Ljung test was conducted. And from the test, it is visible that the null
hypothesis exists and the residuals are randomly distributed
Accuracy of the model can be found by using various techniques, like MAPE, MAE, MSE, etc
Mean absolute percentage error is found to be 3.9% and which is a very good value from this
we can say that the model which was build using auto ARIMA method was really good and the
forecasting values are very close to the actual values.
P a g e | 17
3 Conclusion
The data was imported and studied using exploratory data analytics techniques. The series was
plotted and checked for visible trend, seasonality and other characteristics. There were visible
seasonality and trends in the data. Then the components were identified by decomposing the
data into trend, seasonality and remainder. After that, the data was freed from the seasonality
effect by adding just trend and remainder components and the output was plotted to observe
the difference.
The series was checked to know that if it is a stationary series. The stationarity of the series was
analysed visually and then using the Augmented Dickey-Fuller test. The test failed and the series
resulted to be a non- stationary series. Series was made into a stationary series by taking the
difference series and the test was conducted again. The test showed that the difference series
is a stationary series. Even though the test proved it is a stationary series the series looked like
there are some components of periodicity in the series.
An ARIMA model was made in R using various p, d, and q values and settled for a simple ARIMA
model. Fitted values of ARIMA model was compared with the actual values and its result was
satisfactory. However, when the forecasting was done for 12 months, the output was not
satisfactory as the remainder failed to show randomness.
Auto-ARIMA model was used to make a better model and it also gave satisfactory result when
the actual values were compared with the fitted values. Then the model was used to forecast
for 12 months and it looked like a good prediction visually. And the accuracy of prediction was
calculated using MAPE, MAE, MSE and other methods. The MAPE value was 3.9% and which
was really promising for a forecast model.
Auto ARIMA model gave a better result than normal ARIMA model as the auto-ARIMA
calculated various possibilities of p,d and q values to optimise the model for minimum errors.
Manually it was not very easy to build using trial and error
project.R
VIJITH
2019-08-02
#setting up the working directory
setwd("F:/BABI/Time series forecasting/project")
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
library(quantmod)
##
## Attaching package: 'zoo'
library(tseries)
## [1] "ts"
start(myts)
## [1] 1956 1
end(myts)
## [1] 1995 8
length(myts)
## [1] 476
frequency(myts)
## [1] 12
periodicity(myts)
length(na.omit(myts))
## [1] 476
summary(myts)
monthplot(myts)
acf(myts)
P a g e | 20
decomp5<- stl(myts,s.window = 5)
plot(decomp5)
P a g e | 21
decomp9<-stl(myts,s.window = 25)
plot(decomp9)
decomp7<- stl(myts,s.window = 7)
plot(decomp7, main = "Components of the time series data")
P a g e | 22
head(decomp7$time.series)
DSdata<-(decomp7$time.series[,2]+decomp7$time.series[,3])
ts.plot(DSdata,myts,col=c("red", "blue"), main="Comparison of data with an
d with out seasonality")
P a g e | 23
##
## Augmented Dickey-Fuller Test
##
## data: myts
## Dickey-Fuller = -2.7131, Lag order = 7, p-value = 0.2764
## alternative hypothesis: stationary
difdata<- diff(myts)
adf.test(difdata)
##
## Augmented Dickey-Fuller Test
##
## data: difdata
## Dickey-Fuller = -19.321, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
plot(difdata)
P a g e | 24
acf(difdata,lag=20,plot = "false")
##
## Autocorrelations of series 'difdata', by lag
##
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500
## 1.000 0.295 0.241 -0.063 -0.263 -0.377 -0.572 -0.423 -0.236 -0.059
## 0.8333 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833
## 0.246 0.414 0.679 0.362 0.200 -0.022 -0.223 -0.368 -0.543 -0.409
P a g e | 25
## 1.6667
## -0.252
# Partial AutoCorrelation
pacf(difdata, lag=20)
##
## Partial autocorrelations of series 'difdata', by lag
##
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333
## 0.295 0.168 -0.193 -0.282 -0.242 -0.440 -0.324 -0.206 -0.305 -0.183
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
## -0.059 0.358 0.068 -0.040 0.004 0.041 0.051 -0.025 0.000 -0.055
##
## Call:
## arima(x = myts, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## 0.2994
## s.e. 0.0440
##
## sigma^2 estimated as 7230640: log likelihood = -4425.08, aic = 8854.1
6
arima110fit=fitted(arima110)
ts.plot(myts,arima110fit,col=c("blue","red"))
P a g e | 26
arima110_forecast<-forecast(arima110, h =12)
arima110_forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 58094.15 54648.08 61540.22 52823.84 63364.46
## Oct 1995 57507.38 51857.05 63157.71 48865.94 66148.81
## Nov 1995 57331.70 49926.37 64737.03 46006.23 68657.17
## Dec 1995 57279.10 48410.87 66147.33 43716.32 70841.89
## Jan 1996 57263.35 47128.12 67398.59 41762.85 72763.86
## Feb 1996 57258.64 45994.45 68522.83 40031.54 74485.73
## Mar 1996 57257.23 44966.36 69548.09 38459.97 76054.49
## Apr 1996 57256.81 44018.37 70495.24 37010.37 77503.24
## May 1996 57256.68 43134.04 71379.32 35657.96 78855.39
## Jun 1996 57256.64 42301.96 72211.32 34385.43 80127.85
## Jul 1996 57256.63 41513.81 72999.45 33180.07 81333.19
## Aug 1996 57256.63 40763.29 73749.96 32032.25 82481.00
plot(arima110_forecast)
P a g e | 27
hist(arima110$residuals)
acf(arima110$residuals)
P a g e | 28
##
## Box-Ljung test
##
## data: arima110_forecast$residuals
## X-squared = 529.75, df = 20, p-value < 2.2e-16
#auto-arima
autoarima<- auto.arima(myts)
autoarima
## Series: myts
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.3756 0.1457 -0.8620 -0.6216
## s.e. 0.0780 0.0621 0.0571 0.0376
##
## sigma^2 estimated as 2587081: log likelihood=-4076.58
## AIC=8163.16 AICc=8163.29 BIC=8183.85
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,1,2)(1,1,1)[12] : 7975.349
## ARIMA(0,1,0)(0,1,0)[12] : 8195.019
## ARIMA(1,1,0)(1,1,0)[12] : 8058.749
## ARIMA(0,1,1)(0,1,1)[12] : 7967.206
## ARIMA(0,1,1)(0,1,0)[12] : 8099.363
## ARIMA(0,1,1)(1,1,1)[12] : 7981.228
## ARIMA(0,1,1)(0,1,2)[12] : 7969.036
P a g e | 29
## ARIMA(0,1,1)(1,1,0)[12] : 8022.449
## ARIMA(0,1,1)(1,1,2)[12] : 7983.344
## ARIMA(0,1,0)(0,1,1)[12] : 8046.55
## ARIMA(1,1,1)(0,1,1)[12] : 7962.501
## ARIMA(1,1,1)(0,1,0)[12] : 8099.564
## ARIMA(1,1,1)(1,1,1)[12] : 7976.677
## ARIMA(1,1,1)(0,1,2)[12] : 7964.495
## ARIMA(1,1,1)(1,1,0)[12] : 8021.922
## ARIMA(1,1,1)(1,1,2)[12] : 7978.528
## ARIMA(1,1,0)(0,1,1)[12] : 7989.013
## ARIMA(2,1,1)(0,1,1)[12] : 7959.637
## ARIMA(2,1,1)(0,1,0)[12] : 8084.795
## ARIMA(2,1,1)(1,1,1)[12] : 7973.753
## ARIMA(2,1,1)(0,1,2)[12] : 7961.349
## ARIMA(2,1,1)(1,1,0)[12] : 8023.178
## ARIMA(2,1,1)(1,1,2)[12] : Inf
## ARIMA(2,1,0)(0,1,1)[12] : 7984.839
## ARIMA(3,1,1)(0,1,1)[12] : 7962.143
## ARIMA(2,1,2)(0,1,1)[12] : 7961.233
## ARIMA(1,1,2)(0,1,1)[12] : 7959.927
## ARIMA(3,1,0)(0,1,1)[12] : 7977.683
## ARIMA(3,1,2)(0,1,1)[12] : 7964.138
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,1,1)(0,1,1)[12] : 8163.16
##
## Best model: ARIMA(2,1,1)(0,1,1)[12]
## Series: myts
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.3756 0.1457 -0.8620 -0.6216
## s.e. 0.0780 0.0621 0.0571 0.0376
##
## sigma^2 estimated as 2587081: log likelihood=-4076.58
## AIC=8163.16 AICc=8163.29 BIC=8183.85
autoarimafitted=fitted(autoarima)
ts.plot(myts,autoarimafitted,col=c("blue","red"))
P a g e | 30
hist(autoarima$residuals)
acf(autoarima$residuals)
P a g e | 31
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 56907.83 54846.53 58969.13 53755.35 60060.32
## Oct 1995 52476.28 50158.98 54793.57 48932.28 56020.27
## Nov 1995 49719.79 47202.83 52236.76 45870.42 53569.16
## Dec 1995 43473.70 40830.29 46117.11 39430.96 47516.45
## Jan 1996 43318.41 40575.77 46061.05 39123.91 47512.91
## Feb 1996 43601.71 40776.80 46426.62 39281.38 47922.04
## Mar 1996 46668.11 43770.45 49565.78 42236.52 51099.71
## Apr 1996 49376.36 46411.96 52340.76 44842.70 53910.02
## May 1996 57536.25 54509.03 60563.46 52906.53 62165.97
## Jun 1996 62184.69 59097.40 65271.98 57463.09 66906.29
## Jul 1996 65795.74 62650.40 68941.08 60985.35 70606.13
## Aug 1996 63391.54 60189.71 66593.37 58494.76 68288.31
plot(autoarima_forecast)
P a g e | 32
##
## Box-Ljung test
##
## data: autoarima_forecast$residuals
## X-squared = 0.0024719, df = 1, p-value = 0.9603
##
## Attaching package: 'MLmetrics'
library(Metrics)
##
## Attaching package: 'Metrics'
MAPE(autoarimafitted,myts)
## [1] 0.03900233
MAE(autoarimafitted,myts)
## [1] 893.4504
MSE(autoarimafitted,myts)
P a g e | 33
## [1] 2494685