You are on page 1of 45

Time Series Forecasting Project

Australian Monthly Gas Production Dataset

Presented By:
Shakshi Narang
Table Of Contents

1. Project Objective
2. Assumptions
3. Steps Of Performing ARIMA & Auto ARIMA
 Reading Data & Visualization
 Data Pre Processing
 Checking stationary
 Determining D Value
 Determining P & Q Value
 Fitting ARIMA Model
 Performing BOX COX Transformation
 Making Prediction
 Making Prediction
4. Accuracy Of Model
5. Appendix R Code
1) Project Objective
This project is to analyse Australian Monthly Gas Production dataset “Gas” in Package “Forecast”.

Monthly gas production of Australis between year 1956-1996 is released by Australian bureau of Statistics which is in Time Series
Format.
Objective here is to read the data from Forecast Package and do various analysis using reading, plotting, observing and conducting
applicable test.
Model building and to forecast for 12 months is also expected in this project using ARIMA & Auto ARIMA Model.
We must come up with best model for our prediction by comparing performance measures of the models.

The Data set look likes as below:

Variable Descripton
Year Year of Production
Month Month Of Production
Gas Production Number of unit Gas produces during specified month & year
2) Assumptions
There are few assumptions considered:
 SAMPLE SIZE is adequate to perform techniques for time series dataset.
 The Australian Gas Production time series data was download from ‘Forecast’ package in R.
 Components of time series are not known.
 Stationarity of time series are not known.
 Seasonality of time series is not known.

3) Steps Of Arima Analysis


 Load the Data & Visualization
 Pre-processing the data
 Check/Make Series Stationary
 Determine D Value
 Determine P & Q Value
 Fit Arima model
 Compare models using accuracy measures
 Make Prediction
 Predict values on validation set
 Calculate MAPE/RSME
 Auto ARIMA Model
 Reading Data & Visualization:

The production of gas in Australia has increased over a period of 40 Years. There is a significant upward trend which can be observed and there seems to be
some seasonality but there is very high variance which can be observed looking at the plot. The timeline involved is 40 years, therefore it has to be seen
how significant the historical data is.
Historical view of Gas Dataset
A large number of lower values (<10000) i.e. depicting the lower production of gas are from early years. Whether these lower values will
accurately would aid in accurately forecasting the production in 1996 remains to be seen.
The lowest gas production was seen in the Feb 1956, whereas the highest monthly production was recorded in July 1995. Therefore,
there exists a huge gap in the production over the year.

The month-plot for the Australian Gas Production shows a clear increasing trend in the months between 1956-1995. it can be
observed that there is a clear upward trend with any visible fluctuations/Variations seen mainly during last 5-10 years.
 Pre Processing the Data
Visual Analysis
Visual inspection of the plot help us understand that there is an upward trend with semi annual seasonality which is mainly observed
throughout the time series looking at the plot.
Now, the seasonal component at the beginning of the series is smaller than seasonal component later in the series.
To account for this, you would need to log transform the data as follows:
Log Transformation
Decomposition
Seasonal plot
 Checking Series Stationarity

Stationarity

Fitting an ARIMA model requires the series to be stationary. A series is said to be stationary when its Mean, Variance and
auto covariance are invariant.

So we will determine if our time series is stationary or not.

We will start with the visual inspection.

This doesn’t look stationary at all. We can do a formal test to check stationarity in a more empirical way.
We will use Dickey-Fuller (ADF) test which test the Null Hypothesis that the series is Non Stationary. This is included in the Tseries package.
Hypothesis
H0- Non Stationary
Ha- Stationary
If P value is more than 0.05, Alternative Hypothesis (Ha) is rejected and Null Hypothesis (H0) is accepted.

Here, P value is more than 0.05 thus Alternative Hypothesis (Ha) is rejected and Null Hypothesis (H0) is accepted and the data is Non
Stationary.
Stationaries – Differencing
Since we have a Non Stationary Data, thus we need to “Difference” the data until we obtain a stationary time series data. We can do this with
“diff” function in R.
 Determining D Value
After Differencing we ran the adf.test again to check the stationarity of our data. P is below .05.

P Value is not significant thus Alternative Hypothesis is accepted.


The D value of our ARIMA model is 1.
 Determine p & q Values

A large amount of correlation exist and we can see that there is a seasonal pattern. The p & q values would be 2 & 2 respectively.
 Fitting ARIMA Model
Conclusion of ARIMA Model built on (p,d,q) with values of (2,1,2)

o MAPE Observed is 6.992583.


o Histogram confirms that he data is normally distributed.
o As compared to Original plot current model is good.
o Auto Correlation in residuals shows that there is a Correlation that exists Lag 4 onwards at various Lags.
o Box-Ljung test – P Value is significantly less than .05, then the residuals are dependent.
Now building another ARIMA model by adding seasonality into the previous model.
Conclusion of SARIMA Model built on (p,d,q) values of (2,1,2) (1,1,2) and below is the conclusion

o MAPE obtained is 3.9086


o Histogram confirms that the data is Normally distributed.
o Compared in Original plot – Good
o Auto Correlation in Residuals – Correlation exists Lag 4 onward at fewer lags compared to previous models. A
better outcome compared to previous model.
o Box – Ljung Test – P Value is significantly less than 0.05, then the residuals are dependent. P Value is slightly
better then previous model.
 Auto ARIMA
Conclusion of Auto ARIMA Model built on (p,d,q) values (2,1,1) (0,1,1) [12]:
o MAPE observed is 3.900233
o Histogram clearly indicates that data is normally distributed.
o As compared with original model, current model is Good.
o Auto Correlation in residuals is almost similar to previous SARIMA model.
o Box – Ljung Test – P value is significantally less than 0.05 and residuals are dependent. But P value is still not so good than the
previous built models.
Performing BOX COX Transformation
Conclusion of BOX COX Transformation (Best Model ARIMA (0,1,1)(0,1,1)[12] is mentioned below:
o MAPE observed is 0.5343536 which is lower than previous model.
o Histogram confirm that data is normally distributed.
o As compared to previous plot this model is good.
o Auto Correlation in residuals is almost close to Auto ARIMA model.
o Box-Ljung Test – P Value is significantly less than 0.05 and this model is still not better than previous models.

As we can see that this model is also not better than the previous model so we will be taking subset of Original dataand will be performing all
test on it.
Extracting subset of Original Data
Finally we had taken the subset of original timeseries and data considered is from the year 1990, because we found out that
there is a lot of residuals issue along with inaccurate models. Also there is white noise in ARIMA & Auto ARIMA model.
Conclusion of the Model :
o MAPE observed is 8.588294
o Histogram confirm that data is normally distributed.
o As compared to previous plot this plot is good.
o Auto Correlation in residuals showed that there is significant correlation and there is no improvement required as compared to previoud Auto
ARIMA model.
o Box-Ljung Test – P Value is significantly less than 0.05 and hence residuals are dependent. We can further refine the P Value by adding
seasonality to it.
Conclusion of the Model :
o MAPE observed is 4.139159
o Histogram confirm that data is normally distributed.
o This plot has proved to be best so far.
o No correlation observed in Auto Correlation that proved to be best fit so far.
o Box-Ljung Test – P Value is significantly more than 0.05 and hence residuals are independent. P Value is best so far.
Fitting Auto Arima on Subset Model
Model built by performing Auto ARIMA on the subset of the Data proved to be best model so far. It is exactly similar to the
previous built SARIMA model (0,1,1)(1,1,0) which give the exact same result.
Conclusion of the Model :
o MAPE observed is 4.139159
o Histogram confirm that data is normally distributed.
o As compared to Original plot, this model also proved out to be good.
o No correlation observed in Auto Correlation that proved to be best fit so far.
o Box-Ljung Test – P Value is significantly more than 0.05 and hence residuals are independent. P Value is best so far.

Now We can Run Forecast on this model


 Make Prediction
4) Accuracy of Model

So finally to build the best model, we used a subset of the original data. We found that there is correlation of residuals,
white noise, inaccuracy of models in historical data which was available for 40 years with high variance and it signifies
that it was highly affecting the accuracy of our models and as we were trying to forecast for nest 12 months the data
from last 5 years was significantly stationary and was enough to build the model, predict the values and plot forecast
accordingly.
Thank You

You might also like