Professional Documents
Culture Documents
Presented By:
Shakshi Narang
Table Of Contents
1. Project Objective
2. Assumptions
3. Steps Of Performing ARIMA & Auto ARIMA
Reading Data & Visualization
Data Pre Processing
Checking stationary
Determining D Value
Determining P & Q Value
Fitting ARIMA Model
Performing BOX COX Transformation
Making Prediction
Making Prediction
4. Accuracy Of Model
5. Appendix R Code
1) Project Objective
This project is to analyse Australian Monthly Gas Production dataset “Gas” in Package “Forecast”.
Monthly gas production of Australis between year 1956-1996 is released by Australian bureau of Statistics which is in Time Series
Format.
Objective here is to read the data from Forecast Package and do various analysis using reading, plotting, observing and conducting
applicable test.
Model building and to forecast for 12 months is also expected in this project using ARIMA & Auto ARIMA Model.
We must come up with best model for our prediction by comparing performance measures of the models.
Variable Descripton
Year Year of Production
Month Month Of Production
Gas Production Number of unit Gas produces during specified month & year
2) Assumptions
There are few assumptions considered:
SAMPLE SIZE is adequate to perform techniques for time series dataset.
The Australian Gas Production time series data was download from ‘Forecast’ package in R.
Components of time series are not known.
Stationarity of time series are not known.
Seasonality of time series is not known.
The production of gas in Australia has increased over a period of 40 Years. There is a significant upward trend which can be observed and there seems to be
some seasonality but there is very high variance which can be observed looking at the plot. The timeline involved is 40 years, therefore it has to be seen
how significant the historical data is.
Historical view of Gas Dataset
A large number of lower values (<10000) i.e. depicting the lower production of gas are from early years. Whether these lower values will
accurately would aid in accurately forecasting the production in 1996 remains to be seen.
The lowest gas production was seen in the Feb 1956, whereas the highest monthly production was recorded in July 1995. Therefore,
there exists a huge gap in the production over the year.
The month-plot for the Australian Gas Production shows a clear increasing trend in the months between 1956-1995. it can be
observed that there is a clear upward trend with any visible fluctuations/Variations seen mainly during last 5-10 years.
Pre Processing the Data
Visual Analysis
Visual inspection of the plot help us understand that there is an upward trend with semi annual seasonality which is mainly observed
throughout the time series looking at the plot.
Now, the seasonal component at the beginning of the series is smaller than seasonal component later in the series.
To account for this, you would need to log transform the data as follows:
Log Transformation
Decomposition
Seasonal plot
Checking Series Stationarity
Stationarity
Fitting an ARIMA model requires the series to be stationary. A series is said to be stationary when its Mean, Variance and
auto covariance are invariant.
This doesn’t look stationary at all. We can do a formal test to check stationarity in a more empirical way.
We will use Dickey-Fuller (ADF) test which test the Null Hypothesis that the series is Non Stationary. This is included in the Tseries package.
Hypothesis
H0- Non Stationary
Ha- Stationary
If P value is more than 0.05, Alternative Hypothesis (Ha) is rejected and Null Hypothesis (H0) is accepted.
Here, P value is more than 0.05 thus Alternative Hypothesis (Ha) is rejected and Null Hypothesis (H0) is accepted and the data is Non
Stationary.
Stationaries – Differencing
Since we have a Non Stationary Data, thus we need to “Difference” the data until we obtain a stationary time series data. We can do this with
“diff” function in R.
Determining D Value
After Differencing we ran the adf.test again to check the stationarity of our data. P is below .05.
A large amount of correlation exist and we can see that there is a seasonal pattern. The p & q values would be 2 & 2 respectively.
Fitting ARIMA Model
Conclusion of ARIMA Model built on (p,d,q) with values of (2,1,2)
As we can see that this model is also not better than the previous model so we will be taking subset of Original dataand will be performing all
test on it.
Extracting subset of Original Data
Finally we had taken the subset of original timeseries and data considered is from the year 1990, because we found out that
there is a lot of residuals issue along with inaccurate models. Also there is white noise in ARIMA & Auto ARIMA model.
Conclusion of the Model :
o MAPE observed is 8.588294
o Histogram confirm that data is normally distributed.
o As compared to previous plot this plot is good.
o Auto Correlation in residuals showed that there is significant correlation and there is no improvement required as compared to previoud Auto
ARIMA model.
o Box-Ljung Test – P Value is significantly less than 0.05 and hence residuals are dependent. We can further refine the P Value by adding
seasonality to it.
Conclusion of the Model :
o MAPE observed is 4.139159
o Histogram confirm that data is normally distributed.
o This plot has proved to be best so far.
o No correlation observed in Auto Correlation that proved to be best fit so far.
o Box-Ljung Test – P Value is significantly more than 0.05 and hence residuals are independent. P Value is best so far.
Fitting Auto Arima on Subset Model
Model built by performing Auto ARIMA on the subset of the Data proved to be best model so far. It is exactly similar to the
previous built SARIMA model (0,1,1)(1,1,0) which give the exact same result.
Conclusion of the Model :
o MAPE observed is 4.139159
o Histogram confirm that data is normally distributed.
o As compared to Original plot, this model also proved out to be good.
o No correlation observed in Auto Correlation that proved to be best fit so far.
o Box-Ljung Test – P Value is significantly more than 0.05 and hence residuals are independent. P Value is best so far.
So finally to build the best model, we used a subset of the original data. We found that there is correlation of residuals,
white noise, inaccuracy of models in historical data which was available for 40 years with high variance and it signifies
that it was highly affecting the accuracy of our models and as we were trying to forecast for nest 12 months the data
from last 5 years was significantly stationary and was enough to build the model, predict the values and plot forecast
accordingly.
Thank You