You are on page 1of 52

Project – Time Series

Forecasting
Rose Dataset

By Somya Dhar
SL. NO. TABLE OF CONTENT (For ROSE dataset) PAGE NO.

1. Problem : Time Series Forecasting using Rose.csv dataset 6 - 51

1.1 Read the data as an appropriate Time Series data and plot the data 6- 7

1.2 Perform EDA and time series decomposition 7 - 12

1.3 Split the data into training and test 12 – 14

Build various exponential smoothing models and evaluate the model using
1.4 RMSE on the test data. 14 - 30

Check for the stationarity of the data and mention the hypothesis for the
1.5 statistical test. Check the new data for stationarity and comment. 30 - 32

Build an automated version of the ARIMA/SARIMA model using the lowest


1.6 Akaike Information Criteria (AIC) and evaluate this model using RMSE 32 - 34

Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF
1.7 and evaluate this model on the test data using RMSE. 34 - 48

Build a table with all the models built along with parameters and the
1.8 respective RMSE values on the test data 48

Build the most optimum model(s) on the complete data and predict 12
1.9 months into the future with appropriate confidence intervals/bands. 48 - 49

Comment on the model thus built and report your findings and suggest the
1.10 measures that the company should be taking for future sales. 49 - 50
SL. NO. LIST OF TABLES (For ROSE dataset) PAGE NO.
1.1 Rose dataset as a time series 6
1.2 Describing Rose dataset 7
1.3 Plotting a graph of monthly sales across years 10
1.4 Split the data into training and test 13
1.5 First few and last few rows of train and test data for ROSE dataset 14
1.6 First few and last few rows of train and test data : Linear Regression 14
1.7 Test RMSE : Linear Regression 14
1.8 First few and last few rows of train and test data : Naïve Model 15
1.9 Test RMSE : Naïve Model 15
2.1 First few rows of test data : Simple Average 16
2.2 First few rows of test data : Moving Average 17
2.3 Test RMSE : Moving Average for 2,4,6 and 9 point 19
2.4 Test RMSE : 0.995 Simple Exponential Model 21
2.5 Test RMSE : 0.3 Simple Exponential Model 23
2.6 Test RMSE for different Alpha and Beta : Double Exponential Model 24
2.7 Test RMSE : 0.3 and 0.3 Double Exponential Model 25
2.8 Test RMSE : 0.079 ,0.40 and 0.00087 Triple Exponential Model 27
2.9 Test RMSE : Full model forecasting 28
3.1 Upper and lower confidence bands at 95% confidence level 30
3.2 Test RMSE : Full model forecasting 30
3.3 ARIMA model : Akaike Information Criteria (AIC) on the training data 32
3.4 ARIMA (0,1,2) model results 34
3.5 Test RMSE : ARIMA model (0,1,2 35
3.6 ARIMA (4,1,2) model results 37
3.7 Test RMSE : ARIMA model (4,1,2) 39
3.8 SARIMA : Setting the seasonality as 12 40
3.9 SARIMA (0,1,2)*(2,1,2,12) model results 41
4.1 Upper and lower confidence bands at 95% confidence level 43
4.2 SARIMAX (1,1,2)*(2,0,2,6) model results 45
4.3 Upper and lower confidence bands at 95% confidence level 46
4.4 SARIMAX (4,1,4)*(1,0,0,12) model results 47
4.5 Test RMSE all models 47
4.6 SARIMAX (0,1,2)*(2,0,2,6) model results 48
4.7 Upper and lower confidence bands at 95% confidence level 48
TABLE OF FIGURES :

Figure 1.1 : Plotting the time series for Rose dataset………………………………………………….…………Page No. 6

Figure 1.2 : Checking datatype for Rose dataset………………………………………………….………….…….Page No. 6

Figure 1.3 : Null value check for Rose dataset………………………………………………….……………………Page No. 7

Figure 1.4 : Dataset check after treating Null values…………………………………………………..…………Page No. 8

Figure 1.5 : Shape of dataset………………………………………………………………………………….…..…………Page No. 8

Figure 1.6 : Boxplot of ROSE wine across different years …………………………...…………………………Page No. 8

Figure 1.7 : Monthly boxplot for Rose dataset………………………………………………………………………..Page No. 9

Figure 1.8 : Graphical month plot of the give Time Series………..……………………………………………..Page No. 9

Figure 1.9 : Monthly sales across years Plot …………………………………………………………………………..Page No. 10

Figure 2.1 : Average Rose wine Sales,and precent change………………………………………………….....Page No. 10

Figure 2.2 : Decomposing the Time Series : Additive ..……………………………………………………………Page No. 11

Figure 2.3 : Decomposing the Time Series : Multiplicative………………………………………………………Page No. 11

Figure 2.4 : Graphical representation of train and test data……………………………………………………Page No. 12

Figure 2.5 : Model 1: Linear Regression……………………………..…………………………………………………..Page No. 13

Figure 2.6 : Plot : Linear Regression……………………………..…………………………………………………………Page No. 13

Figure 2.7 : Plot : Naïve Model……..……………………………..………………………………………………………… Page No. 15

Figure 2.8 : Plot : Simple Average…..……………………………..………………………………………………………..Page No. 16

Figure 2.9 : Plot : Moving Average for 2,4,6 and 9 point.…………………………………………………………Page No. 17

Figure 3.1 : Plot : Moving Average for training and testing data……………………………………………..Page No. 17

Figure 3.2 : Plot : 0.995 Simple Exponential Smoothing prediction.………………………………………..Page No. 19

Figure 3.3 : Plot : 0.3 Simple Exponential Smoothing prediction.…………………………………………….Page No. 20

Figure 3.4 : Plot : 0.3 and 0.3 Double Exponential Smoothing prediction.……………………………….Page No. 20

Figure 3.5 : Plot : 0.079 , 0.40 and 0.00087 Triple Exponential Smoothing prediction.…………….Page No. 21

Figure 3.6 : Plot of all three Exponential Smoothing prediction models………………………….………Page No. 22

Figure 3.7 : Plot of Full model forecasting model………………………………………………………….………..Page No. 23

Figure 3.8 : Plot of all three Exponential Smoothing prediction models………………………………….Page No. 24

Figure 3.9 : Plot the forecast along with the confidence band…….………………………………………….Page No. 25
Figure 4.1 : Plot for Test for stationarity of the series - using Dicky Fuller test……………….……..Page No. 28

Figure 4.2 : Plot after making Time series as stationary……….…….………………………………….………Page No. 30

Figure 4.3 : Plot for Test for stationarity of the series - using Dicky Fuller test……………….………Page No. 32

Figure 4.4 : ACF plots…………………………………………………………………………………………………….……….Page No. 34

Figure 4.5 : PACF plots…………………………………………………………………………………………………….……..Page No. 36

Figure 4.6 : Plot for SARIMA : seasonality as 12………………………………………….……………….………….Page No. 36

Figure 4.7 : Plot for SARIMA : seasonality as 6………………………………………….…………………………….Page No. 39

Figure 4.8 : ACF plots……………………………………………………………………………………………………………..Page No. 40

Figure 4.9 : PACF plots……………………………………………………………………………………………………………Page No. 42

Figure 5.1 : PACF plots………………………………………………………………………………………………….………..Page No. 43

Figure 5.2 : Plot for SARIMA : seasonality as 6 (Future prediction)….…………………………….……….Page No. 45

Figure 5.3: Plot for future prediction graph………………………………………………………………….…………Page No. 46


Problem 1: Time Series Forecasting

For this particular assignment, the data of different types of wine sales in the 20th century is to be
analyzed. Both of these data are from the same company but of different wines. As an analyst in the
ABC Estate Wines, you are tasked to analyze and forecast Wine Sales in the 20th century.

Data set for the Problem: Rose.csv

Q 1.1. Read the data as an appropriate Time Series data and plot the data.

Answer 1.1 :

Step a : Import the libraries

Step b : Reading/loading the data as an appropriate time series and checking the head of the data :

(Table 1.1)

Observation : Time series dataset looks good based on initial records seen in top 5 count.

Step c : Plotting the time series

(Figure 1.1)
Observation :

The plot depicts a decreasing trend of sales over the period of 1980 to 1995.

Q 1.2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.

Answer 1.2 :

Performing EDA to understand the data :

#Describing the data :

(Table 1.2)

Total 187 records and min sales of 28.0 and max sales of 267. 50 percentile is 85.

#Checking data types of ROSE dataset :

(Figure 1.2)

We have datetime format and Float datatype.

#Checking Null values :

(Figure 1.3)
There 2 missing values

Now we have to replace the null values, so we use the interpolate function and use the method as spline
to generate values and replace them with missing values, As we know, we had 2 missing values in 1994,
we have to replace them.

#In order to treat the missing values, we will make use of Interpolate method :

Checking dataset after treating null values :

(Figure 1.4)

No missing values are found

#Checking shape of dataset :

(Figure 1.5)

There are 187 columns and 1 rows in the dataset.

#Plotting a boxplot to understand the sales of ROSE wine across different years and within different
months across years. :

(Figure 1.6)
The above picture shows the trend over the period between 1990 to 1995 with outliers in most of the
years.

#Monthly boxplot :

(Figure 1.7)

The above picture shows the trend of sales in months , December seems to have the highest sales and
January seems to have lowest sales.

#Graphical month plot of the give Time Series :

(Figure 1.8)
The above picture shows the monthly graphical representation.

#Plotting a graph of monthly sales across years :

(Table 1.3)

 1980 had the highest sales of all years in December.


 1995 May had the least sales

#Plotting a graph for Monthly sales across years Plot :


(Figure 1.9)

 February month has the decent sales of all months.

# Group by date and get average Rose wine Sales, and precent change :

(Figure 2.1)

 Average sales drop over the years.


 Percentage drips as the year passes by.
#Decomposing the Time Series and plotting the different components :

We now have all the values, so we can decompose our series to check the seasonality, trend and
residual components. There are 2 methods- additive and multiplicative.

Additive Decomposition :

(Figure 2.2)

 As per the 'additive' decomposition, we see that there is a decreasing trend in sales of Rose wine
starting from 1980 to 1994 sales has fallen only.
 Seasonality is also clearly visible from the seasonal graph where trend lines are forming the
peaks with different height every year.
 Residuals seems to be scattered from the 0 level. Indicating that the series is not additive.

#Multiplicative Decomposition :
(Figure 2.3)

The trend and seasonality are present same as in case of additive model. But residuals plot is clearly
showing the concentration of data towards 1 point. Hence it can be concluded that series is
multiplicative.

Q 1.3. Split the data into training and test. The test data should start in 1991.

Answer 1.3 :

Train has been split with data before 1991 and test with data after 1991

Train has 132 rows and test has 55 rows.

(Table 1.4)

First few and last few rows of train and test data :
(Table 1.5)
#Graphical representation of train and test data :

(Figure 2.4)

As seen in above graph, Train has been split with data before 1991 and test with data after 1991.

Q 1.4. Build various exponential smoothing models on the training data and evaluate the model using
RMSE on the test data.

Answer 1.4 :

#Model 1: Linear Regression

(Figure 2.5)
(Table 1.6)
(Figure 2.7)

The predicted trend is downwards indicating further decrease in sales of Rose wine. Units sold in 1991
can be 60 which can fall down to 40 Units by year 1995.

Defining the accuracy metrics :

Model Evaluation: Test RMSE

(Table 1.7)

The RSME on test data value is 15.275, value is not very high but since seasonality is also not taken care
by model this model is not suitable predictions on Rose time series data.

#Model 2: Naive Approach:

Head (5 rows) of Test data :

Trail (5 rows) of Train data :


(Table 1.8)

(Figure 2.8)

Model Evaluation : Defining the accuracy metrics :

(Table 1.9)
RSME is 79.738 which is higher than the linear regression model. Thus, this model being too simple, not
taking care of seasonality with very high RSME.

# Method 3: Simple Average :

For this particular simple average method, we will forecast by using the average of the training value :

Checking head of Testing data :

(Table 2.1)

(Figure 2.9)

Model Evaluation : Defining the accuracy metrics :

RMSE is 53.41 which is less than Naïve model but higher than regression model and without seasonality
component.
#Method 4: Moving Average(MA) :

For the moving average model, we are going to calculate rolling means (or moving averages) for
different intervals. The best interval can be determined by the maximum accuracy (or the minimum
error) over here.

For Moving Average, we are going to average over the entire data.

Moving average head :

(Table 2.2)

Trailing moving averages :

(Table 2.3)

Plotting on the whole data :


(Figure 3.1)

Considering criteria that testing data should start from 1991 onwards, training and testing data is
prepared.

(Figure 3.2)

On the testing data set having the orange color trend line the predicted values in green trend line fits
the most which is for the 2 point trailing moving average.

Model Evaluation : Done only on the test data.


(Table 2.3)

Plotting on both Training and Test data :

(Figure 3.3)

Comparison plot shows the best fit model in brown color line for 2nd moving average aptly fitting on the
actual test values.
#Method 5: Simple Exponential Smoothing :

Best Parameters of this model :

Head of the data :

(Table 2.4)

Plotting on both the Training and Test data :

(Figure 3.3)
#Model Evaluation for Alpha = 0.995 : Simple Exponential Smoothing :

Simple exponential smoothing value for different vales of alpha :

(Table 2.4)

Plotting on both the Training and Test data :


(Figure 3.4)

(Table 2.5)

#Method 6: Double Exponential Smoothing (Holt's Model) :

Level and Trend are accounted for in this model.

Double exponential smoothing value for different vales of alpha and beta :
(Table 2.6)

Plotting on both the Training and Test data :

(Figure 3.5)

For alpha and beta value of 0.3 and 0.3 respectively , RMSE is 265.591922, higher than Simple
exponential smoothing.
(Table 2.7)

#Method 7: Triple Exponential Smoothing (Holt - Winter's Model) : Level, Trend and Seasonality are
accounted for in this model.

Best Parameters of this model :

(Figure 3.6)
#Model Evaluation for Alpha = 0.995 : Triple Exponential Smoothing :

Test RMSE values :

(Table 2.8)

Test RMSE values till now :

Triple exponential with different alpha, beta and gamma values :


Plotting on both the Training and Test data using brute force alpha, beta and gamma determination

(Figure 3.7)

Test RMSE of all models :


The below picture shows the graphical representation of all three models :

(Figure 3.8)

#Full model forecasting :

RMSE of full model

(Table 2.9)

Getting the predictions for the same number of times stamps that are present in the test data :

(Figure 3.8)
In the below code, we have calculated the upper and lower confidence bands at 95% confidence level.

Here we are taking the multiplier to be 1.96 as we want to plot with respect to 95% confidence intervals.

(Table 3.1)

Plot the forecast along with the confidence band :

(Figure 3.9)

The above picture shows a declining trend of sales till 2000.

Q 1.5. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to be non-
stationary, take appropriate steps to make it stationary. Check the new data for stationarity and
comment.

Note: Stationarity should be checked at alpha = 0.05.

Answer 1.5 :

Check for stationarity of the whole Time Series data.


Test for stationarity of the series - using Dicky Fuller test

Hypothesis for Dicky Fuller Test is as follows :

Ho: Time Series is Non-Stationary

Ha: Time Series is Stationary

(Figure 4.1)

The p-value is 0.3439 which is more than 0.05, hence we cannot reject the null. The data is not
stationary, which means have to differentiate the time series.

Hence, we will be applying difference.

We see that at 5% significant level the Time Series is non-stationary.

Let us take a difference of order 1 and check whether the Time Series is stationary or not :
(Figure 4.3)

P value is now less than 0.05 so the differential is d= 1.

Q 1.6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected
using the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the
test data using RMSE.

Answer 1.6 :

The following loop helps us in getting a combination of different parameters of p and q in the range of 0
and 2.

We have kept the value of d as 1 as we need to take a difference of the series to make it stationary.
The above code has p and q value in range of 0,3 and difference is 1.

Calculating Akaike Information Criteria (AIC) on the training data :

(Table 3.3)

Sort the above AIC values in the ascending order to get the parameters for the minimum AIC value :
(Table 3.4)

Run the automated model getting a combination of different parameters of p and q in the range of 0
and 2.

We have kept the value of d as 1 as we need to take a difference of the series to make it stationary.

Built the ARIMA model with the lowest AIC values and the test RMSE for the value is
15.624635588290396.

(Table 3.5)

Q 1.7. : Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data
and evaluate this model on the test data using RMSE.

Answer 1.7 :

Based on ACF and PACF values :

Following graphs show ACF plot :


(Figure 4.5)

PCF plot :
(Figure 4.6)

The following loop helps us in getting a combination of different parameters of p and q in the range of 0
and 2

We have kept the value of d as 1 as we need to take a difference of the series to make it stationary.
(Table 3.6)

The PACF plot summarizes the correlations for an observation with lag values that is not accounted for
by prior lagged observations.

We chose the AR parameter p value 4, Moving average parameter q value 2 and d value 1 based on the
given plots.

RMSE for the ARIMA model is 33.97012531308992.


(Table 3.7)

Auto SARIMA model is built using seasonality as 12 :

Setting the seasonality as 12 for the first iteration of the auto SARIMA model.:

Fitting SARIMA model on below combinations of parameters :


(Table 3.8)

Sort values by AIC :


(Table 3.9)

The model returns 6 parameters as seen above out of which 2 are significant with p-value less than 0.5.

The AIC value returned in 774.969 which has reduced with the seasonality factor as compared to the
auto ARIMA model AIC and SARIMA model taking 12 as seasonality factor.

(Figure 4.7)
 By looking at the above pictures we can say the two tests has been done JB and Ljung BOXtest
 JB test distribution is normal for null hypothesis P value is greater than 0.05 hence it is normal
distribution
 Ljung box test errors and Residual are independent of each other since we fail to reject the null
hypothesis
 Heteroskedasticity means the residuals do have relation with independent variables. In this case
p value is high hence it not heteroskedastic
 By looking at the model diagrams we can say that there is no seasonality
 KDE plot of the residuals looks normally distributed.
 Residuals are distributed following a linear trend of the samples taken from standard normal
distribution in Normal Q-Q plot.

Time series residuals have low correlation with lagged version by itself.

(Table 4.1)

Auto SARIMA model is built using seasonality as 12 :

Fitting SARIMA model on below combinations of parameters :


Sort values by AIC :

(Table 4.2)
(Figure 4.8)

The model returns 6 parameters as seen above out of which 6 are significant with p-value less than 0.5.
The AIC value returned in 1041.655818 which has reduced with the seasonality factor as compared to
the auto ARIMA model AIC and SARIMA model taking 6 as seasonality factor.

 By looking at the above pictures we can say the two tests has been done JB and Ljung BOXtest
 JB test distribution is normal for null hypothesis P value is high hence it is normal distribution
 Ljung box test errors and Residual are independent of each other since we fail to reject the null
hypothesis
 Heteroskedasticity means the residuals do have relation with independent variables. In this case
null is high hence it not heteroskedastic
 By looking at the model diagrams we can say that there is no seasonality
 KDE plot of the residuals looks normally distributed.
 Residuals are distributed following a linear trend of the samples taken from standard normal
distribution in Normal Q-Q plot.
 Time series residuals have low correlation with lagged version by itself.
(Table 4.3)

Build a version of the SARIMA model for which the best parameters are selected by looking at the ACF
and the PACF plots.

#Seasonality at 6 :

ACF :
(Figure 4.9)

PACF :

(Figure 5.1)

Looking at the ACF and PACF plots of the differenced series we see our first significant value at lag 4 for
ACF and at the same lag 4 for the PACF which suggest to use p = 4 and q = 4.

We also have a big value at lag 12 in the ACF plot which suggests our season is S = 12 and since this lag is
positive it suggests P = 1 and Q = 0. Since this is a differenced series for SARIMA we set d = 1, and since
the seasonal pattern is not stable over time we set D = 0.

All together this gives us a SARIMA(4,1,4)(1,0,0)[12] model. Next we run SARIMA with these values to fit
a model on our training data.
#Seasonal period as 12 :

(Table 4.4)

(Figure 5.2)
 By looking at the above pictures we can say the two tests has been done JB and Ljung BOXtest 
JB test distribution normal for null hypothesis
 P value is below 0.05 hence it is not normal distribution
 Ljung box test errors and Residual are independent of each other since we fail to reject the null
hypothesis
 Heteroskedasticity means the residuals do have relation with independent variables. In this case
less than 0.05 hence it is heteroskedastic
 By looking at the model diagrams we can say that there is no seasonality
 KDE plot of the residuals are not normally distributed.
 Residuals are distributed following a linear trend of the samples taken from standard normal
 Distribution in Normal Q-Q plot.
 Time series residuals have low correlation with lagged version by itself.

Q 1.8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.

Answer 1.8 :

RMSE Values Sorted :

(Table 4.5)
Q 1.9. Based on the model-building exercise, build the most optimum model(s) on the complete data
and predict 12 months into the future with appropriate confidence intervals/bands.

Answer 1.9 :

#Building the most optimum model on the Full Data : Future prediction into 12 months :

(Table 4.6)

(Figure 5.3)
#Evaluate the model on the whole and predict 12 months into the future (till the end of next year). :

(Table 4.7)

Test RMSE of this model :

Plotting the prediction graph :

(Figure

Q 1.10. Comment on the model thus built and report your findings and suggest the measures that the
company should be taking for future sales.

Answer 1.10 :

From the above models and conclusions, we can suggest the following business insights and
recommendations :

Based upon the Test RMSE scores, the Alpha=0.3,Beta=0.3,Gamma=0.4,TripleExponentialSmoothing model


is most suitable to consider as it is having least RMSE score.
Time series analysis involves understanding various aspects about the inherent nature of the series so
that you are better informed to create meaningful and accurate forecasts.

Inference :

 Rose wine sales shown a decrease in trend on year-on-year basis.

 December month has the highest sales in a year for Rose wine as well.

 Model plot was build based on the trend and seasonality. We see the future prediction is in line
with the previous year predictions.

Recommendations :

 The company should encourage sales in the summer and other seasons by putting some
discounts and other offers which encourages more customers to buy mainly during the first 3
months and July-August-September.

 Rose wine sales are seasonal.

 We are able to see the Rose wines are sold highly during March/August/October till December.

 Company should plan a head and keep enough stock for March/August/October till December
to capitalize on the demand.

 In order to increase the sales company should plan some promotional offers during the low sale
period.

You might also like