Professional Documents
Culture Documents
Forecasting
Rose Dataset
By Somya Dhar
SL. NO. TABLE OF CONTENT (For ROSE dataset) PAGE NO.
1.1 Read the data as an appropriate Time Series data and plot the data 6- 7
Build various exponential smoothing models and evaluate the model using
1.4 RMSE on the test data. 14 - 30
Check for the stationarity of the data and mention the hypothesis for the
1.5 statistical test. Check the new data for stationarity and comment. 30 - 32
Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF
1.7 and evaluate this model on the test data using RMSE. 34 - 48
Build a table with all the models built along with parameters and the
1.8 respective RMSE values on the test data 48
Build the most optimum model(s) on the complete data and predict 12
1.9 months into the future with appropriate confidence intervals/bands. 48 - 49
Comment on the model thus built and report your findings and suggest the
1.10 measures that the company should be taking for future sales. 49 - 50
SL. NO. LIST OF TABLES (For ROSE dataset) PAGE NO.
1.1 Rose dataset as a time series 6
1.2 Describing Rose dataset 7
1.3 Plotting a graph of monthly sales across years 10
1.4 Split the data into training and test 13
1.5 First few and last few rows of train and test data for ROSE dataset 14
1.6 First few and last few rows of train and test data : Linear Regression 14
1.7 Test RMSE : Linear Regression 14
1.8 First few and last few rows of train and test data : Naïve Model 15
1.9 Test RMSE : Naïve Model 15
2.1 First few rows of test data : Simple Average 16
2.2 First few rows of test data : Moving Average 17
2.3 Test RMSE : Moving Average for 2,4,6 and 9 point 19
2.4 Test RMSE : 0.995 Simple Exponential Model 21
2.5 Test RMSE : 0.3 Simple Exponential Model 23
2.6 Test RMSE for different Alpha and Beta : Double Exponential Model 24
2.7 Test RMSE : 0.3 and 0.3 Double Exponential Model 25
2.8 Test RMSE : 0.079 ,0.40 and 0.00087 Triple Exponential Model 27
2.9 Test RMSE : Full model forecasting 28
3.1 Upper and lower confidence bands at 95% confidence level 30
3.2 Test RMSE : Full model forecasting 30
3.3 ARIMA model : Akaike Information Criteria (AIC) on the training data 32
3.4 ARIMA (0,1,2) model results 34
3.5 Test RMSE : ARIMA model (0,1,2 35
3.6 ARIMA (4,1,2) model results 37
3.7 Test RMSE : ARIMA model (4,1,2) 39
3.8 SARIMA : Setting the seasonality as 12 40
3.9 SARIMA (0,1,2)*(2,1,2,12) model results 41
4.1 Upper and lower confidence bands at 95% confidence level 43
4.2 SARIMAX (1,1,2)*(2,0,2,6) model results 45
4.3 Upper and lower confidence bands at 95% confidence level 46
4.4 SARIMAX (4,1,4)*(1,0,0,12) model results 47
4.5 Test RMSE all models 47
4.6 SARIMAX (0,1,2)*(2,0,2,6) model results 48
4.7 Upper and lower confidence bands at 95% confidence level 48
TABLE OF FIGURES :
Figure 1.1 : Plotting the time series for Rose dataset………………………………………………….…………Page No. 6
Figure 1.6 : Boxplot of ROSE wine across different years …………………………...…………………………Page No. 8
Figure 1.8 : Graphical month plot of the give Time Series………..……………………………………………..Page No. 9
Figure 2.9 : Plot : Moving Average for 2,4,6 and 9 point.…………………………………………………………Page No. 17
Figure 3.1 : Plot : Moving Average for training and testing data……………………………………………..Page No. 17
Figure 3.4 : Plot : 0.3 and 0.3 Double Exponential Smoothing prediction.……………………………….Page No. 20
Figure 3.5 : Plot : 0.079 , 0.40 and 0.00087 Triple Exponential Smoothing prediction.…………….Page No. 21
Figure 3.6 : Plot of all three Exponential Smoothing prediction models………………………….………Page No. 22
Figure 3.8 : Plot of all three Exponential Smoothing prediction models………………………………….Page No. 24
Figure 3.9 : Plot the forecast along with the confidence band…….………………………………………….Page No. 25
Figure 4.1 : Plot for Test for stationarity of the series - using Dicky Fuller test……………….……..Page No. 28
Figure 4.3 : Plot for Test for stationarity of the series - using Dicky Fuller test……………….………Page No. 32
For this particular assignment, the data of different types of wine sales in the 20th century is to be
analyzed. Both of these data are from the same company but of different wines. As an analyst in the
ABC Estate Wines, you are tasked to analyze and forecast Wine Sales in the 20th century.
Q 1.1. Read the data as an appropriate Time Series data and plot the data.
Answer 1.1 :
Step b : Reading/loading the data as an appropriate time series and checking the head of the data :
(Table 1.1)
Observation : Time series dataset looks good based on initial records seen in top 5 count.
(Figure 1.1)
Observation :
The plot depicts a decreasing trend of sales over the period of 1980 to 1995.
Q 1.2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.
Answer 1.2 :
(Table 1.2)
Total 187 records and min sales of 28.0 and max sales of 267. 50 percentile is 85.
(Figure 1.2)
(Figure 1.3)
There 2 missing values
Now we have to replace the null values, so we use the interpolate function and use the method as spline
to generate values and replace them with missing values, As we know, we had 2 missing values in 1994,
we have to replace them.
#In order to treat the missing values, we will make use of Interpolate method :
(Figure 1.4)
(Figure 1.5)
#Plotting a boxplot to understand the sales of ROSE wine across different years and within different
months across years. :
(Figure 1.6)
The above picture shows the trend over the period between 1990 to 1995 with outliers in most of the
years.
#Monthly boxplot :
(Figure 1.7)
The above picture shows the trend of sales in months , December seems to have the highest sales and
January seems to have lowest sales.
(Figure 1.8)
The above picture shows the monthly graphical representation.
(Table 1.3)
# Group by date and get average Rose wine Sales, and precent change :
(Figure 2.1)
We now have all the values, so we can decompose our series to check the seasonality, trend and
residual components. There are 2 methods- additive and multiplicative.
Additive Decomposition :
(Figure 2.2)
As per the 'additive' decomposition, we see that there is a decreasing trend in sales of Rose wine
starting from 1980 to 1994 sales has fallen only.
Seasonality is also clearly visible from the seasonal graph where trend lines are forming the
peaks with different height every year.
Residuals seems to be scattered from the 0 level. Indicating that the series is not additive.
#Multiplicative Decomposition :
(Figure 2.3)
The trend and seasonality are present same as in case of additive model. But residuals plot is clearly
showing the concentration of data towards 1 point. Hence it can be concluded that series is
multiplicative.
Q 1.3. Split the data into training and test. The test data should start in 1991.
Answer 1.3 :
Train has been split with data before 1991 and test with data after 1991
(Table 1.4)
First few and last few rows of train and test data :
(Table 1.5)
#Graphical representation of train and test data :
(Figure 2.4)
As seen in above graph, Train has been split with data before 1991 and test with data after 1991.
Q 1.4. Build various exponential smoothing models on the training data and evaluate the model using
RMSE on the test data.
Answer 1.4 :
(Figure 2.5)
(Table 1.6)
(Figure 2.7)
The predicted trend is downwards indicating further decrease in sales of Rose wine. Units sold in 1991
can be 60 which can fall down to 40 Units by year 1995.
(Table 1.7)
The RSME on test data value is 15.275, value is not very high but since seasonality is also not taken care
by model this model is not suitable predictions on Rose time series data.
(Figure 2.8)
(Table 1.9)
RSME is 79.738 which is higher than the linear regression model. Thus, this model being too simple, not
taking care of seasonality with very high RSME.
For this particular simple average method, we will forecast by using the average of the training value :
(Table 2.1)
(Figure 2.9)
RMSE is 53.41 which is less than Naïve model but higher than regression model and without seasonality
component.
#Method 4: Moving Average(MA) :
For the moving average model, we are going to calculate rolling means (or moving averages) for
different intervals. The best interval can be determined by the maximum accuracy (or the minimum
error) over here.
For Moving Average, we are going to average over the entire data.
(Table 2.2)
(Table 2.3)
Considering criteria that testing data should start from 1991 onwards, training and testing data is
prepared.
(Figure 3.2)
On the testing data set having the orange color trend line the predicted values in green trend line fits
the most which is for the 2 point trailing moving average.
(Figure 3.3)
Comparison plot shows the best fit model in brown color line for 2nd moving average aptly fitting on the
actual test values.
#Method 5: Simple Exponential Smoothing :
(Table 2.4)
(Figure 3.3)
#Model Evaluation for Alpha = 0.995 : Simple Exponential Smoothing :
(Table 2.4)
(Table 2.5)
Double exponential smoothing value for different vales of alpha and beta :
(Table 2.6)
(Figure 3.5)
For alpha and beta value of 0.3 and 0.3 respectively , RMSE is 265.591922, higher than Simple
exponential smoothing.
(Table 2.7)
#Method 7: Triple Exponential Smoothing (Holt - Winter's Model) : Level, Trend and Seasonality are
accounted for in this model.
(Figure 3.6)
#Model Evaluation for Alpha = 0.995 : Triple Exponential Smoothing :
(Table 2.8)
(Figure 3.7)
(Figure 3.8)
(Table 2.9)
Getting the predictions for the same number of times stamps that are present in the test data :
(Figure 3.8)
In the below code, we have calculated the upper and lower confidence bands at 95% confidence level.
Here we are taking the multiplier to be 1.96 as we want to plot with respect to 95% confidence intervals.
(Table 3.1)
(Figure 3.9)
Q 1.5. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to be non-
stationary, take appropriate steps to make it stationary. Check the new data for stationarity and
comment.
Answer 1.5 :
(Figure 4.1)
The p-value is 0.3439 which is more than 0.05, hence we cannot reject the null. The data is not
stationary, which means have to differentiate the time series.
Let us take a difference of order 1 and check whether the Time Series is stationary or not :
(Figure 4.3)
Q 1.6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected
using the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the
test data using RMSE.
Answer 1.6 :
The following loop helps us in getting a combination of different parameters of p and q in the range of 0
and 2.
We have kept the value of d as 1 as we need to take a difference of the series to make it stationary.
The above code has p and q value in range of 0,3 and difference is 1.
(Table 3.3)
Sort the above AIC values in the ascending order to get the parameters for the minimum AIC value :
(Table 3.4)
Run the automated model getting a combination of different parameters of p and q in the range of 0
and 2.
We have kept the value of d as 1 as we need to take a difference of the series to make it stationary.
Built the ARIMA model with the lowest AIC values and the test RMSE for the value is
15.624635588290396.
(Table 3.5)
Q 1.7. : Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data
and evaluate this model on the test data using RMSE.
Answer 1.7 :
PCF plot :
(Figure 4.6)
The following loop helps us in getting a combination of different parameters of p and q in the range of 0
and 2
We have kept the value of d as 1 as we need to take a difference of the series to make it stationary.
(Table 3.6)
The PACF plot summarizes the correlations for an observation with lag values that is not accounted for
by prior lagged observations.
We chose the AR parameter p value 4, Moving average parameter q value 2 and d value 1 based on the
given plots.
Setting the seasonality as 12 for the first iteration of the auto SARIMA model.:
The model returns 6 parameters as seen above out of which 2 are significant with p-value less than 0.5.
The AIC value returned in 774.969 which has reduced with the seasonality factor as compared to the
auto ARIMA model AIC and SARIMA model taking 12 as seasonality factor.
(Figure 4.7)
By looking at the above pictures we can say the two tests has been done JB and Ljung BOXtest
JB test distribution is normal for null hypothesis P value is greater than 0.05 hence it is normal
distribution
Ljung box test errors and Residual are independent of each other since we fail to reject the null
hypothesis
Heteroskedasticity means the residuals do have relation with independent variables. In this case
p value is high hence it not heteroskedastic
By looking at the model diagrams we can say that there is no seasonality
KDE plot of the residuals looks normally distributed.
Residuals are distributed following a linear trend of the samples taken from standard normal
distribution in Normal Q-Q plot.
Time series residuals have low correlation with lagged version by itself.
(Table 4.1)
(Table 4.2)
(Figure 4.8)
The model returns 6 parameters as seen above out of which 6 are significant with p-value less than 0.5.
The AIC value returned in 1041.655818 which has reduced with the seasonality factor as compared to
the auto ARIMA model AIC and SARIMA model taking 6 as seasonality factor.
By looking at the above pictures we can say the two tests has been done JB and Ljung BOXtest
JB test distribution is normal for null hypothesis P value is high hence it is normal distribution
Ljung box test errors and Residual are independent of each other since we fail to reject the null
hypothesis
Heteroskedasticity means the residuals do have relation with independent variables. In this case
null is high hence it not heteroskedastic
By looking at the model diagrams we can say that there is no seasonality
KDE plot of the residuals looks normally distributed.
Residuals are distributed following a linear trend of the samples taken from standard normal
distribution in Normal Q-Q plot.
Time series residuals have low correlation with lagged version by itself.
(Table 4.3)
Build a version of the SARIMA model for which the best parameters are selected by looking at the ACF
and the PACF plots.
#Seasonality at 6 :
ACF :
(Figure 4.9)
PACF :
(Figure 5.1)
Looking at the ACF and PACF plots of the differenced series we see our first significant value at lag 4 for
ACF and at the same lag 4 for the PACF which suggest to use p = 4 and q = 4.
We also have a big value at lag 12 in the ACF plot which suggests our season is S = 12 and since this lag is
positive it suggests P = 1 and Q = 0. Since this is a differenced series for SARIMA we set d = 1, and since
the seasonal pattern is not stable over time we set D = 0.
All together this gives us a SARIMA(4,1,4)(1,0,0)[12] model. Next we run SARIMA with these values to fit
a model on our training data.
#Seasonal period as 12 :
(Table 4.4)
(Figure 5.2)
By looking at the above pictures we can say the two tests has been done JB and Ljung BOXtest
JB test distribution normal for null hypothesis
P value is below 0.05 hence it is not normal distribution
Ljung box test errors and Residual are independent of each other since we fail to reject the null
hypothesis
Heteroskedasticity means the residuals do have relation with independent variables. In this case
less than 0.05 hence it is heteroskedastic
By looking at the model diagrams we can say that there is no seasonality
KDE plot of the residuals are not normally distributed.
Residuals are distributed following a linear trend of the samples taken from standard normal
Distribution in Normal Q-Q plot.
Time series residuals have low correlation with lagged version by itself.
Q 1.8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.
Answer 1.8 :
(Table 4.5)
Q 1.9. Based on the model-building exercise, build the most optimum model(s) on the complete data
and predict 12 months into the future with appropriate confidence intervals/bands.
Answer 1.9 :
#Building the most optimum model on the Full Data : Future prediction into 12 months :
(Table 4.6)
(Figure 5.3)
#Evaluate the model on the whole and predict 12 months into the future (till the end of next year). :
(Table 4.7)
(Figure
Q 1.10. Comment on the model thus built and report your findings and suggest the measures that the
company should be taking for future sales.
Answer 1.10 :
From the above models and conclusions, we can suggest the following business insights and
recommendations :
Inference :
December month has the highest sales in a year for Rose wine as well.
Model plot was build based on the trend and seasonality. We see the future prediction is in line
with the previous year predictions.
Recommendations :
The company should encourage sales in the summer and other seasons by putting some
discounts and other offers which encourages more customers to buy mainly during the first 3
months and July-August-September.
We are able to see the Rose wines are sold highly during March/August/October till December.
Company should plan a head and keep enough stock for March/August/October till December
to capitalize on the demand.
In order to increase the sales company should plan some promotional offers during the low sale
period.