Professional Documents
Culture Documents
Submitted By:
Sparkling Dataset 2
Contents
Problem Statement
For this particular assignment, the data of different types of wine sales in the 20th century is to be analyzed. Both of these
data are from the same company but of different wines. As an analyst in the ABC Estate Wines, you are tasked to analyze
and forecast Wine Sales in the 20th century.
1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Build all the exponential smoothing models on the training data and evaluate the model using RMSE on the test data.
Other additional models such as regression, naïve forecast models, simple average models, moving average models
should also be built on the training data and check the performance on the test data using RMSE.
5. Check for the stationarity of the data on which the model is being built on using appropriate statistical tests and also
mention the hypothesis for the statistical test. If the data is found to be non-stationary, take appropriate steps to make
it stationary. Check the new data for stationarity and comment. Note: Stationarity should be checked at alpha = 0.05.
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the lowest
Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using RMSE.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and evaluate this
model on the test data using RMSE.
8. Build a table with all the models built along with their corresponding parameters and the respective RMSE values on
the test data.
9. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict 12 months
into the future with appropriate confidence intervals/bands.
10. Comment on the model thus built and report your findings and suggest the measures that the company should be
taking for future sales.
Sparkling Dataset 3
1. Read the data as an appropriate Time Series data and plot the data.
Sparkling Dataset 4
Fig:6. Sparkling time series plot without time-stamp
Note
X-Axis is not showing time stamp value. We need to pass the date range manually.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
Yearly Boxplot
Sparkling Dataset 5
Fig:9. Sparkling Yearly box plot
Monthly Boxplot
Month plot
Sparkling Dataset 6
Pivot table
Sparkling Dataset 7
Fig:12. Sparkling sales across years
• Data from January, 1980 to July, 1995, monthly sales of Sparkling wines are provided.
• The given data file has been read as is, and a date range (valid time-stamp) has been applied to the data as an
index.
• The sale of Sparkling wines shows up and down slopes over the period, showing no consistent trend.
• Sparkling wine has been consistently popular with customers over the years.
• According to the data, on average, 2402 units of Sparkling wines were sold each month As a result of Within
a certain timeframe. Approximately 50% A month sales varied from 1605 units to 2549 units. Maximum sale
reported in a month is 7242 units and the minimum is 1070.
• Based on a yearly report boxplot, the average sale of Sparkling has remained relatively steady over the
period, around or a little below 2000 units.
Decade Plot
Sparkling Dataset 8
Fig:13. Sparkling Decade plot
Yearly Plot
Quarterly plot
Daily Plot
Sparkling Dataset 9
Fig:16. Sparkling Daily plot
ECDF
Inference:
• From Empirical CDF plot, it seems that at least 3000 Sparkling wine units have been sold in 80% of month.
Sparkling Dataset 10
Month on month Percent change
Decomposition
The decomposition of time series is a statistical task that deconstructs a time series into several components.
There are various forces that may affect the observations in a time series. The three important components are:
I. Trend (Long term movement)
II. Seasonal component: Intra-year stable fluctuations repeatable over the entire length of series
III. Irregular component (Random movements)
Additive
Multiplicative
Sparkling Dataset 11
Infernece:
3. Split the data into training and test. The test data should start in 1991.
Sparkling Dataset 12
Fig:24. Sparkling train and test split
4. Build all the exponential smoothing models on the training data and evaluate the model using RMSE on the test
data. Other additional models such as regression, naïve forecast models, simple average models, moving average
models should also be built on the training data and check the performance on the test data using RMSE.
Sparkling Dataset 13
Fig:27. Sparkling -Naïve Approach
Sparkling Dataset 14
Fig:29. Sparkling -Moving Average
• On the second iteration, the model was executed without passing a value for alpha and used parameters
‘optimized=True, use_brute=True’
Fig:31. Sparkling-SES
Fig:32. Sparkling-DES
The RMSE value for SES model is, 1778.564, Alpha=0.1, Beta=0.1
Sparkling Dataset 16
• The Triple Exponential Smoothing models (Holt-Winter’s Model) is applicable when data has both trend and
seasonality. Sparkling data contain slight trend and significant seasonality.
• On the second iteration the model was allowed tochose the optimized values using parameters
‘optimized=True, use_brute=True’.
• The best model chosen as final one in comparison with all models is the one with alpha 0.1, beta 0.9 and
gamma 0.6 which has the least RMSE value as 338.458
Fig:33. Sparkling-TES
The RMSE value for SES model is, 473.152, Alpha=0.112, Beta=0.037, Gamma=0.493
The RMSE value for SES model is, 338.458, Alpha=0.1, Beta=0.9, Gamma=0.6
5. Check for the stationarity of the data on which the model is being built on using appropriate statistical tests and
also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take appropriate steps to
make it stationary. Check the new data for stationarity and comment. Note: Stationarity should be checked at alpha =
0.05.
Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is
stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a
series.
Sparkling Dataset 17
Fig:34. Sparkling- Stationarity check on the whole data
We see that at 5% significant level, the above Time series is non-stationary.Let us take a difference of order 1
and check whether the series becomes stationary or not.
Sparkling Dataset 19
Fig:38. Sparkling- Partial Autocorrelation
Inference:
• From the first plot, PACF order is 1, therefore p value is 1.
• From the second plot differenced PACF, we got the p value as 3.
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the lowest
Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using RMSE.
Sparkling Dataset 20
▪ The RMSE value is, ARIMA(3,1,3) 1374.037
Least AIC:
Sparkling Dataset 21
Fig:42. Sparkling- Auto SARIMA (1,1,2)(2,0,2,6)
Least AIC:
Sparkling Dataset 22
Fig:44. Sparkling- Auto SARIMA (1,1,2)(1,0,2,12)
Inference:
• The diagnostic plot for the model is as below, which clearly shows a normal distribution of residuals, where
more values are around zero.
• The correlogram shows the autocorrelation of the residuals and there are no points significant above the
confidence index.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and evaluate
this model on the test data using RMSE.
The Auto-Regressive parameter in an SARIMA model is 'P' which comes from the significant lag after which
the PACF plot cuts-off to 2.
- The Moving-Average parameter in an SARIMA model is 'q' which comes from the significant lag after
which the ACF plot cuts-off to 2.
Sparkling Dataset 23
Fig:45. Sparkling- Man_SARIMA_6(2,1,2)(2,0,2,6)
Inference:
• The diagnostic plot for the model is as below, which clearly shows a normal distribution of residuals, where
more values are around zero.
Sparkling Dataset 24
8. Build a table (create a data frame) with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.
9. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict 12
months into the future with appropriate confidence intervals/bands.
There are two optimum models based on which we will build whole data,
A. We see that the best model is the Man_SARIMA_12(3,1,1)(3,1,1,12) with additive seasonality with the
parameters α = 3, β = 1 and γ = 1.
Sparkling Dataset 25
RMSE value is 547.310
B. We see that the next best model is Triple Exponential Smoothing with additive seasonality with the parameters
$\alpha$ = 0.1, $\beta$ = 0.9 and $\gamma$ = 0.6.
Therefore the optimum model for the whole data is TES with parameters 𝛼 = 0.1, 𝛽 = 0.9 and 𝛾 = 0.6
We have forecasted the sparkling wine sales for future 12 months using TES with parameters 𝛼 = 0.1, 𝛽 = 0.9
and 𝛾 = 0.6
Sparkling Dataset 26
Fig:49. Sparkling- Full data forecasting plot
10.Comment on the model thus built and report your findings and suggest the measures that the company should be
taking for future sales.
• Based on comparison of all model’s Test RMSE value,Triple Exponential Smoothing (Holt Winter’s)
is selected for final prediction into 12 months in future.
• The ABC wine company is recommended to ramp up their procurement and production line in accordance
with the above forecasts.
*************************End of Report**************************
Sparkling Dataset 27