You are on page 1of 26

Business Analytics Report

Submitted to: Concerned faculty At


Great learning
The University of Texas & Austin

Submitted By:

Mr. Charit Sharma


PGPDSBA Online July 2021
Subject: - Time Series Forecasting – SPARKLING DATA SET

Sparkling Dataset 2
Contents

Problem Statement

For this particular assignment, the data of different types of wine sales in the 20th century is to be analyzed. Both of these
data are from the same company but of different wines. As an analyst in the ABC Estate Wines, you are tasked to analyze
and forecast Wine Sales in the 20th century.

1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Build all the exponential smoothing models on the training data and evaluate the model using RMSE on the test data.
Other additional models such as regression, naïve forecast models, simple average models, moving average models
should also be built on the training data and check the performance on the test data using RMSE.
5. Check for the stationarity of the data on which the model is being built on using appropriate statistical tests and also
mention the hypothesis for the statistical test. If the data is found to be non-stationary, take appropriate steps to make
it stationary. Check the new data for stationarity and comment. Note: Stationarity should be checked at alpha = 0.05.
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the lowest
Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using RMSE.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and evaluate this
model on the test data using RMSE.
8. Build a table with all the models built along with their corresponding parameters and the respective RMSE values on
the test data.
9. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict 12 months
into the future with appropriate confidence intervals/bands.
10. Comment on the model thus built and report your findings and suggest the measures that the company should be
taking for future sales.

Sparkling Dataset 3
1. Read the data as an appropriate Time Series data and plot the data.

Display top 5 records

Fig:1. Sparkling first 5 records

Display bottom 5 records

Fig:2. Sparkling last 5 records

Check for Duplicate values

Sparkling data doesn’t have duplicate records.

Check for Missing values

Sparkling data doesn’t have null values.

Shape of the Sparkling data set

The Data set has 187 rows and 2 columns

Creating Time Stamp and make it as actual Time Series data

Sparkling Dataset 4
Fig:6. Sparkling time series plot without time-stamp

Note
X-Axis is not showing time stamp value. We need to pass the date range manually.

Fig:7. Sparkling time series plot with time-stamp

2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.

Descriptive Statistics of data

Fig:8. Sparkling Descriptive Stats

Yearly Boxplot

Sparkling Dataset 5
Fig:9. Sparkling Yearly box plot

Monthly Boxplot

Fig:10. Sparkling Monthly box plot

Month plot

Fig:11. Sparkling Month plot

Sparkling Dataset 6
Pivot table

Table:1. Sparkling pivot

Monthly Sales across years

Sparkling Dataset 7
Fig:12. Sparkling sales across years

Analysis Observations from Sparkling wine data:

• Data from January, 1980 to July, 1995, monthly sales of Sparkling wines are provided.

• The given data file has been read as is, and a date range (valid time-stamp) has been applied to the data as an
index.

• The sale of Sparkling wines shows up and down slopes over the period, showing no consistent trend.

• Sparkling wine has been consistently popular with customers over the years.

• According to the data, on average, 2402 units of Sparkling wines were sold each month As a result of Within
a certain timeframe. Approximately 50% A month sales varied from 1605 units to 2549 units. Maximum sale
reported in a month is 7242 units and the minimum is 1070.

• Based on a yearly report boxplot, the average sale of Sparkling has remained relatively steady over the
period, around or a little below 2000 units.

Decade Plot

Sparkling Dataset 8
Fig:13. Sparkling Decade plot
Yearly Plot

Fig:14. Sparkling Yearly plot

Quarterly plot

Fig:15. Sparkling Quarterly plot

Daily Plot

Sparkling Dataset 9
Fig:16. Sparkling Daily plot

ECDF

Fig:17. Sparkling ECDF

Inference:
• From Empirical CDF plot, it seems that at least 3000 Sparkling wine units have been sold in 80% of month.

Avg Sparkling wine sales per month

Fig:18. Average Sparkling sales per month

Sparkling Dataset 10
Month on month Percent change

Fig:19. Sparkling month on month percent change

Decomposition
The decomposition of time series is a statistical task that deconstructs a time series into several components.
There are various forces that may affect the observations in a time series. The three important components are:
I. Trend (Long term movement)
II. Seasonal component: Intra-year stable fluctuations repeatable over the entire length of series
III. Irregular component (Random movements)

Additive

Fig:20. Sparkling - Additive

Multiplicative

Fig:21. Sparkling - Multiplicative

Sparkling Dataset 11
Infernece:

• There is no consistent trend shown in the plot of the trend component.

Fig:22. Additive- deaseasonalized

Fig:23. Multiplicative- deaseasonalized

3. Split the data into training and test. The test data should start in 1991.

Sparkling Dataset 12
Fig:24. Sparkling train and test split

Fig:25. Sparkling train and test plot

4. Build all the exponential smoothing models on the training data and evaluate the model using RMSE on the test
data. Other additional models such as regression, naïve forecast models, simple average models, moving average
models should also be built on the training data and check the performance on the test data using RMSE.

Model 1 Linear Regression


We are going to regress the 'Sparkling' variable against the order of the occurrence. we have to modify our training
data before fitting it into a linear regression.

Fig:26. Sparkling -Linear Regression

▪ The RMSE value for Linear Regression model is 1389.135

Model 2 Naïve Approach


In naive model, we say that the prediction for tomorrow is the same as today and the prediction for day after
tomorrow is tomorrow and since the prediction of tomorrow is same as today, Therefore the prediction for day after
tomorrow is also today.

Sparkling Dataset 13
Fig:27. Sparkling -Naïve Approach

▪ The RMSE value for Naïve approach model is 3864.279

Model 3 Simple Average


In simple average method, we will forecast by using the average of the training values.

Fig:28. Sparkling -Simple Average

▪ The RMSE value for Simple Average model is 1275.082

Model 4 Moving Average


In moving average model, we are going to calculate rolling means (or moving averages) for different intervals. The
best interval can be determined by the maximum accuracy (or the minimum error) over here. we are going to average
over the entire data.

Sparkling Dataset 14
Fig:29. Sparkling -Moving Average

The RMSE value for Moving average model is,

Fig:30. Model Comparison plot

Model 5 Simple Exponential Smoothing


Sparkling Dataset 15
• The autofit model picked 0.049 as the smoothing parameter.

• On the second iteration, the model was executed without passing a value for alpha and used parameters
‘optimized=True, use_brute=True’

Fig:31. Sparkling-SES

The RMSE value for SES model is, 1316.035, Alpha=0.049


The RMSE value for SES model is, 1375.393, Alpha=0.1
The RMSE value for SES model is, 1595.206, Alpha=0.2

Model 6 Double Exponential Smoothing (Holt’s Model)

Fig:32. Sparkling-DES
The RMSE value for SES model is, 1778.564, Alpha=0.1, Beta=0.1

Model 7 Triple Exponential Smoothing (Holt-Winter’s Model)

Sparkling Dataset 16
• The Triple Exponential Smoothing models (Holt-Winter’s Model) is applicable when data has both trend and
seasonality. Sparkling data contain slight trend and significant seasonality.

• On the second iteration the model was allowed tochose the optimized values using parameters
‘optimized=True, use_brute=True’.

• The best model chosen as final one in comparison with all models is the one with alpha 0.1, beta 0.9 and
gamma 0.6 which has the least RMSE value as 338.458

Fig:33. Sparkling-TES

The RMSE value for SES model is, 473.152, Alpha=0.112, Beta=0.037, Gamma=0.493
The RMSE value for SES model is, 338.458, Alpha=0.1, Beta=0.9, Gamma=0.6

5. Check for the stationarity of the data on which the model is being built on using appropriate statistical tests and
also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take appropriate steps to
make it stationary. Check the new data for stationarity and comment. Note: Stationarity should be checked at alpha =
0.05.

Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is
stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a
series.

H0: Time series is non-stationary

H1: Time series is stationary

Stationarity should be checked at alpha = 0.05

Sparkling Dataset 17
Fig:34. Sparkling- Stationarity check on the whole data

We see that at 5% significant level, the above Time series is non-stationary.Let us take a difference of order 1
and check whether the series becomes stationary or not.

After Difference of order 1,

Fig:35. Sparkling- Difference of order whole data


Sparkling Dataset 18
Fig:36. Sparkling- Autocorrelation

Fig:37. Sparkling- Differenced Autocorrelation


Inference:
• From the first plot, ACF order is 1, therefore q value is 1.
• From the second plot differenced ACF, we got the q value as 2.

Sparkling Dataset 19
Fig:38. Sparkling- Partial Autocorrelation

Fig:39. Sparkling- Differenced Partial Autocorrelation

Inference:
• From the first plot, PACF order is 1, therefore p value is 1.
• From the second plot differenced PACF, we got the p value as 3.

6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the lowest
Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using RMSE.

Model 8 Automated ARIMA

Sparkling Dataset 20
▪ The RMSE value is, ARIMA(3,1,3) 1374.037

Model 9 Automated SARIMA

Least AIC:

Fig:40. Sparkling- Auto SARIMA Least AIC

Sparkling Dataset 21
Fig:42. Sparkling- Auto SARIMA (1,1,2)(2,0,2,6)

RMSE value is, 627.380819506515

Least AIC:

Fig:43. Sparkling- Auto SARIMA Least AIC

Sparkling Dataset 22
Fig:44. Sparkling- Auto SARIMA (1,1,2)(1,0,2,12)

RMSE value is, 528.612

Inference:
• The diagnostic plot for the model is as below, which clearly shows a normal distribution of residuals, where
more values are around zero.
• The correlogram shows the autocorrelation of the residuals and there are no points significant above the
confidence index.

7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and evaluate
this model on the test data using RMSE.

Model 10 Manual ARIMA

RMSE value of Man_ARIMA(3,1,2) is, 1379.049

Model 11 Manual SARIMA 6

The Auto-Regressive parameter in an SARIMA model is 'P' which comes from the significant lag after which
the PACF plot cuts-off to 2.
- The Moving-Average parameter in an SARIMA model is 'q' which comes from the significant lag after
which the ACF plot cuts-off to 2.

Sparkling Dataset 23
Fig:45. Sparkling- Man_SARIMA_6(2,1,2)(2,0,2,6)

RMSE value is, 657.894

Model 12 Manual SARIMA 12

Fig:46. Sparkling- Man_SARIMA_12(3,1,1)(3,1,1,12)

RMSE value is, 333.541

Inference:
• The diagnostic plot for the model is as below, which clearly shows a normal distribution of residuals, where
more values are around zero.

Sparkling Dataset 24
8. Build a table (create a data frame) with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.

Fig:47. Sparkling-Test RMSE values of all models

9. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict 12
months into the future with appropriate confidence intervals/bands.

There are two optimum models based on which we will build whole data,

A. We see that the best model is the Man_SARIMA_12(3,1,1)(3,1,1,12) with additive seasonality with the
parameters α = 3, β = 1 and γ = 1.

Sparkling Dataset 25
RMSE value is 547.310

B. We see that the next best model is Triple Exponential Smoothing with additive seasonality with the parameters
$\alpha$ = 0.1, $\beta$ = 0.9 and $\gamma$ = 0.6.

RMSE value is 465.568

RMSE of TES < RMSE of SARIMA

Therefore the optimum model for the whole data is TES with parameters 𝛼 = 0.1, 𝛽 = 0.9 and 𝛾 = 0.6

We have forecasted the sparkling wine sales for future 12 months using TES with parameters 𝛼 = 0.1, 𝛽 = 0.9
and 𝛾 = 0.6

Fig:48. Sparkling-Confidence intervals on full data

Sparkling Dataset 26
Fig:49. Sparkling- Full data forecasting plot

10.Comment on the model thus built and report your findings and suggest the measures that the company should be
taking for future sales.

• Based on comparison of all model’s Test RMSE value,Triple Exponential Smoothing (Holt Winter’s)
is selected for final prediction into 12 months in future.

• TES forecast on the Sparkling Full Data, RMSE is 465.568

• The ABC wine company is recommended to ramp up their procurement and production line in accordance
with the above forecasts.

*************************End of Report**************************

Sparkling Dataset 27

You might also like