You are on page 1of 38

1

Business Report

Project - Time Series Forecasting– Shoes Sales Analysis

Divjyot Shah Singh

Date: 01/04/2022
2

Table of Contents

Table of Contents .................................................................................................................. 2


Table of Figures .................................................................... Error! Bookmark not defined.
1. Executive Summary ....................................................................................................... 4
2. Introduction ................................................................................................................... 4
3. Data Details ................................................................................................................... 4
Q1 Read the data as an appropriate Time Series data and plot the data .................................. 5
1.1 Reading the Data..................................................................................................... 5
1.2 Plotting the Data ..................................................................................................... 5
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition ....................................................................................................................... 6
2.1 EDA ............................................................................................................................ 6
Null Value Check .............................................................................................................. 6
Duplicate Value Check................................................................................................... 7
Data Description ............................................................................................................ 7
Yearly Box Plots ............................................................................................................ 7
Monthly Box Plots ......................................................................................................... 8
Monthly Sales across Years............................................................................................ 9
2.2 Decomposition ...................................................................................................... 10
3. Split the data into training and test. The test data should start in 1991. ......................... 11
4. Build various exponential smoothing models on the training data and evaluate the model
using RMSE on the test data. Other models such as Regression, Naïve forecast models and
simple average models should also be built on the training data and check the performance on
the test data using RMSE .................................................................................................... 13
4.1 Linear Regression ................................................................................................. 13
4.2 Naïve Model ......................................................................................................... 14
4.3 Simple Average Model.......................................................................................... 15
4.4 Moving Average Model ........................................................................................ 17
4.5 Simple Exponential Smoothing (SES) ................................................................... 20
4.6 Double Exponential Smoothing (DES) .................................................................. 21
4.7 Triple Exponential Smoothing (TES) .................................................................... 23
4.8 Summary of all Models ......................................................................................... 25
5. Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If the data
is found to be non-stationary, take appropriate steps to make it stationary. Check the new data
for stationarity and comment. Note: Stationarity should be checked at alpha = 0.05 ............ 26
3

6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE............................................................................... 28
6.1 ARIMA Model...................................................................................................... 28
6.2 SARIMA Model ................................................................................................... 30
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE. ..................................... 32
7.1 ACF and PACF plots ............................................................................................ 32
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.............................................................................. 36
9. Based on the model-building exercise, build the most optimum model(s) on the complete
data and predict 12 months into the future with appropriate confidence intervals/bands. ...... 37
10. Based Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales. .............................................. 38
4

1. Executive Summary

You are an analyst in the IJK shoe company and you are expected to forecast the sales of the

pairs of shoes for the upcoming 12 months from where the data ends. The data for the pair of

shoe sales have been given to you from January 1980 to July 1995.

2. Introduction

The intent for this project is to perform forecasting analysis on the Shoes sales dataset. I will

try to analyse this dataset by using Linear Regression, Naïve Model, Simple and Moving

Average models, Simple, Double and Triple Exponential Smoothing. The data set contains

187entries, and I will try to build the most optimum model(s) on the complete data and

predict 12 months into the future with appropriate confidence intervals/bands.

3. Data Details

Data set contains two columns, where the first column shows the month and year of the

corresponding Sales Quantity recorded in the second column.

YearMonth Shoe_Sales
1980-01 85
1980-02 89
1980-03 109
1980-04 95
1980-05 91
1980-06 95
1980-07 96

Table 1: Shoes Sales Dataset Details


5

Q1 Read the data as an appropriate Time Series data and plot the

data

1.1 Reading the Data

I have imported the data series and as we can observe, entry has an YearMonth value

with it, which is not really a data point, but an index for the sales entry. So in reality the

datasets have a single column that contains the quantity of shoes sold in that particular

month. Here, while reading the datasets I have given the argument in a way so that it

parses the first column which is date column, and indicates to the system that this is a

one column series through squeeze.

Figure 1: Reading Shoes sales Dataset

It can be observed the dataset has data starting from January 1980 going till July 1995, so

there are 187 entries in totality in each dataset.

1.2 Plotting the Data

Now that I have uploaded the dataset with no arguments (and hence uploaded the

datasets without parsing the dates here), I will need to provide a time stamp value by

ourselves. In addition to that I have removed the YearMonth variable and added a time

stamp to the dataset myself.

I have plotted both the time series below.


6

Figure 2: Shoe Sales Time Series Plot

As we can observe from the above plot, the sales for Shoes were in upward trend till

1988 and downward trend 1988 onwards. There is a certain seasonality element that is

visible in the graph. We will explore the trend and seasonality further during

decomposition, where we will be able to view a much detailed report on these two

factors.

2. Perform appropriate Exploratory Data Analysis to understand

the data and also perform decomposition

2.1 EDA

Null Value Check

Performing a Null value check on the time series, I got:

Figure 3: Null Value Check


7

Duplicate Value Check

There are no duplicate entries in the dataset as each value corresponds to a different time

index, so basically these are all sales figures for different months.

Data Description

Figure 4: Shoes sales Time Series Data Description

As we can see from the above, the shoes sales time series data look like they are skewed.

There is High Standard Deviation for the time series since the Min and Max have

significant difference between them. Moreover, there is difference between the mean and

the median for the same reason of skewness. As mentioned earlier, there are in total 187

records in the dataset.

Yearly Box Plots

Following is the yearly box plot for the Shoes sales time-series:
8

Figure 5: Yearly Box Plots

As we can observe from the above plot, Shoes has upward trend till 1987 and a

downward sales trend post 1988. The highest sales for shoes can be observed in 1987 and

the lowest sales in 1980. The highest variation in monthly sales for shoes seems to be in

the year 1985 and on the year 1984 there seems to be the lowest variation in monthly

sales.

There are outliers in the yearly sales data, however as it is a Time Series; we can ignore

the outlier data.

Monthly Box Plots

Following is the monthly box plot for the shoe sales time-series:
9

Figure 6: Monthly Box Plots

As we can observe from the Monthly Box Plots, we can clearly see that there is a

seasonality element visible in time series dataset. As can be clearly seen that the sales

have an increasing sales trend in the last quarter of the year. The sales for shoes seems to

pick up from July month and is more or less consistent till June, observes some

stagnancy in September month and then starts to pick up again from October (i.e. last

quarter). Monthly sales data shows skewness without much exception.

Monthly Sales across Years

The monthly sales across years can be seen in the following Pivot Table and the

associated graph:
10

Figure 7: Monthly Sales across Years

As can be observed from the above set of table and graph, the months of December

seems to be the month that drives the highest sales figures. The second highest sales

being in November. We can observe a seasonality element in the graph above.

2.2 Decomposition

I have provided the decomposed elements for the Time Series below:
11

Figure 8: Additive Decomposition

Figure 9: Multiplicative Decomposition

We can see the decomposition of the time series above. I have tried with both additive

and multiplicative decomposition for time series so that I can determine if the shoes

dataset is a multiplicative or additive series.

As we can observe from the above, we can say that the time series is clearly

multiplicative in nature and has a seasonal component.

The plots above clearly indicate that the sales are unstable and not uniform, and they

have an apparent seasonality trend.

3. Split the data into training and test. The test data should start in

1991.

I have split the time series datasets into Train and Test datasets below. It is given the

question that the Test Data should start in 1991.


12

Figure 10: Training and Test Datasets for Shoes Time Series

I have also confirmed that the Train dataset indeed ends in 1990, and the Test dataset

indeed starts in 1991 by using the Head and Tail functions on the Training and Test

dataset. As we can observe, the size of the Train data frame is 132 observations and that

of the Test data frame is 55 observations.

I have also plotted the Train and test data frames for both time series datasets below:

Figure 11: Plot for Training and Test data frames

We can observe the training and test data in the above plot, the blue part of the plots

depicts the Train datasets (January ’80 – December ‘90), and the Orange part of the plots

depict the test datasets (January ’91 – July ‘95).


13

4. Build various exponential smoothing models on the training data

and evaluate the model using RMSE on the test data. Other

models such as Regression, Naïve forecast models and simple

average models should also be built on the training data and

check the performance on the test data using RMSE

In this section I will try to run the various available models on time series data set. Let’s

kick off the analysis with Linear Regression model.

4.1 Linear Regression

The extracts of Training and Test time stamps for the Linear Regression can be seen

below:

Figure 12: Training and Test data for Linear Regression

Following is the results from a Linear Regression model on the dataset:


14

Figure 13: Linear Regression Outcome

The Regression plots above depict the regression on training set as the Red line and that

on the test set as the green line. As we can observe from the above plot and metric, shoes

sales show upward trend on training data set and downward trend on test data set.

For Regression on Time forecast on the Test Data,

RMSE = 266.276 | MAPE = 110.88

The summarized performance of the model run on the dataset can be seen below:

Figure 14: Performance of the Linear Regression Model

4.2 Naïve Model

The extracts of Training and Test data for the Naïve Model can be seen below:

Figure 15: Training and Test data for Naive Model

Following is the result from running a Naïve Model:


15

Figure 16: Naive Model Outcome

For Naive model on Time forecast on the Test Data,

RMSE = 2450.121 | MAPE = 101.47

Figure 17: Performance of the two Models

As can be seen from the Naïve model performance above, the Naïve model is not

suitable for the shoe dataset since the forecasts depends on the previous last observation.

4.3 Simple Average Model

The extracts of Training and Test data for the Simple Average Model can be seen below:
16

Figure 18: Training and Test data for Simple Average Model

Following are the results from running a Simple Average Model:

Figure 19: Simple Average Model Outcome

For Simple Average Model,

RMSE = 63.985| MAPE = 21.86

The summarized performance of the models run dataset can be seen below:

Figure 20: Performance of the three Models


17

As can be seen from the Simple Average model performance above, the Simple Average

model has the best performance among all the three models run till now for.

4.4 Moving Average Model

The Moving Average data for the dataset can be seen below:

Figure 21: Moving Average Model Data

Following is the result from running a Moving Average Model dataset:


18

Figure 22: Moving Average Model Outcome


19

For 2 point Moving Average Model forecast on the Testing Data, RMSE = 45.948 |

MAPE = 14.32

For 4 point Moving Average Model forecast on the Testing Data, RMSE = 57.872 |

MAPE = 19.48

For 6 point Moving Average Model forecast on the Testing Data, RMSE = 63.456 |

MAPE = 22.38

For 9 point Moving Average Model forecast on the Testing Data, RMSE = 67.723 |

MAPE = 23.33

The summarized performance of the models run on the wine datasets can be seen below:

Figure 23: Summarized Performance of the Models

I have applied 2, 4, 6 and 9-point trailing averages on the dataset.

As we can observe from the above plots, all of the trailing average plots show prediction

values below the actual train and test data sets, and the 9 point trailing average plot

shows the lowest prediction of all the plots. The closest prediction to actual data is shown

by the 2 point trailing moving average model. This observation is corroborated by the

RMSE scores for each of these moving average models.

As can be seen from the summarized performance of all the models, the 2 point moving

average has shown the best performance of all the models run on dataset.
20

4.5 Simple Exponential Smoothing (SES)

The SES Parameters for dataset can be seen below:

Figure 24: SES Parameters

Following is the result from running a SES Model on the dataset:

Figure 25: Simple Exponential Smoothing Outcome

For Alpha = 0.605 Simple Exponential Smoothening Model forecast on the Test data,

RMSE = 196.405 | MAPE = 79.92

The summarized performance of the models run on the wine datasets can be seen below:
21

Figure 26: Summarized Performance of the Models

As we all know that SES model should be used on data which has no element of trend or

seasonality, I still applied it on the data set so as to see what the performance of the

model is in this case.

I used Alpha = 0.605 for the SES model and as expected, it did not perform well as

compared to previously run models.

4.6 Double Exponential Smoothing (DES)

The SES Parameters for dataset can be seen below:

Figure 27: DES Parameters

Following is the result from running a DES Model on dataset:


22

Figure 28: Double Exponential Smoothing Outcome

For Alpha =0.1, Beta = 0.1 Double Exponential Smoothening Model forecast on the Test

data, RMSE = 76.91

The summarized performance of the models run on the wine datasets can be seen below:

Figure 29: Summarized Performance of the Models


23

As we all know that DES model should be used on data which has no seasonality but has

levels and trends, I used the grid search to begin and we reached conclusion that Alpha =

0.1 and Beta = 0.1 show the lowest RMSE and MAPE. . The DES model is the model

with the good performance so far.

4.7 Triple Exponential Smoothing (TES)

The TES Parameters for the Rose and Sparkling wine datasets can be seen below:

Figure 3: TES Parameters for the Rose and Sparkling wine datasets respectively

The TES train and test data dataset can be seen below:

Figure 4: TES Model Train and Test data

Following is the result from running a TES Model on the dataset:


24

Figure 5: Triple Exponential Smoothing Outcome

For Alpha=0.606, Beta=0, Gamma=0.262, Triple Exponential Smoothing Model forecast

on the Test, RMSE = 133.703

The summarized performance of the models run on the wine datasets can be seen below:

Figure 33: Summarized Performance of the Models


25

4.8 Summary of all Models

Now that we have run all the models planned, let’s view the summary of the performance

of the dataset:

Figure 34: Sorted Model Performance Summary

As we can observe that for the dataset, the 2 point trailing moving average gives the best

RMSE and MAPE among all the models.


26

5. Check for the stationarity of the data on which the model is

being built on using appropriate statistical tests and also

mention the hypothesis for the statistical test. If the data is

found to be non-stationary, take appropriate steps to make it

stationary. Check the new data for stationarity and comment.

Note: Stationarity should be checked at alpha = 0.05

I have performed the Stationarity Test on data frame. I have used an augmented Dickey-

Fuller test on the shoes data set to check the stationarity. The Hypothesis is that the shoes

data is stationary, Alpha = 0.05


27

Figure 35: Stationarity

As we can observe from the above, we need to reject the Hypothesis since the p value

seems to be greater than alpha, hence we will have to stationaries the data. That is, the

data properties do not depend on the time when the data series is observed. This is

basically a hint of a seasonality/trend element in the dataset. After taking the difference

of 1 in between continuous observations to stationaries the data, we can observe that the

p-value appeared to be less than 0.05.


28

6. Build an automated version of the ARIMA/SARIMA model in

which the parameters are selected using the lowest Akaike

Information Criteria (AIC) on the training data and evaluate this

model on the test data using RMSE.

6.1 ARIMA Model


29

Figure 36: Running Automated ARIMA Model

Following are the Results of ARIMA model in Rose wine dataset:

Figure 37: Results of Automated ARIMA Model


30

As we can see from the above, the lowest AIC recorded for the data is for p,d,q values of

(4,1,3) respectively and the lowest AIC is 1479.147 . The p value of coefficients MA1

and MA2 are 0 and 0.013 which means that these are pretty significant. The RMSE and

MAPE values are:

RMSE: 205.555 MAPE: 83.41

6.2 SARIMA Model

Following is the outcome of SARIM Model run on data:


31
32

Figure 68:SARIMA Model

As can be observed, the model with p,d,q, as 2,1,1 respectively has the lowest AIC,

which is 14. The p value of ar.S.L12 and ma.S.L12 is less than 0.05 which makes them

pretty significant. The RMSE and MAPE values are

RMSE: 70.723

MAPE: 24.48

7. Build ARIMA/SARIMA models based on the cut-off points of ACF

and PACF on the training data and evaluate this model on the

test data using RMSE.

7.1 ACF and PACF plots

An autocorrelation (ACF) plot represents the autocorrelation of the series with lags of

itself. A partial autocorrelation (PACF) plot represents the amount of correlation between

a series and a lag to itself that is not explained by correlations at all lower- order lags.

We would like all the spikes to fall in the blue region.


33

Figure 79:ACF and PACF result

The above shows ACF and PACF for a stationary time series, respectively. The ACF and

PACF plots indicate that an MA (1) model would be appropriate for the time series

because the ACF cuts after 1 lag while the PACFs shows a slowly decreasing trend.

Following is the outcome of SARIMA Model run on data:


34

Figure 40: SARIMA model

Following is the outcome of ARIMA Model run on data:


35

Figure 41: ARIMA model


36

8. Build a table with all the models built along with their

corresponding parameters and the respective RMSE values on

the test data.

I have sorted the models based on lowest RMSE and MAPE values on test data.

Figure 42: RMSE and MAPE values on test data for all the model runs

We can observe 2 point Trailing Moving average has the lowest RMSE and MAPE score

on test data and hence is the best model.


37

9. Based on the model-building exercise, build the most optimum

model(s) on the complete data and predict 12 months into the

future with appropriate confidence intervals/bands.

We can plot the real and the forecasted sales for the time series.

Figure 43: Forecasted sales

Figure 44: Lower and Upper Confidence interval bands


38

Figure 45: Lower and Upper Confidence interval forecasted plot

10. Based Comment on the model thus built and report your

findings and suggest the measures that the company should be

taking for future sales.

 The company should come up with discount offers in the months of January to
May as the sales are low in these months.
 Also, the company can adopt a good price for shoes as we saw there were many
outliers in case of yearly prediction
 To increase sample size
 To increase the number of independent variables
 Try more combinations of variables to see if accuracy of the model can be
improved.

You might also like