Professional Documents
Culture Documents
RHEA.S.M
PGPDSBA Online Sep_B 2021
1
Table of Contents
1. Problem 1:......................................................................................................................................................
1.1. Objective.........................................................................................................................................................
1.2. Descriptive and Exploratory Data Analysis......................................................................................................
1.2.1. Descriptive Data analysis:.......................................................................................................................
1.2.2. Time Series Data- Plotted:.......................................................................................................................
1.2.3. Exploratory Data Analysis:......................................................................................................................
1.3. Splitting of Train and Test data.......................................................................................................................
1.4. Building Different models and checking RMSE.............................................................................................
1.4.1. Linear Regression:.................................................................................................................................
1.4.2. Naïve Bayes Model:..............................................................................................................................
1.4.3. Simple Average Forecast:......................................................................................................................
1.4.4. Moving Average Forecast:....................................................................................................................
1.4.5. Simple Exponential Smoothening:........................................................................................................
1.4.6. Double Exponential Smoothening:........................................................................................................
1.4.7. Triple Exponential Smoothening:..........................................................................................................
1.4.8. Triple Exponential Smoothening (Multiplicative):.................................................................................
1.5. Checking for Stationarity..............................................................................................................................
1.6. ARIMA and SARIMA using lowest AIC method:.............................................................................................
1.7. ARIMA and SARIMA based on the cut-off points of ACF and PACF:..............................................................
1.8. Comparing RMSE values...............................................................................................................................
1.9. Building of optimum model and 12 month forecast.....................................................................................
1.10. Findings and Suggestions......................................................................................................................
2
List of Figures
Figure Name Page
No. No.
Fig 1 Time Series Plot –Shoe Sales 5
Fig 2 Monthly Box plot of Shoe Sales 6
Fig 3 Monthly Shoe Sales across the years 6
Fig 4 Time Series Plot along with Mean and Median 6
Fig 5 Multiplicative Decomposition of dataset 7
Fig 6 Additive Decomposition of dataset 8
Fig 7 Shoe Sales- Train and Test split 9
Fig 8 Linear Regression 10
Fig 9 Naïve Bayes Model 11
Fig 10 Simple Average Forecast 11
Fig 11 Trailing Moving Average Forecast 12
Fig 12 Single Exponential Smoothening 13
Fig 13 Single and Double Exponential Smoothening 13
Fig 14 Simple, Double and Triple Exponential Smoothening 14
Fig 15 Simple, Double and Triple Exponential Smoothening(Multiplicative) 14
Fig 16 Stationarity of Shoe Sales at lag 1 16
Fig 17 AIC-ARIMA(2,1,3) A. Summary, B. Graph and C. Diagnostics 18
Fig 18 AIC- SARIMA(0,1,2) (1, 0, 2, 12) A. Summary, B. Graph and C. Diagnostics 20
Fig 19 Autocorrelation of Differenced Data 21
Fig 20 Partial Autocorrelation of Differenced Data 21
Fig 21 ACF/PACF- ARIMA(3,1,1) A. Summary, B. Graph and C. Diagnostics 22
Fig 22 Figure-22 ACF/PACF- SARIMA(3,1,1) (2, 0, 4, 12) A. Summary, B. Graph and C. 24
Diagnostics
Fig 23 Optimum Model Forecast for next 12 months 25
List of Tables
Table No. Name Page No.
Table 1 Summary of Descriptive statistics information 4
Table 2 Train and Test Split 9
Table 3 Summary Results of all models 24
3
1. Problem 1:
1.1. Objective
The objective the problem is to build an optimum model, to forecast the sales of the
pairs of shoes for the upcoming 12 months from where the data currently ends.
We additionally also have to comment on the model thus built and report our findings
and suggest the measures that the company should be taking for future sales.
Background: You are an analyst in the IJK shoe company and you are expected to
forecast the sales of the pairs of shoes for the upcoming 12 months from where the
data ends. The data for the pair of shoe sales have been given to you from January
1980 to July 1995.
Data Dictionary:
YearMonth: Month and Year of Shoe Sales
Shoe_Sales: The monthly sale of shoes
1.2.1. Descriptive Data analysis:
The dataset has been read and stored as a data frame for further analysis.
Provided data set consists of total 2 columns and has 187 entries that are
numerical in nature. There are no null values present.
The first column represents the date as which the Shoe Sales have been
recorded. While the second column represents the Sales itself.
The following Table 1 consists the head(), tail(), info() and description of the
dataset at hand.
4
1.2.2. Time Series Data- Plotted:
5
Additionally since the mean is shown to be higher than the median, leading
to a conclusion that the distribution is positively skewed.
6
DECOMPOSITION OF THE DATASET:
7
(ii) Additive Decomposition Of The Dataset:
Data is represented in terms of addition of seasonality, trend, cyclical
and residual components. Used where change is measured in absolute
quantity.
Since we are looking at change in absolute quantity for this particular dataset we
move on with using the additive model.
The train-test split is used to estimate the performance of machine learning algorithms
that are applicable for prediction-based Algorithms/Applications. This method is a
fast and easy procedure to perform such that we can compare our own machine
learning model results to machine results.
Both the datasets have been split at the Year 1991. This means that the test data starts
from 1991.
8
Table-2: Train and Test Split
Train data Head of the dataset: Test data Head of the dataset:
Train data Tail of the dataset: Test data Tail of the dataset:
9
1.4. Building Different models and checking RMSE
10
Figure-9 Naïve Bayes Model
The RMSE values seem to be lowest for Naïve Bayes so far. But since the
forecast is constant through the years, it isn’t an ideal model for our dataset.
The method is very simple. We average the data by months or quarters or years
and then calculate the average for the period. We later proceed to find out, what
percentage it is to the grand average.
11
Model Type RMSE
RegressionOnTime 266.276
5
NaiveModel 245.121
3
SimpleAverageModel 63.9845
7
The RMSE values seem to be lowest for the Simple Average Method so far. But
since the forecast is constant through the years, it isn’t an ideal model for our
dataset.
12
The RMSE values seem to be lowest for the 2 point Trailing Moving Average
Method so far.
13
Figure-13 Simple and Double Exponential Smoothening
14
Figure-15 Simple, Double and Triple Exponential Smoothening (Multiplicative)
Model Type RMSE
RegressionOnTime 266.2765
NaiveModel 245.1213
SimpleAverageModel 63.98457
2pointTrailingMovingAverage 45.94874
4pointTrailingMovingAverage 57.87269
6pointTrailingMovingAverage 63.45689
9pointTrailingMovingAverage 67.72365
SimpleExponentialSmoothing 196.4048
DoubleExponentialSmoothing 266.1612
TripleExponentialSmoothing 128.9925
TripleExponentialSmoothingMultiplicative 83.73405
The RMSE values seem to be lowest for the 2 point Trailing Moving Average
Method so far.
15
When ADF was applied on the model we got a p-value of 0.801 which is
higher than 0.5, hence we fail to reject the null hypothesis. Concluding that the
series is not stationary.
We now have to do a level differencing on the dataset and check for
Stationarity.
The p-value after level 1 differencing is 0.0361<0.05, hence we now reject the
null hypothesis and conclude that the series is stationary with a lag of 1.
Below is a graphic representation of the same. The test statistic value is -
3.532, while the number of lags used is 12.
Now that the data is stationary we can move on to building the ARIMA and
SARIMA models.
An ARIMA model consists of the Auto-Regressive (AR) part and the Moving
Average (MA) part after we have made the Time Series stationary by taking the
correct degree/order of differencing.
ARIMA models can be built keeping the Akaike Information Criterion (AIC) in mind
as well. In this case, we choose the ‘p’ and ‘q’ values to determine the AR and MA
orders respectively which gives us the lowest AIC value. Lower the AIC better is the
model.
Coding languages tries different orders of ‘p’ and ‘q’ to arrive to this conclusion.
Remember, even for such a way of choosing the ‘p’ and ‘q’ values, we must make
sure that the series is stationary.
The formula for calculating the AIC is 2k – 2ln(L), where k is the number of
parameters to be estimated and L is the likelihood estimation.
For the SARIMA models, we can also estimate ‘p’, ‘q’ , ‘P’ and ‘Q’ by looking at the
lowest AIC values.
ARIMA:
16
i. We first create a grid of all possible outcomes (p,d,q). The range of ‘p’ and ‘q’
being (0,4) and ‘d’ a constant = 1.
Model: (0, 1, 1)
Model: (0, 1, 2)
Model: (0, 1, 3)
Model: (1, 1, 0)
Model: (1, 1, 1)
Model: (1, 1, 2)
Model: (1, 1, 3)
Model: (2, 1, 0)
Model: (2, 1, 1)
Model: (2, 1, 2)
Model: (2, 1, 3)
Model: (3, 1, 0)
Model: (3, 1, 1)
Model: (3, 1, 2)
Model: (3, 1, 3)
ii. We then move on to fit the ARIMA model into each of the above
combinations and end up choosing that one with the least AIC value.
param AIC
11 (2, 1, 3) 1480.805493
15 (3, 1, 3) 1482.566450
5 (1, 1, 1) 1492.487187
6 (1, 1, 2) 1494.423859
9 (2, 1, 1) 1494.431498
2 (0, 1, 2) 1494.964605
3 (0, 1, 3) 1495.148474
14 (3, 1, 2) 1495.655855
13 (3, 1, 1) 1496.346864
7 (1, 1, 3) 1496.385878
10 (2, 1, 2) 1496.410739
1 (0, 1, 1) 1497.050322
12 (3, 1, 0) 1498.930309
8 (2, 1, 0) 1498.950483
17
4 (1, 1, 0) 1501.643124
0 (0, 1, 0) 1508.283772
iii. The lowest AIC for ARIMA is clearly (2, 1, 3) with an AIC of 1480.80. We
now fit the train data with the model and forecast on the test set. And we get
the ARIMA Summary, graph and diagnostic results.
A.
B.
C.
18
Figure-17 AIC-ARIMA(2,1,3) A. Summary, B. Graph and C. Diagnostics
iv. We finally check the accuracy of the model with the help of the RMSE and
MAPE calculated.
SARIMA:
ii. We then move on to fit the SARIMA model into each of the above
combinations and end up choosing that one with the least AIC value.
19
80 (2, 1, 2) (2, 0, 2, 12) 1158.630324
iii. The lowest AIC for SARIMA is clearly (0, 1, 2) (1, 0, 2, 12) with an AIC of
1156.165429. We now fit the train data with the model and forecast on the test
set. And we get the SARIMA Summary, graph and diagnostic results. This can
be seen in Figure-18 below.
iv. We finally check the accuracy of the model with the help of the RMSE and
MAPE calculated. AIC-SARIMA has lowest RMSE and MAPE up until now.
A.
B.
C.
20
Figure-18 AIC- SARIMA(0,1,2) (1, 0, 2, 12) A. Summary, B. Graph and C. Diagnostics
1.7. ARIMA and SARIMA based on the cut-off points of ACF and PACF:
An ARIMA model consists of the Auto-Regressive (AR) part and the Moving
Average (MA) part after we have made the Time Series stationary by taking the
correct degree/order of differencing.
The AR order is selected by looking at where the PACF plot cuts-off (for
appropriate confidence interval bands) and the MA order is selected by looking at
where the ACF plots cuts-off (for appropriate confidence interval bands).
The correct degree or order of difference gives us the value of ‘d’ while the ‘p’
value is for the order of the AR model and the ‘q’ value is for the order of the MA
model.
For SARIMA, the seasonal parameter ‘F’ can be determined by looking at the
ACF plots. The ACF plot is expected to show a spike at multiples of ‘F’ thereby
indicating a presence of seasonality.
Also, for Seasonal models, the ACF and the PACF plots are going to behave a bit
different and they will not always continue to decay as the number of lags
increase.
ARIMA:
i. We are to observe the ACF and PACF plots. We get the ‘p’ value from the
PACF and the ‘q’ value from the ACF plot. The following are the plots at d=1:
21
Figure-19 Autocorrelation of Differenced Data
ii. We then move on to fit the ARIMA model into (3,1,1). These values have
been found from the ACF and PACF plots. And we get the ARIMA Summary,
graph and diagnostic results.
A.
B.
22
C.
iii. We finally check the accuracy of the model with the help of the RMSE and
MAPE calculated. AIC-SARIMA has lowest RMSE and MAPE up until now.
i. We are to observe the ACF and PACF plots. We get the ‘p’ value from the
PACF and the ‘q’ value from the ACF plot. From the above plots Figure 19
and 20 at d=1, frequency= 12. We additionally find P, D, Q from the above
plot by looking for seasonal peaks.
ii. We then move on to fit the SARIMA model into (3,1,1) (2, 0, 4, 12). These
values have been found from the ACF and PACF plots. And we get the
SARIMA Summary, graph and diagnostic results.
A.
23
B.
C.
24
Figure-22 ACF/PACF- SARIMA(3,1,1) (2, 0, 4, 12) A. Summary, B. Graph and
C. Diagnostics
iii. We finally check the accuracy of the model with the help of the RMSE and
MAPE calculated. AIC-SARIMA has lowest RMSE and MAPE up until now.
25
SimpleExponentialSmoothing 196.404
8
NaiveModel 245.121
3
DoubleExponentialSmoothing 266.161
2
RegressionOnTime 266.276
5
We see that the best model with least RMSE in the 2 point Trailing Moving Average,
followed by all the other moving averages and simple average too. At 6th place we see
AIC-SARIMA(0, 1, 2)(1, 0, 2, 12).
Since the RMSE values are not too far apart from 1st to 6th place for ease of
computation and accurate predictability, we choose AIC-SARIMA(0, 1, 2)(1, 0, 2,
12). Additionally, ARIMA models are more computationally efficient and gives us
accurate predictions.
It also takes into consideration MAPE, and it is always a good idea to have more than
one accuracy parameter.
Industry wide exponential smoothening and ARIMA models are more popular when it
comes to model building. While exponential smoothing technique depends upon the
assumption of exponential decrease in weights for past data and ARIMA is employed
by transforming a time series to stationary series and studying the nature of the
stationary series through ACF and PACF and then accounting auto-regressive and
moving average effects in a time series, if present.
We are going to be building the optimum model with AIC-SARIMA(0, 1, 2)(1, 0, 2, 12)
as per explanation already provided above.
26
1.10. Findings and Suggestions
Data set contains total of 187 entries among which 2 variables. The first
column represents the date as which the Shoe Sales have been recorded.
While the second column represents the Sales itself. There are no null values
in the dataset.
There are outliers present in April and May. This tells us there were some
sales made in those months that were out of the usual.
The sales tend to pick up at the second half of the year more than the first.
December records the highest sales in shoes.
The spike may be due to the Holiday season, and maybe shoes are very
popularly purchased and used either for self-consumption or gifting
purposes.
In the monthly as well as the yearly trend, we see that December is the most
popular month for Shoe Sales as well as the year it peaked in sales between
1986 and 1988. This peak may be due to widespread interest and a lot of
innovations done to lure the customers into buying their products, thus
boosting sales.
From the forecast we see a clear peak, showcasing better sales than the year
prior. Hence, the manufacturers must ensure they have enough and more than
the year preceding.
The company can boost sales higher than forecasted if they focus on
Advertising and launching new unique type of Shoes.
With the launch of the new shoes they can entice customers and lure them
into thinking they need to buy the shoes because they are one of a kind.
Giving the manufacturers a first mover advantage.
This will ensure boost in sales for a while and then decision of discontinuing
the manufacture of shoe types that are not that popular can be taken too. This
will help save important resources that can be used elsewhere.
There is hope for the year on year spike to peak again, because shoes are a
necessity and the commodity will never lose its importance.
27
28