You are on page 1of 31

Retail Sales Forecasting

Revathy Prabhakaran
EXECUTIVE SUMMARY

• Problem/data - To predict future sales of Rossmann stores using past sales data
available.
• Methods and models - Time series analysis using the following models:-
• 1.Holt Winters’s
• 2.ARIMA
• 3.SARIMA
• 4. FB Prophet.
• Results/errors of these models - Performance of the forecasting method
measured using the root mean squared error(RMSE)
• Findings – Basis above models we found SARIMA to provide the best forecast for
our weekly resampled dataset of the Rossmann stores
Exploratory Data Analysis
Data Preparation & Pre-processing
Correlation Analysis

We can see a strong positive correlation between the amount of Sales and
Customers visiting the store. We can also observe a positive correlation
between a running promotion (Promo = 1) and number of customers.
Sales are highly correlated with feature Customers and feature Open and
moderately correlated with Promo.
Sales trend over the months

*For 2013; we can see from the given trends that sales
tend to spike in November and December. So, there is
a seasonality factor present in the data.
*For 2014; we can see the same trends seasonality
trend for November and December. So, there is a
seasonality factor present in the data.
*We see that sales due to promotion are peaking
during Mar and Jun. For 2015 as well we see the same
trend as 2013 & 2014
Sales trend over days
We resample the entire dataset to daily level,
and below are the observations;

We can see from the trend that there are no


promotions on the weekends i.e., Saturday
and Sunday, which makes sense as stores
want to earn a maximum profit during the
time when people do their house chores.

The sales tend to increase on Sunday because


people shop during the weekend. We can also
see that the maximum sale happens on
Mondays when there are promotional offers.
Conclusion of EDA

• The most selling and crowded Store Type is A.


• Store Type B has the highest Sale per Customer.
• Customers tends to buy more on Mondays when there are ongoing
promotional offers and on Saturday/Sunday when there is no
promotion at all.
• Second promotion (Promo2) doesn't seem to contribute in the
increase of sales.
Time Series Analysis & Predictive
Modelling For data understanding, we will
consider one store from each store
type a , b, c, d that will represent
their respective group. It also makes
sense to down sample the data from
days to weeks using the resample
method to see the present trends
more clearly.
We can see from above plots that
sales for Store Type A and C tend to
peak in the end of year (Christmas
season) and then decline after the
holidays.
We are not able to see a similar trend
in Store Type D because no data is
available for that time period (stores
closed).
Checking Stationarity of the data
In order to use time series forecasting models, we need to ensure that our time series data is
stationary i.e constant mean, constant variance and constant covariance with time.
• Rolling Mean:
• Dicky -Fuller test:
Seasonal decomposition of the data

After plotting the


graphs for all the store
types, we can see that
there is seasonality
and trend present in
our data.
So, we'll use forecasting
models that take both
of these factors into
consideration. For
example Holt Winter's
SARIMA and Prophet.
Data Preparation & Pre-processing
Treatment of Missing values & Joining Train and Store
data

So, we have 172,871 observations when the stores were closed or have zero sales.
We can drop these rows in order to do data analysis but we can still keep them for predictive
modelling because our models will be able to understand the trend behind it
There are in total 1115 stores with
sales
feature having 3849.93 volatility and

feature customers having 464.41


volatility
with a mean of 57773.82 and 633.15
respectively.
Checking days when the stores were closed & whether there was a
school holiday when the store was closed
Distribution of sales and customers across store types
# Customers
Sales SalePerCustomer

a b c d a b c d a b c d

From above table, we can see that Store of type 'a' and 'd' have
the highest total sales but stores of type ‘a' and ‘d' have the
highest sale per customer
Model Development & Validations
Forecasting a Time Series
We tried 4 following modelling approaches
Holt winters
ARIMA
SARIMA
FB Prophet
Evaluation Metrics
• There are two popular metrics used in measuring the performance of regression (continuous variable)
models i.e MAE & RMSE.
• MAE - Mean Absolute Error: It is the average of the absolute difference between the predicted values and
observed values.
• RMSE - Root Mean Square Error: It is the square root of the average of squared differences between the
predicted values and observed values.
• MAE is easier to understand and interpret but RMSE works well in situations where large errors are
undesirable.
• So, let's choose RMSE as a metric to measure the performance of our models
Forecasting a Time Series
Model 1: Holt Winter’s
Autocorrelation plots From the daily data, we see a strong
correlation of sales of a specific day
with 7 day lagged version of it.

Hence its clear that we need to


down-sample the data to a weekly.
In the plots based on weekly mean
data, we can find intricate
autocorrelation patterns.
Thus, orders for both seasonal and
non-seasonal AR and MA terms
cannot be decisively chosen,
complicating our process.
This demonstrates how real-life
statistics can be much messier than
textbook examples, and we have to
make do with the uncertainty of our
own choices. Fortunately, we can
avoid any unscientific guessing by
employing an optimization method
called Grid Search.
Hyper parameter tuning ARIMA model
• As discussed previously, we have three parameters (p, d and q) for SARIMA model. So, in order
to choose the best combination of these parameter, we'll use a grid search. The best
combination of parameters will give the lowest AIC score.

• Perform Grid search

• This algorithm simply conduct an exhaustive search over all the combinations of parameters.
The best one among all of them will be chosen according to a loss function of our choice. In our
case, we use the popular Akaike Information Criterion (AIC) as per the standard in the ARMA
modelling process.
Model2. ARIMA
Checking diagnostic plots- SARIMA
Model 4 – FB Prophet
Model Evaluation
• Holt Winter’s

ARIMA

FB PROPHET
Model Evaluation
Business recommendations and next steps
Recommendation:
The Rossmann data set is more suited for weekly forecast, so instead of forecasting at daily
basis the business should look at weekly forecasts and this is what we have tried in all our
models. Also, the business should favor oversupplying inventory rather than undersupplying, as
drug store products tend to have a long use life so can always be sold at a later date. In case of
undersupply of products, there is the certainty of a missed sales opportunity, as drug products
tend to be urgent and necessary for customers
Next steps:
• We should use SARIMAX model, which is expected to do better than SARIMA. In SARIMAX, we
pass exogenous variables which have impact on sales like we discovered during EDA
• Fit prophet model in better way. We could do more customization in prophet model to make
its prediction better. e.g. we can pass school holiday and state holiday data. As prophet is
known to make good use of holiday data in prediction. Also, we can use automatic change
point detection also to make it better.
• We did analysis on weekly mean of sales data. We can also use sales data for each store type
and run different models to make better predictions for each type of store. Other hierarchical
forecasting models can be used
THANK YOU

You might also like