You are on page 1of 22

BIG MART SALES

PREDICTION
Team 6
Kiran S (2113050)
Viswa M (2113112)
Danel Hilton W (2111011)
Pramod Raja (2111074)
Prasana (2111075)
Problem Statement
• Primary aim of any business is to maximize sale of products.
• But businesses find it difficult to predict sales of products.
• Product properties have to be considered for sale prediction.
• Big Mart has collected data regarding:
• Sale of 1559 products
• Attributes of each product is captured.
• Aim of the project:
• To categorize products according to properties.
• To predict the sale value of products.
Data
Collection
• Data Source:
https://datahack.analyticsvidhya.co
m/contest/practice-problem-big-ma
rt-sales-iii/
• Two sets of data are available:
• Training set (independent
variables with sale results)
• Test set (only independent
variables)
• Data about 6113 customers
• To build a model using training set
• To predict the survival result for
test set
Data Collection
• Item_Identifier, Outlet_Identifier
• Nominal variables
• Cannot be used for analysis

• Item_Weight, Item_Fat_Content,
Item_Visibility, Item_MRP,
Item_Outlet_Sales
• Quantitative variables
• Can be used directly for analysis

• Item_Type, Item_Fat_Content,
Outlet_Size,
Outler_Location_Type,
Outlet_Type
• Categorical variables
• Dummy variables for analysis

• Outlet_Establishment_Year
• Nominal variable
• Can be converted into quantitative
variable for analysis
Preprocessing
the data
• Presence of null values in Item
weight column.
• If removed, loss of 1400 datapoints.
• So, interpolation done to remove null
values.

• Check for outliers in Sales variable


• Presence of 135 outliers.
• Outliers removed.
• Original shape: (6113,13)
• New shape: (5978,13)
Transforming
the data
• Outlet establishment year
available.
• To the use the data,
• Find difference b/w
current year and Year of
establishment
• Create new column
called “Years”
• Data is now pre-processed
and transformed.
Descriptive Analysis

• Mean Sales Rs. 2198


• Cumulative proportion of Medium &
Small outlet data is large (84.5%).
• Cumulative proportion of Tier 1 &
Tier 3 location data is large (84.64%).
• Proportion of Type 1 Supermarket
data is large (61.61%).
• Mean Sale value of Tier 2 & 3 is higher than Tier 1
• Supermarket Type 3 performs well in sales.
Data visualization • Grocery store sales is no way near to supermarket sales.
• Sales value of medium and high outlet size firms better than small size
firms.
Clustering
Analysis
• Clustering based on item
weight and item MRP.
• Standardization done.
• 4 clusters to be chosen.
• Grocery store sales is no way
near to supermarket sales.
• Sales value of medium and
high outlet size firms better
than small size firms.
Clustering Analysis
4 Clusters
• Cluster 0: High weight, High
Price
• Cluster 1: Low weight, Low
Price
• Cluster 2: High weight, Low
Price
• Cluster 3: Low weight, High
Price
Creating Models for each Cluster
Cluster 0: High weight,
High Price products

• vif<10 for all IVs- No multi-collinearity


• Supermarket Type 2 & Grocery similar
• Supermarket Type 3 provides higher sale
value compared to Type 1
• Outlet Locations is not important
• Outlet size is not important
• Item type is not important
• Years has negative coefficient (Customers
prefer new stores)
• Accuracy of prediction: 70.84%

𝑆𝑎𝑙𝑒 𝑉𝑎𝑙𝑢𝑒=2146.1966 −51.63 ( 𝑌𝑒𝑎𝑟𝑠 )+1843.74 ( 𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡 𝑡𝑦𝑝𝑒 1 ) +2738.33 (𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡 𝑡𝑦𝑝𝑒 3)
Cluster 1: Low weight,
Low Price products

• vif<10 for all Ivs- No multi-collinearity


• Outlet type is very important
• Supermarket type 3 provides higher sales
• Tier 1 & Tier 2 similar sale value
• Outlet size is important
• Small outlet provides higher sale value
• Item type is not important
• Years has negative coefficient (Customers
prefer new stores)
• Accuracy of prediction: 63.52%

𝑆𝑎𝑙𝑒𝑉𝑎𝑙𝑢𝑒=666.2371−26.71 ( 𝑌𝑒𝑎𝑟𝑠 )+447.34 ( 𝑀𝑒𝑑𝑖𝑢𝑚) +560.29 ( 𝑆𝑚𝑎𝑙𝑙 )+780.57 ( 𝑇𝑖𝑒𝑟 3 ) +902.94 ( 𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡 𝑡𝑦𝑝𝑒1 ) +1438.07 (𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡 𝑡𝑦𝑝𝑒3)
Cluster 2: High weight,
Low Price products

• vif<10 for all IVs


• Outlet type is important
• Supermarket Type 3 provides higher sale value
• Tier 1 & Tier 2 similar sale value
• Outlet size is important
• Small outlet provides higher sale value
• If soft drinks are sold in this category, it will have
a negative effect on sales.
• Breads under this category provide higher sale
value.
• All other item types have similar sale value
• Accuracy of prediction: 61%

𝑆𝑎𝑙𝑒 𝑉𝑎𝑙𝑢𝑒=−778.47+369.10 ( 𝐵𝑟𝑒𝑎𝑑)− 452.22 (𝑆𝑜𝑓𝑡𝑑𝑟𝑖𝑛𝑘𝑠)+611.58 ( 𝑀𝑒𝑑𝑖𝑢𝑚)+638.82( 𝑆𝑚𝑎𝑙𝑙)+469.1 (𝑇𝑖𝑒𝑟3)+1373.44 (𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡𝑡𝑦𝑝𝑒1)+1786.51(𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡𝑡𝑦𝑝𝑒3)


Cluster 3: Low weight,
High Price products

• vif<10 for all IVs


• Supermarket Type 2 & Grocery similar
• Supermarket Type 1 provides higher
sale value
• Tier 1 & Tier 2 similar sale value
• Outlet size is not important
• Soft drinks under this category have
negative effect on sales.
• Accuracy of prediction: 68.61%

𝑆𝑎𝑙𝑒 𝑉𝑎𝑙𝑢𝑒=540.72 −490.78 ( 𝑆𝑜𝑓𝑡 𝑑𝑟𝑖𝑛𝑘𝑠 )+1439.93 ( 𝑇𝑖𝑒𝑟 3 ) +2135.39 ( 𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡 𝑡𝑦𝑝𝑒 1 )+ 2055.7(𝑆𝑢𝑝𝑒𝑟𝑚𝑎𝑟𝑘𝑒𝑡 𝑡𝑦𝑝𝑒 3)
Next step-Prediction
• When product attributes are given, it is important to identify
cluster before predicting sale value.
• So, classification is done before regression is done.
• Use the same train set to predict cluster type (but without
including sale value)
Classification models under consideration

Classification models in
Model validation
consideration:
• Classification Regression Technique Training set data has been broken into
(CART) 70% train set & 30% test set.
• Random Forest
• k-Nearest Neighbour Algorithm
(KNN)
• Adaptive Boosting
• Gradient Boosting
• Naïve Bayes classifier
• Support Vector Machine (SVM)
Comparing accuracy of models
S.No Name of the model Accuracy

1 Classification Regression Technique (CART) 99.83%

2 Random Forest 99.88%


3 k-Nearest Neighbour Algorithm (KNN) 96.15%

4 Adaptive Boosting 99.78%


5 Gradient Boosting 99.9%
6 Naïve Bayes classifier 88.30%
7 Support Vector Machine (SVM) 98.39%
Testing accuracy
of prediction
• The best model for
classification is Gradient
Boosting.
• Item clusters have been
predicted using Gradient
Boosting classifier.
• Accuracy of predicting test
set- 99.98%
Predicting the test set

INPUT: OUTPUT 1: OUTPUT 2:


Item Identifier: FDW58
Item weight: 20.75
Item_Fat_Content: Low Fat Cluster 2 (High Predicted Sale Value:
Item visibility: 0.007564836
Weight, Low Price Rs. 1437.47
Item_type: Snack Foods
Item_MRP: 107.8622
product)
Outlet Identifier: OUT049
Years: 24
Outlet_size: Medium
Outlet_Location_type: Tier 1
Outlet_Type: Supermarket Type 1
Actionable Insights from Data analysis

High weight, High Price Products Low weight, Low Price Products

• Sell at Supermarket Type 3 • Sell at Supermarket Type 3

Recommendations
• Sell in all locations • Sell in Tier 3 location
• Sell in all outlet sizes • Sell in Small size outlets

for maximizing • Try to sell in new outlets • Try to sell in new outlets

sale value
High weight, Low Price Products Low weight, Low Price Products

• Sell at Supermarket Type 3


• Sell at Supermarket Type 1
• Sell in Tier 3 location
• Sell in Tier 3 location
• Sell in Small size outlets
• Sell in all outlet sizes
• Bread is the highest selling product.
• Don’t sell soft drinks in this category. • Don’t sell soft drinks in this category.
THANK YOU

You might also like