Professional Documents
Culture Documents
TOPIC
BLACK FRIDAY SALES PREDICTION
Submitted By
SUPRIYA R
DR. P. THANGAVEL
Professor
UNIVERSITY OF MADRAS
DEPARTMENT OF COMPUTER SCIENCE
GUINDY CAMPUS, CHENNAI - 600 025
BONAFIDE CERTIFICATE
Submitted for the University Semester Examination held during June 2021 at the
Department of Computer Science, University of Madras, Guindy, Chennai – 600
025.
Examiner
ACKNOWLEDGEMENT
We are proud and privileged to express my sincere gratitude to
thank and praise Dr.P.THANGAVEL Professor, Department of
Computer Science, University of Madras, Guindy Campus,
Chennai for his consent and for having shared the fruit of
knowledge by constant guidance and support.
SUPRIYA R
ABSTRACT
A sales Prediction helps every business make better
business decisions. It helps in overall business planning,
budgeting, and risk management. Sales Prediction allows
companies to efficiently allocate resources for future growth and
manage their cash flow.
With this particular Black Friday sale analysis, we are more
interested in figuring out how much will a customer spend based
on certain attributes such as their Age group, City Category, etc.
In this project, we tried to analyze the data using various
analysis method and factor that affects the sales. Here we deal
with both independent and dependent various regression models.
At last, we reduced the error which helps to predict more
accurately. Understand the customer purchase behavior
(specifically, purchase amount) against various products of
different categories. They have shared purchase summaries of
various customers for selected high-volume products.
TABLE OF CONTENTS
S.NO CONTENTS Page no
1 INTRODUCTION 1
2 FEATURE OF WORK 2
6 CONCLUSION 28
7 BIBLIOGRAPHY 29
INTRODUCTION
For a long history of several decades, Black Friday has been
recognized as the largest shopping day of the year in the US. It is the
Friday after Thanksgiving and for American consumers, it ignites the
Christmas holiday shopping. For most retailers, it is the busiest day of
the year. Black Friday is traditionally known for long lines of customers
waiting outdoors in cold weather before the open hours. Sales are so
high for Black Friday that it has become a crucial day for stores and the
economy in general with approximately 30% of all the annual retail sales
occurring in the time from Black Friday. It is unofficially a public
holiday in more than 20 states and is considered the start of the US
Christmas shopping season.
1
FEATURE WORK
The main aim of Feature Engineering is to analyze the data as much as
possible which might involve creating additional features out of the
existing features which have a better correlation with the Target
Variable. Thus, it helps the model perform better.
In addition to that, it was observed that some products were sold more
as compared to other products which could be a result of the product
being more popular than the others and can be assumed to have
a significant amount of bias as compared to others.
2
PROBLEM STATEMENT
A retail company “ABC Private Limited” wants to understand the
customer purchase behavior (specifically, purchase amount) against
various products of different categories. They have shared purchase
summaries of various customers for selected high-volume products from
last month.
The data set also contains customer demographics (age, gender,
marital status, city_type, stay_in_current_city), product details
(product_id and product category), and Total purchase_amount from last
month.
They want to build a model to predict the purchase amount of
customers against various products which will help them to create a
personalized offer for customers against different products.
DATA
The Data-Set consists of a Train Data Set and a Test Data Set. We
start by looking at the training and test data set.
we can observe that there are missing values in the
columns ‘Product_Category_2’ and ‘Product_Category_3’ of both the
train and test dataset. These 2 features are additional features of a
product and fill those values.
3
ALGORITHM AND IMPLEMENTATION:
It can be seen that we have 550,068 rows in our data and most of the
Data Columns are non-null except
for ‘Purchase_Category_2’ and ‘Purchase_Category_3’. We need to
handle the missing data in these columns. But before that, we will take a
look at how these columns affect the target and then handle it
accordingly.
4
Step 2: A closer look at the features
1. User_ID: A distinct ID is given to the customers to identify them
uniquely.
2. Product_ID: A distinct ID is given to products to identify
them uniquely.
3. Gender: M or F can be used as a binary variable.
4. Age: Age is given in bins with 6 categories.
5. Occupation: The type of occupation a user does, It is already
masked.
6. City_Category: The category of the city out of A, B, C.
Should be used as Categorical Variable.
7. Stay_In_Current_City_Years: It has 5 values: 0, 1, 2, 3, 4+ and
may be used as categorical variables.
8. Marital_Status: 0: Unmarried and 1: Married. It is expected that
marital status does affect the Purchase value.
9. Product_Category_1: The primary category that a product
belongs to. It can be a useful feature as a certain category of
products are sold more often than others.
10. Product_Category_2: The Secondary category of a product.
If there is no secondary category this will be Null.
11. Product_Category_3: The Tertiary Category of a product.
This will be only occupied when Category 1 and 2 are
occupied. Also, if a product does not have a tertiary category, it
will be Null.
12. Purchase: This is the target variable. Now that we have
understood our data, we can start visualizing and gain some
more insights.
5
Step 3: EDA using Visualization
There are a large number of possibilities when it comes to analyzing
the data using visualization. We will first understand how different
features affect the target and then how the combinations of these features
affect the target.
AGE
We can see, the distribution of various Age groups in our data.
Customers of age 26–35 were in the largest numbers with around 40%
of the total customers, while people of age 0–17 were lowest with 2.75%
only.
We can therefore infer that people of age group 26–35 shopped the
most followed by 36–25, 18–25, 51–55, 55+ and then 0–17. It is easy to
speculate on this data. Since people of age 0–17 are usually dependent
on elders, their numbers as customers are the lowest. Also, people of the
age group 26–35 are generally independent and have income sources,
they make the largest population in our data.
6
Despite the disparity in the number of customers of the different age
groups, we can see below that the values of the Average Purchase
amount of different age groups (Average value) and the Purchase
amount of an average person in the age group (Median value) are nearly
the same. Also, it is crucial to note that the age group 26–35 does largely
make up for our data, but the largest average amount spent is by people
of the age group 51–55. A general reason can be that they don’t need to
save up anymore and can freely spend whatever amount they wish.
GENDER
It is also very crucial to understand how the individual genders shopped
in this sale. This can be a very important feature as there can be some
major differences between the shopping behavior of the different
genders. In Figure 3.2.1 we can see the distribution of Male(M) and
Female(F) in the data. Males account for 75% of the shopping while
Females, just 25%. This is a very peculiar observation as one will not
expect such great disparity between the genders and the company must get
behind the reason why there is this disparity and what can be done to
7
Not only did Males shop more, but also spent more on average than
Females. Also, the amount spent by an average male is more than an
average female, though not by much amount.
MARITAL STATUS
We can see about 60% of the customers were unmarried and 40% were
married. It is possible that the commodities that married people prefer to
buy did not have attractive offers and perhaps the company can work on
that in the next sale. It is also possible that couples choose not to fritter
away their income in the sale and focus more on themselves and their
family.
8
CITY CATEGORY
9
OCCUPATION
10
STAY IN CURRENT CITY YEARS
Here we can see that most of our customers are those people who have
been staying in the same city for the past 1 year (35.24%). And the least
are those who just moved in (13.53 %). There can be some obvious reasons
for this observation. The people who have been in a city for one year
now are likely to stay more so they freely take part in the Black Friday
Sale and may buy some things for the house, while those who just
moved in need more time to settle in. Also, it is possible that those who
have stayed in the city for 4+ years are either planning on moving out or
are bored with the sale in the city and so choose not to shop that much.
We are done with the univariate analysis of our data. But there was a
variable ‘Marital Status’ for which we could not figure out how it affects
our target. So, we will do bivariate analysis and gain a deeper insight.
We see the comparison of Marital Status and Stay in Current City Years
(SCCY)
w.r.t Average Purchase. For SCCY 0, the value of avg. purchase is nearly
the same. For SCCY 1, the value of avg. purchase there is just a slight
difference between married and unmarried. For SCCY 2 also there is
just a slight difference between married and unmarried. For SCCY 3
however, there is a little more difference between married (9,170.6)
and unmarried (9,362.9). For SCCY 4+, there is just a slight difference
between married and unmarried.
11
PRODUCT CATEGORIES
There are three columns for Product Categories and two of them contain
Null Values. We will have to deal with the null values. But before that
let us understand what these categories tell
us. A product can belong to one single category (Primary Category), or
there may be two categories of the product (Primary + Secondary
Category) or there can be a maximum of three categories(Primary +
Secondary + Tertiary Category). Does belonging to more than one
category affect the purchase amount of a product? Let’s find out.
Now let’s say that there is a product that has a primary category 10 and
has a second category as well.
12
Shows us the different secondary product categories and average
purchase amount when the primary category is 10. So, if a product has a
category 10, it means that it may not belong to any other category
(Null)or it can belong to any of the following categories: 14, 16, 13, 15,
or 11. We can see that if the product does not belong to any other
category, it has the maximum average purchase value: 20,295. And if the
product belongs to category 10 and category 11, its average purchase
amount decreases significantly to19,206, which cannot be ignored. So,
we can say that product category 2 when combined with the product
category 1 affects the Purchase value.
Now let us assume we have a product with a category 10 and a category
13.
Does having a third category affect the product’s purchase amount?
Depicts that in our given data if a product has categories 10 and 13,
it may either not have a third category (Null) or may belong to category
16. We can see a huge disparity in the avg. purchase amounts of the
products if they don’t have a third category and those having a third
category (16).
For every row that has a non-zero cell in either Product category 2 or
Product Category 3 (suppose row 2, highlighted), we take that non-zero
value ‘i’ and replace the ‘ith’ column in that row with 1. We do this for
all our data.
This is what the Final data will look like once we are done with all the
data points.
Now we will code the above steps for the actual data set. Let’s have a
look at the columns in our data now.
14
data_df_oneh = pd.get_dummies(df_oneHo
columns=['Age',"Occupation"
ot t,
,'Stay_In_Current_City_Years'], prefix = ['Age',"Occupation", 'City_Category',
'City','Stay'])
We also want to use the Product ID column but cannot be used as it is since
it is of type ‘P00…’. So, we will first remove ‘P00’ from the column and
then use it.
data_df_onehot['Product_ID'] =
data_df_onehot['Product_ID'].str.replace('P00
', '')
For effective model building, we can standardize the dataset
using Feature Scaling. This can be done with StandardScaler() from
sklearn’s preprocessing library. Now we separate the target variable
from our dataset and then the dataset is split into training data and
testing data in the ratio 80:20 using the train_test_split() command.
First, we import xgboost and then convert our data into a DMatrix
format that is used by XGBoost. The algorithm can also be used without
converting into the DMatrix, however. I will do it anyway.
import xgboost as xgbdtrain = xgb.DMatrix(train_data,
label=train_labels) dtest = xgb.DMatrix(test_data, label=test_labels)
Since there are a lot of data points, it takes some time to train the model.
Our model went through complete data 819 times and it found the best
score at 809th round. The Test-rmse is 2510 without hyperparameter
tuning, which is pretty good.
I have only done a rough parameter tuning as it was taking a lot of time
to tune the model using cross-validation. I feel like a better tuning of the
model can be done as compared to the below tuning, but the purpose is
to show how I did the tuning.
16
We get the best score with a max_depth of 9 and min_child_weight of 7,
so let's update our params.
Subsample and Column Sample by Tree
Those parameters control the sampling of the dataset that is done at
each boosting round.
Instead of using the whole training set every time, we can build a tree
on slightly different data at each step, which makes it less likely to
overfit a single sample or feature.
Let’s see if we can get better results by tuning those parameters together.
17
Step 7: Evaluation of the Model
Here is how our final dictionary of parameters looks like:
params = {'colsample_bytree': 0.7,
'eta': 0.2,
'eval_metric': 'rmse',
'max_depth': 9,
'min_child_weight': 7,
'objective': 'reg:squarederror',
'subsample': 1}
Let’s train a model with it and see how well it does on our test set!
Well isn’t that an improvement? Not only did the number of iterations
go down from 819 to 607, but the Test-RMSE also reduced from
2510.89 to 2497.255. Now, we can use this model to fit the test data and
then we submit it for checking.
This places us at position 355, which is the top 14% of all
participants.
18
DECISION TREE
◈ Decision trees learn how to best split the dataset into smaller and
smaller subsets to predict the target value.
◈ “mse” for the mean squared error
19
RANDOM FOREST
20
MULTIVARIATE ANALYSIS
21
XGBoost Learning Model
22
MULTIVARIATE ANALYSIS
This technique is best suited for use when we have multiple categorical
independent variables; and two or more metric dependent variables.
CORRELATION
Finally, let's take a look at the relationships between numeric
features and other numeric features.
Correlation is a value between -1 and 1 that represents how
closely values for two separate features move in unison.
A positive correlation means that as one feature increases, the
other increases; eg. a child's age and height.
A negative correlation means that as one feature increases, the
other decreases; eg. hours spent studying and the number of parties
attended.
Correlations near -1 or 1 indicate a strong relationship.
Those closer to 0 indicate a weak relationship.
0 indicates no relationship.
23
WHAT DO WE NEED TO DO?
The color bar on the right explains the meaning of the heatmap -
Dark colors indicate strong negative correlations and light colors
indicate strong positive correlations.
Perhaps the most helpful way to interpret this correlation heatmap
is to first find features that are correlated with our target variable
by scanning the last column.
In this case, it doesn't look like many features are strongly
correlated with the target variable.
Seems like there is a negative correlation between the columns
'Purchase' and 'Product_Category_1'.
24
data analysis by visualizing the data, in particular, by visualizing the
statistical relationship between the different variables.
25
MODEL PREPARATION AND PREDICTIONS
I did use several regression techniques but XGBRegressor produced
the best results. It gave me an RMSE score of 2896 initially.
After applying Hyperparameter Optimization with several values, it
does give an RMSE score of around 2739. But, there are still other ways
of implementing this such as Extracting more exciting features, trying
other ML models, and trying Deep Learning Models.
Hope this was helpful information. I know there might be a lot of
things I have missed. So any kind of suggestions or improvements is
appreciated.
26
CONCLUSION
27
BIBLIOGRAPHY
VI. REFERENCES
[1] J. McDermott, "Black Friday spending statistics 2018," Apr
2019. [Online]. Available:
https://www.finder.com/black-friday-statistics
[2] E. Swilley and R. E. Goldsmith, "Black Friday and Cyber Monday:
Understanding consumer intentions on two major
shopping days," Journal of retailing and consumer services, vol. 20, no. 1,
pp. 43-50, 2013.
[3] J. Boyd Thomas and C. Peters, "An exploratory investigation of black
Friday consumption rituals," International
Journal of Retail & Distribution Management, vol. 39, no. 7, pp. 522-537,
2011.
[4] L. Simpson, L. Taylor, K. O'Rourke, and K. Shaw, "An analysis of
consumer behavior on Black Friday," American
International Journal of Contemporary Research, 2011.
[5] G. C. Bell, M. R. Weathers, S. O. Hastings, and E. B. Peterson,
"Investigating the celebration of Black Friday as a
communication ritual," Journal of Creative Communications, vol. 9, no.
3, pp. 235-251, 2014.
[6] H. J. Kwon and T. M. Brinthaupt, "The motives, characteristics,
and experiences of us black Friday shoppers,"
Journal of Global Fashion Marketing, vol. 6, no. 4, pp. 292-302, 2015.
[7] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting
system," in Proceedings of the 22nd acm sigkdd
international conference on knowledge discovery and data mining. ACM,
pp. 785-794, 2016.
[8] A. Guzman, Derivatives and integrals of multivariable functions.
Springer Science & Business Media, 2012.
[9] M. Dagdoug, "Black Friday: A study of sales through consumer
behaviors," Jul 2018. [Online]. Available:
https://www.kaggle.com/mehdidag/black-friday.
28