SS Teamproject Documentation

UNIVERSITY OF MADRAS
GUINDY CAMPUS, CHENNAI - 600 025
DEPARTMENT OF COMPUTER SCIENCE

TEAM PROJECT-II
TOPIC
BLACK FRIDAY SALES PREDICTION
Submitted By
SUPRIYA R
I M.Sc (Computer Science)

(REGISTER NUMBER: 36820105)
Under the guidance of
DR. P. THANGAVEL
Professor
UNIVERSITY OF MADRAS
DEPARTMENT OF COMPUTER SCIENCE
GUINDY CAMPUS, CHENNAI - 600 025
BONAFIDE CERTIFICATE
This is to certify that the Document work entitled “TEAM PROJECT-II

(BLACK FRIDAY SALES PREDICTION)” is a bonafide record of work done
by Supriya R M.Sc., Register Number 36820105, during the period of March
2021 – June 2021.
Staff - in - charge Head of the Department

(Dr. P. THANGAVEL) (Dr. P. THANGAVEL)
Submitted for the University Semester Examination held during June 2021 at the
Department of Computer Science, University of Madras, Guindy, Chennai – 600
025.
Examiner
ACKNOWLEDGEMENT
We are proud and privileged to express my sincere gratitude to
thank and praise Dr.P.THANGAVEL Professor, Department of
Computer Science, University of Madras, Guindy Campus,
Chennai for his consent and for having shared the fruit of
knowledge by constant guidance and support.
We sincerely thank Dr.P.THANGAVEL, the Head of the

Department, Computer Science, for the encouragement given.
I am heartily thankful to our colleagues, their encouragement,

guidance, and support from the initial to the final level enabled
us to develop an understanding of the subject.
SUPRIYA R
ABSTRACT
A sales Prediction helps every business make better
business decisions. It helps in overall business planning,
budgeting, and risk management. Sales Prediction allows
companies to efficiently allocate resources for future growth and
manage their cash flow.
With this particular Black Friday sale analysis, we are more
interested in figuring out how much will a customer spend based
on certain attributes such as their Age group, City Category, etc.
In this project, we tried to analyze the data using various
analysis method and factor that affects the sales. Here we deal
with both independent and dependent various regression models.
At last, we reduced the error which helps to predict more
accurately. Understand the customer purchase behavior
(specifically, purchase amount) against various products of
different categories. They have shared purchase summaries of
various customers for selected high-volume products.
TABLE OF CONTENTS
S.NO CONTENTS Page no
1 INTRODUCTION 1
2 FEATURE OF WORK 2
3 PROBLEM STATEMENT AND DATA 3
4 ALGORITHM & IMPLEMENTATION 4
5 CODE & SCREENSHOTS 13
6 CONCLUSION 28
7 BIBLIOGRAPHY 29
INTRODUCTION
For a long history of several decades, Black Friday has been
recognized as the largest shopping day of the year in the US. It is the
Friday after Thanksgiving and for American consumers, it ignites the
Christmas holiday shopping. For most retailers, it is the busiest day of
the year. Black Friday is traditionally known for long lines of customers
waiting outdoors in cold weather before the open hours. Sales are so
high for Black Friday that it has become a crucial day for stores and the
economy in general with approximately 30% of all the annual retail sales
occurring in the time from Black Friday. It is unofficially a public
holiday in more than 20 states and is considered the start of the US
Christmas shopping season.
To maximize their efforts and revenues, retailers enthusiastically

understand how the consumers make shopping decisions that will assist
them to achieve the most profits during the shopping season. Many
possible parameters that have been considered are presented. If retailers
comprehensively understand their customers in terms of characteristics,
behaviors, and motivations in the previous shopping seasons, they can
implement and develop more effective marketing strategies for specific
customer categories [4,5,6]. The authors place the Black Friday
challenge as an interesting opportunity to investigate the performance of
several machine learning models. We decide to apply boosting-based
models to the problem and see how would they perform. The objective is
to predict the amount of purchase a consumer is willing to pay given
several categorical and numerical features.
1
FEATURE WORK
The main aim of Feature Engineering is to analyze the data as much as
possible which might involve creating additional features out of the
existing features which have a better correlation with the Target
Variable. Thus, it helps the model perform better.
Let’s start with the Target Variable. Target Variable(Purchase) gives

us an idea that the distribution is similar to a Gaussian Distribution.
To study the relationship between ProductID and Purchase, the Mean

Purchase of the Product_IDs were calculated. After sorting the values
by descending order of their Purchase values, some random samples
from the data were used to get more information. The random samples
suggested that most of the Product_IDs had multiple categories of
Product associated with them.
In addition to that, it was observed that some products were sold more
as compared to other products which could be a result of the product
being more popular than the others and can be assumed to have
a significant amount of bias as compared to others.
2
PROBLEM STATEMENT
A retail company “ABC Private Limited” wants to understand the
customer purchase behavior (specifically, purchase amount) against
various products of different categories. They have shared purchase
summaries of various customers for selected high-volume products from
last month.
The data set also contains customer demographics (age, gender,
marital status, city_type, stay_in_current_city), product details
(product_id and product category), and Total purchase_amount from last
month.
They want to build a model to predict the purchase amount of
customers against various products which will help them to create a
personalized offer for customers against different products.
DATA
The Data-Set consists of a Train Data Set and a Test Data Set. We
start by looking at the training and test data set.
we can observe that there are missing values in the
columns ‘Product_Category_2’ and ‘Product_Category_3’ of both the
train and test dataset. These 2 features are additional features of a
product and fill those values.
3
ALGORITHM AND IMPLEMENTATION:
Step 0: Understanding the Problem
We must understand what the problem demands from us before we

begin to play with the data. In this case, we are asked to predict the
‘Purchase Amount’ which is a continuous variable. Now that we know
we are going to predict a continuous variable, we can say with certainty
that this is a Regression Problem and we can use various regression
algorithms such as Linear Regression, Ridge Regression, Decision Tree
Regression, Ensemble Techniques, Neural Networks, or any other
preferred Regression technique.
Step 1: Import Libraries and Dataset

Python has a vast collection of Machine Learning libraries that makes
it one of the most optimal programming language for Data Science. The
most important are the Pandas, Numpy, Scikit Learn, Matplotlib, and
Seaborn.
It can be seen that we have 550,068 rows in our data and most of the
Data Columns are non-null except
for ‘Purchase_Category_2’ and ‘Purchase_Category_3’. We need to
handle the missing data in these columns. But before that, we will take a
look at how these columns affect the target and then handle it
accordingly.
4
Step 2: A closer look at the features
1. User_ID: A distinct ID is given to the customers to identify them
uniquely.
2. Product_ID: A distinct ID is given to products to identify
them uniquely.
3. Gender: M or F can be used as a binary variable.
4. Age: Age is given in bins with 6 categories.
5. Occupation: The type of occupation a user does, It is already
masked.
6. City_Category: The category of the city out of A, B, C.
Should be used as Categorical Variable.
7. Stay_In_Current_City_Years: It has 5 values: 0, 1, 2, 3, 4+ and
may be used as categorical variables.
8. Marital_Status: 0: Unmarried and 1: Married. It is expected that
marital status does affect the Purchase value.
9. Product_Category_1: The primary category that a product
belongs to. It can be a useful feature as a certain category of
products are sold more often than others.
10. Product_Category_2: The Secondary category of a product.
If there is no secondary category this will be Null.
11. Product_Category_3: The Tertiary Category of a product.
This will be only occupied when Category 1 and 2 are
occupied. Also, if a product does not have a tertiary category, it
will be Null.
12. Purchase: This is the target variable. Now that we have
understood our data, we can start visualizing and gain some
more insights.
5
Step 3: EDA using Visualization
There are a large number of possibilities when it comes to analyzing
the data using visualization. We will first understand how different
features affect the target and then how the combinations of these features
affect the target.
 AGE
We can see, the distribution of various Age groups in our data.
Customers of age 26–35 were in the largest numbers with around 40%
of the total customers, while people of age 0–17 were lowest with 2.75%
only.
We can therefore infer that people of age group 26–35 shopped the
most followed by 36–25, 18–25, 51–55, 55+ and then 0–17. It is easy to
speculate on this data. Since people of age 0–17 are usually dependent
on elders, their numbers as customers are the lowest. Also, people of the
age group 26–35 are generally independent and have income sources,
they make the largest population in our data.
6
Despite the disparity in the number of customers of the different age
groups, we can see below that the values of the Average Purchase
amount of different age groups (Average value) and the Purchase
amount of an average person in the age group (Median value) are nearly
the same. Also, it is crucial to note that the age group 26–35 does largely
make up for our data, but the largest average amount spent is by people
of the age group 51–55. A general reason can be that they don’t need to
save up anymore and can freely spend whatever amount they wish.
 GENDER
It is also very crucial to understand how the individual genders shopped
in this sale. This can be a very important feature as there can be some
major differences between the shopping behavior of the different
genders. In Figure 3.2.1 we can see the distribution of Male(M) and
Female(F) in the data. Males account for 75% of the shopping while
Females, just 25%. This is a very peculiar observation as one will not
expect such great disparity between the genders and the company must get
behind the reason why there is this disparity and what can be done to
entice female shoppers.
7
Not only did Males shop more, but also spent more on average than
Females. Also, the amount spent by an average male is more than an
average female, though not by much amount.
 MARITAL STATUS
We can see about 60% of the customers were unmarried and 40% were
married. It is possible that the commodities that married people prefer to
buy did not have attractive offers and perhaps the company can work on
that in the next sale. It is also possible that couples choose not to fritter
away their income in the sale and focus more on themselves and their
family.
Although unmarried people did more of the shopping, the average

amount spent by both unmarried and married people is nearly the same.
8
 CITY CATEGORY
Shows the distribution of our customers in various cities. Most of the

shoppers are from City B (43%) and the least from City A (27%). Now
we are not sure what these categories mean and on what basis are these
categories made. However, we can get some idea after a more detailed
analysis. For example, if we assume that cities are divided based on the
income range of people, so it is possible that high-income people are less
interested in the sale and so belong to City A, also, people with low-
income will be interested but their hands are tied because of their low
remuneration and so can be categorized in City C. People with wages
not too high and not too low can freely participate in this sale and so are
the major shoppers, hence belong to the City Category B.
9
 OCCUPATION
There are 21 different categories of occupation and these values are

already masked. we see that most of our shoppers are involved in
occupation code 4 with 13.15% followed by occupation code 0 with
12.66%.
People performing occupation 8 accounts for the lowest shoppers.
Maybe the company should focus on shoppers of this occupation.
10
 STAY IN CURRENT CITY YEARS
Here we can see that most of our customers are those people who have
been staying in the same city for the past 1 year (35.24%). And the least
are those who just moved in (13.53 %). There can be some obvious reasons
for this observation. The people who have been in a city for one year
now are likely to stay more so they freely take part in the Black Friday
Sale and may buy some things for the house, while those who just
moved in need more time to settle in. Also, it is possible that those who
have stayed in the city for 4+ years are either planning on moving out or
are bored with the sale in the city and so choose not to shop that much.
We are done with the univariate analysis of our data. But there was a
variable ‘Marital Status’ for which we could not figure out how it affects
our target. So, we will do bivariate analysis and gain a deeper insight.
We see the comparison of Marital Status and Stay in Current City Years
(SCCY)
w.r.t Average Purchase. For SCCY 0, the value of avg. purchase is nearly
the same. For SCCY 1, the value of avg. purchase there is just a slight
difference between married and unmarried. For SCCY 2 also there is
just a slight difference between married and unmarried. For SCCY 3
however, there is a little more difference between married (9,170.6)
and unmarried (9,362.9). For SCCY 4+, there is just a slight difference
between married and unmarried.
We see the comparison of Marital Status and City Category w.r.t

Average Purchase. There is not a significant difference between married
and unmarried people living in different categories of the city.
11
 PRODUCT CATEGORIES
There are three columns for Product Categories and two of them contain
Null Values. We will have to deal with the null values. But before that
let us understand what these categories tell
us. A product can belong to one single category (Primary Category), or
there may be two categories of the product (Primary + Secondary
Category) or there can be a maximum of three categories(Primary +
Secondary + Tertiary Category). Does belonging to more than one
category affect the purchase amount of a product? Let’s find out.
We can see that product having primary category 10 has a shopping

average purchase amount of 19,676, followed by category 7 products
with an average of 16,366 and so on. Products that belong to category 19
have an average purchase amount of 37.
Now let’s say that there is a product that has a primary category 10 and
has a second category as well.
12
Shows us the different secondary product categories and average
purchase amount when the primary category is 10. So, if a product has a
category 10, it means that it may not belong to any other category
(Null)or it can belong to any of the following categories: 14, 16, 13, 15,
or 11. We can see that if the product does not belong to any other
category, it has the maximum average purchase value: 20,295. And if the
product belongs to category 10 and category 11, its average purchase
amount decreases significantly to19,206, which cannot be ignored. So,
we can say that product category 2 when combined with the product
category 1 affects the Purchase value.
Now let us assume we have a product with a category 10 and a category
13.
Does having a third category affect the product’s purchase amount?
Depicts that in our given data if a product has categories 10 and 13,
it may either not have a third category (Null) or may belong to category
16. We can see a huge disparity in the avg. purchase amounts of the
products if they don’t have a third category and those having a third
category (16).
Step 4: Data Preprocessing

1. The First thing we will do is drop the Marital Status column.
df = train.copy() #Create a copy of Train Data to work on.
df = df.drop(columns = ['Marital_Status'])
2. Now we will encode the Gender Column. Since it is a binary variable,

we will use a replace( ) function for this purpose.
df = df.replace({'Gender': {'M': 1, 'F':0}})
3. Now we need to do something with the missing values

in Product_Category_2 and Product_Category_3 without dropping the
missing values. So first we will replace the NaN values with 0 in both these
13
columns and then do One Hot Encoding for Product_Category_1. Next,
if we encounter any non-zero value in the
columns Product_Category_2 and Product_Category_3 in any row of
our data, we will replace the value in the respective column of one hot
encoded Product_Category_1 by 1. What this will do is it will aggregate
all the information of the three product category columns into the One
Hot Encoding for Product_Category_1. To have a more clear
understanding, let us take an example. Consider Dummy data with just 4
total categories 1,2,3,4.
 Replace all NaN values with 0.

Encode Product Category 1 using the one-hot encoding and with prefix
= ‘P’
For every row that has a non-zero cell in either Product category 2 or
Product Category 3 (suppose row 2, highlighted), we take that non-zero
value ‘i’ and replace the ‘ith’ column in that row with 1. We do this for
all our data.
We do the above-mentioned step for row 3 (highlighted) and this is what

our data looks like.
This is what the Final data will look like once we are done with all the
data points.
Now we will code the above steps for the actual data set. Let’s have a
look at the columns in our data now.
We drop the ‘Product_Category_2’ and ‘Product_Category_3’ columns

now.
df_oneHo = df_oneHot.drop(column = ['Product_Category_2
'Product_Category_3']
t s ',
)
Now we do One Hot Encoding for the rest of the categorical variables.
14
data_df_oneh = pd.get_dummies(df_oneHo
columns=['Age',"Occupation"
ot t,
,'Stay_In_Current_City_Years'], prefix = ['Age',"Occupation", 'City_Category',
'City','Stay'])
We also want to use the Product ID column but cannot be used as it is since
it is of type ‘P00…’. So, we will first remove ‘P00’ from the column and
then use it.
data_df_onehot['Product_ID'] =
data_df_onehot['Product_ID'].str.replace('P00
', '')
For effective model building, we can standardize the dataset
using Feature Scaling. This can be done with StandardScaler() from
sklearn’s preprocessing library. Now we separate the target variable
from our dataset and then the dataset is split into training data and
testing data in the ratio 80:20 using the train_test_split() command.
Step 5: Data Modelling
First, we import xgboost and then convert our data into a DMatrix
format that is used by XGBoost. The algorithm can also be used without
converting into the DMatrix, however. I will do it anyway.
import xgboost as xgbdtrain = xgb.DMatrix(train_data,
label=train_labels) dtest = xgb.DMatrix(test_data, label=test_labels)
Now we will consider some parameters of XGBoost that we will be tuning.

These Parameters are Max depth, Minimum Child Weight, Learning
Rate, Subsample, and Column Sampling. We also take an Evaluation
Metric. We will use Root Mean Square Error (RMSE) since this is
what is used in the competition. Also, we will set the Number of Boost
Rounds to 999 (which is the maximum allowed value). The number of
Boost Rounds is the number of times the model will go through the
complete data. To avoid going for 999 rounds which will take a lot of time,
we can set an Early Stopping variable which will stop the training once
the
15
model does not improve after a certain number of rounds. We do all this
and then start training our model.
Since there are a lot of data points, it takes some time to train the model.
Our model went through complete data 819 times and it found the best
score at 809th round. The Test-rmse is 2510 without hyperparameter
tuning, which is pretty good.
Step 6: Hyperparameter Tuning

Now we will tune our model. For this, we will be using Cross-
Validation. XGBoost comes with an inbuilt cross-validation feature
which we will be using.
I have only done a rough parameter tuning as it was taking a lot of time
to tune the model using cross-validation. I feel like a better tuning of the
model can be done as compared to the below tuning, but the purpose is
to show how I did the tuning.
Maximum Depth and Minimum Child Weight.

max_depth is the maximum number of nodes allowed from the root to
the farthest leaf of a tree. Deeper trees can model more complex
relationships by adding more nodes, but as we go deeper, splits become
less relevant and are sometimes only due to noise, causing the model to
overfit. min_child_weight is the minimum weight (or some samples if
all samples weight 1) required to create a new node in the
tree. A smaller min_child_weight allows the algorithm to create
children that correspond to fewer samples, thus allowing for more
complex trees, but again, more likely to overfit. We tune these
parameters together to ensure a good trade-off
16
We get the best score with a max_depth of 9 and min_child_weight of 7,
so let's update our params.
Subsample and Column Sample by Tree
Those parameters control the sampling of the dataset that is done at
each boosting round.
Instead of using the whole training set every time, we can build a tree
on slightly different data at each step, which makes it less likely to
overfit a single sample or feature.
 subsample corresponds to the fraction of observations (the

rows) to subsample at each step. By default, it is set to 1
meaning that we use all rows.
 colsample_bytree corresponds to the fraction of features (the
columns) to use. By default, it is set to 1 meaning that we will
use all features.
Let’s see if we can get better results by tuning those parameters together.
17
Step 7: Evaluation of the Model
Here is how our final dictionary of parameters looks like:
params = {'colsample_bytree': 0.7,
'eta': 0.2,
'eval_metric': 'rmse',
'max_depth': 9,
'min_child_weight': 7,
'objective': 'reg:squarederror',
'subsample': 1}
Let’s train a model with it and see how well it does on our test set!
Well isn’t that an improvement? Not only did the number of iterations
go down from 819 to 607, but the Test-RMSE also reduced from
2510.89 to 2497.255. Now, we can use this model to fit the test data and
then we submit it for checking.
This places us at position 355, which is the top 14% of all
participants.
18
DECISION TREE
◈ Decision trees learn how to best split the dataset into smaller and
smaller subsets to predict the target value.
◈ “mse” for the mean squared error
19
RANDOM FOREST
◈ Random forests (RF) construct many individual decision trees at

training. Predictions from all trees are pooled to make the final
prediction; the mode of the classes for classification or the mean
prediction for regression.
20
MULTIVARIATE ANALYSIS
◈ Multivariate analysis (MVA) is a Statistical procedure for the

analysis of data involving more than one type of measurement or
observation. It may also mean solving problems where more than
one dependent variable is analyzed simultaneously with other
variables.
◈ Multivariate analysis is part of Exploratory data analysis. Based on
MVA, we can visualize the deeper insight of multiple variables.
21
XGBoost Learning Model
XGBoost is considered one of the most powerful and efficient

implementations of the Gradient Boosted Trees algorithm to tackle all
the tasks of supervised learning. XGBoost has proved to be a highly
effective ML algorithm, extensively used in machine learning
competitions and hackathons. XGBoost has high predictive power and is
almost 10 times faster than the other gradient boosting techniques. It
also includes a variety of regularization which reduces overfitting and
improves overall performance. It is based on function approximation by
optimizing specific loss functions as well as applying several
regularization techniques. Before discussing what are loss functions and
regularization techniques, let us define some notations and settings.
𝑑
By 𝑥𝑖 ∈ ℝ , we denote the ith instance with an associated label,
e.g. in case of classification, or real value, e.g. in case of regression. We
denote 𝑦̂ as the prediction given 𝑥𝑖 . Assume we have K trees,
22
MULTIVARIATE ANALYSIS
This technique is best suited for use when we have multiple categorical
independent variables; and two or more metric dependent variables.
CORRELATION
 Finally, let's take a look at the relationships between numeric
features and other numeric features.
 Correlation is a value between -1 and 1 that represents how
closely values for two separate features move in unison.
 A positive correlation means that as one feature increases, the
other increases; eg. a child's age and height.
 A negative correlation means that as one feature increases, the
other decreases; eg. hours spent studying and the number of parties
attended.
 Correlations near -1 or 1 indicate a strong relationship.
 Those closer to 0 indicate a weak relationship.
 0 indicates no relationship.
23
WHAT DO WE NEED TO DO?
 The color bar on the right explains the meaning of the heatmap -
Dark colors indicate strong negative correlations and light colors
indicate strong positive correlations.
 Perhaps the most helpful way to interpret this correlation heatmap
is to first find features that are correlated with our target variable
by scanning the last column.
 In this case, it doesn't look like many features are strongly
correlated with the target variable.
 Seems like there is a negative correlation between the columns
'Purchase' and 'Product_Category_1'.
#plot color scaled correlation matrix

corr=purchase_selected.corr() corr.style.background_gradient(c
map='coolwarm')
24
data analysis by visualizing the data, in particular, by visualizing the
statistical relationship between the different variables.
25
MODEL PREPARATION AND PREDICTIONS
I did use several regression techniques but XGBRegressor produced
the best results. It gave me an RMSE score of 2896 initially.
After applying Hyperparameter Optimization with several values, it
does give an RMSE score of around 2739. But, there are still other ways
of implementing this such as Extracting more exciting features, trying
other ML models, and trying Deep Learning Models.
Hope this was helpful information. I know there might be a lot of
things I have missed. So any kind of suggestions or improvements is
appreciated.
26
CONCLUSION
The key purpose of this study is to investigate the performance of the

gradient boosting technique in extremely noisy data. More specifically,
the current results have confirmed the effective strategy of applying
extreme gradient boosted
trees to predict the amount of purpose. The Black Friday challenge is
still operating, so much further consideration can be made to improve
the RMSE score.
 In this project, we tried to build a model using various algorithms
such as Linear regression, KNN regression, Decision tree
regression, Random forest, and XGB regressor to get the best
possible prediction.
 The hyperparameter tuned XGB regressor gives us the best rmse
value and r2 score for this problem.
The challenge's leaderboard shows that the RMSE difference of our

score to that of the first position is approximate 240, eg. 2646 versus
2405, which strongly indicates a more advanced model's tuning and
feature engineering.
 Even though we got a decent result with the above model, some
did better than us. Perhaps, we could focus more on extracting
interesting features or using a different ensemble model. It is also
possible to use deep neural networks for regression that may yield
better results than ensemble models.
The complete Python code can be accessed. RMSE in the
Analytics SUPRIYA and SIVASAKTHI was predicted at about 2985.
this best outcome and was achieved.
27
BIBLIOGRAPHY
VI. REFERENCES
[1] J. McDermott, "Black Friday spending statistics 2018," Apr
2019. [Online]. Available:
https://www.finder.com/black-friday-statistics
[2] E. Swilley and R. E. Goldsmith, "Black Friday and Cyber Monday:
Understanding consumer intentions on two major
shopping days," Journal of retailing and consumer services, vol. 20, no. 1,
pp. 43-50, 2013.
[3] J. Boyd Thomas and C. Peters, "An exploratory investigation of black
Friday consumption rituals," International
Journal of Retail & Distribution Management, vol. 39, no. 7, pp. 522-537,
2011.
[4] L. Simpson, L. Taylor, K. O'Rourke, and K. Shaw, "An analysis of
consumer behavior on Black Friday," American
International Journal of Contemporary Research, 2011.
[5] G. C. Bell, M. R. Weathers, S. O. Hastings, and E. B. Peterson,
"Investigating the celebration of Black Friday as a
communication ritual," Journal of Creative Communications, vol. 9, no.
3, pp. 235-251, 2014.
[6] H. J. Kwon and T. M. Brinthaupt, "The motives, characteristics,
and experiences of us black Friday shoppers,"
Journal of Global Fashion Marketing, vol. 6, no. 4, pp. 292-302, 2015.
[7] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting
system," in Proceedings of the 22nd acm sigkdd
international conference on knowledge discovery and data mining. ACM,
pp. 785-794, 2016.
[8] A. Guzman, Derivatives and integrals of multivariable functions.
Springer Science & Business Media, 2012.
[9] M. Dagdoug, "Black Friday: A study of sales through consumer
behaviors," Jul 2018. [Online]. Available:
https://www.kaggle.com/mehdidag/black-friday.
28

SS Teamproject Documentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SS Teamproject Documentation

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF MADRAS

GUINDY CAMPUS, CHENNAI - 600 025

DEPARTMENT OF COMPUTER SCIENCE

I M.Sc (Computer Science)

Under the guidance of

This is to certify that the Document work entitled “TEAM PROJECT-II

Staff - in - charge Head of the Department

We sincerely thank Dr.P.THANGAVEL, the Head of the

I am heartily thankful to our colleagues, their encouragement,

3 PROBLEM STATEMENT AND DATA 3

4 ALGORITHM & IMPLEMENTATION 4

5 CODE & SCREENSHOTS 13

To maximize their efforts and revenues, retailers enthusiastically

Let’s start with the Target Variable. Target Variable(Purchase) gives

To study the relationship between ProductID and Purchase, the Mean

Step 0: Understanding the Problem

We must understand what the problem demands from us before we

Step 1: Import Libraries and Dataset

entice female shoppers.

Although unmarried people did more of the shopping, the average

Shows the distribution of our customers in various cities. Most of the

There are 21 different categories of occupation and these values are

We see the comparison of Marital Status and City Category w.r.t

We can see that product having primary category 10 has a shopping

Step 4: Data Preprocessing

2. Now we will encode the Gender Column. Since it is a binary variable,

3. Now we need to do something with the missing values

 Replace all NaN values with 0.

We do the above-mentioned step for row 3 (highlighted) and this is what

We drop the ‘Product_Category_2’ and ‘Product_Category_3’ columns

Step 5: Data Modelling

Now we will consider some parameters of XGBoost that we will be tuning.

Step 6: Hyperparameter Tuning

Maximum Depth and Minimum Child Weight.

 subsample corresponds to the fraction of observations (the

◈ Random forests (RF) construct many individual decision trees at

◈ Multivariate analysis (MVA) is a Statistical procedure for the

XGBoost is considered one of the most powerful and efficient

#plot color scaled correlation matrix

The key purpose of this study is to investigate the performance of the

The challenge's leaderboard shows that the RMSE difference of our

You might also like