You are on page 1of 41

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DATA SCIENCE

MACHINE LEARNING DOMAIN PROJECTS FOR


REGRESSION, CLASSIFICATION AND CLUSTERING
USING VARIOUS DATASETS

ABES ENGINEERING COLLEGE, GHAZIABAD

DATA SCIENCE PROJECT TRAINING REPORT SUBMITTED IN PARTIAL


FULFILLMENT OF THE DEGREE OF BACHELOR OF TECHNOLOGY,
2020-2024

Date : 19 July 2022


Name of the Student : DEEP SHARAN
University Roll No : 2000321540025

1
ACKNOWLEDGEMENT

I would like to acknowledge the contributions of the following people without whose help and
guidance this report would not have been completed.
I acknowledge the counsel and support of Dr. Sanjay Kumar Singh, Professor, IT Department,
with respect and gratitude, whose expertise, guidance, support, encouragement, and enthusiasm
has made this report possible. Their feedback vastly improved the quality of this report and
provided an enthralling experience. I am indeed proud and fortunate to be supported by him. I am
also thankful to Prof. (Dr.) Amit Sinha, H.O.D of Information Technology Department, and
Dr. Kanika Gupta, A.H.O.D of Information Technology Department for his constant
encouragement, valuable suggestions and moral support and blessings.
Although it is not possible to name individually, I shall ever remain indebted to the faculty
members of ABES Engineering College, Ghaziabad for their persistent support and cooperation
extended during this work.
This acknowledgement will remain incomplete if I fail to express our deep sense of obligation to
my parents and God for their consistent blessings and encouragement.

NAME : DEEP SHARAN


ROLL NUMBER: 2000321540025

2
CONTENTS
Page no.
➢ Acknowledgement

➢ Student’s Declaration

➢ Regression

➢ Dataset

➢ Project

3
Regression

Regression, one of the most common types of machine learning models,


estimates the relationships between variables. Whereas classification models
identify which category an observation belongs to, regression models estimate a
numeric value82 million learners, 100+ Fortune 500 companies, and more than
6,000 campuses, businesses, and governments come to Coursera to access
world-class learning—anytime, anywhere.

In the context of machine learning and data science, regression specifically refers
to the estimation of a continuous dependent variable or response from a list of
input variables, or features. There are a variety of regression techniques, ranging
from the simplest (linear regression), to complicated statistical classic regression
models (Lasso, Elastic Net, etc.), to more complex techniques including gradient
boosting and neural networks.

4
Why is Regression Important?

The evaluation of relationship between two or more variables is called


Regression Analysis. It is a statistical technique.

Regression Analysis helps enterprises to understand what their data


points represent, and use them wisely in coordination with different
business analytical techniques in order to make better decisions.

Regression Analysis helps an individual to understand how the typical


value of the dependent variable changes when one of the independent
variables is varied, while the other independent variables remain
unchanged. Therefore, this powerful statistical tool is used by Business
Analysts and other data professionals for removing the unwanted
variables and choosing only the important ones.

The benefit of regression analysis is that it allows data crunching to


help businesses make better decisions. A greater understanding of the
variables can impact the success of a business in the coming weeks,
months, and years in the future.

5
Data Science
The regression method of forecasting, as the name implies, is used for
forecasting and for finding the casual relationship between variables.
From a business point of view, the regression method of forecasting can
be helpful for an individual working with data in the following ways:

• Predicting sales in the near and long term.

• Understanding demand and supply.

• Understanding inventory levels


.
• Review and understand how variables impact all these factors.

However, businesses can use regression methods to understand the


following:
• Why did the customer service calls drop in the past months?

• How the sales will look like in the next six months?

• Which ‘marketing promotion’ method to choose?

• Whether to expand the business or to create and market a new product.

The ultimate benefit of regression analysis is to determine which


independent variables have the most effect on a dependent variable. It
also helps to determine which factors can be ignored and those that
should be emphasized.

6
Meaning of Regression

Let us understand the concept of regression with an example.

Consider a situation where you conduct a case study on several college


students. We will understand if students with high CGPA also get a high
GRE score.

Our first job is to collect the details of the GRE scores and CGPAs of all
the students of a college in a tabular form. The GRE scores and the
CGPAs are listed in the 1st and 2nd columns, respectively.

To understand the relationship between CGPA and GRE score, we need


to draw a scatter plot.

Here, we can see a linear relationship between CGPA and GRE score in
the scatter plot. This indicates that if the CGPA increases, the GRE scores
also increase. Thus, it would also mean that a student with a high CGPA
is likely to have a greater chance of getting a high GRE score.

However, if a question arises like “If the CGPA of a student is 8.51, what
will be the GRE score of the student?”. We need to find the relationship
between these two variables to answer this question. This is the place
where Regression plays its role.

7
In a regression algorithm, we usually have one dependent variable and
one or more than one independent variable where we try to regress the
dependent variable "Y" (in this case, GRE score) using the independent
variable "X" (in this case, CGPA). In layman's terms, we are trying to
understand how the value of "Y" changes concerning the change in "X".

Let us now understand the concept of dependent and independent


variables.

8
Dependent and Independent variables

In data science, variables refer to the properties or characteristics of


certain events or objects.

There are mainly two types of variables while performing regression


analysis which is as follows:

• Independent variables – These variables are manipulated or are altered


by researchers whose effects are later measured and compared. They are
also referred to as predictor variables. They are called predictor variables
because they predict or forecast the values of dependent variables in a
regression model.
• Dependent variables – These variables are the type of variable that
measures the effect of the independent variables on the testing units. It is
safer to say that dependent variables are completely dependent on them.
They are also referred to as predicted variables. They are called because
these are the predicted or assumed values by the independent or
predictor variables.

When an individual is looking for a relationship between two variables,


he is trying to determine what factors make the dependent variable
change. For example, consider a scenario where a student's score is a
dependent variable. It could depend on many independent factors like
the amount of study he did, how much sleep he had the night before
the test, or even how hungry he was during the test.

In data models, independent variables can have different names such


as “regressors”, “explanatory variable”, “input variable”, “controlled
variable”, etc. On the other hand, dependent variables are called
“regressand,” “response variable”, “measured variable,” “observed
variable,” “responding variable,” “explained variable,” “outcome
variable,” “experimental variable,” or “output variable.”

9
Below are a few examples to understand the usage and significance of
dependent and independent variables in a wider sense:

• Suppose you want to estimate the cost of living of a person using a


regression model. In that case, you need to take independent variables as
factors such as salary, age, marital status, etc. The cost of living of a
person is highly dependent on these factors. Thus, it is designated as the
dependent variable.
• Another scenario is in the case of a student's poor performance in an
examination. The independent variable could be factors, for example,
poor memory, inattentiveness in class, irregular attendance, etc. Since
these factors will affect the student's score, the dependent variable, in
this case, is the student's score.
• Suppose you want to measure the effect of different quantities of nutrient
intake on the growth of a newborn child. In that case, you need to
consider the amount of nutrient intake as the independent variable. In
contrast, the dependent variable will be the growth of the child, which
can be calculated by factors such as height, weight, etc.

Let us now understand the concept of a regression line.

What is a Regression Line?

In the field of statistics, a regression line is a line that best describes


the behaviour of a dataset, such that the overall distance from the line
to the points (variable values) plotted on a graph is the smallest. In
layman's words, it is a line that best fits the trend of a given set of data.

Regression lines are mainly used for forecasting procedures. The


significance of the line is that it describes the interrelation of a
dependent variable “Y” with one or more independent variables “X”. It is
used to minimize the squared deviations of predictions.

10
If we take two variables, X and Y, there will be two regression lines:

• Regression line of Y on X: This gives the most probable Y values from the
given values of X.
• Regression line of X on Y: This gives the most probable values of X from
the given values of Y.

The correlation between the variables X and Y depend on the distance


between the two regression lines. The degree of correlation is higher if
the regression lines are nearer to each other. In contrast, the degree of
correlation will be lesser if the regression lines are farther from each
other.

If the two regression lines coincide, i.e. only a single line exists,
correlation tends to be either perfect positive or perfect negative.
However, if the variables are independent, then the correlation is zero,
and the lines of regression will be at right angles.

Regression lines are widely used in the financial sector and business
procedures. Financial Analysts use linear regression techniques to
predict prices of stocks, commodities and perform valuations, whereas
businesses employ regressions for forecasting sales, inventories, and
many other variables essential for business strategy and planning.

11
What is the Regression Equation?

In statistics, the Regression Equation is the algebraic expression of the


regression lines. In simple terms, it is used to predict the values of the
dependent variables from the given values of independent variables.

Let us consider one regression line, say Y on X and another line, say X
on Y, then there will be one regression equation for each regression
line:

• Regression Equation of Y on X:

This equation depicts the variations in the dependent variable Y from


the given changes in the independent variable X. The expression is as
follows:

Ye = a + bX

Where,

• Ye is the dependent variable,


• X is the independent variable,
• a and b are the two unknown constants that determine the position of
the line.
The parameter “a” indicates the distance of a line above or below the
origin, i.e. the level of the fitted line, whereas parameter "b" indicates
the change in the value of the independent variable Y for one unit of
change in the dependent variable X.

The parameters "a" and "b" can be calculated using the least square
method. According to this method, the line needs to be drawn to
connect all the plotted points. In mathematical terms, the sum of the

12
squares of the vertical deviations of observed Y from the calculated
values of Y is the least. In other words, the best-fitted line is obtained
when ∑ (Y-Ye)2 is the minimum.

To calculate the values of parameters “a” and “b”, we need to


simultaneously solve the following algebraic equations:

∑ Y = Na + b ∑ X

∑ XY = a ∑ X + b ∑ X2

• Regression Equation of X on Y:

This equation depicts the variations in the independent variable Y


from the given changes in the dependent variable X. The expression is
as follows:

Xe = a + bY

Where,

• Xe is the dependent variable,


• Y is the independent variable,
• a and b are the two unknown constants that determine the position of
the line.

Again, in this equation, the parameter “a” indicates the distance of a


line above or below the origin, i.e. the level of the fitted line, whereas
parameter "b" indicates the slope, i.e. change in the value of the
dependent variable X for a unit of change in the independent variable
Y.

13
To calculate the values of parameters “a” and “b” in this equation, we
need to simultaneously solve the following two normal equations:

∑ X = Na + b ∑ Y

∑ XY = a ∑ Y + b ∑ Y2

Please note that the regression lines can be completely determined


only if we obtain the constant values “a” and “b”.

14
When Should I Use Regression Analysis?

Regression Analysis is mainly used to describe the relationships


between a set of independent variables and the dependent variables.
It generates a regression equation where the coefficients correspond
to the relationship between each independent and dependent
variable.

Analyze a wide variety of relationships

You can use the method of regression analysis to perform many


things, for example:

• To model multiple independent variables.


• Include continuous and categorical variables.
• Use polynomial terms for curve fitting.
• Evaluate interaction terms to examine whether the effect of one
independent variable is dependent on the value of another variable.
Regression Analysis can untangle very critical problems where the
variables are entwined. Consider yourself to be a researcher studying
any of the following:

• What impact does socio-economic status and race have on educational


achievement?
• Do education and IQ affect earnings?
• Impact of exercise habits and diet affect weight.
• Do drinking coffee and smoking cigarettes reduce the mortality rate?
• Does a particular exercise have an impact on bone density?
These research questions create a huge amount of data that entwines
numerous independent and dependent variables and question their
influence on each other. It is an important task to untangle this web of
related variables and find out which variables are statistically essential
and the role of each of these variables. To answer all these questions

15
and rescue us in this game of variables, we need to take the help of
regression analysis for all the scenarios.

Control the independent variables

Regression analysis describes how the changes in each independent


variable are related to the changes in the dependent variable and how
it is responsible for controlling every variable in a regression model.

In the process of regression analysis, it is crucial to isolate the role of


each variable. Consider a scenario where you participated in an
exercise intervention study. You aimed to determine whether the
intervention was responsible for increasing the subject's bone mineral
density. To achieve an outcome, you need to isolate the role of
exercise intervention from other factors that can impact the bone
density, which can be the diet you take or any other physical activity.

To perform this task, you need to reduce the effect of the


unsupportive variables. Regression analysis estimates the effect the
change in one dependent variable has on the dependent variables
while all other independent variables are constant. This particular
process allows you to understand each independent variable's role
without considering the other variables in the regression model.

Now, let us understand how regression can help control the other
variables in the process.

According to a recent study on the effect of coffee consumption on


mortality, the initial results depicted that the higher the intake of
coffee, the higher is the risk of death. However, researchers did not
include the fact that most coffee drinkers smoke in their first model.
After smoking was included in the model, the regression results were

16
quite different from the initial results. It depicted that coffee intake
lowers the risk of mortality while smoking increases it.

This model isolates the role of each variable while holding the other
variables constant. You can examine the effect of coffee intake while
controlling the smoking factor. On the other hand, you can also look at
smoking while controlling for coffee intake.

This particular example shows how omitting a significant variable can


produce misleading results and causes it to be uncontrolled. This
warning is mainly applicable for observational studies where the
effects of omitted significant variables can be unbalanced. This
omitted variable bias can be minimized in a randomization process
where true experiments tend to shell out the effects of these variables
in an equal manner.

17
PROJECT BASED LEARNING AND IMPLEMENTATION
DATASET: BLACK FRIDAY

Description:

Black Friday is a colloquial term for the Friday after Thanksgiving in the
United States. It traditionally marks the start of the Christmas shopping
season in the United States. Many stores offer highly promoted sales at
discounted prices and often open early, sometimes as early as midnight
or even on Thanksgiving.
Black Friday marks the beginning of the Christmas shopping festival
across the US. On Black Friday big shopping giants like Amazon, Flipkart,
etc. lure customers by offering discounts and deals on different product
categories. The product categories range from electronic items,
Clothing, kitchen appliances, Décor. Research has been carried out
to predict sales by various researchers. The analysis of this data serves
as a basis to provide discounts on various product items. With the
purpose of analyzing and predicting the sales, we have used three
models. The dataset Black Friday Sales Dataset available on Kaggle
has been used for analysis and prediction purposes. The models used
for prediction are linear regression, lasso regression, ridge regression,
Decision Tree Regressor, and Random Forest Regressor. Mean Squared
Error (MSE) is used as a performance evaluation measure. Random
Forest Regressor outperforms the other models with the least MSE score

18
CODE:-

import numpy as np
import pandas as pd
import seaborn as sns
In [2]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
In [3]:
train_data.head() #head gives data of first 5 rows
Out[3]:
Use Prod Ge A Occu City_ Stay_In_Cur Marit Product_ Product_ Product_ Pur
r_I uct_I nd g patio Categ rent_City_Ye al_Sta Category Category Category cha
D D er e n ory ars tus _1 _2 _3 se

0
100 P000
- 837
0 000 6904 F 10 A 2 0 3 NaN NaN
1 0
1 2
7

0
100 P002
- 152
1 000 4894 F 10 A 2 0 1 6.0 14.0
1 00
1 2
7

0
100 P000
- 142
2 000 8784 F 10 A 2 0 12 NaN NaN
1 2
1 2
7

0
100 P000
- 105
3 000 8544 F 10 A 2 0 12 14.0 NaN
1 7
1 2
7

100 P002 5
796
4 000 8544 M 5 16 C 4+ 0 8 NaN NaN
9
2 2 +

In [4]:
test_data.head()

19
Out[4]:
Use Prod Ge A Occu City_C Stay_In_Curr Marita Product_ Product_ Product_
r_I uct_I nde g patio ategor ent_City_Year l_Statu Category_ Category_ Category_
D D r e n y s s 1 2 3

4
100
P0012 6-
0 000 M 7 B 2 1 1 11.0 NaN
8942 5
4
0

2
100
P0011 6-
1 000 M 17 C 0 0 3 5.0 NaN
3442 3
9
5

3
100
P0028 6-
2 001 F 1 B 4+ 1 5 14.0 NaN
8442 4
0
5

3
100
P0014 6-
3 001 F 1 B 4+ 1 4 9.0 NaN
5342 4
0
5

2
100
P0005 6-
4 001 F 1 C 1 0 4 5.0 12.0
3842 3
1
5

In [5]:
train_data.info() #gives data of the totals no. of columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category_1 550068 non-null int64
9 Product_Category_2 376430 non-null float64
10 Product_Category_3 166821 non-null float64
11 Purchase 550068 non-null int64

20
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB
In [6]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 233599 non-null int64
1 Product_ID 233599 non-null object
2 Gender 233599 non-null object
3 Age 233599 non-null object
4 Occupation 233599 non-null int64
5 City_Category 233599 non-null object
6 Stay_In_Current_City_Years 233599 non-null object
7 Marital_Status 233599 non-null int64
8 Product_Category_1 233599 non-null int64
9 Product_Category_2 161255 non-null float64
10 Product_Category_3 71037 non-null float64
dtypes: float64(2), int64(4), object(5)
memory usage: 19.6+ MB
In [7]:
# Dropping columns with null values in train and test data
train_data.drop(['Product_Category_2','Product_Category_3','User_ID','Product
_ID'],axis=1,inplace=True)
test_data.drop(['Product_Category_2','Product_Category_3','User_ID','Product_
ID'],axis=1,inplace=True)
In [8]:
train_data.head(10) #to ckech the data of first particular rows instert the
values in bracets
Out[8]:
Gend Ag Occupati City_Categ Stay_In_Current_City_ Marital_Sta Product_Catego Purcha
er e on ory Years tus ry_1 se

0-
0 F 10 A 2 0 3 8370
17

0-
1 F 10 A 2 0 1 15200
17

0-
2 F 10 A 2 0 12 1422
17

0-
3 F 10 A 2 0 12 1057
17

21
Gend Ag Occupati City_Categ Stay_In_Current_City_ Marital_Sta Product_Catego Purcha
er e on ory Years tus ry_1 se

55
4 M 16 C 4+ 0 8 7969
+

26-
5 M 15 A 3 0 1 15227
35

46-
6 M 7 B 2 1 1 19215
50

46-
7 M 7 B 2 1 1 15854
50

46-
8 M 7 B 2 1 1 15686
50

26-
9 M 20 A 1 1 8 7871
35

In [9]:
test_data.head(10)
Out[9]:
Gende Ag Occupatio City_Categor Stay_In_Current_City_Ye Marital_Stat Product_Category
r e n y ars us _1

46-
0 M 7 B 2 1 1
50

26-
1 M 17 C 0 0 3
35

36-
2 F 1 B 4+ 1 5
45

36-
3 F 1 B 4+ 1 4
45

26-
4 F 1 C 1 0 4
35

22
Gende Ag Occupatio City_Categor Stay_In_Current_City_Ye Marital_Stat Product_Category
r e n y ars us _1

46-
5 M 1 C 3 1 2
50

46-
6 M 1 C 3 1 1
50

46-
7 M 1 C 3 1 2
50

26-
8 M 7 A 1 0 10
35

18-
9 M 15 A 4+ 0 5
25

In [10]:
sns.scatterplot(data=train_data,x='Occupation',y='Purchase')
Out[10]:
<AxesSubplot:xlabel='Occupation', ylabel='Purchase'>

In [11]:
sns.scatterplot(data=test_data,x='Occupation',y='Gender')
Out[11]:

23
<AxesSubplot:xlabel='Occupation', ylabel='Gender'>

In [12]:
sns.barplot(data=train_data,x='Occupation',y='Purchase')
Out[12]:
<AxesSubplot:xlabel='Occupation', ylabel='Purchase'>

In [13]:
sns.barplot(data=test_data,y='Occupation',x='Gender')

24
Out[13]:
<AxesSubplot:xlabel='Gender', ylabel='Occupation'>

In [14]:
sns.barplot(data=train_data,x='Marital_Status',y='Purchase')
Out[14]:
<AxesSubplot:xlabel='Marital_Status', ylabel='Purchase'>

In [15]:
sns.barplot(data=train_data,x='City_Category',y='Purchase')

25
Out[15]:
<AxesSubplot:xlabel='City_Category', ylabel='Purchase'>

In [16]:
sns.barplot(data=train_data,x='Age',y='Purchase')
Out[16]:
<AxesSubplot:xlabel='Age', ylabel='Purchase'>

In [17]:

26
# temporarily concatenating the train and test dataframes
df = pd.concat([train_data.assign(ind="train"),
test_data.assign(ind="test")],ignore_index=True)
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783667 entries, 0 to 783666
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 783667 non-null object
1 Age 783667 non-null object
2 Occupation 783667 non-null int64
3 City_Category 783667 non-null object
4 Stay_In_Current_City_Years 783667 non-null object
5 Marital_Status 783667 non-null int64
6 Product_Category_1 783667 non-null int64
7 Purchase 550068 non-null float64
8 ind 783667 non-null object
dtypes: float64(1), int64(3), object(5)
memory usage: 53.8+ MB
In [19]:
# One hot encoding multiple categorical columns in combined dataset
complete_dataset = pd.get_dummies(data=df,drop_first=True, columns=['Gender',
'Age','City_Category','Stay_In_Current_City_Years'])
In [20]:
# Splitting the above dataset into train and test dataset
train_data2, test_data2 =
complete_dataset[complete_dataset["ind"].eq("train")].copy(), \

complete_dataset[complete_dataset["ind"].eq("test")].copy().reset_index(drop=
True)

# Removing the unwanted column ind used for marking train and test data
train_data2.drop('ind',axis=1,inplace=True)
test_data2.drop('ind',axis=1,inplace=True)
In [21]:
# Splitting the training dataset into X_train, y_train
X_train, y_train = train_data2.drop("Purchase",axis=1),
train_data2["Purchase"]

X_test = test_data2
y_train.head()
Out[21]:
0 8370.0
1 15200.0
2 1422.0
3 1057.0
4 7969.0

27
Name: Purchase, dtype: float64
In [22]:
# Using feature selection to select the best features for regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

bestfeatures = SelectKBest(score_func=f_regression, k=8)


fit = bestfeatures.fit(X_train, y_train)

# Creating new dataframe with columns and their corresponding scores


dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
featureScores.sort_values("Score",ascending=False).head(12)
Out[22]:
Specs Score

2 Product_Category_1 73684.939778

11 City_Category_C 2055.228480

3 Gender_M 2010.442472

0 Occupation 238.831156

10 City_Category_B 200.691083

8 Age_51-55 120.383197

4 Age_18-25 42.903923

6 Age_36-45 24.746972

13 Stay_In_Current_City_Years_2 15.790576

7 Age_46-50 6.050446

9 Age_55+ 4.637896

28
Specs Score

14 Stay_In_Current_City_Years_3 2.402775

In [23]:
X_train =
X_train[["Product_Category_1","City_Category_C","Gender_M","Occupation","City
_Category_B"]]
In [24]:
pip install LightGBM
Requirement already satisfied: LightGBM in c:\users\acer\appdata\local\progra
ms\python\python310\lib\site-packages (3.3.2)
Requirement already satisfied: scikit-learn!=0.22.0 in c:\users\acer\appdata\
local\programs\python\python310\lib\site-packages (from LightGBM) (1.1.1)
Requirement already satisfied: scipy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (1.8.1)
Requirement already satisfied: numpy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (1.22.4)
Requirement already satisfied: wheel in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (0.37.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\acer\appdata\
local\programs\python\python310\lib\site-packages (from scikit-learn!=0.22.0-
>LightGBM) (3.1.0)
Requirement already satisfied: joblib>=1.0.0 in c:\users\acer\appdata\local\p
rograms\python\python310\lib\site-packages (from scikit-learn!=0.22.0->LightG
BM) (1.1.0)
Note: you may need to restart the kernel to use updated packages.
WARNING: There was an error checking the latest version of pip.
In [25]:
pip install XGBoost
Requirement already satisfied: XGBoost in c:\users\acer\appdata\local\program
s\python\python310\lib\site-packages (1.6.1)
Requirement already satisfied: numpy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from XGBoost) (1.22.4)
Requirement already satisfied: scipy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from XGBoost) (1.8.1)
Note: you may need to restart the kernel to use updated packages.
WARNING: There was an error checking the latest version of pip.
In [26]:
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
In [27]:
regressor = LGBMRegressor()
xgb_regressor = XGBRegressor()
In [28]:
regressor.fit(X_train,y_train)
xgb_regressor.fit(X_train,y_train)

29
Out[28]:
XGBRegressor
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, ...)
In [29]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import ShuffleSplit, cross_validate

cv_split = ShuffleSplit(n_splits = 10, test_size = .3, train_size = .7,


random_state = 0)
cv_results = cross_validate(xgb_regressor, X_train[["Product_Category_1"]],
y_train, cv = cv_split)
In [30]:
print(cv_results["test_score"].mean())
0.6364900199920567
In [31]:
X_test =
test_data2[["Product_Category_1","City_Category_C","Gender_M","Occupation","C
ity_Category_B"]]
In [32]:
y_pred = regressor.predict(X_test)
In [33]:
y_pred
Out[33]:
array([13804.62771151, 10207.34592312, 6158.6480155 , ...,
13571.47548186, 19648.31996153, 2450.37255679])
In [34]:
test_data2["Purchase"] = y_pred
In [35]:
test_data2.head()
Out[35]:

30
A A A A A
g g g g g A
Cit Cit
O Ma Pro G e e e e e g Stay_I Stay_I Stay_I Stay_I
Pu y_ y_
cc rit duct en _ _ _ _ _ e n_Cur n_Cur n_Cur n_Curr
rc Cat Cat
up al_ _Ca de 1 2 3 4 5 _ rent_C rent_C rent_C ent_Cit
ha ego ego
ati Sta tego r_ 8 6 6 6 1 5 ity_Ye ity_Ye ity_Ye y_Year
se ry_ ry_
on tus ry_1 M - - - - - 5 ars_1 ars_2 ars_3 s_4+
B C
2 3 4 5 5 +
5 5 5 0 5

13
80
4.6
0 7 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0
27
71
2

10
20
7.3
1 17 0 3 1 0 1 0 0 0 0 0 1 0 0 0 0
45
92
3

61
58.
2 1 1 5 64 0 0 0 1 0 0 0 1 0 0 0 0 1
80
15

23
80.
3 1 1 4 85 0 0 0 1 0 0 0 1 0 0 0 0 1
92
56

24
48.
4 1 0 4 48 0 0 1 0 0 0 0 0 1 1 0 0 0
94
56

In [36]:
import math
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
In [37]:
from sklearn.datasets import make_blobs
from sklearn.datasets import make_circles

31
from sklearn.preprocessing import StandardScaler
import seaborn as sns
In [40]:
# Generate sample data
X, y = make_circles(n_samples=1000, factor=0.009, noise=0.15)

#-------------------------------------------------------------
ads_arr = StandardScaler().fit_transform(X)
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=y)
Out[40]:
<AxesSubplot:>

In [41]:
#distance calculation UDF
def minkowski_(point_a,point_b,p=2):

if p==1:
print('----> Manhattan')
dist = np.sum(abs(point_a-point_b))
print('Manual Distance :',dist)
elif p==2:
#print('----> Euclidean')
dist = np.sqrt(np.sum(np.square(point_a-point_b)))
#print('Manual Distance :',dist)

return dist

#------------------------------------------------------------------
#UDF for Calculation of distance from a point to every ther point
def distance_to_all(curr_vec,data,p_=2):

32
#curr_vec = X_arr[0] #example
distance_list = []
#data = X_arr[0:5]

for vec_idx in range(len(data)):


dist = minkowski_(point_a=curr_vec,point_b=data[vec_idx],p=p_)
distance_list.append(dist)

return distance_list
In [49]:
def core_border_noise_mapping(data=ads_arr,min_points=10,epsilon=0.38):

#Initializing trays for collecting labels & dictionaries for key-value


combinations
core = []
interim = []
density_reachable = []
idx_dict = {} #Mapping of each point to its directly density reachable
points (within epsilon)
nmin_dict= {} #Count of total density reachable points within epsilon for
each point

#------------------------------------------------------------------------
-----------------
for idx in range(len(data)): #For each point of data

current_arr = data[idx]
current_to_all =
np.array(distance_to_all(curr_vec=current_arr,data=data,p_=2)) #Calculating
distance

inside_ = np.argwhere(current_to_all<epsilon).ravel() #Filtering for


within epsilon (not inclusive)
#print('Index :',idx,'- Len :',len(inside_))
length = len(inside_)
idx_dict.update({idx : inside_}) #Updating the points which are
within eps wrt 'idx' point
nmin_dict.update({idx : length}) #Updating the no of points which are
within eps wrt 'idx' point

#------------------------------------------------------------------------
-----------------
idx_dict_updated = {} #Copy of the mapping dict with removal of self
distance (a point has 0 distance with itself)
for key,val in idx_dict.items():
val_ = val[val!=key]
idx_dict_updated.update({key : val_})

#------------------------------------------------------------------------
-----------------

33
#Classifying between Core and non-core (interim) points through nmin
parameter
for (key,value) in nmin_dict.items():

if value>=min_points:
core.append(key)
elif value<=min_points:
interim.append(key)

#----------------------------------------------------
print('Total core points :',len(core))
print('Total interim points :',len(interim))

#----------------------------------------------------
#Calculating the directly density reachable points (All points which are
within eps of any point)
for key_ in idx_dict_updated.keys():

val = list(idx_dict_updated[key_])
density_reachable += val

density_reachable = list(set(density_reachable)) #Unique through


conversion to set
print('Total density reachable points :',len(density_reachable))
#density_reachable[0:4]

#----------------------------------------------------
noise = []
border = []
#Classifying between border and noisy points
for idx in interim:
if idx in density_reachable:
border.append(idx)
elif idx not in density_reachable:
noise.append(idx)

print('Total noisy points :', len(noise))


print('Total border points :',len(border))
#----------------------------------------------------

return core,border,noise,idx_dict_updated
In [50]:
#!pip install kneed #If not already installed
from sklearn.neighbors import NearestNeighbors #Nearest Neighbor Calculator
#from kneed import KneeLocator

nbrs =
NearestNeighbors(n_neighbors=6,algorithm='auto',metric='minkowski',p=2,n_jobs
=-1).fit(ads_arr)
distances, indices = nbrs.kneighbors(ads_arr)

34
#----------------------------------------------------------------------------
-------------------------
distances = np.sort(distances[:,-1], axis=0)
i = np.arange(len(distances))
#knee = KneeLocator(i, distances, S=1, curve='convex',
direction='increasing', interp_method='polynomial')

#----------------------------------------------------------------------------
-------------------------
sns.set_style('darkgrid')
ax = sns.lineplot(x=range(0,len(ads_arr)),y=distances,color='g')
ax.set(xlabel='No of Points',ylabel='Distance of kth neighbor',title='Elbow
Curve of kth distance-vs-point')
#knee.plot_knee()
Out[50]:
[Text(0.5, 0, 'No of Points'),
Text(0, 0.5, 'Distance of kth neighbor'),
Text(0.5, 1.0, 'Elbow Curve of kth distance-vs-point')]

In [51]:
epsilon = 0.31 #Calculate through Nearest Neighbours Distance graph
min_points = 5

core,border,noise,idx_dict_updated =
core_border_noise_mapping(data=ads_arr,min_points=min_points,epsilon=epsilon)
Total core points : 963
Total interim points : 37
Total density reachable points : 999
Total noisy points : 1

35
Total border points : 36
In [56]:
def expand_clusters(point, neighbors_, border_, core_, idx_dict_updated_):

#Starting of allotment of cluster


clusters[point] = counter

#print('Length of neighbor at START :',len(neighbors_)) ##

i = 0 #Initializing

#Total directly density reachable elements to cluster tray (gets updated


automatically for density reachable & density connected)
while i < len(neighbors_):

nextPoint = neighbors_[i]

if (nextPoint in border_) and (nextPoint not in counter_assign): #In


border AND not alloted already
#print('For border :',nextPoint)
clusters[nextPoint] = counter
counter_assign.append(nextPoint)

elif (nextPoint in core_) and (nextPoint not in counter_assign): #In


core AND not alloted already
#print('For Core :',nextPoint)
clusters[nextPoint] = counter

counter_assign.append(nextPoint)
#print('Next point else :',nextPoint)

nextNeighbors = list(idx_dict_updated_[nextPoint])

#print('---- New neigbors :',len(nextNeighbors)) ##


neighbors_ = neighbors_ + nextNeighbors
#Updating with new list of directly density reachable since the
new addition to cluster a core point

#print('Length of neighbor at end :',len(neighbors_)) ##

i += 1 #Next element in the tray

#return clusters_
In [57]:
clusters = np.array([np.nan]*len(ads_arr)) #Initializing clusters with
required size and nan value
print('Cluster length :',clusters.shape)

counter = 0 #Start of allotment counter


counter_assign = [] #Counter for assignment (To prevent over-write of
clusters)

36
for idx in range(len(ads_arr)): #For each data point

if idx not in counter_assign: #Checking if that point is not assigned


already

print('---------------------------------------- idx :',idx)


#print('------------------ counter :',counter)
if idx in noise: #If index of the point in noise list, no allotment
required
#print('noise :',idx)
clusters[idx] = -1 #Allotment for noisy points
counter_assign.append(idx)

else : #If core or border point


#print('Core :',idx)

print('------------------ counter :',counter)

neighbors = list(idx_dict_updated[idx]) #Directly density


reachable for that point

expand_clusters(point=idx,
neighbors_=neighbors,
border_=border,
core_=core,
idx_dict_updated_=idx_dict_updated) #Greedy
search algo to allot cluster till the edge

counter += 1 #Changing cluster label value


Cluster length : (1000,)
---------------------------------------- idx : 0
------------------ counter : 0
---------------------------------------- idx : 4
------------------ counter : 1
---------------------------------------- idx : 33
------------------ counter : 2
---------------------------------------- idx : 38
---------------------------------------- idx : 104
------------------ counter : 3
---------------------------------------- idx : 241
------------------ counter : 4
---------------------------------------- idx : 654
------------------ counter : 5
---------------------------------------- idx : 861
------------------ counter : 6
In [58]:
print('Total clusters :',len(np.unique(clusters)))
print('Unique cluster values :',np.unique(clusters))
Total clusters : 8
Unique cluster values : [-1. 0. 1. 2. 3. 4. 5. 6.]

37
In [59]:
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=clusters)
Out[59]:
<AxesSubplot:>

In [60]:
from sklearn.cluster import DBSCAN

#----------------------------------------------------------------------------
---------
dbscan =
DBSCAN(eps=epsilon,min_samples=min_points,metric='euclidean',n_jobs=-1)
dbscan.fit(ads_arr)
dbscan_labels = dbscan.labels_
In [61]:
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=dbscan_labels)
Out[61]:
<AxesSubplot:>

38
39
CONCLUSION

With traditional methods not being of much help to business growth in terms of
revenue, the use of Machine learning approaches proves to be an important
point for the shaping of the business plan taking into consideration the shopping
pattern of consumers.
Projection of sales concerning several factors including the sale of last year
helps businesses take on suitable strategies for increasing the sales of goods that
are in demand. Thus the dataset is used for the experimentation, Black Friday
Sales Dataset from Kaggle [9].

The models used are Linear Regression, Lasso Regression, Ridge Regression,
Decision Tree Regressor, and Random Forest Regressor. The evaluation measure
used is Mean Squared Error (MSE).
Based on Table II Random Forest Regressor is best suitable for the prediction of
sales based on a given dataset.
Thus the proposed model will predict the customer purchase on Black Friday and
give the retailer insight into customer choice of products. This will result in a
discount based on customer-centric choices thus increasing the profit to the
retailer as well as the customer.

40
Thank You

41

You might also like