Professional Documents
Culture Documents
DATA SCIENCE
1
ACKNOWLEDGEMENT
I would like to acknowledge the contributions of the following people without whose help and
guidance this report would not have been completed.
I acknowledge the counsel and support of Dr. Sanjay Kumar Singh, Professor, IT Department,
with respect and gratitude, whose expertise, guidance, support, encouragement, and enthusiasm
has made this report possible. Their feedback vastly improved the quality of this report and
provided an enthralling experience. I am indeed proud and fortunate to be supported by him. I am
also thankful to Prof. (Dr.) Amit Sinha, H.O.D of Information Technology Department, and
Dr. Kanika Gupta, A.H.O.D of Information Technology Department for his constant
encouragement, valuable suggestions and moral support and blessings.
Although it is not possible to name individually, I shall ever remain indebted to the faculty
members of ABES Engineering College, Ghaziabad for their persistent support and cooperation
extended during this work.
This acknowledgement will remain incomplete if I fail to express our deep sense of obligation to
my parents and God for their consistent blessings and encouragement.
2
CONTENTS
Page no.
➢ Acknowledgement
➢ Student’s Declaration
➢ Regression
➢ Dataset
➢ Project
3
Regression
In the context of machine learning and data science, regression specifically refers
to the estimation of a continuous dependent variable or response from a list of
input variables, or features. There are a variety of regression techniques, ranging
from the simplest (linear regression), to complicated statistical classic regression
models (Lasso, Elastic Net, etc.), to more complex techniques including gradient
boosting and neural networks.
4
Why is Regression Important?
5
Data Science
The regression method of forecasting, as the name implies, is used for
forecasting and for finding the casual relationship between variables.
From a business point of view, the regression method of forecasting can
be helpful for an individual working with data in the following ways:
• How the sales will look like in the next six months?
6
Meaning of Regression
Our first job is to collect the details of the GRE scores and CGPAs of all
the students of a college in a tabular form. The GRE scores and the
CGPAs are listed in the 1st and 2nd columns, respectively.
Here, we can see a linear relationship between CGPA and GRE score in
the scatter plot. This indicates that if the CGPA increases, the GRE scores
also increase. Thus, it would also mean that a student with a high CGPA
is likely to have a greater chance of getting a high GRE score.
However, if a question arises like “If the CGPA of a student is 8.51, what
will be the GRE score of the student?”. We need to find the relationship
between these two variables to answer this question. This is the place
where Regression plays its role.
7
In a regression algorithm, we usually have one dependent variable and
one or more than one independent variable where we try to regress the
dependent variable "Y" (in this case, GRE score) using the independent
variable "X" (in this case, CGPA). In layman's terms, we are trying to
understand how the value of "Y" changes concerning the change in "X".
8
Dependent and Independent variables
9
Below are a few examples to understand the usage and significance of
dependent and independent variables in a wider sense:
10
If we take two variables, X and Y, there will be two regression lines:
• Regression line of Y on X: This gives the most probable Y values from the
given values of X.
• Regression line of X on Y: This gives the most probable values of X from
the given values of Y.
If the two regression lines coincide, i.e. only a single line exists,
correlation tends to be either perfect positive or perfect negative.
However, if the variables are independent, then the correlation is zero,
and the lines of regression will be at right angles.
Regression lines are widely used in the financial sector and business
procedures. Financial Analysts use linear regression techniques to
predict prices of stocks, commodities and perform valuations, whereas
businesses employ regressions for forecasting sales, inventories, and
many other variables essential for business strategy and planning.
11
What is the Regression Equation?
Let us consider one regression line, say Y on X and another line, say X
on Y, then there will be one regression equation for each regression
line:
• Regression Equation of Y on X:
Ye = a + bX
Where,
The parameters "a" and "b" can be calculated using the least square
method. According to this method, the line needs to be drawn to
connect all the plotted points. In mathematical terms, the sum of the
12
squares of the vertical deviations of observed Y from the calculated
values of Y is the least. In other words, the best-fitted line is obtained
when ∑ (Y-Ye)2 is the minimum.
∑ Y = Na + b ∑ X
∑ XY = a ∑ X + b ∑ X2
• Regression Equation of X on Y:
Xe = a + bY
Where,
13
To calculate the values of parameters “a” and “b” in this equation, we
need to simultaneously solve the following two normal equations:
∑ X = Na + b ∑ Y
∑ XY = a ∑ Y + b ∑ Y2
14
When Should I Use Regression Analysis?
15
and rescue us in this game of variables, we need to take the help of
regression analysis for all the scenarios.
Now, let us understand how regression can help control the other
variables in the process.
16
quite different from the initial results. It depicted that coffee intake
lowers the risk of mortality while smoking increases it.
This model isolates the role of each variable while holding the other
variables constant. You can examine the effect of coffee intake while
controlling the smoking factor. On the other hand, you can also look at
smoking while controlling for coffee intake.
17
PROJECT BASED LEARNING AND IMPLEMENTATION
DATASET: BLACK FRIDAY
Description:
Black Friday is a colloquial term for the Friday after Thanksgiving in the
United States. It traditionally marks the start of the Christmas shopping
season in the United States. Many stores offer highly promoted sales at
discounted prices and often open early, sometimes as early as midnight
or even on Thanksgiving.
Black Friday marks the beginning of the Christmas shopping festival
across the US. On Black Friday big shopping giants like Amazon, Flipkart,
etc. lure customers by offering discounts and deals on different product
categories. The product categories range from electronic items,
Clothing, kitchen appliances, Décor. Research has been carried out
to predict sales by various researchers. The analysis of this data serves
as a basis to provide discounts on various product items. With the
purpose of analyzing and predicting the sales, we have used three
models. The dataset Black Friday Sales Dataset available on Kaggle
has been used for analysis and prediction purposes. The models used
for prediction are linear regression, lasso regression, ridge regression,
Decision Tree Regressor, and Random Forest Regressor. Mean Squared
Error (MSE) is used as a performance evaluation measure. Random
Forest Regressor outperforms the other models with the least MSE score
18
CODE:-
import numpy as np
import pandas as pd
import seaborn as sns
In [2]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
In [3]:
train_data.head() #head gives data of first 5 rows
Out[3]:
Use Prod Ge A Occu City_ Stay_In_Cur Marit Product_ Product_ Product_ Pur
r_I uct_I nd g patio Categ rent_City_Ye al_Sta Category Category Category cha
D D er e n ory ars tus _1 _2 _3 se
0
100 P000
- 837
0 000 6904 F 10 A 2 0 3 NaN NaN
1 0
1 2
7
0
100 P002
- 152
1 000 4894 F 10 A 2 0 1 6.0 14.0
1 00
1 2
7
0
100 P000
- 142
2 000 8784 F 10 A 2 0 12 NaN NaN
1 2
1 2
7
0
100 P000
- 105
3 000 8544 F 10 A 2 0 12 14.0 NaN
1 7
1 2
7
100 P002 5
796
4 000 8544 M 5 16 C 4+ 0 8 NaN NaN
9
2 2 +
In [4]:
test_data.head()
19
Out[4]:
Use Prod Ge A Occu City_C Stay_In_Curr Marita Product_ Product_ Product_
r_I uct_I nde g patio ategor ent_City_Year l_Statu Category_ Category_ Category_
D D r e n y s s 1 2 3
4
100
P0012 6-
0 000 M 7 B 2 1 1 11.0 NaN
8942 5
4
0
2
100
P0011 6-
1 000 M 17 C 0 0 3 5.0 NaN
3442 3
9
5
3
100
P0028 6-
2 001 F 1 B 4+ 1 5 14.0 NaN
8442 4
0
5
3
100
P0014 6-
3 001 F 1 B 4+ 1 4 9.0 NaN
5342 4
0
5
2
100
P0005 6-
4 001 F 1 C 1 0 4 5.0 12.0
3842 3
1
5
In [5]:
train_data.info() #gives data of the totals no. of columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category_1 550068 non-null int64
9 Product_Category_2 376430 non-null float64
10 Product_Category_3 166821 non-null float64
11 Purchase 550068 non-null int64
20
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB
In [6]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 233599 non-null int64
1 Product_ID 233599 non-null object
2 Gender 233599 non-null object
3 Age 233599 non-null object
4 Occupation 233599 non-null int64
5 City_Category 233599 non-null object
6 Stay_In_Current_City_Years 233599 non-null object
7 Marital_Status 233599 non-null int64
8 Product_Category_1 233599 non-null int64
9 Product_Category_2 161255 non-null float64
10 Product_Category_3 71037 non-null float64
dtypes: float64(2), int64(4), object(5)
memory usage: 19.6+ MB
In [7]:
# Dropping columns with null values in train and test data
train_data.drop(['Product_Category_2','Product_Category_3','User_ID','Product
_ID'],axis=1,inplace=True)
test_data.drop(['Product_Category_2','Product_Category_3','User_ID','Product_
ID'],axis=1,inplace=True)
In [8]:
train_data.head(10) #to ckech the data of first particular rows instert the
values in bracets
Out[8]:
Gend Ag Occupati City_Categ Stay_In_Current_City_ Marital_Sta Product_Catego Purcha
er e on ory Years tus ry_1 se
0-
0 F 10 A 2 0 3 8370
17
0-
1 F 10 A 2 0 1 15200
17
0-
2 F 10 A 2 0 12 1422
17
0-
3 F 10 A 2 0 12 1057
17
21
Gend Ag Occupati City_Categ Stay_In_Current_City_ Marital_Sta Product_Catego Purcha
er e on ory Years tus ry_1 se
55
4 M 16 C 4+ 0 8 7969
+
26-
5 M 15 A 3 0 1 15227
35
46-
6 M 7 B 2 1 1 19215
50
46-
7 M 7 B 2 1 1 15854
50
46-
8 M 7 B 2 1 1 15686
50
26-
9 M 20 A 1 1 8 7871
35
In [9]:
test_data.head(10)
Out[9]:
Gende Ag Occupatio City_Categor Stay_In_Current_City_Ye Marital_Stat Product_Category
r e n y ars us _1
46-
0 M 7 B 2 1 1
50
26-
1 M 17 C 0 0 3
35
36-
2 F 1 B 4+ 1 5
45
36-
3 F 1 B 4+ 1 4
45
26-
4 F 1 C 1 0 4
35
22
Gende Ag Occupatio City_Categor Stay_In_Current_City_Ye Marital_Stat Product_Category
r e n y ars us _1
46-
5 M 1 C 3 1 2
50
46-
6 M 1 C 3 1 1
50
46-
7 M 1 C 3 1 2
50
26-
8 M 7 A 1 0 10
35
18-
9 M 15 A 4+ 0 5
25
In [10]:
sns.scatterplot(data=train_data,x='Occupation',y='Purchase')
Out[10]:
<AxesSubplot:xlabel='Occupation', ylabel='Purchase'>
In [11]:
sns.scatterplot(data=test_data,x='Occupation',y='Gender')
Out[11]:
23
<AxesSubplot:xlabel='Occupation', ylabel='Gender'>
In [12]:
sns.barplot(data=train_data,x='Occupation',y='Purchase')
Out[12]:
<AxesSubplot:xlabel='Occupation', ylabel='Purchase'>
In [13]:
sns.barplot(data=test_data,y='Occupation',x='Gender')
24
Out[13]:
<AxesSubplot:xlabel='Gender', ylabel='Occupation'>
In [14]:
sns.barplot(data=train_data,x='Marital_Status',y='Purchase')
Out[14]:
<AxesSubplot:xlabel='Marital_Status', ylabel='Purchase'>
In [15]:
sns.barplot(data=train_data,x='City_Category',y='Purchase')
25
Out[15]:
<AxesSubplot:xlabel='City_Category', ylabel='Purchase'>
In [16]:
sns.barplot(data=train_data,x='Age',y='Purchase')
Out[16]:
<AxesSubplot:xlabel='Age', ylabel='Purchase'>
In [17]:
26
# temporarily concatenating the train and test dataframes
df = pd.concat([train_data.assign(ind="train"),
test_data.assign(ind="test")],ignore_index=True)
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783667 entries, 0 to 783666
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 783667 non-null object
1 Age 783667 non-null object
2 Occupation 783667 non-null int64
3 City_Category 783667 non-null object
4 Stay_In_Current_City_Years 783667 non-null object
5 Marital_Status 783667 non-null int64
6 Product_Category_1 783667 non-null int64
7 Purchase 550068 non-null float64
8 ind 783667 non-null object
dtypes: float64(1), int64(3), object(5)
memory usage: 53.8+ MB
In [19]:
# One hot encoding multiple categorical columns in combined dataset
complete_dataset = pd.get_dummies(data=df,drop_first=True, columns=['Gender',
'Age','City_Category','Stay_In_Current_City_Years'])
In [20]:
# Splitting the above dataset into train and test dataset
train_data2, test_data2 =
complete_dataset[complete_dataset["ind"].eq("train")].copy(), \
complete_dataset[complete_dataset["ind"].eq("test")].copy().reset_index(drop=
True)
# Removing the unwanted column ind used for marking train and test data
train_data2.drop('ind',axis=1,inplace=True)
test_data2.drop('ind',axis=1,inplace=True)
In [21]:
# Splitting the training dataset into X_train, y_train
X_train, y_train = train_data2.drop("Purchase",axis=1),
train_data2["Purchase"]
X_test = test_data2
y_train.head()
Out[21]:
0 8370.0
1 15200.0
2 1422.0
3 1057.0
4 7969.0
27
Name: Purchase, dtype: float64
In [22]:
# Using feature selection to select the best features for regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
2 Product_Category_1 73684.939778
11 City_Category_C 2055.228480
3 Gender_M 2010.442472
0 Occupation 238.831156
10 City_Category_B 200.691083
8 Age_51-55 120.383197
4 Age_18-25 42.903923
6 Age_36-45 24.746972
13 Stay_In_Current_City_Years_2 15.790576
7 Age_46-50 6.050446
9 Age_55+ 4.637896
28
Specs Score
14 Stay_In_Current_City_Years_3 2.402775
In [23]:
X_train =
X_train[["Product_Category_1","City_Category_C","Gender_M","Occupation","City
_Category_B"]]
In [24]:
pip install LightGBM
Requirement already satisfied: LightGBM in c:\users\acer\appdata\local\progra
ms\python\python310\lib\site-packages (3.3.2)
Requirement already satisfied: scikit-learn!=0.22.0 in c:\users\acer\appdata\
local\programs\python\python310\lib\site-packages (from LightGBM) (1.1.1)
Requirement already satisfied: scipy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (1.8.1)
Requirement already satisfied: numpy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (1.22.4)
Requirement already satisfied: wheel in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (0.37.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\acer\appdata\
local\programs\python\python310\lib\site-packages (from scikit-learn!=0.22.0-
>LightGBM) (3.1.0)
Requirement already satisfied: joblib>=1.0.0 in c:\users\acer\appdata\local\p
rograms\python\python310\lib\site-packages (from scikit-learn!=0.22.0->LightG
BM) (1.1.0)
Note: you may need to restart the kernel to use updated packages.
WARNING: There was an error checking the latest version of pip.
In [25]:
pip install XGBoost
Requirement already satisfied: XGBoost in c:\users\acer\appdata\local\program
s\python\python310\lib\site-packages (1.6.1)
Requirement already satisfied: numpy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from XGBoost) (1.22.4)
Requirement already satisfied: scipy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from XGBoost) (1.8.1)
Note: you may need to restart the kernel to use updated packages.
WARNING: There was an error checking the latest version of pip.
In [26]:
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
In [27]:
regressor = LGBMRegressor()
xgb_regressor = XGBRegressor()
In [28]:
regressor.fit(X_train,y_train)
xgb_regressor.fit(X_train,y_train)
29
Out[28]:
XGBRegressor
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, ...)
In [29]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import ShuffleSplit, cross_validate
30
A A A A A
g g g g g A
Cit Cit
O Ma Pro G e e e e e g Stay_I Stay_I Stay_I Stay_I
Pu y_ y_
cc rit duct en _ _ _ _ _ e n_Cur n_Cur n_Cur n_Curr
rc Cat Cat
up al_ _Ca de 1 2 3 4 5 _ rent_C rent_C rent_C ent_Cit
ha ego ego
ati Sta tego r_ 8 6 6 6 1 5 ity_Ye ity_Ye ity_Ye y_Year
se ry_ ry_
on tus ry_1 M - - - - - 5 ars_1 ars_2 ars_3 s_4+
B C
2 3 4 5 5 +
5 5 5 0 5
13
80
4.6
0 7 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0
27
71
2
10
20
7.3
1 17 0 3 1 0 1 0 0 0 0 0 1 0 0 0 0
45
92
3
61
58.
2 1 1 5 64 0 0 0 1 0 0 0 1 0 0 0 0 1
80
15
23
80.
3 1 1 4 85 0 0 0 1 0 0 0 1 0 0 0 0 1
92
56
24
48.
4 1 0 4 48 0 0 1 0 0 0 0 0 1 1 0 0 0
94
56
In [36]:
import math
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
In [37]:
from sklearn.datasets import make_blobs
from sklearn.datasets import make_circles
31
from sklearn.preprocessing import StandardScaler
import seaborn as sns
In [40]:
# Generate sample data
X, y = make_circles(n_samples=1000, factor=0.009, noise=0.15)
#-------------------------------------------------------------
ads_arr = StandardScaler().fit_transform(X)
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=y)
Out[40]:
<AxesSubplot:>
In [41]:
#distance calculation UDF
def minkowski_(point_a,point_b,p=2):
if p==1:
print('----> Manhattan')
dist = np.sum(abs(point_a-point_b))
print('Manual Distance :',dist)
elif p==2:
#print('----> Euclidean')
dist = np.sqrt(np.sum(np.square(point_a-point_b)))
#print('Manual Distance :',dist)
return dist
#------------------------------------------------------------------
#UDF for Calculation of distance from a point to every ther point
def distance_to_all(curr_vec,data,p_=2):
32
#curr_vec = X_arr[0] #example
distance_list = []
#data = X_arr[0:5]
return distance_list
In [49]:
def core_border_noise_mapping(data=ads_arr,min_points=10,epsilon=0.38):
#------------------------------------------------------------------------
-----------------
for idx in range(len(data)): #For each point of data
current_arr = data[idx]
current_to_all =
np.array(distance_to_all(curr_vec=current_arr,data=data,p_=2)) #Calculating
distance
#------------------------------------------------------------------------
-----------------
idx_dict_updated = {} #Copy of the mapping dict with removal of self
distance (a point has 0 distance with itself)
for key,val in idx_dict.items():
val_ = val[val!=key]
idx_dict_updated.update({key : val_})
#------------------------------------------------------------------------
-----------------
33
#Classifying between Core and non-core (interim) points through nmin
parameter
for (key,value) in nmin_dict.items():
if value>=min_points:
core.append(key)
elif value<=min_points:
interim.append(key)
#----------------------------------------------------
print('Total core points :',len(core))
print('Total interim points :',len(interim))
#----------------------------------------------------
#Calculating the directly density reachable points (All points which are
within eps of any point)
for key_ in idx_dict_updated.keys():
val = list(idx_dict_updated[key_])
density_reachable += val
#----------------------------------------------------
noise = []
border = []
#Classifying between border and noisy points
for idx in interim:
if idx in density_reachable:
border.append(idx)
elif idx not in density_reachable:
noise.append(idx)
return core,border,noise,idx_dict_updated
In [50]:
#!pip install kneed #If not already installed
from sklearn.neighbors import NearestNeighbors #Nearest Neighbor Calculator
#from kneed import KneeLocator
nbrs =
NearestNeighbors(n_neighbors=6,algorithm='auto',metric='minkowski',p=2,n_jobs
=-1).fit(ads_arr)
distances, indices = nbrs.kneighbors(ads_arr)
34
#----------------------------------------------------------------------------
-------------------------
distances = np.sort(distances[:,-1], axis=0)
i = np.arange(len(distances))
#knee = KneeLocator(i, distances, S=1, curve='convex',
direction='increasing', interp_method='polynomial')
#----------------------------------------------------------------------------
-------------------------
sns.set_style('darkgrid')
ax = sns.lineplot(x=range(0,len(ads_arr)),y=distances,color='g')
ax.set(xlabel='No of Points',ylabel='Distance of kth neighbor',title='Elbow
Curve of kth distance-vs-point')
#knee.plot_knee()
Out[50]:
[Text(0.5, 0, 'No of Points'),
Text(0, 0.5, 'Distance of kth neighbor'),
Text(0.5, 1.0, 'Elbow Curve of kth distance-vs-point')]
In [51]:
epsilon = 0.31 #Calculate through Nearest Neighbours Distance graph
min_points = 5
core,border,noise,idx_dict_updated =
core_border_noise_mapping(data=ads_arr,min_points=min_points,epsilon=epsilon)
Total core points : 963
Total interim points : 37
Total density reachable points : 999
Total noisy points : 1
35
Total border points : 36
In [56]:
def expand_clusters(point, neighbors_, border_, core_, idx_dict_updated_):
i = 0 #Initializing
nextPoint = neighbors_[i]
counter_assign.append(nextPoint)
#print('Next point else :',nextPoint)
nextNeighbors = list(idx_dict_updated_[nextPoint])
#return clusters_
In [57]:
clusters = np.array([np.nan]*len(ads_arr)) #Initializing clusters with
required size and nan value
print('Cluster length :',clusters.shape)
36
for idx in range(len(ads_arr)): #For each data point
expand_clusters(point=idx,
neighbors_=neighbors,
border_=border,
core_=core,
idx_dict_updated_=idx_dict_updated) #Greedy
search algo to allot cluster till the edge
37
In [59]:
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=clusters)
Out[59]:
<AxesSubplot:>
In [60]:
from sklearn.cluster import DBSCAN
#----------------------------------------------------------------------------
---------
dbscan =
DBSCAN(eps=epsilon,min_samples=min_points,metric='euclidean',n_jobs=-1)
dbscan.fit(ads_arr)
dbscan_labels = dbscan.labels_
In [61]:
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=dbscan_labels)
Out[61]:
<AxesSubplot:>
38
39
CONCLUSION
With traditional methods not being of much help to business growth in terms of
revenue, the use of Machine learning approaches proves to be an important
point for the shaping of the business plan taking into consideration the shopping
pattern of consumers.
Projection of sales concerning several factors including the sale of last year
helps businesses take on suitable strategies for increasing the sales of goods that
are in demand. Thus the dataset is used for the experimentation, Black Friday
Sales Dataset from Kaggle [9].
The models used are Linear Regression, Lasso Regression, Ridge Regression,
Decision Tree Regressor, and Random Forest Regressor. The evaluation measure
used is Mean Squared Error (MSE).
Based on Table II Random Forest Regressor is best suitable for the prediction of
sales based on a given dataset.
Thus the proposed model will predict the customer purchase on Black Friday and
give the retailer insight into customer choice of products. This will result in a
discount based on customer-centric choices thus increasing the profit to the
retailer as well as the customer.
40
Thank You
41