Lalit Suryavanshi - DSA-II - Assignment

DSA -II Assignment
Q1 . Plots of K-means and DBSCAN clusters on two Principal Components of PCA
Q2. Impact of changing the kernel of the SVM on the SVM performance
• Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Laplace
RBF Kernel, Sigmoid Kernel, Anove RBF Kernel.
• Choosing the right kernel is crucial, because if the transformation is incorrect, then the
model can have very poor results.
• SVM performance improved after using Kernel. Value of recall found best in the Sigmoid
kernel and precision value found best in the Polynomial kernel (refer table)
Kernel Recall Precision

RBF 0.126 0.46
Polynomial 0.027 0.90
Sigmoid 0.412 0.23
Q3.Improvement on the recall you have seen on each algorithm of RF, XGB, lightGBM because
of hyper parameter tuning
Yes, the recall value has been improved after hyperparameter tuning:
Model Name Recall Value (Before HT) Recall Value (After HT)
Random forest 0.35 0.39
XGBoost 0.38 0.382
LightGBM 0.37 0.4
*HT: Hyperparameter tuning
Grid Searches result for RF, XGB & LightGBM

Q4. Improvement on the recall on the best accuracy across all algorithms
Yes, there has been increase in accuracy after recall improvment
Model Name Model accuracy Recall Value (Before Recall Value (After
HT) HT)
Random forest 0.793 0.35 0.39
XGBoost 0.758 0.38 0.382
LightGBM 0.747 0.37 0.4
RandomForest Classifier results

XGBoost Classifier results
LightGBM Classifier results
Q5. Compare the order of feature importance across all the algorithms and mention about the
differences being observed
Random forest XGBoost LightGBM

Frequency Cluster_1 Average_Ticket_Size
Average_Ticket_Size Cluster_0 Recency
Recency Cluster_-1 Frequency
Cluster_1 Frequency Cluster_1
Cluster_0 Recency Cluster_0
Cluster_-1 Average_Ticket_Size Cluster_-1
Cluster_2 Cluster_2 Cluster_2
The three top important features in Random Forest classifier are frequency, avg ticket size
and recency which contributes whether a customer will churn or not.
The three top important features in XG Boost classifier are Cluster_1, Cluster_0 and
Cluster_-1which contributes whether a customer will churn or not.
The three top important features in LightGBM classifier are Avg ticket size, recency and
frequency and which contributes whether a customer will churn or not.
Thankyou.
For Python code, see next page

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import time
import matplotlib.pyplot as plt
import pickle
import warnings
warnings.filterwarnings("ignore")
Loading Data
In [2]:
df = pickle.load( open( "data_all_onlineretail.p", "rb" ))
Data Processing:
In [3]:
df.head()
Out[3]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country
ID
15CM CHRISTMAS GLASS 2009-12-01 United

0 489434 85048 12 6.95 13085
BALL 20 LIGHTS 07:45:00 Kingdom
2009-12-01 United
1 489434 79323P PINK CHERRY LIGHTS 12 6.75 13085
07:45:00 Kingdom
2009-12-01 United
2 489434 79323W WHITE CHERRY LIGHTS 12 6.75 13085
07:45:00 Kingdom
RECORD FRAME 7" SINGLE 2009-12-01 United

3 489434 22041 48 2.1 13085
SIZE 07:45:00 Kingdom
STRAWBERRY CERAMIC 2009-12-01 United

4 489434 21232 24 1.25 13085
TRINKET BOX 07:45:00 Kingdom
In [4]:
df['Quantity'] = df['Quantity'].astype(float)
df['Price'] = df['Price'].astype(float)
df['InvoiceDate'] = pd.to_datetime(df["InvoiceDate"])
In [5]:
df.dtypes
Out[5]: Invoice object
StockCode object
Description object
Quantity float64
InvoiceDate datetime64[ns]
Price float64
Customer ID object
Country object
dtype: object
In [6]:
df.isna().sum()
Out[6]: Invoice 0
StockCode 0
Description 4382
Quantity 0
InvoiceDate 0
Price 0
Customer ID 243007
Country 0
dtype: int64
In [7]:
# Removing those missing customer ID's:
df_v2 = df.loc[~df['Customer ID'].isna()].reset_index(drop=True)
In [8]:
df_v2.shape
Out[8]: (824364, 8)
In [9]:
df_v2.isna().sum()
Out[9]: Invoice 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
Price 0
Customer ID 0
Country 0
dtype: int64
In [10]:
# Creating Revenue columns:
df_v2['Revenue'] = df_v2['Quantity']*df_v2['Price']
In [11]:
df_UK = df_v2.loc[df_v2['Country']=="United Kingdom"].reset_index(drop=True)
df_UK.shape
Out[11]: (741301, 9)
In [12]:
df_UK.head()
Out[12]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue
ID
15CM
CHRISTMAS 2009-12-01 United
0 489434 85048 12.0 6.95 13085 83.4
GLASS BALL 20 07:45:00 Kingdom
LIGHTS
PINK CHERRY 2009-12-01 United

1 489434 79323P 12.0 6.75 13085 81.0
LIGHTS 07:45:00 Kingdom
WHITE CHERRY 2009-12-01 United

2 489434 79323W 12.0 6.75 13085 81.0
LIGHTS 07:45:00 Kingdom
Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue
ID
RECORD FRAME 2009-12-01 United

3 489434 22041 48.0 2.10 13085 100.8
7" SINGLE SIZE 07:45:00 Kingdom
STRAWBERRY
2009-12-01 United
4 489434 21232 CERAMIC 24.0 1.25 13085 30.0
07:45:00 Kingdom
TRINKET BOX
In [13]:
# Creating Monthly columns:
df_UK['Date'] = df_UK['InvoiceDate'].dt.date
df_UK['Month'] = df_UK['InvoiceDate'].dt.month
df_UK['Quarter'] = df_UK['InvoiceDate'].dt.quarter
df_UK['Year'] = df_UK['InvoiceDate'].dt.year
In [14]:
df_UK.head()
Out[14]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue Date
ID
15CM
CHRISTMAS 2009-12-01 United 2009-
0 489434 85048 12.0 6.95 13085 83.4
GLASS BALL 07:45:00 Kingdom 12-01
20 LIGHTS
PINK CHERRY 2009-12-01 United 2009-

1 489434 79323P 12.0 6.75 13085 81.0
LIGHTS 07:45:00 Kingdom 12-01
WHITE
2009-12-01 United 2009-
2 489434 79323W CHERRY 12.0 6.75 13085 81.0
07:45:00 Kingdom 12-01
LIGHTS
RECORD
2009-12-01 United 2009-
3 489434 22041 FRAME 7" 48.0 2.10 13085 100.8
07:45:00 Kingdom 12-01
SINGLE SIZE
STRAWBERRY
2009-12-01 United 2009-
4 489434 21232 CERAMIC 24.0 1.25 13085 30.0
07:45:00 Kingdom 12-01
TRINKET BOX
In [15]:
df_UK[['Month','Quarter','Year']].drop_duplicates().reset_index(drop=True).sort_values(
Out[15]: Month Quarter Year
0 12 4 2009
1 1 1 2010
2 2 1 2010
3 3 1 2010
4 4 2 2010
5 5 2 2010
Month Quarter Year
6 6 2 2010
7 7 3 2010
8 8 3 2010
9 9 3 2010
10 10 4 2010
11 11 4 2010
12 12 4 2010
13 1 1 2011
14 2 1 2011
15 3 1 2011
16 4 2 2011
17 5 2 2011
18 6 2 2011
19 7 3 2011
20 8 3 2011
21 9 3 2011
22 10 4 2011
23 11 4 2011
24 12 4 2011
In [16]:
len(df_UK['Customer ID'].unique())
Out[16]: 5410
Creating RECENCY, FREQUENCY, MONETARY Fields at customer

level
In [17]:
df_daily_revenue = df_UK.groupby(['Customer ID','Date','Month','Quarter','Year']).agg({
In [18]:
df_daily_revenue.head()
Out[18]: Customer ID Date Month Quarter Year Revenue
0 12346 2009-12-14 12 4 2009 90.0
1 12346 2009-12-18 12 4 2009 23.5
2 12346 2010-01-04 1 1 2010 45.0
3 12346 2010-01-14 1 1 2010 22.5

Customer ID Date Month Quarter Year Revenue
4 12346 2010-01-22 1 1 2010 22.5
In [19]:
df_daily_revenue['Frequency'] = 1
In [20]:
CURRENT_DATE = df_daily_revenue['Date'].max()
print(CURRENT_DATE)
2011-12-09
In [21]:
df_cust = df_daily_revenue.groupby(['Customer ID']).agg({'Revenue':'sum','Frequency':'s
In [22]:
df_cust.head()
Out[22]: Customer ID Revenue Frequency Date
0 12346 -64.68 11 2011-01-18
1 12608 415.79 1 2010-10-31
2 12745 723.85 2 2010-08-10
3 12746 230.85 3 2010-06-30
4 12747 9164.59 31 2011-12-07
In [23]:
df_cust['Recency'] = CURRENT_DATE - df_cust['Date']
In [24]:
df_cust['Recency'] = df_cust['Recency'].astype('timedelta64[D]')
In [25]:
df_cust.drop(columns = ['Date'], inplace=True)
In [26]:
df_cust['Average_Ticket_Size'] = df_cust['Revenue'] /df_cust['Frequency']
In [27]:
df_cust
Out[27]: Customer ID Revenue Frequency Recency Average_Ticket_Size
0 12346 -64.68 11 325.0 -5.880000
1 12608 415.79 1 404.0 415.790000
2 12745 723.85 2 486.0 361.925000
3 12746 230.85 3 527.0 76.950000
4 12747 9164.59 31 2.0 295.631935

Customer ID Revenue Frequency Recency Average_Ticket_Size
... ... ... ... ... ...
5405 18283 2736.65 19 3.0 144.034211
5406 18284 436.68 2 429.0 218.340000
5407 18285 427.00 1 660.0 427.000000
5408 18286 1188.43 3 476.0 396.143333
5409 18287 4177.89 7 42.0 596.841429
5410 rows × 5 columns
Segmentation of customers based on RFM

Variables:
In [28]:
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
Standardization of Data
In [29]:
# Standardization of the data
X_Vars = df_cust[['Frequency','Recency','Average_Ticket_Size']]
# Standardizing with MinMax Scaling:
scaler = MinMaxScaler()
scaler.fit(X_Vars)
X_Vars_Scaled = scaler.transform(X_Vars)
In [30]:
X_Vars.head()
Out[30]: Frequency Recency Average_Ticket_Size
0 11 325.0 -5.880000
1 1 404.0 415.790000
2 2 486.0 361.925000
3 3 527.0 76.950000
4 31 2.0 295.631935
In [31]:
X_Vars_Scaled
Out[31]: array([[0.04854369, 0.4403794 , 0.66938676],
[0. , 0.54742547, 0.68062986],
[0.00485437, 0.65853659, 0.67919364],
...,
[0. , 0.89430894, 0.68092876],
[0.00970874, 0.64498645, 0.68010602],
[0.02912621, 0.05691057, 0.68545728]])
Implementing Kmeans
In [32]:
# Implementing Kmeans Algorithm:
# Creating the KMeans object and fitting
model_kmeans = KMeans(n_clusters=3)
model_kmeans.fit(X_Vars_Scaled)
# Predicting the cluster labels
labels = model_kmeans.predict(X_Vars_Scaled)
print(labels)
[2 2 1 ... 1 1 0]
In [33]:
set(labels)
Out[33]: {0, 1, 2}
In [34]:
model_kmeans.cluster_centers_
Out[34]: array([[0.03861861, 0.07094753, 0.67855074],
[0.00330358, 0.81538168, 0.67362668],
[0.01018753, 0.47373381, 0.67740482]])
In [35]:
# Finding the final centroids
centroids = model_kmeans.cluster_centers_
# Evaluating the quality of clusters
s = metrics.silhouette_score(X_Vars_Scaled, labels, metric='euclidean')
print(f"Silhouette Coefficient of Kmeans Clusters: {s:.2f}")
# plotting the clusters using two variables:
plt.scatter(X_Vars_Scaled[:, 0], X_Vars_Scaled[:, 1], c=labels, cmap="rainbow")
plt.show()
Silhouette Coefficient of Kmeans Clusters: 0.66
In [36]: df_cust['Kmeans_Cluster_ID'] = labels
In [37]:
df_cust
Out[37]: Customer ID Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID
0 12346 -64.68 11 325.0 -5.880000 2
1 12608 415.79 1 404.0 415.790000 2
2 12745 723.85 2 486.0 361.925000 1
3 12746 230.85 3 527.0 76.950000 1
4 12747 9164.59 31 2.0 295.631935 0
... ... ... ... ... ... ...
5405 18283 2736.65 19 3.0 144.034211 0
5406 18284 436.68 2 429.0 218.340000 2
5407 18285 427.00 1 660.0 427.000000 1
5408 18286 1188.43 3 476.0 396.143333 1
5409 18287 4177.89 7 42.0 596.841429 0
In [38]:
# Finding the optimal K using SSD:
K = range(2,10)
sum_of_squared_distances = []
Silhoutte_Scores =[]
# Using Scikit Learn’s KMeans Algorithm to find sum of squared distances
for k in K:
model = KMeans(n_clusters=k).fit(X_Vars_Scaled)
s = metrics.silhouette_score(X_Vars_Scaled, model.predict(X_Vars_Scaled), metric='e

Silhoutte_Scores.append(s)
sum_of_squared_distances.append(model.inertia_)
plt.plot(K, sum_of_squared_distances, "bx-")
plt.xlabel("K values")
plt.ylabel("Sum of Squared Distances")
plt.title("Elbow Method")
plt.show()
In [39]:
Silhoutte_Scores
Out[39]: [0.703901183483134,
0.6577730374846207,
0.6227769084393752,
0.5972158551285756,
0.5568580013514581,
0.5487585756394782,
0.522675329365951,
0.5005652012801658]
In [40]:
silh_score_df_KMeans= pd.DataFrame(zip(K,K,Silhoutte_Scores,sum_of_squared_distances),

silh_score_df_KMeans
Out[40]: K_Values Number_Of_Clusters SilhoutteScore sum_of_squared_distances
0 2 2 0.703901 100.670793
1 3 3 0.657773 54.270376
2 4 4 0.622777 33.335792
3 5 5 0.597216 26.762086
4 6 6 0.556858 21.446518
5 7 7 0.548759 17.229781
6 8 8 0.522675 14.135381
7 9 9 0.500565 11.653419
Implementing DBSCAN
In [41]:
# Implemting DBSCAN: (Initial Model)
model_dbscan = DBSCAN(eps=0.5, min_samples=10)
model_dbscan.fit(X_Vars_Scaled)
labels_dbscan = model_dbscan.labels_
# label=-1 means the point is an outlier. Rest of the values represent the label/cluste
print(labels_dbscan)
[0 0 0 ... 0 0 0]
In [42]:
print(set(labels_dbscan))
{0}
In [43]:
eps_range = np.arange(0.1,1,0.01)
silhoutte_scores = []
N_Clusters = []
for e in eps_range:
print(e)
model = DBSCAN(eps=e, min_samples=5).fit(X_Vars_Scaled)
nclusters = len(set(model.labels_)) - (1 if -1 in model.labels_ else 0)

try:
sil_score = metrics.silhouette_score(X_Vars_Scaled, model.labels_)

except:
sil_score = 0
print("An exception occurred")
N_Clusters.append(nclusters)
silhoutte_scores.append(sil_score)
0.1
0.11
0.12
0.13
0.13999999999999999
0.14999999999999997
0.15999999999999998
0.16999999999999998
0.17999999999999997
0.18999999999999995
0.19999999999999996
0.20999999999999996
0.21999999999999995
0.22999999999999995
0.23999999999999994
0.24999999999999992
0.2599999999999999
0.2699999999999999
0.2799999999999999
0.2899999999999999
0.29999999999999993
0.30999999999999994
0.3199999999999999
0.32999999999999985
0.33999999999999986
0.34999999999999987
0.3599999999999999
0.3699999999999999
0.3799999999999999
0.3899999999999999
0.3999999999999998
An exception occurred
0.4099999999999998
0.4199999999999998
0.4299999999999998
0.43999999999999984
0.44999999999999984
0.45999999999999985
0.46999999999999986
0.47999999999999976
0.48999999999999977
0.4999999999999998
0.5099999999999998
0.5199999999999998
0.5299999999999998
0.5399999999999998
0.5499999999999998
0.5599999999999997
0.5699999999999997
0.5799999999999997
0.5899999999999997
0.5999999999999998
0.6099999999999998
0.6199999999999998
0.6299999999999997
0.6399999999999997
0.6499999999999997
0.6599999999999997
0.6699999999999997
0.6799999999999997
0.6899999999999997
0.6999999999999996
0.7099999999999996
0.7199999999999996
0.7299999999999996
0.7399999999999997
0.7499999999999997
0.7599999999999997
0.7699999999999997
0.7799999999999997
0.7899999999999996
0.7999999999999996
0.8099999999999996
0.8199999999999996
0.8299999999999996
0.8399999999999996
0.8499999999999996
0.8599999999999995
0.8699999999999996
0.8799999999999996
0.8899999999999996
0.8999999999999996
0.9099999999999996
0.9199999999999996
0.9299999999999996
0.9399999999999996
0.9499999999999995
0.9599999999999995
0.9699999999999995
0.9799999999999995
0.9899999999999995
In [44]:
silh_score_df_DBSCAN = pd.DataFrame(zip(eps_range,N_Clusters,silhoutte_scores), columns
In [45]:
silh_score_df_DBSCAN
Out[45]: Eps_Value Number_Of_Clusters SilhoutteScore
0 0.10 1 0.553983
1 0.11 1 0.553983
2 0.12 1 0.553983
Eps_Value Number_Of_Clusters SilhoutteScore
3 0.13 2 0.417510
4 0.14 2 0.417510
... ... ... ...
85 0.95 1 0.000000
86 0.96 1 0.000000
87 0.97 1 0.000000
88 0.98 1 0.000000
89 0.99 1 0.000000
In [46]:
silh_score_df_DBSCAN.loc[silh_score_df_DBSCAN['Number_Of_Clusters']==2]
3 0.13 2 0.417510
4 0.14 2 0.417510
5 0.15 2 0.408073
6 0.16 2 0.401897
7 0.17 2 0.401654
8 0.18 2 0.401654
9 0.19 2 0.401654
In [47]:

model_dbscan.fit(X_Vars_Scaled)
# label=-1 means the point is an outlier. Rest of the values represent the label/cluste
print(set(labels_dbscan))
{0, -1}
In [48]:
df_cust['DBSCAN_ClusterId'] = labels_dbscan
In [49]:
plt.scatter(X_Vars_Scaled[:, 0], X_Vars_Scaled[:, 1], c=labels_dbscan, cmap="rainbow")
plt.show()
Implemeting PCA dimensionality reduction

In [50]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_scale_results = pca.fit_transform(X_Vars_Scaled)
pca_df_scale = pd.DataFrame(pca_scale_results, columns=['pca1', 'pca2'])
plt.figure(figsize = (10,10))
plt.scatter(pca_df_scale.iloc[:,0],pca_df_scale.iloc[:,1],alpha=1 ,facecolor='lightslat
plt.xlabel('pca1')
plt.ylabel('pca2')
plt.show()
In [51]:
pca_df_scale
Out[51]: pca1 pca2
0 0.161949 0.031558
1 0.271621 -0.010256
2 0.382255 0.001154
3 0.437466 0.009116
4 -0.280744 0.102640
... ... ...
5405 -0.275916 0.044490
5406 0.305177 -0.003526
5407 0.617887 0.010357
5408 0.368436 0.005215

pca1 pca2
5409 -0.219777 -0.010212
Implemeting DBSCAN on PCA

In [52]:
eps_range = np.arange(0.01,0.2,0.01)
silhoutte_scores = []
N_Clusters = []
for e in eps_range:
print(e)
model = DBSCAN(eps=e, min_samples=5).fit(pca_df_scale)
nclusters = len(set(model.labels_)) - (1 if -1 in model.labels_ else 0)

try:
sil_score = metrics.silhouette_score(pca_df_scale, model.labels_)
except:
sil_score = 0
print("An exception occurred")
N_Clusters.append(nclusters)
silhoutte_scores.append(sil_score)
0.01
0.02
0.03
0.04
0.05
0.060000000000000005
0.06999999999999999
0.08
0.09
0.09999999999999999
0.11
0.12
0.13
0.14
0.15000000000000002
0.16
0.17
0.18000000000000002
0.19
In [53]:
silh_score_df_DBSCAN_pca = pd.DataFrame(zip(eps_range,N_Clusters,silhoutte_scores), col
In [54]:
silh_score_df_DBSCAN_pca
0 0.01 6 -0.032801
1 0.02 3 -0.250099
2 0.03 1 0.412203
Eps_Value Number_Of_Clusters SilhoutteScore
3 0.04 1 0.595276
4 0.05 1 0.595276
5 0.06 1 0.595276
6 0.07 1 0.595276
7 0.08 1 0.595276
8 0.09 1 0.603558
9 0.10 1 0.637769
10 0.11 1 0.637769
11 0.12 1 0.637769
12 0.13 2 0.253860
13 0.14 2 0.253860
14 0.15 2 0.253860
15 0.16 2 0.253860
16 0.17 2 0.253393
17 0.18 2 0.253393
18 0.19 2 0.253393
In [55]:
silh_score_df_DBSCAN_pca['SilhoutteScore'].max()
Out[55]: 0.637768576052979
In [56]:
pca_df = pca_df_scale.copy()
In [57]:

model_dbscan.fit(pca_df_scale)
set(labels_dbscan)
pca_df['DBSCAN_pca_cluster_ID'] =labels_dbscan
In [58]:
sns.scatterplot(pca_df.iloc[:,0],pca_df.iloc[:,1],hue=pca_df['DBSCAN_pca_cluster_ID'],
plt.legend()
plt.show()
Implemeting KMEANS on PCA

In [59]:
# Finding the optimal K using SSD:
K = range(2,10)
sum_of_squared_distances = []
Silhoutte_Scores =[]
for k in K:
model = KMeans(n_clusters=k, random_state=100).fit(pca_df_scale)
s = metrics.silhouette_score(pca_df_scale, model.predict(pca_df_scale), metric='euc

Silhoutte_Scores.append(s)
sum_of_squared_distances.append(model.inertia_)
plt.plot(K, sum_of_squared_distances, "bx-")
plt.xlabel("K values")
plt.ylabel("Sum of Squared Distances")
plt.title("Elbow Method")
plt.show()
silh_score_df_KMeans= pd.DataFrame(zip(K,K,Silhoutte_Scores), columns = ["K_Values","Nu

silh_score_df_KMeans
Out[59]: K_Values Number_Of_Clusters SilhoutteScore
0 2 2 0.705901
1 3 3 0.661167
2 4 4 0.627432
3 5 5 0.592428
4 6 6 0.565769
5 7 7 0.558290
6 8 8 0.527178
7 9 9 0.512092
In [60]:
# Implementing Kmeans Algorithm:
# Creating the KMeans object and fitting
model_kmeans = KMeans(n_clusters=3)
model_kmeans.fit(pca_df_scale)
# Predicting the cluster labels
labels_kmeans_pca = model_kmeans.predict(pca_df_scale)
print(labels_kmeans_pca)
pca_df['KMEANS_pca_cluster_ID'] =labels_kmeans_pca
[2 2 1 ... 1 1 0]
In [61]:
sns.scatterplot(pca_df.iloc[:,0],pca_df.iloc[:,1],hue=labels_kmeans_pca, palette='Set1'
plt.legend()
plt.show()
In [62]:
pca_df
Out[62]: pca1 pca2 DBSCAN_pca_cluster_ID KMEANS_pca_cluster_ID
0 0.161949 0.031558 -1 2
1 0.271621 -0.010256 0 2
2 0.382255 0.001154 0 1
3 0.437466 0.009116 0 1
4 -0.280744 0.102640 1 0
... ... ... ... ...
5405 -0.275916 0.044490 1 0
5406 0.305177 -0.003526 0 2
5407 0.617887 0.010357 0 1
5408 0.368436 0.005215 0 1
5409 -0.219777 -0.010212 1 0
In [63]:
df_cust[['DBSCAN_pca_cluster_ID','KMEANS_pca_cluster_ID']] = pca_df[['DBSCAN_pca_cluste
In [64]:
df_cust
Out[64]: Customer
Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_Cluste
ID
0 12346 -64.68 11 325.0 -5.880000 2
1 12608 415.79 1 404.0 415.790000 2
2 12745 723.85 2 486.0 361.925000 1
3 12746 230.85 3 527.0 76.950000 1
4 12747 9164.59 31 2.0 295.631935 0
... ... ... ... ... ... ...
5405 18283 2736.65 19 3.0 144.034211 0
5406 18284 436.68 2 429.0 218.340000 2
5407 18285 427.00 1 660.0 427.000000 1
5408 18286 1188.43 3 476.0 396.143333 1
5409 18287 4177.89 7 42.0 596.841429 0
In [65]:
df_churn = pd.read_csv("Churn_Indicator.csv", dtype=str)
df_churn
Out[65]: Customer ID Final_Churn_Ind
0 12346 False
1 12608 False
2 12745 False
3 12746 True
4 12747 False
... ... ...
3935 18283 False
3936 18284 False
3937 18285 False
3938 18286 False
3939 18287 False

In [66]:
df_cust = df_cust.merge(df_churn, how='left', on = ['Customer ID'])
In [67]:
df_cust['Final_Churn_Ind'].fillna('False', inplace=True)
In [68]:
df_cust['Churn_Ind'] = df_cust['Final_Churn_Ind']=='True'
In [69]:
df_cust.drop(columns = ['Final_Churn_Ind'], inplace= True)
In [70]:
df_cust
Out[70]: Customer
Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_Cluste
ID
0 12346 -64.68 11 325.0 -5.880000 2
1 12608 415.79 1 404.0 415.790000 2
2 12745 723.85 2 486.0 361.925000 1
3 12746 230.85 3 527.0 76.950000 1
4 12747 9164.59 31 2.0 295.631935 0
... ... ... ... ... ... ...
5405 18283 2736.65 19 3.0 144.034211 0
5406 18284 436.68 2 429.0 218.340000 2
5407 18285 427.00 1 660.0 427.000000 1
5408 18286 1188.43 3 476.0 396.143333 1
5409 18287 4177.89 7 42.0 596.841429 0
In [71]:
df_cust.set_index(['Customer ID'],inplace=True)
In [72]:
df_cust
Out[72]: Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_ClusterId DB
Customer
ID
12346 -64.68 11 325.0 -5.880000 2 0
12608 415.79 1 404.0 415.790000 2 0
12745 723.85 2 486.0 361.925000 1 0

Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_ClusterId DB
Customer
ID
12746 230.85 3 527.0 76.950000 1 0
12747 9164.59 31 2.0 295.631935 0 0
... ... ... ... ... ... ...
18283 2736.65 19 3.0 144.034211 0 0
18284 436.68 2 429.0 218.340000 2 0
18285 427.00 1 660.0 427.000000 1 0
18286 1188.43 3 476.0 396.143333 1 0
18287 4177.89 7 42.0 596.841429 0 0
In [73]:
x = df_cust[['Frequency', 'Recency', 'Average_Ticket_Size', 'DBSCAN_pca_cluster_ID']]
y = df_cust['Churn_Ind']
Onehot Encoding: (For creating dummy binary

variables for categorical variables)
In [74]:
df_clust_dummy = pd.get_dummies(x['DBSCAN_pca_cluster_ID'], prefix='Cluster')
x[list(df_clust_dummy.columns)] = df_clust_dummy
In [75]:
x.drop(columns = ['DBSCAN_pca_cluster_ID'], inplace=True)
In [76]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.3,random_state = 0)
In [77]:
# pct of train counts: Shows imbalanced dataset
print(y_train.value_counts())
y_train.value_counts(normalize =True)
False 3019
True 768
Name: Churn_Ind, dtype: int64
Out[77]: False 0.797201
True 0.202799
Name: Churn_Ind, dtype: float64
Balancing the imbalance class with SMOTE

In [78]: #Balancing the class with SMOTE:
from imblearn.over_sampling import SMOTE
In [79]:
smt = SMOTE(sampling_strategy= 0.6, random_state = 100, k_neighbors = 5)
xtrain_smt, ytrain_smt = smt.fit_resample(x_train, y_train)
In [80]:
xtrain_smt
Out[80]: Frequency Recency Average_Ticket_Size Cluster_-1 Cluster_0 Cluster_1 Cluster_2
0 5 10.000000 113.390000 0 0 1 0
1 1 590.000000 1049.660000 0 1 0 0
2 7 15.000000 233.178571 0 0 1 0
3 1 414.000000 226.700000 0 1 0 0
4 1 313.000000 403.300000 0 0 1 0
... ... ... ... ... ... ... ...
4825 1 575.678565 157.701723 0 1 0 0
4826 16 202.197448 579.772449 1 0 0 0
4827 19 22.773875 509.542828 0 0 1 0
4828 6 414.401542 469.348669 0 1 0 0
4829 1 620.480470 190.776118 0 1 0 0
In [81]:
ytrain_smt.value_counts(normalize =True)
Out[81]: False 0.625052
True 0.374948
Name: Churn_Ind, dtype: float64
Model Training on balanced Data

In [82]:
x_train = xtrain_smt
y_train = ytrain_smt
In [83]:
from sklearn.metrics import confusion_matrix,classification_report, precision_score, re
In [84]:
def give_classifcation_metrics(actual, predicted):
# confusion matrix
matrix = confusion_matrix(actual,predicted, labels=[1,0])
df_cfmatrix = pd.DataFrame(matrix, index=[1,0], columns=[1,0] )
print('Confusion matrix : \n',df_cfmatrix,'\n')
# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(actual,predicted,labels=[1,0]).reshape(-1)
tp_outcome_ser = pd.Series([tp, fn, fp, tn], index=['tp', 'fn', 'fp', 'tn'])
print('Outcome values : \n',tp_outcome_ser,'\n')
# Recall and Precision metrics
recall = recall_score(actual,predicted)
precision = precision_score(actual,predicted)
print('Recall : \n',recall,'\n')
print('Precision: \n',precision,'\n')
# classification report for precision, recall f1-score and accuracy
clf_report = classification_report(actual,predicted,labels=[1,0])
print('Classification report : \n',clf_report)
return(df_cfmatrix,tp_outcome_ser,recall,precision, clf_report)
RandomForest Grid Search & classification

In [85]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rf_clst_clf=RandomForestClassifier()
parameters = {'random_state': [100,200.300],
'n_estimators': [100,200,300],
'max_depth' : [6,8,12],
'criterion': ['entropy','gini']
grid = GridSearchCV(estimator=rf_clst_clf, param_grid = parameters)
grid.fit(x_train, y_train)
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n", grid.best_estimator_)
print("\n The best score across ALL searched params:\n", grid.best_score_)
print("\n The best parameters across ALL searched params:\n", grid.best_params_)
Results from Grid Search
The best estimator across ALL searched params:
RandomForestClassifier(max_depth=12, n_estimators=200, random_state=100)
The best score across ALL searched params:
0.7643892339544514
The best parameters across ALL searched params:
{'criterion': 'gini', 'max_depth': 12, 'n_estimators': 200, 'random_state': 100}
In [86]:
from sklearn.ensemble import RandomForestClassifier
rf_clst_clf=RandomForestClassifier(random_state=100,n_estimators=100,max_depth = 6,crit
#Train the model using the training sets y_pred=clf.predict(X_test)
rf_clst_clf.fit(x_train,y_train)
Out[86]: RandomForestClassifier(max_depth=6, random_state=100)
In [87]:
# Predictions on the test data:
y_predicted_rf = rf_clst_clf.predict(x_test)
y_pred_probs = rf_clst_clf.predict_proba(x_test)
model_accuracy = rf_clst_clf.score(x_test, y_test)
print(model_accuracy)
RF_clf_output = give_classifcation_metrics(y_test,y_predicted_rf)
0.7935921133703019
Confusion matrix :
1 0
1 135 213
0 122 1153
Outcome values :
tp 135
fn 213
fp 122
tn 1153
dtype: int64
Recall :
0.3879310344827586
Precision:
0.5252918287937743
Classification report :
precision recall f1-score support
1 0.53 0.39 0.45 348
0 0.84 0.90 0.87 1275
accuracy 0.79 1623
macro avg 0.68 0.65 0.66 1623
weighted avg 0.78 0.79 0.78 1623
In [88]:
Importance = pd.DataFrame({'feature': list(x_train.columns),
'importance': rf_clst_clf.feature_importances_}).\
sort_values('importance', ascending = False)
In [89]:
Importance
Out[89]: feature importance
0 Frequency 0.388836
2 Average_Ticket_Size 0.220750
1 Recency 0.165903
5 Cluster_1 0.141582
4 Cluster_0 0.038268
3 Cluster_-1 0.024941
6 Cluster_2 0.019721
XGBoost Grid Search & classification

In [90]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
model_xgb_clf=XGBClassifier()
'n_estimators': [100,200,300], }
grid = GridSearchCV(estimator=model_xgb_clf, param_grid = parameters)
[18:59:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea

rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.















XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=100,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
0.772463768115942
{'n_estimators': 200, 'random_state': 100}
In [91]:
model_xgb_clf=XGBClassifier(n_estimators=100, random_state=100 ).fit(x_train,y_train)

In [92]:
y_predicted_xgb = model_xgb_clf.predict(x_test)
y_pred_probs = model_xgb_clf.predict_proba(x_test)
model_accuracy = model_xgb_clf.score(x_test, y_test)
XGB_clf_output = give_classifcation_metrics(y_test,y_predicted_xgb)
0.758471965495995
Confusion matrix :
1 0
1 133 215
0 177 1098
Outcome values :
tp 133
fn 215
fp 177
tn 1098
dtype: int64
Recall :
0.382183908045977
Precision:
0.4290322580645161
1 0.43 0.38 0.40 348
0 0.84 0.86 0.85 1275
accuracy 0.76 1623
macro avg 0.63 0.62 0.63 1623
weighted avg 0.75 0.76 0.75 1623
In [93]:
'importance': model_xgb_clf.feature_importances_}).\
In [94]:
Importance
5 Cluster_1 0.508694
4 Cluster_0 0.184712
3 Cluster_-1 0.121934
0 Frequency 0.080018
1 Recency 0.052784
2 Average_Ticket_Size 0.051857
6 Cluster_2 0.000000
LGBM Grid Search & classification

In [95]:
from lightgbm import LGBMClassifier
model_lgbm_clf = LGBMClassifier()
'n_estimators': [100,200,300,500,1000], }
grid = GridSearchCV(estimator=model_lgbm_clf, param_grid = parameters)
LGBMClassifier(n_estimators=1000, random_state=100)
0.7853002070393375
{'n_estimators': 1000, 'random_state': 100}
In [96]: model_lgbm_clf = LGBMClassifier(n_estimators=1000, random_state=100).fit(x_train,y_trai
In [97]:
y_predicted_lgbm = model_lgbm_clf.predict(x_test)
y_pred_probs = model_lgbm_clf.predict_proba(x_test)
model_accuracy = model_lgbm_clf.score(x_test, y_test)
lgbm_clf_output = give_classifcation_metrics(y_test,y_predicted_lgbm)
0.7479975354282193
Confusion matrix :
1 0
1 138 210
0 199 1076
Outcome values :
tp 138
fn 210
fp 199
tn 1076
dtype: int64
Recall :
0.39655172413793105
Precision:
0.4094955489614243
1 0.41 0.40 0.40 348
0 0.84 0.84 0.84 1275
accuracy 0.75 1623
macro avg 0.62 0.62 0.62 1623
weighted avg 0.75 0.75 0.75 1623
In [98]:
'importance': model_lgbm_clf.feature_importances_}).\
In [99]:
Importance
2 Average_Ticket_Size 12814
1 Recency 11827
0 Frequency 4996
5 Cluster_1 154
4 Cluster_0 108
3 Cluster_-1 97
6 Cluster_2 4
THANK YOU VAMSEE SIR
In [ ]:

Lalit Suryavanshi - DSA-II - Assignment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lalit Suryavanshi - DSA-II - Assignment

Uploaded by

Copyright:

Available Formats

DSA -II Assignment

Q1 . Plots of K-means and DBSCAN clusters on two Principal Components of PCA

Kernel Recall Precision

Grid Searches result for RF, XGB & LightGBM

Yes, there has been increase in accuracy after recall improvment

RandomForest Classifier results

LightGBM Classifier results

Random forest XGBoost LightGBM

For Python code, see next page

import seaborn as sns

import matplotlib.pyplot as plt

15CM CHRISTMAS GLASS 2009-12-01 United

RECORD FRAME 7" SINGLE 2009-12-01 United

STRAWBERRY CERAMIC 2009-12-01 United

Out[5]: Invoice object

df_v2 = df.loc[~df['Customer ID'].isna()].reset_index(drop=True)

PINK CHERRY 2009-12-01 United

WHITE CHERRY 2009-12-01 United

RECORD FRAME 2009-12-01 United

PINK CHERRY 2009-12-01 United 2009-

Out[15]: Month Quarter Year

Creating RECENCY, FREQUENCY, MONETARY Fields at customer

Out[18]: Customer ID Date Month Quarter Year Revenue

0 12346 2009-12-14 12 4 2009 90.0

1 12346 2009-12-18 12 4 2009 23.5

2 12346 2010-01-04 1 1 2010 45.0

3 12346 2010-01-14 1 1 2010 22.5

4 12346 2010-01-22 1 1 2010 22.5

Out[22]: Customer ID Revenue Frequency Date

0 12346 -64.68 11 2011-01-18

1 12608 415.79 1 2010-10-31

2 12745 723.85 2 2010-08-10

3 12746 230.85 3 2010-06-30

4 12747 9164.59 31 2011-12-07

Out[27]: Customer ID Revenue Frequency Recency Average_Ticket_Size

0 12346 -64.68 11 325.0 -5.880000

1 12608 415.79 1 404.0 415.790000

2 12745 723.85 2 486.0 361.925000

3 12746 230.85 3 527.0 76.950000

4 12747 9164.59 31 2.0 295.631935

... ... ... ... ... ...

5405 18283 2736.65 19 3.0 144.034211

5406 18284 436.68 2 429.0 218.340000

5407 18285 427.00 1 660.0 427.000000

5408 18286 1188.43 3 476.0 396.143333

5409 18287 4177.89 7 42.0 596.841429

5410 rows × 5 columns

Segmentation of customers based on RFM

from sklearn.cluster import DBSCAN

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import MinMaxScaler

from sklearn import metrics

from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

# Standardizing with MinMax Scaling:

Out[30]: Frequency Recency Average_Ticket_Size

Out[31]: array([[0.04854369, 0.4403794 , 0.66938676],

[0. , 0.54742547, 0.68062986],

[0.00485437, 0.65853659, 0.67919364],

[0. , 0.89430894, 0.68092876],

[0.00970874, 0.64498645, 0.68010602],

[0.02912621, 0.05691057, 0.68545728]])