You are on page 1of 36

DSA -II Assignment

Q1 . Plots of K-means and DBSCAN clusters on two Principal Components of PCA

Q2. Impact of changing the kernel of the SVM on the SVM performance

• Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Laplace
RBF Kernel, Sigmoid Kernel, Anove RBF Kernel.

• Choosing the right kernel is crucial, because if the transformation is incorrect, then the
model can have very poor results.

• SVM performance improved after using Kernel. Value of recall found best in the Sigmoid
kernel and precision value found best in the Polynomial kernel (refer table)

Kernel Recall Precision


RBF 0.126 0.46
Polynomial 0.027 0.90
Sigmoid 0.412 0.23
Q3.Improvement on the recall you have seen on each algorithm of RF, XGB, lightGBM because
of hyper parameter tuning

Yes, the recall value has been improved after hyperparameter tuning:

Model Name Recall Value (Before HT) Recall Value (After HT)
Random forest 0.35 0.39
XGBoost 0.38 0.382
LightGBM 0.37 0.4
*HT: Hyperparameter tuning

Grid Searches result for RF, XGB & LightGBM


Q4. Improvement on the recall on the best accuracy across all algorithms

Yes, there has been increase in accuracy after recall improvment

Model Name Model accuracy Recall Value (Before Recall Value (After
HT) HT)
Random forest 0.793 0.35 0.39
XGBoost 0.758 0.38 0.382
LightGBM 0.747 0.37 0.4

RandomForest Classifier results


XGBoost Classifier results

LightGBM Classifier results

Q5. Compare the order of feature importance across all the algorithms and mention about the
differences being observed

Random forest XGBoost LightGBM


Frequency Cluster_1 Average_Ticket_Size
Average_Ticket_Size Cluster_0 Recency
Recency Cluster_-1 Frequency
Cluster_1 Frequency Cluster_1
Cluster_0 Recency Cluster_0
Cluster_-1 Average_Ticket_Size Cluster_-1
Cluster_2 Cluster_2 Cluster_2
The three top important features in Random Forest classifier are frequency, avg ticket size
and recency which contributes whether a customer will churn or not.

The three top important features in XG Boost classifier are Cluster_1, Cluster_0 and
Cluster_-1which contributes whether a customer will churn or not.

The three top important features in LightGBM classifier are Avg ticket size, recency and
frequency and which contributes whether a customer will churn or not.

Thankyou.

For Python code, see next page


In [1]:
import pandas as pd

import numpy as np

import seaborn as sns

import time

import matplotlib.pyplot as plt

import pickle

import warnings

warnings.filterwarnings("ignore")

Loading Data
In [2]:
df = pickle.load( open( "data_all_onlineretail.p", "rb" ))

Data Processing:
In [3]:
df.head()

Out[3]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country
ID

15CM CHRISTMAS GLASS 2009-12-01 United


0 489434 85048 12 6.95 13085
BALL 20 LIGHTS 07:45:00 Kingdom

2009-12-01 United
1 489434 79323P PINK CHERRY LIGHTS 12 6.75 13085
07:45:00 Kingdom

2009-12-01 United
2 489434 79323W WHITE CHERRY LIGHTS 12 6.75 13085
07:45:00 Kingdom

RECORD FRAME 7" SINGLE 2009-12-01 United


3 489434 22041 48 2.1 13085
SIZE 07:45:00 Kingdom

STRAWBERRY CERAMIC 2009-12-01 United


4 489434 21232 24 1.25 13085
TRINKET BOX 07:45:00 Kingdom

In [4]:
df['Quantity'] = df['Quantity'].astype(float)

df['Price'] = df['Price'].astype(float)

df['InvoiceDate'] = pd.to_datetime(df["InvoiceDate"])

In [5]:
df.dtypes

Out[5]: Invoice object

StockCode object

Description object

Quantity float64

InvoiceDate datetime64[ns]

Price float64

Customer ID object

Country object

dtype: object

In [6]:
df.isna().sum()

Out[6]: Invoice 0

StockCode 0

Description 4382

Quantity 0

InvoiceDate 0

Price 0

Customer ID 243007

Country 0

dtype: int64

In [7]:
# Removing those missing customer ID's:

df_v2 = df.loc[~df['Customer ID'].isna()].reset_index(drop=True)

In [8]:
df_v2.shape

Out[8]: (824364, 8)

In [9]:
df_v2.isna().sum()

Out[9]: Invoice 0

StockCode 0

Description 0

Quantity 0

InvoiceDate 0

Price 0

Customer ID 0

Country 0

dtype: int64

In [10]:
# Creating Revenue columns:

df_v2['Revenue'] = df_v2['Quantity']*df_v2['Price']

In [11]:
df_UK = df_v2.loc[df_v2['Country']=="United Kingdom"].reset_index(drop=True)

df_UK.shape

Out[11]: (741301, 9)

In [12]:
df_UK.head()

Out[12]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue
ID

15CM
CHRISTMAS 2009-12-01 United
0 489434 85048 12.0 6.95 13085 83.4
GLASS BALL 20 07:45:00 Kingdom
LIGHTS

PINK CHERRY 2009-12-01 United


1 489434 79323P 12.0 6.75 13085 81.0
LIGHTS 07:45:00 Kingdom

WHITE CHERRY 2009-12-01 United


2 489434 79323W 12.0 6.75 13085 81.0
LIGHTS 07:45:00 Kingdom
Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue
ID

RECORD FRAME 2009-12-01 United


3 489434 22041 48.0 2.10 13085 100.8
7" SINGLE SIZE 07:45:00 Kingdom

STRAWBERRY
2009-12-01 United
4 489434 21232 CERAMIC 24.0 1.25 13085 30.0
07:45:00 Kingdom
TRINKET BOX

In [13]:
# Creating Monthly columns:

df_UK['Date'] = df_UK['InvoiceDate'].dt.date

df_UK['Month'] = df_UK['InvoiceDate'].dt.month

df_UK['Quarter'] = df_UK['InvoiceDate'].dt.quarter

df_UK['Year'] = df_UK['InvoiceDate'].dt.year

In [14]:
df_UK.head()

Out[14]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue Date
ID

15CM
CHRISTMAS 2009-12-01 United 2009-
0 489434 85048 12.0 6.95 13085 83.4
GLASS BALL 07:45:00 Kingdom 12-01
20 LIGHTS

PINK CHERRY 2009-12-01 United 2009-


1 489434 79323P 12.0 6.75 13085 81.0
LIGHTS 07:45:00 Kingdom 12-01

WHITE
2009-12-01 United 2009-
2 489434 79323W CHERRY 12.0 6.75 13085 81.0
07:45:00 Kingdom 12-01
LIGHTS

RECORD
2009-12-01 United 2009-
3 489434 22041 FRAME 7" 48.0 2.10 13085 100.8
07:45:00 Kingdom 12-01
SINGLE SIZE

STRAWBERRY
2009-12-01 United 2009-
4 489434 21232 CERAMIC 24.0 1.25 13085 30.0
07:45:00 Kingdom 12-01
TRINKET BOX

In [15]:
df_UK[['Month','Quarter','Year']].drop_duplicates().reset_index(drop=True).sort_values(

Out[15]: Month Quarter Year

0 12 4 2009

1 1 1 2010

2 2 1 2010

3 3 1 2010

4 4 2 2010

5 5 2 2010
Month Quarter Year

6 6 2 2010

7 7 3 2010

8 8 3 2010

9 9 3 2010

10 10 4 2010

11 11 4 2010

12 12 4 2010

13 1 1 2011

14 2 1 2011

15 3 1 2011

16 4 2 2011

17 5 2 2011

18 6 2 2011

19 7 3 2011

20 8 3 2011

21 9 3 2011

22 10 4 2011

23 11 4 2011

24 12 4 2011

In [16]:
len(df_UK['Customer ID'].unique())

Out[16]: 5410

Creating RECENCY, FREQUENCY, MONETARY Fields at customer


level
In [17]:
df_daily_revenue = df_UK.groupby(['Customer ID','Date','Month','Quarter','Year']).agg({

In [18]:
df_daily_revenue.head()

Out[18]: Customer ID Date Month Quarter Year Revenue

0 12346 2009-12-14 12 4 2009 90.0

1 12346 2009-12-18 12 4 2009 23.5

2 12346 2010-01-04 1 1 2010 45.0

3 12346 2010-01-14 1 1 2010 22.5


Customer ID Date Month Quarter Year Revenue

4 12346 2010-01-22 1 1 2010 22.5

In [19]:
df_daily_revenue['Frequency'] = 1

In [20]:
CURRENT_DATE = df_daily_revenue['Date'].max()

print(CURRENT_DATE)

2011-12-09

In [21]:
df_cust = df_daily_revenue.groupby(['Customer ID']).agg({'Revenue':'sum','Frequency':'s

In [22]:
df_cust.head()

Out[22]: Customer ID Revenue Frequency Date

0 12346 -64.68 11 2011-01-18

1 12608 415.79 1 2010-10-31

2 12745 723.85 2 2010-08-10

3 12746 230.85 3 2010-06-30

4 12747 9164.59 31 2011-12-07

In [23]:
df_cust['Recency'] = CURRENT_DATE - df_cust['Date']

In [24]:
df_cust['Recency'] = df_cust['Recency'].astype('timedelta64[D]')

In [25]:
df_cust.drop(columns = ['Date'], inplace=True)

In [26]:
df_cust['Average_Ticket_Size'] = df_cust['Revenue'] /df_cust['Frequency']

In [27]:
df_cust

Out[27]: Customer ID Revenue Frequency Recency Average_Ticket_Size

0 12346 -64.68 11 325.0 -5.880000

1 12608 415.79 1 404.0 415.790000

2 12745 723.85 2 486.0 361.925000

3 12746 230.85 3 527.0 76.950000

4 12747 9164.59 31 2.0 295.631935


Customer ID Revenue Frequency Recency Average_Ticket_Size

... ... ... ... ... ...

5405 18283 2736.65 19 3.0 144.034211

5406 18284 436.68 2 429.0 218.340000

5407 18285 427.00 1 660.0 427.000000

5408 18286 1188.43 3 476.0 396.143333

5409 18287 4177.89 7 42.0 596.841429

5410 rows × 5 columns

Segmentation of customers based on RFM


Variables:
In [28]:
from sklearn.cluster import KMeans

from sklearn.cluster import DBSCAN

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import MinMaxScaler

from sklearn import metrics

from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

Standardization of Data
In [29]:
# Standardization of the data

X_Vars = df_cust[['Frequency','Recency','Average_Ticket_Size']]

# Standardizing with MinMax Scaling:

scaler = MinMaxScaler()

scaler.fit(X_Vars)

X_Vars_Scaled = scaler.transform(X_Vars)

In [30]:
X_Vars.head()

Out[30]: Frequency Recency Average_Ticket_Size

0 11 325.0 -5.880000

1 1 404.0 415.790000

2 2 486.0 361.925000

3 3 527.0 76.950000

4 31 2.0 295.631935

In [31]:
X_Vars_Scaled

Out[31]: array([[0.04854369, 0.4403794 , 0.66938676],

[0. , 0.54742547, 0.68062986],

[0.00485437, 0.65853659, 0.67919364],

...,

[0. , 0.89430894, 0.68092876],

[0.00970874, 0.64498645, 0.68010602],

[0.02912621, 0.05691057, 0.68545728]])

Implementing Kmeans
In [32]:
# Implementing Kmeans Algorithm:

# Creating the KMeans object and fitting

model_kmeans = KMeans(n_clusters=3)

model_kmeans.fit(X_Vars_Scaled)

# Predicting the cluster labels

labels = model_kmeans.predict(X_Vars_Scaled)

print(labels)

[2 2 1 ... 1 1 0]

In [33]:
set(labels)

Out[33]: {0, 1, 2}

In [34]:
model_kmeans.cluster_centers_

Out[34]: array([[0.03861861, 0.07094753, 0.67855074],

[0.00330358, 0.81538168, 0.67362668],

[0.01018753, 0.47373381, 0.67740482]])

In [35]:
# Finding the final centroids

centroids = model_kmeans.cluster_centers_

# Evaluating the quality of clusters

s = metrics.silhouette_score(X_Vars_Scaled, labels, metric='euclidean')

print(f"Silhouette Coefficient of Kmeans Clusters: {s:.2f}")

# plotting the clusters using two variables:

plt.scatter(X_Vars_Scaled[:, 0], X_Vars_Scaled[:, 1], c=labels, cmap="rainbow")

plt.show()

Silhouette Coefficient of Kmeans Clusters: 0.66

In [36]: df_cust['Kmeans_Cluster_ID'] = labels

In [37]:
df_cust

Out[37]: Customer ID Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID

0 12346 -64.68 11 325.0 -5.880000 2

1 12608 415.79 1 404.0 415.790000 2

2 12745 723.85 2 486.0 361.925000 1

3 12746 230.85 3 527.0 76.950000 1

4 12747 9164.59 31 2.0 295.631935 0

... ... ... ... ... ... ...

5405 18283 2736.65 19 3.0 144.034211 0

5406 18284 436.68 2 429.0 218.340000 2

5407 18285 427.00 1 660.0 427.000000 1

5408 18286 1188.43 3 476.0 396.143333 1

5409 18287 4177.89 7 42.0 596.841429 0

5410 rows × 6 columns

In [38]:
# Finding the optimal K using SSD:

K = range(2,10)

sum_of_squared_distances = []

Silhoutte_Scores =[]

# Using Scikit Learn’s KMeans Algorithm to find sum of squared distances

for k in K:

model = KMeans(n_clusters=k).fit(X_Vars_Scaled)

s = metrics.silhouette_score(X_Vars_Scaled, model.predict(X_Vars_Scaled), metric='e


Silhoutte_Scores.append(s)

sum_of_squared_distances.append(model.inertia_)

plt.plot(K, sum_of_squared_distances, "bx-")

plt.xlabel("K values")

plt.ylabel("Sum of Squared Distances")

plt.title("Elbow Method")

plt.show()

In [39]:
Silhoutte_Scores

Out[39]: [0.703901183483134,

0.6577730374846207,

0.6227769084393752,

0.5972158551285756,

0.5568580013514581,

0.5487585756394782,

0.522675329365951,

0.5005652012801658]

In [40]:
silh_score_df_KMeans= pd.DataFrame(zip(K,K,Silhoutte_Scores,sum_of_squared_distances),

silh_score_df_KMeans

Out[40]: K_Values Number_Of_Clusters SilhoutteScore sum_of_squared_distances

0 2 2 0.703901 100.670793

1 3 3 0.657773 54.270376

2 4 4 0.622777 33.335792

3 5 5 0.597216 26.762086

4 6 6 0.556858 21.446518

5 7 7 0.548759 17.229781

6 8 8 0.522675 14.135381

7 9 9 0.500565 11.653419

Implementing DBSCAN
In [41]:
# Implemting DBSCAN: (Initial Model)

model_dbscan = DBSCAN(eps=0.5, min_samples=10)

model_dbscan.fit(X_Vars_Scaled)

labels_dbscan = model_dbscan.labels_

# label=-1 means the point is an outlier. Rest of the values represent the label/cluste
print(labels_dbscan)

[0 0 0 ... 0 0 0]

In [42]:
print(set(labels_dbscan))

{0}

In [43]:
eps_range = np.arange(0.1,1,0.01)

silhoutte_scores = []

N_Clusters = []

# Using Scikit Learn’s KMeans Algorithm to find sum of squared distances

for e in eps_range:

print(e)

model = DBSCAN(eps=e, min_samples=5).fit(X_Vars_Scaled)

nclusters = len(set(model.labels_)) - (1 if -1 in model.labels_ else 0)


try:

sil_score = metrics.silhouette_score(X_Vars_Scaled, model.labels_)


except:

sil_score = 0

print("An exception occurred")

N_Clusters.append(nclusters)

silhoutte_scores.append(sil_score)

0.1

0.11

0.12

0.13

0.13999999999999999

0.14999999999999997

0.15999999999999998

0.16999999999999998

0.17999999999999997

0.18999999999999995

0.19999999999999996

0.20999999999999996

0.21999999999999995

0.22999999999999995

0.23999999999999994

0.24999999999999992

0.2599999999999999

0.2699999999999999

0.2799999999999999

0.2899999999999999

0.29999999999999993

0.30999999999999994

0.3199999999999999

0.32999999999999985

0.33999999999999986

0.34999999999999987

0.3599999999999999

0.3699999999999999

0.3799999999999999

0.3899999999999999

0.3999999999999998

An exception occurred

0.4099999999999998

An exception occurred

0.4199999999999998

An exception occurred

0.4299999999999998

An exception occurred

0.43999999999999984

An exception occurred

0.44999999999999984

An exception occurred

0.45999999999999985

An exception occurred

0.46999999999999986

An exception occurred

0.47999999999999976

An exception occurred

0.48999999999999977

An exception occurred

0.4999999999999998

An exception occurred

0.5099999999999998

An exception occurred

0.5199999999999998

An exception occurred

0.5299999999999998

An exception occurred

0.5399999999999998

An exception occurred

0.5499999999999998

An exception occurred

0.5599999999999997

An exception occurred

0.5699999999999997

An exception occurred

0.5799999999999997

An exception occurred

0.5899999999999997

An exception occurred

0.5999999999999998

An exception occurred

0.6099999999999998

An exception occurred

0.6199999999999998

An exception occurred

0.6299999999999997

An exception occurred

0.6399999999999997

An exception occurred

0.6499999999999997

An exception occurred

0.6599999999999997

An exception occurred

0.6699999999999997

An exception occurred

0.6799999999999997

An exception occurred

0.6899999999999997

An exception occurred

0.6999999999999996

An exception occurred

0.7099999999999996

An exception occurred

0.7199999999999996

An exception occurred

0.7299999999999996

An exception occurred

0.7399999999999997

An exception occurred

0.7499999999999997

An exception occurred

0.7599999999999997

An exception occurred

0.7699999999999997

An exception occurred

0.7799999999999997

An exception occurred

0.7899999999999996

An exception occurred

0.7999999999999996

An exception occurred

0.8099999999999996

An exception occurred

0.8199999999999996

An exception occurred

0.8299999999999996

An exception occurred

0.8399999999999996

An exception occurred

0.8499999999999996

An exception occurred

0.8599999999999995

An exception occurred

0.8699999999999996

An exception occurred

0.8799999999999996

An exception occurred

0.8899999999999996

An exception occurred

0.8999999999999996

An exception occurred

0.9099999999999996

An exception occurred

0.9199999999999996

An exception occurred

0.9299999999999996

An exception occurred

0.9399999999999996

An exception occurred

0.9499999999999995

An exception occurred

0.9599999999999995

An exception occurred

0.9699999999999995

An exception occurred

0.9799999999999995

An exception occurred

0.9899999999999995

An exception occurred

In [44]:
silh_score_df_DBSCAN = pd.DataFrame(zip(eps_range,N_Clusters,silhoutte_scores), columns

In [45]:
silh_score_df_DBSCAN

Out[45]: Eps_Value Number_Of_Clusters SilhoutteScore

0 0.10 1 0.553983

1 0.11 1 0.553983

2 0.12 1 0.553983
Eps_Value Number_Of_Clusters SilhoutteScore

3 0.13 2 0.417510

4 0.14 2 0.417510

... ... ... ...

85 0.95 1 0.000000

86 0.96 1 0.000000

87 0.97 1 0.000000

88 0.98 1 0.000000

89 0.99 1 0.000000

90 rows × 3 columns

In [46]:
silh_score_df_DBSCAN.loc[silh_score_df_DBSCAN['Number_Of_Clusters']==2]

Out[46]: Eps_Value Number_Of_Clusters SilhoutteScore

3 0.13 2 0.417510

4 0.14 2 0.417510

5 0.15 2 0.408073

6 0.16 2 0.401897

7 0.17 2 0.401654

8 0.18 2 0.401654

9 0.19 2 0.401654

In [47]:
# Implemting DBSCAN: (Initial Model)

model_dbscan = DBSCAN(eps=0.13, min_samples=10)


model_dbscan.fit(X_Vars_Scaled)

labels_dbscan = model_dbscan.labels_

# label=-1 means the point is an outlier. Rest of the values represent the label/cluste
print(set(labels_dbscan))

{0, -1}

In [48]:
df_cust['DBSCAN_ClusterId'] = labels_dbscan

In [49]:
plt.scatter(X_Vars_Scaled[:, 0], X_Vars_Scaled[:, 1], c=labels_dbscan, cmap="rainbow")

plt.show()

Implemeting PCA dimensionality reduction


In [50]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca_scale_results = pca.fit_transform(X_Vars_Scaled)

pca_df_scale = pd.DataFrame(pca_scale_results, columns=['pca1', 'pca2'])

plt.figure(figsize = (10,10))

plt.scatter(pca_df_scale.iloc[:,0],pca_df_scale.iloc[:,1],alpha=1 ,facecolor='lightslat
plt.xlabel('pca1')

plt.ylabel('pca2')

plt.show()

In [51]:
pca_df_scale

Out[51]: pca1 pca2

0 0.161949 0.031558

1 0.271621 -0.010256

2 0.382255 0.001154

3 0.437466 0.009116

4 -0.280744 0.102640

... ... ...

5405 -0.275916 0.044490

5406 0.305177 -0.003526

5407 0.617887 0.010357

5408 0.368436 0.005215


pca1 pca2

5409 -0.219777 -0.010212

5410 rows × 2 columns

Implemeting DBSCAN on PCA


In [52]:
eps_range = np.arange(0.01,0.2,0.01)

silhoutte_scores = []

N_Clusters = []

# Using Scikit Learn’s KMeans Algorithm to find sum of squared distances

for e in eps_range:

print(e)

model = DBSCAN(eps=e, min_samples=5).fit(pca_df_scale)

nclusters = len(set(model.labels_)) - (1 if -1 in model.labels_ else 0)


try:

sil_score = metrics.silhouette_score(pca_df_scale, model.labels_)

except:

sil_score = 0

print("An exception occurred")

N_Clusters.append(nclusters)

silhoutte_scores.append(sil_score)

0.01

0.02

0.03

0.04

0.05

0.060000000000000005

0.06999999999999999

0.08

0.09

0.09999999999999999

0.11

0.12

0.13

0.14

0.15000000000000002

0.16

0.17

0.18000000000000002

0.19

In [53]:
silh_score_df_DBSCAN_pca = pd.DataFrame(zip(eps_range,N_Clusters,silhoutte_scores), col

In [54]:
silh_score_df_DBSCAN_pca

Out[54]: Eps_Value Number_Of_Clusters SilhoutteScore

0 0.01 6 -0.032801

1 0.02 3 -0.250099

2 0.03 1 0.412203
Eps_Value Number_Of_Clusters SilhoutteScore

3 0.04 1 0.595276

4 0.05 1 0.595276

5 0.06 1 0.595276

6 0.07 1 0.595276

7 0.08 1 0.595276

8 0.09 1 0.603558

9 0.10 1 0.637769

10 0.11 1 0.637769

11 0.12 1 0.637769

12 0.13 2 0.253860

13 0.14 2 0.253860

14 0.15 2 0.253860

15 0.16 2 0.253860

16 0.17 2 0.253393

17 0.18 2 0.253393

18 0.19 2 0.253393

In [55]:
silh_score_df_DBSCAN_pca['SilhoutteScore'].max()

Out[55]: 0.637768576052979

In [56]:
pca_df = pca_df_scale.copy()

In [57]:
# Implemting DBSCAN: (Initial Model)

model_dbscan = DBSCAN(eps=0.01, min_samples=10)


model_dbscan.fit(pca_df_scale)

labels_dbscan = model_dbscan.labels_

set(labels_dbscan)

pca_df['DBSCAN_pca_cluster_ID'] =labels_dbscan

In [58]:
plt.figure(figsize = (8,8))

sns.scatterplot(pca_df.iloc[:,0],pca_df.iloc[:,1],hue=pca_df['DBSCAN_pca_cluster_ID'],
plt.legend()

plt.show()

Implemeting KMEANS on PCA


In [59]:
# Finding the optimal K using SSD:

K = range(2,10)

sum_of_squared_distances = []

Silhoutte_Scores =[]

# Using Scikit Learn’s KMeans Algorithm to find sum of squared distances

for k in K:

model = KMeans(n_clusters=k, random_state=100).fit(pca_df_scale)

s = metrics.silhouette_score(pca_df_scale, model.predict(pca_df_scale), metric='euc


Silhoutte_Scores.append(s)

sum_of_squared_distances.append(model.inertia_)

plt.plot(K, sum_of_squared_distances, "bx-")

plt.xlabel("K values")

plt.ylabel("Sum of Squared Distances")

plt.title("Elbow Method")

plt.show()

silh_score_df_KMeans= pd.DataFrame(zip(K,K,Silhoutte_Scores), columns = ["K_Values","Nu


silh_score_df_KMeans

Out[59]: K_Values Number_Of_Clusters SilhoutteScore

0 2 2 0.705901

1 3 3 0.661167

2 4 4 0.627432

3 5 5 0.592428

4 6 6 0.565769

5 7 7 0.558290

6 8 8 0.527178

7 9 9 0.512092

In [60]:
# Implementing Kmeans Algorithm:

# Creating the KMeans object and fitting

model_kmeans = KMeans(n_clusters=3)

model_kmeans.fit(pca_df_scale)

# Predicting the cluster labels

labels_kmeans_pca = model_kmeans.predict(pca_df_scale)

print(labels_kmeans_pca)

pca_df['KMEANS_pca_cluster_ID'] =labels_kmeans_pca

[2 2 1 ... 1 1 0]

In [61]:
plt.figure(figsize = (8,8))

sns.scatterplot(pca_df.iloc[:,0],pca_df.iloc[:,1],hue=labels_kmeans_pca, palette='Set1'
plt.legend()

plt.show()

In [62]:
pca_df

Out[62]: pca1 pca2 DBSCAN_pca_cluster_ID KMEANS_pca_cluster_ID

0 0.161949 0.031558 -1 2

1 0.271621 -0.010256 0 2

2 0.382255 0.001154 0 1

3 0.437466 0.009116 0 1

4 -0.280744 0.102640 1 0

... ... ... ... ...

5405 -0.275916 0.044490 1 0

5406 0.305177 -0.003526 0 2

5407 0.617887 0.010357 0 1

5408 0.368436 0.005215 0 1

5409 -0.219777 -0.010212 1 0

5410 rows × 4 columns

In [63]:
df_cust[['DBSCAN_pca_cluster_ID','KMEANS_pca_cluster_ID']] = pca_df[['DBSCAN_pca_cluste

In [64]:
df_cust

Out[64]: Customer
Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_Cluste
ID

0 12346 -64.68 11 325.0 -5.880000 2

1 12608 415.79 1 404.0 415.790000 2

2 12745 723.85 2 486.0 361.925000 1

3 12746 230.85 3 527.0 76.950000 1

4 12747 9164.59 31 2.0 295.631935 0

... ... ... ... ... ... ...

5405 18283 2736.65 19 3.0 144.034211 0

5406 18284 436.68 2 429.0 218.340000 2

5407 18285 427.00 1 660.0 427.000000 1

5408 18286 1188.43 3 476.0 396.143333 1

5409 18287 4177.89 7 42.0 596.841429 0

5410 rows × 9 columns

In [65]:
df_churn = pd.read_csv("Churn_Indicator.csv", dtype=str)

df_churn

Out[65]: Customer ID Final_Churn_Ind

0 12346 False

1 12608 False

2 12745 False

3 12746 True

4 12747 False

... ... ...

3935 18283 False

3936 18284 False

3937 18285 False

3938 18286 False

3939 18287 False

3940 rows × 2 columns


In [66]:
df_cust = df_cust.merge(df_churn, how='left', on = ['Customer ID'])

In [67]:
df_cust['Final_Churn_Ind'].fillna('False', inplace=True)

In [68]:
df_cust['Churn_Ind'] = df_cust['Final_Churn_Ind']=='True'

In [69]:
df_cust.drop(columns = ['Final_Churn_Ind'], inplace= True)

In [70]:
df_cust

Out[70]: Customer
Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_Cluste
ID

0 12346 -64.68 11 325.0 -5.880000 2

1 12608 415.79 1 404.0 415.790000 2

2 12745 723.85 2 486.0 361.925000 1

3 12746 230.85 3 527.0 76.950000 1

4 12747 9164.59 31 2.0 295.631935 0

... ... ... ... ... ... ...

5405 18283 2736.65 19 3.0 144.034211 0

5406 18284 436.68 2 429.0 218.340000 2

5407 18285 427.00 1 660.0 427.000000 1

5408 18286 1188.43 3 476.0 396.143333 1

5409 18287 4177.89 7 42.0 596.841429 0

5410 rows × 10 columns

In [71]:
df_cust.set_index(['Customer ID'],inplace=True)

In [72]:
df_cust

Out[72]: Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_ClusterId DB

Customer
ID

12346 -64.68 11 325.0 -5.880000 2 0

12608 415.79 1 404.0 415.790000 2 0

12745 723.85 2 486.0 361.925000 1 0


Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_ClusterId DB

Customer
ID

12746 230.85 3 527.0 76.950000 1 0

12747 9164.59 31 2.0 295.631935 0 0

... ... ... ... ... ... ...

18283 2736.65 19 3.0 144.034211 0 0

18284 436.68 2 429.0 218.340000 2 0

18285 427.00 1 660.0 427.000000 1 0

18286 1188.43 3 476.0 396.143333 1 0

18287 4177.89 7 42.0 596.841429 0 0

5410 rows × 9 columns

In [73]:
x = df_cust[['Frequency', 'Recency', 'Average_Ticket_Size', 'DBSCAN_pca_cluster_ID']]

y = df_cust['Churn_Ind']

Onehot Encoding: (For creating dummy binary


variables for categorical variables)
In [74]:
df_clust_dummy = pd.get_dummies(x['DBSCAN_pca_cluster_ID'], prefix='Cluster')

x[list(df_clust_dummy.columns)] = df_clust_dummy

In [75]:
x.drop(columns = ['DBSCAN_pca_cluster_ID'], inplace=True)

In [76]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.3,random_state = 0)

In [77]:
# pct of train counts: Shows imbalanced dataset

print(y_train.value_counts())

y_train.value_counts(normalize =True)

False 3019

True 768

Name: Churn_Ind, dtype: int64

Out[77]: False 0.797201

True 0.202799

Name: Churn_Ind, dtype: float64

Balancing the imbalance class with SMOTE


In [78]: #Balancing the class with SMOTE:

from imblearn.over_sampling import SMOTE

In [79]:
smt = SMOTE(sampling_strategy= 0.6, random_state = 100, k_neighbors = 5)

xtrain_smt, ytrain_smt = smt.fit_resample(x_train, y_train)

In [80]:
xtrain_smt

Out[80]: Frequency Recency Average_Ticket_Size Cluster_-1 Cluster_0 Cluster_1 Cluster_2

0 5 10.000000 113.390000 0 0 1 0

1 1 590.000000 1049.660000 0 1 0 0

2 7 15.000000 233.178571 0 0 1 0

3 1 414.000000 226.700000 0 1 0 0

4 1 313.000000 403.300000 0 0 1 0

... ... ... ... ... ... ... ...

4825 1 575.678565 157.701723 0 1 0 0

4826 16 202.197448 579.772449 1 0 0 0

4827 19 22.773875 509.542828 0 0 1 0

4828 6 414.401542 469.348669 0 1 0 0

4829 1 620.480470 190.776118 0 1 0 0

4830 rows × 7 columns

In [81]:
ytrain_smt.value_counts(normalize =True)

Out[81]: False 0.625052

True 0.374948

Name: Churn_Ind, dtype: float64

Model Training on balanced Data


In [82]:
x_train = xtrain_smt

y_train = ytrain_smt

In [83]:
from sklearn.metrics import confusion_matrix,classification_report, precision_score, re

In [84]:
def give_classifcation_metrics(actual, predicted):

# confusion matrix

matrix = confusion_matrix(actual,predicted, labels=[1,0])

df_cfmatrix = pd.DataFrame(matrix, index=[1,0], columns=[1,0] )

print('Confusion matrix : \n',df_cfmatrix,'\n')

# outcome values order in sklearn

tp, fn, fp, tn = confusion_matrix(actual,predicted,labels=[1,0]).reshape(-1)

tp_outcome_ser = pd.Series([tp, fn, fp, tn], index=['tp', 'fn', 'fp', 'tn'])

print('Outcome values : \n',tp_outcome_ser,'\n')

# Recall and Precision metrics

recall = recall_score(actual,predicted)

precision = precision_score(actual,predicted)

print('Recall : \n',recall,'\n')

print('Precision: \n',precision,'\n')

# classification report for precision, recall f1-score and accuracy

clf_report = classification_report(actual,predicted,labels=[1,0])

print('Classification report : \n',clf_report)

return(df_cfmatrix,tp_outcome_ser,recall,precision, clf_report)

RandomForest Grid Search & classification


In [85]:
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier

rf_clst_clf=RandomForestClassifier()

parameters = {'random_state': [100,200.300],

'n_estimators': [100,200,300],

'max_depth' : [6,8,12],

'criterion': ['entropy','gini']

grid = GridSearchCV(estimator=rf_clst_clf, param_grid = parameters)

grid.fit(x_train, y_train)

print(" Results from Grid Search " )

print("\n The best estimator across ALL searched params:\n", grid.best_estimator_)

print("\n The best score across ALL searched params:\n", grid.best_score_)

print("\n The best parameters across ALL searched params:\n", grid.best_params_)

Results from Grid Search

The best estimator across ALL searched params:

RandomForestClassifier(max_depth=12, n_estimators=200, random_state=100)

The best score across ALL searched params:

0.7643892339544514

The best parameters across ALL searched params:

{'criterion': 'gini', 'max_depth': 12, 'n_estimators': 200, 'random_state': 100}

In [86]:
from sklearn.ensemble import RandomForestClassifier

rf_clst_clf=RandomForestClassifier(random_state=100,n_estimators=100,max_depth = 6,crit
#Train the model using the training sets y_pred=clf.predict(X_test)

rf_clst_clf.fit(x_train,y_train)

Out[86]: RandomForestClassifier(max_depth=6, random_state=100)

In [87]:
# Predictions on the test data:

y_predicted_rf = rf_clst_clf.predict(x_test)

y_pred_probs = rf_clst_clf.predict_proba(x_test)

model_accuracy = rf_clst_clf.score(x_test, y_test)

print(model_accuracy)

RF_clf_output = give_classifcation_metrics(y_test,y_predicted_rf)

0.7935921133703019

Confusion matrix :

1 0

1 135 213

0 122 1153

Outcome values :

tp 135

fn 213

fp 122

tn 1153

dtype: int64

Recall :

0.3879310344827586

Precision:

0.5252918287937743

Classification report :

precision recall f1-score support

1 0.53 0.39 0.45 348

0 0.84 0.90 0.87 1275

accuracy 0.79 1623

macro avg 0.68 0.65 0.66 1623

weighted avg 0.78 0.79 0.78 1623

In [88]:
Importance = pd.DataFrame({'feature': list(x_train.columns),

'importance': rf_clst_clf.feature_importances_}).\

sort_values('importance', ascending = False)

In [89]:
Importance

Out[89]: feature importance

0 Frequency 0.388836

2 Average_Ticket_Size 0.220750

1 Recency 0.165903

5 Cluster_1 0.141582

4 Cluster_0 0.038268

3 Cluster_-1 0.024941

6 Cluster_2 0.019721

XGBoost Grid Search & classification


In [90]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

model_xgb_clf=XGBClassifier()

parameters = {'random_state': [100,200.300],

'n_estimators': [100,200,300], }

grid = GridSearchCV(estimator=model_xgb_clf, param_grid = parameters)

grid.fit(x_train, y_train)

print(" Results from Grid Search " )

print("\n The best estimator across ALL searched params:\n", grid.best_estimator_)

print("\n The best score across ALL searched params:\n", grid.best_score_)

print("\n The best parameters across ALL searched params:\n", grid.best_params_)

[18:59:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:06] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:06] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:07] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:07] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:08] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:09] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:09] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:10] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:11] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:13] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:14] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:15] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:17] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

[18:59:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

Results from Grid Search

The best estimator across ALL searched params:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,

colsample_bynode=1, colsample_bytree=1, enable_categorical=False,

gamma=0, gpu_id=-1, importance_type=None,

interaction_constraints='', learning_rate=0.300000012,

max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,

monotone_constraints='()', n_estimators=200, n_jobs=8,

num_parallel_tree=1, predictor='auto', random_state=100,

reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,

tree_method='exact', validate_parameters=1, verbosity=None)

The best score across ALL searched params:

0.772463768115942

The best parameters across ALL searched params:

{'n_estimators': 200, 'random_state': 100}

In [91]:
model_xgb_clf=XGBClassifier(n_estimators=100, random_state=100 ).fit(x_train,y_train)

[18:59:20] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/lea


rner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the obj
ective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metr
ic if you'd like to restore the old behavior.

In [92]:
# Predictions on the test data:

y_predicted_xgb = model_xgb_clf.predict(x_test)

y_pred_probs = model_xgb_clf.predict_proba(x_test)

model_accuracy = model_xgb_clf.score(x_test, y_test)

print(model_accuracy)

XGB_clf_output = give_classifcation_metrics(y_test,y_predicted_xgb)

0.758471965495995

Confusion matrix :

1 0

1 133 215

0 177 1098

Outcome values :

tp 133

fn 215

fp 177

tn 1098

dtype: int64

Recall :

0.382183908045977

Precision:

0.4290322580645161

Classification report :

precision recall f1-score support

1 0.43 0.38 0.40 348

0 0.84 0.86 0.85 1275

accuracy 0.76 1623

macro avg 0.63 0.62 0.63 1623

weighted avg 0.75 0.76 0.75 1623

In [93]:
Importance = pd.DataFrame({'feature': list(x_train.columns),

'importance': model_xgb_clf.feature_importances_}).\

sort_values('importance', ascending = False)

In [94]:
Importance

Out[94]: feature importance

5 Cluster_1 0.508694

4 Cluster_0 0.184712

3 Cluster_-1 0.121934

0 Frequency 0.080018

1 Recency 0.052784

2 Average_Ticket_Size 0.051857

6 Cluster_2 0.000000

LGBM Grid Search & classification


In [95]:
from lightgbm import LGBMClassifier

model_lgbm_clf = LGBMClassifier()

parameters = {'random_state': [100,200.300],

'n_estimators': [100,200,300,500,1000], }

grid = GridSearchCV(estimator=model_lgbm_clf, param_grid = parameters)

grid.fit(x_train, y_train)

print(" Results from Grid Search " )

print("\n The best estimator across ALL searched params:\n", grid.best_estimator_)

print("\n The best score across ALL searched params:\n", grid.best_score_)

print("\n The best parameters across ALL searched params:\n", grid.best_params_)

Results from Grid Search

The best estimator across ALL searched params:

LGBMClassifier(n_estimators=1000, random_state=100)

The best score across ALL searched params:

0.7853002070393375

The best parameters across ALL searched params:

{'n_estimators': 1000, 'random_state': 100}

In [96]: model_lgbm_clf = LGBMClassifier(n_estimators=1000, random_state=100).fit(x_train,y_trai

In [97]:
# Predictions on the test data:

y_predicted_lgbm = model_lgbm_clf.predict(x_test)

y_pred_probs = model_lgbm_clf.predict_proba(x_test)

model_accuracy = model_lgbm_clf.score(x_test, y_test)

print(model_accuracy)

lgbm_clf_output = give_classifcation_metrics(y_test,y_predicted_lgbm)

0.7479975354282193

Confusion matrix :

1 0

1 138 210

0 199 1076

Outcome values :

tp 138

fn 210

fp 199

tn 1076

dtype: int64

Recall :

0.39655172413793105

Precision:

0.4094955489614243

Classification report :

precision recall f1-score support

1 0.41 0.40 0.40 348

0 0.84 0.84 0.84 1275

accuracy 0.75 1623

macro avg 0.62 0.62 0.62 1623

weighted avg 0.75 0.75 0.75 1623

In [98]:
Importance = pd.DataFrame({'feature': list(x_train.columns),

'importance': model_lgbm_clf.feature_importances_}).\

sort_values('importance', ascending = False)

In [99]:
Importance

Out[99]: feature importance

2 Average_Ticket_Size 12814

1 Recency 11827

0 Frequency 4996

5 Cluster_1 154

4 Cluster_0 108

3 Cluster_-1 97

6 Cluster_2 4
THANK YOU VAMSEE SIR
In [ ]:

You might also like