Professional Documents
Culture Documents
Lalit Suryavanshi - DSA-II - Assignment
Lalit Suryavanshi - DSA-II - Assignment
Q2. Impact of changing the kernel of the SVM on the SVM performance
• Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Laplace
RBF Kernel, Sigmoid Kernel, Anove RBF Kernel.
• Choosing the right kernel is crucial, because if the transformation is incorrect, then the
model can have very poor results.
• SVM performance improved after using Kernel. Value of recall found best in the Sigmoid
kernel and precision value found best in the Polynomial kernel (refer table)
Yes, the recall value has been improved after hyperparameter tuning:
Model Name Recall Value (Before HT) Recall Value (After HT)
Random forest 0.35 0.39
XGBoost 0.38 0.382
LightGBM 0.37 0.4
*HT: Hyperparameter tuning
Model Name Model accuracy Recall Value (Before Recall Value (After
HT) HT)
Random forest 0.793 0.35 0.39
XGBoost 0.758 0.38 0.382
LightGBM 0.747 0.37 0.4
Q5. Compare the order of feature importance across all the algorithms and mention about the
differences being observed
The three top important features in XG Boost classifier are Cluster_1, Cluster_0 and
Cluster_-1which contributes whether a customer will churn or not.
The three top important features in LightGBM classifier are Avg ticket size, recency and
frequency and which contributes whether a customer will churn or not.
Thankyou.
import numpy as np
import time
import pickle
import warnings
warnings.filterwarnings("ignore")
Loading Data
In [2]:
df = pickle.load( open( "data_all_onlineretail.p", "rb" ))
Data Processing:
In [3]:
df.head()
Out[3]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country
ID
2009-12-01 United
1 489434 79323P PINK CHERRY LIGHTS 12 6.75 13085
07:45:00 Kingdom
2009-12-01 United
2 489434 79323W WHITE CHERRY LIGHTS 12 6.75 13085
07:45:00 Kingdom
In [4]:
df['Quantity'] = df['Quantity'].astype(float)
df['Price'] = df['Price'].astype(float)
df['InvoiceDate'] = pd.to_datetime(df["InvoiceDate"])
In [5]:
df.dtypes
StockCode object
Description object
Quantity float64
InvoiceDate datetime64[ns]
Price float64
Customer ID object
Country object
dtype: object
In [6]:
df.isna().sum()
Out[6]: Invoice 0
StockCode 0
Description 4382
Quantity 0
InvoiceDate 0
Price 0
Customer ID 243007
Country 0
dtype: int64
In [7]:
# Removing those missing customer ID's:
In [8]:
df_v2.shape
Out[8]: (824364, 8)
In [9]:
df_v2.isna().sum()
Out[9]: Invoice 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
Price 0
Customer ID 0
Country 0
dtype: int64
In [10]:
# Creating Revenue columns:
df_v2['Revenue'] = df_v2['Quantity']*df_v2['Price']
In [11]:
df_UK = df_v2.loc[df_v2['Country']=="United Kingdom"].reset_index(drop=True)
df_UK.shape
Out[11]: (741301, 9)
In [12]:
df_UK.head()
Out[12]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue
ID
15CM
CHRISTMAS 2009-12-01 United
0 489434 85048 12.0 6.95 13085 83.4
GLASS BALL 20 07:45:00 Kingdom
LIGHTS
STRAWBERRY
2009-12-01 United
4 489434 21232 CERAMIC 24.0 1.25 13085 30.0
07:45:00 Kingdom
TRINKET BOX
In [13]:
# Creating Monthly columns:
df_UK['Date'] = df_UK['InvoiceDate'].dt.date
df_UK['Month'] = df_UK['InvoiceDate'].dt.month
df_UK['Quarter'] = df_UK['InvoiceDate'].dt.quarter
df_UK['Year'] = df_UK['InvoiceDate'].dt.year
In [14]:
df_UK.head()
Out[14]: Customer
Invoice StockCode Description Quantity InvoiceDate Price Country Revenue Date
ID
15CM
CHRISTMAS 2009-12-01 United 2009-
0 489434 85048 12.0 6.95 13085 83.4
GLASS BALL 07:45:00 Kingdom 12-01
20 LIGHTS
WHITE
2009-12-01 United 2009-
2 489434 79323W CHERRY 12.0 6.75 13085 81.0
07:45:00 Kingdom 12-01
LIGHTS
RECORD
2009-12-01 United 2009-
3 489434 22041 FRAME 7" 48.0 2.10 13085 100.8
07:45:00 Kingdom 12-01
SINGLE SIZE
STRAWBERRY
2009-12-01 United 2009-
4 489434 21232 CERAMIC 24.0 1.25 13085 30.0
07:45:00 Kingdom 12-01
TRINKET BOX
In [15]:
df_UK[['Month','Quarter','Year']].drop_duplicates().reset_index(drop=True).sort_values(
0 12 4 2009
1 1 1 2010
2 2 1 2010
3 3 1 2010
4 4 2 2010
5 5 2 2010
Month Quarter Year
6 6 2 2010
7 7 3 2010
8 8 3 2010
9 9 3 2010
10 10 4 2010
11 11 4 2010
12 12 4 2010
13 1 1 2011
14 2 1 2011
15 3 1 2011
16 4 2 2011
17 5 2 2011
18 6 2 2011
19 7 3 2011
20 8 3 2011
21 9 3 2011
22 10 4 2011
23 11 4 2011
24 12 4 2011
In [16]:
len(df_UK['Customer ID'].unique())
Out[16]: 5410
In [18]:
df_daily_revenue.head()
In [19]:
df_daily_revenue['Frequency'] = 1
In [20]:
CURRENT_DATE = df_daily_revenue['Date'].max()
print(CURRENT_DATE)
2011-12-09
In [21]:
df_cust = df_daily_revenue.groupby(['Customer ID']).agg({'Revenue':'sum','Frequency':'s
In [22]:
df_cust.head()
In [23]:
df_cust['Recency'] = CURRENT_DATE - df_cust['Date']
In [24]:
df_cust['Recency'] = df_cust['Recency'].astype('timedelta64[D]')
In [25]:
df_cust.drop(columns = ['Date'], inplace=True)
In [26]:
df_cust['Average_Ticket_Size'] = df_cust['Revenue'] /df_cust['Frequency']
In [27]:
df_cust
Standardization of Data
In [29]:
# Standardization of the data
X_Vars = df_cust[['Frequency','Recency','Average_Ticket_Size']]
scaler = MinMaxScaler()
scaler.fit(X_Vars)
X_Vars_Scaled = scaler.transform(X_Vars)
In [30]:
X_Vars.head()
0 11 325.0 -5.880000
1 1 404.0 415.790000
2 2 486.0 361.925000
3 3 527.0 76.950000
4 31 2.0 295.631935
In [31]:
X_Vars_Scaled
...,
Implementing Kmeans
In [32]:
# Implementing Kmeans Algorithm:
model_kmeans = KMeans(n_clusters=3)
model_kmeans.fit(X_Vars_Scaled)
labels = model_kmeans.predict(X_Vars_Scaled)
print(labels)
[2 2 1 ... 1 1 0]
In [33]:
set(labels)
Out[33]: {0, 1, 2}
In [34]:
model_kmeans.cluster_centers_
In [35]:
# Finding the final centroids
centroids = model_kmeans.cluster_centers_
plt.show()
In [37]:
df_cust
In [38]:
# Finding the optimal K using SSD:
K = range(2,10)
sum_of_squared_distances = []
Silhoutte_Scores =[]
for k in K:
model = KMeans(n_clusters=k).fit(X_Vars_Scaled)
sum_of_squared_distances.append(model.inertia_)
plt.xlabel("K values")
plt.title("Elbow Method")
plt.show()
In [39]:
Silhoutte_Scores
Out[39]: [0.703901183483134,
0.6577730374846207,
0.6227769084393752,
0.5972158551285756,
0.5568580013514581,
0.5487585756394782,
0.522675329365951,
0.5005652012801658]
In [40]:
silh_score_df_KMeans= pd.DataFrame(zip(K,K,Silhoutte_Scores,sum_of_squared_distances),
silh_score_df_KMeans
0 2 2 0.703901 100.670793
1 3 3 0.657773 54.270376
2 4 4 0.622777 33.335792
3 5 5 0.597216 26.762086
4 6 6 0.556858 21.446518
5 7 7 0.548759 17.229781
6 8 8 0.522675 14.135381
7 9 9 0.500565 11.653419
Implementing DBSCAN
In [41]:
# Implemting DBSCAN: (Initial Model)
model_dbscan.fit(X_Vars_Scaled)
labels_dbscan = model_dbscan.labels_
# label=-1 means the point is an outlier. Rest of the values represent the label/cluste
print(labels_dbscan)
[0 0 0 ... 0 0 0]
In [42]:
print(set(labels_dbscan))
{0}
In [43]:
eps_range = np.arange(0.1,1,0.01)
silhoutte_scores = []
N_Clusters = []
for e in eps_range:
print(e)
sil_score = 0
N_Clusters.append(nclusters)
silhoutte_scores.append(sil_score)
0.1
0.11
0.12
0.13
0.13999999999999999
0.14999999999999997
0.15999999999999998
0.16999999999999998
0.17999999999999997
0.18999999999999995
0.19999999999999996
0.20999999999999996
0.21999999999999995
0.22999999999999995
0.23999999999999994
0.24999999999999992
0.2599999999999999
0.2699999999999999
0.2799999999999999
0.2899999999999999
0.29999999999999993
0.30999999999999994
0.3199999999999999
0.32999999999999985
0.33999999999999986
0.34999999999999987
0.3599999999999999
0.3699999999999999
0.3799999999999999
0.3899999999999999
0.3999999999999998
An exception occurred
0.4099999999999998
An exception occurred
0.4199999999999998
An exception occurred
0.4299999999999998
An exception occurred
0.43999999999999984
An exception occurred
0.44999999999999984
An exception occurred
0.45999999999999985
An exception occurred
0.46999999999999986
An exception occurred
0.47999999999999976
An exception occurred
0.48999999999999977
An exception occurred
0.4999999999999998
An exception occurred
0.5099999999999998
An exception occurred
0.5199999999999998
An exception occurred
0.5299999999999998
An exception occurred
0.5399999999999998
An exception occurred
0.5499999999999998
An exception occurred
0.5599999999999997
An exception occurred
0.5699999999999997
An exception occurred
0.5799999999999997
An exception occurred
0.5899999999999997
An exception occurred
0.5999999999999998
An exception occurred
0.6099999999999998
An exception occurred
0.6199999999999998
An exception occurred
0.6299999999999997
An exception occurred
0.6399999999999997
An exception occurred
0.6499999999999997
An exception occurred
0.6599999999999997
An exception occurred
0.6699999999999997
An exception occurred
0.6799999999999997
An exception occurred
0.6899999999999997
An exception occurred
0.6999999999999996
An exception occurred
0.7099999999999996
An exception occurred
0.7199999999999996
An exception occurred
0.7299999999999996
An exception occurred
0.7399999999999997
An exception occurred
0.7499999999999997
An exception occurred
0.7599999999999997
An exception occurred
0.7699999999999997
An exception occurred
0.7799999999999997
An exception occurred
0.7899999999999996
An exception occurred
0.7999999999999996
An exception occurred
0.8099999999999996
An exception occurred
0.8199999999999996
An exception occurred
0.8299999999999996
An exception occurred
0.8399999999999996
An exception occurred
0.8499999999999996
An exception occurred
0.8599999999999995
An exception occurred
0.8699999999999996
An exception occurred
0.8799999999999996
An exception occurred
0.8899999999999996
An exception occurred
0.8999999999999996
An exception occurred
0.9099999999999996
An exception occurred
0.9199999999999996
An exception occurred
0.9299999999999996
An exception occurred
0.9399999999999996
An exception occurred
0.9499999999999995
An exception occurred
0.9599999999999995
An exception occurred
0.9699999999999995
An exception occurred
0.9799999999999995
An exception occurred
0.9899999999999995
An exception occurred
In [44]:
silh_score_df_DBSCAN = pd.DataFrame(zip(eps_range,N_Clusters,silhoutte_scores), columns
In [45]:
silh_score_df_DBSCAN
0 0.10 1 0.553983
1 0.11 1 0.553983
2 0.12 1 0.553983
Eps_Value Number_Of_Clusters SilhoutteScore
3 0.13 2 0.417510
4 0.14 2 0.417510
85 0.95 1 0.000000
86 0.96 1 0.000000
87 0.97 1 0.000000
88 0.98 1 0.000000
89 0.99 1 0.000000
90 rows × 3 columns
In [46]:
silh_score_df_DBSCAN.loc[silh_score_df_DBSCAN['Number_Of_Clusters']==2]
3 0.13 2 0.417510
4 0.14 2 0.417510
5 0.15 2 0.408073
6 0.16 2 0.401897
7 0.17 2 0.401654
8 0.18 2 0.401654
9 0.19 2 0.401654
In [47]:
# Implemting DBSCAN: (Initial Model)
labels_dbscan = model_dbscan.labels_
# label=-1 means the point is an outlier. Rest of the values represent the label/cluste
print(set(labels_dbscan))
{0, -1}
In [48]:
df_cust['DBSCAN_ClusterId'] = labels_dbscan
In [49]:
plt.scatter(X_Vars_Scaled[:, 0], X_Vars_Scaled[:, 1], c=labels_dbscan, cmap="rainbow")
plt.show()
pca = PCA(n_components=2)
pca_scale_results = pca.fit_transform(X_Vars_Scaled)
plt.figure(figsize = (10,10))
plt.scatter(pca_df_scale.iloc[:,0],pca_df_scale.iloc[:,1],alpha=1 ,facecolor='lightslat
plt.xlabel('pca1')
plt.ylabel('pca2')
plt.show()
In [51]:
pca_df_scale
0 0.161949 0.031558
1 0.271621 -0.010256
2 0.382255 0.001154
3 0.437466 0.009116
4 -0.280744 0.102640
silhoutte_scores = []
N_Clusters = []
for e in eps_range:
print(e)
except:
sil_score = 0
N_Clusters.append(nclusters)
silhoutte_scores.append(sil_score)
0.01
0.02
0.03
0.04
0.05
0.060000000000000005
0.06999999999999999
0.08
0.09
0.09999999999999999
0.11
0.12
0.13
0.14
0.15000000000000002
0.16
0.17
0.18000000000000002
0.19
In [53]:
silh_score_df_DBSCAN_pca = pd.DataFrame(zip(eps_range,N_Clusters,silhoutte_scores), col
In [54]:
silh_score_df_DBSCAN_pca
0 0.01 6 -0.032801
1 0.02 3 -0.250099
2 0.03 1 0.412203
Eps_Value Number_Of_Clusters SilhoutteScore
3 0.04 1 0.595276
4 0.05 1 0.595276
5 0.06 1 0.595276
6 0.07 1 0.595276
7 0.08 1 0.595276
8 0.09 1 0.603558
9 0.10 1 0.637769
10 0.11 1 0.637769
11 0.12 1 0.637769
12 0.13 2 0.253860
13 0.14 2 0.253860
14 0.15 2 0.253860
15 0.16 2 0.253860
16 0.17 2 0.253393
17 0.18 2 0.253393
18 0.19 2 0.253393
In [55]:
silh_score_df_DBSCAN_pca['SilhoutteScore'].max()
Out[55]: 0.637768576052979
In [56]:
pca_df = pca_df_scale.copy()
In [57]:
# Implemting DBSCAN: (Initial Model)
labels_dbscan = model_dbscan.labels_
set(labels_dbscan)
pca_df['DBSCAN_pca_cluster_ID'] =labels_dbscan
In [58]:
plt.figure(figsize = (8,8))
sns.scatterplot(pca_df.iloc[:,0],pca_df.iloc[:,1],hue=pca_df['DBSCAN_pca_cluster_ID'],
plt.legend()
plt.show()
K = range(2,10)
sum_of_squared_distances = []
Silhoutte_Scores =[]
for k in K:
sum_of_squared_distances.append(model.inertia_)
plt.xlabel("K values")
plt.title("Elbow Method")
plt.show()
silh_score_df_KMeans
0 2 2 0.705901
1 3 3 0.661167
2 4 4 0.627432
3 5 5 0.592428
4 6 6 0.565769
5 7 7 0.558290
6 8 8 0.527178
7 9 9 0.512092
In [60]:
# Implementing Kmeans Algorithm:
model_kmeans = KMeans(n_clusters=3)
model_kmeans.fit(pca_df_scale)
labels_kmeans_pca = model_kmeans.predict(pca_df_scale)
print(labels_kmeans_pca)
pca_df['KMEANS_pca_cluster_ID'] =labels_kmeans_pca
[2 2 1 ... 1 1 0]
In [61]:
plt.figure(figsize = (8,8))
sns.scatterplot(pca_df.iloc[:,0],pca_df.iloc[:,1],hue=labels_kmeans_pca, palette='Set1'
plt.legend()
plt.show()
In [62]:
pca_df
0 0.161949 0.031558 -1 2
1 0.271621 -0.010256 0 2
2 0.382255 0.001154 0 1
3 0.437466 0.009116 0 1
4 -0.280744 0.102640 1 0
In [63]:
df_cust[['DBSCAN_pca_cluster_ID','KMEANS_pca_cluster_ID']] = pca_df[['DBSCAN_pca_cluste
In [64]:
df_cust
Out[64]: Customer
Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_Cluste
ID
In [65]:
df_churn = pd.read_csv("Churn_Indicator.csv", dtype=str)
df_churn
0 12346 False
1 12608 False
2 12745 False
3 12746 True
4 12747 False
In [67]:
df_cust['Final_Churn_Ind'].fillna('False', inplace=True)
In [68]:
df_cust['Churn_Ind'] = df_cust['Final_Churn_Ind']=='True'
In [69]:
df_cust.drop(columns = ['Final_Churn_Ind'], inplace= True)
In [70]:
df_cust
Out[70]: Customer
Revenue Frequency Recency Average_Ticket_Size Kmeans_Cluster_ID DBSCAN_Cluste
ID
In [71]:
df_cust.set_index(['Customer ID'],inplace=True)
In [72]:
df_cust
Customer
ID
Customer
ID
In [73]:
x = df_cust[['Frequency', 'Recency', 'Average_Ticket_Size', 'DBSCAN_pca_cluster_ID']]
y = df_cust['Churn_Ind']
x[list(df_clust_dummy.columns)] = df_clust_dummy
In [75]:
x.drop(columns = ['DBSCAN_pca_cluster_ID'], inplace=True)
In [76]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.3,random_state = 0)
In [77]:
# pct of train counts: Shows imbalanced dataset
print(y_train.value_counts())
y_train.value_counts(normalize =True)
False 3019
True 768
True 0.202799
In [79]:
smt = SMOTE(sampling_strategy= 0.6, random_state = 100, k_neighbors = 5)
In [80]:
xtrain_smt
0 5 10.000000 113.390000 0 0 1 0
1 1 590.000000 1049.660000 0 1 0 0
2 7 15.000000 233.178571 0 0 1 0
3 1 414.000000 226.700000 0 1 0 0
4 1 313.000000 403.300000 0 0 1 0
In [81]:
ytrain_smt.value_counts(normalize =True)
True 0.374948
y_train = ytrain_smt
In [83]:
from sklearn.metrics import confusion_matrix,classification_report, precision_score, re
In [84]:
def give_classifcation_metrics(actual, predicted):
# confusion matrix
recall = recall_score(actual,predicted)
precision = precision_score(actual,predicted)
print('Recall : \n',recall,'\n')
print('Precision: \n',precision,'\n')
clf_report = classification_report(actual,predicted,labels=[1,0])
return(df_cfmatrix,tp_outcome_ser,recall,precision, clf_report)
rf_clst_clf=RandomForestClassifier()
'n_estimators': [100,200,300],
'max_depth' : [6,8,12],
'criterion': ['entropy','gini']
grid.fit(x_train, y_train)
0.7643892339544514
In [86]:
from sklearn.ensemble import RandomForestClassifier
rf_clst_clf=RandomForestClassifier(random_state=100,n_estimators=100,max_depth = 6,crit
#Train the model using the training sets y_pred=clf.predict(X_test)
rf_clst_clf.fit(x_train,y_train)
In [87]:
# Predictions on the test data:
y_predicted_rf = rf_clst_clf.predict(x_test)
y_pred_probs = rf_clst_clf.predict_proba(x_test)
print(model_accuracy)
RF_clf_output = give_classifcation_metrics(y_test,y_predicted_rf)
0.7935921133703019
Confusion matrix :
1 0
1 135 213
0 122 1153
Outcome values :
tp 135
fn 213
fp 122
tn 1153
dtype: int64
Recall :
0.3879310344827586
Precision:
0.5252918287937743
Classification report :
In [88]:
Importance = pd.DataFrame({'feature': list(x_train.columns),
'importance': rf_clst_clf.feature_importances_}).\
In [89]:
Importance
0 Frequency 0.388836
2 Average_Ticket_Size 0.220750
1 Recency 0.165903
5 Cluster_1 0.141582
4 Cluster_0 0.038268
3 Cluster_-1 0.024941
6 Cluster_2 0.019721
model_xgb_clf=XGBClassifier()
'n_estimators': [100,200,300], }
grid.fit(x_train, y_train)
interaction_constraints='', learning_rate=0.300000012,
0.772463768115942
In [91]:
model_xgb_clf=XGBClassifier(n_estimators=100, random_state=100 ).fit(x_train,y_train)
In [92]:
# Predictions on the test data:
y_predicted_xgb = model_xgb_clf.predict(x_test)
y_pred_probs = model_xgb_clf.predict_proba(x_test)
print(model_accuracy)
XGB_clf_output = give_classifcation_metrics(y_test,y_predicted_xgb)
0.758471965495995
Confusion matrix :
1 0
1 133 215
0 177 1098
Outcome values :
tp 133
fn 215
fp 177
tn 1098
dtype: int64
Recall :
0.382183908045977
Precision:
0.4290322580645161
Classification report :
In [93]:
Importance = pd.DataFrame({'feature': list(x_train.columns),
'importance': model_xgb_clf.feature_importances_}).\
In [94]:
Importance
5 Cluster_1 0.508694
4 Cluster_0 0.184712
3 Cluster_-1 0.121934
0 Frequency 0.080018
1 Recency 0.052784
2 Average_Ticket_Size 0.051857
6 Cluster_2 0.000000
model_lgbm_clf = LGBMClassifier()
'n_estimators': [100,200,300,500,1000], }
grid.fit(x_train, y_train)
LGBMClassifier(n_estimators=1000, random_state=100)
0.7853002070393375
In [97]:
# Predictions on the test data:
y_predicted_lgbm = model_lgbm_clf.predict(x_test)
y_pred_probs = model_lgbm_clf.predict_proba(x_test)
print(model_accuracy)
lgbm_clf_output = give_classifcation_metrics(y_test,y_predicted_lgbm)
0.7479975354282193
Confusion matrix :
1 0
1 138 210
0 199 1076
Outcome values :
tp 138
fn 210
fp 199
tn 1076
dtype: int64
Recall :
0.39655172413793105
Precision:
0.4094955489614243
Classification report :
In [98]:
Importance = pd.DataFrame({'feature': list(x_train.columns),
'importance': model_lgbm_clf.feature_importances_}).\
In [99]:
Importance
2 Average_Ticket_Size 12814
1 Recency 11827
0 Frequency 4996
5 Cluster_1 154
4 Cluster_0 108
3 Cluster_-1 97
6 Cluster_2 4
THANK YOU VAMSEE SIR
In [ ]: