You are on page 1of 5

Atelier N°5 : unsupervised learning

Hierarchical clustering and Kmeans


By Souhir Mabrouk

1. Description :
This case requires to develop a customer segmentation to define marketing strategy.
The sample Dataset summarizes the usage behavior of about 9000 active credit card
holders during the last 6 months. The file is at a customer level with 18 behavioral
variables.
Following is the Data Dictionary for Credit Card dataset :
CUST_ID: Identification of Credit Card holder (Categorical)
BALANCE: Balance amount left in their account to make purchases
(BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0
and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES: Amount of purchases made from account
ONEOFF_PURCHASES: Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES: Amount of purchase done in installment
CASH_ADVANCE: Cash in advance given by the user
PURCHASES_FREQUENCY: How frequently the Purchases are being made, score
between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY: How frequently Purchases are happening in one-
go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY: How frequently purchases in installments
are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY: How frequently the cash in advance being paid
CASHADVANCETRX: Number of Transactions made with "Cash in Advanced"
PURCHASES_TRX: Number of purchase transactions made
CREDIT_LIMIT: Limit of Credit Card for user
PAYMENTS: Amount of Payment done by userMINIMUM_PAYMENTS: Minimum
amount of payments made by user PRCFULLPAYMENT: Percent of full payment paid
by user TENURE: Tenure of credit card service for user
1. Load your dataset.
2. Use hierarchical clustering to identify the inherent groupings within your data.
3. Plot the clusters.
4. Plot the dendrogram. Use k-means clustering.
5. Try different k values and select the best one.
6. Plot the clusters.
7. Compare the two results.
Bonus: search for another validation metric

2.Hierarchical clustering
Hierarchical clustering is a popular method for grouping objects. It
creates groups so that objects within a group are similar to each
other and different from objects in other groups. Clusters are
visually represented in a hierarchical tree called a dendrogram.

Strategies for hierarchical clustering generally fall into two


categories:

• Agglomerative: This is a "bottom-up" approach: Each


observation starts in its own cluster, and pairs of clusters
are merged as one moves up the hierarchy.
• Divisive: This is a "top-down" approach: All observations
start in one cluster, and splits are performed recursively as
one moves down the hierarchy.
In order to decide which clusters should be combined (for
agglomerative), or where a cluster should be split (for divisive), a
measure of dissimilarity between sets of observations is required. In
most methods of hierarchical clustering, this is achieved by use of
an appropriate distance d, such as the Euclidean distance,
between single observations of the data set, and a linkage criterion,
which specifies the dissimilarity of sets as a function of the pairwise
distances of observations in the sets.
3.code in python
model=AgglomerativeClustering(n_clusters=3, affinity='euclidean', linka
ge='complete')
clust_labels=model.fit_predict(df_norm)
clust_labels
best_cols= ["BALANCE", "PURCHASES", "CASH_ADVANCE", "CREDIT_LIMIT", "PA
YMENTS", "MINIMUM_PAYMENTS"]
df_norm['cluster'] = clust_labels
best_cols.append('cluster')
# make a Seaborn pairplot
sns.pairplot(df_norm[best_cols], hue='cluster')
-------------------------
agglom=pd.DataFrame(clust_labels)
agglom
--------------------------
plt.figure(figsize =(20, 20))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(df_norm, method ='complete')))

4. Kmeans code
print('New dataframe with best columns has just been created. Data shap
e: ' + str(data_final_3.shape))
# apply KMeans clustering
alg_3 = KMeans(n_clusters = 3)
label_3 = alg_3.fit_predict(data_final_3)

# create a 'cluster' column


data_final_3['cluster_3'] = label_3
best_cols_3.append('cluster_3')

# make a Seaborn pairplot


sns.pairplot(data_final_3[best_cols_3], hue='cluster_3')

# select best columns


best_cols = ["BALANCE", "PURCHASES", "CASH_ADVANCE", "CREDIT_LIMIT", "P
AYMENTS", "MINIMUM_PAYMENTS"]

# dataframe with best columns


data_final= pd.DataFrame(df_norm[best_cols])
# inertia plotter function
def inertia_plot(clust, X, start = 2, stop = 20):
inertia = []
for x in range(start,stop):
km = clust(n_clusters = x)
labels = km.fit_predict(X)
inertia.append(km.inertia_)
plt.figure(figsize = (12,6))
plt.plot(range(start,stop), inertia, marker = 'o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Inertia plot with K')
plt.xticks(list(range(start, stop)))
plt.show()
inertia_plot(KMeans, data_final)

You might also like