You are on page 1of 14

Unveiling Customer Diversity: Exploring Agglomerative

Clustering for Grocery Mart Segmentation

ABSTRACT
This project explores the application of agglomerative clustering on customer
data taken from a grocery mart’s database. The aim of this project is to segment
the customers into distinct clusters based on purchasing behaviour. The dataset
has been streamlined using dimensionality reduction methods, followed by
agglomerative clustering to identify clusters. The research resulted in 4 distinct
customer segments that were profiled based on various factors such as family
structures, income levels and spending patterns. These insights offer valuable
opportunities for developing targeted marketing strategies to meet the needs of
each customer segment, thereby increasing the effectiveness of marketing
strategies in the retail industry.

INTRODUCTION
The process of grouping a set of physical or abstract objects into classes of
similar objects is called clustering. A cluster is a collection of data objects that
are like one another within the same cluster and are dissimilar to the objects in
other clusters. A cluster of data objects can be treated collectively as one group
and so may be considered as a form of data compression. Although
classification is an effective means for distinguishing groups or classes of
objects, it requires the often-costly collection and labelling of a large set of
training tuples or patterns, which the classifier uses to model each group.
Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their similarity .
(Huda Hamdan Ali, 2015)

The reasons for customer segmentation are elaborated below:


• Understanding Customer Mindset: The most important stakeholder for
a business, specially in the retail industry is their customers. It is pivotal
for the management to understand the needs and preferences of their
customers to make the business successful.
• Targeted Marketing Strategies: Deploying marketing strategies without
understanding the customers and their preferences will not bring success.
By segmenting the customers, tailor made marketing campaigns can be
deployed to target each distinct segment.
• High Revenue: This is one of the main requirements of any customer
segmentation process. Higher revenue can be collected due to the
collective efforts of the abovementioned advantages.

Customer segmentation can be done by cluster analysis, an unsupervised


machine learning algorithm. Clusters are made based on similarities or distance
measures. There are various types of clustering methods available such as:
1. K-means clustering,
2. Hierarchical clustering,
3. Density-Based clustering (DBSCAN),
4. Agglomerative clustering,
And so on. For this project, agglomerative clustering has been used.
Agglomerative clustering is a hierarchical clustering algorithm that starts by
assuming each data point to be a separate cluster and iteratively merges till the
specified number of clusters remain. Through the analysis, 4 clusters have been
chosen as the optimal number of clusters by plotting the silhouette scores.

DATA COLLECTION AND RESEARCH METHODOLOGY


The dataset named “marketing_campaign.csv” was taken from Kaggle.com.
Kaggle is an online platform of data scientists and machine learning experts.
Akash Patel.2023.Customer Personality Analysis.Kaggle. Customer Personality Analysis
(kaggle.com)

The dataset consists of 2240 datapoints and 29 attributes. It can be categorised


into the following subsets:
1. Customer’s Information:
• ID – Customer’s unique identifier.
• Year_Birth – Customer’s birth year.
• Education – Education level.
• Marital Status – customer’s marital status.
• Income – Yearly household income.
• Kidhome – Number of children at home.
• Teenhome – Number of teenagers at home.
• Dt_Customer – Date of customer’s enrolment with company.
• Recency – Number of days since customer’s last purchase.
• Complain – 1 if customer complained in last 2 years, 0 otherwise.

2. Products (Amount spent on different products in last 2 years)


• MntWines – amount spent on wines.
• MntFruits – amount spent on fruits.
• MntMeatProducts – amount spent on meat products.
• MntFishProducts – amount spent on fish products.
• MntSweetProducts – amount spent on sweets.
• MntGoldProducts – amount spent on gold.
3. Promotions
• NumDealsPurchased – Number of purchases made with a discount
• AcceptedCmp1 – If customer accepted offer in 1st campaign, 0
otherwise.
• AcceptedCmp2 - If customer accepted offer in 2nd campaign, 0
otherwise.
• AcceptedCmp3 - If customer accepted offer in 3rd campaign, 0
otherwise.
• AcceptedCmp4 - If customer accepted offer in 4th campaign, 0
otherwise.
• AcceptedCmp5 - If customer accepted offer in 5th campaign, 0
otherwise.
• Response - If customer accepted offer in the last campaign, 0
otherwise.

4. Place
• NumWebPurchases – Number of purchases through website.
• NumCatalogPurchases – Number of purchases made using
catalogue.
• NumStorePurchases – Number of purchases made directly in
stores.
• NumWebVisitsMonth – Number of visits to company website in
last month.
For this project, the model has been built in Python as it is the most preferred
and largely used programming language for machine learning applications.
Execution of the code for the agglomerative clustering model has been done in
Jupyter Notebook, which is an IDE (Interactive Development Environment) for
Python.
The dataset has been imported into the IDE, following which the data has been
cleaned to deal with missing values. After cleaning, feature engineering has
been done to further aid with dimensionality reduction later in the project.
The features have been plotted and the identified outliers have been removed.

Clearly there are a few outliers in the Income and Age features.

DATA PREPROCESSING AND DIMENSIONALITY REDUCTION


Firstly, the correlation amongst the features was plotted (excluding the
categorical attributes)
The data is quite clean, and the new features have also been included.
Following this, label encoding was done on the categorical features and the
features were scaled using a standard scaler.
A subset of the dataset has been created for further dimensionality reduction
using Principal Component Analysis (PCA). High number of features are more
difficult to work with. Hence, dimensionality reduction was done.
Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal values.
Post PCA, the dimensions have been reduced to 3 and the summary is as
follows:
By looking at these statistics, the mean for all three components is close to zero,
and the standard deviation is positive for all three components. This suggests
that the data points are spread out around the mean, but not all in one direction.
3D projection of the data in the reduced dimensions is as follows:

3D scatter plot showing a projection of high-dimensional data onto three


principal components, which are often referred to as PC1, PC2, and PC3. These
axes capture the most important information from the original data set. Text
labels along the axes show the values from -6 to 6 for PC1, -4 to 4 for PC2, and
-2 to 2 for PC3.
CLUSTERING AND MODEL EVALUATION
Steps involved in Clustering:
• Plotting Silhouette Scores to determine optimal number of clusters.
• Agglomerative Clustering
• Examining clusters via scatter plot.

Based on this plot, the optimal number of clusters chosen is 4. The point before
the curve plateaus has been chosen, the point indicates that the clusters have
high cohesion.

The clusters have then been examined via a scatter plot.


The Clusters have been plotted as bar graphs to check their distribution.
The clusters seem to be fairly distributed.
Income vs Spending Plot:

Based on the graph we can assume that,


Group 0: High spending & average income
Group 1: High spending & high income
Group 2: Low spending & low income
Group 3: High spending & low income

Distribution of Clusters based on products:


We can infer that cluster 1 is our biggest set of customers closely followed by
cluster 0.

Exploration of past campaigns:

Based on the plot, we can infer that no customer has taken part in all 5
campaigns. The overall response is underwhelming.
Deals offered:

The deals offered have done well. The best outcomes can be seen with cluster 0
and cluster 3. Cluster 1 and 2 haven’t been attracted as much.

CUSTOMER SEGMENT PROFILING AND INTERPRETATION


For profiling of the customers into different segments, 9 plots have been made.
The following features have been plotted against Expenditure (“Spent”):

1. “Kidhome”,
2. “Teenhome”,
3. “Customer_for”,
4. “Age”,
5. “Children”,
6. “Family_Size”,
7. “Is_parent”,
8. “education”,
9. “Living_with”
Based on these plots, the following information can be deduced about the
customers:
Cluster 0:
• A parent.
• At least 2 and at most 4 members in the family.
• Single parents are a subset of this group.
• Most have a teenager at home.
• Relatively older.

Cluster 1:
• Not a parent.
• At most 2 family members.
• Slight majority of couples.
• Span all ages.
• High income.
Cluster 2:
• Majority Parents.
• At most 3 members in the family.
• They majorly have one kid.
• Relatively Younger.
Cluster 3:
• A parent.
• At most 5 and at least 2 family members.
• Majority of them have a teenager at home.
• Relatively older.
• Lower-income group.

DISCUSSION AND CONCLUSION


Understanding a business’s customer base is extremely important for any
business organisation. Customer segmentation is one of the ways to gain
deeper understanding of customer behaviour. It is one of the important
applications of cluster analysis amongst many applications spread across
different domains. Sales and marketing efforts can be well designed for
these clusters of customers to achieve high return on investment. Unsupervised
machine learning algorithms such as Agglomerative clustering algorithms can
be easily applied using python support libraries to summarize and visualize the
clusters. The current research applied Agglomerative clustering algorithms
on Marketing Campaign dataset and discovered different clusters from that data.
These clusters can help marketing team of the Grocery Mart to focus on these
segments of customers differently and achieve maximum profit.

REFERENCES:
1. H. H. Ali and L. E. Kadhum, ‘K-Means Clustering Algorithm Applications in Data Mining and Pattern
Recognition’, Int. J. Sci. Res., vol. 6, no. 8, pp. 1577–1584, 2017
2. Customer Personality Analysis (kaggle.com)

You might also like