Customer Segmentation Using Hierarchical Agglomerative Clustering

Customer Segmentation Using
Hierarchical Agglomerative Clustering

Phan Duy Hung Nguyen Thi Thuy Lien Nguyen Duc Ngoc
FPT University FPT University FPT University
Hanoi, Vietnam Hanoi, Vietnam Hanoi, Vietnam
hungpd2@fe.edu.vn liennttmse0096@fpt.edu.vn ngocndmse0093@fpt. edu.vn
ABSTRACT financial acumen [4].

Customer segmentation plays an important role in customer Artificial neural networks and k-means clustering are widely used
relationship management. It allows companies to design and for market segmentation [5]. K-means algorithm is simple and
establish different strategies to maximize the value of customers. efficient. For small k’s, k-mean is considered a linear algorithm
Customer segmentation refers to grouping customers into and has good performance. On another hand, hierarchical
different categories based on shared characteristics such as age, algorithm provides better quality clusters as compared to k-mean
location, spending habit and so on. Similarly, clustering means algorithm. Especially for small datasets where performance is not
putting things together in such a way that similar type of things of importance, clustering using hierarchical algorithm is the
remain in the same group. In this study, a machine learning (ML) superior method [7].
hierarchical agglomerative clustering (HAC) algorithm is
implemented in the R programming language to perform customer 1.2 Related works
segmentation on credit card data sets to determine the appropriate There are many researchs about hierarchical agglomerative
marketing strategies. clustering. In [8], Li et al. proposes a Q-criterion based
hierarchical clustering algorithm, named HACNJ. HACNJ has the
CCS Concepts same complexity with the basic hierarchical clustering, with a
• Applied computing → Electronic data interchange time complexity of O(n3), and a space complexity of O(n2),
• Information systems → Clustering and classification. which are prohibitive when dealing with large datasets. Through
experiment on the Iris dataset, they verified that HACNJ is
Keywords effective.
Customer segmentation; Clustering; Agglomerative Hierarchical;
Machine learning. Yogita Rani and Dr. Harish Rohil in [2] presented a detailed
discussion on some improved hierarchical clustering algorithms.
1. INTRODUCTION These algorithms are intended to overcome the limitations of pure
1.1 Problem and motivation hierarchical clustering algorithms. The mentioned algorithms
Data mining and analysis allows extraction of knowledge from include CURE (Clustering Using REpresentatives), BIRCH
historical data, and thus formulation of a background for (Balanced Iterative Reducing and Clustering using Hierarchies),
predicting future outcomes. In Data mining, agglomerative ROCK (RObust Clustering using linKs), CHEMELEON
clustering algorithms are widely used because of their flexibility Algorithm, Linkage Algorithms, Leaders – Subleaders Leaders-
and conceptual simplicity [1]. It works by grouping data objects Subleaders and Bisecting k-means.
into a tree of clusters [2]. In [5], authors Kamthania D. et al made an attempt to propose a
The purpose of customer segmentation is to divide the user-base model for formulating business strategies based on the users'
into smaller groups that can be targeted with specialized content interest and location. They used Principal Component Analysis
and offers [3]. Customer segmentation enables companies to (PCA) technique followed by k-mode clustering algorithm.
address each specific group of customers in the most efficient Furthermore, the geographical location was fetched from an e-
way. For decades, investors have relied on customer segmentation commerce website for data visualization. The paper proposed a
models built around basic demographic data such as age, income, complete and simplified system from pre-processing data to
education, gender and so on. In fact, these are poor predictors of visualization, suitable for small businesses. This system identifies
consumer behavior. Instead, banks and credit unions should the popularity of each product over a time period, and from there,
segment consumers by their level of digital sophistication and targets the potential customers. In the future, the architecture
proposed by the authors can be extended to integrate with real-
time analytical tools. And the proposed system can be scaled up
Permission to make digital or hard copies of all or part of this work for by adopting HDFS and MapReduce techniques to build a fully-
personal or classroom use is granted without fee provided that copies are fledged production system.
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights These studies have mentioned some aspects of clustering,
for components of this work owned by others than ACM must be honored. machine learning, and market segmentation. But they all have
Abstracting with credit is permitted. To copy otherwise, or republish, to some limitations when put to practice or when applied to other
post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
defined datasets. Besides, there are other possible approaches that
ICISS 2019, March 16-19, 2019, Tokyo, Japan are worth taking a look at. In this paper, with a small dataset, we
© 2019 Association for Computing Machinery. will implement customer segmentation using a hierarchical
ACM ISBN 978-1-4503-6103-3/19/03…$15.00 clustering algorithm.
https://doi.org/10.1145/3322645.3322677
33
1.3 Contribution In this paper, a credit cards dataset is used. The dataset [9]
The main contribution of this paper is to make an experimental summarizes the usage behavior of about 9000 active credit card
combination of R language machine learning, especially holders for 6 months. This data includes 18 features where
hierarchical agglomerative clustering algorithm, with several CREDIT_LIMIT, PURCHASES and PAYMENTS will be used to
ancillary techniques. The cloud environment helps to solve the perform customer segmentation task.
problem of large data volumes. The results will show that this
Before performing clustering, some preprocessing steps have to
method is suitable for practical implementation.
be performed to clean and rearrange the data. The first idea is
The remainder of the paper is organized as follows: section 2 checking the sparsity and the amount of missing data. There are
describes data collection, environment preparation and data many packages in the R language that support these tasks. The
normalization techniques. Section 3 then covers the aggr() function in the VIM package helps to list missing values
implementation using the hierarchical agglomerative clustering visually [10]. The missing values in the Credit Card dataset (in the
algorithm. Finally, section 4 concludes the paper and provide MINIMUM_PAYMENTS and CREDIT_LIMIT columns) are
some of the authors’ perspectives. replaced by the median [11].
2. DATA COLLECTION AND The second step is to check for outlier values. Treatment or
change of outlier or extreme values in observations is not a
PREPARATION standard procedure. However, it can strongly bias or change the
A Google cloud virtual machine with hardware parameters of estimates and predictions accordingly. Therefore, it is necessary
26GB of RAM and 4vCPUs is used as the experiment deal with these exceptions [12]. Using box plots to identify the
environment. outliers, any data lies outside the upper or lower fence lines is
The operating system information and the frameworks are: considered outliers (see Figure 1). And then these exotic values
Ubuntu 16.04, R language 3.5.2, Python 3.7.1, Java 1.8. are trimmed out [11].
(a) data with outliers (b) data after trimming outliers
Figure 1. Trimming outliers.

3. IMPLEMENTATION The red color presents low dissimilarity and the blue color shows
high dissimilarity. This figure confirms that credit cards dataset is
3.1 Assessing Clustering Tendency clusterable.
There is a big problem in clustering analysis. That clustering
methods have the ability to return clusters even when the dataset 3.2 Determining Optimal Cluster
contains no meaningful clusters. Hence, the assessment of the In [14], Kassambara describes 3 different methods for
clustering tendency or the feasibility of the clustering analysis is determining the optimal number of clusters. These methods
important to evaluate whether the datasets are clusterable. include direct and statistical testing method.
In [13], Bezdek and Hathaway developed the visual assessment of Elbow method: The oldest one called the Elbow method, a visual
cluster tendency (VAT) technique. This technique can be method. The idea is starting with k=2 and keep increasing k in
performed in R quite easily. First, compute the dissimilarity each step by 1. The value of k will be the point at which the total
matrix between observations using the function dist(). Next, use within-cluster sum of square (WSS) drops dramatically. After that
the function fviz_dist() in the factoextra package to display the point the graph reaches a plateau when k increases further [15].
dissimilarity matrix. Results when applied on the selected dataset Figure 3 shows the final optimal cluster calculated by the Elbow
as shown in Figure 2. method.
34
Figure 2. Dissimilarity matrix.
Figure 4. Optimal number of clusters by Silhouette method.
Figure 5. Optimal number of clusters by the Gap statistic

method.
Figure 3. Elbow optimal clustering.
These methods can be performed in R using the fviz_nbclust()
Average silhouette method: This method measures the quality of a function. In the case of the Credit Card dataset, the Average
clustering by determining each object that lies within its cluster. A silhouette method suggests 2 clusters while Elbow method and
clustering is good if it has high average silhouette width. Figure 4 Gap statistic method suggest 3 clusters. Hence, 3 should be
shows the visualized result from Average silhouette method. The chosen as the point to cut dendrogram tree.
optimal number of clusters is 2, at the point has the highest
average silhouette method. 3.3 Dendrogram
Gap statistic method: The third method is the Gap statistic Finally, agglomerative hierarchical clustering analysis is
method, introduced by Tibshirani et al [17]. This method performed by using the hclust() method from Cluster package in
compares the total intra-cluster variation for different values of k R. The result dendrogram is shown in Figure 6 below.
with their expected values under null reference distribution of the
data (i.e. a distribution with no obvious clustering). Figure 5
shows the result of the Gap statistic method which is performed
by using R.
35
The final goal of this study is customer segmentation, taking that
as a background for building marketing strategies. The
distribution of observations in clusters is easily seen in Figure 7
through the following Scatter Plot Matrix (SPLOM).
From the result above, we see that the first cluster is a group of
customers with medium to high credit limits, but they do not make
many purchases. This group accounted for a relatively large
number of customers. Unfortunately, it does not seem to be
suitable for marketing strategies. The second cluster comprises of
the customers with the lowest credit limit and smallest purchases,
and hence they have low payments. Customers in the third cluster
are those with high spending. They have a credit limit that ranges
from low to high. The number of customers in this group is also
quite large. A marketing strategy that targets this group might be
effective.
Depending on the dendrogram, the number of clusters can be
adjusted to the dataset to produce more meaningful groups for
analysts.
Figure 6. Cluster dendrogram.
Figure 7. SPLOM Lattice Scatter Plot Matrices for Flow Data.
4. CONCLUSION AND PERSPECTIVES proposed system includes a set of methods to solve all stages from
The strong rising trend of marketing services these years has data preprocessing to result visualization. The drawback of this
given data analysts new tasks and challenges, requiring more method is quite slow and hardware dependent. Therefore, it is
advanced and complex analytical methods, which also provide advisable to use a cloud environment for large datasets.
meaningful results in future strategic planning. This papers also can give a reference to many field in Data
This study uses the hierarchical agglomerative clustering Analytics, for example, Bioinformatics [22,23,24], Pattern
algorithm on a Credit cards dataset to perform customer Recognition [25,26], etc.
segmentation. Based on the obtained results, analysts can promote
appropriate marketing strategies that are more profitable. The
5. REFERENCES
36
[1] Kettani, O., Ramdani, F., Tadili, B. (2014). An Agglomerative [16] Kaufman L., Rousseeuw P. J., Finding Groups in Data: An
Clustering Method for Large Data Sets. International Introduction to Cluster Analysis, Wiley Series in Probability
Journal of Computer Applications, 92(14), 975–8887. and Statistics, Wiley, 2009.
[2] Rani, Y., Rohil, H. (2013). A Study of Hierarchical Clustering [17] Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the
Algorithm. International Journal of Information and number of clusters in a data set via the gap statistic. Journal
Computation Technology, 3(11), 1225–1232. of the Royal Statistical Society. Series B: Statistical
[3] Konstantinos, T. Antonios, C. (2010). Data Mining Methodology, 63(2), f411–423.
Techniques in CRM: Inside Customer Segmentation. [16] Kassambara, A. 2015. Practical Guide To Cluster Analysis in
10.1002/9780470685815.ch7. R. pp. 1–38, 2015.
[4] Machauer, A., Morgner, S. (2001). Segmentation of bank [17] Package 'VIM' - The R Project for Statistical Computing.
customers by expected benefits and attitudes. International 2017, https://cran.r-project.org/web/packages/VIM/VIM.pdf.
Journal of Bank Marketing, Vol. 19 Issue: 1, pp.6-18. Accessed : 2018-12-01.
[5] Kamthania, D., Pawa, A., Madhavan, S. (2018). Market [18] Base package | R Documentation.
Segmentation Analysis and Visualization using K-Mode https://www.rdocumentation.org/packages/base/versions/3.5.
Clustering Algorithm for E-Commerce Business. Journal of 1. Accessed : 2018-12-01.
Computing and Information Technology, 26(1), 57–68. [19] Package 'stringr'. https://cran.r-
[6] Wilks, D. S. (2011). Cluster Analysis. International project.org/web/packages/stringr/stringr.pdf. Accessed :
Geophysics, 100, 603–616. 2018-12-01.
[7] Kaushik, M., Mathur, B. (2014). Comparative Study of K [20] Package 'standardize' - The R Project for Statistical
means and Hierarchical Clustering Techniques. International Computing. https://cran.r-
Journal of Software & Hardware Research in Engineering, project.org/web/packages/standardize/standardize.pdf.
2(6), 93–98. Accessed : 2018-12-01.
[8] Jianfu, L., Jianshuang L., Huaiqing H. (2011). A Simple and [21] Package 'factoextra' - The R Project for Statistical Computing.
Accurate Approach to Hierarchical Clustering. Journal of https://cran.r-
Computational Information Systems , 7(7), 2577–2584. project.org/web/packages/factoextra/factoextra.pdf.
[9] "Credit Card Dataset for Clustering | Kaggle." 2 Mar. 2018, Accessed : 2018-12-01.
https://www.kaggle.com/arjunbhasin2013/ccdata. Accessed 9 [22] Hung, P. D, Hanh, T. D., and Diep, V. T. (2018). Breast
Jan. 2019. Cancer Prediction using Spark MLlib and ML packages.
[10] "Package 'VIM' - The R Project for Statistical Computing." ICBRA, 5th International Conference on Bioinformatics
11 Apr. 2017, https://cran.r- Research and Applications, 12, 2018, Hong Kong.
project.org/web/packages/VIM/VIM.pdf. Accessed 9 Jan. https://doi.org/10.1145/3309129.3309133.
2019. [23] Hung, P. D. (2018). Detection of Central Sleep Apnea based
[11] Jianfu, L., Jianshuang, L., Huaiqing, H. (2011). A Simple and on a single-lead ECG. ICBRA, 5th International Conference
Accurate Approach to Hierarchical Clustering. Journal of on Bioinformatics Research and Applications, 12, 2018,
Computational Information Systems , 7(7), 2577–2584. Hong Kong. https://doi.org/10.1145/3309129.3309132.
[12] "Outlier Treatment With R | Multivariate Outliers - r- [24] Hung , P. D. (2018). Central Sleep Apnea Detection Using an
statistics.co." http://r-statistics.co/Outlier-Treatment-With- Accelerometer. Proceedings of the 2018 International
R.html. Accessed 9 Jan. 2019. Conference on Control and Computer Vision (ICCCV '18).
ACM, New York, NY, USA, 106-111. DOI:
[13] Bezdek, J. C., Hathaway, R. J. (n.d.). VAT: a tool for visual https://doi.org/10.1145/3232651.3232660.
assessment of (cluster) tendency. Proceedings of the 2002
International Joint Conference on Neural Networks. [25] Nam, N. T., Hung, P. D. 2018. Pest detection on Traps using
IJCNN’02 (Cat. No.02CH37290), 2225–2230. Deep Convolutional Neural Networks. Proceedings of the
2018 International Conference on Control and Computer
[14] Kassambara, A. 2015. Practical Guide To Cluster Analysis in Vision (ICCCV '18). ACM, New York, NY, USA, 33-38.
R. pp. 1–38, 2015. DOI: https://doi.org/10.1145/3232651.3232661.
[15] Kodinariya, T. M., Makwana, P. R. (2013). Review on [26] Hung, P. D., Linh, D. Q. 2019. Implementing an Android
determining number of Cluster in K-Means Clustering. Application for Automatic Vietnamese Business Card
International Journal of Advance Research in Computer Recognition. Pattern Recognition and Image Analysis, ISSN
Science and Management Studies, 1(6), 2321–7782. 1054-6618 29 (1), 203-213.
37

Customer Segmentation Using Hierarchical Agglomerative Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Customer Segmentation Using Hierarchical Agglomerative Clustering

Uploaded by

Copyright:

Available Formats

Customer Segmentation Using

Hierarchical Agglomerative Clustering

ABSTRACT financial acumen [4].

(a) data with outliers (b) data after trimming outliers

Figure 1. Trimming outliers.

Figure 5. Optimal number of clusters by the Gap statistic

Figure 6. Cluster dendrogram.

Figure 7. SPLOM Lattice Scatter Plot Matrices for Flow Data.

You might also like