You are on page 1of 7

Final Examination of Multivariate Statistics-03 (2021)

European Bank Credit Card Holders Segmentation for Customer Retention


using Principal Component Analysis and K-Means Clustering
Diva T. Mughni, Fakhira F. Putri, Rifa Salsabila, Victoria E.B. Dianita, Yohanna G. Christie
Department of Industrial Engineering, Faculty of Engineering, Universitas Indonesia, Depok 16424,
Indonesia
Abstrac

0
Final Examination of Multivariate Statistics-03 (2021)

European Bank Credit Card Holders Segmentation for Customer Retention


using Principal Component Analysis and K-Means Clustering
Diva T. Mughni, Fakhira F. Putri, Rifa Salsabila, Victoria E.B. Dianita, Yohanna G. Christie
Department of Industrial Engineering, Faculty of Engineering, Universitas Indonesia, Depok 16424,
Indonesia
Abstract: This paper represents an application of Principal Component Analysis and K-Means Clustering approach of
which is utilized to determine the clusterization of European Bank Credit Card Holders and its characteristics that
obviously create a strategy of appropriate treatment. The secondary data was collected by a recorded system of European
credit card users data which compounds over several banks and utilizes the R software to do the data processing. At the
end of analysis, it is found that there are three clusters, which are Bourjois, Jane and Joes, and The Red Flags. Bourjois
are the cluster with an outstanding transaction history with huge credit limit along with full payment history, Jane and
Joes are the ones with common purchase with full payment and average credit limit with moderate payment period, while
The Red Flags are the cluster with a fairly small credit limit, lower percentage of full payment paid, and shorter payment
period. For customer retention purposes, it is recommended that banks focus on Bourjois who are the most interesting
prospect.
Keywords: Principal Component Analysis, K-Means Clustering, Customer Segmentation

1. Introduction
These days, people tend to be consumptive to 2. Literature Review
keep up with the trends. So, it is not surprising that The variables that are chosen for the study,
many people are in debt because their expenses are which include the transaction behaviour of the
greater than their income. Many methods are available customers, are adjusted according to the study
for get loan, one of the most popular method is a credit objectives to retain existing bank customers. The
card. In 2017, Canada is the country with the largest adjustment is aided from the assessment of the variables
credit card users in the world and America is in the 6th in the existing dataset and previous clustering literatures
rank (Statista, 2017). But the situation is different from within the banking industry with different settings,
European countries which have a contrasting culture of datasets, and clustering purposes.
choosing a form of credit. Even though Europeans have Birir et al. (2020) applied principal component
nearly the same level of household debt as America, the analysis and k-means clustering to cluster bank
use of credit cards is relatively low. This might be customers in Africa for marketing purposes to bank
happens because most of Europeans startup provides customers prospects. There are 60 total variables which
point-of-sale financing that allows customers to pay for contain purchase information of customers alongside
their purchases in installments which is Klarna the purchase location and the merchant groups
Payment. So they prefer to take purchase-by-purchase associated with the purchase.
loans and not maintain a revolving line of credit. Rao et al. (2010) used k-means clustering and
Not only is the number of credit card users Bayesian classification for bank customers retention
low, the number of credit card transaction values in the initiatives. There are seven variables which include the
UK has been decreasing every year. The rocketing inflow and outflow of the transaction as well as the
popularity of contactless payment methods to replace personal information of the customers.
their cash payment makes them less interested in using Farajian et al. (2010) incorporated k-means
credit cards as the main means of payment. Especially clustering and apriori algorithm for grouping bank
with the £45 spending limits in contactless payment as customers and creating customer profiling to gain more
preventive action to protect customers from fraud, competitive advantage in the banking industry. The total
making European citizens feel more secure in carrying variables for the study are 11 that contain the
out their daily transactions. transaction information and the customer profile.
Owing to those backgrounds, this research is Leung (2009) suggested the use of k-means
conducted to find clusters in credit card holders using clustering to group bank customers for credit card
data collected by a recorded system of European credit scoring purposes. The variables in the study consist of
card users data which compounds over several banks. the transaction history and the tenure with a total of 10
which was published on kaggle in 2017. The use of variables.
secondary data is due to time and place constraints to Chang et al. (2018) employed clustering with a
get the latest data. Hence, the research is conducted in bottom-up method for understanding the transaction
order to categorize customers of European Bank patterns and existing demographic information to
Companies into segments that define the basic customer support marketing strategies in a Chinese bank. The
characteristics where customers are viewed as members variables used include the demographic information and
of relatively homogeneous groups portrayed through transaction behaviour that account for 18 variables.
common characteristics relevant to marketing.

1
Final Examination of Multivariate Statistics-03 (2021)

3. Methods After the VIF test is conducted, there are nine


3.1 Research Design
variables that have 5<VIF<10 and the correlation
This research using a dataset from European
banks consisting of the credit card usage of 8950 matrix shows between variables have high
registered users within six months in 2017. The data
correlation. Then, these data must conduct
is recorded by their system. The data consists of 17
variables which are balance, balance frequency, component analysis to reduce the impact of
purchases, one-off purchases, installment purchases,
multicollinearity.
cash advance, cash advance, purchases frequency,
one-off purchases frequency, purchase installment
frequency, cash advance, cash advance frequency, 3.3 Checking Assumption of Component Analysis
cash advance transactions, purchases transactions, There are assumptions that must be made
credit limit, payments, minimum payments, adequately before conducting component analysis,
percentage of full payment, and tenure. that is the dataset must be sufficient before
Before conducting analysis, the missing component analysis. To determine if the dataset is
values of data have to be checked. In this data, all sufficient or not, this research conducts Bartlett test
missing values have been replaced with mean values of sphericity and KMO test. Its hypothesis testing is:
of their respective columns. The next step is to H0 : Variables are correlated or same with
detect if there are outliers in each variable. All identity matrix
variables in this data show they have outliers, except H1 : Variables are not correlated or diverse
in the TENURE variable. However, outliers have from identity matrix
been replaced with the first and third quartile of Reject H0 if p value < 0,05. Meanwhile, the KMO
each variable because the residuals are not normally test is used for determining the level of correlation
distributed (Appendix 1). TENURE shows it does between variables besides determining if the dataset
not have outliers because they only have little is sufficient to do component analysis. The
distance between data in its histogram (Appendix hypothesis is :
2). H0 : Data is sufficient to do component analysis
Before performing PCA, Standardization is H1 : Data is not sufficient to do component
required to standardize the entire variable's scale, analysis
dispersion of the standard deviation, and the mean Reject H0 if overall MSA < 0.50. After all
of each variable to avoid it affecting the clustering assumptions have been fulfilled, we can do factor
results. Therefore the variables need to be analysis by using Principal Component Analysis
standardized by making sure the scale is already (PCA).
uniform so that the standard deviation of each
variable is one and the centered mean is zero. After 3.4 Identification of Principal Component Analysis
the variable is standardized, the variable is ready to In order to select the optimal number of
be used in the next stage. Principal Components (PCs) to retain, we can use
Methods that will be used in this research is either the percentage of variance or the scree test
K-means clustering because this method focuses criterion. The percentage of variance’s criterion is an
more on examining the clusters rather than approach based on achieving a specified cumulative
examining a wide array of clustering methods. percentage of total variance extracted by successive
Beside that, this research using dataset consisting of factors. The purpose is to ensure practical
more than 1.000 size. Also, outliers in these data significance for the derived factors by ensuring that
have already been treated. K-means clustering is they explain at least a specified amount of variance.
partitioning the data into a pre-set number of No absolute threshold has been adopted for all
clusters and continued by a series of iterations to applications. To ensure the practical significance for
group the objects into clusters. the derived factors we use at least 95 percent of
variance as satisfactory. We need to compute the
3.2 Checking Assumption of Cluster Analysis cumulative percentage of the total variance of the
In order to cluster, the dataset must satisfy its PCs by using RStudio.
assumptions. The dataset consists of 8950 samples
3.5 Visualization of Variable’s Loadings using Dot Plots
of credit card users and no outliers so it satisfies the PCA is a tool for graphical analysis and
visualization of the data. It can show important
first requirement of cluster analysis that the sample
insights of the data. To better understand the
is representative to its population. Then, there must customer segmentation, we will use the visualization
of variable’s loadings.
be no multicollinearity or correlation between
We construct dotplots that display the value
variables. This checking can use the VIF test. If of the principal component loadings. Loading
describes the relationship between the original
there is a variable that has 5<VIF<10, it indicates
variables and the new principal component. The
there is a problematic collinearity between variables. factor-loading matrix contains the factor loading of

2
Final Examination of Multivariate Statistics-03 (2021)

each variable on each factor From the loadings


plots, we identify the components of each of the 10 4.1.2 Measure of Sampling Adequacy
PCs with factor loadings of ±. 50 to be generally
considered necessary for practical significance.

4. Result and Discussion


4.1 Multicollinearity
These data indicate they have multicollinearity
between variables by using the VIF test. From the result Because the overall MSA > 0.5 and all
of the VIF test, there are 9 variables that have variables have MSA > 0.5; So, accept H0 that resulting
5<VIF<10. the amount of data is sufficient for component analysis.
As an addition insight, because the overall MSA is 0.72,
we can interpret the correlation between variable is in
Average level 0.7<KMO<0.8. So , it can be conclude
that the amount of these data is sufficient to do
component analysis

4.3 Principal Components Identification


From the plot of the cumulative percent of
variance, we can see that PC1, PC2, PC3, PC4, PC5,
Then, correlation matrix has been used to PC6, PC7, PC8, PC9, and PC10 explain about 95.28%
show the high correlation between variables. variance in the data. The greater the variance explained
​ by the PC, the more information that is summarized by
that PC. The variance explained by each of the five
principal components is shown in table below:

The value of 1 and -1 indicates they have high


correlation (solid blue and red). The value of 0 shows
they have low correlation between variables (white
color)

4.2 Principal Component Analysis Assumptions


Checking
4.1.1 Bartlett Test of Sphericity

Next, we can use the scree plot to select the


optimal number of PCs, as shown in figure above.
The scree plot above can be used to determine
the optimal number of PCs to represent the data. In the
plot, we look for an elbow. We can see that there is an
elbow in the plot after approximately the tenth principal
component. Thus, 10 PCs seem to be the optimal
number of PCs. Since both the percentage of variance
or the scree test and the scree plot criterion indicate that
there should be ten principal components retained for
The resulting p value is 0. This indicates that
further analysis, we use optimal number of PCs = 10.
the p value has a very small value; it has proven by
So, initially we had 17 variables in our dataset,
seeing the result of parameters p value that <0.001. So,
now it is only 10. Thus, our variables get reduced by
reject the null hypothesis that shows variables are
applying the PCA.
correlated.
We generated the dotplots of PC1, PC2, PC3,
PC4, PC5, PC6, PC7, PC8, PC9, and PC10 as shown in

3
Final Examination of Multivariate Statistics-03 (2021)

Figure 2 below. From the dot plots, we can identify the


components of each of the 10 PCs as follows:
Component of PC5:
● Percent of full payment paid by user
Component of PC6:
● Tenure of credit card service for user
Component of PC7:
● Limit of Credit Card for use
Component of PC9:
● Amount of Payment done by user
Component of PC10:
The retained principal components are also
● Minimum amount of payments made
proven to be valid as shown in the sample-split
by user
validation result. The splitted samples yield equivalent
● Percent of full payment paid by user
loadings which signify the stability of the rotation result
regardless of the sample used. Hence, having 5
There is no component that can be identified
principal components that include five variables is
and considered significant from PC1, PC2, PC3, PC4,
appropriate for the study.
and PC8.
4.4 Bank Customer Clusters for Customer Retention
There are three clusters that account for most
variability in the total within sum of squares as shown
in the elbow. The clusters are named as Bourjois, Jane
& Joes, and The Red Flag each with distinct
characteristics based on the mean and mean centroid
result.
Cluster Bourjois has relatively higher means
on all of the variables which are credit limit, payments,
minimum payments, PRC full payment and tenure. This
represents a customer segment with an outstanding
transaction history who have huge credit limits, tend to
purchase with full payment and prefer to purchase in
installment with a fairly long payment period.
Cluster Jane & Joes Interpret the customer
segment as the one with good transaction history which
has potential that can still be maximized. These
customers commonly purchase with full payment and
have average credit limit with moderate payment
period. Meanwhile, cluster The Red Flag is
characterized as one indicates that the customer has an
inadequate transaction history who has a fairly small
credit limit with lower percentage of full payment paid
and prefer shorter payment period.
. The distinct characteristics of the clusters are
also proven by the cluster visualization that shows the
unique centroids of each cluster (Fig.6).

Fig.2 - Factor Loadings of the 10 optimal Principal


Components

4
Final Examination of Multivariate Statistics-03 (2021)

Fig. 4 - The Elbow Method Result 5. Conclusion


In this paper, we investigated the application of
principal component analysis and K-means clustering in
the segmentation for customer retention of European
bank credit card holders. We identified active customers
in order to apply appropriate marketing strategies
towards each homogeneous group.
As we show that the results of the Principal
Component Analysis (PCA) as an unsupervised
statistical technique, it can be used for dimension
reduction and yield 5 significant variables which
Fig. 5 - Mean and Mean Centroid of The Clusters includes PRC_FULL_PAYMENT, TENURE,
MINIMUM_PAYMENT, CREDIT_LIMIT_, and
PAYMENT. The PCA effectively reduces the number of
variables in our dataset from 17 to 5. The PCA is
proven to effectively reduce the multicollinearity
between high correlated variables in the dataset.
Moreover, the K-means clustering is employed
by focusing on the nonhierarchical method and elbow
method. The nonhierarchical method is urged due to the
focus on examining clusters, the existence of large
dataset that consists of more than 1,000 size, and the
outliers that may cause a concern. Thus, the final result
Fig. 6 - Clusters Visualization in Two of K-means clustering is yielding 3 homogeneous
and Three Dimensions clusters that have each unique characteristic which are
4.5 Data Validation Bourjois cluster due to its high credit limit, amount of
The utilization of silhouette method is payment, % full payment and long tenure, Jane & Joes
proposed for measuring similarity of an object to its cluster with its medium credit limit, amount of
own cluster compared to other clusters. The clusters payment, % full payment and moderate tenure, and The
prove to be well-clustered according to the validation Red Flags with its small credit limit, amount of
assessment which is interpreted by the value of average payment, % full payment and short tenure.
silhouette width for every cluster that has value 0<x<1 For future research, the researcher recommends
(Fig. 7 & 8). Furthermore, the silhouette technique marketing that focuses on customer group 1 which
gives a more precise score and number of k for k-means prefers to purchase in installment with a fairly long
algorithm rather than elbow’s performance. payment period but less interested in using cash in
advance.

6. Recommendations for Marketing Strategies


Based on the results of our clustering analysis,
we can recommend the following marketing strategies
to the credit card company regarding their customer’s
segmentation:
Fig. 7 - Cluster Sizing ● Customers in Group 1 - Bourjois
Customers in this group are recognized as the
most potential which is worthy to be offered higher
reward treatments for a long period and needs to be
maintained due to its sizing of total customers is the
largest.
● Customers in Group 2 - Jane & Joes
Customers in this group are counted as the small risky
segmentation which is good to provide less interest rate
on purchase transactions & variation of any special
deals to promote the consumption habits for them.
● Customers in Group 3 - The Red Flags
Customers in this group are counted as the most risky
customers and need to have premium & loyalty cards
would be beneficial for every each potential
transactions and focus on minor section of discounted
stuffs.

Fig. 8 - Silhouette Validation


References

5
Final Examination of Multivariate Statistics-03 (2021)

Hair, J., Black, W., Babin, B., & Anderson, R.


2009. Multivariate data analysis (8th ed.)

Muhoza, E. (n.d.). Using Unsupervised


Machine Learning Techniques for Behavioral-based
Credit Card Users Segmentation in Africa. IEEE Xplore
Full-text PDF: Retrieved December 13, 2021, from
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnum
ber=9142602 (Main Journal)

Jessica, Saritha, et al. (2010). Clustering


Methods for Credit Card using Bayesian Rules based on
K-means Classification. Retrieved December 13, 2021,
from
https://thesai.org/Downloads/Volume1No4/Paper%2016
-Clustering%20Methods%20for%20Credit%20Card%2
0using%20Bayesian%20rules%20based%20on%20K-m
eans%20classification.pdf

Chang, Min-Suk, et al. (2018). A Customer


Segmentation Scheme Base on Big Data in a Bank.
Retrieved December 13, 2021, from
https://www.koreascience.or.kr/article/JAKO201808962
641880.pdf

Bhasin, Arjun. (2017). Credit Card Dataset for


Clustering. Retrieved from December 3, 2021, from
https://www.koreascience.or.kr/article/JAKO201808962
641880.pdf (Data sources)

Foottit, Ian. (2019). The Future of Credit A


European Perspective. Retrieved from December 3,
2021, from
https://www2.deloitte.com/content/dam/Deloitte/uk/Do
cuments/financial-services/deloitte-uk-mastercard-the-f
uture-of-credit-a-european-perspective-2019-report-digi
tal-updated.pdf

Farajian, Ali. (2019). Selecting Smart


Strategies Based on Big Data Techniques and Space
Matrix (FASE Model). Retrieved from December 3,
2021, from
https://www.researchgate.net/publication/331961376_S
electing_Smart_Strategies_Based_on_Big_Data_Techni
ques_and_SPACE_Matrix_FASE_model

Lilly, Chris. (2019). UK Credit Card Data and


Statistics. Retrieved from December 13, 2021, from
https://www.merchantsavvy.co.uk/uk-credit-card-statisti
cs/

Liu, Xian. (2021). Predicting Bank Failures: A


Synthesis of Literatures and Directions for Future
Research. Retrieved from December 22, 2021, from
https://www.mdpi.com/1911-8074/14/10/474/pdf

You might also like