Professional Documents
Culture Documents
33
1.3 Contribution In this paper, a credit cards dataset is used. The dataset [9]
The main contribution of this paper is to make an experimental summarizes the usage behavior of about 9000 active credit card
combination of R language machine learning, especially holders for 6 months. This data includes 18 features where
hierarchical agglomerative clustering algorithm, with several CREDIT_LIMIT, PURCHASES and PAYMENTS will be used to
ancillary techniques. The cloud environment helps to solve the perform customer segmentation task.
problem of large data volumes. The results will show that this
Before performing clustering, some preprocessing steps have to
method is suitable for practical implementation.
be performed to clean and rearrange the data. The first idea is
The remainder of the paper is organized as follows: section 2 checking the sparsity and the amount of missing data. There are
describes data collection, environment preparation and data many packages in the R language that support these tasks. The
normalization techniques. Section 3 then covers the aggr() function in the VIM package helps to list missing values
implementation using the hierarchical agglomerative clustering visually [10]. The missing values in the Credit Card dataset (in the
algorithm. Finally, section 4 concludes the paper and provide MINIMUM_PAYMENTS and CREDIT_LIMIT columns) are
some of the authors’ perspectives. replaced by the median [11].
2. DATA COLLECTION AND The second step is to check for outlier values. Treatment or
change of outlier or extreme values in observations is not a
PREPARATION standard procedure. However, it can strongly bias or change the
A Google cloud virtual machine with hardware parameters of estimates and predictions accordingly. Therefore, it is necessary
26GB of RAM and 4vCPUs is used as the experiment deal with these exceptions [12]. Using box plots to identify the
environment. outliers, any data lies outside the upper or lower fence lines is
The operating system information and the frameworks are: considered outliers (see Figure 1). And then these exotic values
Ubuntu 16.04, R language 3.5.2, Python 3.7.1, Java 1.8. are trimmed out [11].
34
Figure 2. Dissimilarity matrix.
Figure 4. Optimal number of clusters by Silhouette method.
35
The final goal of this study is customer segmentation, taking that
as a background for building marketing strategies. The
distribution of observations in clusters is easily seen in Figure 7
through the following Scatter Plot Matrix (SPLOM).
From the result above, we see that the first cluster is a group of
customers with medium to high credit limits, but they do not make
many purchases. This group accounted for a relatively large
number of customers. Unfortunately, it does not seem to be
suitable for marketing strategies. The second cluster comprises of
the customers with the lowest credit limit and smallest purchases,
and hence they have low payments. Customers in the third cluster
are those with high spending. They have a credit limit that ranges
from low to high. The number of customers in this group is also
quite large. A marketing strategy that targets this group might be
effective.
Depending on the dendrogram, the number of clusters can be
adjusted to the dataset to produce more meaningful groups for
analysts.
4. CONCLUSION AND PERSPECTIVES proposed system includes a set of methods to solve all stages from
The strong rising trend of marketing services these years has data preprocessing to result visualization. The drawback of this
given data analysts new tasks and challenges, requiring more method is quite slow and hardware dependent. Therefore, it is
advanced and complex analytical methods, which also provide advisable to use a cloud environment for large datasets.
meaningful results in future strategic planning. This papers also can give a reference to many field in Data
This study uses the hierarchical agglomerative clustering Analytics, for example, Bioinformatics [22,23,24], Pattern
algorithm on a Credit cards dataset to perform customer Recognition [25,26], etc.
segmentation. Based on the obtained results, analysts can promote
appropriate marketing strategies that are more profitable. The
5. REFERENCES
36
[1] Kettani, O., Ramdani, F., Tadili, B. (2014). An Agglomerative [16] Kaufman L., Rousseeuw P. J., Finding Groups in Data: An
Clustering Method for Large Data Sets. International Introduction to Cluster Analysis, Wiley Series in Probability
Journal of Computer Applications, 92(14), 975–8887. and Statistics, Wiley, 2009.
[2] Rani, Y., Rohil, H. (2013). A Study of Hierarchical Clustering [17] Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the
Algorithm. International Journal of Information and number of clusters in a data set via the gap statistic. Journal
Computation Technology, 3(11), 1225–1232. of the Royal Statistical Society. Series B: Statistical
[3] Konstantinos, T. Antonios, C. (2010). Data Mining Methodology, 63(2), f411–423.
Techniques in CRM: Inside Customer Segmentation. [16] Kassambara, A. 2015. Practical Guide To Cluster Analysis in
10.1002/9780470685815.ch7. R. pp. 1–38, 2015.
[4] Machauer, A., Morgner, S. (2001). Segmentation of bank [17] Package 'VIM' - The R Project for Statistical Computing.
customers by expected benefits and attitudes. International 2017, https://cran.r-project.org/web/packages/VIM/VIM.pdf.
Journal of Bank Marketing, Vol. 19 Issue: 1, pp.6-18. Accessed : 2018-12-01.
[5] Kamthania, D., Pawa, A., Madhavan, S. (2018). Market [18] Base package | R Documentation.
Segmentation Analysis and Visualization using K-Mode https://www.rdocumentation.org/packages/base/versions/3.5.
Clustering Algorithm for E-Commerce Business. Journal of 1. Accessed : 2018-12-01.
Computing and Information Technology, 26(1), 57–68. [19] Package 'stringr'. https://cran.r-
[6] Wilks, D. S. (2011). Cluster Analysis. International project.org/web/packages/stringr/stringr.pdf. Accessed :
Geophysics, 100, 603–616. 2018-12-01.
[7] Kaushik, M., Mathur, B. (2014). Comparative Study of K [20] Package 'standardize' - The R Project for Statistical
means and Hierarchical Clustering Techniques. International Computing. https://cran.r-
Journal of Software & Hardware Research in Engineering, project.org/web/packages/standardize/standardize.pdf.
2(6), 93–98. Accessed : 2018-12-01.
[8] Jianfu, L., Jianshuang L., Huaiqing H. (2011). A Simple and [21] Package 'factoextra' - The R Project for Statistical Computing.
Accurate Approach to Hierarchical Clustering. Journal of https://cran.r-
Computational Information Systems , 7(7), 2577–2584. project.org/web/packages/factoextra/factoextra.pdf.
[9] "Credit Card Dataset for Clustering | Kaggle." 2 Mar. 2018, Accessed : 2018-12-01.
https://www.kaggle.com/arjunbhasin2013/ccdata. Accessed 9 [22] Hung, P. D, Hanh, T. D., and Diep, V. T. (2018). Breast
Jan. 2019. Cancer Prediction using Spark MLlib and ML packages.
[10] "Package 'VIM' - The R Project for Statistical Computing." ICBRA, 5th International Conference on Bioinformatics
11 Apr. 2017, https://cran.r- Research and Applications, 12, 2018, Hong Kong.
project.org/web/packages/VIM/VIM.pdf. Accessed 9 Jan. https://doi.org/10.1145/3309129.3309133.
2019. [23] Hung, P. D. (2018). Detection of Central Sleep Apnea based
[11] Jianfu, L., Jianshuang, L., Huaiqing, H. (2011). A Simple and on a single-lead ECG. ICBRA, 5th International Conference
Accurate Approach to Hierarchical Clustering. Journal of on Bioinformatics Research and Applications, 12, 2018,
Computational Information Systems , 7(7), 2577–2584. Hong Kong. https://doi.org/10.1145/3309129.3309132.
[12] "Outlier Treatment With R | Multivariate Outliers - r- [24] Hung , P. D. (2018). Central Sleep Apnea Detection Using an
statistics.co." http://r-statistics.co/Outlier-Treatment-With- Accelerometer. Proceedings of the 2018 International
R.html. Accessed 9 Jan. 2019. Conference on Control and Computer Vision (ICCCV '18).
ACM, New York, NY, USA, 106-111. DOI:
[13] Bezdek, J. C., Hathaway, R. J. (n.d.). VAT: a tool for visual https://doi.org/10.1145/3232651.3232660.
assessment of (cluster) tendency. Proceedings of the 2002
International Joint Conference on Neural Networks. [25] Nam, N. T., Hung, P. D. 2018. Pest detection on Traps using
IJCNN’02 (Cat. No.02CH37290), 2225–2230. Deep Convolutional Neural Networks. Proceedings of the
2018 International Conference on Control and Computer
[14] Kassambara, A. 2015. Practical Guide To Cluster Analysis in Vision (ICCCV '18). ACM, New York, NY, USA, 33-38.
R. pp. 1–38, 2015. DOI: https://doi.org/10.1145/3232651.3232661.
[15] Kodinariya, T. M., Makwana, P. R. (2013). Review on [26] Hung, P. D., Linh, D. Q. 2019. Implementing an Android
determining number of Cluster in K-Means Clustering. Application for Automatic Vietnamese Business Card
International Journal of Advance Research in Computer Recognition. Pattern Recognition and Image Analysis, ISSN
Science and Management Studies, 1(6), 2321–7782. 1054-6618 29 (1), 203-213.
37