You are on page 1of 6

African Journal of Business Management Vol. 5 (20), pp.

8307-8312, 16 September, 2011


Available online at http://www.academicjournals.org/AJBM
DOI: 10.5897/AJBM11.476
ISSN 1993-8233 ©2011 Academic Journals

Full Length Research Paper

A comparative study of data mining techniques in


predicting consumers’ credit card risk in banks
Ling Kock Sheng1* and Teh Ying Wah2
1
Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia.
2
Department of Information Science, Faculty of Computer Science and Information Technology,
University of Malaya, 50603 Kuala Lumpur, Malaysia.
Accepted 21 June, 2011

This paper investigates the use of batch and incremental classifiers such as logistic regression, neural
networks, C5, naïve bayes updateable, IBk (instance-based learner, k nearest neighbour) and raced
incremental logit boost to obtain the best classifier to be used for improving the predictive accuracy of
consumers’ credit card risk of a bank in Malaysia. Prior to generating all the models for comparison, the
initial set of data is also loaded into an ETL (extraction, transformation, loading) system developed to
perform feature selection or attribute relevancy analysis using ID3 algorithm, compiling a subset of data
with the highest information gain and gain ratio. An extended test is performed to use equal length
binning on some attributes to find if it affects the relevancy of each attribute. The selected subset of
data of 24 months is used to generate various data mining models using different training and testing
sizes and binning sizes. C5 emerged consistently as the technique that have generated the best models
with an average predictive accuracy as high as 94.68%. Sample sizes, equal-length binning sizes and
training and testing sizes are all shown to have an effect on accuracy in different intensity.

Key words: Data mining techniques, predictive accuracy, incremental learning schemes.

INTRODUCTION

The credit card industry in Malaysia has undergone some of prompt payment of customer is essential in the drive to
major changes in the last decade as competition ensure that the bank remain resilience.
mounted and credit risks escalated. The central bank has
taken measures to tighten credit spending, by imposing
taxes for any new credit card issue. The government Objectives of the study
foresees that such action is needed to prevent any build
up of credit bubbles and the subsequent economic This research seeks to discover the commonly used data
implication that will ensue should the banking system mining tools and techniques used in banks in Malaysia.
collapse. This research also employed the commonly used tools
Banks are still very much attracted to the credit card and data mining techniques including incremental
business as it is one of the most profitable services to learning schemes in an attempt to provide an improved
engage in though it is competitive. It could command lu- classifier, ETL and data mining solution to help the bank
crative margins as high as 1.8% interest charges a month. in mitigating their risk portfolio.
It is therefore important for banks to manage their risks
properly, to maximize margins. Risks mitigation process
in banks is more often than not based on information that LITERATURE REVIEW
they can find or mine from their historical database about
the borrower and their tendency to default in their Depending on the commercial uses of credit scoring, the
payment. The impact of improving predictive accuracy methodology to construct credit scoring models varies
from bank to bank. It may involve firstly, a sample of his-
torical records classified as "good" and "bad" (or as bad
loss, bad profit and good risk depending on the number
*Corresponding author. E-mail: lingks99@yahoo.com. of categories required) depending on their repayment
8308 Afr. J. Bus. Manage.

performance over a given period. Next, data could be previously unseen data. This can be estimated using one
obtained from internal or other external sources, namely, or more data set or test sets (Han and Kamber, 2006).
from credit bureau reports. Finally, statistical or other Looking at the various data mining techniques under
quantitative analysis is performed on the data to derive a this research study, that is, logistics regression, decision
credit scoring model (Koh et al., 2006). tree, neural networks, naïve bayes updateable, IBk, and
With the right credit scoring model, the bank can eva- raced incremental logit boost, each technique tends to
luate any new or existing profiles of customers accurately, perform better than the rest under different circumstances.
enabling them to minimize potential risks that might be Decision tree has a better predictive accuracy in a study
looming. Such scoring models, together with the by Koh et al. (2004) on credit scoring, whereas neural
information provided by CCRIS (central credit reference networks has a better predictive accuracy in a study by
information system) from the Central Bank and other Zurada and Lional (2005) on bad debt recovery.
information service provider form the basis to which credit The prevalent use of credit cards in recent years have
rating systems were established. Credit rating systems is created large stream of data for the bank to analyse.
used to categorize the risk’s worthiness of a person as Incremental schemes are continually being explored to
high, medium or low. This allows for decision support by handle such development as it becoming more difficult to
accepting, extending or rejecting any credit request. work with batch classifier as it requires a much larger
system resources and effort to work with. The changes in
the spending pattern of the consumers prompted by
Data mining tools, techniques and credit risks changes in the economic landscape or vice versa have
evaluation also brought about a faster change to the risk profile of
this business.
Various data mining techniques were employed by the Most noticeable of system for incremental learning are
banking and credit card industry in managing credit risks. credit risks evaluation (Li and Xin, 2009), credit scoring
Sinha and Zhao (2008) uses several techniques to using incremental kernel method (Yang, 2007) and
examine the performance on business problems, mostly network security using IBk and naïve bayes updateable
involving binary classification of two categories, that is, (Gandhi and Srivatsa, 2010).
bankrupt or non-bankrupt, bad credit or good credit and
others. In classification, a set of training data is use as
input to build a model describing the predetermined set of ETL in data mining
data classes. Once the predictive accuracy of the model
is acceptable, the model can be used to predict future According to Kimball and Caserta (2004), the ETL system
data tuples (Fayyad et al., 1996; Lee, 2008). or tool is very important in data mining as it consumes
Traditionally, logistic regression and discriminant 70% of the resources required for data mining. The three
analysis are the most widely used approaches to create core areas of the ETL processes are, firstly, extraction,
scoring models in the industry (Yang, 2007). There are the activity of extracting data from various sources and
however ample evidence on the use of other data mining collating it to a target database. Transformation, or data
techniques related to credit evaluation such as pre-processing is the foundation for data analysis and
bankruptcy prediction (Sung et al., 1999; Kim and Mcleod, mining (Zhang et al., 2007). Today’s real-world data-
1999; Ryu and Yue, 2005), credit risk assessment bases are highly susceptible to noisy, missing and
(Doumpos et al., 2002), and credit evaluation (Sinha and inconsistent data due to their typically huge size (Han
Zhao, 2008). The techniques employed were as such, and Kamber, 2006). Any handicapped data passed on for
neural network (Back et al., 1996; Jo and Han, 1997), the mining process and then to decision support system
logistic regression (Desai et.al., 1996; Xiao et al., 2006) will result in reports and output that will be highly inaccu-
and decision tree (Koh et al., 2004). rate, thereby, severely affecting business decisions and
The use of incremental learning schemes related to the business itself. Data has to be first verified before the
credit scoring or credit evaluation is however less pro- mining, (Hsu, 2009).
found. Existing application using mostly static models fail
to adapt when environment or population changes over
the time (Yang, 2007). The problem on updating the RESEARCH DESIGN
scoring model incrementally is not resolved. Yang adop-
ted the incremental kernel method to build the adaptive This research design is made up primarily of a multi-methodological
scoring system. approach (Nunamaker et al., 1990, 1991) to the construction of the
data mining solution. A quantitative methodology is adopted to
systematically approach this investigation using a survey
Predictive accuracy of a classifier questionnaire to look at the adoption and maturity level in the use of
data mining tools and techniques in the banking industry in
Malaysia, prototyping methodology and experimentation for the
The accuracy of a classifier refers to the ability of a given development of ETL and data mining solution. The targeted popu-
classifier to correctly predict the class label of new or lation of this survey encompasses the credit cards department of
Sheng and Wah 8309

Figure 1. Computed information gain on different binning sizes of attributes.

of all the banks in Malaysia. Banks having services in credit cards records. Each month of data is added incrementally. The equal
include nine local anchor banks and seven fully qualified foreign length binning size used is also extended to 70 except for attributes
banks. with only a binary classification of two categories. The training and
testing partitions sizes are set to 90 and 10% respectively.
The last part of the test involve the use of incremental learning
Samples sizes for the survey scheme to ascertain if each smaller batches or streaming records
would help to further improve the predictive accuracy by repeatedly
The targeted population of the survey encompasses the credit rebuilding the model from each instances of data provided. The
cards department of all the banks in Malaysia. Banks having ser- subsample size of 120,000 records, equal length binning of 70 and
vices in credit cards include nine local anchor banks and seven fully training and testing sizes of 90 and 10% remains. The incremental
qualified foreign banks. classifiers used for the test were naïve bayes updateable, IBk and
raced incremental logit boost.

System modelling
RESULTS
The ETL system is designed and constructed with features that will
allow the users to reconstruct the data with different equal-length
binning sizes and the ability to compute information gain and gain
Descriptive analysis
ratios according to ID3 algorithm. These features are needed in this
research to build the right data sample to locate the right classifier The survey results indicated a progressive adoption of
with an improved accuracy level. data mining techniques to evaluate credit risks in banks
Three batch classifiers and three incremental classifiers were with 90% of them indicating some uses of it in various
used for this comparison. The techniques used are C5.1, neural net capacities. The most common technique used is con-
and logistic regression, naïve bayes updateable, IBk and raced
firmed as logistic regression as traditionally employed.
incremental logit boost. With all data being fully substantiated, 5
attributes were used. They are age, location, sex, accumulated The level of use, about 60%, is predominantly found at
credit amount, credit limit/income level. Each attribute is assigned the regional headquarters instead of the branch level.
as “input”, the predictor field whereas the “class info” field is set as Most respondents are however not too sure about the
the “output” field, a binary classification of two categories – prompt predictive accuracy achieved on the prompt payment of
payment or default in payment, the predicted fields for a machine- customers.
learning process of the data mining tool. Initially, various partition
sizes are used for the entire batch of data for this analysis. The
combination of the training and testing sample and partition sizes
are set to 50 to 50%, 80 to 20% and 90 to 10% and partition sizes Information gain and gain ratios
from 2, 10, 25 and 50 depending on the attributes. A randomly
selected batch of subsample record of 5000 records of various The results from the computation of information gain is as
partition sizes has been used for the training and testing of the
models. As for achieving best predictive accuracy, each model is
shown in Figure 1 and gain ratios with different equal-
trained and tested for its highest score. length binning sizes does suggest that the higher the
The test is then extended to have a larger subsample of 120,000 partition size, the higher information gain and gain ratios
records representing 24 months of data from 5000 customer’s will be.
8310 Afr. J. Bus. Manage.

Data Mining techniques - Predictive accuracy for Incremental data


Sum of OverallAccuracy
96

95

94

93

92

91
Total
90

89

88

87

86

85
C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1

C5 1
Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic

Logistic
Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net

Neural net
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Testing

Partition Mno Models

Figure 2. Testing results for using various batch classifiers on the entire subsample at every increment of 5000 records
(each month).

Data mining techniques compared more data were added progressively in batches to form
one large sample as shown in Figure 2. This result
The testing and training results for using one batch of the remains consistent when two different data mining tools
subsample at 5000 records and for using the entire (Clementine’s C5 and Weka’s J48) were used to
subsample at 120,000 records spanning 24 months of generate the models. The highest level of accuracy for
5000 each are as shown in Figure 2. The results using C5 achieved is at 94.68% using Clementine’s C5 clas-
the incremental schemes are shown further. sifier with a subsample of size of 120,000 records. Other
data mining technique is not progressively better when
Testing results of batch classifiers using a small the sample size increases.
dataset
Testing results of incremental learning schemes
C5 emerged as the best classifier with 91.46% accuracy using a large dataset
level when tested. All models were trained and tested
using a sample of 5000 records with the 90% of the Instance-based IBk has the best predictive accuracy from
sample used for training and 10% randomly picked to get the three incremental learning models generated. The
the test results. accuracy rate is at 93.63%. Naïve bayes updateable and
The results also show that equal-length binning sizes of raced incremental logit boost could only garner an
various attributes and testing partition sizes do affect accuracy level of 90.24 and 90.23%. This means the
predictive accuracy. The single equal-length test of 10, incremental learning scheme is not performing better that
10, 2, 50 and 25 for each attribute 1 to 5 (age, location, the C5 or J48 batch classifier using this set of data and
sex, amount owing and credit limit/income level) respec- setting.
tively for neural network achieved the single highest
single predictive accuracy of 92.46% as shown in Table 1.
LIMITATIONS AND ENHANCEMENTS

Testing results of batch classifiers using a large One of the obvious limitations encountered in this re-
dataset search is the limited attributes used. There is always a
possibility that some other combination of attributes and
The training and testing results show a marked improve- data subset could provide a better information gain and
ment on the accuracy level for C5 when the entire gain ratios resulting in the overall predictive accuracy of
subsample of 120,000 records is used to generate and the model to be improved. The limited number of attri-
test the model. The accuracy level for C5 improves as butes also resulted in a limited dimension in which data
Sheng and Wah 8311

Table 1. Training and testing accuracy using various batch classifier, equal-length bin sizes and testing and training partition size.

Data mining technique - Training and testing accuracy


Bin size DM 50 80 90 Train (AVG) 50 80 90 TEST (AVG)
10, 10, 2, 50, 25 C5 93.52 93.06 93.34 93.31 91.20 91.77 90.95 91.30
10, 10, 2, 50, 25 LR 92.14 92.69 92.60 92.48 91.99 90.58 90.34 90.97
10, 10, 2,50, 25 NN 92.95 92.91 93.00 92.96 92.46 91.37 90.34 91.39
20, 10, 2, 10, 25 C5 92.87 92.89 92.85 92.87 92.34 91.17 89.74 91.08
20, 10, 2, 10, 25 LR 92.01 92.74 92.69 92.48 92.22 91.67 89.74 91.21
20, 10, 2, 10, 25 NN 92.50 92.61 92.89 92.67 91.91 91.37 90.54 91.27
20, 10, 2, 25, 25 C5 93.97 92.71 93.47 93.38 92.30 90.48 90.95 91.24
20, 10, 2, 25, 25 LR 91.97 92.41 92.58 92.32 91.83 91.17 90.34 91.11
20, 10, 2, 25, 25 NN 92.79 92.79 92.69 92.76 91.95 90.77 90.74 91.16
20, 10, 2, 50, 25 C5 92.71 92.86 93.34 92.97 92.11 91.87 90.95 91.64
20, 10, 2, 50, 25 LR 92.14 92.69 92.58 92.47 91.91 90.58 90.34 90.94
20, 10, 2, 50, 25 NN 92.67 92.69 92.89 92.75 92.26 91.37 90.14 91.26
25, 10, 2, 50, 25 C5 92.71 92.86 93.34 92.97 92.11 91.87 90.95 91.64
25, 10, 2, 50, 25 LR 92.14 92.69 92.58 92.47 91.87 90.58 90.34 90.93
25, 10, 2, 50, 25 NN 92.42 92.64 92.78 92.61 91.44 90.48 90.54 90.82
50, 10, 2, 50, 25 C5 92.71 92.86 93.47 93.01 92.11 91.87 90.95 91.64
50, 10, 2, 50, 25 LR 92.14 92.69 92.58 92.47 91.87 90.58 90.34 90.93
50, 10, 2, 50, 25 NN 93.11 92.86 92.94 92.97 92.22 91.17 90.74 91.38
50, 25, 2, 50, 25 C5 92.71 93.51 93.34 93.19 92.11 91.47 91.15 91.57
50, 25, 2, 50, 25 LR 92.14 92.71 92.60 92.48 91.87 90.58 90.34 90.93
50, 25, 2, 50, 25 NN 92.87 92.89 92.58 92.78 92.03 91.37 90.14 91.18
50, 50, 2, 50, 10 C5 92.75 93.41 93.00 93.05 92.11 91.47 91.35 91.64
50, 50, 2, 50, 10 LR 92.38 92.54 92.49 92.47 91.79 90.58 90.74 91.04
50, 50, 2, 50, 10 NN 92.83 92.79 92.78 92.80 92.38 90.58 90.34 91.10
50, 50, 2, 50, 25 C5 92.75 93.51 93.34 93.20 92.11 91.47 90.74 91.44
50, 50, 2, 50, 25 LR 92.22 92.71 92.60 92.51 91.79 90.58 90.34 90.90
50, 50, 2, 50, 25 NN 92.67 92.71 92.89 92.76 92.26 90.87 90.34 91.16
50, 50, 2, 50, 50 C5 92.58 93.41 93.25 93.08 92.18 91.47 90.54 91.40
50, 50, 2, 50, 50 LR 92.25 92.71 92.58 92.52 91.79 90.58 90.54 90.97
50, 50, 2, 50, 50 NN 92.83 92.81 92.98 92.87 92.30 90.87 90.34 91.17

mining tools can be applied. new insights into the use of data mining tools and tech-
The various equal-length binning sizes appeared to niques in a bank. Factors such as the sample size, the
have an effect on predictive accuracy. The higher the equal-length partition sizes and the training and testing
number of records or percentages sizes of samples use partition size have had an effect on predictive accuracy.
for testing, the better the training and testing accuracy The objectives put forward was successfully met with
seem to be. the identification of a batch classifier and model that
This observation and relationship could be validated if attain a higher level of predictive accuracy beyond that is
more records are obtained for testing the models. This initially expected.
could further improve predictive accuracy.
C5 or decision tree seems to have the best predictive
REFERENCES
accuracy among not only the batch classifier, but also the
incremental ones. It will be good if an incremental deci- Back B, Laitinen T, Sere K (1996). Neural Networks and genetic
sion tree technique is made available to see if it will algorithms for bankruptcy predictions. Expert Syst. Appl., 11(4): 407-
413.
exceed what the current tool is able to improve on.
Desai VS, Crook JN, Overstreet GAJ (1996). A comparison of neural
networks and linear scoring models in the credit union environment.
Eur. J. Oper. Res., 95: 24-37.
Conclusion Doumpos M, Kosmidou K, Baourakis G, Zopounidis C (2002). Credit
risk assessment using a multicriteria hierarchical discrimination
approach: a comparative analysis. Eur. J. Oper. Res., 138(2): 392–
The successful completion of this research has provided 412.
8312 Afr. J. Bus. Manage.

Fayyad U, Piatetsky-Shapiro G, Smyth P (1996). The KDD process for Li J, Xin X (2009). The Identification of Bank Customer Credit Risk.
extracting useful knowledge from volumes of data. Commun. ACM., Comput. Sci. Eng. 2: 191-195.
39(11). Nunamaker W, Chen M, Purdin T (1990-91). Systems development in
Gandhi M, Srivatsa SK (2010). Classification Algorithms in Comparing information systems research. J. Manag. Inform. Syst., 7(3): 89-106.
Classifier Categories to Predict the Accuracy of the Network Intrusion Ryu YU, Yue WT (2005). Firm bankruptcy prediction: Experimental
Detection – A Machine Learning Approach. Adv. Comp. Sci. Technol. comparison of isotonic separation and other classification
3(3): 321-334. approaches. IEEE Trans. Syst. Man Cybern., 35(5): 727–737.
Koh HC, Tan WC, Goh CP (2004). Credit scoring using data mining Sinha AP, Zhao H (2008). Incorporating domain knowledge into data
techniques. Singap. Manag. Rev., 26(2): 25-47. mining classifiers: An application in indirect lending. Decis. Support
Koh HC, Tan WC, Goh CP (2006). A two-step method to construct Syst., 46(1): 287-299.
credit scoring models with data mining techniques. Int. J. Bus. Infor., Sung TK, Chang N, Lee G (1999). Dynamics of modeling in data mining:
1(1): 966-118. interpretive approach to bankruptcy prediction. J. Manage. Inform.
Han J, Kamber M (2006). Data Mining: Concepts and Techniques. San Syst. 16(1): 63–85.
Dieg. Academic. Press. Xiao W, Zhao Q, Fei Q (2006). A comparative study of data mining
Hsu C (2009). Data Mining to improve industrial standards and enhance methods in consumer loans credit scoring management. J. Syst. Sci.
production and marketing: An empirical study in apparel industry. Syst Eng., 15(4): 419-435.
Expert Syst. Appl., 36(3). Yang Y (2007). Adaptive credit scoring with kernel learning methods.
Jo H, Han I (1997). Bankruptcy prediction using case-based reasoning, Eur. J. Oper. Res., 183(3): 1521-1536.
neural networks, and discriminant analysis. Expert Syst. Appl., 13(2): Zhang F, Yang B, Song W, Lee L (2007). Intelligent decision support
97-108. system based on data mining: Foreign trading case study. IEEE Int.
Kim CN, McLeod R (1999). Expert, linear models, and nonlinear models Conf. Cont. Autom., 1486-1491.
of expert decision making in bankruptcy prediction: a lens model Zurada J, Lonial S (2005). Comparison of The Performance of Several
analysis. J. Manage. Infor. Syst., 16(1): 189–206. Data Mining Methods for Bad Debt Recovery. J. Appl. Bus. Res.,
Kimball R, Caserta J (2004). The data warehouse ETL toolkit. 21(2): 37-53.
Indianapolis. Indiana: Wiley Publishing.
Lee J (2008). A new approach of top-down induction of decision trees
for knowledge discovery. ProQues. Dissertation Theses Iowa State
University.

You might also like