You are on page 1of 12

INTEGRAL, Vol. 8 No.

2, Oktober 2003

TOWARDS THE USE OF C4.5 ALGORITHM FOR CLASSIFYING BANKING DATASET


Veronica S. Moertini
Jurusan Ilmu Komputer, Fakultas Matematika dan Ilmu Pengetahuan Alam universitas Katolik Parahyangan Bandung. E-mail : wurjanto@bdg.centrin.net.id

Abstract
C4.5 is a well known algorithm used for classifying datasets. It induces decision trees and rules from datasets, which could contain categorical and numerical attributes. The rules could be used to predict categorical values of attributes from new records. This paper discusses an overview of data classification and its techniques, the basic methods of C4.5 algorithm, the process and analysis of the results of an experiment, which utilizes C4.5 for classifying banking dataset. C4.5 performs well in classifying the dat aset, but more data needs to be collected in order to gain useful rules.

Intisari
C4.5 adalah algoritma yang sudah banyak dikenal dan digunakan untuk klasifikasi data yang memiliki atribut-atribut numerik dan kategorial. Hasil dari proses klasifikasi yang berupa aturan-aturan dapat digunakan untuk memprediksi nilai atribut bertipe diskret dari record yang baru. Makalah ini membahas teknik-teknik klasifikasi data secara umum, metodologi dasar algoritma C4.5, proses dan analisis hasil eksperimen yang menggunakan C4.5 untuk mengklasifikasi data perbankan. C4.5 bekerja dengan baik, tapi untuk mendapatkan aturan-aturan yang berguna, perlu untuk dikumpulkan data yang lebih lengkap.

Diterima : 27 Juni 2003 Disetujui untuk dipublikasikan : 10 Juli 2003 1. Introduction


Databases are rich with hidden information that can be used for making intelligent business decision. Classification is one of the forms of data analysis that can be used to extract models describing important data classes or to predict categorical labels. An example of the model application is to categorize bank loan application as either safe or risky1. Banks databases are rich with data. Banks can take advantage of the data they have to characterize the behavior of their customers 2, then based on the behavior of the customers, can take business actions, such as to hold on to good customers and weeding out the bad ones 3. An experiment in analyzing banking dataset with the goal of generating knowledge regarding bank customers has been conducted. The task chosen is to classify the customers and the technique used in the experiment is mainly C4.5 algorithm. This paper discusses an overview of data classification and its techniques, the basic methods of C4.5 algorithm, the process and the res ult analysis of the experiment in utilizing C4.5 for classifying banking dataset.

105

INTEGRAL, Vol. 8 No. 2, Oktober 2003

2. Data Classification
Data classification is a two-step process (see Figure 1). In the first step, a model is built describing a predetermined set of data classes or concepts. The model is constructed by analyzing database tuples (records) described by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the label attribute. In the context of classification, data tuples are also referred to as samples, examples or objects The data tuples are .

analyzed to build the model collectively from the training data set. The individual tuples making up the training set are referred to as training samples and are randomly sele cted from the sample population. Since the class label of each training sample is provided, this step is also known as supervised learning (i.e., the learning of the model is supervised in that it is told to which class each training sample belongs). In the second step (Figure 1.b), the model is used for classification.

Figure 1. The data classification process: (a) Learning: Training data are analyzed by a classification algorithm. The class label attribute is credit_rating, and the learned model is represented in the form of classification rules. (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable, the rules can be used to classify new data tuples 1.

106

INTEGRAL, Vol. 8 No. 2, Oktober 2003

3. Preparing Data for Classification


The following preprocessing steps may be applied to the data in order to help improve the accuracy, efficiency and scalability of the classification process: - Data cleaning: removing the noise and the treatment of missing values. In real dataset, noise can be viewed as legitimate records having abnormal behavior. - Relevance analysis: removing any irrelevant or redundant attributes from the learning process. - Data transformation: the data can be generalized to higher -level concepts. The data may also be normalized.

are graphical models, which unlike nave Bayesian classifiers, allow the representation of dependencies among subset of attributes. Neural networks which are common for data classification are of backpropagation type. Backpropagation learns by iteratively processing a set of training samples, comparing the networks prediction for each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the mean square error between the networks prediction and the actual class.

4. Overview of Data Classification Techniques


There are several basic techniques for data classification. Among them are decision tree induction, Bayesian classification and Bayesian belief networks 1, neural networks5,6,7, and associat ion-based classification. There are also other approaches to classification, which are less commonly used for commercial data mining systems, such as k-nearest neighbor classifier, case-based reasoning, genetic algorithms, rough sets4 and fuzzy logic techniques. The basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees, from the dataset, in a top-down recursive divideand-conquer manner. Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. There are two types of Bayesian classifiers, which are nave Bayesian classifier and Bayesian belief networks. Nave Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. Bayesian belief networks

5. Classification by Decision Tree Induction, ID3 and C4.5 Algorithm


Decision trees are powerful and popular tools for classification and prediction 3. The attractiveness of tree-based methods is due in large part to the fact that, in contrast to neural networks, decision trees represent rules. Rules can readily be expressed in a language that humans can understand them, or in a database access language, such as SQL. In some applications, the accuracy of a classification or prediction is the only thing that matters, for example in selecting (or predicting) the most potential customers. In this case, neural networks can be used. But, in other situations, the ability to explain the reason for a decision is crucial. For example, rejecting loan applicants require some explanation. There are a variety of algorithms for building decision trees. The most popular ones are CART, CHAID and C4.53. A decision tree is a flow-chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution. The top-most node in a tree is the root node. An example of a tree is given on Figure 2.

107

INTEGRAL, Vol. 8 No. 2, Oktober 2003

Figure 2. A decision tree example.

Decision Tree Induction This section discusses a well known decision tree induction, C4.5 algorithm, by first introducing the basic methods of its predecessor, which is ID3 algorithm. Then, the enhancement of the methods that is applied to C4.5 would be given. As has been mentioned previously, the basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees in a top-dow n recursive divide-and-conquer manner. Figure 3 shows the basic algorithm of ID3. The basic strategy is as follows [1]: - The tree starts as a single node representing the training samples (step 1). - If the samples are all of the same class, then the node becomes a leaf and is labeled with that class (steps 2 and 3). - Otherwise, the algorithm uses an entropy-based measure known as information gain as a heuristic for selecting the attribute that will best separate the samples into individual classes (step 6). This attribute becomes the test or decision attribute at the node (step 7). (All of the attributes are categorical or discrete value. Continues-valued attribute must be discretized.) - A branch is created for each known value of the test attribute, and the samples are partitioned accordingly (steps 8-10).

- The algorithm uses the same process recursively to form a decision tree for the samples at each partition. Once an attribute has occurred at a node, it need not be considered in any of the nodes descendents (step 13). - The recursive partitioning stops only when any one of the following conditions is true: o All the samples for a given node belong to the same class (steps 2 and 3), or o There are no remaining attributes on which the samples may be further partitioned (step 4). In this case, majority voting is employed (step 5). This involves converting the given node into a leaf and labeling it with the class in majority among samples. Alternatively, the class distribution of the node samples may be stored. o There are no samples for the branch test-attribute = ai (step 11). In this case, a leaf is created with the majority class in samples (step 12).

108

INTEGRAL, Vol. 8 No. 2, Oktober 2003

Algorithm: Generate_decision_tree. Narative : Generate a decision tree from the given training data. Input: The training samples, samples, represented by discrete-valued attribute; the set of candidate attributes, attribute-list. Output: A decision tree. Method: (1) create a node N; (2) if samples are all of the same class, C then (3) return N as a leaf node labeled with the class C; (4) if attribute-list is empty then (5) return N as a leaf node labeled with the most common class in samples;//majority voting (6) select test-attribute, the attribute among attribute-list with the highest information gain; (7) label node N with test-attribute; (8) for each known value ai of test-attribute; (9) grow a branch from node N for the condition test-attribute = ai; (10) let si be the set of samples in samples for which test-attribute = ai; // a partition (11) if si is empty then (12) attach a leaf labeled with the most common class in samples; (13) else attach the node returned by Generate_decision_tree (si, attribute-listtest-attribute);
Figure 3. Basic algorithm for inducing a decision tree from training samples 1 .

Attribute Selection Measure The information gain measure is used to select the test attribute at each node in the tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split. The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1,,m). Let si be the number of samples of S in class C i. The expected information needed to classify a given sample is given by

I (s1 , s2 ,..., sm ) = p i log 2 ( p i ),


i =1

where p i is the probability that an arbitrary sample belongs to class Ci and is estimated by s i/s. The log function to the base 2 is used as the information is encoded in bits. Let attribute A have v distinct values, {a1,a2,,av }. Attribute A can be used to partition S into v subset, {S1,S2,,Sv }, where Sj contains those samples in S that have value a j of A. If A were selected as the test attribute (the best attribute for splitting), then these subsets would correspond to the branches grown from the node containing the set S. Let sij be

109

INTEGRAL, Vol. 8 No. 2, Oktober 2003

the number of samples of class Ci in a subset S j. The entropy, or expected information based on the partitioning into subsets by A, is given by

tree to correctly classify independent test data. There are two common approaches to tree pruning, which are prepruning and postpruning. In the prepruning approach, a tree is pruned by halting its construction early (by deciding not to further split or partitioned the subset of training samples at a given node). Upon halting, the node becomes a leaf. In the postpruning approach, a tree is pruned after it is fully grown. A tree node is pruned by removing its branches. The lowest unpruned node becomes a leaf and is labeled by the most frequent class among its former branches. Extracting Classification Rules from Decision Trees The knowledge represented in decision trees can be extracted and represented in the form of IF-THEN rules. One rule is created for each path from the root to a leaf node. Each attribute-value pair along a given path forms a conjunction n the i rule antecedent (IF part). The leaf node holds the class prediction, forming the rule consequent (THEN part). The IFTHEN rules may be easier for humans to understand, especially if the given tree is very large. C4.5: An Enhancement to ID3 Several enhancements to the basic decision tree (ID3) algorithm have been proposed. C4.5 (detailed discussion is in [8]), a successor algorithm to ID3, proposes mechanism for 3 types of attribute test: 1. The standard test on a discrete attribute, with one outcome and branch for each possible value of that attribute. 2. A more complex test, based on a discrete attribute, in which the possible values are allocated to a variable number of groups with one

E ( A) =
j =1

S1 j +...+ Smj

s s

I (s1 j ,..., smj )


acts as the weight of

The term

S1 j +... +S mj

the jth subset and is the number of samples in the subset (having value aj of A) divided by total number of samples in S. The smaller the entropy value, the greater the purity of the subset partitions. For a given subset Sj,

I (s1 j , s2 j ,..., smj ) = p ij log 2 ( pij )


i =1

where p ij =

sij |Sj |

and is the probability

that a sample in Sj belongs to class Ci. The encoding information that would be gained by branching on A is

Gain( A) = I ( s1 , s 2 ,..., sm ) E ( A)

In other words, Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A. The algorithm computes the informatio n gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set S. A node is created and labeled with the attribute, branches are created for each value of the attribute, and the samples are partitioned accordingly. Tree Pruning When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of overfitting the data. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the

110

INTEGRAL, Vol. 8 No. 2, Oktober 2003

outcome for each group rather than each value. 3. If attribute A h continuous numeric as values, a binary test with outcomes A Z and A>Z , based on comparing the value of A against a threshold value Z. Given v values of A, then v-1 possible splits are considered in determining Z, which are the midpoints between each pair of adjacent values. The information gain measure is biased in that it tends to prefer attributes with many values. C4.5 proposes gain ratio, which considers the probability of each attribute value.

Classifying the Banking Dataset Suppose the bank marketing managers need to classify the customers who hold credit card, so that they could offer the right card to the bank customers who currently hold no credit card. Also, the loan division needs to classify the customers who have loans, so that they could predict whether the new loan applicants would be good customers. Then, the tasks chosen in analyzing the data are to c lassify customers who hold credit card and who have loan. The data is considered to be clean and complete, so there is no treatment applied to improve the quality of the data. To select the relevant data from the database, two datasets are created. One is used for analyzing credit card holders and the other is for analyzing loan owners. The original C4.5 requires 3 files as its inputs, which are filename.names, filename.data and filename.test 8. Filename.names contains the definition of the label attribute, the name of the attributes and their categorical values or their type of continuous. Filename.data contains the training data (each line contains one tuple) and filename.test contains the test data (each line contains one tuple). Dataset for card holders: The data that is considered to be relevant to be analyzed is the data stored in table Client, District, Account, Transaction, Loan and CreditCard. The tables are then joined by properly constructing SQL statements. The attributes selected are birth number from table Client; the sum of amount from table Loan; the sum of order id from table PermanentOrder; the average of balance from table Transaction; A4, A10, A11 from table District and type from table CreditCard. From the result of the join operation, the age and gender of the customers is then computed from birth number. The result is then exported

6. Experiment
An experiment is conducted with the goal of finding the steps needed to utilize C4.5 algorithm for classifying real banking dataset, discovering rules generated from the dataset and the meaning of them. Banking Dataset Description The original banking dataset used for the experiment is obtained from [10]. It consists of several text files as described in [9]. The data is then exported and stored in Access database. The database contains the data related to a banks clients, and its schema is given on Figure 4. Figure 4 shows that there are relation Account, Client, Disposition, PermanentOrder, Transaction, Loan, CreditCard and District, which related between one and another. There is 4500 tuples in Account, 5369 in Client, 5369 in Disposition, 6471 in PermanentOrder, 1056320 in Transaction, 682 in Loan, 892 in CreditCard, and 77 in District. Detailed description of the data can be found in [9].

111

INTEGRAL, Vol. 8 No. 2, Oktober 2003

to 2 text files, card.data that contains 810 lines and card.test that contains 82 lines

or tuples.

Figure 4. The database schema in MSAccess that shows the name of the relations and their relationship among them.

Dataset for loan owners: The data that is considered to be relevant to be analyzed is the data stored in table Client, District, Account, PermanentOrder, Loan and CreditCard. Data trans action, actually, can be useful in classifying loan owners. Unfortunately, the data transaction stored in table Transaction is not complete. The table contains some parts of the transactions done by some part of the customers (not all of the tupples in Loa n is related to tupples in Transaction), so it could not be used. Then, the tables selected are joined by properly constructed SQL statements. The attributes selected are birth number from table Client; A4, A10, A11 from table District; the sum of order id and the sum of amount from table PermanentOrder; type from table CreditCard, the sum of amount and duration, and status from table Loan. The loan status A and C are converted to good, and B and D are converted to bad (please see [9] for the

description of loan status). The result is then exported to 2 text files, loan.data that contains 600 lines and loan.test that contains 83 lines or tuples. The dataset chosen are not normalized and is not generalized to higher level concepts as the database schema does not show hierarchies. The result of presenting the training and test data of card dataset to C4.5 program (downloaded from [11]) is given on Figure 5. It turns out that C4.5 classifies the data by attribute age only. Of the 800 records from data training, 131 of which are classified as junior card holders and 679 as classic card holders. The evaluation on training and test data (Figure 6) shows that some of the customers are misclassified. 79 customers who hold gold card are classified as classic card holders. This happens due to tree pruning which has

112

INTEGRAL, Vol. 8 No. 2, Oktober 2003

been discussed in Section 3. The error percentage of training data is 9.8% and of test data is 11%. If this error is acceptable, then the rules given on Figure 5.b can be applied to new record of customers for predicting the type of the card that customers would buy. However, it can easily be learned from the rules that the rules are already known and (a)

would not predict any gold card holder. Therefore, these rules, despite the error percentage, would not be applicable or useful in making business decision, and would not help the banks managers in improving their marketing strategies. To generate better rules, clearly, more data that tells more about the bank customers needs to be gathered. (b)
C4.5 [release 8] rule generator ------------------------------Final rules from tree 0: Rule 1: Age <= 20.0 -> class junior [98.9%] Rule 2: Age > 20.0 -> class classic [87.4%] Default class: classic

C4.5 [release 8] decision tree generator ----------------------------------------Read 810 cases (8 attributes) from card.dat a Decision Tree: Age <= 20.0 : junior (131.0) Age > 20.0 : classic (679.0/79.0) Tree saved

Figure 5. The output of C4.5 algorithm for card dataset: (a) Decision tree. (b) Rules generated from the tree.

(a)
Evaluation on training data (810 items): Tested 810, errors 79 (9.8%) << (a) (b) (c) <-classified as ---- ---- ---79 (a): class gold 600 (b): class classic 131 (c): class junior

(b)
Evaluation on test data (82 items): Tested 82, errors 9 (11.0%) << (a) (b) (c) <-classified as ---- ---- ---9 (a): class gold 59 (b): class classic 14 (c): class junior

Figure 6. The evaluation on: (a) training and (b) test of card dataset.

The result of presenting the training and test data of loan dataset to C4.5 is given on Figure 7. Here, C4.5 generates a few decision trees and rules using a few attributes. As can bee seen on Figure 7.b, the attributes used for the rules are
NoPermOrder, NoPermOrder PermOrderAmt, AvgSalary .

denotes the number of permanent order service that customers subscribe. One of the purposes of subscribing this service is actually for paying loans periodically (for example monthly) and automatically. Therefore, the loan owners may subscribe this service after they are granted for loans. PermOrderAmt states the amount to be

deducted from customers account for this service. So, this may also exist after loan owners have loans. AvgSalary is the average salary of the district where the customers live. This may be a useful attribute in characterizing loan owners. But, the rules using this attribute are rather suspicious. Rule 5 states that customers living in districts having average salary of greater than 9624 are bad customers. Rule 2 states that customers living in districts having average salary of less than 9624 are good customers. These 2 rules need further investigation to prove their correctness.

113

INTEGRAL, Vol. 8 No. 2, Oktober 2003

Other than the error percentage, on Figure 8, ones could also see that most of the loan owners are good ones. Therefore, in analyzing the bank dataset, (a)
C4.5 [release 8] decision tree generator ----------------------------------------Read 600 cases (9 attributes) from loan.data Decision Tree: NoPermOrder > 1.0:Good(385.0/18.0) NoPermOrder <= 1.0 : | PermOrderAmt <= 7512.7:Good (189.0/38.0) | PermOrderAmt > 7512.7 : | | PermOrderAmt <= 7742.0:Bad (6.0) | | PermOrderAmt > 7742.0 : | | | AvgSalary > 9624.0:Bad (6.0/1.0) | | | AvgSalary <= 9624.0 : | | | | NofInhabitans >70699.0: Good (9.0) | | | | NofInhabitans <=70699.0: | | | | | NofInhabitans <= 45714.0: Good(3.0/1.0) | | | | | NofInhabitans > 45714.0: Bad(2.0)

it may be more appropriate in focusing the analysis to the bad customers, and gather more facts about them.

(b)
C4.5 [release 8] rule generator --------------------------------------Read 600 cases (9 attributes) from loan -----------------Processing tree 0 Final rules from tree 0: Rule 1: NoPermOrder <= 1.0 PermOrderAmt > 7512.7 PermOrderAmt <= 7742.0 -> class Bad [79.4%] Rule 5: AvgSalary > 9624.0 NoPermOrder <= 1.0 PermOrderAmt > 7512.7 -> class Bad [66.2%] Rule 6: NoPermOrder > 1.0 -> class Good [94.4%] Rule 2: AvgSalary <= 9624.0 PermOrderAmt > 7742.0 -> class Good [91.1%] Default class: Good

Figure 7. The output of C4.5 algorithm for loan dataset: (a) Decision tree. (b) Rules generated from the tree.

(a)
Evaluation on training data (600 items): Tested 600, errors 60 (10.0%) << (a) (b) < -classified as ---- ---529 1 (a): class Good 59 11 (b): class Bad

(b)
Evaluation on test data (83 items): Tested 83, errors (a) (b) ---- ---76 7 7 (8.4%) << < -classified as (a): class Good (b): class Bad

Figure 8. Training error on: (a) training and (b) test of loan data.

114

INTEGRAL, Vol. 8 No. 2, Oktober 2003

Another experiment with the intention of visualizing then clustering the two datasets has also been conducted. The techniques used are Self-Organizing Map (SOM) and K-Means algorithm. However, due to space limitation, the results could not be presented in this paper. The clustering results show similarities with the results of the tree induction experiment: for the card dataset, only the attribute age and card type are important, whereas for the loan dataset, attribute NoPermOrder , PermOrderAmt and loan status play significant role in forming clusters.

banking dataset, more data needs to be collected. The data might be the one related to the customers, such as detailed demographic data, and various as well as complete transactional data. 8. References [1] Han, Jiawei; Kamber, Micheline; Data Mining Concepts and Techniques, Morgan Kaufmann Pub., USA, 2001. [2] IBM, Mellon Bank Forecasts a Bright Future for Data Mining, Data Management Solutions Banking, http://www.software.ibm.com/data, 1998. [3] Berry M.J., Linoff G., Data Mining Techniques for Marketing, Sales and Customer Support, John Wiley & Sons Inc., USA, 1997. [4] Hu, Xiaohua, Using Rough Sets Theory and Database Operations to Construct a Good Ensemble of Classifiers for Data Mining Applications, IEEE ICDM Proceedings, December, 2001. [5] Brause, R., Langsdorf, T., Hepp, M., Neural Data Mining for Credit Card Fraud Detection, J.W. Goethe-University, Frankfurt, Germany. [6] Kao, L.J, Chiu, C.C.; Mining the Customer Credit by Using the Neural Network Model with Classification and Regresion Tree Approach, IEEE Transaction on Data Engineering and Knowledge Discovery, Vol.1, p.923, 2001. [7] Syeda,M., Zhang, Y.Q., Pan, Y.; Parallel Granular Neural Networks for Fast Credit Card Fraud Detection, IEEE Transaction on Neural Networks, Vol.2, p.572, 2002.

7. Conclusion
C4.5 algorithm performs well in constructing decision trees and extracting rules from the banking dataset. However, a graphical user interface based application that implements C4.5 algorithm is needed in order t provide o ease of use and better visualization of the decision trees for the users. The application should also provide features for accessing databases directly, as most of the business data is stored in databases. From the experiment results, it can be learned that a few of the attr ibutes are unused in classifying. There are also attributes used in the result rules that have unimportant meaning in making business decision. Hence, it can be concluded that selecting the proper attributes being used from the dataset plays a significant role in data classification. For classifying banking dataset, banking knowledge base and statistical methods of analyzing the relevant attributes for the tasks must be employed. In order to discover new, meaningful and actionable knowledge from the

115

INTEGRAL, Vol. 8 No. 2, Oktober 2003

[8] [9]

Quinlan, J.Ross; C4.5: Programs for Machine Learning, Morgan Kaufmann Pub., USA, 1993 Berka, Petr; Guide to the Financial Data Set, Laboratory for Intelligent Systems, Univ. of Economics, Prague, Czech Republic, http://lisp.vse.cz/pkdd99.

[10] http://lisp.vse.cz/pkdd99. [11] http://www.mkp.com/c45. [12] Conolly, Thomas; Begg, Carolyn; Database Systems A Practical Approach to Design, Implementation and Management, 3rd ed., Addison Wesley Pub., USA, 2002.

116

You might also like