You are on page 1of 10

Expert Systems with Applications 41 (2014) 1937–1946

Contents lists available at ScienceDirect

Expert Systems with Applications
journal homepage: www.elsevier.com/locate/eswa

Hybrid decision tree and naïve Bayes classifiers for multi-class
classification tasks
Dewan Md. Farid a, Li Zhang a,⇑, Chowdhury Mofizur Rahman b, M.A. Hossain a, Rebecca Strachan a
a
Computational Intelligence Group, Department of Computer Science and Digital Technology, Northumbria University, Newcastle upon Tyne, UK
b
Department of Computer Science & Engineering, United International University, Bangladesh

a r t i c l e i n f o a b s t r a c t

Keywords: In this paper, we introduce two independent hybrid mining algorithms to improve the classification
Data mining accuracy rates of decision tree (DT) and naïve Bayes (NB) classifiers for the classification of multi-class
Classification problems. Both DT and NB classifiers are useful, efficient and commonly used for solving classification
Hybrid problems in data mining. Since the presence of noisy contradictory instances in the training set may
Decision tree
cause the generated decision tree suffers from overfitting and its accuracy may decrease, in our first
Naïve Bayes classifier
proposed hybrid DT algorithm, we employ a naïve Bayes (NB) classifier to remove the noisy troublesome
instances from the training set before the DT induction. Moreover, it is extremely computationally expen-
sive for a NB classifier to compute class conditional independence for a dataset with high dimensional
attributes. Thus, in the second proposed hybrid NB classifier, we employ a DT induction to select a
comparatively more important subset of attributes for the production of naïve assumption of class con-
ditional independence. We tested the performances of the two proposed hybrid algorithms against those
of the existing DT and NB classifiers respectively using the classification accuracy, precision, sensitivity-
specificity analysis, and 10-fold cross validation on 10 real benchmark datasets from UCI (University of
California, Irvine) machine learning repository. The experimental results indicate that the proposed
methods have produced impressive results in the classification of real life challenging multi-class prob-
lems. They are also able to automatically extract the most valuable training datasets and identify the
most effective attributes for the description of instances from noisy complex training databases with
large dimensions of attributes.
Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction construction of overfitting or fragile classifiers. Thus, data prepro-
cessing techniques are needed, where the data are prepared for
During the past decade, a sufficient number of data mining mining. It can improve the quality of the data, thereby helping to
algorithms have been proposed by the computational intelligence improve the accuracy and efficiency of the mining process. There
researchers for solving real world classification and clustering are a number of data preprocessing techniques available such as
problems (Farid et al., 2013; Liao, Chu, & Hsiao, 2012; Ngai, Xiu, (a) data cleaning: removal of noisy data, (b) data integration:
& Chau, 2009). Generally, classification is a data mining function merging data from multiple sources, (c) data transformations: nor-
that describes and distinguishes data classes or concepts. The goal malization of data, and (d) data reduction: reducing the data size
of classification is to accurately predict class labels of instances by aggregating and eliminating redundant features.
whose attribute values are known, but class values are unknown. This paper presents two independent hybrid algorithms for
Clustering is the task of grouping a set of instances in such a way scaling up the classification accuracy of decision tree (DT) and
that instances within a cluster have high similarities in comparison naïve Bayes (NB) classifiers in multi-class classification problems.
to one another, but are very dissimilar to instances in other clus- DT is a classification tool commonly used in data mining tasks such
ters. It analyzes instances without consulting a known class label. as ID3 (Quinlan, 1986), ID4 (Utgoff, 1989), ID5 (Utgoff, 1988), C4.5
The instances are clustered based on the principle of maximizing (Quinlan, 1993), C5.0 (Bujlow, Riaz, & Pedersen, 2012), and CART
the intraclass similarity and minimizing the interclass similarity. (Breiman, Friedman, Stone, & Olshen, 1984). The goal of DT is to
The performance of data mining algorithms in most cases depends create a model that predicts the value of a target class for an un-
on dataset quality, since low-quality training data may lead to the seen test instance based on several input features (Loh & Shih,
1997; Safavian & Landgrebe, 1991; Turney, 1995). Amongst other
⇑ Corresponding author. Tel.: +44 191 243 7089.
data mining methods, DTs have various advantages: (a) simple to
E-mail address: li.zhang@northumbria.ac.uk (L. Zhang).
understand, (b) easy to implement, (c) requiring little prior

0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2013.08.089

Section 6 concludes accuracy. DTFS ex- missing attribute values. It is also an important mining classi. The influential of instances from a large amount of noisy training data with high factor of attributes was found in their work. (b) strong (naïve) independence assumptions. Carrasco-Ochoa. Their one-against-all method constructed M of class conditional independence.5 decision tree classifier and a one-against-all meth- most important attributes selected by DT with their corresponding od to improve the classification accuracy for multi-class classifica- weights are employed for the calculation of the naïve assumption tion problems. each of which sep- of the proposed hybrid algorithms against those of existing DT arated one class from all of the rest. The ith C4. 2010). then DTFS updated the va- to use. The G-FDT tree used the Gini Index as the split measure In this section. (2009) proposed a method to resolve one of the exceptions in basic ically extract the most representative high quality training datasets decision tree induction algorithms when the class prediction at a and identify the most important attributes for the characterization leaf node cannot be determined by majority voting. fier for data mining and applied in many real world classification DTFS expended or updated the leaf node according to the class la- problems because of its high classification performance.1. such as the bagging approach of a DT classifier and a categorize the training dataset. presents our proposed two hybrid algorithms for the multi-class Chen and Hung (2009) presented an associative classification classification problems respectively based on DT and NB classifiers. (c) handling lue of the input tree branch of this leaf node. fast splitting attribute selection (DTFS). 2010). the NB classifier also has several advantages such as (a) easy training instances from only one class. sensitivity-specificity analysis. dependability of the attribute value on the class label. it is also noted that to com. the number of instances stored in a leaf node reached its limit. If the leaf node of DT contained to DT. we review recent research on decision trees and to choose the most appropriate splitting attribute for each node naïve Bayes classifiers for various real world multi-class classifica- in the decision tree. Sanchez. Chandra and Varghese (2009) proposed a fuzzy decision tree Gini Index based (G-FDT) algorithm to fuzzify the decision boundary 2. Otherwise. bute with the highest gain measure as the splitting attribute. DT may suffer found similarities between instances using clustering algorithms from overfitting due to the presence of such noisy instances and and also selected target attributes. (b) only one scan of the training data required. repository (Frank & Asuncion. Gini Index was computed using the fuzzy-membership values of the attribute corresponding to a split value and fuzzy-membership 2. Otherwise. The benchmark datasets from UCI (University of California. Moreover. for classifying instances in large datasets with a large number of where a large number of variables in datasets were being consid- variables. When a threshold for the pute class conditional independence using a NB classifier is extre. When classifying new instances overview of the work related to DT and NB classifiers. tree (ACT) that combined the advantages of both associative clas- Section 5 provides experimental results and a comparison against sification and decision trees. Related work without converting the numeric attributes into fuzzy linguistic terms. Decision trees values of the instances.5 decision tree clas- and NB classifiers using the classification accuracy. to improve the classifi- superfluous nodes and branches are removed in order to improve cation accuracy. It the learning tree for decision making. A naïve DTs for large datasets. an algorithm for building (e) robust. In order to avoid storing all the instances into the DT. which gave the dimensional attributes. Rahman. 2010). The experi. and (f) dealing with large and noisy datasets. mental way. Section 3 using this pruned tree. The first proposed hybrid DT algorithm finds the trouble. most crucial subset of attributes using a DT induction. duced a decision tree construction method based on adjusted clus- moves these instances from the training set before constructing ter analysis classification called classification by clustering (CbC). the tion problems. sifier was trained with all the training instances of the ith class sitivity–specificity analysis. (d) able to handle both numerical and categorical data. Irvine) ma. for a DT classifier and a NB classifier for multi-class classification DTFS always considered a small number of instances in the main tasks. & Rahman. and 10-fold cross validation on 10 real with positive labels and all the others with negative labels. memory for the building of a DT. For the construction of the decision tree. ACT followed a simple heuristic which selected the attri- the findings and proposes directions for future work. Section 2 gives an was pruned based on this factor. number of instances stored in a cluster was reached.1938 D. and (c) independent feature models (Farid. sen. / Expert Systems with Applications 41 (2014) 1937–1946 knowledge. and 10-fold cross mental results prove that the proposed methods have produced validation on three datasets taken from the UCI machine learning very promising results in the classification of real world challeng. Farid et al. They proposed a novel combination of DTs and evolutionary decision trees: (a) the growth of the tree to enable it to accurately methods. Our second proposed hybrid NB algorithm finds the ate value of the target attribute. cation accuracy. The split-points were chosen as the mid- points of attribute values where the class information changed. The weights Polat and Gunes (2009) proposed a hybrid classification system of the selected attributes by DT are also calculated. whereby back-propagation neural network method. DTFS used this attribute selection technique Bayes (NB) classifier is a simple probabilistic classifier based on: to expand nodes and process all the training instances in an incre- (a) Bayes theorem. When 2011. Balamurugan and Rajaram ing multi-class problems. Franco-Arcega. Aviad and Roy (2011) also intro- some instances in the training data using a NB classifier and re. performance of this hybrid classifier was tested using the classifi- chine learning repository (Frank & Asuncion. The ACT tree was built using a set of existing DT and NB algorithms using 10 real benchmark datasets associative classification rules with high classification predictive from UCI machine learning repository. DTFS stored at most N number of instances in a leaf node. Decision tree classification provides a rapid and useful solution Aitkenhead (2008) presented a co-evolving decision tree method. Then it calculated the target its accuracy may decrease. In this approach.Md. There are two common issues for the construction of ered. all the in- mely computationally expensive for a dataset with many stances in each cluster were classified with respect to the appropri- attributes. Section 4 rately than the basic assignment by traditional DT algorithms. We evaluate the performances number of binary C4. panded the leaf node by choosing a splitting attribute and created In this paper. tree and also handled comparatively a wider range of values and Diaz. and (b) the pruning stage. The DT The rest of the paper is organized as follows. Then only these based on a C4. These methods also allow us to automat. Such methods evolved the structure of a decision classification accuracy.5 decision tree classifiers. Finally. the class labels can be assigned more accu- introduces the basic DT and NB classification techniques. 2010. precision. we propose two hybrid algorithms respectively one branch for each class of the stored instances. Similar bels of the instances stored in it. and Martinez-Trinidad (2011) presented decision trees using data types. Lee & Isa. and (d) continuous data. attributes distribution for each cluster. .

InfoA ðDÞ ¼  InfoðDj Þ ð2Þ jDj conditional independent component analysis (PC-ICA) method for j¼1 naïve Bayes classification in microarray data analysis. pj. (3).5 classifier is a successor of ID3 method called extended naïve Bayes (ENB) for the classification classifier (Quinlan. socioeconomic status and experience) and performance probability that xi 2 D. the expected information needed to correctly model was tested using socio-demographic (age. Di A subset of a training dataset T A decision tree The HNB method was based on the idea of creating another layer that represented a hidden parent of each attribute. partition. gender. Fan. Koc. Aij An attribute’s value Wi The weight of attribute Ai tacks. D. The goal of DT is to iteratively and over-fitting for the classification of gene expression datasets. Tavallaee. xi 2 D. and Ruz (2012) also presented an approach selection measure (Quinlan. where all instances in R-NBC used logarithms of probabilities rather than multiplying each Di belong to the same class. . In Eq. Jamain & Hand. xi 2 D. The HNB classifier Ci A class label was an extended version of a basic NB classier. / Expert Systems with Applications 41 (2014) 1937–1946 1939 2. Poh. A. & Ghor. (1). The results showed that the socio-demographic attributes were not suitable for predicting sale X m InfoðDÞ ¼  pi log 2 ðpi Þ ð1Þ performances. Farid et al. When handling numeric attributes. effective contacts and finished records) jCi. Given a exclusively to sales and telemarketing based on a NB classifier. training dataset. ENB used a normal NB algorithm to calcu- gously with Info(D) as shown in Eq. The root node of DT is chosen to predict the performance of sales agents of a call centre dedicated based on the highest information gain of the attribute.e. xi. underflow to identify Ci of an instance. from D based estimate approach for providing solutions to over-fitting problems. Valle. Dichotomiser). error rate and the training datasets into smaller subsets until all the subsets be- misclassification cost compared with the traditional NB classifier on long to a single class. on the partitioning by attributes. In this section. The mixed types of data included categor- information gain using a ‘‘split information’’ value defined analo- ical and numeric data. and Zhou (2010) proposed a partition. (1). and is estimated by (logged hours.Dj/jDj. . Classification is one of the most popular data mining techniques GainðAÞ GainRatioðAÞ ¼ ð5Þ that can be used for intelligent decision making. It applies a kind of normalization to of mixed types of data. Huang. Generally. Hsu. Table 1 summarizes the most com- quency of class Cj 2 D. 1986). each DT is a rule set. ent partitions so that independent component analysis (ICA) can be done within each partition. The naïve Bayes classifier is also widely used for classification Symbol Term problems in data mining and machine learning fields because of xi A data point or instance its simplicity and impressive classification accuracy. monly used symbols and terms throughout the paper. (4). Eq. which uses information theory as its attribute bani. Naïve Bayes classifiers Table 1 Commonly used symbols and terms. we SplitInfoðAÞ discuss some basic techniques for data classification using decision Eq. (2) shows InfoA(D) jD j It did not require any prior feature selection approaches in the field calculation. belongs to a class. (7). It relaxed the condi. DT recursively partitions a superior overall performance in terms of accuracy. late the probabilities of categorical attributes. This training dataset. X m GiniðDÞ ¼ 1  p2j ð6Þ 3. InfoA(D) is the expected infor- probabilities to handle the underflow problem and employed an mation required to correctly classify an instance. The attribute with the maximum Gain Ratio is selected as the 3. which is defined in Eq. . Chandra and Gupta (2011) proposed a robust naïve Bayes clas. Info(D) is the average amount of information needed sifier (R-NBC) to overcome two major limitations i. D. where pi is the status. D A training dataset tional independence assumption imposed on the basic NB classifier. Ci. PC-ICA spited the small-size data samples into differ- is shown in Eq. talked hours. D. information as attributes of individuals.Md. .. Ci. is given in Eq. 2009. D2. marital classify an instance. PC-ICA also attempted to do ICA-based GainðAÞ ¼ InfoðDÞ  InfoA ðDÞ ð3Þ feature extraction within each partition that may consist of several The Gain Ratio is an extension to the information gain approach. Supervised classification splitting attribute. 2000. It further ex- Information gain is defined as the difference between the tended the class-conditional independent component analysis (CC- original information requirement and the new requirement that ICA) method. which provides a rapid and effective method for classifying Eq. (6) defines the gini for a dataset. Decision tree induction j¼1 A decision tree classifier is typically a top-down greedy ap. Varas. (5). Bagheri. A C4. and Chang (2008) presented a classification also used in DT such as C4. Dn}. The common DT algorithm is ID3 (Iterative the KDD 99 dataset (McHugh. Lu.5. The goodness of a split of D into subsets D1 and D2 is defined by proach. The HNB multiclass classification model exhibited 2008). . The influences from all of the other attributes can thus be easily combined through conditional probabilities by estimating the attributes from the data instances (Chandra & Paul Varghese. where jDjj acts as the weight of the jth partition. It especially significantly improved the accuracy for the C Total number of classes in a training dataset detection of denial-of-services (DoS) attacks. X A set of instances and Sarkani (2012) applied a hidden naïve Bayes (HNB) classifier to Ai An attribute a network intrusion detection system (NIDS) to classify network at. 2009). but operational records proved to be useful for the i¼1 prediction of the performances of sales agents. into subsets. it adopted the statistical theory to discrete X n   jDj j jDj j the numeric attributes into symbols by considering the average S plitInfoA ðDÞ ¼   log 2 ð4Þ j¼1 jDj jDj and variance of numeric values. classes. is the fre- tree and naïve Bayes classifiers. 1993).2. where. {D1. of microarray data analysis where a large number of attributes X n jDj j were considered.1. D. Mazzuchi.

(12). the class Ci for which P(XjCi)P(Ci) is the maximum probability. the NB classifier predicts that the instance X belongs to the class Ci. rC i Þ ð11Þ class Ci for which P(CijX) is maximized is called the Maximum 1 ðxpÞ2  2 Posteriori Hypothesis. lC i is the mean and rC i is the standard deviation of PðC i jXÞ ¼ ð8Þ PðXÞ the values of the attribute Ak for all training instances in the class Ci. (9).Md. if and therefore maximize P(XjCi).j – i. PðXjC i Þ ¼ Pðxk jC i Þ ð9Þ k¼1 sponding probabilities for those attributes when calculating the likelihood of membership for each class. Therefore. if and only the classes are equally likely. P(C1) = P(C2) =    = P(Cm). . lC i . Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Probability Value Overcast Hot Normal Weak Yes P(Play = Yes) 9/14 = 0. Moreover. Tables 3 and 4 respectively tabulate the prior probabilities for each class and conditional probabilities for each attribute value gener- Table 2 ated using the playing tennis dataset shown in Table 2. P(xnjCi) can be Given a training dataset. .375 . xn}. that is.. The attribute values can be dis. To compute P(XjCi) in a dataset with many attributes In Eq. the classifier will predict that X belongs sumed to have a Gaussian distribution with a mean l and standard to the class with the highest posterior probability. X2. PðXjC i Þ ¼ Pðx1 jC i Þ  Pðx2 jC i Þ      Pðxn jC i Þ ð10Þ quires the class conditional independence.. . the attri- represented as. divided by crete or continuous. A2. X. . (13). The NB classifier probabilities are not known. independent of one another. butes in training datasets can be categorical or continuous-valued. in order to estimate P(xkjCi). . we consider a small dataset in Table 2 described by four attributes namely Outlook. has a particular class label If Ak is a continuous-valued attribute. Aih}. rÞ ¼ pffiffiffiffiffiffiffi e 2r ð12Þ 2pr PðXjC i ÞPðC i Þ In Eq. x2.Dj/jDj. if and only if P(CijX) > P(CjjX) for 1 6 j 6 m. . 3. X 2 D. l.2. Tian. A NB classifier can easily Y n handle missing attribute values by simply omitting the corre. . Temperature. each data record is easily estimated from the training instances. these probabilities P(x1jCi). 2010). Xn}. (8). Naïve Bayes classification tion of class conditional independence is made in order to reduce A naïve Bayes classifier is a simple probabilistic based method. . . In Eq. advantages: (a) easy to use. 1 shows the decision tree model constructed using the playing tennis dataset shown in Table 2. Xi = {x1. then P(xkjCi) is the number attribute values {Ai1. Ak.P(x2jCi). . Humid- ity. . butes {A1. maximize P(XjCi)P(Ci). Each attribute has several unique attribute values. The attributes are conditionally which can predict the class membership probabilities (Chen. 2009. C2. Fig. (11). . . The class prior probabilities are calculated by P(Ci) = jCi. i. is categorical. bute on a given class is independent of those of other attributes. To predict the class label of instance classes. D = {X1. . then Ak is typically as- Ci. An} and each attribute Ai contains the following If the attribute value. given the class label of the instance. If the class prior X. The Play column in Table 2 represents the class category of each instance. together with In Bayes theorem shown in Eq. i. as P(X) is a constant for all xk. It indicates whether a particular weather condition is suitable or not for playing tennis. .e. conditioned deviation r. The proposed hybrid learning algorithms Sunny Hot High Weak No Sunny Hot High Strong No In this paper. Thus. PðXjC i ÞPðC i Þ > PðXjC j ÞPðC j Þ ð13Þ where jCi. Farid et al. the number of instances belonging to the class Ci 2 D. of instances in the class Ci 2 D with the value xk for Ak. gðx. Otherwise. . . The NB classifier also re. the effect of an attri. Cm}. The playing tennis dataset. xk refers to the value of attribute Ak for instance X. For a test instance. The Pðxk jC i Þ ¼ gðxk . Ai2.Dj. Each training instance. Fig. That is the predicted class label is is extremely computationally expensive. A decision tree generated using the playing tennis dataset. only P(XjCi)P(Ci) needs to be maximized. D contains the following attri. D also contains a set of classes C = {C1. then it is commonly assumed that predicts that the class label of instance X is the class Ci. Farid & Rahman.Dj is the number of training instances belonging to the class Ci in D. defined respectively by the following two equations: on X. the split with the best gini value is chosen. and Wind.P(XjCi)P(Ci) is evaluated for each class Ci 2 D.642 Rain Mild High Strong No P(Play = No) 5/14 = 0. . . These proposed algorithms are described in the following Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Table 3 Rain Mild Normal Weak Yes Prior probabilities for each class generated using the playing tennis dataset. Outlook Temperature Humidity Wind Play 4. the naïve assump. we have proposed two independent hybrid Overcast Hot High Weak Yes algorithms respectively for decision tree and naïve Bayes classifiers Rain Mild High Weak Yes Rain Cool Normal Weak Yes to improve the classification accuracy in multi-class classification Rain Cool Normal Strong No tasks. . 1 6 j 6 m and j – i. computation in evaluating P(XjCi). 1. & Qu. / Expert Systems with Applications 41 (2014) 1937–1946 n1 n2 ginisplit ðDÞ ¼ þ ð7Þ nðginiðD1 ÞÞ nðginiðD2 ÞÞ In this way.1940 D. . jCi. Huang. To illustrate the operation of DT. which represent the weather condition of a particu- lar day. . .e.. It has several Thus. Eqs. . and (b) only one scan of the training data required for probability generation. That is. Now we can bring these two quantities to Eq. (9) and (10) are used to produce P(XjCi).

each instance. the number P(Wind = WeakjPlay = No) 2/5 = 0. do attributes to improve efficiency. Ai.4 noise free data.222 splitting attribute with the maximum information gain value as P(Temperature = HotjPlay = No) 2/5 = 0. .666 a DT algorithm depends on the size of training dataset. xn}. training dataset. we select the best P(Temperature = HotjPlay = Yes) 2/9 = 0. Cm}. For the decision tree generation. Decision tree.333 The algorithm continues recursively by adding new subtrees to P(Temperature = CooljPlay = No) 1/5 = 0. .444 mined. It is developed based on a basic C4.333 P(Wind = StrongjPlay = No) 3/5 = 0. We have found some in. bility. P(AijjCi). xi2. A decision tree is a classification attribute. attributes {A1. . . The time and space complexity of P(Humidity = NormaljPlay = No) 1/5 = 0. . P(AijjCi). and the size of the generated tree. and thus decrease its accuracy. attribute values {Ai1. . We calculate the prior prob- 23: end if ability.3 making using the updated training dataset D with those purely P(Outlook = RainjPlay = No) 2/5 = 0. The result is a 19: if stopping point reached for this path. Then we re- move all the misclassified training instances from the dataset D. We the integration of a decision tree in order to find a subset of attri- calculate the prior and class conditional probabilities using this butes with attribute weighting. instance based on these probabilities. 2: Find the prior probabilities. Each attribute. xi 2 D. for each class.0 P(Outlook = RainjPlay = Yes) 3/9 = 0. it is used to classify each test instance. xn} // Training dataset. P(Ci). .333 in the reduced training set all belong to the same class. x2. .1 and 4. C2. P(Temperature = MildjPlay = No) 2/5 = 0. Each attribute con- indicate that they belong to ‘‘Class = yes’’. each training instance is 12: end for represented as xi = {xi1. . for each attribute value (even if it is numeric) in D. some of these examples either contain contradictory characteristics. This class P(Humidity = HighjPlay = No) 4/5 = 0. An}. Decision tree induction Input: D = {x1. It seems that there is some C2.4 of attributes. Ai2.6 After removing those misclassified/troublesome instances from P(Outlook = OvercastjPlay = Yes) 4/9 = 0. do noisy training data at an initial stage to avoid overfitting. a predicate that can be applied to the attribute associated with the 17: for each arc do parent. Then we calculate the P(ClassjD) for each class determination. . brid DT induction. Algorithm 1 is used to describe the proposed Output: T. A2.2. xi 2 D. The proposed hybrid algorithm for a naïve Bayes classifier blesome training examples. we discuss the proposed Algorithm 1 of the hy- 10: Remove xi from D. . D = {x1.2 P(Wind = WeakjPlay = Yes) 6/9 = 0. The presence of Probability Value such noisy training instances is more likely to cause a DT classifier P(Outlook = SunnyjPlay = Yes) 2/9 = 0. with the highest posterior probability. where each . we first apply a basic NB classifier to 22: T0 = DTBuild(D). xih} and D contains the following 13: T = . 21: else For the training dataset. Once the root node of DT has been deter- P(Temperature = MildjPlay = Yes) 4/9 = 0. D = {A1. xi.444 the training dataset. . xih}. Ai2. An}. / Expert Systems with Applications 41 (2014) 1937–1946 1941 Table 4 noise within these data.2. 3: end for It embeds a DT classifier to identify a subset of most important 4: for each attribute value. . The algorithm terminates when the instances P(Humidity = HighjPlay = Yes) 3/9 = 0.2 each branching arc. Aih}.4 P(Temperature = CooljPlay = Yes) 3/9 = 0. 5: Find the class conditional probabilities. P(Wind = StrongjPlay = Yes) 3/9 = 0. . There are two basic steps for the 20: T0 = Create a leaf node and label it with an appropriate development of a DT based application: (a) building the DT from a class. or carry exceptional In this section. xi. The proposed hybrid decision tree algorithm 8: Find the posterior probability. . is built. D. A set of classes C = {C1. . . Thus these misclassified in- dataset. . .4 the root node of the tree. . x2.- dataset they are labeled as ‘‘Class = no’’. . and (b) applying the DT to a test dataset. D. D. Ai. Ci. (b) each arc labeled with label. There is a set of attributes used to stances where the probabilities calculated using the NB classifier describe the training data. Farid et al. The training data also belong to a 15: T = Create the root node and label it with the splitting set of classes C = {C1. stances are regarded as troublesome examples. Aij 2 D. . 11: end if Given a training dataset. these misclassified instances tend to be the trou. . xi 2 D. Ci. we present a hybrid naïve Bayes classifier with features. classify each training instance. . which employs a NB classifier to remove any Method: 1: for each class. For example. P(Ci). In our experiments. and (c) each leaf node labeled with a class. Sections 4. Ci 2 D and the class conditional proba- 24: T = Add T0 to arc. xi 2 D. is selected as the final classification for the instance.1. . . Algorithm 2 is used for the construction of a hybrid NB classifier. The class. . tree associated with D that has the following properties: (a) each 16: T = Add arc to the root node for each split predicate and internal node labeled with an attribute. 4. .666 lines the proposed DT algorithm. .. 25: end for Then we classify each training instance. D. xi. D. hybrid DT induction. Aik}. P(Cijxi) 9: if xi is misclassified. . using these proba- bilities. A2. . Cn} is also used to label the training instances. Ci 2 D. do In this section. the child nodes and its arcs are created and added to the DT. . do 4. Once the tree 18: D = Dataset created by applying splitting predicate to D. which contains a set of training instances and their associated class labels.6 Algorithm 1. P(Cijxi). However in the training tains attribute values Ai = {Ai1. Suppose there is a training dataset with two classes. . . 6: end for 7: for each training instance.222 to become overfitting. which leads to contradictory results in Conditional probabilities for each attribute value calculated using the playing tennis comparison with the original labels.8 is then used to label the corresponding leaf node. which play more important roles in example training dataset. classification for that instance. In a given training dataset. we subsequently build a DT for decision P(Outlook = OvercastjPlay = No) 0/5 = 0. . contains values {xi1.5 algorithm. . contains the following 14: Determine best splitting attribute. P(Outlook = SunnyjPlay = No) 3/5 = 0. . . .Md. Algorithm 2 out- P(Humidity = NormaljPlay = Yes) 6/9 = 0. . xi.

For example. then crucial roles in the final classification. 2. we calculate the prior probability.. sponding weights. by counting how often Ci occurs in D.1. P(Ci). Each dataset is roughly equivalent to a two-dimensional Pðxi jC i Þ ¼ PðAij jC i ÞW i ð14Þ spreadsheet or a database table. First of all. Wi. The weights of the se. fects on class conditional probability calculation as an exponential 3. xi 2 D. d. which ef. proposed hybrid NB algorithm. 26: Find the class conditional probabilities. proposed hybrid decision tree and naïve Bayes classifiers. Naïve Bayes classifier 9.1942 D. Wi. For each attribute. Dataset descriptions. we calculate the minimum 19: end if depth. 10. Table 5 Method: 1: T = .. P(Cijxi). 16: Wi = 0. the The performances of both of the proposed hybrid decision tree weights of the selected attributes also influence their class condi- and naïve Bayes algorithms are tested on 10 real benchmark data- tional probability calculation as exponential parameters. If the attri. Ai. Image Segmentation Data (Image seg. Ai 2 T. Wi. 18: d as the minimum depth of Ai 2 T. Moreover. . Breast cancer 9 Nominal 286 2 Contact lenses 4 Nominal 24 3 4: T = Add arc to the root node for each split predicate and Diabetes 8 Real 768 2 label. 19 Real 1500 7 8: T0 = Create a leaf node and label it with an appropriate Tic-Tac-Toe 9 Nominal 958 2 class. 2: Determine the best splitting attribute. do Subsequently. Fitting Contact Lenses Database (Contact lenses). Breast Cancer Data (Breast cancer). Aij 2 xi. from the training dataset.e. Glass Identification Database (Glass). That is we use the DT classifier to 14: for each attribute. Wi – 0) and classify each instance xi 2 D using these probabilities. Ci2. (14). in this 10: T0 = DTBuild(D). Att. . we generate a basic decision tree. This is done only for those attributes that appear in the DT. . Wi refers to the weight of the attribute. 11: end if T. D. and present the evaluation results for both of the can be estimated by counting how often each Aij occurs in Ci 2 D. parameter. This is conducted by combining the effects of different attribute values. xn} // Training data. in the tree. lected attributes by the DT are also used as exponential parameters 27: end for (see Eq. Output: A classification Model..Md. probability. then Image seg. Input: D = {x1. . and collect the attributes appearing 12: T = Add T0 to arc. (14). Ai. Aij 2 Ai. x2. To classify an instance. Datasets make the prediction. can be In this section. do find the subset of the attributes in the training dataset which play 15: if Ai is not tested in T. Large Soybean Database (Soybean). Tic-Tac-Toe Endgame Data (Tic-Tac-Toe).e. Iris Plants Database (Iris plants). Datasets No of Att.P(Ci) and P(AijjCi) from D are used to 5. we calculate the class conditional probabilities using 25: for each attribute value. xi 2 D. Ai. xi. 1984 United States Congressional Voting Records Database the proposed hybrid algorithm for the naïve Bayes classifier. Soybean 35 Nominal 683 19 Vote 16 Nominal 435 2 7: ifstopping point reached for this path. Farid et al. class conditional probabilities of those unselected attributes by 31: end for DT will not be generated and employed in the classification result production. (Vote). P(Cijxi). where the attribute. Pima Indians Diabetes Database (Diabetes). The posterior 5. jCi). is tested in the DT and initial- 20: end for ize the weight. Otherwise. Cik} also has some values. . . Ai 2 D. we need P(Ci) for each Ci. We esti- sets from UCI machine learning repository (Frank & Asuncion. d bute. The class. the root node of the tree has a 23: end for higher weight value in comparison with those of its child nodes. 2010). Table 5 describes the datasets used in experimental analy- Y n sis. do way. we initialize the weight. As mentioned earlier. and P(xi. for each 5. is then found for Ci. . P(Ci). Ci. for each attribute. PðAij j C i ÞW i . the 30: Find the posterior probability. Similarly. the probability P(AijjCi) also environments. 4. and estimate the likelihood that xi is in each Ci. 17: else bute.. DT is applied as an attribute selection 13: end for and attribute weighting method. Ai. of the attri. . 8. Ci 2 D. Ai 2 D and Wi – 0. the number of occurrences of each attribute value. NSL-KDD 41 Real & Nominal 25192 23 9: else . After the tree construction. we describe the test datasets and experimental counted to determine P(Ai). / Expert Systems with Applications 41 (2014) 1937–1946 class Ci = {Ci1. do only those attributes selected by DT (i. Ci. Aij. In Eq. Ai 2 D. Experiments class. the importance of the attributes is measured by their corre- 22: Find the prior probabilities. In this 21: for each class. In this approach. NSL-KDD Dataset (NSL-KDD).) Algorithm 2. is not tested in the DT then the weight. est probability is used to label the instance. Types Instances Classes 3: T = Create the root node and label it with the splitting attribute. Wi = 0) will not 29: for each instance. Glass 9 Real 214 7 5: for each arc do Iris plants 4 Real 150 3 6: D = Dataset created by applying splitting predicate to D. T. is initialized to zero. 24: for each attribute. with the value of p1ffiffid. 6. The 10 datasets are: j¼1 1. Ai 2 D. of the attribute. with the high.e. Algorithm 2 outlines 7. Other 28: end for attributes which are not selected by the DT (i. and W i ¼ p1ffiffi. I. (14)) for class conditional probability calculation. mate P(xijCi) by Eq. do be considered in the final result probability calculation. To calculate P(Cijxi).

We consider the weighted average for pre- accuracy ¼ ð16Þ TP þ TN þ FP þ FN cision. The detailed results for the weighted average of precision. since it is able to identify the most jXj important and discriminative subset of attributes for the If classify(x) = x  c then assess(x) = 1 else assess(x) = 0.6 86.33 precision ‘‘%’’ of positive predictions that are correct Image seg. The classification accuracy is measured either algorithm. x.66 79.66 73. This repeats 10 times and we take a mean accuracy rate. sensitivity and specificity analysis for each dataset. true negatives. The TP weighted average is similar to an arithmetic mean. and 10-fold cross validation to measure the performances of the proposed algorithms (Algorithms 1 and TP þ TN 2) using all 10 datasets.33 93. and 10- generated directly from the original training instances using C4. dataset by 4. and the remaining partitions are collectively used to mances of the classic C4.60 97.26 83.94 77.83 FP xi predicted to be in Ci but is not actually in it Iris plants 98 96 100 98.. the accuracy estimate is the overall number of correct classifications from the k iterations. Farid et al.34 98.98%. where X is production of naïve assumption of class conditional independence the set of instances to be classified. 10-fold cross validation breaks data into 10 sets on 10 datasets using 10-fold cross validation.5% using tioned into k mutually exclusive subsets or ‘‘folds’’. x 2 X and x  c is the class of in.63 accuracy ‘‘%’’ of predictions that are correct Vote 97. (15) or by Eq. and visualization.83 98.0 81. or the recall rate). .24 90. Training and test.55 TN xi not predicted to be in Ci and is not actually in it Glass 96. Table 7 divided by the total number of instances in the initial dataset.1 in accuracy rates of the basic C4. and specificity (also 8. Dk. regression. the tradi- In k-fold cross-validation. by Eq.7 99.5 fold cross validation.83 95. and false negatives.44 . This DT classifier gen- rules. in Table 7.5 classifier and the proposed DT algorithm train the classifier. It is a collection of machine learning 4.83 100 TP xi predicted to be in Ci and is actually in it Diabetes 84.7 98. rizes the symbols and terms used throughout in Eqs. We implement both of the proposed algo- datasets shown in Table 5. precision. respectively. Moreover. sensitivity-specificity analysis. NetBeans IDE is the first posed hybrid algorithms for each of the 10 training datasets. Evaluated using all the instances in the 10 datasets.5 DT achieved an average accuracy rate of 83.66 99. . the tic-tac-toe dataset by called the true positive rate. the tic-tac-toe dataset by ing software (Hall et al. the partition Di is reserved of the 10 datasets. Table 6 summa. the sensitiv ity ¼ ð18Þ TP þ FN weighting for the evaluation of this research is calculated using TN the number of instances belonging to one class divided by the total specificity ¼ ð19Þ TN þ FP number of instances in one dataset. (16)–(19) and their meanings. (16).82 95. xi 2 X ð15Þ of all 10 training datasets.0 GHz processor (2 MB Cache.87% and the contact lenses called the true negative rate). the image seg. Weka3 contains tools for data by 3.D2. D.81 specificity ‘‘%’’ of negative instances that are predicted as negative NSL-KDD 73.11 76. the initial data are randomly parti. dataset by 4. The proposed DT algorithm (Algorithm each of which has an approximately equal size. which is open source data min- tion of the diabetes dataset by 7. Among all the 10 datasets. 1) obtained an average accuracy rate of 88. of size N/10.66 FN xi not predicted to be in Ci but is actually in it Soybean 96.2. ..Md.html). The code for the basic versions of the DT and NB (Algorithm 1) outperforms the C4. It trains the classifier on 9 datasets and tests it using the remaining one dataset.5 DT classifier for the classifica- classifiers is adopted from Weka3. the contact lenses dataset by 4. (17)– proposed NB algorithm respectively improves the classification (19) are used for the calculations of precision. IDE providing support for JDK 7 and Java EE 6 (http://netbeans. association from each dataset before the DT induction. sensitivity. 10-fold cross validation. In iteration i. 800 MHz the classification accuracy on the training sets of the 10 benchmark FSB) and 1 GB of RAM. clustering. sensitivity and specificity analysis where.9 Symbol Intuitive Meaning Contact lenses 91.85 96.85 93. Eqs. positives. false for the experiments conducted are presented in Tables 8–11.5 and NB classifiers.org/ The results in Table 7 indicate that the proposed DT algorithm index. TP some data points contribute more than others. we evaluated the performances of proposed algorithms The experiments were conducted using a machine with an Intel (Algorithms 1 and 2) against existing DT and NB classifiers using Core 2 Duo Processor 2. Tables 8 and 9 respectively tabulate the perfor- as the test set. tional C4. We use NetBeans IDE 7. comparing to the basic NB classifier. classify(x) returns the classification of x. Secondly. Experimental setup Firstly.26 55. D1.55%. the proposed NB algorithm (Algorithm 2) PjXj also outperforms the traditional NB classifier for the classification i¼1 assessðxi Þ accuracy ¼ . Table 7 summarizes the classification rithms (Algorithms 1 and 2) in Java. scribed as 100% sensitivity and 100% specificity. TN. precision. we have used the classi- to carry more generalization capabilities comparing to the DT fication accuracy.89%. Also.17% and the NSL-KDD dataset algorithms for data mining tasks. Training C4.3% for the classification ing are performed k times. / Expert Systems with Applications 41 (2014) 1937–1946 1943 5. TP.53 sensitivity ‘‘%’’ of positive instances that are predicted as positive Tic-Tac-Toe 93. 99. classification. Algorithm 1 is capable of identifying the noisy instances pre-processing. where instead precision ¼ ð17Þ TP þ FP of each of the data points contributing equally to the final average.17%. Results and discussion 5. For example.7%. training dataset is less likely to become overfitting and thus able To test the proposed hybrid methods. specificity. we have used classification accuracy.5 DT NB Proposed DT Proposed NB Table 6 datasets classifier classifier classifier (%) classifier(%) Symbols used in Eqs.37 79.23%. (16)–(19). The classification accuracies of classifiers using training datasets.3. For classification. sensitivity (also rates of the glass dataset by 18. FP and FN denote true positives. The mining algorithms in Weka3 can be erated from the updated noise free high quality representative either applied directly to a dataset or called from our own coding.33 93.30 91.43 78. and the pro- Redhat enterprise Linux 5 for Java coding. 2009). (%) (%) (Algorithm 1) (Algorithm 2) Breast cancer 96. the stance.73 69.66 95. A perfect classifier would be de.

742 0.67%.042 Image seg.758 0. the results shown in Tables 8 and 9 indicate that the proposed DT algorithm outperforms the traditional C4.036 Vote 94.3% using 10-fold cross validation. Overall.055 Image seg.735 0.5 DT and the proposed DT classifi- Tic-Tac-Toe 88. sensitivity and specificity values for the sets using 10-fold cross validation.957 0. Breast cancer 71.185 the classification accuracy rates of the contact lenses dataset by Contact lenses 91.5 DT classi- fier in all the test cases.716 0.96 0.96 0.02 Soybean 91.842 0.91 0.098 Image seg.46 0.891 0.87 0.241 Contact lenses 83.149 Tic-Tac-Toe 78. / Expert Systems with Applications 41 (2014) 1937–1946 Table 8 Table 11 The classification accuracies and precision. Fig.82 0.) Breast cancer 75. the proposed NB algorithm has respectively improved Breast cancer 81.70 0.67 0.145 Glass 66.819 0.) avg.755 0.057 by 9.853 0.931 0.866 0.5 DT and the sifier (Algorithm 2) obtained an average accuracy rate of 86.1 0.06 0. 2 and 3). Algorithm 1 is able to automatically remove noisy Table 10 instances from training datasets for DT generation to avoid overfit- The classification accuracies and precision. and between a basic NB and the proposed NB classifiers for NSL-KDD 81.87 0.738 0.19 0.291 classifiers in all the test cases (see Figs.708 0.836 0. 2 and 3 respectively show the comparison of classification Image seg. Comparing to the traditional avg. proposed DT classifiers on 10 datasets with 10-fold cross validation.826 0.15 0.944 0.691 0.767 0.) avg.764 0.851 0.909 0.5 The classification accuracies and precision.79 0.96 0.485 0. 95.07 0.245 Breast cancer 75.176 Tables 10 and 11 respectively tabulate the performances of the Table 9 traditional NB classifier and the proposed NB algorithm on 10 data- The classification accuracies and precision.022 Figs.762 0.851 0.) avg.514 Iris plants 96 0.763 0.39 0.977 0.04 Iris plants 98 0.83 0.33 0.12%.3 NSL-KDD 76.11 0.826 0.8% on average comparing to the traditional DT classifier.533 0.261 Diabetes 85.945 0. Furthermore. 2. proposed hybrid DT classifier with 10-fold cross validation. 96.849 0. Algorithm 2 is capable of identifying the most discriminative Datasets Classification Precision Sensitivity Specificity subset of attributes for classification.055 Vote 90.083 16.166 Contact lenses 87.849 0. Overall. 85. In comparison with the traditional DT.62 0. the contact lenses dataset by 8. and the NSL-KDD dataset by 6.901 0.81%.) avg.758 0.283 They respectively outperform the traditional C4.66 0.21 NSL-KDD 71.013 Soybean 92.875 0.67 0.696 0. Diabetes 76.66 0.789 0. the glass dataset by 9.209 9.18 each dataset with 10-fold cross validation.50 0.986 0.965 0.814 0.905 0.48 0.236 Glass 48.189 Tic-Tac-Toe 69.523 0. proposed hybrid NB classifier with 10-fold cross validation.) avg.034 accuracy rates between the C4.98 0.819 0.92 0. The results shown in Tables 10 and 11 indicate that the Datasets Classification Precision Sensitivity Specificity proposed NB algorithm (Algorithm 2) also outperforms the basic accuracy (%) (weighted (weighted (weighted NB algorithm for all the test cases. and the breast cancer dataset by 5.496 0.94%.27 0.) and 2) for the classification of challenging real benchmark datasets.711 0.752 0.83 0. Datasets Classification Precision Sensitivity Specificity Datasets Classification Precision Sensitivity Specificity accuracy (%) (weighted (weighted (weighted accuracy (%) (weighted (weighted (weighted avg. while the proposed NB clas.682 0.872 0.97 0. It thus possesses more robustness and generalization capabil- classifier with 10-fold cross validation.75 0.Md. Fig.977 0.893 0. the tic-tac-toe dataset by Diabetes 79.96 0.82 0.5 DT and NB Contact lenses 70.331 Glass 52.52 0.73 0.1944 D. 4 also shows the comparison of classification accuracy rates of all classifiers on 10 datasets with 10-fold cross validation.98 0.708 0.916 0.834 0.786 0. The evaluation results prove accuracy (%) (weighted (weighted (weighted the efficiency of the proposed DT and NB algorithms (Algorithm 1 avg.30 0.59 0.237 improved the classification accuracy rates for all the 10 datasets Iris plants 98. Vote 97.88 0.854 0.963 0.788 0.45%. it has improved the classification accu- racy rates for the classification of the above 10 datasets by 4.11%.7%. Algorithm 2 has Glass 76.) avg.881 0. sensitivity and specificity values for a C4. the diabetes dataset by 9.33 0.) NB classifier.711 0.03 0.148 Tic-Tac-Toe 85.) avg.81 0. sensitivity and specificity values for a NB ting. Farid et al.986 0. ities.047 Vote 96.) avg.32 0.) avg.963 0. The comparison of classification accuracy rates between the C4. sensitivity and specificity values for the classifier with 10-fold cross validation.33%.4% on average in comparison to the classic NB classifier.125 Diabetes 73.288 NSL-KDD 82.476 Iris plants 96 0.53 0.703 0.762 0. . 81.065 Soybean 94.29%. we have also evaluated the traditional NB classi- fier using all the 10 datasets and achieved an average accuracy rate of 77.50 0.668 0. the proposed DT classifier has respectively improved the classification accuracy rates of the NSL-KDD dataset by 10.823 0.118 ers.852 0.27 0.871 0.41 0.237 Moreover.04 Soybean 92.833 0.11 0.

A. Hall. McHugh. Partition-conditional ICA for Bayesian classification of microarray data. 12–19. (2008).. & Sarkani. Testing intrusion detection systems: a critique of the 1998 and algorithms were tested against those of the traditional DT and NB 1999 darpa intrusion detection system evaluations as performed by lincoln classifiers using the classification accuracy. D. J. 36. 1080–1083. Application of data mining techniques mark datasets from UCI machine learning repository. & Asuncion.. 81–106. J. Jamain. In future work. & Rajaram.. Expert Systems with Applications.0 machine learning algorithm. 1059–1069.. sensitivity. & Hsiao. Strachan..ics.. Y.5: Programs for machine learning. C. 23–31. & Hand. B. 1293–1298. Riaz. 39. Frank. Farid. Tavallaee. T. A survey of decision tree classifier methodology. http:// www. (1995). A.. 8549–8559. B. 24. Carrasco-Ochoa. & Varghese. of the International Conference on Computing.. 815–840. Harbi.. 4. Turney. M. Hsu. Safavian. Expert Systems with Applications. D. (2012). based on a hidden naïve Bayes multiclass classifier. Farid. Chandra. Using decision trees to summarize associative classification rules. 13492–13500. & Qu. M. K.. precision. W. (2011).. & Ghorbani. 5895–5906. A. UCI machine learning repository. C. 35. L. & Gunes. Effective solution for unhandled exception in decision tree induction algorithms. Z. The in customer relationship management: A literature review and classification review article.. E. & Isa. A. 39. Friedman.edu/ml.-L. 25. (2009). Rahman. C4.H. R. P. T.. J. L. G.. Journal of Computers..-P. (2008). & Rahman. Frank. IEEE Transactions on Systems.. Rahman. 660–674. classification problems. C. T. 34. Classification and regression trees. International Journal of Network Security & Its Applications. R. Feature selection for text classification with naïve Bayes. T. S.ac. Sexton. (2011). (1993). Induction of decision tree. Chen. 87–112. Expert Systems with Applications. L.. (2009). S. K. the proposed hybrid NB classifiers on 10 datasets with 10-fold cross validation.-C. specificity analysis. Holmes. & Rahman. 38. A. L. R. (2012). R. factor facility for na’’ıve Bayes classification. Adaptive intrusion detection based on boosting and naïve Bayesian classifier. The comparison of classification accuracy rates between the basic NB and Chandra. 2. rithms. Hossain.. 11303–11311.. Robust approach for estimating probabilities in naïve Bayesian classifier for gene expression data. (2009). Expert Systems with Applications. Farid... Conclusions mixed data. (2010). In Proc.. A novel hybrid intelligent method based on C4.-Y. R. I. A. Mining supervised classification performance studies: A meta-analytic investigation. 7. 38. R.. Expert Systems with Applications. E. multi-class classification tasks under dynamic feature sets. L.. In Proc. Anomaly network intrusion detection based on improved self adaptive Bayesian algorithm. & Gupta. 12113–12119. J. Y. P.. P.. A. J. genetic algorithms. Machine Learning. D.. laboratory.-L. Franco-Arcega. & Shih. A co-evolving decision tree classification method. J. M.-H. Chu. An adaptive ensemble classifier for mining concept drifting data streams. S. instances from the training set before the DT induction. 369–409. 8220–8228.Md. Expert Systems with Applications. S. (2009). L. 2645). D. (1991).. Lu.T. 5432–5435. G. M. 37. 262–294. Information Sciences.. Reutemann.-W. A. Quinlan. Expert Systems with Applications. ACM Transaction on Information and System Security. Combining naïve Bayes and decision tree for adaptive intrusion detection.. & Roy. & Chang. J.. Bujlow. J. Lee. P. Bagheri.. S. Split selection methods for classification tree. Mazzuchi.cs. The performances of the proposed Sininca. 37. Networking and Knowl- edge) project (Grant No.. (1986). M. J. of the 2nd IEEE Int.2013. H. (2008).waikato. M. J. 11. Extended naïve Bayes classifier for 6. W. Poh. experimental results showed that the proposed methods have pro. Huang. 1587–1592.-H. 14290–14300. Man and Cybernetics. Chandra. & Hung. W. (2013). on Computational Intelligence in Security and Defense Applications (pp.. Stone. & Chau. 21. & Pedersen. Classsification by clustering decision tree-like classifier based on adjusted clusters. Zhang. The comparison of classification accuracy rates of all classifiers on each archive.. (2010). Chapman and Hall/CRC. Data mining techniques and applications – a decade review from 2000 to 2011. & Paul Varghese. Expert Systems with Applications. / Expert Systems with Applications 41 (2014) 1937–1946 1945 of Excellence for Learning.. Decision tree induction using a fast splitting attribute selection for large datasets. 2592–2602. 12–15. .. Cost-sensitive classification: Empirical evaluation of a hybrid We appreciate the support for this research received from the genetic decision tree induction algorithm. P. 36. M. Y. Xiu..07.nz/ml/weka/. M. Expert Systems with proved the classification accuracy rates of both DT and NB classifiers Applications.2013. (2009). 53–58). Expert Systems with Applications. (2009). (2011). Expert Systems with Applications. other classification algo. A method for classification of network traffic based on C5.. Expert Systems with Applications. The proposed methods im. Aviad.. & Rahman. 179. Farid. A detailed analysis of the Acknowledgment KDD CUP 99 data set. Sanchez-Diaz. 36.. D. A. M. Expert Systems with second proposed hybrid NB classifier used a DT induction to select Applications. P. D. will be used to deal with real-time Publishers Inc. Networking and Communications (ICNC) (pp.. 3. J. (2010). Expert Systems with Applications. 40. 3. References Aitkenhead. D. H. 38. 36.5 duced impressive results for the classification of real life challenging decision tree classifier and one-against-all approach for multi-class multi-class problems.. X. Z. (2000). L. D. The first proposed hybrid DT algo. rough Quinlan. while the Liao. Conf. B. A.. The weka data mining software: An update. & Witten.. 237–241). Chen. 5. Tian.. 2338–2351. 1. such as naïve Bayes tree (NBTree). & Olshen. Z. Pfahringer. & Martinez-Trinidad. A network intrusion detection system algorithms for DT and NB classifiers. M. Ngai. M. G.A. D. International Journal of Computer Applications. D. M. B. Moving towards efficient decision tree construction.. (2009). Fuzzifying gini index based decision trees. Statistica conditional independence. Expert Systems with Applications. Accessed 29. Balamurugan.. Expert Systems with Applications. and 10-fold cross validation on 10 real bench. Morgan Kaufman set approaches and fuzzy logic.07. Y. 18–25.uci. & Zhou. we have proposed two independent hybrid Koc. & Landgrebe. P. (2012).. SIGKDD Explorations.. C.M. (2011). http:// Fig. Accessed 29. a subset of attributes for the production of naïve assumption of class Loh. (2010). M. E. Polat. 36. A. Breiman. (1997).-H. Fan. dataset with 10-fold cross validation. N. (1984). 36. In this paper. 8188–8192. K. M.. Huang. B. Journal of Artificial Intelligence European Union (EU) sponsored (Erasmus Mundus) cLINK (Centre Research. rithm used a NB classifier to remove the noisy troublesome 8471–8478. (2009). Automatically computed document dependent weighting in multi-class classification tasks. Fig. F. Farid et al. G.. C. K. S. (2010). (2009).. Innovation. et al. Journal of Classification.

. 4. Varas. E. using a naïve Bayes classifier. Ann Arbor. of the fifth National Conference on Machine Learning. 107–120). Farid et al. / Expert Systems with Applications 41 (2014) 1937–1946 Utgoff. Michigan. & Ruz. USA (pp.Md. Valle. P.E. M. P. Expert Systems with Applications. Incremental induction of decision trees. Machine Learning. (1988). (2012). S.. Utgoff. G.. A. Job performance prediction in a call center 161–186. In Proc. (1989). A.1946 D. 9939–9945. ID5: An incremental ID3. 39.