You are on page 1of 12

A Neuro-Fuzzy Decision Tree algorithm using C4.

5
Dr.S.K.Sudarsanam,
Professor and Program Chair,
VIT Business School Chennai
VIT University,
India

Abstract: Decision Trees are powerful and popular models for machine learning since they are
easily understandable, quite intuitive, and produce graphical models that can also be expressed as
rules. Fuzzy decision trees combine the powerful models of decision trees with the nterpretability
and ability of processing uncertainty and imprecision of fuzzy systems. (Marcos E. Cintra et al.,
2012). Neuro-fuzzy decision tree improves Fuzzy Decision Tree's classification accuracy and
classification rules. In this paper a Neuro-Fuzzy Decision tree algorithm using C4.5 is proposed.
This paper aims to prove that the Neuro-Fuzzy decision tree works better than the traditional
decision tree models.

Keywords: Fuzzy Logic, Neural Network, Neural Fuzzy Design Toolbox, MatLab 8.0, Fuzzy
Controller, Rule Viewer, Image Viewer, Rule editor, Decision Tree, Fuzzy Neural Network
Model.

1. Introduction : Machine learning is concerned with the development of methods for


the extraction of patterns from data to make decisions based on the patterns. Decision trees
are popular methods of machine learning. They are easily understandable, intuitive and can
be represented as graphical models that can be converted into rules. Decision support tools
that use a tree-like predictive model which map observations about an item on several levels
in the tree until reaching the final conclusion regarding the outcome of the desired function.
One of the most used algorithms for constructing decision trees has long been the ID3
method introduced by Quinlan[5]. This algorithm tries to construct the smallest classification
tree based on the set of training data available. The main issue is that it takes discrete data as
input and also as output for the resulting output class. Also, the user need to have a good
prior knowledge of the dataset for creating the partitions of the dataset. The C4.5, CART and
C5.0 have improved the ID3 algorithm by removing the drawbacks of ID3 algorithm. In this
paper, we will propose a Neuro-Fuzzy network algorithm which extends C4.5 algorithm. A
comparative study of the results of various algorithms are provided.

2. Decision Tree Algorithms :


ID3 Algorithm : The ID3 algorithm begins with the original set S as the root node. On each
iteration of the algorithm, it iterates through every unused attribute of the set S and calculates
the entropy H(S) or information gain IG(S) of that attribute. It then selects the attribute which
has the smallest entropy (or largest information gain) value. The set S is then split by the
selected attribute (e.g. age is less than 50, age is between 50 and 100, age is greater than 100)
to produce subsets of the data. The algorithm continues to recurse on each subset,
considering only attributes never selected before.
Recursion on a subset may stop in one of these cases:

a) every element in the subset belongs to the same class (+ or -), then the node is turned
into a leaf and labelled with the class of the examples
b) there are no more attributes to be selected, but the examples still do not belong to the
same class (some are + and some are -), then the node is turned into a leaf and
labelled with the most common class of the examples in the subset

c) there are no examples in the subset, this happens when no example in the parent set
was found to be matching a specific value of the selected attribute, for example if
there was no example with age >= 100. Then a leaf is created, and labelled with the
most common class of the examples in the parent set.

Throughout the algorithm, the decision tree is constructed with each non-terminal node
representing the selected attribute on which the data was split, and terminal nodes
representing the class label of the final subset of this branch.
C4.5 Algorithm : C4.5 builds decision trees from a set of training data in the same way
as ID3, using the concept of information entropy. The training data is a set S=s1,s2,...of
already classified samples. Each sample si consists of a p-dimensional vector (x1,i,

x2,i,...,xp,i) , where the xj represent attribute values or features of the sample, as well as
the class in which si falls.

At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its
set of samples into subsets enriched in one class or the other. The splitting criterion is the
normalized information gain (difference in entropy). The attribute with the highest
normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs
on the smaller sublists.
This algorithm has a few base cases.
a) All the samples in the list belong to the same class. When this happens, it simply
creates a leaf node for the decision tree saying to choose that class.
b) None of the features provide any information gain. In this case, C4.5 creates a
decision node higher up the tree using the expected value of the class.
c) Instance of previously-unseen class encountered. Again, C4.5 creates a decision node
higher up the tree using the expected value.
C4.5 made a number of improvements to ID3. Some of these are:

a) Handling both continuous and discrete attributes - In order to handle continuous


attributes, C4.5 creates a threshold and then splits the list into those whose attribute
value is above the threshold and those that are less than or equal to it.
b) Handling training data with missing attribute values - C4.5 allows attribute values to
be marked as ? for missing. Missing attribute values are simply not used in gain and
entropy calculations.
c) Handling attributes with differing costs.
d) Pruning trees after creation - C4.5 goes back through the tree once it's been created
and attempts to remove branches that do not help by replacing them with leaf nodes.

The “Classification and Regression Trees” (CART) methodology is introduced by


Breiman et al. (1984) and is a classification technique which uses supervised learning to
build
a decision tree. Decision trees are represented by a set of questions which split a data set into
smaller and smaller parts. As a result decision trees partition a data set into mutually
exclusive regions. When decision trees are used for classification problems, these trees are
called classification trees.

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he
markets commercially. C5.0 offers a number of improvements on C4.5.
Some of these are:
 Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
 Memory usage - C5.0 is more memory efficient than C4.5
 Smaller decision trees - C5.0 gets similar results to C4.5 with considerably
smaller decision trees.
 Support for boosting - Boosting improves the trees and gives them more accuracy.
 Weighting - C5.0 allows you to weight different cases and misclassification types.
 Winnowing - a C5.0 option automatically winnows the attributes to remove those
that may be unhelpful.
For our comparative study, we would be using the Neuro-Fuzzy algorithm to generate decision
tree and then compare the results of various algorithms.

3. Fuzzy Systems and Neural Networks :

FUZZY LOGIC is a situation where you’re unable to say Yes or No because you need more
information. You have all the information you need. The situation itself makes either Yes or No
inappropriate. If you answered One, or Some, or A Few, or Mostly—all of which are fuzzy
answers, somewhere in between Yes and No. They handle the actual ambiguity in descriptions or
presentations of reality. With fuzzy logic the answer is Maybe, and its value ranges anywhere
from 0 (No) to 1 (Yes).
Crisp value takes - No or Yes only
Fuzzy value - No Slightly Somewhat Sort Of A Few Mostly Yes, Absolutely

Zadeh (1965) proposed the fuzzy set theory which is an important concept to deal with
Uncertainty-based information.

The main components associated with fuzzy systems are:


 Fuzzification
 Fuzzy rule base
 Defuzzification.

Fuzzification : Fuzzification refers to transformation of crisp inputs into a membership degree


which expresses how well the input belongs to the linguistically defined terms. Experts
judgement and experience can be used for defining the degree of membership function for a
particular variable. During Fuzzification, a fuzzy logic controller receives input data, also known
as the fuzzy variable, and analyzes it according to user-defined charts called membership
functions (Klir and Yuan, 1995).
Fuzzy rule base: The rule base describes the criticality level of the system for each combination
of input variables. Often expressed in ‘If-Then', they are formulated in linguistic terms using two
approaches (i) Expert knowledge and expertise (ii) Fuzzy model of the process (Zimmermann,
1996). Experts judgement and experience can be used for define degree of membership function
for a particular variable

Defuzzification: The defuzzification process examines all of the rule outcomes after they have
been logically added and then computes a value that will be the final output of the fuzzy
controller. During defuzzification, the controller converts the fuzzy output into a real-life data
value.

A Fuzzy controller is used to convert the four fuzzy inputs into a fuzzy output. The Sugeno
controller provides better results than Mamdani Controller. The inputs considered are Sepal
Length, Sepal Width, Petal Length and Petal Width (of the IRIS dataset).

The inputs, controller and output are designed using the neuro-fuzzy design tool of MatLab.
Fuzzy membership functions for input and output variables and the fuzzy rule engine are
generated by the Neuro-fuzzy design toolbox.

Artificial neural networks (ANNs) are simplified models of the interconnections between cells of
the brain. They are defined by Wasserman and Schwartz (1987) as "highly simplified models of
the human nervous system, exhibiting abilities such as learning, generalization and abstraction.”
Recent technological advances, however, have made ANN models a viable alternative for many
decision problems and they have the potential for improving the models of numerous financial
activities such as forecasting financial distress in firms. A general description of neural networks
is found in Rummelhart, Hinton and Williams (1986).

Neural neworks are typically organized in layers. Layers are made up of a number of
interconnected 'nodes' which contain an 'activation function'. Patterns are presented to the
network via the 'input layer', which communicates to one or more 'hidden layers' where the actual
processing is done via a system of weighted 'connections'. The hidden layers then link to an
'output layer' where the answer is output as shown in the graphic below.

A Neural Network model :

As per Josef Thomas Burger, most ANNs contain some form of 'learning rule' which modifies the
weights of the connections according to the input patterns that it is presented with. In a sense,
ANNs learn by example as do their biological counterparts. Although there are many different
kinds of learning rules used by neural networks, this demonstration is concerned only with one;
the delta rule. The delta rule is often utilized by the most common class of ANNs called
'backpropagational neural networks' (BPNNs). Backpropagation is an abbreviation for the
backwards propagation of error.

With the delta rule, as with other types of backpropagation, 'learning' is a supervised process that
occurs with each cycle or 'epoch' (i.e. each time the network is presented with a new input
pattern) through a forward activation flow of outputs, and the backwards error propagation of
weight adjustments. More simply, when a neural network is initially presented with a pattern it
makes a random 'guess' as to what it might be. It then sees how far its answer was from the actual
one and makes an appropriate adjustment to its connection weights. More graphically, the
process looks something like

Within each hidden layer node is a sigmoidal activation function which polarizes network
activity and helps it to stablize.

Backward propagation performs a gradient descent within the solution's vector space towards a
'global minimum' along the steepest vector of the error surface. The global minimum is that
theoretical solution with the lowest possible error. The error surface itself is a hyperparaboloid in
shape. In most problems, the solution space is quite irregular with numerous 'pits' and 'hills'
which may cause the network to settle down in a 'local minum' which is not the best overall
solution.

Since the nature of the error space cannot be known a prioi, neural network analysis often
requires a large number of individual runs to determine the best solution. Most learning rules
have built-in mathematical terms to assist in this process which control the 'speed' (Beta-
coefficient) and the 'momentum' of the learning. The speed of learning is actually the rate of
convergence between the current solution and the global minimum.

Once the Neural network is trained at sufficient epoch levels with low error tolerance, the
network can be used to test and predict for new data. The data is processed through the hybrid
network (with fuzzy inference system and rules and trained neural network).

New inputs are presented to the input pattern where they filter into and are processed by the
middle layers as though training were taking place, however, at this point the output is retained
and no backward propagation occurs. The output of a forward propagation run is the predicted
model for the data which can then be used for further analysis and interpretation.

4.Neuro-Fuzzy Decision Tree Model :

The following are the steps involved in the building of Neuro-Fuzzy Decision Tree using neuro-
fuzzy design toolbox of Matlab.
a) IRIS data is used as the input data for the tool. The data has four inputs viz. Petal length,
Petal Width, Sepal Length and Sepal Width.
b) Partition the collected data into training data and testing data and load both the data into
Load data frame of the ANFIS (neuro fuzzy inference system) – using the neuro fuzzy
design toolbar of MATLAB
c) Generate the Fuzzy membership functions for the input variables, use the sugeno fuzzy
controller and generate the fuzzy rules for using the Generate FIS frame. The Sub
Clustering radio button option is chosen as previous study shows Sub-Clustering option
generates better fuzzy inference system

d) Train the FIS model : The two ANFIS parameter optimization method options available
for FIS training are backward propagation and Hybrid. Hybrid option is chosen as it
involves both backward propagation and mixed least squares method. Error Tolerance is
used to create a training stopping criterion, which is related to the error size. The process
will stop after the training data error remains within this tolerance. This is best left set to
0 as we don’t know how training error is going to behave. The number of training epochs
is set to 30, it can be increased to improve the results. The training will stop after the
training data error remains within this tolerance.
e) Finally the test FIS frame is used to test the test data for bankruptcy prediction.

The following provide snapshots of the ANFIS model in action.


Membership function editor :
Surface viewer :
ANFIS output of testing data :

As we can observe the prediction of Neuro-Fuzzy Decision Tree is 100%.


The following table provides the prediction of various Decision tree techniques.
We used R Studio and ran the algorithms in R Studio and the following results were
noted
Predictions
Algorithm Setosa versicolor virginica Percentage
setosa 50 0 0
versicolor 0 49 1
ID3 virginica 0 2 48 98%
setosa 50 0 0
versicolor 0 49 2
C4.5 virginica 0 1 48 98%
setosa 50 0 0
versicolor 0 50 0
C5.0 virginica 0 0 50 100%
setosa 50 0 0
Neural versicolor 0 50 0
Net virginica 0 0 50 100%
As a comparison between the Decision tree algorithms, we can note that the Neuro-fuzzy,
Neural net and C5.0 performs perfectly for the IRIS dataset.

5.Conclusion and Future Research directions: The Neuro-Fuzzy


Network model is able to build a decision tree which predicts the output of the IRIS data (4
inputs) better than the C4.5 and other algorithms.

The following are the suggestions to be considered for future research :

a) Support Vector Machine and Genetic algorithm can be considered for Decision tree
generation.

b) A combination of genetic algorithm and NFIS (neuro fuzzy inference system) can be
considered
c) Neural network methods can be improved

d) A broader comparison of decision tree algorithms over different types of data samples
viz. Small, Large and High Data samples can be considered and suggestions for
choosing the right decision tree algorithms based on the size of the dataset can be
considered.

References:
L. Breiman, J. H. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth and Brooks, 1984.
R. L. P. Chang and T. Pavlidis. Fuzzy decision tree algorithms. IEEE Transactions on Systems
Man and Cybernetics, 7:28 35, 1977.
A. Frank and A. Asuncion. UCI machine learning repository, 2010.
Hawley, Delvin D., Johnson, John D., 1994, "Artificial Neural Networks: Past, Present, and
Future: An Overview of the Structure and Training of Artificial Learning Systems," Advances in
Artificial Intelligence in Economics, Finance and Management, Vol. 1, Edited by Andrew
Whinston and John D. Johnson, JAI Press, pp. 1-22.
Hawley, Delvin D., Johnson, John D. and Dijotam Raina, 1990, "Artificial Neural Systems a
New Tool For Financial Decision Making," Financial Analyst Journal, November/December, pp.
62-72.
G.J. Klir and Yuan, B. (1995) Fuzzy sets and fuzzy logic: theory and application, Prentice Hall
PTR, New Jersey, U.S.A.
Marcos, E.C., Maria C.M., & Helocia A.C.(2012). FuzzyDT-A Fuzzy decision tree algorithm
based on C4.5.
C. Olaru and L. Wehenkel. A complete fuzzy decision tree technique. Fuzzy Sets
and Systems, 138(2):221 254, 2003.
J. R. Quinlan. Induction of Decision Trees. Machine Learning, 1:81 106, 1986. Reprinted in
Shavlik and Dieterich (eds.) Readings in Machine Learning.
J. R. Quinlan. C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine
Learning). Morgan Kaufmann, 1 edition, January 1993.
Rumelhart, D.E., G.E. Hinton and R.J. Williams, 1986b, "Learning Representations by Back-
Propagating Errors," Nature, 323, pp.533-536.
Wasserman, P. D. and T. Schwartz, 1988, "Neural Networks, Part 1," IEEE Expert,Spring,10-15.
Zadeh, L.A. (1965). "Fuzzy sets", Information and Control 8 (3): 338–353.
Zimmermann, H. (1996), Fuzzy Set Theory and its Applications, 3rd ed., KluwerAcademic
Publishers, London.

You might also like