You are on page 1of 3

Implementation of ID3 – Decision Tree Algorithm

Sharad Verma*, Nikita Jain**


*Sharadverma@live.in, Truba College of Engineering & Science/ Computer Engineering INDORE,
INDIA
**nikitaj.01@gmail.com, Truba College of Engineering & Science/ Computer Engineering
INDORE, INDIA

Abstract employing a top-down, greedy search through the given


sets to test each attribute at every tree node. In order to
In this paper we address the issue of decision select the attribute that is most useful for classifying a
tree learning algorithm which has been successfully given sets, we introduce a metric---information gain.
used in expert systems in capturing knowledge. The
main task performed in these systems is using inductive To find an optimal way to classify a learning set, what
methods to the given values of attributes of an unknown we need to do is to minimize the questions asked (i.e.
object to determine appropriate classification according minimizing the depth of the tree). Thus, we need some
to decision tree rules.We focus on the problem of function which can measure which questions provide the
decision tree learning with the popular ID3 algorithm. most balanced splitting. The information gain metric is
Algorithms have a wide range of applications like churn such a function.
pre-diction, fraud detection, artificial intelligence, and
credit card rating etc. Also there are many classification 1.1 Entropy
algorithms available in literature but decision trees is
the most commonly used because of its ease of In information theory, entropy is a measure of the
implementation and easier to understand compared to uncertainty about a source of messages. The more
other classification algorithms. uncertain a receiver is about a source of messages, the
more information that receiver will need in order to
know what message has been sent.
Keywords: Data mining, Decision trees & ID3 k
Entropy ( S )= −∑pi log pi
Algorithm.

i =1
1. Introduction
1.2 Information Gain
A decision tree is a tree in which each branch
node represents a choice between a number of Measuring the expected reduction in Entropy As
alternatives, and each leaf node represents a decision. we mentioned before, to minimize the decision tree
depth, when we traverse the tree path, we need to select
Decision tree are commonly used for gaining
the optimal attribute for splitting the tree node, which we
information for the purpose of decision -making. can easily imply that the attribute with the most entropy
Decision tree starts with a root node on which it is for reduction is the best choice. We define information gain
users to take actions. From this node, users split each as the expected reduction of entropy related to specified
node recursively according to decision tree learning attribute when splitting a decision tree node.
algorithm. The final result is a decision tree in which
each branch represents a possible scenario of decision r Sj
and its outcome. We demonstrate this on ID3, a well- Gain( S , S1..S r ) = Entropy( S ) − ∑ Entropy( S j )
known and influential algorithm for the task of decision j =1 S
tree learning. We note that extensions of ID3 are widely
used in real market applications.
For inductive learning, decision tree learning is
ID3 is a simple decision tree learning algorithm attractive for 3 reasons:
developed by Ross Quinlan (1983). The basic idea of
ID3 algorithm is to construct the decision tree by 1. Decision tree is a good generalization for unobserved
instance, only if the instances are described in terms of different values. One of the attributes in the database is
features that are correlated with the target concept. designated as the class attribute; the set of possible
values for this attribute being the classes. We wish to
2. The methods are efficient in computation that is predict the class of a transaction by viewing only the
proportional to the number of observed training non-class attributes. This can then be used to predict the
instances. class of new transactions for which the class is
unknown. For example, the weather problem is a toy
3. The resulting decision tree provides a representation data set which we will use to understand how a decision
of the concept that appeal to human because it renders tree is built. It is reproduced with slight modifications in
the classification process self-evident. Witten and Frank (1999), and concerns the conditions
under which some hypothetical outdoor game may be
played. In this dataset, there are five categorical
attributes outlook, temperature, humidity, windy, and
1.3 Related Work
play. We are interested in building a system which will
enable us to decide whether or not to play the game on
In this paper, we have focused on the problem the basis of the weather conditions, i.e. we wish to
of minimizing test cost while maximizing accuracy. In predict the value of play using outlook, temperature
some settings, it is more appropriate to minimize humidity, and windy. We can think of the attribute we
misclassification costs instead of maximizing accuracy. wish to predict, i.e. play, as the output attribute, and the
For the two class problem, Elkan gives a method to other attributes as input.
minimize misclassification costs given classification
probability estimates. Bradford et al. compare pruning
algorithms to minimize misclassification costs. As both 2.2 Decision Trees and the ID3 Algorithm
of these methods act independently of the decision tree
growing process, they can be incorporated with our The main ideas behind the ID3 algorithm are:
algorithms (although we leave this as future work). Ling
etal propose a cost-sensitive decision tree algorithm that 1. Each non-leaf node of a decision tree corresponds to
optimizes both accuracy and cost. However, the cost an input attribute, and each arc to a possible value of that
insensitive version of their algorithm (i.e. the algorithm attribute. A leaf node corresponds to the expected value
run if all feature costs are zero), reduces to a splitting of the output attribute when the input attributes are
criteria that maximizes accuracy, which is well known to described by the path from the root node to that leaf
be inferior to the information gain and gain ratio node.
criterion. Integrating machine learning with program
understanding is an active area of current research. 2. In a “good” decision tree, each non-leaf node should
Systems that analyze root cause errors in distributed correspond to the input attribute which is the most
systems and systems that find bugs using dynamic informative about the output attribute amongst all the
predicates may both benefit from cost sensitive learning input attributes not yet considered in the path from the
to decrease overhead monitoring costs. root node to that node. This is because we would like to
predict the output attribute using the smallest possible
number of questions on average.
2. Classification by Decision Tree Learning
The ID3 algorithm assumes that each attribute
This section briefly describes the machine is categorical, that is containing discrete data only, in
learning and data mining problem of classification and contrast to continuous data such as age, height etc. The
ID3, a well-known algorithm for it. The presentation principle of the ID3 algorithm is as follows. The tree is
here is rather simplistic and very brief and we refer the constructed top-down in a recursive fashion. At the root,
reader to Mitchell [12] for an in-depth treatment of the each attribute is tested to determine how well it alone
subject. The ID3 algorithm for generating decision trees classified the transactions. The “best” attribute (to be
was first introduced by Quinlan in [15] and has since discussed below) is then chosen and the remaining
become a very popular learning tool. transactions are partitioned by it. ID3 is then recursively
called on each partition (which is a smaller database
containing only the appropriate transactions and without
2.1 The Classification Problem the splitting attribute).

2.2.1 ID3 algorithm is best suited for: -


The aim of a classification problem is to
classify transactions into one of a discrete set of possible
1. Instance is represented as attribute-value pairs.
categories. The input is a structured database comprised
of attribute-value pairs. Each row of the database is a
2. Target function has discrete output values.
transaction and each column is an attribute taking on
3. Attribute values should be nominal.

3. Conclusion
Figure 1: The ID3 Algorithm for Decision Tree Learning
The paper conducted concludes that ID3 works
ID3(R, C, T ) fairly well on classification problems having datasets
1. If R is empty, return a leaf-node with the class value with nominal attribute values. It also works well in case
of missing attribute values but the way missing attributes
assigned to the most transactions in T. are handled actually governs the performance of the
2. If T consists of transactions which all have the same algorithm. In case of neglecting instances with missing
values for the attribute leads to high error rate compared
value look for the class attribute, return a leaf-node with to selecting the missing value as a separate value.
the value c (finished classification path). Decision tree induction is one of the classification
techniques used in decision support systems and
3. Otherwise, machine learning process. With decision tree technique
(a) Determine the attribute that best classified the the training data set is recursively partitioned using
depth- first (Hunt’s method) or breadth-first greedy
transactions in T , let it be A. technique (Shafer et al ,1996) until each partition is pure
(b) Let a, b the values of attribute A and let T (a 1), ..., T or belong to the same class/leaf node (Hunts et al, 1966
and Shafer et al , 1996). Decision tree model is preferred
(am) be a partition of T such that every transaction in among other classification algorithms because it is an
T(ai) has the attribute value a. eager learning algorithm and easy to implement.

(c) Return a tree whose root is labeled A (this is the test


attribute) and has edges labeled a1, am such that for every 4. References
i, the edge a goes to the tree ID3(R − {A}, C, T (ai)).
[1] Tom M. Mitchell, (1997). Machine Learning, Singapore,
McGraw- Hill.

[2] Usama et al. “On the Handling of Continuous-Values


Attributes in Decision Tree Generation”. University of
Michigan, Ann Arbor.
What remains is to explain how the best predicting
attribute is chosen. This is the central principle of ID3
[3] R. Chmielewski et al. “Global Discretization of
and is based on information theory. The entropy of the
Continuous Attributes as Preprocessing for Machine
class attribute clearly expresses the difficulty of
Learning”. Int. Journal of Approximate Reasoning 1996.
prediction. We know the class of a set of transactions
when the class entropy for them equals zero. The idea is
[4] Dan Ventura et al. “An Empirical Comparison of
therefore to check which attribute reduces the
Discretization Methods”. Proceedings of the Tenth
information of the class-attribute to the greatest degree.
International Symposium on Computer and Information
This results in a greedy algorithm which searches for a
Sciences, pp. 443-450, 1995.
small decision tree consistent with the database. The bias
favoring short descriptions of a hypothesis is based on
[5] Karmaker et al. “Incorporating an EM-Approach for
Occam’s razor. As a result of this, decision trees are
Handling Missing Attribute-Values in Decision Tree
usually relatively small, even for large databases.
Induction”.

[6] Stuart Russell, Peter Norvig, 1995. Artificial Intelligence:


2.3 Advantages of using ID3
A Modern Approach New Jersey: Prantice Hall.
 Understandable prediction rules are created
[7] J.R. Quinlan (1986): “Induction of Decision Tree”
from the training data.
Machine Learning, Vol, pp.81-106.
 Builds the fastest tree.
 Builds a short tree.
[8] M. R. Civanlar and H. J. Trussell, “Constructing
 Only need to test enough attributes until all data
membership functions using statistical data,” Fuzzy Sets and
is classified.
Systems, vol. 18, 1986, pp. 1-14.
 Finding leaf nodes enables test data to be
pruned, reducing number of tests.
 Whole dataset is searched to create tree.