11 upvote00 downvotes

6K views3 pagesNov 17, 2009

© Attribution Non-Commercial (BY-NC)

DOC, PDF, TXT or read online from Scribd

Attribution Non-Commercial (BY-NC)

6K views

11 upvote00 downvotes

Attribution Non-Commercial (BY-NC)

You are on page 1of 3

*Sharadverma@live.in, Truba College of Engineering & Science/ Computer Engineering INDORE,

INDIA

**nikitaj.01@gmail.com, Truba College of Engineering & Science/ Computer Engineering

INDORE, INDIA

sets to test each attribute at every tree node. In order to

In this paper we address the issue of decision select the attribute that is most useful for classifying a

tree learning algorithm which has been successfully given sets, we introduce a metric---information gain.

used in expert systems in capturing knowledge. The

main task performed in these systems is using inductive To find an optimal way to classify a learning set, what

methods to the given values of attributes of an unknown we need to do is to minimize the questions asked (i.e.

object to determine appropriate classification according minimizing the depth of the tree). Thus, we need some

to decision tree rules.We focus on the problem of function which can measure which questions provide the

decision tree learning with the popular ID3 algorithm. most balanced splitting. The information gain metric is

Algorithms have a wide range of applications like churn such a function.

pre-diction, fraud detection, artificial intelligence, and

credit card rating etc. Also there are many classification 1.1 Entropy

algorithms available in literature but decision trees is

the most commonly used because of its ease of In information theory, entropy is a measure of the

implementation and easier to understand compared to uncertainty about a source of messages. The more

other classification algorithms. uncertain a receiver is about a source of messages, the

more information that receiver will need in order to

know what message has been sent.

Keywords: Data mining, Decision trees & ID3 k

Entropy ( S )= −∑pi log pi

Algorithm.

i =1

1. Introduction

1.2 Information Gain

A decision tree is a tree in which each branch

node represents a choice between a number of Measuring the expected reduction in Entropy As

alternatives, and each leaf node represents a decision. we mentioned before, to minimize the decision tree

depth, when we traverse the tree path, we need to select

Decision tree are commonly used for gaining

the optimal attribute for splitting the tree node, which we

information for the purpose of decision -making. can easily imply that the attribute with the most entropy

Decision tree starts with a root node on which it is for reduction is the best choice. We define information gain

users to take actions. From this node, users split each as the expected reduction of entropy related to specified

node recursively according to decision tree learning attribute when splitting a decision tree node.

algorithm. The final result is a decision tree in which

each branch represents a possible scenario of decision r Sj

and its outcome. We demonstrate this on ID3, a well- Gain( S , S1..S r ) = Entropy( S ) − ∑ Entropy( S j )

known and inﬂuential algorithm for the task of decision j =1 S

tree learning. We note that extensions of ID3 are widely

used in real market applications.

For inductive learning, decision tree learning is

ID3 is a simple decision tree learning algorithm attractive for 3 reasons:

developed by Ross Quinlan (1983). The basic idea of

ID3 algorithm is to construct the decision tree by 1. Decision tree is a good generalization for unobserved

instance, only if the instances are described in terms of diﬀerent values. One of the attributes in the database is

features that are correlated with the target concept. designated as the class attribute; the set of possible

values for this attribute being the classes. We wish to

2. The methods are efficient in computation that is predict the class of a transaction by viewing only the

proportional to the number of observed training non-class attributes. This can then be used to predict the

instances. class of new transactions for which the class is

unknown. For example, the weather problem is a toy

3. The resulting decision tree provides a representation data set which we will use to understand how a decision

of the concept that appeal to human because it renders tree is built. It is reproduced with slight modifications in

the classification process self-evident. Witten and Frank (1999), and concerns the conditions

under which some hypothetical outdoor game may be

played. In this dataset, there are five categorical

attributes outlook, temperature, humidity, windy, and

1.3 Related Work

play. We are interested in building a system which will

enable us to decide whether or not to play the game on

In this paper, we have focused on the problem the basis of the weather conditions, i.e. we wish to

of minimizing test cost while maximizing accuracy. In predict the value of play using outlook, temperature

some settings, it is more appropriate to minimize humidity, and windy. We can think of the attribute we

misclassification costs instead of maximizing accuracy. wish to predict, i.e. play, as the output attribute, and the

For the two class problem, Elkan gives a method to other attributes as input.

minimize misclassification costs given classification

probability estimates. Bradford et al. compare pruning

algorithms to minimize misclassification costs. As both 2.2 Decision Trees and the ID3 Algorithm

of these methods act independently of the decision tree

growing process, they can be incorporated with our The main ideas behind the ID3 algorithm are:

algorithms (although we leave this as future work). Ling

etal propose a cost-sensitive decision tree algorithm that 1. Each non-leaf node of a decision tree corresponds to

optimizes both accuracy and cost. However, the cost an input attribute, and each arc to a possible value of that

insensitive version of their algorithm (i.e. the algorithm attribute. A leaf node corresponds to the expected value

run if all feature costs are zero), reduces to a splitting of the output attribute when the input attributes are

criteria that maximizes accuracy, which is well known to described by the path from the root node to that leaf

be inferior to the information gain and gain ratio node.

criterion. Integrating machine learning with program

understanding is an active area of current research. 2. In a “good” decision tree, each non-leaf node should

Systems that analyze root cause errors in distributed correspond to the input attribute which is the most

systems and systems that find bugs using dynamic informative about the output attribute amongst all the

predicates may both benefit from cost sensitive learning input attributes not yet considered in the path from the

to decrease overhead monitoring costs. root node to that node. This is because we would like to

predict the output attribute using the smallest possible

number of questions on average.

2. Classiﬁcation by Decision Tree Learning

The ID3 algorithm assumes that each attribute

This section brieﬂy describes the machine is categorical, that is containing discrete data only, in

learning and data mining problem of classiﬁcation and contrast to continuous data such as age, height etc. The

ID3, a well-known algorithm for it. The presentation principle of the ID3 algorithm is as follows. The tree is

here is rather simplistic and very brief and we refer the constructed top-down in a recursive fashion. At the root,

reader to Mitchell [12] for an in-depth treatment of the each attribute is tested to determine how well it alone

subject. The ID3 algorithm for generating decision trees classiﬁed the transactions. The “best” attribute (to be

was ﬁrst introduced by Quinlan in [15] and has since discussed below) is then chosen and the remaining

become a very popular learning tool. transactions are partitioned by it. ID3 is then recursively

called on each partition (which is a smaller database

containing only the appropriate transactions and without

2.1 The Classiﬁcation Problem the splitting attribute).

The aim of a classiﬁcation problem is to

classify transactions into one of a discrete set of possible

1. Instance is represented as attribute-value pairs.

categories. The input is a structured database comprised

of attribute-value pairs. Each row of the database is a

2. Target function has discrete output values.

transaction and each column is an attribute taking on

3. Attribute values should be nominal.

3. Conclusion

Figure 1: The ID3 Algorithm for Decision Tree Learning

The paper conducted concludes that ID3 works

ID3(R, C, T ) fairly well on classification problems having datasets

1. If R is empty, return a leaf-node with the class value with nominal attribute values. It also works well in case

of missing attribute values but the way missing attributes

assigned to the most transactions in T. are handled actually governs the performance of the

2. If T consists of transactions which all have the same algorithm. In case of neglecting instances with missing

values for the attribute leads to high error rate compared

value look for the class attribute, return a leaf-node with to selecting the missing value as a separate value.

the value c (ﬁnished classiﬁcation path). Decision tree induction is one of the classification

techniques used in decision support systems and

3. Otherwise, machine learning process. With decision tree technique

(a) Determine the attribute that best classiﬁed the the training data set is recursively partitioned using

depth- first (Hunt’s method) or breadth-first greedy

transactions in T , let it be A. technique (Shafer et al ,1996) until each partition is pure

(b) Let a, b the values of attribute A and let T (a 1), ..., T or belong to the same class/leaf node (Hunts et al, 1966

and Shafer et al , 1996). Decision tree model is preferred

(am) be a partition of T such that every transaction in among other classification algorithms because it is an

T(ai) has the attribute value a. eager learning algorithm and easy to implement.

attribute) and has edges labeled a1, am such that for every 4. References

i, the edge a goes to the tree ID3(R − {A}, C, T (ai)).

[1] Tom M. Mitchell, (1997). Machine Learning, Singapore,

McGraw- Hill.

Attributes in Decision Tree Generation”. University of

Michigan, Ann Arbor.

What remains is to explain how the best predicting

attribute is chosen. This is the central principle of ID3

[3] R. Chmielewski et al. “Global Discretization of

and is based on information theory. The entropy of the

Continuous Attributes as Preprocessing for Machine

class attribute clearly expresses the diﬃculty of

Learning”. Int. Journal of Approximate Reasoning 1996.

prediction. We know the class of a set of transactions

when the class entropy for them equals zero. The idea is

[4] Dan Ventura et al. “An Empirical Comparison of

therefore to check which attribute reduces the

Discretization Methods”. Proceedings of the Tenth

information of the class-attribute to the greatest degree.

International Symposium on Computer and Information

This results in a greedy algorithm which searches for a

Sciences, pp. 443-450, 1995.

small decision tree consistent with the database. The bias

favoring short descriptions of a hypothesis is based on

[5] Karmaker et al. “Incorporating an EM-Approach for

Occam’s razor. As a result of this, decision trees are

Handling Missing Attribute-Values in Decision Tree

usually relatively small, even for large databases.

Induction”.

2.3 Advantages of using ID3

A Modern Approach New Jersey: Prantice Hall.

Understandable prediction rules are created

[7] J.R. Quinlan (1986): “Induction of Decision Tree”

from the training data.

Machine Learning, Vol, pp.81-106.

Builds the fastest tree.

Builds a short tree.

[8] M. R. Civanlar and H. J. Trussell, “Constructing

Only need to test enough attributes until all data

membership functions using statistical data,” Fuzzy Sets and

is classified.

Systems, vol. 18, 1986, pp. 1-14.

Finding leaf nodes enables test data to be

pruned, reducing number of tests.

Whole dataset is searched to create tree.

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.