Decision Tree

Under the Guidance of

Dr. C.N. Ravi Kumar HOD, Dept. of Computer Science SJCE, Mysore

Nitin Jain 5VZ07SSZ04 IV Sem Software Engineering SJCE Mysore
1

Content
• • • • • • • • • Decision Tree The Construction Principal The Best Split Splitting Indices Splitting Criteria Decision Tree Construction Algorithms CART ID3 C4.5
2

Content
• • • • • • • • • CHAID Decision Tree Construction with Presorting Rain Forest Approximate Method CLOUDS BOAT Pruning Technique Integration of Pruning and Construction Conclusion
3

Decision Tree
• It’s a classification scheme which generate a tree and a set of rules, representing the model of different class, from a given data set. • Training set- -derive classifier • Test set - -to measure accuracy

4

Eg of decision tree 5 .

Rules to classify 6 .

A Decision Tree 7 .

Rule 1= 50% Rule 2= 50% Rule 3= 66% 8 .

Example 2 9 .

Advantage of Decision tree Classification • Easy to understood • Can handle both numerical & categorical attribute. 10 . • Provide clear indication of which field is important for predication and classification.

Shortcoming of Decision tree Classification • Some decision tree can deal only with binary-valued. • The process of growing tree is expensive 11 .

Principal to Construct tree • Construction Phase: Initial Decision tree is Constructed in this Phase • Pruning Phase: In this stage lower branches are removed to improve the performance • Processing the Pruned tree to improve the performance. 12 .

OVERFIT • A tree is set to over fit the training data if there exist some other tree T’ which is a simplification of T. such as T has smaller error then T’ over the training set but T’ has a smaller error than T over the entire distribution of instances. 13 .

OVERFIT Leads to • OVERFIT lead to difficulties when there is a noise in the training data. • Incorrect model • Requirement of space be increased and more computational resource. or when the number of training example is too small. • Require unnecessary collection of features • Difficult to comprehend 14 .

15 .The Best Split • Evaluation of splits for each attribute and the selection of the best split. determination of the splitting attribute • Determination of the splitting condition on the selected splitting attribute and • Partition the data using best split.

The Best Split • The splitting depend on domain of the attribute being numerical or categorical. which predominates. • But how to determine the BEST ONE 16 . • The best split is one which does the best job of separating the record in to group.

Splitting INDICES • • • • ENTROPY INFORMATION FOR PARTITION GAIN GINI INDEX 17 .

ENTROPY • Entropy provides an information approach to measure the goodness of a split. 18 . • ‘yes’ or ‘no’ in our example. • Entropy (P)=-[p1log(p1)+p2log(p2)+…+pnlog(pn)] • Entropy gives an idea of how to split an attribute from a tree.

Example Entropy 19 .

INFORMATION FOR PARTITION • If T is partitioned based on the value of non-class attribute X.T2.e the Weighted Average 20 .Tn then the information needed to identify the class of an element T become the weight average of the information to identify the class of the element of Ti. into set T1.…. i.

T) 21 .GAIN • The information Gain represents the difference between the information needed to identify an element of T and the information needed to identify an element of T after the value of attribute X is obtained. GAIN (X.T)= Info (T)-Info (X.

GAIN RATIO GINI INDEX 22 .

each of which we now attempt to split in the same manner as the root node 23 . according to a function of a single attribute. • The initial split produces two nodes. • CART uses the gini index for determining the best split. • CART builds a binary decision tree by splitting the records at each node.CART Classification And Regression Tree • CART (Classification And Regression Tree) is one of the popular methods of building decision trees in the machine learning community.

every record of the training set has been assigned to some leaf of the full decision tree. The error rate of a leaf node is the percentage of incorrect classification at that node. Each leaf can now be assigned a class and an error rate. Each leaf ‘s contribution to the total is the error rate at that leaf multiplied by the probability that a record will end up in there.CART • At the end of the tree-growing process. The error rate (If an entire decision tree is a weighted sum of the error rates of all the leaves. 24 .

each node corresponds to a splitting attribute and each arc is a possible value of that attribute. At each node the spIitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root.ID3 • Quinlan introduced the ID3. The attribute with . This algorithm uses the criterion of information gain to determine the goodness of a split.the greatest information gain is taken as the splitting attribute. 25 . • In ID3. Iterative Dichotomizer 3. • Entropy is used to measure how informative is a node. and the data set is split for all distinct values of the attribute. for constructing the decision trees from data.

5 is an extension of ID3 that accounts for unavailable values. continuous attribute value ranges. or the gain ratio.5 • C4. for an attribute by considering only those records where those attribute values are available. we can deal with training sets that have records with unknown attribute values by evaluating the gain. pruning of decision trees and rule derivation.C4. • In building a decision tree. 26 .

CHAID • CHAID. CHAID attempts to stop growing the tree before overfitting occurs. proposed by Hartigan in 1975. proposed by Kass in 1980. 27 . whereas the above algorithms generate a fully grown tree and then carry out pruning as postprocessing step. In that sense. is a derivative of AID (Automatic Interaction Detection). CHAID avoids the pruning phase.

The attribute list has one tuple corresponding to each record of the training data set. like other algorithms. It is similar to SPRINT.Rain Forest • Rainforest. Thus. But the A VC-set has one entry for each distinct value of an attribute. but instead of attribute lists it uses a different structure called the A VC-set. is a topdown decision tree induction scheme. the size of the A VC-set for any attribute is much smaller. 28 .

but it refers to the count matrix. It also uses a . We discuss the steps in which CLOUDS differs from the SPRINT method 29 .uild the decision tree.breadth-first strategy to b.CLOUDS (Classification of Large or Out-ofcore Data Sets) • CLOUDS is a kind of approximate version of the SPRINT method. CLOUDS uses the gini index for evaluating the split index of the attributes. Categorical attributes do not require any sorting. The method of finding the gini index for the categorical attributes is the same as that used in thc SPRlNT algorithm. There are different criteria for splitting the categorical and numerical attributes.

BOAT (Bootstrap Optimistic Algorithm for Tree Construction • BOAT (Bootstrap Optimistic Algorithm for Tree Construction) is another approximate algorithm based on sampling. DTk. • We take a very large sample D from the training data set T. so that D fits into the main memory. it is based on a wellknown' statistical principle called boot straping. for each of these samples. by any of the known methods. 30 . DT2. Dk and construct. using the boot straping technique. • As the name suggests. • Now. ••• . D2. •. decision trees DT1. respectively.• . we take many small samples with replacement of D as D1.

We call this tree the sample tree.BOAT • From these trees. 31 . called bootstrap trees. we construct a single decision tree for the sample data set D.

the rules yielding from the trees become unmanageable and difficult to comprehend.Pruning Technique • The decision tree built using the training set. This is inherent to the way it was built. deals correctly with most of the records in the training set. 32 . if the tree becomes very deep. Therefore. a pruning phase is invoked after the construction to arrive at a (sort of) optimal decision tree. lopsided or bushy. Moreover. • Overfitting is one of unavoidable situations that may arise due to such construction.

in order to specify whether a node of the tree is an internal node or a leaf node. A non-leaf node can be represented by 1 and a leaf node by O.COST OF ENCODING TREE • THE COST OF ENCODING THE STRUCTURE OF THE TREE The structure of the tree can be encoded by using a single bit. 33 .

34 .Now that we have formulated the cost of a tree. we next turn our attention to computing the minimum cost subtree of the tree constructed in the building phase. The main idea is that if minimum cost at a node is the cost of the node. then the splitting of the node would result in a higher cost and hence the subtree at this node is removed.

whether it is possible to identify a node that need not be expanded right at the time of construction as it is likely to be pruned eventually. instead of pruning being a post-processing step after the tree is fully built. 35 . In other words.Integration of Pruning and Construction • A natural question that arises is whether we can incorporate the pruning step within the construction process of the decision tree.

we must discover the concept hierarchy of each of the attributes.SUMMARY: AN IDEAL ALGORITHM • After studying different strategies for building a decision tree for a large database let us try to build an ideal decision tree algorithm based on these principles. We are given a training data set. • First of all. • This study helps in removing certain attribute from the database. 36 . This technique is essentially similar to dimension modeling.

based on dimension hierarch 37 . • Then we identify the numerical and categorical attribute Just having the numerical data type does not necessarily imply that the attribute i numerical (for example. Construct the attribute list for each attribute Construct a generalized attribute list for each attribute. some of the numerical attributes may turn into categorical ones.AN IDEAL ALGORITHM • As we move up this hierarchy.pincode). • We take a large sample of the training set so that the sample fits into the memory. Sampling involves select operations (tuple sampling) an project operations (attribute removal).

} 38 .windy.out. printf("\nENTER 1 for WINDY 0 for not windy"). else if(out==2) printf("PLAY").&h). scanf("%d". scanf("%d". printf("ENTER THE TEMP.• • • • • • • • • • • • • • • • • • • • • • • #include<stdio. else if(out==3 && windy==1) printf("NO PLAY").HUMIDITY").t. else if(out==3 && windy==0) printf("PLAy"). printf("1 for Sunny\n2 for Overcast\n3 for Rain"). getch().&windy). if(out==1 && h<75) printf("PLAY"). else if(out==1 && h>75) printf("NO PLAy"). clrscr().&t.h> void main() { int h. scanf("%d%d".&out).

} 39 . } if(out==2) { printf("play"). else printf("play").&humi).h> void main() { int humi. int out. printf("Enter 1 for Sunny\n2 for Overcast\n3 for Rain"). if(out==1) { printf("ENTER THE HUMIDITY"). clrscr().#include<stdio.&out).windy). scanf("%d". if(windy==0) printf("play"). scanf("%d". else printf("no play"). if (humi>75) printf("no play"). } if(out==3) { printf("enter wind condition 1 for windy 2 for no wind"). } getch(). scanf("%d".temp.windy.

SPRINT: A Scalable Parallel Classifier for Data Mining John Shafer Rakeeh Agrawal Manish Mehta IBM Almaden Research Center Data Mining Classification: Basic Concepts.• • • • • • BIBLIOGRAPHY Data Mining & Ware House by Arun K Pujari. and Model Evaluation Introduction to Data Mining by Tan. Kumar Advanced Data mining Techniques by David L Olson Dursun Delen Data Mining Concept and Techniques by Jiawei Han Micheline Lamber Google. Steinbach. Wekipedia Web RESOURCE 40 . Decision Trees.

THNK YOU 41 .