You are on page 1of 17

Why Decision Tree

Decision Tree is considered to be one of the main Data


Mining techniques and useful Machine Learning algorithms
since it can be used to solve a variety of problems. Here
are a few reasons why you should use Decision Tree:
 It is considered to be the most understandable Machine
Learning algorithm and it can be easily interpreted.
 It can be used for classification and regression
problems.
 it works effectively with non-linear data.

 Constructing a Decision Tree is a very quick process


since it uses only one feature per node to split the data.
What ?
A Decision Tree is a Supervised Machine Learning
algorithm which looks like an inverted tree, wherein
each node represents a predictor variable (feature),
the link between the nodes represents a Decision and
each leaf node represents an outcome (response
variable).
Let’s say that you hosted a huge party and you want to know how
many of your guests were non-vegetarians. To solve this problem,
let’s create a simple Decision Tree.
A Decision Tree has the following structure:

 Root Node: The root node is the starting point of a


tree. At this point, the first split is performed.
 Internal Nodes: Each internal node represents a
decision point (predictor variable) that eventually leads
to the prediction of the outcome.
 Leaf/ Terminal Nodes: Leaf nodes represent the final
class of the outcome and therefore they’re also called
terminating nodes.
 Branches: Branches are connections between nodes,
they’re represented as arrows. Each branch represents
a response such as yes or no.
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No Algorithm
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
 Given a collection of records (training set)
–Each record contains a set of attributes (x), with one
additional attribute which is the class (y).

 Find a model to predict the class as a function of the values of


other attributes.

 Goal: previously unseen records should be assigned a class as


accurately as possible.
–A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
 Classifying credit card transactions
as legitimate or fraudulent

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

 Categorizing news stories as finance,


weather, entertainment, sports, etc

 Predicting tumor cells as benign or


malignant
 There are many techniques/algorithms for carrying out
classification

 In this chapter we will study only decision trees

 In Chapter 5 we will study other techniques, including some


very modern and effective techniques
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Lanjutan
Buku :
Data Mining
Concepts, Models and Techniques
By Florin Gorunescu

Bab 4

You might also like