You are on page 1of 93

Tietmyksen muodostaminen

Knowledge discovery Classification Decision tree induction

Kati Iltanen Computer Sciences School of Information Sciences University of Tampere

Classification

Aim: to predict the value of a qualitative attribute

class labels (the values of the target attribute, class) are predicted

Every case belongs to one of the mutually exclusive classes. This class is known.

supervised learning

The classification method classifies the training data based on the attribute values and class labels.

constructs a model

Classification

The model is used for classifying new data.


The model is evaluated (e.g. accuracy and subjective estimate) If the model is acceptable, it is used to classify cases whose class labels are not known.

new data (unknown data, previously unseen data)

Classification methods

Decision trees, rules, k nearest neighbour method, nave Bayesian classifier, neural networks, ...

Application examples: To give a diagnosis suggestion on the basis of the symptoms and test results of a patient

To predict the paying capacity of a loan applicant

Constructing and testing a classifier


Training data
NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no

Learning algorithm

Classifier (Model)
Known class labels

Test data
NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes

IF rank = professor OR years > 6 THEN tenured = yes ELSE tenured = no


no yes yes yes The model misclassifies the second test case (gives , the known yes class is ) no
4

Using the classifier


Classifier (Model)
IF rank = professor OR years > 6 THEN tenured = yes ELSE tenured = no

New data: (Jeff, Professor, 4) Tenured? Yes

Decision tree induction

TDIDT (Top Down Induction of Decision Trees)

Inductive learning: general knowledge from separate cases


Cases are described using fixed-length attribute vectors. Each case belongs to one class. Classes are mutually exclusive. The class of a case is known: supervised learning

Knowledge is represented in the form a decision tree.

A decision tree is a classification model.

The tree is constructed in a top-down manner (from the root to the leaves)

Decision tree induction

Training data: Saturday mornings


Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot high false N hot high true N hot high false P m ild high false P cool normal false P cool normal true N cool normal true P m ild high false N cool normal false P m ild normal false P m ild normal true P m ild high true P hot normal false P m ild high true N

outlook sunny overcast P rain

humidity high N normal P

windy true N false P

Decision tree

Classes: P = play tennis, N = dont play tennis (Quinlan 86)

Decision tree

Decision tree

Inner nodes contain tests based on attributes (test nodes) Branches correspond to the outcomes of the tests (attribute values) Leaf nodes (leaves) contain the class information (one class or class distribution)

outlook sunny overcast P rain

humidity high N normal P

windy true N false P

Decision tree

Classification of a new case starts from the root of the tree.


outlook

The attribute assigned to the root node is examined and a branch corresponding to the attribute value is followed. This process continues until a leaf node is encountered.

sunny

overcast P

rain

humidity high N normal P

windy true N false P

The leaf predicts the class of the new case.

Decision tree

The classification path from the root to a leaf gives an explanation for the decision. The number of tested attributes depends on the classification path.

It is not necessary to test all the attributes in all the paths.

A classification path: a conjunction of constraints set on attributes A decision tree: a disjunction of the classification paths

10

10

Building a decision tree

Building a decision tree is a two step process

Tree construction

A complete (fully-grown) tree is built based on the training data. (prepruning: the growth of tree is restricted)

Tree pruning

postpruning: branches are pruned from a complete tree (or from a prepruned tree)

11

11

TDIDT: Basic algorithm

A decision tree is constructed in a top-down recursive divide-andconquer manner. In the beginning, all the training examples are at the root. If the stopping criterion is fulfilled, a leaf node is formed. If the stopping criterion is not fulfilled, the best attribute is selected according to some criterion (a greedy algorithm) and

a test node is formed cases are divided into subsets according to values of the chosen attribute a decision tree is formed recursively for each subset

12

12

DTIDT: Basic algorithm


Generate a decision tree (1) Create a node N (2) (3) (4) (5) (6) (7) (8) (9) if (stopping criterion is fulfilled) Make a leaf node (node N) else Choose the best attribute and make a test node (node N) that tests the chosen attribute Divide cases into subsets according to the values of the chosen attribute Generate a decision tree for each subset

13

13

TDIDT: Key questions


How to select the best attribute? How to specify the attribute test condition?

How to form inner nodes and branches?

When to stop the recursive splitting? How to form decision nodes (leaves)? How to prune a tree?

14

14

Attribute selection criterion


How to select the best attribute? Adequacy of attributes

Attributes are adequate for the classification task, if all the cases having the same attribute values belong to the same class.

If the attributes are adequate it is always possible to construct a decision tree which correctly classifies all the training data.

Usually there exist several correctly classifying decision trees. In the worst case, there is a leaf in the tree for each of the training cases.

15

15

Simple decision tree

A simple decision tree for the Tennis playing classification task

16

16

Complex decision tree


A complex decision tree for the same classification task

17

17

Attribute selection criterion

The aim is to generate simple (small) decision trees.


Derives from the principle called Occam razor: s If there are two models having the same accuracy on the training data, the smaller one (simpler one) can be seen more general and thus better Smaller trees: more general, easier to understand and possibly more accurate in classifying unseen cases

Try to generate simple trees by generating simple nodes. The complexity of a node is

in its largest when the node has an equal number of cases from every class of the node in its smallest when the node has cases from one class only

Heuristic attribute selection measures (measures of goodness of split) are used. These aim to generate homogeneous (pure) child nodes (subsets).
18

18

TDIDT algorithm family

CLS (Concept Learning System)


E.B. Hunt (50 and 60 s s) To simulate human problem solving methods Analysing the content of English texts, medical diagnostics

ID3 (Iterative Dichotomizer 3)


J.R. Quinlan (end of 70 s) Chess endgames Applications from medical diagnostics to scouting

Other early decision tree algorithms


CART (Classification and Regression Trees) (-84) Assistant (-84)

C4.5, C5, See5


descendants of ID3 Addresses issues arising in real world classification tasks C4.5 is one of the most widely used machine learning algorithms, frequently used as a reference algorithm in machine learning research
19

19

ID3

Assumes that

attributes are categorical and have a small number of possible values the class (the target attribute) has two possible values

applicable to classification tasks with two classes

attributes are adequate data contain no missing values

ID3 selects the best attribute according to a criterion called information gain

Criterion selects an attribute that maximises information gain (or minimises entropy)

20

20

ID3: Attribute selection criterion

Let

S be a training set that contains s cases (s is the number of cases) the class attribute C have values C1 , , Cm (m is the number of classes)

In ID3 m = 2

si be the number of cases belonging to the class Ci in the training set S and p(Ci) = si /s the relative frequency of the class Ci in S

21

21

ID3: Attribute selection criterion

The expected information needed to classify an arbitrary case in S (or entropy of C in S) is

H (C ) = p (Ci ) log 2 p (Ci )


i =1

2-based logarithm, because the information is coded in bits We define in this context that if p(Ci) = 0 then p(Ci) log2 p(Ci) returns 0 (zero)

log k a = x k x = a, k , l R+ \ {1}, a R+

logl a = log k a : log k l


22

22

ID3: Attribute selection criterion


C1 C2 C1 C2
C1 C2 C1 C2

0 6 1 5
2 4 3 3

p(C1) = 0/6 = 0 p(C2) = 6/6 = 1 H(C) = 0 log2 0 1 log2 1 = 0 0 = 0

p(C1) = 1/6 p(C2) = 5/6 H(C) = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65 p(C1) = 2/6 p(C2) = 4/6 H(C) = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92

p(C1) = 3/6 p(C2) = 3/6 H(C) = (3/6) log2 (3/6) (3/6) log2 (3/6) = 1

Maximum (= log2 m) when cases are equally distributed among the classes

m = number of classes
23

Minimum (= 0) when all cases belong to the same class

23

ID3: Attribute selection criterion


Let an attribute A have the values Aj , j = 1, ,v Let the set S be divided into subsets {S1, S2 , , Sv} according to the values of the attribute A The expected information needed to classify an arbitrary case in the branch corresponding the value Aj is
m

H (C | A j ) = p (Ci | A j ) log 2 p (Ci | A j )


i =1

Consider only those cases having the value Aj for the attribute A and calculate p(Ci) in the set of these cases

24

24

ID3: Attribute selection criterion

The expected information needed to classify an arbitrary case when using the attribute A as root is

H(C|A) = p(A j )H(C|A j )


j =1

p(Ai) is the relative frequency of the cases having value Aj for the attribute A in the set S

Information gained by branching on the attribute A is

I(C|A) = H (C ) H (C | A)

ID3 chooses the attribute resulting in the greatest information gain as the attribute for the root of the decision tree.
25

25

ID3: Tests

Tests in the inner nodes take the form of

A = Aj

An attribute A has the value Aj

Outcomes of a test are mutually exclusive. There is an own branch in the tree for each possible outcome .

26

26

ID3: Stopping criterion


ID3 assumes that attributes are adequate. It splits the data in recursive fashion, until all the cases of a node belong to the same class. The class of a leaf node is defined on the basis of the class of the cases in the node.

If the leaf is empty (there are no cases with some particular value of an attribute), the class is unknown (the leaf is labelled as ) null

27

27

Example: ID3 (1)


Playing tennis (Quinlan 86)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity Windy Class high false N high true N high false P high false P norm al false P norm al true N norm al true P high false N norm al false P norm al false P norm al true P high true P norm al false P high true N

Cases: Saturday mornings

Classes: P = positive (play) N = negative (donplay) t

28

28

Example: ID3 (2)

Class P (positive class): play tennis

9 cases

Class N (negative class): don play tennis t

5 cases

The expected information needed to classify an arbitrary case in S is

H (C ) =

9 9 5 5 log 2 log 2 = 0.940 14 14 14 14

29

29

Example: ID3 (3)

The expected information required for each of the subtrees after using the attribute Outlook to split the set S into 3 subsets

outlook sunny overcast rain

P 2 4 3

N H(C|A j ) 3 0.971 0 0 2 0.971

sunny:

2 2 3 3 H (C | A1 ) = log 2 log 2 = 0.971 5 5 5 5 4 4 0 0 H (C | A2 ) = log 2 log 2 = 0 4 4 4 4 3 3 2 2 H (C | A3 ) = log 2 log 2 = 0.971 5 5 5 5


30

overcast:

rain:

30

Example: ID3 (4)

The expected information needed to classify an arbitrary case for the tree with the attribute Outlook as root is

H (C | A) =

5 4 5 0.971 + 0+ 0.971 = 0.694 14 14 14

The information gained by branching on the attribute Outlook (A) is

I ( C | A ) = H ( C ) H ( C | A ) = 0.940 0.694 = 0.246

31

31

Example: ID3 (5)

The information gain for other candidate attributes is calculated similarly


I(C|temperature) = 0.029 I (C|humidity) = 0.151 I (C|windy) = 0.048

The attribute resulting in the greatest information gain is chosen as the attribute for the root of the decision tree.

I(C|outlook) = 0.246

32

32

Example: ID3 (6)

The attribute Outlook has been chosen and the cases have been divided into subsets according to their values of the Outlook attribute.

outlook sunny
Cases (1, sunny, hot, , N) (2, sunny, hot, , N) (8, sunny, mild, , N) (9, sunny, cool, , P) (11, sunny, mild, , P)

overcast
Cases (3, overcast, hot, , P) (7, overcast, cool, , P) (12, overcast, mild, , P) (13, overcast, hot, , P)

rain

Cases (4, rain, mild, , P) (5, rain, cool, , P) (6, rain, cool, , N) (10, rain, mild, , P) (14, rain, mild, , N)
33

33

Example: ID3 (7)

The branch corresponding the outcome sunny is built next.


Cases: (1, sunny, hot, high, false, N) (2, sunny, hot, high, true, N) (8, sunny, mild, high, false, N) (9, sunny, cool, normal, false, P) (11, sunny, mild, normal, true, P)

Calculate the expected information

2 2 3 3 H (C ) = log 2 log 2 = 0.971 5 5 5 5

and the information gain for all candidate attributes...


34

34

Example: ID3 (8)


Tempera hot mild cool P 0 1 1 N H(C|Aj ) 2 0 1 1 0 0
2 2 1 H (C | Temperature) = 0 + 1 + 0 5 5 5 = 0.400
I (C | Temperature) = 0.971 0.400 = 0.571 3 2 H (C | Humidity ) = 0 + 0 = 0 5 5 I (C | Humidity ) = 0.971 0 = 0.971

Humidity high normal W indy FALSE TRUE

P 0 2 P 1 1

N H(C|Aj ) 3 0 0 0 N H(C|Aj ) 2 0.918 1 1

3 2 H (C | Windy ) = 0.918 + 1 = 0.951 5 5 I (C | Windy ) = 0.971 0.951 = 0.020


35

35

Example: ID3 (9)

Humidity is chosen
Cases are sent down to high and normal branches The cases in high branch are all of the same class: a leaf node is formed The same situation in the normal branch

outlook sunny humidity high N P


36

overcast

rain

normal

Branches for overcast and rain are built in the similar way

36

Example: ID3 (10)

Complete decision tree and classification of a new case (Outlook: rain, Temperature: hot, Humidity: high, Windy: true) Play tennis?

outlook sunny overcast P rain

humidity high N normal P

windy true N false P

37

37

Real world classification tasks

Real world data can be mixed.

Attributes may have different scales (both qualitative and quantitative).

Data may contain


missing values noise (erroneous values) exceptional values or value combinations

ID3 does not address issues arising in real world classification tasks.

Modifications to the original algorithm are needed.

38

38

C4.5

Descendant of ID3 algorithm (Quinlan -93) Upgrades:


Gain ratio attribute selection criterion Tests for value groups and quantitative attributes No requirement of fully adequate attributes Probabilistic approach for handling of missing values Pruning

Prepruning and postpruning

Converting trees to rules

39

39

C4.5 Attribute selection criterion

The information gain criterion has a tendency to favour attributes with many outcomes.

However, this kind of attributes may be less relevant in prediction than attributes having a smaller number of outcomes.

An extreme example is an attribute that is used as an identifier. Identifiers have unique values resulting in pure nodes but they donhave t predictive power.

To overcome this problem, a gain ratio criterion has been developed.

40

40

C4.5 Gain ratio selection criterion

A gain ratio is calculated as

I(C|A) , H (A)

where I(C |A) is the information gain got from testing the attribute A and H (A) is the expected information needed to sort out the value of the attribute A i.e. the uncertainty relating to the value of the attribute A

H ( A) = p ( A j ) log 2 p ( A j )
j =1

where p (Aj) is the probability of the value Aj (the relative frequency of the value Aj)

41

41

C4.5 Gain ratio selection criterion

The gain ratio criterion selects the attribute having the highest gain ratio among of those attributes whose information gain is at least the average information gain over all the attributes examined.

The information gain of the attribute has to be large.

42

42

C4.5 Gain ratio selection criterion

Let calculate the gain ratio for the Outlook attribute of the s Tennis example. The information gain I(C |A) for the attribute Outlook is 0.246. Calculate the expected information for the Outlook attribute:

outlook frek sunny 5 overcast 4 rain 5

H ( A) = (5 / 14) log 2 (5 / 14) (4 / 14) log 2 (4 / 14) (5 / 14) log 2 (5 / 14) = 1.577

The gain ratio for the attribute Outlook is

GR(C | A) =I(C|A) / H ( A) = 0.246 / 1.577 = 0.156


43

43

C4.5 Test types

One branch for each possible attribute value

Outlook
Sunny Rain Overcast

Value groups

{Sunny, Overcast}

Outlook
{Rain}

Thresholds for quantitative attributes

Humidity
75 > 75

44

44

C4.5 Value groups

Tests based on qualitative attributes can take the form of outlook in sunny, overcast outlook = rain

Why value groups?

To avoid too small subsets of cases

Useful patterns may become undetectable because of the scarcity of data

To assess equitably qualitative attributes that vary in their numbers of possible values

Gain ratio criterion is biased to prefer attributes having a small number of possible values

45

45

C4.5 Value groups

Appropriate value groups can be determined on the basis of domain knowledge.

For each appropriate grouping, an additional attribute is formed in the preprocessing phase. This approach is economical from a computational viewpoint Problem: Appropriateness of a grouping may depend on the context (the part of the tree). A constantgrouping may be too crude.

46

46

C4.5 Value groups


In C4.5, values are merged to groups in an iterative manner. A greedy method


At first, each value forms its own group. Then, all possible pairs of groups are formed.

A grouping yielding the highest gain ratio is chosen.

Process continues until just two value groups remain, or until no such merger would result in a better division of the training data.

Aims to find a grouping which results in the highest gain ratio.


Example on the next slide:

Michalski Soybean data s 35 attributes, 19 classes, 683 training cases Attribute stem canker with four values: none, below soil, above soil, above 2nd node
47

47

C4.5 Value groups

1) Partition into four one-value groups 2) Two onevalue groups are merged

3)

Based on the results of the section 2, above soiland above 2nd nodeare merged No merger of the section 3 improves the situation the process stops. Final groups: {none}, {below soil}, {above soil, above 2nd node}
48

48

C4.5 Value groups

From the overall viewpoint, the aim is to get simpler and more accurate trees. Advantageous of value groupings depends on the application domain. Search for value groups can require a substantial increase in computation.

49

49

C4.5 Quantitative attributes

Tests based on quantitative attributes employ thresholds.


The value of the attribute A is compared to some threshold Z. A Z, A > Z

The threshold is defined dynamically. Cases are first sorted on the values of the attribute A being considered.

A1, A2, ,Aw

The midpoint of adjacent values Ak and Ak+1

Ak + Ak +1 2
is a possible threshold Z that divides the cases of the training set S into two subsets.

50

50

C4.5 Quantitative attributes


There are w-1 candidate thresholds. The best threshold is the one that results in the largest gain ratio. The largest value of the attribute A in the training set that does not exceed the best midpoint is chosen as the threshold.

All the threshold values appearing in the tree actually occur in the training data.

After finding the threshold, the quantitative attribute can be compared to qualitative and to other quantitative attributes in the usual way.

51

51

C4.5 Quantitative attributes

Finding the threshold value Z dynamically during the tree construction:


A 32 46 52 58 Class P N P P

Cases are first sorted on the values of the attribute A.

Z 39 49 55

A Z P 1 1 2 N 0 1 1 P 2 2 1

A>Z N 1 0 0

The candidate threshold 49 yields the highest gain ratio, and, thus, 46 is chosen as the threshold. A 46, A > 46

Midpoints of successive values are possible thresholds. The gain ratio is calculated for each candidate threshold. The best candidate is the one resulting in the highest gain ratio. Choose as the threshold the biggest value of A in the training set that does not exceed the best candidate (midpoint).

52

52

C4.5 Quantitative attributes


Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot 85 false N hot 90 true N hot 78 false P mild 96 false P cool 80 false P cool 70 true N cool 65 true P mild 95 false N cool 70 false P mild 80 false P mild 70 true P mild 90 true P hot 75 false P mild 96 true N
53

Tennis playing 1 2 (Quinlan 86)


3 4 5 6 7 8 9 10 11 12 13 14

Humidity has been measured using a quantitative scale

53

C4.5 Quantitative attributes

An example of a decision tree built from the Tennis data in which the attribute humidity has been measured using a quantitative scale.

outlook = overcast: P outlook = sunny: :...humidity = high: N : humidity = normal: P outlook = rain: :...windy = true: N windy = false: P

outlook = overcast: P outlook = sunny: :...humidity <= 75: P : humidity > 75: N outlook = rain: :...windy = true: N windy = false: P

54

54

C4.5 ordinal attributes

Ordinal attributes can be handled either in the same way than nominal attributes or in the same way than quantitative attributes. Processing of quantitative attributes is based on ordering of values. Values of ordinal attributes have a natural order, and, thus, the approach employed for quantitative attributes can be utilised with ordinal attributes, too.

55

55

C4.5 - stopping criterion

Stopping criteria

All the cases in a node belong to the same class No cases in a node None of the attributes improves the situation in a node The number of cases in a node is too small for continuing the splitting process:

Every test must have at least two outcomes having the minimum number of cases. The default value for the number of cases is 2.

56

56

C4.5 - Leaves

A leaf can contain

cases all belonging to a single class Cj:

The class Cj is associated with the leaf

no cases:

The most frequent class (the majority class) at the parent of the leaf is associated with the leaf.

cases belonging to a mixture of classes:

The most frequent class (the majority class) at the leaf is associated with the leaf.

57

57

C4.5 - Missing values


Real world data often have missing attribute values. Missing values may be e.g. filled in (imputed) with

mode, median or mean of the complete cases of a class estimates given by some more intelligentmethod

before running the decision tree program. However, imputation is not unproblematic.

Algorithms can be amended to cope with missing values

in the tree construction


selecting tests sending cases to subtrees submitting cases to subtrees


58

when the tree is used in prediction

58

C4.5 - Missing values

Missing values are taken into account when calculating the information gain

I(C | A) = p(A known ) (H(C) H(C | A)) + p( A unknown ) 0 = p( A known ) ( H(C) H(C | A))

where p (Aknown) is the probability that the value of the attribute A is known (i.e. the relative frequency of those cases for which the value of the attribute A is known)

and calculating the expected information H (A) needed to test the value of the attribute A

Let an attribute A have the values A1, A2, , Av . Missing values are now treated as an own value, the value v+1.

H ( A) = p ( A j ) log 2 p ( A j )
j =1
59

v +1

59

C4.5 - Missing values

Let us assume that the Tennis example has one missing value
Temperature Humidity Windy Class hot 85 false N Tennis playing 1 hot 90 true N (Quinlan 86) 2 hot 78 false P 3 mild 96 false P 4 cool 80 false P 5 cool 70 true N 6 cool 65 true P 7 mild 95 false N 8 cool 70 false P 9 mild 80 false P 10 Missing value 11 mild 70 true P mild 90 true P 12 75 false P 13 overcast hot mild 96 true N 14 rain
60

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny

60

C4.5 - Missing values

The information gain for the Outlook attribute is calculated on the basis of the 13 cases having known value.

H (C ) = (8 / 13) log 2 (8 / 13) (5 / 13) log 2 (5 / 13) = 0.961


outlook sunny overcast rain P 2 3 3 N H(C|A ) 3 0.971 0 0 2 0.971

H (C | A) = (5 / 13) 0.971 + (3 / 13) 0 + (5 / 13) 0.971 = 0.747

I (C | A) = (13 / 14) (0.961 0.747) = 0.199


61

61

C4.5 - Missing values

The expected information needed to test the value of the Outlook attribute is calculated:

H ( A) = (5 / 14) log 2 (5 / 14) (3 / 14) log 2 (3 / 14) (5 / 14) log 2 (5 / 14) (1 / 14) log 2 (1 / 14) = 1.809

sunny overcast rain ? unknown

The gain ratio for the Outlook attribute is

GR (C | A) =I(C|A) / H ( A) = 0.199 / 1.809 = 0.110


62

62

C4.5 Missing values


When cases are sent to subtrees, a weight is given for each case. If the tested attribute value is known, the case is sent to the branch corresponding the outcome Oi with the weight w = 1. Otherwise, a fraction of the case is sent to each branch Oi with the weight w = p(Oi ).

p(Oi ) is the probability (the relative frequency) of the outcome Oi in the current node. The case is divided between the possible outcomes {O1, O2, , Ov} of the test.

The 13 cases with a known value for the Outlook attribute are sent to the corresponding sunny, overcast or rain branches with the weight w = 1. Case 12 is divided between the sunny, overcast and rain branches.

outlook sunny overcast rain

Case 12: w = 5/13

Case 12: w = 3/13

Case 12: w = 5/13 63

63

C4.5 Missing values


outlook

Cases in the sunny branch:


Case no 1 2 8 9 11 12 Outlook sunny sunny sunny sunny sunny ? Temperature Humidity Windy hot 85 FALSE hot 90 TRUE mild 95 FALSE cool 70 FALSE mild 70 TRUE mild 90 TRUE Class N N N P P P Weight 1 1 1 1 1 5/13=0.4

sunny

overcast

rain

The number of cases in a node is now interpreted as the sum of weights of (fractional) cases in the node.

There may be whole cases and fractional cases in a node.

A case came to a node with the weight w. It is sent to the node(s) of the next level with the weight

w w 1 = w w p(Oi ) =

(the value of the attribute of the current node is known) (the value of the attribute of the current node is unknown)
64

64

C4.5 Missing values

Cases in the sunny branch:


Case no 1 2 8 9 11 12 Outlook sunny sunny sunny sunny sunny ? Temperature Humidity Windy hot 85 FALSE hot 90 TRUE mild 95 FALSE cool 70 FALSE mild 70 TRUE mild 90 TRUE Class N N N P P P Weight 1 1 1 1 1 5/13=0.4
sunny outlook overcast rain

humidity <=75 P >75 N

Let us assume that this subset is partitioned further by the test on humidity. The branch humidity <= 75has cases from the single class P. The branch humidity > 75has cases from both classes (class P 0.4/3.4 and class N 3/3.4)

Since no test improves the situation further, a leaf is made (the most frequent class in the node gives the class label).
65

65

C4.5 Missing values

A decision tree constructed from the data having a missing value:


outlook = overcast: P (3.2) outlook = sunny: :...humidity <= 75: P (2) : humidity > 75: N (3.4/0.4) outlook = rain: :...windy = true: N (2.4/0.4) windy = false: P (3)

The tree is alike the tree constructed from the original data, but now some leaves have a marking (N/E)

N is the sum of fractional cases belonging to the leaf E is the sum of those cases misclassified by the leaf (i.e. the sum of fractional cases belonging to classes other than suggested by the leaf) The majority class gives the class label of the node.

The majority class = the biggest class in the node


66

66

C4.5 Missing values

Classification of a new case

If the new case has a missing value for the attribute tested in the current node, the case is divided between the outcomes of the test. Now the case has multiple classification paths from the root to leaves, and, therefore, a classificationis a class distribution. The majority class is the predicted class.

67

67

C4.5 Missing values

A case having a missing value is classified Outlook: sunny, temperature: mild, humidity: ?, windy: false
outlook = overcast: P (3.2) outlook = sunny: :...humidity <= 75: P (2) : humidity > 75: N (3.4/0.4) outlook = rain: :...windy = true: N (2.4/0.4) windy = false: P (3)

If the humidity were less than or equal to 75, the class for the case would be P If the humidity were greater than 75, the class for the case would be N with the probability of 3/3.4 (88%) and P with the probability of 0.4/3.4 (12%).

Results from normal and high branches are summed for the final class distribution class P: (2.0/5.4) 100% + (3.4/5.4 ) 12% = 44% 2 cases of 5.4 training cases belonged to the humidity <= 75 branch and in this branch the probability of the class P is 100% 3.4 cases of 5.4 training cases belonged to the humidity > 75 branch and in this branch the probability of the class P is 12% class N: 3.4/5.4 88% = 56%
68

68

Underfitting and overfitting


Overfitting: Test error rate starts to increase Training error rate continues to decrease

Underfitting: when model is too simple, both training and test errors are large
69

69

Overfitting

The built decision tree may overfit the training data.

The tree is complex. Its lowest branches reflect noise and outliers occurring in the training data. Lower classification accuracy on unseen cases

Reasons for overfitting


Noise and outliers Inadequate attributes Too small training data A local maximum in the greedy search

70

70

Pruning

Overfitting can be overcome by pruning. Pruning generally results in


a faster classification a better classification accuracy on unseen cases

Pruning decreases the accuracy on the training data.

Prepruning

Stop the tree construction early.

Postpruning

Let the tree grow and remove branches from the full fully grown tree.

In a combined approach both pre- and postpruning are used.


71

71

Pruning

(a) The branch marked with a star may be partly based on erroneous or exceptional cases.

(b) The tree growth has been stopped. (prepruning)

(c) The tree has grown (the tree full ) after which it has a been pruned. (postpruning)

72

72

Pruning

The tree growth can be limited in many ways. Define a minimum for the number of cases in a node.

If the number of cases in a node is below the minimum, the recursive division of the example set is stopped and a leaf is formed.

The leaf is labeled with the majority class or the class distribution.

Define a threshold for the attribute selection criterion.

The problem: the definition of a suitable threshold


too high a threshold: oversimplification: useful attributes are discarded too low a threshold: no simplification at all (or little simplification)

73

73

Postpruning

Usually it is more profitable to let the tree grow complete and prune it afterwards than halt the tree growth.

If the tree growth is halted, all the branches growing from a node are lost. Postpruning allows saving some of the branches.

Postpruning requires more calculation than prepruning but it usually results in more reliable trees than prepruning. In postpruning, parts of the tree, whose removal does not decrease the classification accuracy on unseen cases, are discarded.

74

74

Postpruning

Postpruning is based on classification errors made by the tree.

an error rate of a node is E /N


N is the number of training cases belonging to the leaf E is the number of cases that do not belong to the class suggested by the leaf the error rate of the whole tree: E and N are summed over all the leaves

a predicted error rate: the error rate on new cases

75

75

Postpruning

The basic idea of postpruning:

Start from the bottom of the tree and examine each subtree that is not a leaf. If replacement of the subtree with a leaf (or with its most frequently used branch) would reduce the predicted error rate, then prune the tree accordingly.

When the error rate of any of the subtrees reduces, also the error rate of the whole tree reduces.

There can be cases from several classes in a leaf, and, thus, the leaf is labeled with the majority class. The error rate can be predicted by using the training set or a new set of cases.

Not a topic of this course


76

76

C4.5 - Pruning

Prepruning

Every test must have at least two outcomes having the minimum number of cases.

Because of the missing values, the minimum number of cases is actually the minimum for the summed weights of the cases.

The default value for the number of cases is 2.

Postpruning

A verypessimistic method based on estimated error rates How to calculate the very pessimistic estimates is not a topic of this course. However, the idea of the pruning is presented on the next slides.

77

77

C4.5 - postpruning

Example: original, complete tree (Quinlan 93)

Congressional voting data, UCI Machine Learning Repository

physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: : :...education spending = n: democrat (6.0) : education spending = y: democrat (9.0) : education spending = u: republican (1.0) physician fee freeze = y: :...synfuels corporation cutback = n: republican (97.0/3.0) synfuels corporation cutback = u: republican (4.0) synfuels corporation cutback = y: :...duty free exports = y: democrat (2.0) duty free exports = u: republican (1.0) duty free exports = n: :...education spending = n: democrat (5.0/2.0) education spending = y: republican (13.0/2.0) education spending = u: democrat (1.0) physician fee freeze = u: :...water project cost sharing = n: democrat (0.0) water project cost sharing = y: democrat (4.0) water project cost sharing = u: :...mx missile = n: republican (0.0) mx missile = y: democrat (3.0/1.0) mx missile = u: republican (2.0)

78

78

C4.5 postpruning

Pruned tree

The original tree had 17 leaves, the pruned one has 5 leaves.
Subtrees have been replaced with leaves

physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: :...mx missile = n: democrat (3.0/1.1) mx missile = y: democrat (4.0/2.2) mx missile = u: republican (2.0/1.0)

Subtree has been replaced with the most frequently used subtree

123 training cases in the leaf If 123 new cases were classified, 13.9 cases would be misclassified (a very pessimistic estimate)
79

79

C4.5 postpruning
physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) = y: democrat (9.0) = u: republican (1.0)

The subtree has been replaced with a leaf physician fee freeze = n: democrat (168.0/2.6) 168 training cases in the leaf. One of them is missclassified by the leaf. If 168 new cases were classified, 2.6 cases would be misclassified (a very pessimistic estimate)
80

80

C4.5 postpruning
physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) The very pessimistic = y: democrat (9.0) estimate: the sum of = u: republican (1.0) predicted errors is 3.273

First, the subtree has been replaced with a leaf


: adoption of the budget resolution = n: democrat (16.0/2.512) One training case is misclassified. The very pessimistic estimate: the number of predicted errors is 2.512

physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16.0/2.512) 81

81

C4.5 postpruning

physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16/2.512)

Then, the subtree has been replaced with a leaf

The very pessimistic estimate: the sum of predicted errors is 4.642

physician fee freeze = n: democrat (168.0/2.6)

82

82

C4.5 postpruning

Interpretation of the numbers (N /E ) in a pruned tree


N is the number of training cases in the leaf E is the number of predicted errors if a set of N unseen cases were classified by the tree. The sum of the predicted errors over the leaves, divided by the size of the training set (the number of the training cases) provides an immediate estimate of the error rate of the pruned tree on new cases.

20.8/300=0.069 (6.9%) (The pruned tree will misclassify 6.9% of new cases.)

83

83

C4.5 - postpruning

Results for the Congressional voting data Training set (300 cases) Complete tree Nodes 25 Errors 8 (2.7%) Pruned tree Nodes 7 Errors 13 (4.3%)

Test set (135 cases) Complete tree Nodes 25

Pruned tree Nodes 7 Errors 4 (3.0%)

Errors 7 (5.2%)

10-fold cross-validation gives the error rate of 5.3% on new cases (the average predicted, very pessimistic error rate on new cases is 5.6%)
84

84

DTI - pros

Construction of a tree does not (necessarily) require any parameter setting Can handle high dimensional data Can handle heterogeneous data Nonparametric approach Representation form is intuitive, relatively easy to interpret

85

85

DTI - pros

Learning and classification steps are simple and fast

Learning: the complexity depends on the number of nodes, cases and attributes

In each node: O(n quantitative attributes O(p n n p), log ) n number of cases in the node = p = number of attributes

Classification: O(w), where w is the maximum depth of the tree An eagermethod: training is computationally more expensive than classification

Quite robust to the presence of noise In general, good classification accuracy comparable with other classification methods
86

86

DTI - other issues

Decision tree algorithms divide the training data into smaller and smaller subsets in a recursive fashion. Problems

Data fragmentation Repetition Replication

Data fragmentation

Number of instances at the leaf nodes can be too small to make any statistically significant decision

87

87

DTI - other issues

Repetition

An attribute is repeatedly tested along some branch of the decision tree

Replication
A decision tree contains duplicate subtrees
P

88

88

DTI - other issues - decision boundary

Border line between two neighbouring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
89

89

DTI - other issues - multivariate split

x+y<1

Class = +

Class =

Multivariate splits based on a combination of attributes More expressive representation The use of multivariate splits can prevent problems of fragmentation, repetition and replication. Finding optimal test condition is computationally expensive.

90

90

DTI - other issues

Decision tree induction is a widely studied topic - different kind of enhancements to the basic algorithm have been developed.

challenges arising from real world data: quantitative attributes, missing values, noise, outliers multivariate decision trees incremental decision tree induction

updatable decision trees

scalable decision tree induction

Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed

91

91

DTI - other issues

C4.5 is a kind of reference algorithm used in machine learning research. In this course we will use See5, a descendant of C4.5.

A demonstration version of See5 is freely available from

http://www.rulequest.com/download.html

The source code of C4.5 is freely available for research and teaching from

http://www.rulequest.com/Personal/c4.5r8.tar.gz written in C

92

92

References

These slides are partly based on the slides of the books: Han J, Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006 http://www-sal.cs.uiuc.edu/~hanj/bk2/ Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining, Addison-Wesley, 2006 http://www-users.cs.umn.edu/~kumar/dmbook/

Hand D, Mannila H, Smyth P. Principles of Data Mining, MIT Press, 2001 Mitchell TM. Machine learning. McGraw-Hill, 1997. Quinlan JR. Induction of decision trees. Machine Learning 1: 81-106, 1986 Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Quinlan JR. See5. http://www.rulequest.com

93

93