Attribution Non-Commercial (BY-NC)

290 views

Timu a11 Classification Decision Tree Induction

Attribution Non-Commercial (BY-NC)

- Data Structures and Algorithms MCQ'S
- Decision Trees in Python
- dmintro
- Data Mining in the Application of Criminal Cases Based on Decision Tree
- Mastering UniPaaS
- Classification
- J48
- A Comparative Study on Disease Classification using Different Soft Computing Techniques
- An early warning system approach for the identification of currency crises with data mining techniques
- Interactive Exploration of Large-Scale Time-Varying Data using Dynamic Tracking Graphs
- Week 2 Lecture Notes
- 3.VII. Regularization
- ID3 Algorithm
- THE EFFECTS OF COMMUNICATION NETWORKS ON STUDENTS’ ACADEMIC PERFORMANCE: THE SYNTHETIC APPROACH OF SOCIAL NETWORK ANALYSIS AND DATA MINING FOR EDUCATION
- Blue
- Information Strucutres
- lobato
- 6. Software use in TTowers-gopal.ppt
- Case Questions
- Cooksey

You are on page 1of 93

Classification

class labels (the values of the target attribute, class) are predicted

Every case belongs to one of the mutually exclusive classes. This class is known.

supervised learning

The classification method classifies the training data based on the attribute values and class labels.

constructs a model

Classification

The model is evaluated (e.g. accuracy and subjective estimate) If the model is acceptable, it is used to classify cases whose class labels are not known.

Classification methods

Decision trees, rules, k nearest neighbour method, nave Bayesian classifier, neural networks, ...

Application examples: To give a diagnosis suggestion on the basis of the symptoms and test results of a patient

Training data

NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no

Learning algorithm

Classifier (Model)

Known class labels

Test data

NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes

no yes yes yes The model misclassifies the second test case (gives , the known yes class is ) no

4

Classifier (Model)

IF rank = professor OR years > 6 THEN tenured = yes ELSE tenured = no

Cases are described using fixed-length attribute vectors. Each case belongs to one class. Classes are mutually exclusive. The class of a case is known: supervised learning

The tree is constructed in a top-down manner (from the root to the leaves)

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot high false N hot high true N hot high false P m ild high false P cool normal false P cool normal true N cool normal true P m ild high false N cool normal false P m ild normal false P m ild normal true P m ild high true P hot normal false P m ild high true N

Decision tree

Decision tree

Decision tree

Inner nodes contain tests based on attributes (test nodes) Branches correspond to the outcomes of the tests (attribute values) Leaf nodes (leaves) contain the class information (one class or class distribution)

Decision tree

outlook

The attribute assigned to the root node is examined and a branch corresponding to the attribute value is followed. This process continues until a leaf node is encountered.

sunny

overcast P

rain

Decision tree

The classification path from the root to a leaf gives an explanation for the decision. The number of tested attributes depends on the classification path.

A classification path: a conjunction of constraints set on attributes A decision tree: a disjunction of the classification paths

10

10

Tree construction

A complete (fully-grown) tree is built based on the training data. (prepruning: the growth of tree is restricted)

Tree pruning

postpruning: branches are pruned from a complete tree (or from a prepruned tree)

11

11

A decision tree is constructed in a top-down recursive divide-andconquer manner. In the beginning, all the training examples are at the root. If the stopping criterion is fulfilled, a leaf node is formed. If the stopping criterion is not fulfilled, the best attribute is selected according to some criterion (a greedy algorithm) and

a test node is formed cases are divided into subsets according to values of the chosen attribute a decision tree is formed recursively for each subset

12

12

Generate a decision tree (1) Create a node N (2) (3) (4) (5) (6) (7) (8) (9) if (stopping criterion is fulfilled) Make a leaf node (node N) else Choose the best attribute and make a test node (node N) that tests the chosen attribute Divide cases into subsets according to the values of the chosen attribute Generate a decision tree for each subset

13

13

How to select the best attribute? How to specify the attribute test condition?

When to stop the recursive splitting? How to form decision nodes (leaves)? How to prune a tree?

14

14

Attributes are adequate for the classification task, if all the cases having the same attribute values belong to the same class.

If the attributes are adequate it is always possible to construct a decision tree which correctly classifies all the training data.

Usually there exist several correctly classifying decision trees. In the worst case, there is a leaf in the tree for each of the training cases.

15

15

16

16

A complex decision tree for the same classification task

17

17

Derives from the principle called Occam razor: s If there are two models having the same accuracy on the training data, the smaller one (simpler one) can be seen more general and thus better Smaller trees: more general, easier to understand and possibly more accurate in classifying unseen cases

Try to generate simple trees by generating simple nodes. The complexity of a node is

in its largest when the node has an equal number of cases from every class of the node in its smallest when the node has cases from one class only

Heuristic attribute selection measures (measures of goodness of split) are used. These aim to generate homogeneous (pure) child nodes (subsets).

18

18

E.B. Hunt (50 and 60 s s) To simulate human problem solving methods Analysing the content of English texts, medical diagnostics

J.R. Quinlan (end of 70 s) Chess endgames Applications from medical diagnostics to scouting

descendants of ID3 Addresses issues arising in real world classification tasks C4.5 is one of the most widely used machine learning algorithms, frequently used as a reference algorithm in machine learning research

19

19

ID3

Assumes that

attributes are categorical and have a small number of possible values the class (the target attribute) has two possible values

ID3 selects the best attribute according to a criterion called information gain

Criterion selects an attribute that maximises information gain (or minimises entropy)

20

20

Let

S be a training set that contains s cases (s is the number of cases) the class attribute C have values C1 , , Cm (m is the number of classes)

In ID3 m = 2

si be the number of cases belonging to the class Ci in the training set S and p(Ci) = si /s the relative frequency of the class Ci in S

21

21

i =1

2-based logarithm, because the information is coded in bits We define in this context that if p(Ci) = 0 then p(Ci) log2 p(Ci) returns 0 (zero)

log k a = x k x = a, k , l R+ \ {1}, a R+

22

22

C1 C2 C1 C2

C1 C2 C1 C2

0 6 1 5

2 4 3 3

p(C1) = 1/6 p(C2) = 5/6 H(C) = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65 p(C1) = 2/6 p(C2) = 4/6 H(C) = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92

p(C1) = 3/6 p(C2) = 3/6 H(C) = (3/6) log2 (3/6) (3/6) log2 (3/6) = 1

Maximum (= log2 m) when cases are equally distributed among the classes

m = number of classes

23

23

Let an attribute A have the values Aj , j = 1, ,v Let the set S be divided into subsets {S1, S2 , , Sv} according to the values of the attribute A The expected information needed to classify an arbitrary case in the branch corresponding the value Aj is

m

i =1

Consider only those cases having the value Aj for the attribute A and calculate p(Ci) in the set of these cases

24

24

The expected information needed to classify an arbitrary case when using the attribute A as root is

j =1

p(Ai) is the relative frequency of the cases having value Aj for the attribute A in the set S

I(C|A) = H (C ) H (C | A)

ID3 chooses the attribute resulting in the greatest information gain as the attribute for the root of the decision tree.

25

25

ID3: Tests

A = Aj

Outcomes of a test are mutually exclusive. There is an own branch in the tree for each possible outcome .

26

26

ID3 assumes that attributes are adequate. It splits the data in recursive fashion, until all the cases of a node belong to the same class. The class of a leaf node is defined on the basis of the class of the cases in the node.

If the leaf is empty (there are no cases with some particular value of an attribute), the class is unknown (the leaf is labelled as ) null

27

27

Playing tennis (Quinlan 86)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity Windy Class high false N high true N high false P high false P norm al false P norm al true N norm al true P high false N norm al false P norm al false P norm al true P high true P norm al false P high true N

28

28

9 cases

5 cases

H (C ) =

29

29

The expected information required for each of the subtrees after using the attribute Outlook to split the set S into 3 subsets

P 2 4 3

sunny:

30

overcast:

rain:

30

The expected information needed to classify an arbitrary case for the tree with the attribute Outlook as root is

H (C | A) =

31

31

The attribute resulting in the greatest information gain is chosen as the attribute for the root of the decision tree.

I(C|outlook) = 0.246

32

32

The attribute Outlook has been chosen and the cases have been divided into subsets according to their values of the Outlook attribute.

outlook sunny

Cases (1, sunny, hot, , N) (2, sunny, hot, , N) (8, sunny, mild, , N) (9, sunny, cool, , P) (11, sunny, mild, , P)

overcast

Cases (3, overcast, hot, , P) (7, overcast, cool, , P) (12, overcast, mild, , P) (13, overcast, hot, , P)

rain

Cases (4, rain, mild, , P) (5, rain, cool, , P) (6, rain, cool, , N) (10, rain, mild, , P) (14, rain, mild, , N)

33

33

Cases: (1, sunny, hot, high, false, N) (2, sunny, hot, high, true, N) (8, sunny, mild, high, false, N) (9, sunny, cool, normal, false, P) (11, sunny, mild, normal, true, P)

34

34

Tempera hot mild cool P 0 1 1 N H(C|Aj ) 2 0 1 1 0 0

2 2 1 H (C | Temperature) = 0 + 1 + 0 5 5 5 = 0.400

I (C | Temperature) = 0.971 0.400 = 0.571 3 2 H (C | Humidity ) = 0 + 0 = 0 5 5 I (C | Humidity ) = 0.971 0 = 0.971

P 0 2 P 1 1

35

35

Humidity is chosen

Cases are sent down to high and normal branches The cases in high branch are all of the same class: a leaf node is formed The same situation in the normal branch

36

overcast

rain

normal

Branches for overcast and rain are built in the similar way

36

Complete decision tree and classification of a new case (Outlook: rain, Temperature: hot, Humidity: high, Windy: true) Play tennis?

37

37

ID3 does not address issues arising in real world classification tasks.

38

38

C4.5

Gain ratio attribute selection criterion Tests for value groups and quantitative attributes No requirement of fully adequate attributes Probabilistic approach for handling of missing values Pruning

39

39

The information gain criterion has a tendency to favour attributes with many outcomes.

However, this kind of attributes may be less relevant in prediction than attributes having a smaller number of outcomes.

An extreme example is an attribute that is used as an identifier. Identifiers have unique values resulting in pure nodes but they donhave t predictive power.

40

40

I(C|A) , H (A)

where I(C |A) is the information gain got from testing the attribute A and H (A) is the expected information needed to sort out the value of the attribute A i.e. the uncertainty relating to the value of the attribute A

H ( A) = p ( A j ) log 2 p ( A j )

j =1

where p (Aj) is the probability of the value Aj (the relative frequency of the value Aj)

41

41

The gain ratio criterion selects the attribute having the highest gain ratio among of those attributes whose information gain is at least the average information gain over all the attributes examined.

42

42

Let calculate the gain ratio for the Outlook attribute of the s Tennis example. The information gain I(C |A) for the attribute Outlook is 0.246. Calculate the expected information for the Outlook attribute:

H ( A) = (5 / 14) log 2 (5 / 14) (4 / 14) log 2 (4 / 14) (5 / 14) log 2 (5 / 14) = 1.577

43

43

Outlook

Sunny Rain Overcast

Value groups

{Sunny, Overcast}

Outlook

{Rain}

Humidity

75 > 75

44

44

Tests based on qualitative attributes can take the form of outlook in sunny, overcast outlook = rain

To assess equitably qualitative attributes that vary in their numbers of possible values

Gain ratio criterion is biased to prefer attributes having a small number of possible values

45

45

For each appropriate grouping, an additional attribute is formed in the preprocessing phase. This approach is economical from a computational viewpoint Problem: Appropriateness of a grouping may depend on the context (the part of the tree). A constantgrouping may be too crude.

46

46

At first, each value forms its own group. Then, all possible pairs of groups are formed.

Process continues until just two value groups remain, or until no such merger would result in a better division of the training data.

Example on the next slide:

Michalski Soybean data s 35 attributes, 19 classes, 683 training cases Attribute stem canker with four values: none, below soil, above soil, above 2nd node

47

47

1) Partition into four one-value groups 2) Two onevalue groups are merged

3)

Based on the results of the section 2, above soiland above 2nd nodeare merged No merger of the section 3 improves the situation the process stops. Final groups: {none}, {below soil}, {above soil, above 2nd node}

48

48

From the overall viewpoint, the aim is to get simpler and more accurate trees. Advantageous of value groupings depends on the application domain. Search for value groups can require a substantial increase in computation.

49

49

The threshold is defined dynamically. Cases are first sorted on the values of the attribute A being considered.

Ak + Ak +1 2

is a possible threshold Z that divides the cases of the training set S into two subsets.

50

50

There are w-1 candidate thresholds. The best threshold is the one that results in the largest gain ratio. The largest value of the attribute A in the training set that does not exceed the best midpoint is chosen as the threshold.

All the threshold values appearing in the tree actually occur in the training data.

After finding the threshold, the quantitative attribute can be compared to qualitative and to other quantitative attributes in the usual way.

51

51

A 32 46 52 58 Class P N P P

Z 39 49 55

A Z P 1 1 2 N 0 1 1 P 2 2 1

A>Z N 1 0 0

The candidate threshold 49 yields the highest gain ratio, and, thus, 46 is chosen as the threshold. A 46, A > 46

Midpoints of successive values are possible thresholds. The gain ratio is calculated for each candidate threshold. The best candidate is the one resulting in the highest gain ratio. Choose as the threshold the biggest value of A in the training set that does not exceed the best candidate (midpoint).

52

52

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot 85 false N hot 90 true N hot 78 false P mild 96 false P cool 80 false P cool 70 true N cool 65 true P mild 95 false N cool 70 false P mild 80 false P mild 70 true P mild 90 true P hot 75 false P mild 96 true N

53

3 4 5 6 7 8 9 10 11 12 13 14

53

An example of a decision tree built from the Tennis data in which the attribute humidity has been measured using a quantitative scale.

outlook = overcast: P outlook = sunny: :...humidity = high: N : humidity = normal: P outlook = rain: :...windy = true: N windy = false: P

outlook = overcast: P outlook = sunny: :...humidity <= 75: P : humidity > 75: N outlook = rain: :...windy = true: N windy = false: P

54

54

Ordinal attributes can be handled either in the same way than nominal attributes or in the same way than quantitative attributes. Processing of quantitative attributes is based on ordering of values. Values of ordinal attributes have a natural order, and, thus, the approach employed for quantitative attributes can be utilised with ordinal attributes, too.

55

55

Stopping criteria

All the cases in a node belong to the same class No cases in a node None of the attributes improves the situation in a node The number of cases in a node is too small for continuing the splitting process:

Every test must have at least two outcomes having the minimum number of cases. The default value for the number of cases is 2.

56

56

C4.5 - Leaves

no cases:

The most frequent class (the majority class) at the parent of the leaf is associated with the leaf.

The most frequent class (the majority class) at the leaf is associated with the leaf.

57

57

Real world data often have missing attribute values. Missing values may be e.g. filled in (imputed) with

mode, median or mean of the complete cases of a class estimates given by some more intelligentmethod

before running the decision tree program. However, imputation is not unproblematic.

58

58

Missing values are taken into account when calculating the information gain

I(C | A) = p(A known ) (H(C) H(C | A)) + p( A unknown ) 0 = p( A known ) ( H(C) H(C | A))

where p (Aknown) is the probability that the value of the attribute A is known (i.e. the relative frequency of those cases for which the value of the attribute A is known)

and calculating the expected information H (A) needed to test the value of the attribute A

Let an attribute A have the values A1, A2, , Av . Missing values are now treated as an own value, the value v+1.

H ( A) = p ( A j ) log 2 p ( A j )

j =1

59

v +1

59

Let us assume that the Tennis example has one missing value

Temperature Humidity Windy Class hot 85 false N Tennis playing 1 hot 90 true N (Quinlan 86) 2 hot 78 false P 3 mild 96 false P 4 cool 80 false P 5 cool 70 true N 6 cool 65 true P 7 mild 95 false N 8 cool 70 false P 9 mild 80 false P 10 Missing value 11 mild 70 true P mild 90 true P 12 75 false P 13 overcast hot mild 96 true N 14 rain

60

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny

60

The information gain for the Outlook attribute is calculated on the basis of the 13 cases having known value.

outlook sunny overcast rain P 2 3 3 N H(C|A ) 3 0.971 0 0 2 0.971

61

61

The expected information needed to test the value of the Outlook attribute is calculated:

H ( A) = (5 / 14) log 2 (5 / 14) (3 / 14) log 2 (3 / 14) (5 / 14) log 2 (5 / 14) (1 / 14) log 2 (1 / 14) = 1.809

62

62

When cases are sent to subtrees, a weight is given for each case. If the tested attribute value is known, the case is sent to the branch corresponding the outcome Oi with the weight w = 1. Otherwise, a fraction of the case is sent to each branch Oi with the weight w = p(Oi ).

p(Oi ) is the probability (the relative frequency) of the outcome Oi in the current node. The case is divided between the possible outcomes {O1, O2, , Ov} of the test.

The 13 cases with a known value for the Outlook attribute are sent to the corresponding sunny, overcast or rain branches with the weight w = 1. Case 12 is divided between the sunny, overcast and rain branches.

63

outlook

Case no 1 2 8 9 11 12 Outlook sunny sunny sunny sunny sunny ? Temperature Humidity Windy hot 85 FALSE hot 90 TRUE mild 95 FALSE cool 70 FALSE mild 70 TRUE mild 90 TRUE Class N N N P P P Weight 1 1 1 1 1 5/13=0.4

sunny

overcast

rain

The number of cases in a node is now interpreted as the sum of weights of (fractional) cases in the node.

A case came to a node with the weight w. It is sent to the node(s) of the next level with the weight

w w 1 = w w p(Oi ) =

(the value of the attribute of the current node is known) (the value of the attribute of the current node is unknown)

64

64

Case no 1 2 8 9 11 12 Outlook sunny sunny sunny sunny sunny ? Temperature Humidity Windy hot 85 FALSE hot 90 TRUE mild 95 FALSE cool 70 FALSE mild 70 TRUE mild 90 TRUE Class N N N P P P Weight 1 1 1 1 1 5/13=0.4

sunny outlook overcast rain

Let us assume that this subset is partitioned further by the test on humidity. The branch humidity <= 75has cases from the single class P. The branch humidity > 75has cases from both classes (class P 0.4/3.4 and class N 3/3.4)

Since no test improves the situation further, a leaf is made (the most frequent class in the node gives the class label).

65

65

outlook = overcast: P (3.2) outlook = sunny: :...humidity <= 75: P (2) : humidity > 75: N (3.4/0.4) outlook = rain: :...windy = true: N (2.4/0.4) windy = false: P (3)

The tree is alike the tree constructed from the original data, but now some leaves have a marking (N/E)

N is the sum of fractional cases belonging to the leaf E is the sum of those cases misclassified by the leaf (i.e. the sum of fractional cases belonging to classes other than suggested by the leaf) The majority class gives the class label of the node.

66

66

If the new case has a missing value for the attribute tested in the current node, the case is divided between the outcomes of the test. Now the case has multiple classification paths from the root to leaves, and, therefore, a classificationis a class distribution. The majority class is the predicted class.

67

67

A case having a missing value is classified Outlook: sunny, temperature: mild, humidity: ?, windy: false

outlook = overcast: P (3.2) outlook = sunny: :...humidity <= 75: P (2) : humidity > 75: N (3.4/0.4) outlook = rain: :...windy = true: N (2.4/0.4) windy = false: P (3)

If the humidity were less than or equal to 75, the class for the case would be P If the humidity were greater than 75, the class for the case would be N with the probability of 3/3.4 (88%) and P with the probability of 0.4/3.4 (12%).

Results from normal and high branches are summed for the final class distribution class P: (2.0/5.4) 100% + (3.4/5.4 ) 12% = 44% 2 cases of 5.4 training cases belonged to the humidity <= 75 branch and in this branch the probability of the class P is 100% 3.4 cases of 5.4 training cases belonged to the humidity > 75 branch and in this branch the probability of the class P is 12% class N: 3.4/5.4 88% = 56%

68

68

Overfitting: Test error rate starts to increase Training error rate continues to decrease

Underfitting: when model is too simple, both training and test errors are large

69

69

Overfitting

The tree is complex. Its lowest branches reflect noise and outliers occurring in the training data. Lower classification accuracy on unseen cases

Noise and outliers Inadequate attributes Too small training data A local maximum in the greedy search

70

70

Pruning

Prepruning

Postpruning

Let the tree grow and remove branches from the full fully grown tree.

71

71

Pruning

(a) The branch marked with a star may be partly based on erroneous or exceptional cases.

(c) The tree has grown (the tree full ) after which it has a been pruned. (postpruning)

72

72

Pruning

The tree growth can be limited in many ways. Define a minimum for the number of cases in a node.

If the number of cases in a node is below the minimum, the recursive division of the example set is stopped and a leaf is formed.

The leaf is labeled with the majority class or the class distribution.

too high a threshold: oversimplification: useful attributes are discarded too low a threshold: no simplification at all (or little simplification)

73

73

Postpruning

Usually it is more profitable to let the tree grow complete and prune it afterwards than halt the tree growth.

If the tree growth is halted, all the branches growing from a node are lost. Postpruning allows saving some of the branches.

Postpruning requires more calculation than prepruning but it usually results in more reliable trees than prepruning. In postpruning, parts of the tree, whose removal does not decrease the classification accuracy on unseen cases, are discarded.

74

74

Postpruning

N is the number of training cases belonging to the leaf E is the number of cases that do not belong to the class suggested by the leaf the error rate of the whole tree: E and N are summed over all the leaves

75

75

Postpruning

Start from the bottom of the tree and examine each subtree that is not a leaf. If replacement of the subtree with a leaf (or with its most frequently used branch) would reduce the predicted error rate, then prune the tree accordingly.

When the error rate of any of the subtrees reduces, also the error rate of the whole tree reduces.

There can be cases from several classes in a leaf, and, thus, the leaf is labeled with the majority class. The error rate can be predicted by using the training set or a new set of cases.

76

76

C4.5 - Pruning

Prepruning

Every test must have at least two outcomes having the minimum number of cases.

Because of the missing values, the minimum number of cases is actually the minimum for the summed weights of the cases.

Postpruning

A verypessimistic method based on estimated error rates How to calculate the very pessimistic estimates is not a topic of this course. However, the idea of the pruning is presented on the next slides.

77

77

C4.5 - postpruning

physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: : :...education spending = n: democrat (6.0) : education spending = y: democrat (9.0) : education spending = u: republican (1.0) physician fee freeze = y: :...synfuels corporation cutback = n: republican (97.0/3.0) synfuels corporation cutback = u: republican (4.0) synfuels corporation cutback = y: :...duty free exports = y: democrat (2.0) duty free exports = u: republican (1.0) duty free exports = n: :...education spending = n: democrat (5.0/2.0) education spending = y: republican (13.0/2.0) education spending = u: democrat (1.0) physician fee freeze = u: :...water project cost sharing = n: democrat (0.0) water project cost sharing = y: democrat (4.0) water project cost sharing = u: :...mx missile = n: republican (0.0) mx missile = y: democrat (3.0/1.0) mx missile = u: republican (2.0)

78

78

C4.5 postpruning

Pruned tree

The original tree had 17 leaves, the pruned one has 5 leaves.

Subtrees have been replaced with leaves

physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: :...mx missile = n: democrat (3.0/1.1) mx missile = y: democrat (4.0/2.2) mx missile = u: republican (2.0/1.0)

Subtree has been replaced with the most frequently used subtree

123 training cases in the leaf If 123 new cases were classified, 13.9 cases would be misclassified (a very pessimistic estimate)

79

79

C4.5 postpruning

physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) = y: democrat (9.0) = u: republican (1.0)

The subtree has been replaced with a leaf physician fee freeze = n: democrat (168.0/2.6) 168 training cases in the leaf. One of them is missclassified by the leaf. If 168 new cases were classified, 2.6 cases would be misclassified (a very pessimistic estimate)

80

80

C4.5 postpruning

physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) The very pessimistic = y: democrat (9.0) estimate: the sum of = u: republican (1.0) predicted errors is 3.273

: adoption of the budget resolution = n: democrat (16.0/2.512) One training case is misclassified. The very pessimistic estimate: the number of predicted errors is 2.512

physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16.0/2.512) 81

81

C4.5 postpruning

physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16/2.512)

82

82

C4.5 postpruning

N is the number of training cases in the leaf E is the number of predicted errors if a set of N unseen cases were classified by the tree. The sum of the predicted errors over the leaves, divided by the size of the training set (the number of the training cases) provides an immediate estimate of the error rate of the pruned tree on new cases.

20.8/300=0.069 (6.9%) (The pruned tree will misclassify 6.9% of new cases.)

83

83

C4.5 - postpruning

Results for the Congressional voting data Training set (300 cases) Complete tree Nodes 25 Errors 8 (2.7%) Pruned tree Nodes 7 Errors 13 (4.3%)

Errors 7 (5.2%)

10-fold cross-validation gives the error rate of 5.3% on new cases (the average predicted, very pessimistic error rate on new cases is 5.6%)

84

84

DTI - pros

Construction of a tree does not (necessarily) require any parameter setting Can handle high dimensional data Can handle heterogeneous data Nonparametric approach Representation form is intuitive, relatively easy to interpret

85

85

DTI - pros

Learning: the complexity depends on the number of nodes, cases and attributes

In each node: O(n quantitative attributes O(p n n p), log ) n number of cases in the node = p = number of attributes

Classification: O(w), where w is the maximum depth of the tree An eagermethod: training is computationally more expensive than classification

Quite robust to the presence of noise In general, good classification accuracy comparable with other classification methods

86

86

Decision tree algorithms divide the training data into smaller and smaller subsets in a recursive fashion. Problems

Data fragmentation

Number of instances at the leaf nodes can be too small to make any statistically significant decision

87

87

Repetition

Replication

A decision tree contains duplicate subtrees

P

88

88

Border line between two neighbouring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time

89

89

x+y<1

Class = +

Class =

Multivariate splits based on a combination of attributes More expressive representation The use of multivariate splits can prevent problems of fragmentation, repetition and replication. Finding optimal test condition is computationally expensive.

90

90

Decision tree induction is a widely studied topic - different kind of enhancements to the basic algorithm have been developed.

challenges arising from real world data: quantitative attributes, missing values, noise, outliers multivariate decision trees incremental decision tree induction

Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed

91

91

C4.5 is a kind of reference algorithm used in machine learning research. In this course we will use See5, a descendant of C4.5.

http://www.rulequest.com/download.html

The source code of C4.5 is freely available for research and teaching from

http://www.rulequest.com/Personal/c4.5r8.tar.gz written in C

92

92

References

These slides are partly based on the slides of the books: Han J, Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006 http://www-sal.cs.uiuc.edu/~hanj/bk2/ Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining, Addison-Wesley, 2006 http://www-users.cs.umn.edu/~kumar/dmbook/

Hand D, Mannila H, Smyth P. Principles of Data Mining, MIT Press, 2001 Mitchell TM. Machine learning. McGraw-Hill, 1997. Quinlan JR. Induction of decision trees. Machine Learning 1: 81-106, 1986 Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Quinlan JR. See5. http://www.rulequest.com

93

93

- Data Structures and Algorithms MCQ'SUploaded byGuruKPO
- Decision Trees in PythonUploaded byeokyere
- dmintroUploaded bysree1978
- Data Mining in the Application of Criminal Cases Based on Decision TreeUploaded byTI Journals Publishing
- Mastering UniPaaSUploaded byVid Malesevic
- ClassificationUploaded bySurya Kameswari
- J48Uploaded byTaironneMatos
- A Comparative Study on Disease Classification using Different Soft Computing TechniquesUploaded bythesij
- An early warning system approach for the identification of currency crises with data mining techniquesUploaded bykathyadila
- Interactive Exploration of Large-Scale Time-Varying Data using Dynamic Tracking GraphsUploaded byWathsala Widanagamaachchi
- Week 2 Lecture NotesUploaded byAvijit Ghosh
- 3.VII. RegularizationUploaded byChinmay Bhushan
- ID3 AlgorithmUploaded byMihir Shah
- THE EFFECTS OF COMMUNICATION NETWORKS ON STUDENTS’ ACADEMIC PERFORMANCE: THE SYNTHETIC APPROACH OF SOCIAL NETWORK ANALYSIS AND DATA MINING FOR EDUCATIONUploaded byijite
- BlueUploaded bySwati Chandak
- Information StrucutresUploaded byMandy Gill
- lobatoUploaded byapple pie
- 6. Software use in TTowers-gopal.pptUploaded byBattinapati Shiva
- Case QuestionsUploaded bysan
- CookseyUploaded byayushthekiller
- Practical Introduction to Power of Enterprise MinerUploaded byPurna Ganti
- 06508256.pdfUploaded byhub23
- Data MiningUploaded byAsad Shaikh
- Machine Learning SummaryUploaded byKriti Agrawal
- Balance Trees Reveal Microbial Niche DifferentiationUploaded byparanoid923
- 17-TreeFundamentalsUploaded byRishav Kanth
- 1994 Application of Genetic Function Approximation to Quantitative Structure-ActivityUploaded byFelipe Antonio Vasquez Carrasco
- DS_outsidin_Chapter_2.pptUploaded byAmber Lyons
- Public Class BSTreeUploaded byAdrián Muñoz Moreno
- 09-miu-11-2-diana.pdfUploaded byYEAR Agency

- A Critical Study of Selected Classification Algorithms for Liver Disease DiagnosisUploaded byMaurice Lee
- MachineLearningApproachesInSHMUploaded byFatih Sunor
- Pd History and FutureUploaded byjjcanoolivares
- 2018-2019 Latest Ml,Ds,Ai Python & Hadoop Project AbstractsUploaded byIgeeks Technologies,Bangalore
- Minkowski Distance based Feature Selection Algorithm for Effective Intrusion DetectionUploaded byIJMER
- Data Science Bootcamp Curriculum 2Uploaded bykhiari
- A Neural Network- Based Classification Method for Inspection of Bead Shape in ERWUploaded byJames Phillips
- Data Mining is Defined as the Procedure of Extracting Information From Huge Sets of DataUploaded byOdem Dinesh Reddy
- Applying Classification Techniques in E-Learning System: An OverviewUploaded byEditor IJRITCC
- Guzzetti & Otros 2006 GeoUploaded byluiferincon
- big data ass 2Uploaded byRidhi Rocking
- Clustering and Classification of Cancer Data Using Soft Computing TechniqueUploaded byInternational Organization of Scientific Research (IOSR)
- pca 2.pdfUploaded bySharma Manju
- Methods of Research Chapter 4-13 NotesUploaded byShellyn Erespe Gomez
- Template for Estimation Using Functional PointsUploaded bytejpal-singh-1744
- 05376155Uploaded byMadhavi Desai Mistry
- CV Alexandre Caulier EnUploaded byGarvys
- ocr reportUploaded bysubramanyam62
- SAP HANA Predictive Analysis Library PAL EnUploaded byAzeem
- Research Paper Ajcst 034 18Uploaded byGorishsharma
- Improved Medical Diagnosis using Wrapper and Filter Techniques of Feature SelectionUploaded byInternational Journal of Innovative Science and Research Technology
- Redline An100u Suo Quick TestUploaded byNSCG2011
- APStateofForestReportUploaded byashwinmadala
- COMPREHENSIVE VIEW ON CRAN PACKAGES- Robust Analysis of Data In eXtreme (RADIX)Uploaded bySri Krishna
- A GA-LR wrapper approach for feature selection in network intrusion detectionUploaded byMia Amalia
- Paper 10- Survey of Nearest Neighbor Condensing TechniquesUploaded byEditor IJACSA
- Machine Learning by Ryan RobertsUploaded byBigJ
- 238960049 Classification of Wisconsin Breast Cancer Diagnostic and Prognostic Dataset Using Polynomial Neural NetworkUploaded byDcet Nmhu
- 02 File ToolsUploaded byMigue Migue Salgado Zequeda
- El 31908912Uploaded byAnonymous 7VPPkWS8O