You are on page 1of 17

DECISION TREES DECISION TREES

Introduction Introduction

It is a method that induces concepts from examples The target function can be Boolean or discrete valued
(inductive learning)

Most widely used & practical learning method

The learning is supervised: i.e. the classes or categories of the


data instances are known

It represents concepts as decision trees (which can be


rewritten as if-then rules)

1 2

DECISION TREES DECISION TREES

Decision Tree Representation Example

1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute value

3. Each leaf node assigns a classification

3 4

DECISION TREES DECISION TREES

Example
Outlook Decision Tree Representation
Sunny Rain
Overcast
Decision trees represent a disjunction of conjunctions of
Humidity Wind constraints on the attribute values of instances

High Normal Strong Weak Each path from the tree root to a leaf corresponds to a
conjunction of attribute tests (one rule for classification)

The tree itself corresponds to a disjunction of these


conjunctions (set of rules for classification)
A Decision Tree for the concept PlayTennis
An unknown observation is classified by testing its attributes
and reaching a leaf node
5 6

1
DECISION TREES DECISION TREES

Decision Tree Representation Basic Decision Tree Learning Algorithm

Most algorithms for growing decision trees are variants of a


basic algorithm

An example of this core algorithm is the ID3 algorithm


developed by Quinlan (1986)

It employs a top-down, greedy search through the space of


possible decision trees

7 8

DECISION TREES DECISION TREES

Basic Decision Tree Learning Algorithm Basic Decision Tree Learning Algorithm
We have
First of all we select the best attribute to be tested at the root
of the tree D12 D11
D1 - 12 observations

For making this selection each attribute is evaluated using a D2 D5


D10 D4 - 4 attributes
statistical test to determine how well it alone classifies the D6 • Outlook
training examples D3
D14 • Temperature
D8 D9 • Humidity
D7 D13 • Wind

- 2 classes (Yes, No)

9 10

DECISION TREES DECISION TREES

Basic Decision Tree Learning Algorithm Basic Decision Tree Learning Algorithm

Outlook The selection process is then repeated using the training


Sunny Rain examples associated with each descendant node to select the
Overcast best attribute to test at that point in the tree

D1 D8 D10 D6
D3
D14
D11 D4
D9 D12
D2 D7 D5
D13

11 12

2
DECISION TREES DECISION TREES

Outlook
Sunny Rain Basic Decision Tree Learning Algorithm
Overcast
This forms a greedy search for an acceptable decision tree, in
D1 D8 D10 D6 which the algorithm never backtracks to reconsider earlier
D3
D14 choices
D11 D4
D9 D12
D2 D7
D13 D5

What is the
“best” attribute to test at this point? The possible choices are
Temperature, Wind & Humidity
13 14

DECISION TREES DECISION TREES

Which Attribute is the Best Classifier? Which Attribute is the Best Classifier?: Definition of Entropy

The central choice in the ID3 algorithm is selecting which In order to define information gain precisely, we begin by
attribute to test at each node in the tree defining entropy

We would like to select the attribute which is most useful for Entropy is a measure commonly used in information theory,
classifying examples called.

For this we need a good quantitative measure Entropy characterizes the impurity of an arbitrary collection
of examples
For this purpose a statistical property, called information
gain is used

15 16

DECISION TREES DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy Which Attribute is the Best Classifier?: Definition of Entropy

Suppose we have four independent values of a variable X: Someone tells you that there probabilities of occurrence is
not equal:
A, B, C, D
p(A) = 1/2
These values are independent and occur randomly p(B) = 1/4
p(C) = 1/8
You might transmit these values over a binary serial link by p(D) = 1/8
encoding each reading with two bits
It is now possible to invent a coding that only uses 1.75 bits
A = 00B = 01 C = 10 D = 11 on average per symbol, for the transmission, e.g.

We might see something like this: 0100001001001110110011 A= 0 B = 10 C = 110 D = 111


17 18

3
DECISION TREES DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy Which Attribute is the Best Classifier?: Definition of Entropy

If all p’s are equal for a given m, we need the highest number
Suppose X can have m values, V1, V2, …, Vm, with of bits for transmission
probabilities: p1, p2, …, pm
If there are m possible values of an attribute, then the
The smallest number of bits, on average, per value, needed to entropy can be as large as log2 m
transmit a stream of values of X is

If a p = 1 and all other p’s are 0, then we need 0 bits (i.e. we


don’t need to transmit anything)

19 20

DECISION TREES DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy Which Attribute is the Best Classifier?: Information Gain

This formula is called Entropy H Suppose we are trying to predict output Y (Like Film
Gladiator) & we have input X (College Major = v)
H(X) =

Major
High Entropy means that the examples have more equal Math CS
probability of occurrence (and therefore not easily
History
predictable)

Low Entropy means easy predictability

21 22

DECISION TREES DECISION TREES

Which Attribute is the Best Classifier?: Information gain Which Attribute is the Best Classifier?: Information Gain
We have H(X) = 1.5 H(Y) = 1.0 Conditional Entropy of Y
H(Y | X = Math) = 1.0
Conditional Entropy H(Y | X = v) H(Y | X = History) = 0
The Entropy of Y among only those records in which X = v H(Y | X = CS) = 0

Major Major
Math CS Math CS
History History

23 24

4
DECISION TREES DECISION TREES

Which Attribute is the Best Classifier?: Information Gain Which Attribute is the Best Classifier?: Information Gain
Average Conditional Entropy of Y
Information Gain is the expected reduction in entropy
caused by partitioning the examples according to an
H(Y | X) =
attribute’s value
Major
Math CS Info Gain (Y | X) = H(Y) – H(Y | X) = 1.0 – 0.5 = 0.5
History
For transmitting Y, how much bits would be saved if both
side of the line knew X

In general, we write Gain (S, A)


Where S is the collection of examples & A is an attribute

25 26

DECISION TREES DECISION TREES

Which Attribute is the Best Classifier?: Information Gain Which Attribute is the Best Classifier?: Information Gain

The collection of examples has 9 positive values and 5


negative ones
Let’s
investigate
the attribute
Wind Eight (6 positive and 2 negative ones) of these examples
have the attribute value Wind = Weak

Six (3 positive and 3 negative ones) of these examples have


the attribute value Wind = Strong

27 28

DECISION TREES DECISION TREES

Which Attribute is the Best Classifier?: Information Gain Which Attribute is the Best Classifier?: Information Gain

The information gain obtained by separating the We calculate the Info Gain for each attribute and select
examples according to the attribute Wind is calculated as: the attribute having the highest Info Gain

29 30

5
DECISION TREES DECISION TREES

Select Attributes which Minimize Disorder Select Attributes which Minimize Disorder
Make decision tree by selecting tests which minimize Make decision tree by selecting tests which minimize
disorder (maximize gain) disorder (maximize gain)

31 32

DECISION TREES DECISION TREES

Select Attributes which Minimize Disorder Example

The formula can be converted from log2 to log10

logx(M) = log10M . logx10


= log10M/log10x

Hence log2(Y) = log10(Y)/log10(2)

33 34

DECISION TREES DECISION TREES

Example

Which attribute should be selected as the first test?

“Outlook” provides the most information

35 36

6
DECISION TREES DECISION TREES

Example Example

The process of selecting a new attribute is now repeated for This process continues for each new leaf node until either:
each (non-terminal) descendant node, this time using only
training examples associated with that node 1. Every attribute has already been included along this path
through the tree
Attributes that have been incorporated higher in the tree are
excluded, so that any given attribute can appear at most once 2. The training examples associated with a leaf node have
along any path through the tree zero entropy

37 38

DECISION TREES DECISION TREES

Example From Decision Trees to Rules

Next Step: Make rules from the decision tree

After making the identification tree, we trace each path from


the root node to leaf node, recording the test outcomes as
antecedents and the leaf node classification as the consequent

For our example we have:

If the Outlook is Sunny and the Humidity is High then No


If the Outlook is Sunny and the Humidity is Normal then Yes
...
39 40

DECISION TREES DECISION TREES

Hypothesis Space Search Hypothesis Space Search

ID3 can be characterized as It begins with an empty tree,


searching a space of then considers more and
hypotheses for one that fits more elaborate hypothesis
the training examples in search of a decision tree
that correctly classifies the
The space searched is the set training data
of possible decision trees
The evaluation function that
ID3 performs a simple-to- guides this hill-climbing
complex, hill-climbing search is the information
search through this gain measure
hypothesis space
41 42

7
DECISION TREES DECISION TREES

Hypothesis Space Search Hypothesis Space Search


• ID3 maintains only a single current hypothesis as it
Some points to note: searches through the space of decision trees.

• The hypothesis space of all decision trees is a complete By determining only a single hypothesis, ID3 loses the
space. Hence the target function is guaranteed to be capabilities that follow from explicitly representing all
present in it. consistent hypotheses.

For example, it does not have the ability to determine


how many alternative decision trees are consistent with
the training data, or to pose new instance queries that
optimally resolve among these competing hypotheses
43 44

DECISION TREES DECISION TREES

Hypothesis Space Search Learning Bias during Induction

• ID3 performs no backtracking, therefore it is • Given a collection of training examples, there are
susceptible to converging to locally optimal solutions typically many decision trees consistent with these
examples
• ID3 uses all training examples at each step to refine its
current hypothesis. This makes it less sensitive to • Describing the inductive bias of ID3 means describing the
errors in individual training examples. basis by which it chooses one of these consistent
However, this requires that all the training examples hypotheses over the others
are present right from the beginning and the learning
cannot be done incrementally with time

45 46

DECISION TREES DECISION TREES

Learning Bias during Induction Learning Bias during Induction

• We cannot describe precisely the bias, but we can say • We can say “it absolutely prefers shorter trees over
longer ones” if there is an algorithm such that:
approximately that:
- Its selection prefers shorter trees over longer ones
- It begins with an empty tree and searches breadth
- Trees that place high info. gain attributes close to the
first through progressively more complex trees,
root are preferred over those that do not
first considering “all” trees of depth 1, then “all”
trees of depth 2, etc.

- Once it finds a decision tree consistent with the


training data, it returns the smallest consistent
tree
47 48

8
DECISION TREES DECISION TREES

Learning Bias during Induction Learning Bias during Induction


• Is this bias for shorter trees a sound basis for • Example:
generalization beyond the training data? Let there be a small set of 20 training examples

• William of Occam, in the year 1320, thought of the We might expect to be able to find many 500 node
following bias (called Occam’s razor): decision trees consistent with these examples, than 5 node
Prefer the simplest hypothesis that fits the data decision trees

• One argument in its favor is that because there are fewer We might therefore believe that a 5-node tree is less likely
short hypotheses than long ones, it is less likely that a to be a statistical coincidence and prefer this hypothesis
short hypothesis coincidentally fit the training data over the 500 node hypothesis

49 50

DECISION TREES DECISION TREES

Decision Trees: Issues in Learning Avoiding Over-fitting the Data

Practical issues in learning decision trees include: The ID3 algorithm grows each branch of the tree just deeply
enough to perfectly classify the training examples
• How deeply to grow the decision tree
While this is sometimes a reasonable strategy, in fact it can
• Handling continuous attributes lead to difficulties when there is noise in the data, or when
the number of training examples is too small to produce a
• Choosing an appropriate attribute selection measure representative sample of the true target function

• Handling training data with missing attribute values In either of these cases, ID3 can produce trees that over-fit
the training examples
• Handling attributes with differing costs
51 52

DECISION TREES DECISION TREES

Avoiding Over-fitting the Data Avoiding Over-fitting the Data

A hypothesis over-fits the training examples if some other


hypothesis that fits the training examples less well actually
performs better over the entire distribution of instances (i.e.
including instances beyond the training set)

Figure illustrates the impact of over-fitting in a typical


application of decision tree learning

53 54

9
DECISION TREES DECISION TREES

Avoiding Over-fitting the Data Avoiding Over-fitting the Data


Example:
Over-fitting can occur when the training examples contain
random errors or noise
Sunny, Hot, Normal Humidity, Strong Wind = No Tennis
Example:
If we have the following incorrect training observation

Sunny, Hot, Normal Humidity, Strong Wind = No Tennis

The decision tree will become more complex to accommodate


this training observation

55 56

DECISION TREES DECISION TREES

Avoiding Over-fitting the Data Avoiding Over-fitting the Data


Over-fitting can also occur when small number of training
examples are associated with leaf nodes There are several approaches to avoid over-fitting.

In this case, it is possible for coincidental regularities to One popular approach is to prune over-fit trees
occur, in which some attribute happens to partition the
examples very well, despite being unrelated to the actual A key question is: what criterion is to be used to determine
target function the correct final tree size

Whenever such coincidental regularities exist, there is a risk A commonly used practice is to use a separate set of
of overfitting examples, distinct from training examples, (called validation
set) for post-pruning nodes
Example: If Days is an attribute, and we have only one or
two observations for each day
57 58

DECISION TREES DECISION TREES

Avoiding Over-fitting the Data Avoiding Over-fitting the Data

In this approach, the available observations are separated Therefore, the validation set can be expected to provide a
into sets: safety check against over-fitting the spurious
characteristics of the training set
- A training set: which is used to learn the decision
tree Of course, it is important that the validation set be large
- A validation set: which is used to prune the tree enough to itself provide a statistically significant
sample of the instances
The motivation: Even though the learner may be misled by
random errors and coincidental regularities within One common heuristic is to withhold one-third of the
the training set, the validation set is unlikely to exhibit available examples for the validation set, using the
the same random fluctuations other two-thirds for training

59 60

10
DECISION TREES DECISION TREES

Avoiding Over-fitting the Data: Reduced Error Pruning Avoiding Over-fitting the Data: Reduced Error Pruning
One approach is called “reduced error pruning” Nodes are removed only if the resulting pruned tree
It is a form of backtracking in the hill climbing search performs no worse than the original over the
of decision tree hypotheses space validation set

It considers each of the decision nodes in the tree to be This has the effect that any leaf node added due to
candidates for pruning coincidental regularities in the training set is likely to
be pruned because these same coincidences are
Pruning a decision node consists of unlikely to occur in the validation set
- removing the sub-tree rooted at that node, making it
a leaf node, and Nodes are pruned iteratively, always choosing the node
- assigning it the most common classification of the whose removal most increases the decision tree
training examples affiliated with that node accuracy over the validation set
61 62

DECISION TREES DECISION TREES

Avoiding Over-fitting the Data: Reduced Error Pruning Avoiding Over-fitting the Data: Reduced Error Pruning

Pruning of the node continues until further pruning is


harmful (i.e. decreases accuracy of the tree over the
validation set)

Here, the available data has been split into three sub-sets:
- the training examples
- the validation examples for pruning
- the test examples used to provide an unbiased estimate
of accuracy of the pruned tree

63 64

DECISION TREES DECISION TREES

Avoiding Over-fitting the Data: Reduced Error Pruning Avoiding Over-fitting the Data: Rule Post-Pruning
Rule post-pruning involves the following steps:
The major drawback of this approach is that when data is
limited, withholding part of it for the validation set 1. Infer the decision tree from the training set (allowing
reduces even further the number of examples over-fitting to occur)
available for training 2. Convert the learned tree into an equivalent set of rules by
creating one rule for each path from the root node to a
leaf node
3. Prune (generalize) each rule by pruning any
preconditions that result in improving its estimated
accuracy
4. Sort the pruned rules by their estimated accuracy, and
consider them in this sequence when classifying
subsequent instances
65 66

11
DECISION TREES DECISION TREES

Avoiding Over-fitting the Data: Rule Post-Pruning Avoiding Over-fitting the Data: Rule Post-Pruning
Example: If (Outlook = sunny) and (Humidity = high) The main advantage of this approach:
then Play Tennis = no
Each distinct path through the decision tree produces a
Rule post-pruning would consider removing the distinct rule
preconditions one by one Hence removing a precondition in a rule does not
mean that it has to be removed from other rules as
It would select whichever of these removals produced the well
greatest improvement in estimated rule accuracy, then In contrast, in the previous approach, the only two
consider pruning the second precondition as a further choices would be to remove the decision node
pruning step completely, or to retain it in its original form

No pruning is done if it reduces the estimated rule accuracy


67 68

DECISION TREES DECISION TREES

Avoiding Over-fitting the Data: Rule Post-Pruning Avoiding Over-fitting the Data: Rule Post-Pruning
Rule post-pruning involves the following steps:
Example: If (Outlook = sunny) and (Humidity = high)
then Play Tennis = no
1. Infer the decision tree from the training set (allowing
over-fitting to occur)
Rule post-pruning would consider removing the
2. Convert the learned tree into an equivalent set of rules by
preconditions one by one
creating one rule for each path from the root node to a
leaf node
It would select whichever of these removals produced the
3. Prune (generalize) each rule by pruning any
greatest improvement in estimated rule accuracy, then
preconditions that result in improving its estimated
consider pruning the second precondition as a further
accuracy
pruning step
4. Sort the pruned rules by their estimated accuracy, and
consider them in this sequence when classifying
No pruning is done if it reduces the estimated rule accuracy
subsequent instances
69 70

DECISION TREES DECISION TREES

Avoiding Over-fitting the Data: Rule Post-Pruning Decision Trees: Issues in Learning

The main advantage of this approach: Practical issues in learning decision trees include:

Each distinct path through the decision tree produces a • How deeply to grow the decision tree
distinct rule
Hence removing a precondition in a rule does not • Handling continuous attributes
mean that it has to be removed from other rules as
well • Choosing an appropriate attribute selection measure
In contrast, in the previous approach, the only two
choices would be to remove the decision node • Handling training data with missing attribute values
completely, or to retain it in its original form
• Handling attributes with differing costs
71 72

12
DECISION TREES DECISION TREES

Continuous Valued Attributes Continuous Valued Attributes

• If an attribute has continuous values, we can dynamically • Example:


define new discrete-valued attributes that partition Let the training examples associated with a particular
the continuous attribute value into a discrete set of node have the following values for the continuous
intervals valued attribute Temperature and the target
attribute Play Tennis
• In particular, for an attribute A that is continuous
valued, the algorithm can dynamically create a new Temperature: 40 48 60 72 80 90
Boolean attribute Ac that is true if A < c and false Play Tennis: No No Yes Yes Yes No
otherwise

• The only question is how to select the best value for the
threshold c
73 74

DECISION TREES DECISION TREES

Continuous Valued Attributes Continuous Valued Attributes

• We sort the examples according to the continuous • In the current example, there are two candidate
attribute A thresholds, corresponding to the values of
Temperature at which the value of Play Tennis
• Then identify adjacent examples that differ in their changes: (48 + 60)/2 and (80 + 90)/2
target classification
• The information gain is computed for each of these
• We generate a set of candidate thresholds midway attributes, Temperature > 54 and Temperature > 85,
between the corresponding values of A and the best is selected (Temperature > 54)

• These candidate thresholds can then be evaluated by


computing the information gain associated with each

75 76

DECISION TREES DECISION TREES

Continuous Valued Attributes Training Examples with Missing Attribute Values


• In certain cases, the available data may have some
• This dynamically created Boolean attribute can then examples with missing values for some attributes
compete with other discrete valued candidate
attributes available for growing the decision tree • In such cases the missing attribute value can be
estimated based on other examples for which this
• An extension to this approach is to split the continuous attribute has a known value
attribute into multiple intervals rather than just two
intervals (i.e. the attribute become multi-valued, • Suppose Gain(S,A) is to be calculated at node n in the
instead of Boolean) decision tree to evaluate whether the attribute A is the
best attribute to test at this decision node

• Suppose that <x, c(x)> is one of the training examples


with the value A(x) unknown
77 78

13
DECISION TREES DECISION TREES

Training Examples with Missing Attribute Values Training Examples with Missing Attribute Values

• One strategy for filling in the missing value: Assign it • Another procedure is to assign a probability to each of
the value most common for the attribute A among the possible values of A (rather than assigning only
training examples at node n the highest probability value)

• Alternatively, we might assign it the most common • These probabilities can be estimated by observing the
value among examples at node n that have the frequencies of the various values of A among the
classification c(x) examples at node n

• The training example using the estimated value can • For example, given a Boolean attribute A, if node n
then be used directly by the decision tree learning contains six known examples with A = 1 and four with
algorithm A = 0, then we would say the probability that A(x) = 1
is 0.6 and the probability that A(x) = 0 is 0.4
79 80

DECISION TREES DECISION TREES

Training Examples with Missing Attribute Values Classification of Instances with Missing Attribute Values

• A fractional 0.6 of instance x is distributed down the • The fractioning of examples can also be applied to
branch for A = 1, and a fractional 0.4 of x down the classify new instances whose attribute values are
other tree branch unknown

• These fractional examples, along with other “integer” • In this case, the classification of the new instance is
examples are used for the purpose of computing simply the most probable classification, computed by
information Gain summing the weights of the instance fragments
classified in different ways at the leaf nodes of the tree
• This method for handling missing attribute values is
used in C4.5

81 82

DECISION TREES DECISION TREES

Handling Attributes with Differing Costs Handling Attributes with Differing Costs
• In some learning tasks, the attributes may have • In ID3, attribute costs can be taken into account by
associated costs introducing a cost term into the attribute selection
measure
• For example, we may attributes such as Temperature,
Biopsy Result, Pulse, Blood Test Result, etc. • For example, we might divide the Gain by the cost of the
attribute, so that lower-cost attributes would be
• These attributes vary significantly in their costs preferred
(monetary costs, patient comfort, time involved)
• Such cost-sensitive measures do not guarantee
• In such tasks, we would prefer decision trees that use finding an optimal cost-sensitive decision tree
low-cost attributes where possible, relying on high
cost attributes only when needed to provide reliable • However, they do bias the search in favor of low cost
classifications attributes
83 84

14
DECISION TREES DECISION TREES

Handling Attributes with Differing Costs Alternate Measures for Selecting Attributes
• Another example of selection measure is:
• There is a problem in the information gain measure. It
favors attributes with many values over those with few
Gain2 (S,A) / Cost(A)
values
where S = collection of examples & A = attribute • Example: An attribute “Date” would have the highest
information gain (as it would alone perfectly fit the
• Yet another selection measure can be training data)
• To cushion this problem the Info. Gain is divided by a
2Gain (S,A) – 1 / {Cost(A) + 1}w term called “Split Info”

where w  [0, 1] is a constant that determines the


relative importance of cost versus information gain
85 86

DECISION TREES DECISION TREES

Alternate Measures for Selecting Attributes Alternate Measures for Selecting Attributes

• Example:
Let there be 100 training examples at a node A1, with
where Si is the subset of S for which A has value vi 100 branches (one sliding down each branch)
Note that the attribute A can take on c different values,
Split Info (S, A1) = - 100 * 1/100 * log2 (0.01)
e.g. if A = Outlook,
= log2(100) = 6.64
then v1 = Sunny, v2 = Rain, v3 = Overcast

Let there 100 training examples at a node A2, with 2


When divided by Split Information the measure is called
branches (50 sliding down each branch)
Gain Ratio
Split Info (S, A2) = - 2 * 50/100 * log2 (0.5) = 1

87 88

DECISION TREES DECISION TREES

Alternate Measures for Selecting Attributes Decision Boundaries

• Problem with this Solution!!!


The denominator can be zero or very small when
Si  S
for one of the Si

To avoid selecting attributes purely on this basis, we


can adopt some heuristic such as first calculating the
Gain of each attribute, then applying the Gain Ratio
test only those considering those attributes with above
average Gain

89 90

15
DECISION TREES DECISION TREES

Decision Boundaries Advantages

• Easy Interpretation: They reveal relationships between


the rules, which can be derived from the tree. Because
of this it is easy to see the structure of the data.

• We can occasionally get clear interpretations of the


categories (classes) themselves from the disjunction of
rules produced, e.g. Apple = (green AND medium) OR
(red AND medium)

91 92

DECISION TREES DECISION TREES

Advantages Disadvantages

• Classification is rapid & computationally inexpensive • They may generate very complex (long) rules, which
are very hard to prune
• Trees provide a natural way to incorporate prior
• They generate large number of rules. Their number
knowledge from human experts can become excessively large unless some pruning
techniques are used to make them more
comprehensible.

• They require big amounts of memory to store the


entire tree for deriving the rules.

93 94

DECISION TREES DECISION TREES

Disadvantages Appropriate Problems for Decision Tree Learning

• They do not easily support incremental learning. • Instances are represented by discrete attribute-value pairs
Although ID3 would still work if examples are (though the basic algorithm was extended to real-valued
supplied one at a time, but it would grow a new attributes as well)
decision tree from scratch every time a new example
is given
• The target function has discrete output values
• There may be portions of concept space which are not
labeled • Disjunctive hypothesis descriptions may be required
e.g. If low income and bad credit history then high
risk • The training data may contain errors
but what about low income and good credit history?
• The training data may contain missing attribute values
95 96

16
DECISION TREES

Reference

Sections 3.1 – 3.7.1 of T. Mitchell

97

17

You might also like