You are on page 1of 70

DECISION TREE LEARNING

1
Decision Tree Classification Algorithm
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents
the outcome.
• There are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given
dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.

2
Example Tree

3
Prediction By Decision Tree
• An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example. This process is
then repeated for the subtree rooted at the new node.
• For Example instance
<Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong>
The tree predicts that PlayTennis = no
• In general, decision trees represent a disjunction of conjunctions of constraints
on the attribute values of instances.
(Outlook = Sunny ∧ Humidity = Normal) V (Outlook = Overcast) V (Outlook =
Rain ∧ Wind = Weak)

4
APPROPRIATE PROBLEMS FOR DECISION
TREE LEARNING
• Instances are represented by attribute-value pairs. For example Attribute
temperature can have small number of disjoint possible values (e.g., Hot,
Mild, Cold)
• The target function has discrete output values (e.g., yes or no) to each
example
• The training data may contain errors.
• The training data may contain missing attribute values.
• Decision tree learning has therefore been applied to problems such as
learning to classify medical patients by their disease, equipment
malfunctions by their cause, and loan applicants by their likelihood of
defaulting on payments.

5
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

6
How to create a Decision Tree?
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.

7
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes.
• To solve such problems there is a technique which is called
as Attribute selection measure or ASM.
• There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index

8
Information Gain and Entropy
• Information gain measures how well a given attribute separates the training
examples according to their target classification.
• The information gain is defined precisely using a measure of information theory
called, entropy.
• Entropy, characterizes the (im)purity of an arbitrary collection of examples.
• Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this boolean classification is

• where p , is the proportion of positive examples in S and p Θ, is the proportion of


negative examples in S.
• In all calculations involving entropy we define 0 log 0 to be 0.
9
Training Data Example: Goal is to Predict When This
Player Will Play Tennis?

10
Example for Entropy
• Suppose S is a collection of 14 examples (as shown in previous table) of
some boolean concept, including 9 positive and 5 negative examples (we
adopt the notation [9+, 5-] to summarize such a sample of data).

• Then the entropy of S relative to this boolean classification is-

• The entropy is 0 if all members of S belong to the same class.


• If all members are positive (p = 1), then p Θ , is 0, and Entropy(S) = -1 .
log2(1) - 0 . log2 0 = -1 . 0 - 0 . log2 0 = 0.
• Note the entropy is 1 when the collection contains an equal number of
positive and negative examples.
11
Example for Entropy
• If the collection contains unequal numbers of positive and negative examples,
the entropy is between 0 and 1. Figure blow shows the form of the entropy
function relative to a boolean classification, as p , varies between 0 and 1.

12
Entropy
• More generally, if the target attribute can take on c different values, then the
entropy of S relative to this c-wise classification is defined as

• where pi is the proportion of S belonging to class i. Note the logarithm is still


base 2 because entropy is a measure of the expected encoding length
measured in bits.

13
INFORMATION GAIN MEASURES: THE EXPECTED
REDUCTION IN ENTROPY
• Information gain, is simply the expected reduction in entropy caused by
partitioning the examples according to this attribute.
• More precisely, the information gain, Gain(S, A) of an attribute A, relative to a
collection of examples S, is defined as

• where Values(A) is the set of all possible values for attribute A, and Sv, is the
subset of S for which attribute A has value v (i.e., Sv = {S ∈ S | A(s) = v)).
• Note the first term in Equation is just the entropy of the original collection S,
and the second term is the expected value of the entropy after S is partitioned
using attribute A.
14
Example
• Suppose S is a collection of training-example days described by attributes including Wind, which
can have the values Weak or Strong. As before, assume S is a collection containing 14 examples,
[9+, 5-].
• Of these 14 examples, suppose 6 of the positive and 2 of the negative examples have Wind =
Weak, and the remainder have Wind = Strong. The information gain due to sorting the original 14
examples by the attribute Wind may then be calculated as

15
Example: Which attribute is better classifier

16
17
Gain(S, Temp)=0.94-(4/14*1.0)-(6/14)*0.9183-(4/14)*0.8113
=0.028 18
19
20
21
22
23
24
25
26
27
28
ID3 (Iterative Dichotomiser 3) Algorithms for Decision Tree
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the
dataset S into subsets using the feature for which the Information Gain is
maximum.
3. Make a decision tree node using the feature with the maximum Information
gain.
4. If all rows belong to the same class, make the current node as a leaf node
with the class as its label.
5. Repeat for the remaining features until we run out of all features, or the
decision tree has all leaf nodes.
29
30
Decision Tree Classification Task

Decision Tree

31
Example of a Decision Tree
cal cal s
r i r i ou
go o u
te t eg tin ass
ca ca on cl Splitting Attributes
c

Home
Owner
Ye No
s
NO MarSt
Married
Single, Divorced
Income NO
< 80K > 80K

NO YES

Model: Decision Tree


Training Data
32
Another Example of Decision Tree
al al s
ic c u
or ori u o
Single, Divorced
g g in ss MarSt
te te nt a Married
ca ca c o cl

NO Home
Ye Owner No
s
NO Income
< 80K > 80K

NO YES

There could be more than one tree that fits the same
data!

33
Apply Model to Test Data
Start from the root of tree.
Test Data

Home
Ye Owner No
s
NO MarSt
Married
Single, Divorced

Income NO
< 80K > 80K

NO YES

34
Apply Model to Test Data Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

35
Apply Model to Test Data
Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

36
Apply Model to Test Data Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

37
Apply Model to Test Data
Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

38
Apply Model to Test Data
Test Data

Home
Ye Owner No
s
NO MarSt
Single, Divorced Married Assign Defaulted to “No”

Income NO
< 80K > 80K

NO YES

39
ID3 Capabilities and Limitations
1. ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued functions, relative to
the available attributes. Because every finite discrete-valued function can be represented by some decision
tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis spaces (such as
methods that consider only conjunctive hypotheses): that the hypothesis space might not contain the target
function.
2. ID3 uses all training examples at each step in the search to make statistically based decisions regarding how
to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on
individual training examples (e.g., FIND-S or CANDIDATE-ELIMINATION). ION). One advantage of using
statistical properties of all the examples (e.g., information gain) is that the resulting search is much less
sensitive to errors in individual training examples.
3. ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This
contrasts, for example, with the earlier version space Candidate elimination algorithm, which maintains the
set of all hypotheses consistent with the available training examples. For example, it does not have the
ability to determine how many alternative decision trees are consistent with the available training data.
4. ID3 in its pure form performs no backtracking in its search. Once it, selects an attribute to test at a
particular level in the tree, it never backtracks to reconsider this choice.

40
Inductive Bias-ID3 Decision Tree Learning
1. Given a collection of training examples, there are typically many decision
trees consistent with these examples.
2. Describing the inductive bias of ID3 therefore consist of describing the basis
by which it choose one of these consistent hypotheses over the others.
3. Which of these decision trees does ID3 choose?

41
Inductive Bias-ID3 Decision Tree Learning
1. Approximate inductive bias of ID3
- Shorter trees are preferred over larger trees
2. A closer approximation to the inductive bias of ID3
- Shorter trees are preferred over longer trees
- Trees that place high information gain attributes close to the root are
preferred over those that do not

42
Why prefer short Hypothesis
1. Argument in favour:
- There are fewer short hypothesis than long ones.
-If a short hypothesis fits data unlikely to be a coincidence
1. Argument Against:
-Not every short hypothesis is a reasonable one
Occam’s Rozar: “The simplest explanation is usually the best one”

43
Hypothesis space search in decision tree learning
1. ID3 can be characterized as searching a space of
hypotheses for one that fits the training examples. The
hypothesis space searched by ID3 is the set of possible
decision trees.
2. ID3 performs a simple-to-complex, hill-climbing search
through this hypothesis space, beginning with the empty
tree, then considering progressively more elaborate
hypotheses in search of a decision tree that correctly
classifies the training data.
3. The evaluation function that guides this hill-climbing
search is the information gain measure.

44
Creating Decision Tree using Gini Index
For creating decision tree we need to decide two things
•Which attribute/feature should be placed at the root node?
•Which features will act as internal nodes or leaf nodes?

•This can be done by Information Gain, Gini Index, or some other measure of randomness/ or
information impurity.

45
What is Gini Index?
• Gini Index or Gini impurity measures the degree or probability of a particular variable
being wrongly classified when it is randomly chosen.

But what is actually meant by ‘impurity’?

• If all the elements belong to a single class, then it can be called pure. The degree of
Gini Index varies between 0 and .5,
where,
'0' denotes that all elements belong to a certain class or there exists only one class
(pure), and
‘.5' denotes that the elements are randomly distributed across various classes
(impure).
• A Gini Index of '0.5 'denotes equally distributed elements into some classes.

46
Formula of Gini Index
• The formula of the Gini Index is as follows:

where,
‘pi’ is the probability of an object being classified to a particular class.

While building the decision tree, we would prefer to choose the


attribute/feature with the least Gini Index as the root node.

47
Gini Vs Entropy
The Gini Index and the Entropy have three main differences:
• Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1]. In the following
figure, both of them are represented. The gini index has also been represented multiplied by two to see
concretely the differences between them, which are not very significant.
Representation of Gini Index and Entropy

• Computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation
of the Gini Index will be faster.
• If your data probability distribution is exponential or Laplace entropy outperform Gini.

To give an example if you have 2 events one .01 probability and other .99 probability.
In Gini probability squared will be .012+.992, .0001+.9801
means that lower probability does not play any role as everything is governed by the majority probability.
Now in case of entropy .01∗log(.01)+.99∗log(.99)=.01∗(−2)+.99∗(−.00436)=−.02−.00432 now in this case clearly seen
lower probabilities are given better weight-age.
Gini Vs Entropy
If your data probability distribution is exponential or Laplace entropy outperform Gini.
To give an example if you have 2 events one .01 probability and other .99 probability.
In Gini probability squared will be .012+.992, .0001+.9801
means that lower probability does not play any role as everything is governed by the majority
probability.

Now in case of entropy .01∗log(.01)+.99∗log(.99)=.01∗(−2)+.99∗(−.00436)=−.02−.00432 now in


this case clearly seen lower probabilities are given better weight-age.
Example of Gini Index Calculation of Gini Index
Open Trading We will now calculate the Gini Index with the
Past Trend Interest Volume Return following attributes
Positive Low High Up
•Past Trend,
•Open Interest
Negative High Low Down •Trading Volume
Positive Low High Up

Positive High High Up

Negative Low High Down

Positive Low Low Down

Negative High High Down

Negative Low High Down

Positive Low Low Down

Positive High High Up

50
Past Open Trading Gini Index for Past Trend
Trend Interest Volume Return Since the past trend is positive 6 number of times out of 10 and negative 4
Positive Low High Up
number of times, the calculation will be as follows:
P(Past Trend=Positive): 6/10
Negative High Low Down
P(Past Trend=Negative): 4/10
Positive Low High Up
Positive High High Up The probability of (Past Trend = Positive & Return = Up)= 4/6
The probability of (Past Trend = Positive & Return = Down)= 2/6
Negative Low High Down
Positive Low Low Down Gini Index = 1 - ((4/6)^2 + (2/6)^2) = 0.45
Negative High High Down
Negative Low High Down •If (Past Trend = Negative & Return = Up), probability = 0
•If (Past Trend = Negative & Return = Down), probability = 4/4
Positive Low Low Down
Gini Index = 1 - ((0)^2 + (4/4)^2) = 0
Positive High High Up
Weighted sum of the Gini Indices can be calculated as follows:

Gini Index for Past Trend = (6/10)0.45 + (4/10)0 = 0.27

51
Past Open Trading Calculating the Gini Index for open interest
Trend Interest Volume Return The open interest is high 4 times and low 6 times out of total 10 times
Positive Low High Up
and is calculated as follows:
P(Open Interest=High): 4/10
Negative High Low Down
P(Open Interest=Low): 6/10
Positive Low High Up
Positive High High Up If (Open Interest = High & Return = Up), probability = 2/4
If (Open Interest = High & Return = Down), probability = 2/4
Negative Low High Down
Positive Low Low Down Gini Index = 1 - ((2/4)^2 + (2/4)^2) = 0.5
Negative High High Down
Negative Low High Down If (Open Interest = Low & Return = Up), probability = 2/6
If (Open Interest = Low & Return = Down), probability = 4/6
Positive Low Low Down
Positive High High Up Gini Index = 1 - ((2/6)^2 + (4/6)^2) = 0.45

Weighted sum of the Gini Indices can be calculated as follows:

Gini Index for Open Interest = (4/10)0.5 + (6/10)0.45 = 0.47

52
Past Open Trading Calculating the Gini Index for trading volume
Trend Interest Volume Return Trading volume is 7 times high and 3 times low and is calculated as
Positive Low High Up
follows:
P(Trading Volume=High): 7/10
Negative High Low Down
P(Trading Volume=Low): 3/10
Positive Low High Up
Positive High High Up If (Trading Volume = High & Return = Up), probability = 4/7
If (Trading Volume = High & Return = Down), probability = 3/7
Negative Low High Down
Positive Low Low Down Gini Index = 1 - ((4/7)^2 + (3/7)^2) = 0.49
Negative High High Down
Negative Low High Down If (Trading Volume = Low & Return = Up), probability = 0
If (Trading Volume = Low & Return = Down), probability = 3/3
Positive Low Low Down
Positive High High Up Gini Index = 1 - ((0)^2 + (1)^2) = 0

Weighted sum of the Gini Indices can be calculated as follows:

Gini Index for Trading Volume = (7/10)0.49 + (3/10)0 = 0.34

53
Past Open Trading Attributes/Features Gini Index
Trend Interest Volume Return Past Trend 0.27
Positive Low High Up
Open Interest 0.47
Negative High Low Down
Trading Volume 0.34
Positive Low High Up
Positive High High Up
Negative Low High Down Past
Positive Low Low Down Trend
Positive Negative
Negative High High Down
Negative Low High Down
Positive Low Low Down
Positive High High Up

54
Past Open Trading Calculating Gini Index of open interest for positive past
Return
Trend Interest Volume trend
Positive Low High Up Open interest for positive past trend is high 2 times out of 6 and low 4
Positive Low High Up times out of 6
P(Open Interest=High): 2/6
Positive High High Up P(Open Interest=Low): 4/6
Positive Low Low Down
•If (Open Interest = High & Return = Up), probability = 2/2
Positive Low Low Down
•If (Open Interest = High & Return = Down), probability = 0
Positive High High Up
Gini Index = 1 - (sq(2/2) + sq(0)) = 0

•If (Open Interest = Low & Return = Up), probability = 2/4


•If (Open Interest = Low & Return = Down), probability = 2/4

Gini Index = 1 - (sq(0) + sq(2/4)) = 0.50

•Weighted sum of the Gini Indices can be calculated as follows:


Gini Index for Open Interest = (2/6)0 + (4/6)0.50 = 0.33

55
Past Open Trading
Return
Trend Interest Volume Attributes/Feature
Gini Index
Positive Low High Up s
Positive Low High Up Open interest 0.33
Positive High High Up Trading volume 0
Positive Low Low Down
Positive Low Low Down Past
Positive High High Up Trend
Positive Negative

Trading
Volume
High
Low

UP Down

56
Issues in Decision Tree
1. Overfitting the data
2. Incorporating Continuous valued attributes
3. Handling training examples with missing attribute values
4. Handling values with different costs
5. Alternative measures for selecting attributes

57
Overfitting of Data in Decision Tree
Algorithm
A hypothesis overfits the training examples if some other hypothesis that fits
the training examples less well actually performs better over the entire
distribution of instances (i.e., including instances beyond the training set).

Definition: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit


the training data if there exists some alternative hypothesis h' ∈ H, such that
h has smaller error than h' over the training examples, but h' has a smaller
error than h over the entire distribution of instances.

Overfitting may occurs in trees-


• There is noise in the data
• When the number of training examples is too small to produce a
representative sample of the true target function
58
Overfitting of Data in Decision Tree
Algorithm

59
How can it be possible for tree h to fit the training examples better than h', but for it
to perform more poorly over subsequent examples?

Suppose a training example has an error-

(Outlook = Sunny, Temperature = Hot, Humidity = Normal, Wind = Strong, PlayTennis = Yes)

(Outlook = Sunny, Temperature = Hot, Humidity = Normal, Wind = Strong, PlayTennis = No)

60
Tree (h’) for error free data

61
Tree (h) for erroneous data

D15: (Outlook = Sunny, Temperature = Hot, Humidity = Normal, Wind = Strong, PlayTennis = No)
62
Overfitting of Data in Decision Tree
Algorithm
• The result of D15 is that ID3 will output a decision tree (h) that is more
complex than the original tree from original DT for (h').
• Therefore h will fit the collection of training examples perfectly, whereas the
simpler h' will not.
• However, given that the new decision node is simply a consequence of fitting
the noisy training example, we expect h to outperform h' over subsequent
data drawn from the same instance distribution.
• Overfitting is possible even when the training data are noise-free, especially
when small numbers of examples are associated with leaf nodes.

63
Avoiding Overfitting the Data
These can be grouped into two classes:
1. Approaches that stop growing the tree earlier, before it reaches the point where
it perfectly classifies the training data.
2. Approaches that allow the tree to overfit the data, and then post-prune the tree.

• The first of these approaches might seem more direct,


• The second approach of post-pruning overfit trees has been found to be more
successful in practice. This is due to the difficulty in the first approach of
estimating precisely when to stop growing the tree.
• There are two variants of first approach: Reduced error pruning & Rule Post
Pruning
64
REDUCED ERROR PRUNING
• It consider each of the decision nodes in the tree to be candidates for pruning.
• Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node,
and assigning it the most common classification of the training examples affiliated with that node.
• Nodes are removed only if the resulting pruned tree performs no worse than-the original over the
validation set.
• This has the effect that any leaf node added due to coincidental regularities in the training set is likely
to be pruned because these same coincidences are unlikely to occur in the validation set.
• Nodes are pruned iteratively, always choosing the node whose removal most increases the decision
tree accuracy over the validation set.
• Pruning of nodes continues until further pruning is harmful (i.e., decreases accuracy of the tree over
the validation set).

65
RULE POST-PRUNING
1. Infer the decision tree from the training set, growing the tree until the training data is fit as well as
possible and allowing overfitting to occur.
2. Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root
node to a leaf node.
3. Prune (generalize) each rule by removing any preconditions that result in improving its estimated
accuracy.
4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying
subsequent instances.
As an example consider the decision tree in Figure 3.1. In rule post pruning, one rule is generated for each leaf
node in the tree. Each attribute test along the path from the root to the leaf becomes a rule antecedent
(precondition) and the classification at the leaf node becomes the rule consequent (postcondition).
For example, the leftmost path of the tree in Figure 3.1 is translated into the rule
IF (Outlook = Sunny) and (Humidity = High)
THEN PlayTennis = No
Each such rule is pruned by removing any antecedent, or precondition, whose removal does not worsen its
estimated accuracy.
It would select whichever of these pruning steps produced the greatest improvement in estimated rule accuracy,
then consider pruning the second precondition as a further pruning step. No pruning step is performed if it
reduces the estimated rule accuracy. 66
Incorporating Continuous-Valued Attributes
• This can be accomplished by dynamically defining new discrete valued attributes that partition the continuous
attribute value into a discrete set of intervals
• As an example, suppose we wish to include the continuous-valued attribute Temperature in describing the
training example days in the learning task of Table 3.2.
• Suppose further that the training examples associated with a particular node in the decision tree have the
following values for Temperature and the target attribute PlayTennis.
• By sorting the examples according to the continuous attribute ex A, then identifying adjacent examples that
differ in their target classification, we can generate a set of candidate thresholds midway between the
corresponding values of A.
• In the current example, there are two candidate thresholds, corresponding to the values of Temperature at
which the value of PlayTennis changes: (48 + 60)/2 which is 54, and (80 + 90)/2 and 85.
Alternative Measures for Selecting Attributes
• There is a natural bias in the information gain measure that favors attributes with many values over those
with few values.
• As an extreme example, consider the attribute Date, which has a very large number of possible values (e.g.,
March 4, 1979).
• If we were to add this attribute to the data in Table 3.2, it would have the highest information gain of any
of the attributes. This is because Date alone perfectly predicts the target attribute over the training data.
• Thus, it would be selected as the decision attribute for the root node of the tree and lead to a (quite
• broad) tree of depth one, which perfectly classifies the training data.
• Of course, this decision tree would fare poorly on subsequent examples, because it is not a useful
predictor despite the fact that it perfectly separates the training data.
• One way to avoid this difficulty is to select decision attributes based on some measure other than
information gain. One alternative measure that has been used successfully is the gain ratio.
• The gain ratio measure penalizes attributes such as Date by incorporating a term, called split
information, that is sensitive to how broadly and uniformly the attribute splits the data:
Handling Training Examples with Missing
Attribute Values
• The available data may be missing values for some attributes.
• Consider the situation in which Gain(S, A) is to be calculated at node n in the decision tree to
evaluate whether the attribute A is the best attribute to test at this decision node. Suppose
that (x, c(x)) is one of the training examples in S and that the value A(x) is unknown.
• One strategy for dealing with the missing attribute value is to assign it the value that is most
common among training examples at node n.
• Alternatively, we might assign it the most common value among examples at node n that have
the classification c(x).
Handling Attributes with Differing Costs
• In some learning tasks the instance attributes may have associated costs.
• For example, in learning to classify medical diseases we might describe patients in terms of attributes such
as Temperature, BiopsyResult, Pulse, BloodTestResults, etc.
• These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient
comfort. In such tasks.
• we would prefer decision trees that use low-cost attributes where possible, relying on high-cost attributes
only when needed to produce reliable classifications.
• Without sacrificing classification accuracy, by replacing the information gain attribute selection measure by
the following measure

You might also like