You are on page 1of 64

MACHINE LEARNING

• Humans can supply rules to logical reasoning programs

• Another way is to impart the ability of constructing rules to


the machines themselves. This is called Machine Learning

• The machine is given raw data and it is supposed to form


rules (i.e. a model or concept) about the process from which
the data is generated

• Two types: Symbolic data and numeric data


Symbolic machine learning
Numeric machine learning

1
DECISION TREES

Most widely used learning method

It is a method that induces concepts from examples

The learning is supervised: i.e. the classes or categories of the


data instances are known

It represents concepts as decision trees, a representation that


allows us to determine the classification of an object by
testing its values for certain properties

2
IDENTIFICATION TREES

We may think of each property of an instance as


contributing a certain amount of information to its
classification.

For example, if our goal is to determine the species of an


animal, the discovery that it lays eggs contributes a certain
amount of information to that goal

3
Definition

● Decision tree is a classifier in the form of a tree structure


– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision

● Decision trees classify instances or examples by starting


at the root of the tree and moving through it until a leaf
node.
Why decision tree?

■ Decision trees are powerful and popular


tools for classification and prediction.
■ Decision trees represent rules, which can
be understood by humans and used in
knowledge system such as database.
key requirements

■ Attribute-value description: object or case


must be expressible in terms of a fixed
collection of properties or attributes (e.g.,
hot, mild, cold).
■ Predefined classes (target values): the
target function has discrete output values
(bollean or multiclass)
■ Sufficient data: enough training cases
should be provided to learn the model.
Concept Learning

■ E.g., Learn concept “Edible mushroom”


◻ Target Function has two values: T or F
■ Represent concepts as decision trees
■ Use hill climbing search thru space of
decision trees
◻ Start with simple concept
◻ Refine it into a complex concept as needed

7
IDENTIFICATION TREES

From Data to Identification Trees

Step 1: Collect data

8
IDENTIFICATION TREES

From Identification Trees to Rules

Step 2: Create Decision Tree Hair

Blonde Brown

Red

Lotion Used Alex


∙Emily Pete
John
Sun burn

No Yes
No Sun burn

∙Sarah Dana
∙Annie Katie

Sun burn No Sun burn


9
IDENTIFICATION TREES

From Identification Trees to Rules

Step 3: Make rules from the identification tree

For our example we have:

If the person’s hair is blonde


and the person uses lotion
then the person is not sunburned

If the person’s hair is blonde


and the person uses no lotion
then the person is sunburned

10
IDENTIFICATION TREES

From Identification Trees to Rules

Step 3: Make rules from the identification tree

For our example we have:

If the person’s hair is red


then the person is sunburned

If the person’s hair is brown


then the person is not sunburned

11
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

To simplify a rule, you ask whether any of the antecedents


can be eliminated without changing what the rule does on the
samples

Example:
If hair is blonde and person uses lotion then no sunburn

If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary
12
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

Example:
If hair is blonde and person uses lotion then no sunburn

If we eliminate the 1st antecedent, and check the rule over the
whole database, we find that there are no misclassifications
Hence we can drop this antecedent as unnecessary

R1 = Sunburn R2 = No Sunburn
Lotion = Yes l=0 m=3

13
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

Example:
If hair is blonde and person uses lotion then no sunburn

If we eliminate the 2nd antecedent then the resulting


shortened rule is not consistent with the data, hence it cannot
be eliminated

R1 = Sunburn R2 = No Sunburn
Blonde = Yes l=2 m=2

14
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

Consider the following table

R1 = Canary R2 = Crow
Color = Yellow l = 1000 m=0

It is evident that the rule


If color is yellow then bird is canary
is sufficient

15
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (prune the antecedents)

However if we have
R1 = Canary R2 = Crow
Color = Yellow l = 999 m=1

we observe that the rule is not consistent with the data and
some more antecedents should be incorporated (by decision
tree expansion) to make the rule valid for the whole data set

However, we should ask whether doing so is worth our


trouble? Is the simplified rule worth an occasional error?
Also maybe the single entry is a noise? 16
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Rules leading to one label can be replaced by a default rule:


If no other rule applies
then label x

However, this means a very big assumption that all of the


uncovered concept space belongs to label x

This may mean a misclassification of an unknown instance

17
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Without default label, we would have known that the new


instance does not fire any rule and hence may be set aside to
be classified manually

This sometimes leads to formation of new rules for the


category of that instance

It may also lead to discovery of a new, yet unknown, category

18
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Consider the following table

R1 = Canary R2 = Crow
Color = Yellow l = 1000 m=0
Color = Black n=0 o = 1000

For this we have to have the following rules:


If Color is Yellow then bird is Canary
If Color is Black then bird is Crow

19
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

If Color is Yellow then bird is Canary


If Color is Black then bird is Crow

Both rules are necessary. If we drop one of them we will not


be able to classify 50% samples

20
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

Now consider the following table

R1 = Canary R2 = Crow
P1 = Yellow l = 999 m=0
P2 = Black n=0 o=1

In this case we have the rules:


If Color is Yellow then Bird is Canary
If Color is Black then Bird is Crow

21
IDENTIFICATION TREES

From Identification Trees to Rules

Step 4: Optimize the rules (eliminate unnecessary rules)

In this case we have the rules:


If Color is Yellow then Bird is Canary
If Color is Black then Bird is Crow

If we drop the 2nd rule, we would misclassify only 1 sample in


1000 samples
We should ask ourselves:
Is the rule elimination worth an occasional error?
Also, maybe the single entry is a noise?

22
IDENTIFICATION TREES

Identification Trees

For any dimension, we partition the set of training examples


into subsets

The number of subsets is the number of values of the


dimension. The examples in a subset has the same value for
that dimension

For example: risk is a dimension


It has three values: high, moderate, low

This has the effect of cutting the example space by


hyper-planes
23
IDENTIFICATION TREES

Identification Trees

The dimension to be partitioned first is determined by the


average disorder of the examples partitioned

The space is partitioned sufficiently, when all the training


example in a partition have the same class (category) label

For classification of instances whose class label is


undetermined, we simply place it in our concept space and
note the label of the partition which encloses it

24
ID3 Example
Name Hair Height Weight Lotion Result

Sarah Blonde Average Light No Sunburned

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Sunburned

Emily Red Average Heavy No Sunburned

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Katie Blonde Short Light Yes None

25
Average Disorder
■ Average Disorder =
◻ ∑ Nb / Nt * (∑ - Nbc / Nb log2 Nbc / Nb)
◻ Where Nb is the number of samples in
branch b,
◻ Nt is the total number of samples in all
branches,
◻ Nbc is the total samples in branch b of
class c

26
ID3 Example (cont.)

Attribute Name Attribute Values Attribute Occurrences

Hair Blonde 4

Brown 3

Red 1

27
ID3 Example (cont.)
■ Blonde = 4/8 (-2/4 log2 2/4 -2/4 log2 2/4)
= 4/8 (0.5 + 0.5)
= 0.5
■ Brown = 3/8 (-3/3 log2 3/3)
= 3/8 (-1 log2 1)
=0
■ Red = 1/8 (-1 log2 1)
=0
28
ID3 Example (cont.)
■Average Disorder (Hair) = Blonde + Brown
+ Red
= 0.5 + 0 + 0
= 0.5
Average Disorder (Hair) = 0.5

29
ID3 Example (cont.)
■ Similarly Average Disorder for other
attributes can be calculated; which turns
out to be
■ Average Disorder (Hair) = 0.5
■ Average Disorder (Height) = 0.6886
■ Average Disorder (Weight) = 0.9386
■ Average Disorder (Lotion)= 0.6067

30
ID3 Example (cont.)
■ Most homogeneous attribute is Hair so put
hair as the first test. Tree will be:

Hair

Blonde Brown

Red

31
ID3 Example (cont.)
■ With red and brown hair color all the training set is
completely classified. So the only problem left is
with blonde hair color.
Attribute Name Attribute Values Attribute Occurrences

Height (with hair = blonde) Tall 1

Average 1

Short 2

32
ID3 Example (cont.)
■ Tall = 1/4 (-1 log2 1)
=0
■ Average = 1/4 (-1 log2 1)
=0
■ Short = 2/4 (-1/2 log2 1/2 -1/2 log2 1/2)
= 2/4 (0.5 + 0.5)
= 0.5
■ Average Disorder (Height with “hair =
33
blonde”) = 0 + 0 + 0.5 = 0.5
ID3 Example (cont.)
■ Similarly for other attributes but with hair =
blonde the average disorder is:
■ Average Disorder (Height with “hair =
blonde”) = 0.5
■ Average Disorder (Weight with “hair =
blonde”) = 1
■ Average Disorder (Lotion with “hair =
blonde”) = 0
34
ID3 Example (cont.)
■ Here the lotion is with the minimum
average disorder so it will be the nest test.
Now the tree will become: Hair

Blonde Brown
Red

Lotion Used

No Yes

35
ID3 Example (cont.)
Hair

Blonde Brown

Red

Lotion Used Alex


∙Emily Pete
John
Sun burn

No Yes
No Sun burn

∙Sarah Dana
∙Annie Katie

Sun burn No Sun burn

36
ID3 Example (cont.)
■ Rules Extraction
■ IF the person’s hair color is blonde
■ The person uses lotion
■ THEN no sunburn

■ IF the person’s hair color is blonde


■ the person uses no lotion
■ THEN person turns red
37
ID3 Example (cont.)
■ IF the person’s hair color is red
■ THEN person turns red

■ IF the person’s hair color is brown


■ THEN no sunburn

38
IDENTIFICATION TREES

Evaluating ID3

Bad Data
Data is bad if one set of attributes has two different
outcomes

Missing Data
Data is missing if an attribute is not present, perhaps
because it was too expensive to obtain

39
Entropy

■ A measure of homogeneity of the set of examples.

■ Given a set S of positive and negative examples of


some target concept (a 2-class problem), the entropy
of set S relative to this binary classification is

E(S) = - p(P)log2 p(P) – p(N)log2 p(N)


■ Suppose S has 25 examples, 15 positive and 10
negatives [15+, 10-]. Then the entropy of S relative to
this classification is

E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)


Some Intuitions
■ The entropy is 0 if
the outcome is
``certain’’.
■ The entropy is
maximum if we
have no knowledge
of the system (or
any outcome is Entropy of a 2-class problem
with regard to the portion of
equally possible). one of the two groups
Information Gain

■ Information gain measures the expected reduction in


entropy, or uncertainty.

◻ Values(A) is the set of all possible values for attribute A, and


Sv the subset of S for which attribute A has value v Sv = {s in
S | A(s) = v}.
◻ the first term in the equation for Gain is just the entropy of
the original collection S
◻ the second term is the expected value of the entropy after S
is partitioned using attribute A
■ It is simply the expected reduction in
entropy caused by partitioning the
examples according to this attribute.
■ It is the number of bits saved when
encoding the target value of an arbitrary
member of S, by knowing the value of
attribute A.
Training Examples
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
45
D14 Rain Mild High Strong No
Example: “Good day for tennis”

■ Attributes of instances
◻ Outlook = {rainy (r), overcast (o), sunny (s)}
◻ Temperature = {cool (c), medium (m), hot
(h)}
◻ Humidity = {normal (n), high (h)}
◻ Wind = {weak (w), strong (s)}
■ Class value
◻ Play Tennis? = {don’t play (n), play (y)}
■ Feature = attribute with one value

46
E.g., outlook = sunny
Decision Tree Representation
Good day for tennis?
Leaves = classification
Arcs = choice of value
Outlook
for parent attribute Sunny Rain
Overcast
Humidity Wind
Play
Strong Weak
High Normal

Don’t play Play Don’t play Play

Decision tree is equivalent to logic in disjunctive normal form


Play ⇔ (Sunny ∧ Normal) ∨ Overcast ∨ (Rain ∧ Weak)
47
Numeric Attributes
Use
thresholds to
convert Outlook
numeric Sunny Rain
attributes into Overcast
discrete
values Humidity Wind
Play >= 10
< 10 MPH
MPH
>= 75% < 75%

Don’t play Play Don’t play Play

48
DT Learning as Search
■ Nodes
Decision Trees
■ Operators Tree Refinement: Sprouting the tree
Smallest tree possible: a single leaf
■ Initial node Information Gain
Best tree possible (???)
■ Heuristic?

■ Goal?
Day Outlook Temp Humid Wind Play?
d1 s h h w n
What is the d2 s h h s n
d3 o h h w y
Simplest Tree? d4 r m h w y
d5 r c n w y
d6 r c n s n
d7 o c n s y
d8 s m h w n
d9 s c n w y
d10 r m n w y
d11 s m n s y
d12 o m h s y
d13 o h n w y
How good? d14 r m h s n

Majority class:
[9+, correct on 9
examples
5-]
incorrect on 5
examples
50
Successors Ye
s

Humid Wind

Outlook Temp

Which attribute should we use to split?


51
Disorder is bad
Homogeneity is good
No
Bad
Better

Good

52
Entropy (disorder) is bad
Homogeneity is good
■ Let S be a set of examples
■ Entropy(S) = -P log2(P) - N log2(N)
◻ P is proportion of pos example
◻ N is proportion of neg examples
◻ 0 log 0 == 0
■ Example: S has 9 pos and 5 neg
Entropy([9+, 5-]) = -(9/14) log2(9/14) -
(5/14)log2(5/14)
53
= 0.940
Information Gain
■ Measure of expected reduction in
entropy
■ Resulting from splitting along an
attribute

Gain(S,A) = Entropy(S) - Σ (|S | / |S|) Entropy(S )


v v

v∈
Values(A)

Where
54
Entropy(S) = -P log2(P) - N log2(N)
Tree Induction Example

■ Entropy of data S
Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94
Parent Node, p is split into
k partitions;
■ Split data by attribute Outlook ni is number of records in
S[9+, 5-] Sunny [2+,3-] partition i
Outlook Overcast
Rain [3+,2-]
[4+,0-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))]
– 4/14[-4/4(log2(4/4))-0/4(log2(0/4))]
– 5/14[-3/5(log2(3/5))-2/5(log2(2/5))]
= 0.94 – 0.69 = 0.25

55
Tree Induction Example

■ Split data by attribute Temperature

S[9+, 5-] <15 [3+,1-]


15-25 [4+,2-]
Temperature >25 [2+,2-]

Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))]


– 6/14[-4/6(log2(4/6))-2/6(log2(2/6))]
– 4/14[-2/4(log2(2/4))-2/4(log2(2/4))]
= 0.94 – 0.91 = 0.03

56
Tree Induction Example

■ Split data by attribute Humidity


High [3+,4-]
S[9+, 5-] Humidity
Normal [6+, 1-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))]
– 7/14[-6/7(log2(6/7))-1/7(log2(1/7))]
= 0.94 – 0.79 = 0.15
■ Split data by attribute Wind

Weak [6+, 2-]


S[9+, 5-] Wind
Strong [3+, 3-]

Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))]


– 6/14[-3/6(log2(3/6))-3/6(log2(3/6))]
= 0.94 – 0.89 = 0.05

57
Tree Induction Example
Outlook Tempera Humidity Wind Play
ture Tennis
Sunny >25 High Weak No
Sunny >25 High Strong No
Gain(Outlook) = 0.25
Gain(Temperature)=0.0
Overcast >25 High Weak Yes
3
Rain 15-25 High Weak Yes Gain(Humidity) = 0.15
Rain <15 Normal Weak Yes Gain(Wind) = 0.05
Rain <15 Normal Strong No
Overcast <15 Normal Strong Yes
Outlook
Sunny 15-25 High Weak No
Sunny <15 Normal Weak Yes
Rain 15-25 Normal Weak Yes
Sunny Overcast Rain
Sunny 15-25 Normal Strong Yes
Overcast 15-25 High Strong Yes
?? Ye ??
Overcast >25 Normal Weak Yes s
Rain 15-25 High Strong No

58
■ Entropy of branch Sunny
Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97

■ Split Sunny branch by attribute Temperature


Gain(Temperature)
<15 [1+,0-]
Sunny[2+,3-] = 0.97
Temperature 15-25 [1+,1-] – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))]
>25 [0+,2-] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0.4 = 0.57

■ Split Sunny branch by attribute Humidity


Gain(Humidity)
Sunny[2+,3-] High [0+,3-] = 0.97
Humidity – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))]
Normal [2+, 0-] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]
= 0.97 – 0 = 0.97

■ Split Sunny branch by attribute Wind


Gain(Wind)
Weak [1+, 2-] = 0.97
Sunny[2+, 3-]
Wind – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
Strong [1+, 1-] = 0.97 – 0.95= 0.02
59
Tree Induction Example

Outlook

Sunny Overcast Rain

Humidity Yes ??

High Normal

No Yes

60
■ Info(Rain)
Entropy =of-3/5(log
branch Rain
2(3/5))-2/5(log
2
(2/5)) =
0.97

■ Split Rain branch by attribute Temperature


Gain(Outlook)
Rain[3+,2-] <15 [1+,1-] = 0.97
Temperatur 15-25 [2+,1-] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
e – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
>25 [0+,0-]
– 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]
= 0.97 – 0.95 = 0.02

■ Split Rain branch by attribute Humidity


Gain(Humidity)
Rain[3+,2-] High [1+,1-] = 0.97
Humidity – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
Normal [2+, 1-] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
= 0.97 – 0.95 = 0.02

■ Split Rain branch by attribute Wind


Gain(Wind)
Rain[3+,2-] Weak [3+, 0-]
= 0.97
Wind – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))]
Strong [0+, 2-] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0 = 0.97

61
Issues

■ Missing data
■ Real-valued attributes
■ Many-valued features
■ Evaluation
■ Overfitting

62
Strengths

■ can generate understandable rules


■ perform classification without much computation
■ can handle continuous and categorical variables
■ provide a clear indication of which fields are most
important for prediction or classification
Weakness

■ Not suitable for prediction of continuous attribute.


■ Perform poorly with many class and small data.
■ Computationally expensive to train.
◻ At each node, each candidate splitting field must be sorted
before its best split can be found.
◻ In some algorithms, combinations of fields are used and a
search must be made for optimal combining weights.
◻ Pruning algorithms can also be expensive since many
candidate sub-trees must be formed and compared.
■ Do not treat well non-rectangular regions.

You might also like