You are on page 1of 71

Machine Learning, Expert

Systems and Fuzzy Logic
Prepared by Kristian Guillaumier
Dept. of Intelligent Computer Systems
University of Malta
2011
All content in these slides adapted from:
• Most material in these slides from:
– [1] Machine Learning: Tom Mitchell (get this book).
– [2] Introduction to Machine Learning: Ethem Alpaydin.
– [3] Introduction to Expert Systems: Peter Jackson.
– [4] An introduction to Fuzzy Logic and Fuzzy Sets:
Buckley, Elsami.
– [5] Pattern Recognition and Machine Learning:
Christopher Bishop.
– [6] Grammatical Inference: Colin de la Higuera.
– [7] Artificial Intelligence: Negnevitsky.
– Miscellaneous web references.

Kristian Guillaumier, 2011 2
CONCEPT LEARNING, FIND-S,
CANDIDATE ELIMINATION
(main source: Mitchell [1])
Kristian Guillaumier, 2011 3
Note on Induction
• If a large number of items I have seen so far all possess
some property, then ALL items possess that property.
• So far the sun has always risen…
• Are all swans white…

• We never known whether our induction is true (we
have not proven it).

• In machine learning:
– Input: A number of training examples for some function.
– Output: a hypothesis that approximates the function.
Kristian Guillaumier, 2011 4
Concept Learning
• Learning: inducing general functions from specific training
examples (+ve and/or –ve).
• Concept learning: induce the definition of a general category (e.g.
‘cat’) from a sample of +ve and –ve training data.
• Search a space of potential hypotheses (hypothesis space) for a
hypothesis that best fits the training data provided.
• Each concept can be viewed as the description of some subset
(there is a general-to-specific ordering) defined over a larger set.
E.g.:
– Cat _ Feline _ Animal _ Object
• A boolean function over the larger set. E.g. the IsCat() function over
the set of animals.
• Concept learning  learning this function from training data.
Kristian Guillaumier, 2011 5
Example
• Learn the concept: Good days when I like to swim.
• We want to learn the function:
– IsGoodDay(input)  true/false.
• Our hypothesis is represented as a conjunction of constraints on
attributes. Attributes:
– Sky  Sunny/Rainy/Cloudy.
– AirTemp  Warm/Cold.
– Humidity  High/Normal.
– Wind  Strong/Weak.
– Water  Warm cold.
– Forecast  Same/Change.
• Other possible contraints (values) for an attribute:
– ? ‘I don’t care’, any value is acceptable.
– C no value is acceptable.
Sky  Sunny/Rainy.

Attribute
Constraints
Kristian Guillaumier, 2011 6
Example
• Our hypothesis, then, is a vector of constraints on
these attributes:
<Sky, AirTemp, Humidity, Wind, Water, Forecast>

• An example of an hypothesis is (only on warm days
with normal humidity):
<?, Warm, Normal, ?, ?, ?>

• The most general hypothesis is:
<?, ?, ?, ?, ?, ?>

• The most specific hypothesis is:
< C, C, C, C, C, C >

Kristian Guillaumier, 2011 7
Example
• Training Examples:
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Changes No
Sunny Warm High Strong Cool Changes Yes
Kristian Guillaumier, 2011 8
Notation
• The set of all items over which the concept is called the set
of instances, X.
– The set of all days represented by the attributes AirTemp,
Humidity, …
– The set of all animals, etc…
• An instance in X is denoted by x (x e X).
• The concept to be learnt (e.g. cats over animals, good days
to swim over all days) is called the target concept denoted
by c (note that c is the target function).
– c is a Boolean-valued function defined over the instances X.
– i.e. the function c, takes an instance x e X and in our example
• c(x) = 1 if IsGoodDay is Yes
• c(x) = 0 if IsGoodDay is no.


Kristian Guillaumier, 2011 9
Notation
• When learning the target concept, c, the learner
is given a training set that consists of:
– A number of instances x from X.
– For each instance, we have the value of the target
concept c(x).
– Instances where c(x) = 1 are called +ve training
examples (members of target concept).
– Instances where c(x) = 0 are called –ve training
examples (non-members).
• A training example is usually denoted by:
<x, c(x)>
An instance and its target concept value
Kristian Guillaumier, 2011 10
Notation
• Given the set of training examples, we want to
learn (hypothesize) the target concept c.
• H is the set of all hypotheses that we are
considering (all the possible combinations of
<Sky, AirTemp, Humidity, Wind, Water,
Forecast>).
• Each hypothesis in H denoted by h and is usually
a boolean valued function h:X{0,1}.
• Goal of the learner is to find an h such that:
– h(x) = c(x) for all x e X.


Kristian Guillaumier, 2011 11
Concept Learning Task (Mitchell pg. 22)
Kristian Guillaumier, 2011 12
The Inductive Learning Hypothesis
• Our main goal is to find a hypothesis h that is identical to c for all x
e X (for every instance possible).
• The only information we have on c is the training data.
• What about ‘unseen’ instances (where we don’t have training data).
At best we can guarantee that our learner will learn a hypothesis
that learns the training data exactly (not good).
• We make an assumption… the inductive learning hypothesis…

• Any hypothesis that approximates the target function well over a
sufficiently large set of training examples will also approximate
the target function well for other unobserved/unseen examples.




Kristian Guillaumier, 2011 13
Size and Ordering of the Search Space
• We must search in the hypothesis space for the best one (that matches c).
• The size of the hypothesis space is defined by the hypothesis representation.
• Recall:
– Sky  Sunny/Rainy/Cloudy (3 options).
– AirTemp  Warm/Cold (2 options).
– Humidity  High/Normal (2 options).
– Wind  Strong/Weak (2 options).
– Water  Warm cold (2 options).
– Forecast  Same/Change (2 options).
• Which means that we have 3×2×2×2×2×2=96 distinct instances.
• Due to the addition of the symbols ? and C we have 5×4×4×4×4×4=5120
syntactically distinct instances.
• However note that any instance that contains one or more C’s by definition always
means that it is classified –vely. So the number of semantically distinct instances is:
– 1 + 4×3×3×3×3×3 = 973

1 representing all the definitely
negative ones since they contain C
Distinct possibilities + 1 for the ?
Kristian Guillaumier, 2011 14
Size and Ordering of the Search Space
• Consider the following hypotheses:
– h1: <Sunny, ?, ?, Strong, ?, ?>
– h2: <Sunny, ?, ?, ?, ?, ?>
• Consider the sets of instances that are classified as +ve by h1 and
h2.
• Clearly, since h2 has less constraints, it will classify more instances
as +ve than h1.
clearly, anything that is classified as +ve by h1 will also be classified
as +ve by h2.
• We say that h2 ‘is more general than’ h1.
• This allows us to organize the search space (order the set) according
to this relationship between instances (hypotheses) in it.
• This ordering concept is very important because there are concept
learning algorithms that make use of this ordering.
Kristian Guillaumier, 2011 15
Size and Ordering of the Search Space
• For any instance xeX and hypothesis heH,
– x satisfies h iff h(x) = 1.
• The “is more general or equal to” relationship is
denoted by
g
.
• Given two hypothesis h
j
and h
k
, h
j
>
g
h
k
iff any
instance that satisfies h
k
also satisfies h
j
.

h
j
>
g
h
k

iff
¬xeX, h
k
(x)=1  h
j
(x) = 1
Kristian Guillaumier, 2011 16
Size and Ordering of the Search Space
• Just as we have defined >
g
, it is useful to
define: strictly more general than, more
specific or equal to, …

Kristian Guillaumier, 2011 17
Size and Ordering of the Search Space
Kristian Guillaumier, 2011 18
Set of all instances
Each hypothesis corresponds
to a subset of X (arrows).
h2 ‘contains’ h1 and h3 because it is more general.
h1 and h3 are not more general or specific than each other
Size and Ordering of the Search Space
• Notes:
– >
g
is a partial order over the hypothesis space H (it is reflexive, antisymmetric,
and transitive).

• Reflectivity, or Reflexive Relation
– A reflexive relation is a binary relation R over a set S where every element in S is related to
itself.
– That is, ∀ x ∈ S, xRx holds true.
– For example, the ≤ relation over Z
+
is reflexive because ∀ x ∈ Z
+
, x ≤ x.
• Transitivity, or Transitive Relation
– A transitive relation is a binary relation R over a set S where ∀ a, b, c ∈ S: aRb ⋀ bRc ⇒ aRc.
– For example, the ≤ relation over Z
+
is transitive because ∀ a, b, c ∈ Z
+
, if a ≤ b and b ≤ c then a ≤
c.
• Antisymmetry, or Antisymmetric Relation
– An antisymmetric relation is a binary relation R over a set S where ∀ a, b ∈ S, aRb ⋀ bRa ⇒ a =
b.
– An equivalent way of stating this is that ∀ a, b ∈ S, aRb ⋀ a ≠ b⇒ ¬bRa.
– For example, the ≤ relation over Z
+
is antisymmetric because ∀ a, b, c ∈ Z
+
, if a ≤ b and b ≤ a
then a = b.



Kristian Guillaumier, 2011 19
The Find-S Algorithm
Initialise h to the most specific hypothesis in H.

For each +ve training instance x
For each attribute constraint a
i
in h
If constraint a
i
is satisfied by x Then
Do Nothing
Else
Replace a
i
in h by the next more
general constraint that is
satisfied by x

Return h
Kristian Guillaumier, 2011 20
The Find-S Algorithm
• Init h to most specific hypothesis:
h = < C, C, C, C, C, C >
• Start with first +ve training example:
x = <Sunny, Warm, Normal, Strong, Warm, Same>
• Consider the first attribute ‘sky’. Our hypothesis says C (most
specific) but our training example says ‘Sunny’ (more general) so
the attribute in the hypothesis does not satisfy the attribute in x.
Pick the next more general attribute value from C which is ‘Sunny’.
• Repeat for all attributes AirTemp, humidity, … until we get:
h = <Sunny, Warm, Normal, Strong, Warm, Same>
• After having covered the 1
st
training example, h is more general
than what we started with. However it is still very specifiy – it will
classify all possible instances to –ve except the one +ve training
example it has seen.
• Continue with the next +ve training example.
Kristian Guillaumier, 2011 21
The Find-S Algorithm
• The next example is:
x = <Sunny, Warm, High, Strong, Warm, Same>
• Recall that so far:
h = <Sunny, Warm, Normal, Strong, Warm, Same>
• Loop thru all the attributes. 1
st
satisfied – do nothing, 2
nd
satisfied – do
nothing, 3
rd
not satisfied – pick next more general, etc…
x = <Sunny, Warm, High, Strong, Warm, Same>
h = <Sunny, Warm, Normal, Strong, Warm, Same>

new h…
h = <Sunny, Warm, ?, Strong, Warm, Same>

• We complete the algorithm to get the hypothesis:
h = <Sunny, Warm, ?, Strong, ?, ?>


Kristian Guillaumier, 2011 22
The Find-S Algorithm - Observation
• Remember that the algorithm skipped the 3
rd
training
example because it was –ve.
• However we observe that the hypothesis we had
generated so far was consistent with this –ve training
example.
• After considering the 2
nd
training example, our
hypothesis was:
h = <Sunny, Warm, ?, Strong, Warm, Same>
• The 3
rd
training example (that was skipped) was:
x = <Rainy, Cold, High, Strong, Warm, Change>
• Note that h is already consistent with the training
example.
Kristian Guillaumier, 2011 23
The Find-S Algorithm - Observation
• As long as the hypothesis space H contains the hypothesis that describes
the target concept c and there are no errors in the training data, the
current hypothesis can never be inconsistent with a –ve training example.

• To see why:
– h is the most specify hypothesis in H that is consistent with the currently
observed training examples.
– Since we assume the target concept c is in H and is consistent (obviously) with
the positive training examples, then c must be >
g
than h.
– But c will never cover a –ve example so neither will h (by definition of >
g
).
Clearly, if the a more general hypothesis will not misclassify –ve example, the
more specific one cannot misclassify it either.
Kristian Guillaumier, 2011 24
Consider Animal >
g
Cat

If cat then animal
If animal then maybe cat
If not cat then maybe not animal
If not animal then not cat (if –ve is correctly classified by general then it is correctly classified by specific)
The Find-S Algorithm – Issues Raised
• Find-S is guaranteed to find the most specific heH that is
consistent with the +ve and –ve training examples assuming
that the training examples are correct (no noise).
• Issues:
– We don’t know whether we found the only hypothesis that is
consistent. There might be more hypotheses that are consistent.
– Find-S will find the most specific hypothesis consistent with the
training data. Why not the most general hypothesis consistent?
Why not something in between.
– How do we know if the training data is consistent? In real-life
cases, training data may contain noise or errors.
– There might be more than one maximally specific hypothesis –
which one do we pick?

Kristian Guillaumier, 2011 25
The Candidate Elimination (CE)
Algorithm
• Note, that although the hypothesis output by Find-S is consistent
with the training data, it is one of the, possibly, many hypothesis
that is consistent.
• CE will output (a description of) all the hypothesis consistent with
the training data.
• Interestingly, it does so without enumerating the whole space.

• CE finds all describable hypotheses that are consistent with the
observed training examples.

• Defn: h is consistent with a set of training examples D iff h(x) = c(x)
for each example <x, c(x)> in D.

Consistent(h, D) ÷ ¬ <x, c(x)> e D, h(x) = c(x)
Kristian Guillaumier, 2011 26
The Candidate Elimination (CE)
Algorithm
• The subset of all the hypotheses that are consistent
with the training data (what CE finds) is called the
version space WRT the hypothesis space H and the
training data D – it contains all the possible, consistent
versions of the target concept.

• Defn: the version space denoted VS
H,D
with respect to
the hypothesis space H and the training data D is the
subset of hypotheses from H consistent with the
training data in D.

VS
H,D
÷ {heH|Consistent(h,D)}

Kristian Guillaumier, 2011 27
The List-Then-Eliminate (LTE)
Algorithm
• A possible representation of a version space is a
listing of all the elements (hypotheses) in it.

List-Then-Eliminate:

VersionSpace = Generate all hypothesis in H
For each training example d:
remove from VersionSpace any hypothesis h
where h(x) <> c(x)

Return VersionSpace


Kristian Guillaumier, 2011 28
The List-Then-Eliminate (LTE)
Algorithm
• LTE can be applied whenever the hypothesis
space is finite (not always the case).
• It has the advantage of simplicity and the fact
that it will always work (guaranteed to output all
the hypotheses consistent with the training data).

• However, enumerating all the hypotheses in H is
unrealistic for all but the most trivial cases.
• We need a more compact representation.
Kristian Guillaumier, 2011 29
Compact Representation of a Version
Space
• Recall that in our previous example, Find-S
found the hypothesis:
h = <Sunny, Warm, ?, Strong, ?, ?>
• This is only one of 6 possible hypotheses that
are consistent with the training examples.
• We can illustrate the 6 possible hypothesis in
the next diagram…
Kristian Guillaumier, 2011 30
Compact Representation of a Version
Space
Kristian Guillaumier, 2011 31
1
2 3 4
6 5
Compact Representation of a Version
Space
Kristian Guillaumier, 2011 32
Most specific.
Most general.
Arrow represents the
>
g
relation.
Compact Representation of a Version
Space
Kristian Guillaumier, 2011 33
Given only the 2 sets S and G, we can generate all the hypotheses ‘in between’.

Try it!
Compact Representation of a Version
Space
• Intuitively we see that by having these general and specific ‘boundaries’
we can generate the whole version space (check sketch proof in Mitchell).
• A few definitions.
• Defn: the general boundary G (remember that G is a set) WRT a
hypothesis space H and training data D is the set of maximally general
members of H consistent with D.

G ÷ {geH |Consistent(g, D) . (÷-g’eH)*(g’>
g
g).Consistent(g’,D)]}

• Defn: the specific boundary S WRT a hypothesis space H and training data
D is the set of minimally general (maximally specific) members of H
consistent with D.

G ÷ {seH |Consistent(s, D) . (÷-s’eH)[(s>
g
s’).Consistent(s’,D)]}

Kristian Guillaumier, 2011 34
(Back to) The Candidate Elimination
(CE) Algorithm
• CE computes the version space containing all
the hypotheses in H that are consistent with
the observed D.
• First we initialise G (remember it is a set) to
contain the most general hypothesis possible.
G
0
= {<?,?,?,?,?,?>}
• Then we initialise S to contain the most
specific hypothesis possible.
S
0
= {<C,C,C,C,C,C>}
Kristian Guillaumier, 2011 35
The Candidate Elimination (CE)
Algorithm
• So far the two boundaries delimit the whole
hypothesis space (every h in H is between G
0
and
S
0
).
• As each training example is considered the
boundary sets S and G are generalised and
specialised respectively to eliminate from the
version space any hypothesis in H that is
inconsistent.
• At the end we’ll end up with the correct
boundary sets.
Kristian Guillaumier, 2011 36
The Candidate Elimination (CE)
Algorithm
Kristian Guillaumier, 2011 37
The Candidate Elimination (CE)
Algorithm
• After init:
S
0
= {<C,C,C,C,C,C>}
G
0
= {<?,?,?,?,?,?>}
• Consider the first training example:


• It is positive so:
Kristian Guillaumier, 2011 38
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
The Candidate Elimination (CE) Algorithm
• Part 1: all hypotheses in G are consistent with the
training example, so we don’t remove anything.
• Part 2:
– There is only one s in S (<C,C,C,C,C,C>) which is
inconsistent:
• Remove <C,C,C,C,C,C> from S, leaving S={}.
• Add to S all minimal generalisations h of s.
– i.e. we add <Sunny, Warm, Normal, Strong, Warm, Same> to S.
• Remove from S any hypothesis that is more general than any
other hypothesis in S.
– There is only 1 hypothesis in S so we do nothing.
Kristian Guillaumier, 2011 39
Sky AirTemp Hum. Wind Water Forecast
Sunny Warm Normal Strong Warm Same Yes
• So far we got:
Kristian Guillaumier, 2011 40
The Candidate Elimination (CE) Algorithm
• Read the second training example. It is positive as well so.
Kristian Guillaumier, 2011 41
The Candidate Elimination (CE) Algorithm
Sky AirTemp Hum Wind Water Forecast
Sunny Warm High Strong Warm Same Yes
• Part 1: all hypotheses in G are consistent with the
training example, so we don’t remove anything.
• Part 2:
– There is only one s in S (<Sunny, Warm, Normal, Strong,
Warm, Same>) which is inconsistent:
• Remove it from S, leaving S={}.
• Add to S all minimal generalisations h of s.
– i.e. we add <Sunny, Warm, ?, Strong, Warm, Same> to S.
• Remove from S any hypothesis that is more general than any other
hypothesis in S.
– There is only 1 hypothesis in S so we do nothing.
• So far we got:
Kristian Guillaumier, 2011 42
The Candidate Elimination (CE) Algorithm
The Candidate Elimination (CE)
Algorithm
• We notice that the role of +ve training examples is to
make the S boundary more general and the role of
the –ve training examples is to make the G boundary
more specific.
• Consider the 3
rd
training example, which is –ve.
Kristian Guillaumier, 2011 43
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Rainy Cold High Strong Warm Changes No
• Step 1: S contains <Sunny, Warm, ?, Strong, Warm,
Same> which is consistent because it labels the training
example as ‘NO’ – we do nothing.
• Step 2:
– The hypotheses in G that are not consistent with d is
<?,?,?,?,?,?> because it labels it as ‘YES’. Remove it, leaving
G = {}.
– All to G all minimal specialisations of g.
– Continued…

Kristian Guillaumier, 2011 44
The Candidate Elimination (CE) Algorithm
Sky AirT Hum Wind Water Fore
Rainy Cold High Strong Warm Changes No
The Candidate Elimination (CE) Algorithm
• The g we removed is <?,?,?,?,?,?>, all the minimal
specialisations of it would be (remember we want to
label the training example as ‘NO’):
• Sky: <Sunny, ?, ?, ?, ?, ?>, <Cloudy, ?, ?, ?, ?, ?>
• Air Temp: <?,Warm,?,?,?,?>
• Humidity: <?,?,Normal,?,?,?>
• Wind: <?,?,?,Weak,?,?>
• Water: <?,?,?,?,Cold,?>
• Forecast: <?,?,?,?,?,Same>
• However, not all these minimal specialisations go
into the new G.

Kristian Guillaumier, 2011 45
Sky AirT Hum Wind Water Fore
Rainy Cold High Strong Warm Changes No
• Only <Sunny, ?, ?, ?, ?, ?>, <?,Warm,?,?,?,?> and <?,?,?,?,?,Same>
go into the new G.
• <Cloudy, ?, ?, ?, ?, ?>, <?,?,Normal,?,?,?>, <?,?,?,Weak,?,?> and
<?,?,?,?,Cold,?> are not part of the new G.
• Why?
• Because it is inconsistent with the previously encountered training
examples (so far we saw training items 1 and 2).
– <Cloudy, ?, ?, ?, ?, ?> is inconsistent with training item 1 and 2.
– <?,?,Normal,?,?,?> is inconsistent with training item 2.
– <?,?,?,Weak,?,?> is inconsistent with training item 1 and 2.
– <?,?,?,?,Cold,?> is inconsistent with training item 1 and 2.


Kristian Guillaumier, 2011 46
The Candidate Elimination (CE) Algorithm
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Changes No
Sunny Warm High Strong Cool Changes Yes
The Candidate Elimination (CE)
Algorithm
• So far we got:

Kristian Guillaumier, 2011 47
The Candidate Elimination (CE)
Algorithm
• After processing the 4
th
training item, we get:
Kristian Guillaumier, 2011 48
The Candidate Elimination (CE)
Algorithm
• The entire version space derived from the
boundaries is:
Kristian Guillaumier, 2011 49
Converging
• CE will converge towards the target concept if:
– There are no errors in the training data.
– The target concept is in H.
• The target concept is exactly learnt when S and G converge
to a single and identical hypothesis.
• If the training data contains errors e.g. a +ve example is
incorrectly labeled as –ve:
– The algorithm will remove the correct target concept from the
version space.
– Eventually, if we are presented with enough training data, we
will detect an inconsistency because the G and S boundaries will
converge to an empty version space (i.e. there is no hypothesis
in H that is consistent with all the training examples).


Kristian Guillaumier, 2011 50
Requesting Training Examples
• So far, our algorithm was given a set
containing labeled training data.
• Suppose that our algorithm can come up with
an instance and ask (query) an external oracle
to label it.
• What instance should the algorithm come up
with for an answer from the oracle?
Kristian Guillaumier, 2011 51
Requesting Training Examples
• Consider the version space we got from the 4 fixed training examples we
had?
• What training example would we like to have to further refine it?
• We should come up with an instance that will classified as +ve by some
hypotheses and –ve by others to further reduce the size of the version
space.
Kristian Guillaumier, 2011 52
Requesting Training Examples
• Suppose we request the training example:
<sunny, warm, normal, light, warm, same>
• 3 hypothesis would classify it as +ve and 3 would classify it as
–ve:
Kristian Guillaumier, 2011 53
Requesting Training Examples
• So if we ask the oracle to classify:
<sunny, warm, normal, light, warm, same>
• We’d either generalise the S boundary or
specialise the G boundary and shrink the size of
the version space (make it converge).
• In general, the optimal instance we’d like the
oracle to classify (the best training example to
have next) is the one that would half the size of
the version space.
• If we have this option we can converge to the
target concept in Log
2
time.
Kristian Guillaumier, 2011 54
Partially Learned Concepts
• Partially learned = we didn’t converge to the
target concept (S and G are not the same).
• Our previous example is a partially learned
concept:
Kristian Guillaumier, 2011 55
Partially Learned Concepts
• It is possible to classify unseen examples with a degree of certainty.
• Suppose we want to classify the instance (not in training data)…

<sunny, warm, normal, strong, cool, change>

• …using our partially learned concept.
Kristian Guillaumier, 2011 56
Notice that every hypothesis in
the version space classifies the
unseen instance as +ve.

So all the hypothesis classified it
as +ve with the same confidence
as if there would have been only
the target concept remaining
(converged).
Partially Learned Concepts
Kristian Guillaumier, 2011 57
Sky AirTemp Humidity Wind Water Forecast
Sunny Warm Normal Strong Cool Change ?
Rainy Cold Normal Light Warm Same ?
Sunny Warm Normal Light Warm Same ?
Sunny Cold Normal Strong Warm Same ?
All hypothesis in version
space classify as +ve
All hypothesis in version
space classify as -ve
50/50. Need more training
examples.

Note: This is an optimal query
to request from an oracle.
2 +ve, 4-ve.

Possible take a majority vote
and output a confidence level.
Inductive Bias
• Recall that our system, so far, works assuming the
target concept exists in our hypothesis space.
• Also recall that our hypothesis space allowed only
for conjunctions (AND) of attribute values.
• There is no way to allow for a disjunction of
values – we cannot say “Sky=Cloudy OR
Sky=Sunny”.
• Consider what would happen if in fact, I like
swimming if it is cloudy or sunny. I’d get
something like…
Kristian Guillaumier, 2011 58
Inductive Bias
Kristian Guillaumier, 2011 59
• CE will converge to an empty version space: the target concept is not in
the hypothesis space. To see why:
• The most specific hypothesis that classifies the first two examples as +ve
is:

<?, Warm, Normal, Strong, Cool, Change>

• Although it is maximally specific for the frst 2 examples, it is already to
general: it will classify the 3
rd
example as +ve too.
• The problem is that we biased our learner to consider only hypotheses
that are conjunctions.
Sky AirTemp Humidity Wind Water Forecast
Sunny Warm Normal Strong Cool Change Y
Cloudy Warm Normal Strong Cool Change Y
Rainy Warm Normal Strong Cool Change N
Unbiased Learning
• Let’s see what happens if to make sure that the target concept
definitely exists in the hypothesis space, we define the hypothesis
space to contain every possible concept.
– This means that it is possible to represent every possible subset of X.
• In our previous example (containing 6 attributes), the size of the
instance space is 96.
• How many possible concepts can be defined over this set of
instances.
• The powerset!!!
– Recall that the size of a powerset (in general) is 2
|X|
.
– So there are 2
96
(ouch) possible concepts that can be learnt from our
instance space.
– We had seen that by introducing ? and C, we allowed for 973 possible
concepts which is <<<<<< 2
96
(we had a very strong bias) .

Kristian Guillaumier, 2011 60
Unbiased Learning
• Let’s define a new hypothesis space ′ that can
represent every subset of instances. i.e.

= ().
• To do this we allow H’ to allow for any combination of
disjunctions, conjunctions and negations. E.g. the
target concept “Sky=Sunny OR Sky=Cloudy” would be:
<sunny,?,?,?,?,?> v <cloudy,?,?,?,?,?>

• So we can use CE knowing that our target concept will
definitely exist in the hypothesis space. But…
• We create a new problem. Our learner will learn how
to classify exactly the instances presented as training
examples and not generalise beyond them!
Kristian Guillaumier, 2011 61
Unbiased Learning
• To see why, suppose, I have 5 training examples d
1
, d
2
,
d
3
, d
4
, d
5
. And that d
1
, d
2
, d
3
are +ve examples and d
4
,
d
5
are –ve examples.
• The S boundary will become a disjunction of the +ve
examples (since it is the most specific possible
hypothesis that covers the examples):
S = {(d
1
v d
2
v d
3
)}
• The G boundary will become a negation (rule out) of
the negative training examples:
G = {÷(d
4
v d
5
)}
• So the only unambiguously classifiable instances are
those that were provided as training examples.


Kristian Guillaumier, 2011 62
Unbiased Learning
• What would happen if we use the partially learned concept
and take a vote.
• Instances that were originally in the training data will be
classified unambiguously (obviously).
• Any other instance not in the training data will be classified
as +ve by half of the hypothesis in the version space and as
–ve by the other half of the hypothesis in the version space.
– Note that H is the power set of X.
– x is some unobserved instance (not in training data).
– Then there is some h in the version space that covers x.
– But there is also a corresponding h’ in the version space that
covers the same x except for its classification (÷h).

Kristian Guillaumier, 2011 63
More on Bias
• Straight from:
A learner that makes no a priori
assumptions regarding the identity of the
target concept has no rational basis for
classifying any unseen instances.
• (in fact, CE worked because we biased it with
the assumption that the target concept can be
represented by a conjunction of attribute
values).

Kristian Guillaumier, 2011 64
More on Bias
• Consider:
– L = a learning algorithm.
– Has a set of training data D
c
= {<x, c(x)>}.
– c = some target concept.
– x
i
is some instance we wish to classify.
– L(x
i
, D
c
) = the classification (+ve/-ve) that L assigns to x
i
after
learning from training data D
c
.
• The inductive inference step is:
(D
c
. x
i
) ≻ L(x
i
, D
c
)
• Where a ≻ b denotes that b is inductively inferred from a.
• So the inductive inference step reads “given the training
data D
c
and the instance x
i
, as inputs to L, we can
inductively infer the classification of the instance.”

Kristian Guillaumier, 2011 65
More on Bias
Kristian Guillaumier, 2011 66
Sky AirTemp Humidity Wind Water Forecast IsGoodDay
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Changes No
Sunny Warm High Strong Cool Changes Yes
(D
c
. x
i
) ≻ L(x
i
, D
c
)
More on Bias
• Because L is an inductive learning algorithm, in
general, we cannot prove that the result L(x
i
, D
c
)
is correct. I.e. the classification of the example
does not necessarily deductively follow from the
training data (can be proven).
• However, we can add a number of assumptions
to our system so that the classification would
follow deductively.
• The inductive bias of L is defined as these
assumptions.
Kristian Guillaumier, 2011 67
More on Bias
• Let B = these assumptions (e.g. the hypothesis
space is made up only of conjunctions of
attribute values).
• Then the inductive bas of L is B, giving:
(B . D
c
. x
i
) ⊢ L(x
i
, D
c
)

• Where the notation a ⊢ b denotes that b
follows deductively from a (b is provable from
a).
Kristian Guillaumier, 2011 68
Defn. of Inductive Bias
Consider a concept learning algorithm L for the set of instances X.

Let c be an arbitrary concept over X and let D
c
= {<x, c(x)>} be an
arbitrary set of training examples of c.

Let L(x
i
, D
c
) denote the classification assigned to the instance x
i
by L
after training on the data D
c
.

The Inductive Bias of L is any minimal set of assertions B such that for
any target concept c and corresponding training examples D
c

(¬x
i
e X)[(B . D
c
. x
i
) ⊢ L(x
i
, D
c
)]
Kristian Guillaumier, 2011 69
The Inductive Bias of CE
• Let us specify what L(x
i
, D
c
) means for CE (how
classification works).
– Given training data D
c
, CE will compute the version space
VS
H,Dc
.
– Then it will classify a new instance x
i
by taking a vote
amongst the hypothesis in this version space.
– A classification will be output (+ve or –ve) if all the
hypothesis in the version space unanimously agree.
Otherwise no classification is output (“I can’t tell from
training data”).
• The inductive bias of CE is that the target concept c is
contained in the hypothesis space. i.e. ceH.
• Why?
Kristian Guillaumier, 2011 70
The Inductive Bias of CE
• 1:
– Notice that if we assume that ceH, then it follows
deductively (we can prove) that ceVS
H,Dc
.
• 2:
– Recall that we defined the classification L(x
i
, D
c
) to be
a unanimous vote amongst all hypothesis in VS
H,Dc
.
– Thus, if L outputs the classification L(x
i
, D
c
), then so
does every hypothesis heVS
H,Dc
including the
hypothesis ceVS
H,Dc
.
– Therefore c(x
i
) = L(x
i
, D
c
).


Kristian Guillaumier, 2011 71

All content in these slides adapted from:
• Most material in these slides from:
– – – –
– – – –

[1] Machine Learning: Tom Mitchell (get this book). [2] Introduction to Machine Learning: Ethem Alpaydin. [3] Introduction to Expert Systems: Peter Jackson. [4] An introduction to Fuzzy Logic and Fuzzy Sets: Buckley, Elsami. [5] Pattern Recognition and Machine Learning: Christopher Bishop. [6] Grammatical Inference: Colin de la Higuera. [7] Artificial Intelligence: Negnevitsky. Miscellaneous web references.
Kristian Guillaumier, 2011 2

(main source: Mitchell [1])

CONCEPT LEARNING, FIND-S, CANDIDATE ELIMINATION
Kristian Guillaumier, 2011 3

Kristian Guillaumier.Note on Induction • If a large number of items I have seen so far all possess some property. • In machine learning: – Input: A number of training examples for some function. then ALL items possess that property. • So far the sun has always risen… • Are all swans white… • We never known whether our induction is true (we have not proven it). – Output: a hypothesis that approximates the function. 2011 4 .

Kristian Guillaumier.g.Concept Learning • Learning: inducing general functions from specific training examples (+ve and/or –ve). E. 2011 5 . the IsCat() function over the set of animals. • Search a space of potential hypotheses (hypothesis space) for a hypothesis that best fits the training data provided. • Concept learning: induce the definition of a general category (e. E. ‘cat’) from a sample of +ve and –ve training data. • Each concept can be viewed as the description of some subset (there is a general-to-specific ordering) defined over a larger set.g.: – Cat  Feline  Animal  Object • A boolean function over the larger set.g. • Concept learning  learning this function from training data.

Example • Learn the concept: Good days when I like to swim. 2011 6 . Attributes: – – – – – – Sky  Sunny/Rainy/Cloudy. Forecast  Same/Change. Wind  Strong/Weak. –  no value is acceptable. Kristian Guillaumier. AirTemp  Warm/Cold. • Our hypothesis is represented as a conjunction of constraints on attributes. • We want to learn the function: – IsGoodDay(input)  true/false. Water  Warm cold. • Other possible contraints (values) for an attribute: – ? ‘I don’t care’. any value is acceptable. Humidity  High/Normal. Attribute Constraints Sky  Sunny/Rainy.

?. Water. . is a vector of constraints on these attributes: <Sky. . .Example • Our hypothesis. Warm. Wind. Forecast> • An example of an hypothesis is (only on warm days with normal humidity): <?. ?> • The most specific hypothesis is: < . ?> • The most general hypothesis is: <?. Humidity.  > Kristian Guillaumier. ?. ?. then. ?. 2011 7 . AirTemp. . Normal. ?. ?.

2011 8 .Example • Training Examples: Sky Sunny Sunny Rainy Sunny AirTemp Warm Warm Cold Warm Humidity Normal High High High Wind Strong Strong Strong Strong Water Warm Warm Warm Cool Forecast Same Same Changes Changes IsGoodDay Yes Yes No Yes Kristian Guillaumier.

the function c. 2011 9 . Kristian Guillaumier. Humidity. cats over animals. takes an instance x  X and in our example • c(x) = 1 if IsGoodDay is Yes • c(x) = 0 if IsGoodDay is no. – i. … – The set of all animals. good days to swim over all days) is called the target concept denoted by c (note that c is the target function).e. • The concept to be learnt (e.Notation • The set of all items over which the concept is called the set of instances.g. X. – The set of all days represented by the attributes AirTemp. – c is a Boolean-valued function defined over the instances X. etc… • An instance in X is denoted by x (x  X).

the learner is given a training set that consists of: – A number of instances x from X. we have the value of the target concept c(x).Notation • When learning the target concept. 2011 10 . c(x)> An instance and its target concept value Kristian Guillaumier. • A training example is usually denoted by: <x. – Instances where c(x) = 0 are called –ve training examples (non-members). c. – Instances where c(x) = 1 are called +ve training examples (members of target concept). – For each instance.

1}.Notation • Given the set of training examples. AirTemp. 2011 11 . Wind. we want to learn (hypothesize) the target concept c. • Each hypothesis in H denoted by h and is usually a boolean valued function h:X{0. • H is the set of all hypotheses that we are considering (all the possible combinations of <Sky. Forecast>). Kristian Guillaumier. Water. • Goal of the learner is to find an h such that: – h(x) = c(x) for all x  X. Humidity.

Concept Learning Task (Mitchell pg. 2011 12 . 22) Kristian Guillaumier.

Kristian Guillaumier. • What about ‘unseen’ instances (where we don’t have training data). • The only information we have on c is the training data.The Inductive Learning Hypothesis • Our main goal is to find a hypothesis h that is identical to c for all x  X (for every instance possible). At best we can guarantee that our learner will learn a hypothesis that learns the training data exactly (not good). 2011 13 . • We make an assumption… the inductive learning hypothesis… • Any hypothesis that approximates the target function well over a sufficiently large set of training examples will also approximate the target function well for other unobserved/unseen examples.

Size and Ordering of the Search Space • • • We must search in the hypothesis space for the best one (that matches c). Humidity  High/Normal (2 options). Recall: – – – – – – Sky  Sunny/Rainy/Cloudy (3 options). 2011 14 . • • • Which means that we have 322222=96 distinct instances. So the number of semantically distinct instances is: – 1 + 433333 = 973 Distinct possibilities + 1 for the ? 1 representing all the definitely negative ones since they contain  Kristian Guillaumier. Wind  Strong/Weak (2 options). Forecast  Same/Change (2 options). The size of the hypothesis space is defined by the hypothesis representation. However note that any instance that contains one or more ’s by definition always means that it is classified –vely. Due to the addition of the symbols ? and  we have 544444=5120 syntactically distinct instances. AirTemp  Warm/Cold (2 options). Water  Warm cold (2 options).

• Clearly. clearly. • This allows us to organize the search space (order the set) according to this relationship between instances (hypotheses) in it. • This ordering concept is very important because there are concept learning algorithms that make use of this ordering. since h2 has less constraints.Size and Ordering of the Search Space • Consider the following hypotheses: – h1: <Sunny. ?. Strong. ?. ?> • Consider the sets of instances that are classified as +ve by h1 and h2. ?. ?. ?. anything that is classified as +ve by h1 will also be classified as +ve by h2. ?. Kristian Guillaumier. ?> – h2: <Sunny. ?. it will classify more instances as +ve than h1. • We say that h2 ‘is more general than’ h1. 2011 15 .

2011 16 . – x satisfies h iff h(x) = 1. • The “is more general or equal to” relationship is denoted by g. hk(x)=1  hj(x) = 1 Kristian Guillaumier.Size and Ordering of the Search Space • For any instance xX and hypothesis hH. hj g hk iff any instance that satisfies hk also satisfies hj. • Given two hypothesis hj and hk. hj g hk iff xX.

… Kristian Guillaumier. 2011 17 . more specific or equal to. it is useful to define: strictly more general than.Size and Ordering of the Search Space • Just as we have defined g .

2011 18 . h2 ‘contains’ h1 and h3 because it is more general.Size and Ordering of the Search Space Set of all instances Each hypothesis corresponds to a subset of X (arrows). h1 and h3 are not more general or specific than each other Kristian Guillaumier.

Size and Ordering of the Search Space • Notes: – g is a partial order over the hypothesis space H (it is reflexive. c ∈ S: aRb ⋀ bRc ⇒ aRc. or Reflexive Relation – A reflexive relation is a binary relation R over a set S where every element in S is related to itself. and transitive). the ≤ relation over Z+ is antisymmetric because ∀ a. Kristian Guillaumier. • Antisymmetry. c ∈ Z+. • Reflectivity. b ∈ S. or Antisymmetric Relation – An antisymmetric relation is a binary relation R over a set S where ∀ a. the ≤ relation over Z+ is transitive because ∀ a. ∀ x ∈ S. – For example. – For example. 2011 19 . or Transitive Relation – A transitive relation is a binary relation R over a set S where ∀ a. xRx holds true. if a ≤ b and b ≤ c then a ≤ c. – An equivalent way of stating this is that ∀ a. – For example. antisymmetric. • Transitivity. – That is. aRb ⋀ bRa ⇒ a = b. b. x ≤ x. c ∈ Z+. b. b. b ∈ S. the ≤ relation over Z+ is reflexive because ∀ x ∈ Z+. aRb ⋀ a ≠ b⇒ ¬bRa. if a ≤ b and b ≤ a then a = b.

The Find-S Algorithm Initialise h to the most specific hypothesis in H. For each +ve training instance x For each attribute constraint ai in h If constraint ai is satisfied by x Then Do Nothing Else Replace ai in h by the next more general constraint that is satisfied by x Return h Kristian Guillaumier. 2011 20 .

Normal. Same> • After having covered the 1st training example. However it is still very specifiy – it will classify all possible instances to –ve except the one +ve training example it has seen. • Continue with the next +ve training example. Strong. h is more general than what we started with. Pick the next more general attribute value from  which is ‘Sunny’. Warm. • Repeat for all attributes AirTemp. . . Warm. … until we get: h = <Sunny. Strong.  > • Start with first +ve training example: • Consider the first attribute ‘sky’. Our hypothesis says  (most specific) but our training example says ‘Sunny’ (more general) so the attribute in the hypothesis does not satisfy the attribute in x. Warm. Same> x = <Sunny. Normal.The Find-S Algorithm • Init h to most specific hypothesis: h = < . . 2011 21 . . humidity. Warm. Kristian Guillaumier.

Warm. Warm. Warm. Strong. Strong. Same> Loop thru all the attributes. Same> • We complete the algorithm to get the hypothesis: h = <Sunny. 2nd nothing. High. Strong. Warm. Normal. Warm. Normal. High. ?. Same> • Recall that so far: • h = <Sunny. Strong. ?> Kristian Guillaumier. Warm. Warm. ?. Same> new h… h = <Sunny. 2011 22 . Strong. 3rd not satisfied – pick next more general.The Find-S Algorithm • The next example is: x = <Sunny. Warm. Same> h = <Sunny. Warm. Strong. 1st satisfied – do nothing. Warm. Warm. ?. etc… satisfied – do x = <Sunny.

Kristian Guillaumier. Cold. Change> • Note that h is already consistent with the training example. Warm. ?. Warm. Strong. High. Same> • The 3rd training example (that was skipped) was: x = <Rainy. our hypothesis was: h = <Sunny. • After considering the 2nd training example. • However we observe that the hypothesis we had generated so far was consistent with this –ve training example.Observation • Remember that the algorithm skipped the 3rd training example because it was –ve.The Find-S Algorithm . Warm. Strong. 2011 23 .

• To see why: – h is the most specify hypothesis in H that is consistent with the currently observed training examples. 2011 24 . if the a more general hypothesis will not misclassify –ve example. – But c will never cover a –ve example so neither will h (by definition of g). the more specific one cannot misclassify it either. – Since we assume the target concept c is in H and is consistent (obviously) with the positive training examples. Clearly.Observation • As long as the hypothesis space H contains the hypothesis that describes the target concept c and there are no errors in the training data. the current hypothesis can never be inconsistent with a –ve training example.The Find-S Algorithm . Consider Animal g Cat If cat then animal If animal then maybe cat If not cat then maybe not animal If not animal then not cat (if –ve is correctly classified by general then it is correctly classified by specific) Kristian Guillaumier. then c must be g than h.

• Issues: – We don’t know whether we found the only hypothesis that is consistent. Why not the most general hypothesis consistent? Why not something in between. 2011 25 . – How do we know if the training data is consistent? In real-life cases. – Find-S will find the most specific hypothesis consistent with the training data. There might be more hypotheses that are consistent. training data may contain noise or errors. – There might be more than one maximally specific hypothesis – which one do we pick? Kristian Guillaumier.The Find-S Algorithm – Issues Raised • Find-S is guaranteed to find the most specific hH that is consistent with the +ve and –ve training examples assuming that the training examples are correct (no noise).

D)   <x. c(x)>  D. c(x)> in D. h(x) = c(x) Kristian Guillaumier. • Defn: h is consistent with a set of training examples D iff h(x) = c(x) for each example <x. it does so without enumerating the whole space. that although the hypothesis output by Find-S is consistent with the training data. Consistent(h. it is one of the.The Candidate Elimination (CE) Algorithm • Note. • CE will output (a description of) all the hypothesis consistent with the training data. 2011 26 . • CE finds all describable hypotheses that are consistent with the observed training examples. • Interestingly. many hypothesis that is consistent. possibly.

VSH. consistent versions of the target concept.D)} Kristian Guillaumier.D  {hH|Consistent(h.The Candidate Elimination (CE) Algorithm • The subset of all the hypotheses that are consistent with the training data (what CE finds) is called the version space WRT the hypothesis space H and the training data D – it contains all the possible.D with respect to the hypothesis space H and the training data D is the subset of hypotheses from H consistent with the training data in D. 2011 27 . • Defn: the version space denoted VSH.

2011 28 .The List-Then-Eliminate (LTE) Algorithm • A possible representation of a version space is a listing of all the elements (hypotheses) in it. List-Then-Eliminate: VersionSpace = Generate all hypothesis in H For each training example d: remove from VersionSpace any hypothesis h where h(x) <> c(x) Return VersionSpace Kristian Guillaumier.

• It has the advantage of simplicity and the fact that it will always work (guaranteed to output all the hypotheses consistent with the training data). 2011 29 . • However. Kristian Guillaumier.The List-Then-Eliminate (LTE) Algorithm • LTE can be applied whenever the hypothesis space is finite (not always the case). • We need a more compact representation. enumerating all the hypotheses in H is unrealistic for all but the most trivial cases.

Find-S found the hypothesis: h = <Sunny. ?> • This is only one of 6 possible hypotheses that are consistent with the training examples. ?. 2011 30 . Warm. • We can illustrate the 6 possible hypothesis in the next diagram… Kristian Guillaumier. ?. Strong.Compact Representation of a Version Space • Recall that in our previous example.

2011 6 31 .Compact Representation of a Version Space 1 2 3 4 5 Kristian Guillaumier.

Kristian Guillaumier.Compact Representation of a Version Space Most specific. Most general. Arrow represents the g relation. 2011 32 .

2011 33 . we can generate all the hypotheses ‘in between’.Compact Representation of a Version Space Given only the 2 sets S and G. Try it! Kristian Guillaumier.

D)]} Kristian Guillaumier. G  {gH |Consistent(g. D)  (s’H)[(s>gs’)Consistent(s’.Compact Representation of a Version Space • Intuitively we see that by having these general and specific ‘boundaries’ we can generate the whole version space (check sketch proof in Mitchell). • A few definitions. 2011 34 . G  {sH |Consistent(s.D)]} • Defn: the specific boundary S WRT a hypothesis space H and training data D is the set of minimally general (maximally specific) members of H consistent with D. D)  (g’H)*(g’>gg)Consistent(g’. • Defn: the general boundary G (remember that G is a set) WRT a hypothesis space H and training data D is the set of maximally general members of H consistent with D.

G0 = {<?.?.?.?>} • Then we initialise S to contain the most specific hypothesis possible. • First we initialise G (remember it is a set) to contain the most general hypothesis possible.>} Kristian Guillaumier.?... 2011 35 .(Back to) The Candidate Elimination (CE) Algorithm • CE computes the version space containing all the hypotheses in H that are consistent with the observed D.?.. S0 = {<..

• At the end we’ll end up with the correct boundary sets. • As each training example is considered the boundary sets S and G are generalised and specialised respectively to eliminate from the version space any hypothesis in H that is inconsistent. Kristian Guillaumier.The Candidate Elimination (CE) Algorithm • So far the two boundaries delimit the whole hypothesis space (every h in H is between G0 and S0). 2011 36 .

2011 37 .The Candidate Elimination (CE) Algorithm Kristian Guillaumier.

2011 38 ..>} G0 = {<?.?.The Candidate Elimination (CE) Algorithm • After init: S0 = {<..?.?..?>} • Consider the first training example: Sky Sunny AirTemp Warm Humidity Normal Wind Strong Water Warm Forecast Same IsGoodDay Yes • It is positive so: Kristian Guillaumier..?.

Normal Wind Strong Water Warm Forecast Same Yes • Part 1: all hypotheses in G are consistent with the training example.e. Same> to S..The Candidate Elimination (CE) Algorithm Sky Sunny AirTemp Warm Hum.. • Remove from S any hypothesis that is more general than any other hypothesis in S.>) which is inconsistent: • Remove <. • Part 2: – There is only one s in S (<... Warm. Kristian Guillaumier. Normal.. – i. so we don’t remove anything... Strong.. Warm. – There is only 1 hypothesis in S so we do nothing. we add <Sunny. • Add to S all minimal generalisations h of s. 2011 39 .> from S. leaving S={}.

2011 40 .The Candidate Elimination (CE) Algorithm • So far we got: Kristian Guillaumier.

• Part 2: – There is only one s in S (<Sunny. Strong. It is positive as well so. Same>) which is inconsistent: • Remove it from S. Strong. Sky Sunny AirTemp Warm Hum High Wind Strong Water Warm Forecast Same Yes • Part 1: all hypotheses in G are consistent with the training example. – i. Kristian Guillaumier. so we don’t remove anything. • Remove from S any hypothesis that is more general than any other hypothesis in S. Same> to S. 2011 41 . Normal.The Candidate Elimination (CE) Algorithm • Read the second training example. Warm. ?. Warm.e. • Add to S all minimal generalisations h of s. we add <Sunny. Warm. Warm. – There is only 1 hypothesis in S so we do nothing. leaving S={}.

2011 42 .The Candidate Elimination (CE) Algorithm • So far we got: Kristian Guillaumier.

• Consider the 3rd training example. which is –ve.The Candidate Elimination (CE) Algorithm • We notice that the role of +ve training examples is to make the S boundary more general and the role of the –ve training examples is to make the G boundary more specific. 2011 43 . Sky Rainy AirTemp Cold Humidity High Wind Strong Water Warm Forecast Changes IsGoodDay No Kristian Guillaumier.

– Continued… Kristian Guillaumier.?. Same> which is consistent because it labels the training example as ‘NO’ – we do nothing. • Step 2: – The hypotheses in G that are not consistent with d is <?.?. Warm. Remove it. Warm. ?.?.?> because it labels it as ‘YES’.The Candidate Elimination (CE) Algorithm Sky Rainy AirT Cold Hum High Wind Strong Water Warm Fore Changes No • Step 1: S contains <Sunny. – All to G all minimal specialisations of g. 2011 44 .?. Strong. leaving G = {}.

?.?> <?. ?.?.?.?. <Cloudy.?>. ?.Normal. ?. 2011 45 .?> <?. not all these minimal specialisations go into the new G.?.?.The Candidate Elimination (CE) Algorithm Sky Rainy AirT Cold Hum High Wind Strong Water Warm Fore Changes No • The g we removed is <?. ?> <?.?. ?.?.?.?.?. ?>.?.Weak. all the minimal specialisations of it would be (remember we want to label the training example as ‘NO’): • • • • • • Sky: Air Temp: Humidity: Wind: Water: Forecast: <Sunny.?> <?.?. Kristian Guillaumier.Cold.?.?.Same> • However.?.?.?> <?. ?. ?.Warm.?.?. ?. ?.?.

?.Cold. ?.?. <?.?. ?.?.?. <?.Warm.?. ?.The Candidate Elimination (CE) Algorithm • Only <Sunny.?.?.?.?.?>.?.?> is inconsistent with training item 1 and 2. ?> is inconsistent with training item 1 and 2.?.?. ?.?. <?.?. • Why? • Because it is inconsistent with the previously encountered training examples (so far we saw training items 1 and 2).?. ?. <?. • <Cloudy. ?. ?.?.Weak.?.?. ?.?.Normal.?> and <?.?> and <?.?. – – – – Sky <Cloudy.?.Weak.?> is inconsistent with training item 1 and 2.?> is inconsistent with training item 2.?.?.Same> go into the new G. AirTemp Humidity Wind Water Forecast IsGoodDay Sunny Sunny Rainy Sunny Warm Warm Cold Warm Normal High High High Strong Strong Strong Strong Kristian Guillaumier. 2011 Warm Warm Warm Cool Same Same Changes Changes Yes Yes No Yes 46 .Cold. <?.Normal. ?.?> are not part of the new G. ?>. ?>. ?. ?.?.?. <?.

The Candidate Elimination (CE) Algorithm • So far we got: Kristian Guillaumier. 2011 47 .

The Candidate Elimination (CE) Algorithm • After processing the 4th training item. we get: Kristian Guillaumier. 2011 48 .

The Candidate Elimination (CE) Algorithm
• The entire version space derived from the boundaries is:

Kristian Guillaumier, 2011

49

Converging
• CE will converge towards the target concept if:
– There are no errors in the training data. – The target concept is in H.

• The target concept is exactly learnt when S and G converge to a single and identical hypothesis. • If the training data contains errors e.g. a +ve example is incorrectly labeled as –ve:
– The algorithm will remove the correct target concept from the version space. – Eventually, if we are presented with enough training data, we will detect an inconsistency because the G and S boundaries will converge to an empty version space (i.e. there is no hypothesis in H that is consistent with all the training examples).

Kristian Guillaumier, 2011

50

Requesting Training Examples
• So far, our algorithm was given a set containing labeled training data. • Suppose that our algorithm can come up with an instance and ask (query) an external oracle to label it. • What instance should the algorithm come up with for an answer from the oracle?

Kristian Guillaumier, 2011

51

Requesting Training Examples • Consider the version space we got from the 4 fixed training examples we had? • What training example would we like to have to further refine it? • We should come up with an instance that will classified as +ve by some hypotheses and –ve by others to further reduce the size of the version space. Kristian Guillaumier. 2011 52 .

warm. normal. warm.Requesting Training Examples • Suppose we request the training example: <sunny. same> • 3 hypothesis would classify it as +ve and 3 would classify it as –ve: Kristian Guillaumier. light. 2011 53 .

warm. the optimal instance we’d like the oracle to classify (the best training example to have next) is the one that would half the size of the version space. Kristian Guillaumier. warm. 2011 54 . same> • We’d either generalise the S boundary or specialise the G boundary and shrink the size of the version space (make it converge). normal. light. • In general. • If we have this option we can converge to the target concept in Log2 time.Requesting Training Examples • So if we ask the oracle to classify: <sunny.

2011 55 . • Our previous example is a partially learned concept: Kristian Guillaumier.Partially Learned Concepts • Partially learned = we didn’t converge to the target concept (S and G are not the same).

warm.Partially Learned Concepts • • It is possible to classify unseen examples with a degree of certainty. change> • …using our partially learned concept. strong. So all the hypothesis classified it as +ve with the same confidence as if there would have been only the target concept remaining (converged). Suppose we want to classify the instance (not in training data)… <sunny. cool. 2011 56 . Kristian Guillaumier. Notice that every hypothesis in the version space classifies the unseen instance as +ve. normal.

Kristian Guillaumier. Possible take a majority vote and output a confidence level. 2 +ve. Need more training examples. 4-ve. 2011 57 . Note: This is an optimal query to request from an oracle.Partially Learned Concepts All hypothesis in version space classify as +ve Sky Sunny Rainy Sunny Sunny AirTemp Warm Cold Warm Cold Humidity Normal Normal Normal Normal Wind Strong Light Light Strong Water Cool Warm Warm Warm Forecast Change Same Same Same ? ? ? ? All hypothesis in version space classify as -ve 50/50.

• There is no way to allow for a disjunction of values – we cannot say “Sky=Cloudy OR Sky=Sunny”. • Consider what would happen if in fact. I like swimming if it is cloudy or sunny. • Also recall that our hypothesis space allowed only for conjunctions (AND) of attribute values. works assuming the target concept exists in our hypothesis space. I’d get something like… Kristian Guillaumier. so far.Inductive Bias • Recall that our system. 2011 58 .

Strong.Inductive Bias Sky Sunny Cloudy Rainy AirTemp Warm Warm Warm Humidity Normal Normal Normal Wind Strong Strong Strong Water Cool Cool Cool Forecast Change Change Change Y Y N • CE will converge to an empty version space: the target concept is not in the hypothesis space. Normal. Cool. 2011 59 . To see why: • The most specific hypothesis that classifies the first two examples as +ve is: <?. Kristian Guillaumier. Change> • Although it is maximally specific for the frst 2 examples. it is already to general: it will classify the 3rd example as +ve too. • The problem is that we biased our learner to consider only hypotheses that are conjunctions. Warm.

– This means that it is possible to represent every possible subset of X.Unbiased Learning • Let’s see what happens if to make sure that the target concept definitely exists in the hypothesis space. • In our previous example (containing 6 attributes). we define the hypothesis space to contain every possible concept. Kristian Guillaumier. – We had seen that by introducing ? and . • The powerset!!! – Recall that the size of a powerset (in general) is 2|X|. the size of the instance space is 96. 2011 60 . – So there are 296 (ouch) possible concepts that can be learnt from our instance space. • How many possible concepts can be defined over this set of instances. we allowed for 973 possible concepts which is <<<<<< 296 (we had a very strong bias) .

i. • To do this we allow H’ to allow for any combination of disjunctions.?.?.?. conjunctions and negations.?>  <cloudy. Our learner will learn how to classify exactly the instances presented as training examples and not generalise beyond them! Kristian Guillaumier.?. 2011 61 .e. the target concept “Sky=Sunny OR Sky=Cloudy” would be: <sunny.?. E.?.?.g. 𝐻 ′ = 𝒫(𝑋).Unbiased Learning • Let’s define a new hypothesis space 𝐻′ that can represent every subset of instances.?.?> • So we can use CE knowing that our target concept will definitely exist in the hypothesis space. But… • We create a new problem.

d5 are –ve examples. 2011 62 . • The S boundary will become a disjunction of the +ve examples (since it is the most specific possible hypothesis that covers the examples): S = {(d1  d2  d3)} • The G boundary will become a negation (rule out) of the negative training examples: G = {(d4  d5)} • So the only unambiguously classifiable instances are those that were provided as training examples. d3 are +ve examples and d4. d2. d5. d3. d4. d2. Kristian Guillaumier. suppose. I have 5 training examples d1.Unbiased Learning • To see why. And that d1.

2011 63 . – – – – Note that H is the power set of X.Unbiased Learning • What would happen if we use the partially learned concept and take a vote. Then there is some h in the version space that covers x. • Instances that were originally in the training data will be classified unambiguously (obviously). Kristian Guillaumier. But there is also a corresponding h’ in the version space that covers the same x except for its classification (h). x is some unobserved instance (not in training data). • Any other instance not in the training data will be classified as +ve by half of the hypothesis in the version space and as –ve by the other half of the hypothesis in the version space.

Kristian Guillaumier. CE worked because we biased it with the assumption that the target concept can be represented by a conjunction of attribute values).More on Bias • Straight from: A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances. 2011 64 . • (in fact.

c = some target concept. Dc) • Where a ≻ b denotes that b is inductively inferred from a. • So the inductive inference step reads “given the training data Dc and the instance xi. xi is some instance we wish to classify. • The inductive inference step is: (Dc  xi) ≻ L(xi. Has a set of training data Dc = {<x. L(xi. as inputs to L. c(x)>}. Dc) = the classification (+ve/-ve) that L assigns to xi after learning from training data Dc. 2011 65 .” Kristian Guillaumier.More on Bias • Consider: – – – – – L = a learning algorithm. we can inductively infer the classification of the instance.

2011 66 .More on Bias Sky Sunny Sunny Rainy Sunny AirTemp Warm Warm Cold Warm Humidity Normal High High High Wind Strong Strong Strong Strong Water Warm Warm Warm Cool Forecast Same Same Changes Changes IsGoodDay Yes Yes No Yes (Dc  xi) ≻ L(xi. Dc) Kristian Guillaumier.

More on Bias • Because L is an inductive learning algorithm. Kristian Guillaumier. we can add a number of assumptions to our system so that the classification would follow deductively. we cannot prove that the result L(xi. 2011 67 . the classification of the example does not necessarily deductively follow from the training data (can be proven).e. • However. • The inductive bias of L is defined as these assumptions. Dc) is correct. I. in general.

g. the hypothesis space is made up only of conjunctions of attribute values). Dc) • Where the notation a ⊢ b denotes that b follows deductively from a (b is provable from a). 2011 68 . giving: (B  Dc  xi) ⊢ L(xi.More on Bias • Let B = these assumptions (e. • Then the inductive bas of L is B. Kristian Guillaumier.

2011 69 . of Inductive Bias Consider a concept learning algorithm L for the set of instances X. Let c be an arbitrary concept over X and let Dc = {<x. The Inductive Bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc (xi  X)[(B  Dc  xi) ⊢ L(xi. Dc)] Kristian Guillaumier. Let L(xi.Defn. Dc) denote the classification assigned to the instance xi by L after training on the data Dc. c(x)>} be an arbitrary set of training examples of c.

– Given training data Dc. Otherwise no classification is output (“I can’t tell from training data”). Dc) means for CE (how classification works).The Inductive Bias of CE • Let us specify what L(xi. CE will compute the version space VSH. – Then it will classify a new instance xi by taking a vote amongst the hypothesis in this version space. – A classification will be output (+ve or –ve) if all the hypothesis in the version space unanimously agree. cH.e. i. • The inductive bias of CE is that the target concept c is contained in the hypothesis space. 2011 70 . • Why? Kristian Guillaumier.Dc.

then so does every hypothesis hVSH. 2011 71 . Dc) to be a unanimous vote amongst all hypothesis in VSH. Dc).Dc including the hypothesis cVSH.Dc. Kristian Guillaumier. then it follows deductively (we can prove) that cVSH.Dc. if L outputs the classification L(xi. – Therefore c(xi) = L(xi. Dc). – Thus.The Inductive Bias of CE • 1: – Notice that if we assume that cH.Dc. • 2: – Recall that we defined the classification L(xi.