25 views

Uploaded by Mark Scerri

save

- A Probabilistic Neural Network Approach for Protein Superfamily Classification
- Big Data Challenge Shaastra Amex
- Real Numbers
- Machine Learning Techniques for Building a Large Scale Production Ready Classifier
- ai data
- 2011 MM Data Mining
- Computer Ctlg
- TomsCVnew
- Relation algebra
- RS & ELM.pdf
- Cse847 Project Description
- Homework 3
- 3319172891
- Relations and Functions (40)
- WEB MULTIMEDIA CLASSIFICATION USING DT AND SVM MODELS
- l18
- Driving Style Recognition Based.pdf
- Age Estimation
- 5613-rc209_010d-mahout
- Emotional pattern recognition using machine learning
- Remote Sensing Image Classification Method Based on Evidence Theory and Decision Tree (1)
- Simplified Proof of Kruskal’s Tree Theorem
- Detecting Click Fraud Paper
- K Means Clustering Algorithm_ Explained – DnI Institute
- 1404notes.pdf
- Performance Evaluation of Fusion Rules For Multitarget Tracking in Clutter Based on Generalized Data Association
- Top 10 Data Science Video Tutorials
- A DSmT Based Combination Scheme for Multi-Class Classification
- 10.1.1.278.4931
- Staal2004ridge - Drive
- Part-time Courses Application Form
- Verifying Web Applications: From Business Level Specifications to Automated Model-Based Testing
- Verifying Web Applications: From Business Level Specifications to Automated Model-Based Testing
- Hello
- 6819362 Airbus A320330 Panel Documentation

You are on page 1of 71

**Systems and Fuzzy Logic
**

Prepared by Kristian Guillaumier

Dept. of Intelligent Computer Systems

University of Malta

2011

All content in these slides adapted from:

• Most material in these slides from:

– [1] Machine Learning: Tom Mitchell (get this book).

– [2] Introduction to Machine Learning: Ethem Alpaydin.

– [3] Introduction to Expert Systems: Peter Jackson.

– [4] An introduction to Fuzzy Logic and Fuzzy Sets:

Buckley, Elsami.

– [5] Pattern Recognition and Machine Learning:

Christopher Bishop.

– [6] Grammatical Inference: Colin de la Higuera.

– [7] Artificial Intelligence: Negnevitsky.

– Miscellaneous web references.

Kristian Guillaumier, 2011 2

CONCEPT LEARNING, FIND-S,

CANDIDATE ELIMINATION

(main source: Mitchell [1])

Kristian Guillaumier, 2011 3

Note on Induction

• If a large number of items I have seen so far all possess

some property, then ALL items possess that property.

• So far the sun has always risen…

• Are all swans white…

• We never known whether our induction is true (we

have not proven it).

• In machine learning:

– Input: A number of training examples for some function.

– Output: a hypothesis that approximates the function.

Kristian Guillaumier, 2011 4

Concept Learning

• Learning: inducing general functions from specific training

examples (+ve and/or –ve).

• Concept learning: induce the definition of a general category (e.g.

‘cat’) from a sample of +ve and –ve training data.

• Search a space of potential hypotheses (hypothesis space) for a

hypothesis that best fits the training data provided.

• Each concept can be viewed as the description of some subset

(there is a general-to-specific ordering) defined over a larger set.

E.g.:

– Cat _ Feline _ Animal _ Object

• A boolean function over the larger set. E.g. the IsCat() function over

the set of animals.

• Concept learning learning this function from training data.

Kristian Guillaumier, 2011 5

Example

• Learn the concept: Good days when I like to swim.

• We want to learn the function:

– IsGoodDay(input) true/false.

• Our hypothesis is represented as a conjunction of constraints on

attributes. Attributes:

– Sky Sunny/Rainy/Cloudy.

– AirTemp Warm/Cold.

– Humidity High/Normal.

– Wind Strong/Weak.

– Water Warm cold.

– Forecast Same/Change.

• Other possible contraints (values) for an attribute:

– ? ‘I don’t care’, any value is acceptable.

– C no value is acceptable.

Sky Sunny/Rainy.

Attribute

Constraints

Kristian Guillaumier, 2011 6

Example

• Our hypothesis, then, is a vector of constraints on

these attributes:

<Sky, AirTemp, Humidity, Wind, Water, Forecast>

• An example of an hypothesis is (only on warm days

with normal humidity):

<?, Warm, Normal, ?, ?, ?>

• The most general hypothesis is:

<?, ?, ?, ?, ?, ?>

• The most specific hypothesis is:

< C, C, C, C, C, C >

Kristian Guillaumier, 2011 7

Example

• Training Examples:

Sky AirTemp Humidity Wind Water Forecast IsGoodDay

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Changes No

Sunny Warm High Strong Cool Changes Yes

Kristian Guillaumier, 2011 8

Notation

• The set of all items over which the concept is called the set

of instances, X.

– The set of all days represented by the attributes AirTemp,

Humidity, …

– The set of all animals, etc…

• An instance in X is denoted by x (x e X).

• The concept to be learnt (e.g. cats over animals, good days

to swim over all days) is called the target concept denoted

by c (note that c is the target function).

– c is a Boolean-valued function defined over the instances X.

– i.e. the function c, takes an instance x e X and in our example

• c(x) = 1 if IsGoodDay is Yes

• c(x) = 0 if IsGoodDay is no.

Kristian Guillaumier, 2011 9

Notation

• When learning the target concept, c, the learner

is given a training set that consists of:

– A number of instances x from X.

– For each instance, we have the value of the target

concept c(x).

– Instances where c(x) = 1 are called +ve training

examples (members of target concept).

– Instances where c(x) = 0 are called –ve training

examples (non-members).

• A training example is usually denoted by:

<x, c(x)>

An instance and its target concept value

Kristian Guillaumier, 2011 10

Notation

• Given the set of training examples, we want to

learn (hypothesize) the target concept c.

• H is the set of all hypotheses that we are

considering (all the possible combinations of

<Sky, AirTemp, Humidity, Wind, Water,

Forecast>).

• Each hypothesis in H denoted by h and is usually

a boolean valued function h:X{0,1}.

• Goal of the learner is to find an h such that:

– h(x) = c(x) for all x e X.

Kristian Guillaumier, 2011 11

Concept Learning Task (Mitchell pg. 22)

Kristian Guillaumier, 2011 12

The Inductive Learning Hypothesis

• Our main goal is to find a hypothesis h that is identical to c for all x

e X (for every instance possible).

• The only information we have on c is the training data.

• What about ‘unseen’ instances (where we don’t have training data).

At best we can guarantee that our learner will learn a hypothesis

that learns the training data exactly (not good).

• We make an assumption… the inductive learning hypothesis…

• Any hypothesis that approximates the target function well over a

sufficiently large set of training examples will also approximate

the target function well for other unobserved/unseen examples.

Kristian Guillaumier, 2011 13

Size and Ordering of the Search Space

• We must search in the hypothesis space for the best one (that matches c).

• The size of the hypothesis space is defined by the hypothesis representation.

• Recall:

– Sky Sunny/Rainy/Cloudy (3 options).

– AirTemp Warm/Cold (2 options).

– Humidity High/Normal (2 options).

– Wind Strong/Weak (2 options).

– Water Warm cold (2 options).

– Forecast Same/Change (2 options).

• Which means that we have 3×2×2×2×2×2=96 distinct instances.

• Due to the addition of the symbols ? and C we have 5×4×4×4×4×4=5120

syntactically distinct instances.

• However note that any instance that contains one or more C’s by definition always

means that it is classified –vely. So the number of semantically distinct instances is:

– 1 + 4×3×3×3×3×3 = 973

1 representing all the definitely

negative ones since they contain C

Distinct possibilities + 1 for the ?

Kristian Guillaumier, 2011 14

Size and Ordering of the Search Space

• Consider the following hypotheses:

– h1: <Sunny, ?, ?, Strong, ?, ?>

– h2: <Sunny, ?, ?, ?, ?, ?>

• Consider the sets of instances that are classified as +ve by h1 and

h2.

• Clearly, since h2 has less constraints, it will classify more instances

as +ve than h1.

clearly, anything that is classified as +ve by h1 will also be classified

as +ve by h2.

• We say that h2 ‘is more general than’ h1.

• This allows us to organize the search space (order the set) according

to this relationship between instances (hypotheses) in it.

• This ordering concept is very important because there are concept

learning algorithms that make use of this ordering.

Kristian Guillaumier, 2011 15

Size and Ordering of the Search Space

• For any instance xeX and hypothesis heH,

– x satisfies h iff h(x) = 1.

• The “is more general or equal to” relationship is

denoted by

g

.

• Given two hypothesis h

j

and h

k

, h

j

>

g

h

k

iff any

instance that satisfies h

k

also satisfies h

j

.

h

j

>

g

h

k

iff

¬xeX, h

k

(x)=1 h

j

(x) = 1

Kristian Guillaumier, 2011 16

Size and Ordering of the Search Space

• Just as we have defined >

g

, it is useful to

define: strictly more general than, more

specific or equal to, …

Kristian Guillaumier, 2011 17

Size and Ordering of the Search Space

Kristian Guillaumier, 2011 18

Set of all instances

Each hypothesis corresponds

to a subset of X (arrows).

h2 ‘contains’ h1 and h3 because it is more general.

h1 and h3 are not more general or specific than each other

Size and Ordering of the Search Space

• Notes:

– >

g

is a partial order over the hypothesis space H (it is reflexive, antisymmetric,

and transitive).

• Reflectivity, or Reflexive Relation

– A reflexive relation is a binary relation R over a set S where every element in S is related to

itself.

– That is, ∀ x ∈ S, xRx holds true.

– For example, the ≤ relation over Z

+

is reflexive because ∀ x ∈ Z

+

, x ≤ x.

• Transitivity, or Transitive Relation

– A transitive relation is a binary relation R over a set S where ∀ a, b, c ∈ S: aRb ⋀ bRc ⇒ aRc.

– For example, the ≤ relation over Z

+

is transitive because ∀ a, b, c ∈ Z

+

, if a ≤ b and b ≤ c then a ≤

c.

• Antisymmetry, or Antisymmetric Relation

– An antisymmetric relation is a binary relation R over a set S where ∀ a, b ∈ S, aRb ⋀ bRa ⇒ a =

b.

– An equivalent way of stating this is that ∀ a, b ∈ S, aRb ⋀ a ≠ b⇒ ¬bRa.

– For example, the ≤ relation over Z

+

is antisymmetric because ∀ a, b, c ∈ Z

+

, if a ≤ b and b ≤ a

then a = b.

Kristian Guillaumier, 2011 19

The Find-S Algorithm

Initialise h to the most specific hypothesis in H.

For each +ve training instance x

For each attribute constraint a

i

in h

If constraint a

i

is satisfied by x Then

Do Nothing

Else

Replace a

i

in h by the next more

general constraint that is

satisfied by x

Return h

Kristian Guillaumier, 2011 20

The Find-S Algorithm

• Init h to most specific hypothesis:

h = < C, C, C, C, C, C >

• Start with first +ve training example:

x = <Sunny, Warm, Normal, Strong, Warm, Same>

• Consider the first attribute ‘sky’. Our hypothesis says C (most

specific) but our training example says ‘Sunny’ (more general) so

the attribute in the hypothesis does not satisfy the attribute in x.

Pick the next more general attribute value from C which is ‘Sunny’.

• Repeat for all attributes AirTemp, humidity, … until we get:

h = <Sunny, Warm, Normal, Strong, Warm, Same>

• After having covered the 1

st

training example, h is more general

than what we started with. However it is still very specifiy – it will

classify all possible instances to –ve except the one +ve training

example it has seen.

• Continue with the next +ve training example.

Kristian Guillaumier, 2011 21

The Find-S Algorithm

• The next example is:

x = <Sunny, Warm, High, Strong, Warm, Same>

• Recall that so far:

h = <Sunny, Warm, Normal, Strong, Warm, Same>

• Loop thru all the attributes. 1

st

satisfied – do nothing, 2

nd

satisfied – do

nothing, 3

rd

not satisfied – pick next more general, etc…

x = <Sunny, Warm, High, Strong, Warm, Same>

h = <Sunny, Warm, Normal, Strong, Warm, Same>

new h…

h = <Sunny, Warm, ?, Strong, Warm, Same>

• We complete the algorithm to get the hypothesis:

h = <Sunny, Warm, ?, Strong, ?, ?>

Kristian Guillaumier, 2011 22

The Find-S Algorithm - Observation

• Remember that the algorithm skipped the 3

rd

training

example because it was –ve.

• However we observe that the hypothesis we had

generated so far was consistent with this –ve training

example.

• After considering the 2

nd

training example, our

hypothesis was:

h = <Sunny, Warm, ?, Strong, Warm, Same>

• The 3

rd

training example (that was skipped) was:

x = <Rainy, Cold, High, Strong, Warm, Change>

• Note that h is already consistent with the training

example.

Kristian Guillaumier, 2011 23

The Find-S Algorithm - Observation

• As long as the hypothesis space H contains the hypothesis that describes

the target concept c and there are no errors in the training data, the

current hypothesis can never be inconsistent with a –ve training example.

• To see why:

– h is the most specify hypothesis in H that is consistent with the currently

observed training examples.

– Since we assume the target concept c is in H and is consistent (obviously) with

the positive training examples, then c must be >

g

than h.

– But c will never cover a –ve example so neither will h (by definition of >

g

).

Clearly, if the a more general hypothesis will not misclassify –ve example, the

more specific one cannot misclassify it either.

Kristian Guillaumier, 2011 24

Consider Animal >

g

Cat

If cat then animal

If animal then maybe cat

If not cat then maybe not animal

If not animal then not cat (if –ve is correctly classified by general then it is correctly classified by specific)

The Find-S Algorithm – Issues Raised

• Find-S is guaranteed to find the most specific heH that is

consistent with the +ve and –ve training examples assuming

that the training examples are correct (no noise).

• Issues:

– We don’t know whether we found the only hypothesis that is

consistent. There might be more hypotheses that are consistent.

– Find-S will find the most specific hypothesis consistent with the

training data. Why not the most general hypothesis consistent?

Why not something in between.

– How do we know if the training data is consistent? In real-life

cases, training data may contain noise or errors.

– There might be more than one maximally specific hypothesis –

which one do we pick?

Kristian Guillaumier, 2011 25

The Candidate Elimination (CE)

Algorithm

• Note, that although the hypothesis output by Find-S is consistent

with the training data, it is one of the, possibly, many hypothesis

that is consistent.

• CE will output (a description of) all the hypothesis consistent with

the training data.

• Interestingly, it does so without enumerating the whole space.

• CE finds all describable hypotheses that are consistent with the

observed training examples.

• Defn: h is consistent with a set of training examples D iff h(x) = c(x)

for each example <x, c(x)> in D.

Consistent(h, D) ÷ ¬ <x, c(x)> e D, h(x) = c(x)

Kristian Guillaumier, 2011 26

The Candidate Elimination (CE)

Algorithm

• The subset of all the hypotheses that are consistent

with the training data (what CE finds) is called the

version space WRT the hypothesis space H and the

training data D – it contains all the possible, consistent

versions of the target concept.

• Defn: the version space denoted VS

H,D

with respect to

the hypothesis space H and the training data D is the

subset of hypotheses from H consistent with the

training data in D.

VS

H,D

÷ {heH|Consistent(h,D)}

Kristian Guillaumier, 2011 27

The List-Then-Eliminate (LTE)

Algorithm

• A possible representation of a version space is a

listing of all the elements (hypotheses) in it.

List-Then-Eliminate:

VersionSpace = Generate all hypothesis in H

For each training example d:

remove from VersionSpace any hypothesis h

where h(x) <> c(x)

Return VersionSpace

Kristian Guillaumier, 2011 28

The List-Then-Eliminate (LTE)

Algorithm

• LTE can be applied whenever the hypothesis

space is finite (not always the case).

• It has the advantage of simplicity and the fact

that it will always work (guaranteed to output all

the hypotheses consistent with the training data).

• However, enumerating all the hypotheses in H is

unrealistic for all but the most trivial cases.

• We need a more compact representation.

Kristian Guillaumier, 2011 29

Compact Representation of a Version

Space

• Recall that in our previous example, Find-S

found the hypothesis:

h = <Sunny, Warm, ?, Strong, ?, ?>

• This is only one of 6 possible hypotheses that

are consistent with the training examples.

• We can illustrate the 6 possible hypothesis in

the next diagram…

Kristian Guillaumier, 2011 30

Compact Representation of a Version

Space

Kristian Guillaumier, 2011 31

1

2 3 4

6 5

Compact Representation of a Version

Space

Kristian Guillaumier, 2011 32

Most specific.

Most general.

Arrow represents the

>

g

relation.

Compact Representation of a Version

Space

Kristian Guillaumier, 2011 33

Given only the 2 sets S and G, we can generate all the hypotheses ‘in between’.

Try it!

Compact Representation of a Version

Space

• Intuitively we see that by having these general and specific ‘boundaries’

we can generate the whole version space (check sketch proof in Mitchell).

• A few definitions.

• Defn: the general boundary G (remember that G is a set) WRT a

hypothesis space H and training data D is the set of maximally general

members of H consistent with D.

G ÷ {geH |Consistent(g, D) . (÷-g’eH)*(g’>

g

g).Consistent(g’,D)]}

• Defn: the specific boundary S WRT a hypothesis space H and training data

D is the set of minimally general (maximally specific) members of H

consistent with D.

G ÷ {seH |Consistent(s, D) . (÷-s’eH)[(s>

g

s’).Consistent(s’,D)]}

Kristian Guillaumier, 2011 34

(Back to) The Candidate Elimination

(CE) Algorithm

• CE computes the version space containing all

the hypotheses in H that are consistent with

the observed D.

• First we initialise G (remember it is a set) to

contain the most general hypothesis possible.

G

0

= {<?,?,?,?,?,?>}

• Then we initialise S to contain the most

specific hypothesis possible.

S

0

= {<C,C,C,C,C,C>}

Kristian Guillaumier, 2011 35

The Candidate Elimination (CE)

Algorithm

• So far the two boundaries delimit the whole

hypothesis space (every h in H is between G

0

and

S

0

).

• As each training example is considered the

boundary sets S and G are generalised and

specialised respectively to eliminate from the

version space any hypothesis in H that is

inconsistent.

• At the end we’ll end up with the correct

boundary sets.

Kristian Guillaumier, 2011 36

The Candidate Elimination (CE)

Algorithm

Kristian Guillaumier, 2011 37

The Candidate Elimination (CE)

Algorithm

• After init:

S

0

= {<C,C,C,C,C,C>}

G

0

= {<?,?,?,?,?,?>}

• Consider the first training example:

• It is positive so:

Kristian Guillaumier, 2011 38

Sky AirTemp Humidity Wind Water Forecast IsGoodDay

Sunny Warm Normal Strong Warm Same Yes

The Candidate Elimination (CE) Algorithm

• Part 1: all hypotheses in G are consistent with the

training example, so we don’t remove anything.

• Part 2:

– There is only one s in S (<C,C,C,C,C,C>) which is

inconsistent:

• Remove <C,C,C,C,C,C> from S, leaving S={}.

• Add to S all minimal generalisations h of s.

– i.e. we add <Sunny, Warm, Normal, Strong, Warm, Same> to S.

• Remove from S any hypothesis that is more general than any

other hypothesis in S.

– There is only 1 hypothesis in S so we do nothing.

Kristian Guillaumier, 2011 39

Sky AirTemp Hum. Wind Water Forecast

Sunny Warm Normal Strong Warm Same Yes

• So far we got:

Kristian Guillaumier, 2011 40

The Candidate Elimination (CE) Algorithm

• Read the second training example. It is positive as well so.

Kristian Guillaumier, 2011 41

The Candidate Elimination (CE) Algorithm

Sky AirTemp Hum Wind Water Forecast

Sunny Warm High Strong Warm Same Yes

• Part 1: all hypotheses in G are consistent with the

training example, so we don’t remove anything.

• Part 2:

– There is only one s in S (<Sunny, Warm, Normal, Strong,

Warm, Same>) which is inconsistent:

• Remove it from S, leaving S={}.

• Add to S all minimal generalisations h of s.

– i.e. we add <Sunny, Warm, ?, Strong, Warm, Same> to S.

• Remove from S any hypothesis that is more general than any other

hypothesis in S.

– There is only 1 hypothesis in S so we do nothing.

• So far we got:

Kristian Guillaumier, 2011 42

The Candidate Elimination (CE) Algorithm

The Candidate Elimination (CE)

Algorithm

• We notice that the role of +ve training examples is to

make the S boundary more general and the role of

the –ve training examples is to make the G boundary

more specific.

• Consider the 3

rd

training example, which is –ve.

Kristian Guillaumier, 2011 43

Sky AirTemp Humidity Wind Water Forecast IsGoodDay

Rainy Cold High Strong Warm Changes No

• Step 1: S contains <Sunny, Warm, ?, Strong, Warm,

Same> which is consistent because it labels the training

example as ‘NO’ – we do nothing.

• Step 2:

– The hypotheses in G that are not consistent with d is

<?,?,?,?,?,?> because it labels it as ‘YES’. Remove it, leaving

G = {}.

– All to G all minimal specialisations of g.

– Continued…

Kristian Guillaumier, 2011 44

The Candidate Elimination (CE) Algorithm

Sky AirT Hum Wind Water Fore

Rainy Cold High Strong Warm Changes No

The Candidate Elimination (CE) Algorithm

• The g we removed is <?,?,?,?,?,?>, all the minimal

specialisations of it would be (remember we want to

label the training example as ‘NO’):

• Sky: <Sunny, ?, ?, ?, ?, ?>, <Cloudy, ?, ?, ?, ?, ?>

• Air Temp: <?,Warm,?,?,?,?>

• Humidity: <?,?,Normal,?,?,?>

• Wind: <?,?,?,Weak,?,?>

• Water: <?,?,?,?,Cold,?>

• Forecast: <?,?,?,?,?,Same>

• However, not all these minimal specialisations go

into the new G.

Kristian Guillaumier, 2011 45

Sky AirT Hum Wind Water Fore

Rainy Cold High Strong Warm Changes No

• Only <Sunny, ?, ?, ?, ?, ?>, <?,Warm,?,?,?,?> and <?,?,?,?,?,Same>

go into the new G.

• <Cloudy, ?, ?, ?, ?, ?>, <?,?,Normal,?,?,?>, <?,?,?,Weak,?,?> and

<?,?,?,?,Cold,?> are not part of the new G.

• Why?

• Because it is inconsistent with the previously encountered training

examples (so far we saw training items 1 and 2).

– <Cloudy, ?, ?, ?, ?, ?> is inconsistent with training item 1 and 2.

– <?,?,Normal,?,?,?> is inconsistent with training item 2.

– <?,?,?,Weak,?,?> is inconsistent with training item 1 and 2.

– <?,?,?,?,Cold,?> is inconsistent with training item 1 and 2.

Kristian Guillaumier, 2011 46

The Candidate Elimination (CE) Algorithm

Sky AirTemp Humidity Wind Water Forecast IsGoodDay

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Changes No

Sunny Warm High Strong Cool Changes Yes

The Candidate Elimination (CE)

Algorithm

• So far we got:

Kristian Guillaumier, 2011 47

The Candidate Elimination (CE)

Algorithm

• After processing the 4

th

training item, we get:

Kristian Guillaumier, 2011 48

The Candidate Elimination (CE)

Algorithm

• The entire version space derived from the

boundaries is:

Kristian Guillaumier, 2011 49

Converging

• CE will converge towards the target concept if:

– There are no errors in the training data.

– The target concept is in H.

• The target concept is exactly learnt when S and G converge

to a single and identical hypothesis.

• If the training data contains errors e.g. a +ve example is

incorrectly labeled as –ve:

– The algorithm will remove the correct target concept from the

version space.

– Eventually, if we are presented with enough training data, we

will detect an inconsistency because the G and S boundaries will

converge to an empty version space (i.e. there is no hypothesis

in H that is consistent with all the training examples).

Kristian Guillaumier, 2011 50

Requesting Training Examples

• So far, our algorithm was given a set

containing labeled training data.

• Suppose that our algorithm can come up with

an instance and ask (query) an external oracle

to label it.

• What instance should the algorithm come up

with for an answer from the oracle?

Kristian Guillaumier, 2011 51

Requesting Training Examples

• Consider the version space we got from the 4 fixed training examples we

had?

• What training example would we like to have to further refine it?

• We should come up with an instance that will classified as +ve by some

hypotheses and –ve by others to further reduce the size of the version

space.

Kristian Guillaumier, 2011 52

Requesting Training Examples

• Suppose we request the training example:

<sunny, warm, normal, light, warm, same>

• 3 hypothesis would classify it as +ve and 3 would classify it as

–ve:

Kristian Guillaumier, 2011 53

Requesting Training Examples

• So if we ask the oracle to classify:

<sunny, warm, normal, light, warm, same>

• We’d either generalise the S boundary or

specialise the G boundary and shrink the size of

the version space (make it converge).

• In general, the optimal instance we’d like the

oracle to classify (the best training example to

have next) is the one that would half the size of

the version space.

• If we have this option we can converge to the

target concept in Log

2

time.

Kristian Guillaumier, 2011 54

Partially Learned Concepts

• Partially learned = we didn’t converge to the

target concept (S and G are not the same).

• Our previous example is a partially learned

concept:

Kristian Guillaumier, 2011 55

Partially Learned Concepts

• It is possible to classify unseen examples with a degree of certainty.

• Suppose we want to classify the instance (not in training data)…

<sunny, warm, normal, strong, cool, change>

• …using our partially learned concept.

Kristian Guillaumier, 2011 56

Notice that every hypothesis in

the version space classifies the

unseen instance as +ve.

So all the hypothesis classified it

as +ve with the same confidence

as if there would have been only

the target concept remaining

(converged).

Partially Learned Concepts

Kristian Guillaumier, 2011 57

Sky AirTemp Humidity Wind Water Forecast

Sunny Warm Normal Strong Cool Change ?

Rainy Cold Normal Light Warm Same ?

Sunny Warm Normal Light Warm Same ?

Sunny Cold Normal Strong Warm Same ?

All hypothesis in version

space classify as +ve

All hypothesis in version

space classify as -ve

50/50. Need more training

examples.

Note: This is an optimal query

to request from an oracle.

2 +ve, 4-ve.

Possible take a majority vote

and output a confidence level.

Inductive Bias

• Recall that our system, so far, works assuming the

target concept exists in our hypothesis space.

• Also recall that our hypothesis space allowed only

for conjunctions (AND) of attribute values.

• There is no way to allow for a disjunction of

values – we cannot say “Sky=Cloudy OR

Sky=Sunny”.

• Consider what would happen if in fact, I like

swimming if it is cloudy or sunny. I’d get

something like…

Kristian Guillaumier, 2011 58

Inductive Bias

Kristian Guillaumier, 2011 59

• CE will converge to an empty version space: the target concept is not in

the hypothesis space. To see why:

• The most specific hypothesis that classifies the first two examples as +ve

is:

<?, Warm, Normal, Strong, Cool, Change>

• Although it is maximally specific for the frst 2 examples, it is already to

general: it will classify the 3

rd

example as +ve too.

• The problem is that we biased our learner to consider only hypotheses

that are conjunctions.

Sky AirTemp Humidity Wind Water Forecast

Sunny Warm Normal Strong Cool Change Y

Cloudy Warm Normal Strong Cool Change Y

Rainy Warm Normal Strong Cool Change N

Unbiased Learning

• Let’s see what happens if to make sure that the target concept

definitely exists in the hypothesis space, we define the hypothesis

space to contain every possible concept.

– This means that it is possible to represent every possible subset of X.

• In our previous example (containing 6 attributes), the size of the

instance space is 96.

• How many possible concepts can be defined over this set of

instances.

• The powerset!!!

– Recall that the size of a powerset (in general) is 2

|X|

.

– So there are 2

96

(ouch) possible concepts that can be learnt from our

instance space.

– We had seen that by introducing ? and C, we allowed for 973 possible

concepts which is <<<<<< 2

96

(we had a very strong bias) .

Kristian Guillaumier, 2011 60

Unbiased Learning

• Let’s define a new hypothesis space ′ that can

represent every subset of instances. i.e.

′

= ().

• To do this we allow H’ to allow for any combination of

disjunctions, conjunctions and negations. E.g. the

target concept “Sky=Sunny OR Sky=Cloudy” would be:

<sunny,?,?,?,?,?> v <cloudy,?,?,?,?,?>

• So we can use CE knowing that our target concept will

definitely exist in the hypothesis space. But…

• We create a new problem. Our learner will learn how

to classify exactly the instances presented as training

examples and not generalise beyond them!

Kristian Guillaumier, 2011 61

Unbiased Learning

• To see why, suppose, I have 5 training examples d

1

, d

2

,

d

3

, d

4

, d

5

. And that d

1

, d

2

, d

3

are +ve examples and d

4

,

d

5

are –ve examples.

• The S boundary will become a disjunction of the +ve

examples (since it is the most specific possible

hypothesis that covers the examples):

S = {(d

1

v d

2

v d

3

)}

• The G boundary will become a negation (rule out) of

the negative training examples:

G = {÷(d

4

v d

5

)}

• So the only unambiguously classifiable instances are

those that were provided as training examples.

Kristian Guillaumier, 2011 62

Unbiased Learning

• What would happen if we use the partially learned concept

and take a vote.

• Instances that were originally in the training data will be

classified unambiguously (obviously).

• Any other instance not in the training data will be classified

as +ve by half of the hypothesis in the version space and as

–ve by the other half of the hypothesis in the version space.

– Note that H is the power set of X.

– x is some unobserved instance (not in training data).

– Then there is some h in the version space that covers x.

– But there is also a corresponding h’ in the version space that

covers the same x except for its classification (÷h).

Kristian Guillaumier, 2011 63

More on Bias

• Straight from:

A learner that makes no a priori

assumptions regarding the identity of the

target concept has no rational basis for

classifying any unseen instances.

• (in fact, CE worked because we biased it with

the assumption that the target concept can be

represented by a conjunction of attribute

values).

Kristian Guillaumier, 2011 64

More on Bias

• Consider:

– L = a learning algorithm.

– Has a set of training data D

c

= {<x, c(x)>}.

– c = some target concept.

– x

i

is some instance we wish to classify.

– L(x

i

, D

c

) = the classification (+ve/-ve) that L assigns to x

i

after

learning from training data D

c

.

• The inductive inference step is:

(D

c

. x

i

) ≻ L(x

i

, D

c

)

• Where a ≻ b denotes that b is inductively inferred from a.

• So the inductive inference step reads “given the training

data D

c

and the instance x

i

, as inputs to L, we can

inductively infer the classification of the instance.”

Kristian Guillaumier, 2011 65

More on Bias

Kristian Guillaumier, 2011 66

Sky AirTemp Humidity Wind Water Forecast IsGoodDay

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Changes No

Sunny Warm High Strong Cool Changes Yes

(D

c

. x

i

) ≻ L(x

i

, D

c

)

More on Bias

• Because L is an inductive learning algorithm, in

general, we cannot prove that the result L(x

i

, D

c

)

is correct. I.e. the classification of the example

does not necessarily deductively follow from the

training data (can be proven).

• However, we can add a number of assumptions

to our system so that the classification would

follow deductively.

• The inductive bias of L is defined as these

assumptions.

Kristian Guillaumier, 2011 67

More on Bias

• Let B = these assumptions (e.g. the hypothesis

space is made up only of conjunctions of

attribute values).

• Then the inductive bas of L is B, giving:

(B . D

c

. x

i

) ⊢ L(x

i

, D

c

)

• Where the notation a ⊢ b denotes that b

follows deductively from a (b is provable from

a).

Kristian Guillaumier, 2011 68

Defn. of Inductive Bias

Consider a concept learning algorithm L for the set of instances X.

Let c be an arbitrary concept over X and let D

c

= {<x, c(x)>} be an

arbitrary set of training examples of c.

Let L(x

i

, D

c

) denote the classification assigned to the instance x

i

by L

after training on the data D

c

.

The Inductive Bias of L is any minimal set of assertions B such that for

any target concept c and corresponding training examples D

c

(¬x

i

e X)[(B . D

c

. x

i

) ⊢ L(x

i

, D

c

)]

Kristian Guillaumier, 2011 69

The Inductive Bias of CE

• Let us specify what L(x

i

, D

c

) means for CE (how

classification works).

– Given training data D

c

, CE will compute the version space

VS

H,Dc

.

– Then it will classify a new instance x

i

by taking a vote

amongst the hypothesis in this version space.

– A classification will be output (+ve or –ve) if all the

hypothesis in the version space unanimously agree.

Otherwise no classification is output (“I can’t tell from

training data”).

• The inductive bias of CE is that the target concept c is

contained in the hypothesis space. i.e. ceH.

• Why?

Kristian Guillaumier, 2011 70

The Inductive Bias of CE

• 1:

– Notice that if we assume that ceH, then it follows

deductively (we can prove) that ceVS

H,Dc

.

• 2:

– Recall that we defined the classification L(x

i

, D

c

) to be

a unanimous vote amongst all hypothesis in VS

H,Dc

.

– Thus, if L outputs the classification L(x

i

, D

c

), then so

does every hypothesis heVS

H,Dc

including the

hypothesis ceVS

H,Dc

.

– Therefore c(x

i

) = L(x

i

, D

c

).

Kristian Guillaumier, 2011 71

**All content in these slides adapted from:
**

• Most material in these slides from:

– – – –

– – – –

[1] Machine Learning: Tom Mitchell (get this book). [2] Introduction to Machine Learning: Ethem Alpaydin. [3] Introduction to Expert Systems: Peter Jackson. [4] An introduction to Fuzzy Logic and Fuzzy Sets: Buckley, Elsami. [5] Pattern Recognition and Machine Learning: Christopher Bishop. [6] Grammatical Inference: Colin de la Higuera. [7] Artificial Intelligence: Negnevitsky. Miscellaneous web references.

Kristian Guillaumier, 2011 2

(main source: Mitchell [1])

**CONCEPT LEARNING, FIND-S, CANDIDATE ELIMINATION
**

Kristian Guillaumier, 2011 3

Kristian Guillaumier.Note on Induction • If a large number of items I have seen so far all possess some property. • In machine learning: – Input: A number of training examples for some function. then ALL items possess that property. • So far the sun has always risen… • Are all swans white… • We never known whether our induction is true (we have not proven it). – Output: a hypothesis that approximates the function. 2011 4 .

Kristian Guillaumier.g.Concept Learning • Learning: inducing general functions from specific training examples (+ve and/or –ve). E. 2011 5 . the IsCat() function over the set of animals. • Search a space of potential hypotheses (hypothesis space) for a hypothesis that best fits the training data provided. • Concept learning: induce the definition of a general category (e. E. ‘cat’) from a sample of +ve and –ve training data. • Each concept can be viewed as the description of some subset (there is a general-to-specific ordering) defined over a larger set.g.: – Cat Feline Animal Object • A boolean function over the larger set.g. • Concept learning learning this function from training data.

Example • Learn the concept: Good days when I like to swim. 2011 6 . Attributes: – – – – – – Sky Sunny/Rainy/Cloudy. Forecast Same/Change. Wind Strong/Weak. – no value is acceptable. Kristian Guillaumier. AirTemp Warm/Cold. • Our hypothesis is represented as a conjunction of constraints on attributes. • We want to learn the function: – IsGoodDay(input) true/false. Water Warm cold. • Other possible contraints (values) for an attribute: – ? ‘I don’t care’. any value is acceptable. Humidity High/Normal. Attribute Constraints Sky Sunny/Rainy.

?. Water. . is a vector of constraints on these attributes: <Sky. . .Example • Our hypothesis. Warm. Wind. Forecast> • An example of an hypothesis is (only on warm days with normal humidity): <?. ?> • The most specific hypothesis is: < . ?> • The most general hypothesis is: <?. Humidity. > Kristian Guillaumier. ?. ?. then. ?. 2011 7 . AirTemp. . Normal. ?. ?.

2011 8 .Example • Training Examples: Sky Sunny Sunny Rainy Sunny AirTemp Warm Warm Cold Warm Humidity Normal High High High Wind Strong Strong Strong Strong Water Warm Warm Warm Cool Forecast Same Same Changes Changes IsGoodDay Yes Yes No Yes Kristian Guillaumier.

the function c. 2011 9 . Kristian Guillaumier. Humidity. cats over animals. takes an instance x X and in our example • c(x) = 1 if IsGoodDay is Yes • c(x) = 0 if IsGoodDay is no. – i. … – The set of all animals. good days to swim over all days) is called the target concept denoted by c (note that c is the target function).e. • The concept to be learnt (e.Notation • The set of all items over which the concept is called the set of instances.g. X. – The set of all days represented by the attributes AirTemp. – c is a Boolean-valued function defined over the instances X. etc… • An instance in X is denoted by x (x X).

the learner is given a training set that consists of: – A number of instances x from X. we have the value of the target concept c(x).Notation • When learning the target concept. 2011 10 . c(x)> An instance and its target concept value Kristian Guillaumier. • A training example is usually denoted by: <x. – Instances where c(x) = 0 are called –ve training examples (non-members). c. – Instances where c(x) = 1 are called +ve training examples (members of target concept). – For each instance.

1}.Notation • Given the set of training examples. AirTemp. 2011 11 . Wind. we want to learn (hypothesize) the target concept c. • Each hypothesis in H denoted by h and is usually a boolean valued function h:X{0. • H is the set of all hypotheses that we are considering (all the possible combinations of <Sky. Forecast>). Kristian Guillaumier. Water. • Goal of the learner is to find an h such that: – h(x) = c(x) for all x X. Humidity.

Concept Learning Task (Mitchell pg. 2011 12 . 22) Kristian Guillaumier.

Kristian Guillaumier. • What about ‘unseen’ instances (where we don’t have training data). • The only information we have on c is the training data.The Inductive Learning Hypothesis • Our main goal is to find a hypothesis h that is identical to c for all x X (for every instance possible). At best we can guarantee that our learner will learn a hypothesis that learns the training data exactly (not good). 2011 13 . • We make an assumption… the inductive learning hypothesis… • Any hypothesis that approximates the target function well over a sufficiently large set of training examples will also approximate the target function well for other unobserved/unseen examples.

Size and Ordering of the Search Space • • • We must search in the hypothesis space for the best one (that matches c). Humidity High/Normal (2 options). Recall: – – – – – – Sky Sunny/Rainy/Cloudy (3 options). 2011 14 . • • • Which means that we have 322222=96 distinct instances. So the number of semantically distinct instances is: – 1 + 433333 = 973 Distinct possibilities + 1 for the ? 1 representing all the definitely negative ones since they contain Kristian Guillaumier. Wind Strong/Weak (2 options). Forecast Same/Change (2 options). The size of the hypothesis space is defined by the hypothesis representation. However note that any instance that contains one or more ’s by definition always means that it is classified –vely. Due to the addition of the symbols ? and we have 544444=5120 syntactically distinct instances. AirTemp Warm/Cold (2 options). Water Warm cold (2 options).

• Clearly. clearly. • This allows us to organize the search space (order the set) according to this relationship between instances (hypotheses) in it. • This ordering concept is very important because there are concept learning algorithms that make use of this ordering. since h2 has less constraints.Size and Ordering of the Search Space • Consider the following hypotheses: – h1: <Sunny. ?. Strong. ?. ?> • Consider the sets of instances that are classified as +ve by h1 and h2. ?. ?. ?. anything that is classified as +ve by h1 will also be classified as +ve by h2. ?. Kristian Guillaumier. ?> – h2: <Sunny. ?. it will classify more instances as +ve than h1. • We say that h2 ‘is more general than’ h1. 2011 15 .

2011 16 . – x satisfies h iff h(x) = 1. • The “is more general or equal to” relationship is denoted by g. hk(x)=1 hj(x) = 1 Kristian Guillaumier.Size and Ordering of the Search Space • For any instance xX and hypothesis hH. hj g hk iff any instance that satisfies hk also satisfies hj. • Given two hypothesis hj and hk. hj g hk iff xX.

… Kristian Guillaumier. 2011 17 . more specific or equal to. it is useful to define: strictly more general than.Size and Ordering of the Search Space • Just as we have defined g .

2011 18 . h2 ‘contains’ h1 and h3 because it is more general.Size and Ordering of the Search Space Set of all instances Each hypothesis corresponds to a subset of X (arrows). h1 and h3 are not more general or specific than each other Kristian Guillaumier.

Size and Ordering of the Search Space • Notes: – g is a partial order over the hypothesis space H (it is reflexive. c ∈ S: aRb ⋀ bRc ⇒ aRc. or Reflexive Relation – A reflexive relation is a binary relation R over a set S where every element in S is related to itself. and transitive). the ≤ relation over Z+ is antisymmetric because ∀ a. Kristian Guillaumier. • Antisymmetry. c ∈ Z+. • Reflectivity. b ∈ S. or Antisymmetric Relation – An antisymmetric relation is a binary relation R over a set S where ∀ a. the ≤ relation over Z+ is transitive because ∀ a. ∀ x ∈ S. – For example. – For example. 2011 19 . or Transitive Relation – A transitive relation is a binary relation R over a set S where ∀ a. xRx holds true. if a ≤ b and b ≤ c then a ≤ c. – An equivalent way of stating this is that ∀ a. – For example. antisymmetric. • Transitivity. – That is. aRb ⋀ bRa ⇒ a = b. b. x ≤ x. c ∈ Z+. b. b. b ∈ S. the ≤ relation over Z+ is reflexive because ∀ x ∈ Z+. aRb ⋀ a ≠ b⇒ ¬bRa. if a ≤ b and b ≤ a then a = b.

The Find-S Algorithm Initialise h to the most specific hypothesis in H. For each +ve training instance x For each attribute constraint ai in h If constraint ai is satisfied by x Then Do Nothing Else Replace ai in h by the next more general constraint that is satisfied by x Return h Kristian Guillaumier. 2011 20 .

Normal. Same> • After having covered the 1st training example. However it is still very specifiy – it will classify all possible instances to –ve except the one +ve training example it has seen. • Continue with the next +ve training example. Strong. h is more general than what we started with. Pick the next more general attribute value from which is ‘Sunny’. Warm. • Repeat for all attributes AirTemp. . . Warm. … until we get: h = <Sunny. Strong. > • Start with first +ve training example: • Consider the first attribute ‘sky’. Our hypothesis says (most specific) but our training example says ‘Sunny’ (more general) so the attribute in the hypothesis does not satisfy the attribute in x. Warm. Same> x = <Sunny. Normal.The Find-S Algorithm • Init h to most specific hypothesis: h = < . . 2011 21 . . humidity. Warm. Kristian Guillaumier.

Warm. Warm. Warm. Strong. Strong. Same> Loop thru all the attributes. Same> • We complete the algorithm to get the hypothesis: h = <Sunny. 2nd nothing. High. Strong. Warm. Normal. Warm. Normal. High. ?. Same> • Recall that so far: • h = <Sunny. Strong. ?> Kristian Guillaumier. Warm. Warm. ?. Same> new h… h = <Sunny. 2011 22 . Strong. 3rd not satisfied – pick next more general.The Find-S Algorithm • The next example is: x = <Sunny. Warm. Same> h = <Sunny. Warm. Strong. 1st satisfied – do nothing. Warm. Warm. ?. etc… satisfied – do x = <Sunny.

Kristian Guillaumier. Cold. Change> • Note that h is already consistent with the training example. Warm. ?. Warm. Strong. High. Same> • The 3rd training example (that was skipped) was: x = <Rainy. our hypothesis was: h = <Sunny. • After considering the 2nd training example. • However we observe that the hypothesis we had generated so far was consistent with this –ve training example.Observation • Remember that the algorithm skipped the 3rd training example because it was –ve.The Find-S Algorithm . Warm. Strong. 2011 23 .

• To see why: – h is the most specify hypothesis in H that is consistent with the currently observed training examples. 2011 24 . if the a more general hypothesis will not misclassify –ve example. – But c will never cover a –ve example so neither will h (by definition of g). the more specific one cannot misclassify it either. – Since we assume the target concept c is in H and is consistent (obviously) with the positive training examples. Clearly.Observation • As long as the hypothesis space H contains the hypothesis that describes the target concept c and there are no errors in the training data. the current hypothesis can never be inconsistent with a –ve training example.The Find-S Algorithm . Consider Animal g Cat If cat then animal If animal then maybe cat If not cat then maybe not animal If not animal then not cat (if –ve is correctly classified by general then it is correctly classified by specific) Kristian Guillaumier. then c must be g than h.

• Issues: – We don’t know whether we found the only hypothesis that is consistent. Why not the most general hypothesis consistent? Why not something in between. 2011 25 . – How do we know if the training data is consistent? In real-life cases. – Find-S will find the most specific hypothesis consistent with the training data. There might be more hypotheses that are consistent. training data may contain noise or errors. – There might be more than one maximally specific hypothesis – which one do we pick? Kristian Guillaumier.The Find-S Algorithm – Issues Raised • Find-S is guaranteed to find the most specific hH that is consistent with the +ve and –ve training examples assuming that the training examples are correct (no noise).

D) <x. c(x)> D. c(x)> in D. h(x) = c(x) Kristian Guillaumier. • Defn: h is consistent with a set of training examples D iff h(x) = c(x) for each example <x. it does so without enumerating the whole space. that although the hypothesis output by Find-S is consistent with the training data. Consistent(h. it is one of the.The Candidate Elimination (CE) Algorithm • Note. • CE will output (a description of) all the hypothesis consistent with the training data. 2011 26 . • CE finds all describable hypotheses that are consistent with the observed training examples. • Interestingly. many hypothesis that is consistent. possibly.

VSH. consistent versions of the target concept.D)} Kristian Guillaumier.D {hH|Consistent(h.The Candidate Elimination (CE) Algorithm • The subset of all the hypotheses that are consistent with the training data (what CE finds) is called the version space WRT the hypothesis space H and the training data D – it contains all the possible.D with respect to the hypothesis space H and the training data D is the subset of hypotheses from H consistent with the training data in D. 2011 27 . • Defn: the version space denoted VSH.

2011 28 .The List-Then-Eliminate (LTE) Algorithm • A possible representation of a version space is a listing of all the elements (hypotheses) in it. List-Then-Eliminate: VersionSpace = Generate all hypothesis in H For each training example d: remove from VersionSpace any hypothesis h where h(x) <> c(x) Return VersionSpace Kristian Guillaumier.

• It has the advantage of simplicity and the fact that it will always work (guaranteed to output all the hypotheses consistent with the training data). 2011 29 . • However. Kristian Guillaumier.The List-Then-Eliminate (LTE) Algorithm • LTE can be applied whenever the hypothesis space is finite (not always the case). • We need a more compact representation. enumerating all the hypotheses in H is unrealistic for all but the most trivial cases.

Find-S found the hypothesis: h = <Sunny. ?> • This is only one of 6 possible hypotheses that are consistent with the training examples. ?. 2011 30 . Warm. • We can illustrate the 6 possible hypothesis in the next diagram… Kristian Guillaumier. ?. Strong.Compact Representation of a Version Space • Recall that in our previous example.

2011 6 31 .Compact Representation of a Version Space 1 2 3 4 5 Kristian Guillaumier.

Kristian Guillaumier.Compact Representation of a Version Space Most specific. Most general. Arrow represents the g relation. 2011 32 .

2011 33 . we can generate all the hypotheses ‘in between’.Compact Representation of a Version Space Given only the 2 sets S and G. Try it! Kristian Guillaumier.

D)]} Kristian Guillaumier. G {gH |Consistent(g. D) (s’H)[(s>gs’)Consistent(s’.Compact Representation of a Version Space • Intuitively we see that by having these general and specific ‘boundaries’ we can generate the whole version space (check sketch proof in Mitchell). • A few definitions. 2011 34 . G {sH |Consistent(s.D)]} • Defn: the specific boundary S WRT a hypothesis space H and training data D is the set of minimally general (maximally specific) members of H consistent with D. D) (g’H)*(g’>gg)Consistent(g’. • Defn: the general boundary G (remember that G is a set) WRT a hypothesis space H and training data D is the set of maximally general members of H consistent with D.

G0 = {<?.?.?.?>} • Then we initialise S to contain the most specific hypothesis possible. • First we initialise G (remember it is a set) to contain the most general hypothesis possible.>} Kristian Guillaumier.?... 2011 35 .(Back to) The Candidate Elimination (CE) Algorithm • CE computes the version space containing all the hypotheses in H that are consistent with the observed D.?.. S0 = {<..

• At the end we’ll end up with the correct boundary sets. • As each training example is considered the boundary sets S and G are generalised and specialised respectively to eliminate from the version space any hypothesis in H that is inconsistent. Kristian Guillaumier.The Candidate Elimination (CE) Algorithm • So far the two boundaries delimit the whole hypothesis space (every h in H is between G0 and S0). 2011 36 .

2011 37 .The Candidate Elimination (CE) Algorithm Kristian Guillaumier.

2011 38 ..>} G0 = {<?.?.The Candidate Elimination (CE) Algorithm • After init: S0 = {<..?.?..?>} • Consider the first training example: Sky Sunny AirTemp Warm Humidity Normal Wind Strong Water Warm Forecast Same IsGoodDay Yes • It is positive so: Kristian Guillaumier..?.

Normal Wind Strong Water Warm Forecast Same Yes • Part 1: all hypotheses in G are consistent with the training example.e. Same> to S..The Candidate Elimination (CE) Algorithm Sky Sunny AirTemp Warm Hum.. • Remove from S any hypothesis that is more general than any other hypothesis in S.>) which is inconsistent: • Remove <. • Part 2: – There is only one s in S (<... Warm. Kristian Guillaumier. Normal.. – i. so we don’t remove anything... Strong.. Warm. – There is only 1 hypothesis in S so we do nothing. we add <Sunny. • Add to S all minimal generalisations h of s. 2011 39 .> from S. leaving S={}.

2011 40 .The Candidate Elimination (CE) Algorithm • So far we got: Kristian Guillaumier.

• Part 2: – There is only one s in S (<Sunny. Strong. It is positive as well so. Same>) which is inconsistent: • Remove it from S. Strong. Sky Sunny AirTemp Warm Hum High Wind Strong Water Warm Forecast Same Yes • Part 1: all hypotheses in G are consistent with the training example. – i. Kristian Guillaumier. so we don’t remove anything. • Remove from S any hypothesis that is more general than any other hypothesis in S. Same> to S. 2011 41 . Normal.The Candidate Elimination (CE) Algorithm • Read the second training example. Warm. ?. Warm.e. • Add to S all minimal generalisations h of s. we add <Sunny. Warm. Warm. – There is only 1 hypothesis in S so we do nothing. leaving S={}.

2011 42 .The Candidate Elimination (CE) Algorithm • So far we got: Kristian Guillaumier.

• Consider the 3rd training example. which is –ve.The Candidate Elimination (CE) Algorithm • We notice that the role of +ve training examples is to make the S boundary more general and the role of the –ve training examples is to make the G boundary more specific. 2011 43 . Sky Rainy AirTemp Cold Humidity High Wind Strong Water Warm Forecast Changes IsGoodDay No Kristian Guillaumier.

– Continued… Kristian Guillaumier.?. Same> which is consistent because it labels the training example as ‘NO’ – we do nothing. • Step 2: – The hypotheses in G that are not consistent with d is <?.?. Warm. Remove it. Warm. ?.?.?> because it labels it as ‘YES’.The Candidate Elimination (CE) Algorithm Sky Rainy AirT Cold Hum High Wind Strong Water Warm Fore Changes No • Step 1: S contains <Sunny. – All to G all minimal specialisations of g. 2011 44 .?. Strong. leaving G = {}.

?.?> <?. ?.?.?.?. <Cloudy.?>. ?.Normal. ?. 2011 45 .?> <?. not all these minimal specialisations go into the new G.?.?.The Candidate Elimination (CE) Algorithm Sky Rainy AirT Cold Hum High Wind Strong Water Warm Fore Changes No • The g we removed is <?. ?> <?.?. ?.?.?.?.?. ?>.?.Weak. all the minimal specialisations of it would be (remember we want to label the training example as ‘NO’): • • • • • • Sky: Air Temp: Humidity: Wind: Water: Forecast: <Sunny.?> <?.?. Kristian Guillaumier.Cold.?.?.Same> • However.?.?.?> <?. ?. ?.Warm.?.?. ?. ?.?.

?.Cold. ?.?. <?.?. ?.?.?. <?.Warm.?. ?.The Candidate Elimination (CE) Algorithm • Only <Sunny.?.?.?.?.?>.?.?> is inconsistent with training item 1 and 2. ?> is inconsistent with training item 1 and 2.?.?. ?.?. <?.?. • Why? • Because it is inconsistent with the previously encountered training examples (so far we saw training items 1 and 2).?. ?. <?. • <Cloudy. ?. ?.?.Weak.?.?. ?.?.Normal.?> and <?.?> and <?.?. – – – – Sky <Cloudy.?.Weak.?> is inconsistent with training item 1 and 2.?> is inconsistent with training item 2.?.?.Same> go into the new G. AirTemp Humidity Wind Water Forecast IsGoodDay Sunny Sunny Rainy Sunny Warm Warm Cold Warm Normal High High High Strong Strong Strong Strong Kristian Guillaumier. 2011 Warm Warm Warm Cool Same Same Changes Changes Yes Yes No Yes 46 .Cold. <?.Normal. ?.?> are not part of the new G. ?>. ?>. ?. ?.?.?. <?.

The Candidate Elimination (CE) Algorithm • So far we got: Kristian Guillaumier. 2011 47 .

The Candidate Elimination (CE) Algorithm • After processing the 4th training item. we get: Kristian Guillaumier. 2011 48 .

**The Candidate Elimination (CE) Algorithm
**

• The entire version space derived from the boundaries is:

Kristian Guillaumier, 2011

49

Converging

• CE will converge towards the target concept if:

– There are no errors in the training data. – The target concept is in H.

• The target concept is exactly learnt when S and G converge to a single and identical hypothesis. • If the training data contains errors e.g. a +ve example is incorrectly labeled as –ve:

– The algorithm will remove the correct target concept from the version space. – Eventually, if we are presented with enough training data, we will detect an inconsistency because the G and S boundaries will converge to an empty version space (i.e. there is no hypothesis in H that is consistent with all the training examples).

Kristian Guillaumier, 2011

50

**Requesting Training Examples
**

• So far, our algorithm was given a set containing labeled training data. • Suppose that our algorithm can come up with an instance and ask (query) an external oracle to label it. • What instance should the algorithm come up with for an answer from the oracle?

Kristian Guillaumier, 2011

51

Requesting Training Examples • Consider the version space we got from the 4 fixed training examples we had? • What training example would we like to have to further refine it? • We should come up with an instance that will classified as +ve by some hypotheses and –ve by others to further reduce the size of the version space. Kristian Guillaumier. 2011 52 .

warm. normal. warm.Requesting Training Examples • Suppose we request the training example: <sunny. same> • 3 hypothesis would classify it as +ve and 3 would classify it as –ve: Kristian Guillaumier. light. 2011 53 .

warm. the optimal instance we’d like the oracle to classify (the best training example to have next) is the one that would half the size of the version space. Kristian Guillaumier. warm. 2011 54 . same> • We’d either generalise the S boundary or specialise the G boundary and shrink the size of the version space (make it converge). normal. light. • In general. • If we have this option we can converge to the target concept in Log2 time.Requesting Training Examples • So if we ask the oracle to classify: <sunny.

2011 55 . • Our previous example is a partially learned concept: Kristian Guillaumier.Partially Learned Concepts • Partially learned = we didn’t converge to the target concept (S and G are not the same).

warm.Partially Learned Concepts • • It is possible to classify unseen examples with a degree of certainty. change> • …using our partially learned concept. strong. So all the hypothesis classified it as +ve with the same confidence as if there would have been only the target concept remaining (converged). Suppose we want to classify the instance (not in training data)… <sunny. cool. 2011 56 . Kristian Guillaumier. Notice that every hypothesis in the version space classifies the unseen instance as +ve. normal.

Kristian Guillaumier. Possible take a majority vote and output a confidence level. 2 +ve. Need more training examples. 4-ve. 2011 57 . Note: This is an optimal query to request from an oracle.Partially Learned Concepts All hypothesis in version space classify as +ve Sky Sunny Rainy Sunny Sunny AirTemp Warm Cold Warm Cold Humidity Normal Normal Normal Normal Wind Strong Light Light Strong Water Cool Warm Warm Warm Forecast Change Same Same Same ? ? ? ? All hypothesis in version space classify as -ve 50/50.

• There is no way to allow for a disjunction of values – we cannot say “Sky=Cloudy OR Sky=Sunny”. • Consider what would happen if in fact. I like swimming if it is cloudy or sunny. • Also recall that our hypothesis space allowed only for conjunctions (AND) of attribute values. works assuming the target concept exists in our hypothesis space. I’d get something like… Kristian Guillaumier. so far.Inductive Bias • Recall that our system. 2011 58 .

Strong.Inductive Bias Sky Sunny Cloudy Rainy AirTemp Warm Warm Warm Humidity Normal Normal Normal Wind Strong Strong Strong Water Cool Cool Cool Forecast Change Change Change Y Y N • CE will converge to an empty version space: the target concept is not in the hypothesis space. Normal. Cool. 2011 59 . To see why: • The most specific hypothesis that classifies the first two examples as +ve is: <?. Kristian Guillaumier. Change> • Although it is maximally specific for the frst 2 examples. it is already to general: it will classify the 3rd example as +ve too. • The problem is that we biased our learner to consider only hypotheses that are conjunctions. Warm.

– This means that it is possible to represent every possible subset of X.Unbiased Learning • Let’s see what happens if to make sure that the target concept definitely exists in the hypothesis space. • In our previous example (containing 6 attributes). we define the hypothesis space to contain every possible concept. Kristian Guillaumier. – We had seen that by introducing ? and . • The powerset!!! – Recall that the size of a powerset (in general) is 2|X|. the size of the instance space is 96. 2011 60 . – So there are 296 (ouch) possible concepts that can be learnt from our instance space. • How many possible concepts can be defined over this set of instances. we allowed for 973 possible concepts which is <<<<<< 296 (we had a very strong bias) .

i. • To do this we allow H’ to allow for any combination of disjunctions.?.?.?. conjunctions and negations.?> <cloudy. Our learner will learn how to classify exactly the instances presented as training examples and not generalise beyond them! Kristian Guillaumier.?. 2011 61 .e. the target concept “Sky=Sunny OR Sky=Cloudy” would be: <sunny.?. E.?.?.g. 𝐻 ′ = 𝒫(𝑋).Unbiased Learning • Let’s define a new hypothesis space 𝐻′ that can represent every subset of instances.?.?> • So we can use CE knowing that our target concept will definitely exist in the hypothesis space. But… • We create a new problem.

d5 are –ve examples. 2011 62 . • The S boundary will become a disjunction of the +ve examples (since it is the most specific possible hypothesis that covers the examples): S = {(d1 d2 d3)} • The G boundary will become a negation (rule out) of the negative training examples: G = {(d4 d5)} • So the only unambiguously classifiable instances are those that were provided as training examples. d3 are +ve examples and d4. d2. d5. d3. d4. d2. Kristian Guillaumier. suppose. I have 5 training examples d1.Unbiased Learning • To see why. And that d1.

2011 63 . – – – – Note that H is the power set of X.Unbiased Learning • What would happen if we use the partially learned concept and take a vote. Then there is some h in the version space that covers x. • Instances that were originally in the training data will be classified unambiguously (obviously). Kristian Guillaumier. But there is also a corresponding h’ in the version space that covers the same x except for its classification (h). x is some unobserved instance (not in training data). • Any other instance not in the training data will be classified as +ve by half of the hypothesis in the version space and as –ve by the other half of the hypothesis in the version space.

Kristian Guillaumier. CE worked because we biased it with the assumption that the target concept can be represented by a conjunction of attribute values).More on Bias • Straight from: A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances. 2011 64 . • (in fact.

c = some target concept. Dc) • Where a ≻ b denotes that b is inductively inferred from a. • So the inductive inference step reads “given the training data Dc and the instance xi. xi is some instance we wish to classify. • The inductive inference step is: (Dc xi) ≻ L(xi. Has a set of training data Dc = {<x. L(xi. as inputs to L. c(x)>}. Dc) = the classification (+ve/-ve) that L assigns to xi after learning from training data Dc. 2011 65 .” Kristian Guillaumier.More on Bias • Consider: – – – – – L = a learning algorithm. we can inductively infer the classification of the instance.

2011 66 .More on Bias Sky Sunny Sunny Rainy Sunny AirTemp Warm Warm Cold Warm Humidity Normal High High High Wind Strong Strong Strong Strong Water Warm Warm Warm Cool Forecast Same Same Changes Changes IsGoodDay Yes Yes No Yes (Dc xi) ≻ L(xi. Dc) Kristian Guillaumier.

More on Bias • Because L is an inductive learning algorithm. Kristian Guillaumier. we can add a number of assumptions to our system so that the classification would follow deductively. we cannot prove that the result L(xi. 2011 67 . the classification of the example does not necessarily deductively follow from the training data (can be proven).e. • However. • The inductive bias of L is defined as these assumptions. Dc) is correct. I. in general.

g. the hypothesis space is made up only of conjunctions of attribute values). Dc) • Where the notation a ⊢ b denotes that b follows deductively from a (b is provable from a). 2011 68 . giving: (B Dc xi) ⊢ L(xi.More on Bias • Let B = these assumptions (e. • Then the inductive bas of L is B. Kristian Guillaumier.

2011 69 . of Inductive Bias Consider a concept learning algorithm L for the set of instances X. Let c be an arbitrary concept over X and let Dc = {<x. The Inductive Bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc (xi X)[(B Dc xi) ⊢ L(xi. Dc)] Kristian Guillaumier. Let L(xi.Defn. Dc) denote the classification assigned to the instance xi by L after training on the data Dc. c(x)>} be an arbitrary set of training examples of c.

– Given training data Dc. Otherwise no classification is output (“I can’t tell from training data”). Dc) means for CE (how classification works).The Inductive Bias of CE • Let us specify what L(xi. CE will compute the version space VSH. – Then it will classify a new instance xi by taking a vote amongst the hypothesis in this version space. – A classification will be output (+ve or –ve) if all the hypothesis in the version space unanimously agree. cH.e. i. • The inductive bias of CE is that the target concept c is contained in the hypothesis space. 2011 70 . • Why? Kristian Guillaumier.Dc.

then so does every hypothesis hVSH. 2011 71 . Dc) to be a unanimous vote amongst all hypothesis in VSH. Dc).Dc including the hypothesis cVSH.Dc. Kristian Guillaumier. then it follows deductively (we can prove) that cVSH.Dc. if L outputs the classification L(xi. – Therefore c(xi) = L(xi. Dc). – Thus.The Inductive Bias of CE • 1: – Notice that if we assume that cH.Dc. • 2: – Recall that we defined the classification L(xi.

- A Probabilistic Neural Network Approach for Protein Superfamily ClassificationUploaded byOsman Demir
- Big Data Challenge Shaastra AmexUploaded byNaveen Kumar
- Real NumbersUploaded byDaniel Andersson
- Machine Learning Techniques for Building a Large Scale Production Ready ClassifierUploaded byliezelleann
- ai dataUploaded byapi-413617976
- 2011 MM Data MiningUploaded byArvind Balasubramanian
- Computer CtlgUploaded byVenkatesh Elango
- TomsCVnewUploaded bypostscript
- Relation algebraUploaded byAbhishek Mehta
- RS & ELM.pdfUploaded byDiwakar Gautam
- Cse847 Project DescriptionUploaded byAshwini Kumar Maurya
- Homework 3Uploaded bykrishna135
- 3319172891Uploaded byaaa
- Relations and Functions (40)Uploaded byAvnish Bhasin
- WEB MULTIMEDIA CLASSIFICATION USING DT AND SVM MODELSUploaded byAnonymous vQrJlEN
- l18Uploaded byAbdalla Mustafa Mohamed Ahmed
- Driving Style Recognition Based.pdfUploaded bySiva Ranjani
- Age EstimationUploaded byprashdhote92
- 5613-rc209_010d-mahoutUploaded byNemo Marlin
- Emotional pattern recognition using machine learningUploaded byPierrick Barreau
- Remote Sensing Image Classification Method Based on Evidence Theory and Decision Tree (1)Uploaded bypekenet
- Simplified Proof of Kruskal’s Tree TheoremUploaded byAlexander Decker
- Detecting Click Fraud PaperUploaded byAlexander James Thornton
- K Means Clustering Algorithm_ Explained – DnI InstituteUploaded byAman Kalyankar
- 1404notes.pdfUploaded byClifford Yohannan
- Performance Evaluation of Fusion Rules For Multitarget Tracking in Clutter Based on Generalized Data AssociationUploaded byAnonymous 0U9j6BLllB
- Top 10 Data Science Video TutorialsUploaded byVicky Roy
- A DSmT Based Combination Scheme for Multi-Class ClassificationUploaded byDon Hass
- 10.1.1.278.4931Uploaded byMabiratu Beyene
- Staal2004ridge - DriveUploaded byJosé Ignacio Orlando

- Part-time Courses Application FormUploaded byMark Scerri
- Verifying Web Applications: From Business Level Specifications to Automated Model-Based TestingUploaded byMark Scerri
- Verifying Web Applications: From Business Level Specifications to Automated Model-Based TestingUploaded byMark Scerri
- HelloUploaded byMark Scerri
- 6819362 Airbus A320330 Panel DocumentationUploaded byMark Scerri