You are on page 1of 32

3.

NCC

1. Nearest Center Classifier Õ


2. Bayes Classifier
3. Minimum Average Risk Bayes Classifier
4. K-Nearest Neighbour Rule
5. Decision Boundaries
6. (Feature Selection)

45

Nearest Center Classifier


• We begin with the simplest type, nearest centre classifier
(also known as minimum distance classifier).
• For each object class ωk, we find the centroid mk of the set of
samples {x(k)1, x(k)2, ..., x(k)N } belonging to class ωk.
• For any object feature vector, x, find the distances of x to each
of the class centroids (or prototypes).
• Then we classify an unknown object, x, as belonging to the
class whose centroid (mk ) is closest (i.e. having minimum
distance) to the unknown object’s feature vector.

Notation : bold-face small letters denote column vectors.

46
Nearest Center Classifier (NCC)
x2 Being closest to m3 ,
Example: x is classified as
m3 belonging to class w3.
ω1 : m1 = (x1 (1), x2 (1))T

ω2 : m2 = (x1(2), x2(2))T x
ω3 : m3 = (x1(3), x2(3))T
From the figure, x is closest to m3 m2
m1
since ||x - m3||2 < ||x - m1||2 x1
and ||x - m3||2 < ||x - m2||2. 2-D feature space with 3 Classes
Therefore, classify x as belonging to ω3.

||.||2 is the L2 norm or the euclidean norm of a vector.


e.g. if v = [v1 v2 v3]T, then || v ||2 = √ (v12 + v22 + v32)

47

• Each pattern class is represented by a prototype :


1 Nk
mk =
Nk
∑x
i =1
(k )
i k = 1, 2 , ... , c ( c classes )

where mk is the mean vector for class wk,


Nk is the no. of pattern vectors from class wk,
c is the total number of distinct classes.
Given an unknown pattern vector x, find the closest
prototype by finding the minimum Euclidean Distance.
Di(x) = || x - mi ||
Or Di(x)2 = || x - mi ||2

Note : From here on, we will denote || . || is taken to be the L 2 - norm


(euclidean norm)

48
where x → ωi if D i (x) < D j (x) for i, j = 1, 2, ..., c ; j ≠ i

or, x → ωi if D i2(x) < D2j (x) for i, j = 1, 2, ..., c ; j ≠ i


--------------------------------
Since
How can we simplify the decision function of NCC ? ||v||2 = vTv

Dj(x)2 = || x - mj ||2 = (x - mj)T(x - mj) using

= xTx - 2xTmj + (mj)Tmj aTb = bTa

Observe i) that xTx is common to all Dj(x)2 and can be discarded without
affecting the value of j to maximise Dj(x)2.
ii) Also note that minimum of a function is equivalent to the maximum of
the negative of that function.

49

Thus to save computation time, finding the minimum


wrt j of Dj(x)2 is equivalent to finding the maximum
of dj(x) where
dj(x) = 2xTmj - (mj)Tmj

Hence the Nearest Center Classifier computes all dj(x)


for j = 1, 2, …, c (c classes). And for i such that

di(x) > dj(x) for all j = 1, 2,…, c and j ≠ i

NCC will assign the sample with feature x => ωi

50
NCC Example : Classify Iris versicolor and Iris setosa by petal
length, x1, and petal width, x2.

51

Assume that the centroid for each class is found to be


versicolor ω1 : m1 = (4.3, 1.3)T
setosa ω2 : m2 = (1.5, 0.3)T

Let x = (x1, x2)T be the feature for an unknown sample.


Then
d 1 (x ) =
T 1
x m1 −
T

2
(m )
1 m 1 = 4 . 3 x 1 + 1 . 3 x 2 − 10 . 1

d 2 (x ) = x T
m2 − 1 m2
2
( ) T
m 2 = 1 .5 x 1 + 0 .3 x 2 − 1 .17

⎛ x1 ⎞
⎜ ⎟
⎛ 4.3 1.3 − 10.1 ⎞ ⎜ ⎟ ⎛ d1 ⎞
In matrix form ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ x 2 ⎟ = ⎜ ⎟
⎜ 1.5
⎝ 0.3 − 1.17 ⎟ ⎜
⎠ ⎟ ⎜⎝ d 2 ⎟⎠
⎜⎜ ⎟⎟
⎝ 1 ⎠
52
♠ If the petal of an unknown iris has length 2 cm and
width 0.5cm, i.e. x = (2, 0.5)T, which iris is it from ?
⎛ 2⎞
⎛ 4.3 13
. − 101. ⎞ ⎜ ⎟ ⎛ − 0.85⎞
⎜ ⎟ 0.5 = ⎜ ⎟
⎝ 15
. 0.3 − 117 . ⎠ ⎜⎜ ⎟⎟ ⎝ 198
. ⎠
⎝ 1⎠
Therefore, it is Iris setosa.
♠ If a petal of an unknown iris has length 5 cm and
width 1.5cm, i.e. x = (5, 1.5)T, which iris is it from ?
⎛ 5⎞
⎛ 4.3 13
. − 101. ⎞ ⎜ ⎟ ⎛ 13.35⎞
⎜ . ⎟ =⎜
⎟ ⎜ 15 ⎟
⎝ 15
. 0.3 − 117 . ⎠ ⎜ ⎟ ⎝ 6.78 ⎠
⎝ 1⎠
Therefore, it is Iris versicolor.

53

The main advantage of NCC is that we only need to store the class
centres only. The computation time for classification is also very
fast* since we only need to calculate the distance of an unknown
vector from each of the class centres and find the nearest one.
NCC worked well in the previous example. Why ?
In fact, MDC performs quite poorly in real applications.
Under what condition will NCC work well or not work well ?

*For training samples with N data to be used for designing a classifier to


distinguish between c classes, the distance of each unknown vector needs to be
checked against each of the centres of the c classes. ie only c distance
computation and not N. Usually, N >> c.
In certain applications, N could be in the thousands while c may be 10. We need
N distance calculations for Nearest Neighbour Classifier (to be studied later).

54
Linear Decision Boundary

Two classes are said to be linearly separable if one can


find a line (hyperplane) that separates them.

Note that the decision boundary of the nearest center


classifier is the perpendicular bisector of the line joining
two class prototypes.

At the decision boundary (surface), d1(x) = d2(x)

By defining d12(x) = d1(x) - d2(x),


then d12(x) = 0 yields the decision boundary for the
minimum distance classifier.

55

Hence for the above problem :


At the decision boundary (surface), d1(x) = d2(x)
∴ d12(x) = d1(x) − d2(x) = 2.8 x1 + 1.0 x2 − 8.9 = 0
(See graph on page 33)

for Iris classification,


x → ω1 if d12(x) > 0 (why ?)
x → ω2 if d12(x) < 0 (why ?)

Examples :
x = (2, 0.5)T; d12((2, 0.5)T) = − 2.8 → ω2 : Iris setosa
x = (5, 1.5)T; d12((5, 1.5)T) = 6.6 → ω1 : Iris versicolor

56
The k-Nearest Neighbour Rule (k-NNR)

• The k-NNR is a very simple rule which works quite


satisfactorily.
• Method : All samples of all classes are used. Given a
measurement x, find the k nearest samples to x. From these
k samples, check how many of them belong to each class.
The measurement x will be assigned to the class with the
most samples in the k nearest samples.
• In particular, for the case when k = 1, we call this the
nearest neighbour rule (NNR) or nearest neighbour
classifier.

Do not mix up nearest neighbour rule and nearest center rule.


107

X2
x
A
Class 1
B
Class 2 X
Class 3 C

X1

In the above three class distribution in the 2-D x1-x2 feature


space, x is an unknown measurement as shown to be classified
into 1 of the 3 classes.
Then the 3-NNR yields {A, B, C} as the three samples nearest
to x. Since A and B belong to class 1 → x ∈ Class 1.

For NNR, it is nearest to sample C. ∴ x ∈ Class 2.


108
Principle of the k-NNR Classifier
As k gets larger, k samples will start to resemble the
(posteriori) probability densities of each class in the local
region of the unknown measurement.
Then k-NNR tends to a local MAP classifier as k gets larger.

Thus k-NNR is useful when there are many training


samples available (dense) and k is set to a large value.
But if the samples are sparse, there may not be k samples in a small local
region. To obtain k samples, a much larger region may be needed to contain
this amount of samples. This however makes the probability estimations
inaccurate. So one may not be able to set k to be too large.

By setting k = 1, NNR acts like a classifier that learns from instances;


i.e. an instance-based learner. [Rote learner]

109

The k-NNR, although simple, requires a lot of computation.


For each unknown sample, we have to calculate its distance to
every sample and then to find the minimum. Its computational
complexity is O(n) when n is the total number of training
samples of all classes. (Recall that Nearest Center Classifier only needs
c number of distance calculations, where c is the number of classes).
The underlying principle of NNR is that samples that are “closest” in the
feature space are most likely to belong to the same class.

The NNR is sub-optimal when compared with the Bayes Classifier. It can
be proved that as n → ∞
error ratebayes < error ratennr < 2 x error ratebayes
Thus the performance of NNR is quite good. Its drawback is its
high computational complexity.
110
Two-Class X2
Sample
Distribution

Class 1

Class 2 X1

Q: Which classification method would you choose,


giving reasons, for the above problem -- minimum distance
classification or k-NNR ?
Q: Can you think of a better way to classify these two classes
assuming the above sample distribution is representative of the true
sample disribution ?

111

Topics to be covered

1. Minimum Distance Classifier (nearest center)a


2. Bayes Classifier a
3. Minimum Average Risk Classifier a
4. K-Nearest Neighbour Rule a
5. Decision Boundaries
6. (Feature Selection) a

112
Support Vector Machines

Recognition
A. Template matching —
B. Learning-from-Examples
g p approaches
pp —
1. Nearest Center Classifier —
2
2. B
Bayes (MAP ML)
Cl ifi (MAP,
Classifier —
3. Minimum Average Risk Bayes Classifier —
4. K-Nearest Neighbour Rule Boundaries —
5. Optimal Decision

C. Feature Selection

153

Optimal
p Decision Boundaries
It is desirable that two object classes can be separated by a curve ( or a hypersurface
in higher
g dimensions).) When this is ppossible, the classifier has learnt in ggeneral.
(But note that most classification problems are not separable).

If we can find a line (hyperplane) to separate the two classes, then


they are said to be linearly separable.
Recall that nearest center classifier generates a linear decision boundary.

Example of X2
Linear decision
two classes boundaryy
being linearly
separable

Class 1

Class 2 X1

154
Example of X2
non-linear
separation.

Class 1

Class 2 X1

For some other cases, the two classes are separable but not
li
linearly
l separable
bl (e.g.
( in
i the
th previous
i example).
l ) We
W say that
th t the
th
two classes are non-linearly separable.
For example, the 00-1
1 loss function Minimum Average Risk Bayes classifer,
with gaussian distributions of the feature space, yields quadratic decision
boundaries in general.

155

Suppose two classes are linearly separable in the feature space, is


there is an “optimal” decision line (hyperplane) that separates the
two classes for a given set of training samples for each class ?
[[Yes and,, moreover,, the optimal
p solution is unique].
q ]

Support Vector Machine


S “ h support vector machine
i - “the hi isi a
linear machine that constructs a hyperplane as the decision
surface in such a way that the margin of separation between one
class examples and the other class examples is maximised.”
N lN
- [Neural Networks
k bby S H
Haykins
ki ]

156
Example : Given the two Classes 1 and 2 with training
samples as shown.
Observe that there infinite possible lines that can separate
(or dichotomise) the two classes.
Where is the “optimal
optimal” decision line (hyperplane) ?

X2

Class 1

Class 2
X1

157

Recall : Design a classifier that generalizes well.


Generalization capability : To learn from a finite set of
training data and be able to classify unseen data.

X2

Class 1
Class 2 X1

Among all the infinite linear separators, there is one which


has the best generalization capability --- the SVM.
158
Vapnik et al proposed that the optimal separating hyperplane
is the one that yields that maximum margin between the two
classes.
U is the
X2 U maximum
Optimal line
margin of
(hyperplane)
separation
of each
C class from
the optimal
Class 1
B A hyperplane.
U
Class 2 Optimal X1
Hyperplane

By observation, the maximum separation is U and the optimal


hyperplane
yp p is as shown. A,, B and C are called support
pp vectors.
159

“Support vectors are therefore the most important samples as


they are those sample points that decide the optimal hyperplane”.

For further information on how to calculate this optimal


h
hyperplane,
l refer
f to
t “Neural
“N l Networks”
N t k ” by
b Simon
Si Haykins.
H ki
Also note that one can relax the condition in that the two
classes need not be separable.]

To handle non-linearly separable situations, a support vector


machine should have a preprocessing kernel transformation
to convert the initial feature space into an abstract feature space
so that in the latter,
latter the two classes are linearly separable.
separable This
transformation may not easily be found.

160
But some typical kernel transformations are of the form:
a) ( xTxi + 1)p --- polynomial learning machine
b) exp(  || x  xi ||2 /2V2 ) -- radial-basis network
c) tanh( D xTxi + J) --- two-layer perceptron
Cover’s Theorem on the Separability of Patterns (1965)
“AA complex pattern classification problem cast in a high-
high
dimensional space non-linearly is more likely to be
linearly separable than in a low-dimensional space”

Strategy: According to Cover, if the class distributions in the


original feature space are non-separable, transform it into a high
dimensional kernel space in which there is a high probability that the
class distributions are linearly separable, according to Cover. Then
since linear SVM gives the best generalisation for linearly separable
distributions, we can apply SVM in the kernel space.
161

We now look at an example p to show how a nonlinearlyy


separable case can be converted into a linearly separable case
by kernel transformation.

Example 1 [The classic XOR Problem]


L b th
Lett x1 andd x2 be the features
f t off feature
f t vector
t [x F
[ 1, x2]T. For
T T
class 1, (0,0) and (1,1) are the samples while (1,0) and T

(0,1)T belong to class 2.


X2
The two classes XOR
are obviously
b i l nott E
Example
l
linearly separable; Class 1
it is non
non-linearly
linearly Class 2
separable though. X1

162
Take the transformations defined by :
 || x  t1 || 2  || x  t2 || 2
I1(x) = e ; I2(x) = e
where t1 = (1,1)T and t2 = (0,0)T are the centers of the two
kernel functions I1 and I2. For each sample x, find its I1(x) and
I1(x).
Input
pu pattern
p e kernel
e e feature
e u e kernel
e e feature
e ue
Cl
Class
x I1(x) I2(x)
p (0,0) 0.1353 1
Class 1
q (1,1) 1 0.1353

r (0,1) 0.3678 0.3678


Class 2
s (1,
(1 0) 0 3678
0.3678 0 3678
0.3678
163

Constructing the new feature space yields:

I2 I1 - I2 feature space
q
supporting
Class 1 hyperplane
h(x) = 1
Class 2
p
r, s
I1
supporting
h
hyperplane
l
h(x) = - 1

SVM solution: the


optimal hyperplane
h(x) = 0

In the new kernel space,


space the two classes are now linearly separable !!!

164
Given the support vectors, construct the optimal hyperplane

If we know suffiicent a number of support vectors, then to


obtain the decision boundary, let the equation of the optimal
hyperplane be :

h(x) = k1I1(x) + k2I2(x) + b = 0 [optimal hyperplane]


Then if we let the maximum margin be normalised to unity, then
h(x) = 1 is the supporting hyperplane for class 1
h(x) = 1 is the supporting hyperplane for class 2.

Then solve for k1, k2 and b from the table of data. Since there are
3 unknowns
k , we needd to
t have
h three
th supportt vectors.
t
165

We have normalised the margins (i.e. let them be r 1)

The optimal separating hyperplane h(x) = 0 is


h(x) = k1I1(x) ( ) + b = 0
( ) + k2I2(x) ---(a)
( )
The two supporting hyperplanes are
k1I1(x) + k2I2(x) + b = 1 ---(b)
k1I1(x) + k2I2(x) + b = 1 ---(c)
(c)

Points that lie on the optimal hyperplane satisfies Eq(a).


Eq(a)
Support vectors satisfies either Eq(b) or (c).
Points that don't satisfy (b) or (c) are not support vectors.

166
So, for the problem, we obtain the three equations
usingg the three ppoints which are support
pp vectors.

for point p, 0.135 k1 + k2 + b = 1 ---(i)

for point q, k1 + 0.135 k2 + b = 1 ---(ii)

for point r, 0.3678 k1 + 0.3678 k2 + b =  1 ---(iii)

You may use the elimination method to solve


f the
for h unknowns,
k k1, k2 andd bb.
Or rearranging into matrix form, we obtain ...

167

ª . 1353 1 1º ª k 1 º ª 1 º
« »« » « »
« »« » « »
« 1 . 1353 1» « k 2 » « 1 »
« »« » « »
« »« » « »
« . 3678 . 3678 «  1»
¬ 1 »¼ «¬ b »¼ ¬ ¼

Solving , k1 = 5.0038
5 0038 , k2 = 5.0038 4 6808
5 0038 , b =  4.6808

Thus the optimal decision boundary is


h(x) =5.0038 I1(x) + 5.0038I2(x)  4.6808 = 0

For a given x, if h(x) is positive, then we classify x as Class 1.

168
E
Example S
l Suppose the
h optimal
i l separating
i hyperplane
h l i
is

h(x) = 5.0038
5 0038 x1 + 5.0038 4 6808 = 0
5 0038 x2  4.6808

Which class does the feature vector x = [1 2]T belong to ?

h(x = [1 2]T) = 5.0038 u1 + 5.0038 u2  4.6808


= 10.33 (greater than +1)
Therefore, referring to previous problem, x = [1 2]T
comes from Class 1.

Q : Show that x = [1 2]T comes from Class 2.

169

The previous problem is to find the optimal separating


hyperplane, given the support vectors.

The SVM is not so easily solved as depicted in the above example. In


actual, we don't know which of the input data are support vectors. The
usuall approach h is
i to
t find
fi d the
th optimal
ti l hyperplane
h l Thi is
fi t This
first. i
computationally very expensive and involves quadratic optimization
problem with constraints.
constraints There are available software such as SVMlight
g ,

SMO, Core vector machine, etc. For very large data sets, you can
resort to Core Vector Machine or Sequential Bootstrapped SVM.

Once the optimal separating hyperplane has been found, then


finding the support vectors of both classes will be simple.
simple
How is this done ?

170
Hyperplanes

aX + bY + cZ = d is a plane equation in 3D space.


space

Column Vector N = [a b c]T gives the normal direction of the plane.


The unit normal, n, is thus

n = N / || N ||
where || N || = — ( a2 + b2 + c2 ) is the Euclidean (L2) norm of N.

The distance p of the plane to the origin is p = d / || N ||

a1x1 + a2x2 + ... + anxn = d is an n-1 dim hyperplane in an n-


dim space.
space
171

Consider first ax + by + cz = 0.
ªa º ªX º
D fi
Define N «b » andd x « Y »
« » « »
«¬ c »¼ «¬ Z »¼
N
now n is the unit normal.
|| N ||

Hence, the equation of the plane can be put into a vector form as :

NTx 0 i.e || N || n T x 0 or n T x 0

This shows that for any x that lies on the plane, it is orthogonal (90o) to n.

Note that when x = 0 satisfies the plane equation, this means that
ax + by + cz = 0 is a plane that passes through the origin.
case
So p = 0 for this case.
172
Recall : aTb = || a || || b || cos T where T is the angle between the two
vectors.
Rearranging, we have a
|| a ||
T b
a b T
|| a || cos T
|| b ||
A A

A = || a || cos T
The component of any vector a in the direction of b is thus given by
b
a Tb Note that is a unit vector
|| b ||
|| a || cos T
|| b || in the direction of b .

173

Therefore, for ax + by + cz = d

In vector form : T
N Tx d or || N || n x d

d
nT x
|| N ||

which (from previous property) means that d / ||N||


is the distance p of the plane to the origin.

d
p
|| N ||

174
What is distance of any vector z = [z1 z2 z3]T from the hyperplane
ax1 + bx2 + cx3 = d ?

First find the unit normal n to the hyperplane.


ªa º
1
n «b»
a 2  b 2  c 2 «« c »»
¬ ¼
Then find the component of pz in the direction of n.
ª z1 º
1
p nT z >
z
a b c @ «« z 2 »»
a2  b2  c2 «¬ z3 »¼

175

Distance p of the pplane from the origin


g which is also in the
direction of n is as before equal to

d
p
a 2  b2  c2

Then the distance of z from the plane is

ª z1 º
1 d
p p
z
>a b c@ «« z2 »» 
a2  b2  c2 a2  b2  c2
¬« z3 »¼

176
Problem Statement : A hyperplane h(x) = 0 is given by
h(x) = a1x1 + a2x2 + a3x3 + b = 0
where x  ƒn and x = [x1 x2 x3 ]T .
Let a = [a1 a2 a3]T, then the unit normal n to the
hyperplane
yp p is
n = a / || a ||.
Then the
Th th perpendicular
di l distance
di t q off a vector t
t v  ƒn to
the hyperplane is given by
q = nTv  b/|| a ||

177

hyperplane
Proof : The component pv of v in the v
n
direction n is given by
pv = nTv

The (directional) distance p of the


hyperplane from the origin is given by
h(x) = 0
p =  b / || a ||
p
and is along the direction of a (or n).
So the distance q of v from the hyperplane origin
is

q = dv  b / || a || = nTv  b / || a ||

Directional Distance : p =  2 means the plane is a distance of 2 units from the


origin but in the opposite direction of n.
178
Non linearly separable distributions --- apply kernel transformation
Non-linearly
Example : The distributions of the two classes in the x1-x2 feature
space are non-linearly
li l separable
bl as shown
h below.
b l Suggest,
S t by
b way
of kernel transformations, how we can make the two classes
linearly separable from one another.
another
x2

x
=>
> Need to introduce more than two kernel functions.
functions
179

Non-separable distributions

x2
Class A Class B

x1

What can be done with this overlapping case ?

=> interested to know more ? study


Machine Learning

180
Bayes
Classifier

59

Probability: Some Basic Definitions and Axioms

P(A∧B) reads as "the prob of events A AND B happening".

P(A | B) reads as "prob of event A happening given that event B has happened".

P(A∨ B) reads as "prob. of A OR B happening".

60
If A and B are independent, P(A ∧ B) = P(A) P(B)
since P(A|B) = P(A).
If A and B are mutually exclusive, P(A ∧ B ) = 0
and thus P(A|B) = 0 and P(B|A) = 0.

If A and B are independent, then


P(A ∧ B) = P(A) P(B)

If A and B are mutually exclusive, then


P(A ∨ B) = P(A) + P(B)

61

Example : Check first the


two conditions – A1 A2

i) A1 ∪ A2 = U (universe) B∧A2
ii) A1 ∩ A2 = φ (disjoint)
Then we can apply the total
probability formula : B∧A2 B

P(B) = P(B ∧ A1) + P(B ∧ A2)


= P(B|A1)P(A1) + P(B|A2)P(A2)

62
Bayes Theorem

Since P(A ∧ B) = P(A |B) P(B)


and also P(A ∧ B) = P(B|A) P(A)
Then
P(A|B) P(B) = P(B|A) P(A)
This is known as Bayes Theorem. It is widely used
in various decision making processes.
Why is Bayes Theorem important ?

Note : P(A ∧ B) is the probability that event A and B happen

63

Posterior and Prior Probabilities


Q1 : If I have an envelop addressed to one of the students in a certain
class, do you guess it is for a male or a female student ?
What did you base your answer on ?

Suppose there are 20 males students and 8 female students in


the class room. Two male students wear glasses while half
of the female students do so.

Q2 : If I have an envelop addressed to one of the students, do you decide that


it is for a male or a female student ? What is your answer based on ?

Q3 : If it is for a student wearing glasses, would you guess it is for a


male or female student ? What do you base your reasoning on ?

64
For Q2, we should base our choice/decision on class probabilities
since that is the only information available.
For Q3, we use conditional probability. Compute P(Male|glasses)
and P(Female|glasses) and choose the one with the higher
probability. We call these probabilities a posteriori probabilities.

Q4 : If it was for a student who is 1.2 m tall, would you guess a


male or female student now ?

In the case of Q4, we cannot make a qualified choice if we


don't have available the knowledge of who is/are 1.2 m tall
in the class and there could be quite a few. (Imagine that
the class consists of all the students of NTU).
What can be done?

65

One possibility is first to do an experiment on the height


distribution of male and female NTU students. We can take
say 100 random samples of each gender class and plot their
distributions. This expt could be done prior to the 1.2 m
measurement was given.

We thus obtain p(height|male) and p(height|female)


probability density distributions. These prior probabilities
are known as likelihood probabilty density functions
( or simply likelihood functions).

66
Probability
density

Female Male

Height, m
1.2 m
Hence, we find that P(M|height) and P(F|height), which are posteriori
probabilities, are not known. But by Bayes theorem, these can be found
indirectly using prior distributions. Can you now see how ?

Using the above "prior" experiments, we can now get an idea of how likely
it will be a male (or female) student with a height of 1.2 m.
So this is when the Bayes Theorem is useful.

You might guess that prior (before) and posteriori (after) are with respect to the (height) measurement.
67

Bayes Classifier
Given a classification task of c classes, ω1, ω2, … , ωc, and an
unknown pattern, x, we form the c conditional probabilities,
Read P(w|x) = 0.3 as
"given feature vector x,
P (ωk | x ) k = 1, 2, …, c the probability that its
class is w is 0.3.

These are known as a posteriori probabilities. If x is classified as


belonging to the “most probable” class with the maximum aposteriori
probability (MAP), we call this the MAP classifier.

This may be represented as max P (ω k | x ) k = 1, 2 , K , c


k
or, classify x as class ωi if
( )
P ωi | x > P ωj | x ( ) ∀ i , j = 1, 2,K, c ; i≠j

68
Bayes Classifier for the Two-Class Problem
Specifically for a 2-class case, MAP decision is :
if P (ω1 | x ) > P (ω2 | x ), x is classified as ω1
if P (ω1 | x ) < P (ω2 | x ), x is classified as ω2
This may be represented as

ω1
>
P (ω 1 | x ) P (ω 2 | x ) MAP
<
ω2

69

In practice, it is usually not possible to know the a posteriori


probability, P(wk | x), so we resort to the Bayes Rule:

p (x |ω k ) P ( ω k )
P (ω k | x ) =
p ( x)

P(wk) is the class probability. Sometimes, it is possible


to estimate it from the available training samples. In the
case of the probability of occurrence of alphabet “t” in a
book, one could actually scan through part or the whole
book to find this.

{ Recall P(A, B) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) }

70
p ( x | ω k ) P (ω k )
P (ω k | x) =
p ( x)

In most cases, it is not easy to find P(wk). For example,


what is the probability of occurence of apples to oranges
in an orchard ? In that case, we often assume they are all
of equal probability*. This corresponds to the worst case
analysis, with the underlying principle of maximum
uncertainty for maximum entropy. The entropy of classes
is maximum when their distributions have equal probability
of occurrence.

* The Dempster Schaefer theory has something more to say about


this. Check it out if interested to know more.
71

P( x | wi ) is known as a likelihood function. We can obtain this


prior to the measurement (of x) by experiments or otherwise.
For example, if xi is the average length of hair of student A, then
for a two-class problem, we can determine P(xA |male class) and
P(xA |female class) experimentally. We can randomly pick, say,
100 males and 100 females and get the histogram of the average
lengths of hair.

For the pdf of a test vector, p(x), this is again usually not easy
to determine. (What is the probability of a student having 0.3m
long hair and has brown eyes ?) But for the classification
problem, it can be dropped out since it does not affect the
classification result as p(x) is a common constant term to all.

72
For a 2-class problem, MAP decision rule becomes :

ω1
P(ω1 | x ) > P(ω2 | x ) Bayes
<
ω2
ω1
p(x | ω1 ) P(ω1 ) > p(x | ω2 ) P(ω2 )
p (x ) < p (x ) discard
ω2
common
ω1
p(x)
p(x | ω1 ) P(ω1 ) > p(x | ω2 ) P(ω2 )
<
ω2

73

MAP for the multi-class problem.


One can easily extend Bayes Classification to the multi-class recognition task.

The MAP decision rule becomes


p ( x | ωk ) P ( ωk )
max P ( ω k | x ) = max
k k p (x )
= max p ( x | ω k ) P ( ω k )
k

Or, classify x as belonging to class ωi if

( ) ( )
p x | ωi P ωi (
> p x | ωj P ωj ) ( ) ∀ j = 1,K , c and i≠ j

74
Now if we further assume equal class probabilities,
the multi-class case becomes

max p(x | ωk ) P(ωk ) = max p(x | ωk ) [ML]


k k

This is referred to as the maximum likelihood (ML)


classifier (since we are maximising over the likelihood
functions/distributions only).

Thus MAP becomes ML when the class


probabilities are equal.

75

Topics Covered

1. Minimum Distance Classifier (nearest center) a


2. Bayes Classifier a
3. Minimum Average Risk Bayes Classifier
4. K-Nearest Neighbour Rule
5. Decision Boundaries

6. (Feature Selection)

76

You might also like