Professional Documents
Culture Documents
NCC
45
46
Nearest Center Classifier (NCC)
x2 Being closest to m3 ,
Example: x is classified as
m3 belonging to class w3.
ω1 : m1 = (x1 (1), x2 (1))T
ω2 : m2 = (x1(2), x2(2))T x
ω3 : m3 = (x1(3), x2(3))T
From the figure, x is closest to m3 m2
m1
since ||x - m3||2 < ||x - m1||2 x1
and ||x - m3||2 < ||x - m2||2. 2-D feature space with 3 Classes
Therefore, classify x as belonging to ω3.
47
48
where x → ωi if D i (x) < D j (x) for i, j = 1, 2, ..., c ; j ≠ i
Observe i) that xTx is common to all Dj(x)2 and can be discarded without
affecting the value of j to maximise Dj(x)2.
ii) Also note that minimum of a function is equivalent to the maximum of
the negative of that function.
49
50
NCC Example : Classify Iris versicolor and Iris setosa by petal
length, x1, and petal width, x2.
51
2
(m )
1 m 1 = 4 . 3 x 1 + 1 . 3 x 2 − 10 . 1
d 2 (x ) = x T
m2 − 1 m2
2
( ) T
m 2 = 1 .5 x 1 + 0 .3 x 2 − 1 .17
⎛ x1 ⎞
⎜ ⎟
⎛ 4.3 1.3 − 10.1 ⎞ ⎜ ⎟ ⎛ d1 ⎞
In matrix form ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟⎜ x 2 ⎟ = ⎜ ⎟
⎜ 1.5
⎝ 0.3 − 1.17 ⎟ ⎜
⎠ ⎟ ⎜⎝ d 2 ⎟⎠
⎜⎜ ⎟⎟
⎝ 1 ⎠
52
♠ If the petal of an unknown iris has length 2 cm and
width 0.5cm, i.e. x = (2, 0.5)T, which iris is it from ?
⎛ 2⎞
⎛ 4.3 13
. − 101. ⎞ ⎜ ⎟ ⎛ − 0.85⎞
⎜ ⎟ 0.5 = ⎜ ⎟
⎝ 15
. 0.3 − 117 . ⎠ ⎜⎜ ⎟⎟ ⎝ 198
. ⎠
⎝ 1⎠
Therefore, it is Iris setosa.
♠ If a petal of an unknown iris has length 5 cm and
width 1.5cm, i.e. x = (5, 1.5)T, which iris is it from ?
⎛ 5⎞
⎛ 4.3 13
. − 101. ⎞ ⎜ ⎟ ⎛ 13.35⎞
⎜ . ⎟ =⎜
⎟ ⎜ 15 ⎟
⎝ 15
. 0.3 − 117 . ⎠ ⎜ ⎟ ⎝ 6.78 ⎠
⎝ 1⎠
Therefore, it is Iris versicolor.
53
The main advantage of NCC is that we only need to store the class
centres only. The computation time for classification is also very
fast* since we only need to calculate the distance of an unknown
vector from each of the class centres and find the nearest one.
NCC worked well in the previous example. Why ?
In fact, MDC performs quite poorly in real applications.
Under what condition will NCC work well or not work well ?
54
Linear Decision Boundary
55
Examples :
x = (2, 0.5)T; d12((2, 0.5)T) = − 2.8 → ω2 : Iris setosa
x = (5, 1.5)T; d12((5, 1.5)T) = 6.6 → ω1 : Iris versicolor
56
The k-Nearest Neighbour Rule (k-NNR)
X2
x
A
Class 1
B
Class 2 X
Class 3 C
X1
109
The NNR is sub-optimal when compared with the Bayes Classifier. It can
be proved that as n → ∞
error ratebayes < error ratennr < 2 x error ratebayes
Thus the performance of NNR is quite good. Its drawback is its
high computational complexity.
110
Two-Class X2
Sample
Distribution
Class 1
Class 2 X1
111
Topics to be covered
112
Support Vector Machines
Recognition
A. Template matching
B. Learning-from-Examples
g p approaches
pp
1. Nearest Center Classifier
2
2. B
Bayes (MAP ML)
Cl ifi (MAP,
Classifier
3. Minimum Average Risk Bayes Classifier
4. K-Nearest Neighbour Rule Boundaries
5. Optimal Decision
C. Feature Selection
153
Optimal
p Decision Boundaries
It is desirable that two object classes can be separated by a curve ( or a hypersurface
in higher
g dimensions).) When this is ppossible, the classifier has learnt in ggeneral.
(But note that most classification problems are not separable).
Example of X2
Linear decision
two classes boundaryy
being linearly
separable
Class 1
Class 2 X1
154
Example of X2
non-linear
separation.
Class 1
Class 2 X1
For some other cases, the two classes are separable but not
li
linearly
l separable
bl (e.g.
( in
i the
th previous
i example).
l ) We
W say that
th t the
th
two classes are non-linearly separable.
For example, the 00-1
1 loss function Minimum Average Risk Bayes classifer,
with gaussian distributions of the feature space, yields quadratic decision
boundaries in general.
155
156
Example : Given the two Classes 1 and 2 with training
samples as shown.
Observe that there infinite possible lines that can separate
(or dichotomise) the two classes.
Where is the “optimal
optimal” decision line (hyperplane) ?
X2
Class 1
Class 2
X1
157
X2
Class 1
Class 2 X1
160
But some typical kernel transformations are of the form:
a) ( xTxi + 1)p --- polynomial learning machine
b) exp( || x xi ||2 /2V2 ) -- radial-basis network
c) tanh( D xTxi + J) --- two-layer perceptron
Cover’s Theorem on the Separability of Patterns (1965)
“AA complex pattern classification problem cast in a high-
high
dimensional space non-linearly is more likely to be
linearly separable than in a low-dimensional space”
162
Take the transformations defined by :
|| x t1 || 2 || x t2 || 2
I1(x) = e ; I2(x) = e
where t1 = (1,1)T and t2 = (0,0)T are the centers of the two
kernel functions I1 and I2. For each sample x, find its I1(x) and
I1(x).
Input
pu pattern
p e kernel
e e feature
e u e kernel
e e feature
e ue
Cl
Class
x I1(x) I2(x)
p (0,0) 0.1353 1
Class 1
q (1,1) 1 0.1353
I2 I1 - I2 feature space
q
supporting
Class 1 hyperplane
h(x) = 1
Class 2
p
r, s
I1
supporting
h
hyperplane
l
h(x) = - 1
164
Given the support vectors, construct the optimal hyperplane
Then solve for k1, k2 and b from the table of data. Since there are
3 unknowns
k , we needd to
t have
h three
th supportt vectors.
t
165
166
So, for the problem, we obtain the three equations
usingg the three ppoints which are support
pp vectors.
167
ª . 1353 1 1º ª k 1 º ª 1 º
« »« » « »
« »« » « »
« 1 . 1353 1» « k 2 » « 1 »
« »« » « »
« »« » « »
« . 3678 . 3678 « 1»
¬ 1 »¼ «¬ b »¼ ¬ ¼
Solving , k1 = 5.0038
5 0038 , k2 = 5.0038 4 6808
5 0038 , b = 4.6808
168
E
Example S
l Suppose the
h optimal
i l separating
i hyperplane
h l i
is
h(x) = 5.0038
5 0038 x1 + 5.0038 4 6808 = 0
5 0038 x2 4.6808
169
SMO, Core vector machine, etc. For very large data sets, you can
resort to Core Vector Machine or Sequential Bootstrapped SVM.
170
Hyperplanes
n = N / || N ||
where || N || = ( a2 + b2 + c2 ) is the Euclidean (L2) norm of N.
Consider first ax + by + cz = 0.
ªa º ªX º
D fi
Define N «b » andd x « Y »
« » « »
«¬ c »¼ «¬ Z »¼
N
now n is the unit normal.
|| N ||
Hence, the equation of the plane can be put into a vector form as :
NTx 0 i.e || N || n T x 0 or n T x 0
This shows that for any x that lies on the plane, it is orthogonal (90o) to n.
Note that when x = 0 satisfies the plane equation, this means that
ax + by + cz = 0 is a plane that passes through the origin.
case
So p = 0 for this case.
172
Recall : aTb = || a || || b || cos T where T is the angle between the two
vectors.
Rearranging, we have a
|| a ||
T b
a b T
|| a || cos T
|| b ||
A A
A = || a || cos T
The component of any vector a in the direction of b is thus given by
b
a Tb Note that is a unit vector
|| b ||
|| a || cos T
|| b || in the direction of b .
173
Therefore, for ax + by + cz = d
In vector form : T
N Tx d or || N || n x d
d
nT x
|| N ||
d
p
|| N ||
174
What is distance of any vector z = [z1 z2 z3]T from the hyperplane
ax1 + bx2 + cx3 = d ?
175
d
p
a 2 b2 c2
ª z1 º
1 d
p p
z
>a b c@ «« z2 »»
a2 b2 c2 a2 b2 c2
¬« z3 »¼
176
Problem Statement : A hyperplane h(x) = 0 is given by
h(x) = a1x1 + a2x2 + a3x3 + b = 0
where x n and x = [x1 x2 x3 ]T .
Let a = [a1 a2 a3]T, then the unit normal n to the
hyperplane
yp p is
n = a / || a ||.
Then the
Th th perpendicular
di l distance
di t q off a vector t
t v n to
the hyperplane is given by
q = nTv b/|| a ||
177
hyperplane
Proof : The component pv of v in the v
n
direction n is given by
pv = nTv
q = dv b / || a || = nTv b / || a ||
x
=>
> Need to introduce more than two kernel functions.
functions
179
Non-separable distributions
x2
Class A Class B
x1
180
Bayes
Classifier
59
P(A | B) reads as "prob of event A happening given that event B has happened".
60
If A and B are independent, P(A ∧ B) = P(A) P(B)
since P(A|B) = P(A).
If A and B are mutually exclusive, P(A ∧ B ) = 0
and thus P(A|B) = 0 and P(B|A) = 0.
61
i) A1 ∪ A2 = U (universe) B∧A2
ii) A1 ∩ A2 = φ (disjoint)
Then we can apply the total
probability formula : B∧A2 B
62
Bayes Theorem
63
64
For Q2, we should base our choice/decision on class probabilities
since that is the only information available.
For Q3, we use conditional probability. Compute P(Male|glasses)
and P(Female|glasses) and choose the one with the higher
probability. We call these probabilities a posteriori probabilities.
65
66
Probability
density
Female Male
Height, m
1.2 m
Hence, we find that P(M|height) and P(F|height), which are posteriori
probabilities, are not known. But by Bayes theorem, these can be found
indirectly using prior distributions. Can you now see how ?
Using the above "prior" experiments, we can now get an idea of how likely
it will be a male (or female) student with a height of 1.2 m.
So this is when the Bayes Theorem is useful.
You might guess that prior (before) and posteriori (after) are with respect to the (height) measurement.
67
Bayes Classifier
Given a classification task of c classes, ω1, ω2, … , ωc, and an
unknown pattern, x, we form the c conditional probabilities,
Read P(w|x) = 0.3 as
"given feature vector x,
P (ωk | x ) k = 1, 2, …, c the probability that its
class is w is 0.3.
68
Bayes Classifier for the Two-Class Problem
Specifically for a 2-class case, MAP decision is :
if P (ω1 | x ) > P (ω2 | x ), x is classified as ω1
if P (ω1 | x ) < P (ω2 | x ), x is classified as ω2
This may be represented as
ω1
>
P (ω 1 | x ) P (ω 2 | x ) MAP
<
ω2
69
p (x |ω k ) P ( ω k )
P (ω k | x ) =
p ( x)
{ Recall P(A, B) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) }
70
p ( x | ω k ) P (ω k )
P (ω k | x) =
p ( x)
For the pdf of a test vector, p(x), this is again usually not easy
to determine. (What is the probability of a student having 0.3m
long hair and has brown eyes ?) But for the classification
problem, it can be dropped out since it does not affect the
classification result as p(x) is a common constant term to all.
72
For a 2-class problem, MAP decision rule becomes :
ω1
P(ω1 | x ) > P(ω2 | x ) Bayes
<
ω2
ω1
p(x | ω1 ) P(ω1 ) > p(x | ω2 ) P(ω2 )
p (x ) < p (x ) discard
ω2
common
ω1
p(x)
p(x | ω1 ) P(ω1 ) > p(x | ω2 ) P(ω2 )
<
ω2
73
( ) ( )
p x | ωi P ωi (
> p x | ωj P ωj ) ( ) ∀ j = 1,K , c and i≠ j
74
Now if we further assume equal class probabilities,
the multi-class case becomes
75
Topics Covered
6. (Feature Selection)
76