This action might not be possible to undo. Are you sure you want to continue?
https://www.scribd.com/doc/74136767/StprtoolPatternRecognition
06/21/2013
text
original
MACHINE PERCEPTION
CZECH TECHNICAL
UNIVERSITY
R
E
S
E
A
R
C
H
R
E
P
O
R
T
I
S
S
N
1
2
1
3

2
3
6
5
Statistical Pattern Recognition
Toolbox for Matlab
User’s guide
Vojtˇech Franc and V´aclav Hlav´ aˇc
¦xfrancv,hlavac¦@cmp.felk.cvut.cz
CTU–CMP–2004–08
June 24, 2004
Available at
ftp://cmp.felk.cvut.cz/pub/cmp/articles/FrancTR200408.pdf
The authors were supported by the European Union under project
IST200135454 ECVision, the project IST200132184 ActIPret and
by the Czech Grant Agency under project GACR 102/03/0440 and
by the Austrian Ministry of Education under project CONEX GZ
45.535.
Research Reports of CMP, Czech Technical University in Prague, No. 8, 2004
Published by
Center for Machine Perception, Department of Cybernetics
Faculty of Electrical Engineering, Czech Technical University
Technick´a 2, 166 27 Prague 6, Czech Republic
fax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz
Statistical Pattern Recognition Toolbox for Matlab
Vojtˇech Franc and V´aclav Hlav´ aˇc
June 24, 2004
Contents
1 Introduction 3
1.1 What is the Statistical Pattern Recognition Toolbox and how it has been
developed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The STPRtool purpose and its potential users . . . . . . . . . . . . . . 3
1.3 A digest of the implemented methods . . . . . . . . . . . . . . . . . . . 4
1.4 Relevant Matlab toolboxes of others . . . . . . . . . . . . . . . . . . . . 5
1.5 STPRtool document organization . . . . . . . . . . . . . . . . . . . . . 5
1.6 How to contribute to STPRtool? . . . . . . . . . . . . . . . . . . . . . 5
1.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Linear Discriminant Function 8
2.1 Linear classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Linear separation of ﬁnite sets of vectors . . . . . . . . . . . . . . . . . 10
2.2.1 Perceptron algorithm . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Kozinec’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Multiclass linear classiﬁer . . . . . . . . . . . . . . . . . . . . . 14
2.3 Fisher Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Generalized Anderson’s task . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Original Anderson’s algorithm . . . . . . . . . . . . . . . . . . . 19
2.4.2 General algorithm framework . . . . . . . . . . . . . . . . . . . 21
2.4.3 εsolution using the Kozinec’s algorithm . . . . . . . . . . . . . 21
2.4.4 Generalized gradient optimization . . . . . . . . . . . . . . . . . 22
3 Feature extraction 25
3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . 31
3.4 Greedy kernel PCA algorithm . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Generalized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 33
3.6 Combining feature extraction and linear classiﬁer . . . . . . . . . . . . 34
1
4 Density estimation and clustering 41
4.1 Gaussian distribution and Gaussian mixture model . . . . . . . . . . . 41
4.2 MaximumLikelihood estimation of GMM . . . . . . . . . . . . . . . . 43
4.2.1 Complete data . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Incomplete data . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Minimax estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Probabilistic output of classiﬁer . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Kmeans clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Support vector and other kernel machines 57
5.1 Support vector classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Binary Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 58
5.3 OneAgainstAll decomposition for SVM . . . . . . . . . . . . . . . . . 63
5.4 OneAgainstOne decomposition for SVM . . . . . . . . . . . . . . . . . 64
5.5 Multiclass BSVM formulation . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Kernel Fisher Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.7 Preimage problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.8 Reduced set method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Miscellaneous 76
6.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Bayesian classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 Quadratic classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Knearest neighbors classiﬁer . . . . . . . . . . . . . . . . . . . . . . . 82
7 Examples of applications 85
7.1 Optical Character Recognition system . . . . . . . . . . . . . . . . . . 85
7.2 Image denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8 Conclusions and future planes 94
2
Chapter 1
Introduction
1.1 What is the Statistical Pattern Recognition Tool
box and how it has been developed?
The Statistical Pattern Recognition Toolbox (abbreviated STPRtool) is a collection of
pattern recognition (PR) methods implemented in Matlab.
The core of the STPRtool is comprised of statistical PR algorithms described in
the monograph Schlesinger, M.I., Hlav´ aˇc, V: Ten Lectures on Statistical and Structural
Pattern Recognition, Kluwer Academic Publishers, 2002 [26]. The STPRtool has been
developed since 1999. The ﬁrst version was a result of a diploma project of Vojtˇech
Franc [8]. Its author became a PhD student in CMP and has been extending the
STPRtool since. Currently, the STPRtool contains much wider collection of algorithms.
The STPRtool is available at
http://cmp.felk.cvut.cz/cmp/cmp software.html.
We tried to create concise and complete documentation of the STPRtool. This
document and associated help ﬁles in .html should give all information and parameters
the user needs when she/he likes to call appropriate method (typically a Matlab m
ﬁle). The intention was not to write another textbook. If the user does not know
the method then she/he will likely have to read the description of the method in a
referenced original paper or in a textbook.
1.2 The STPRtool purpose and its potential users
• The principal purpose of STPRtool is to provide basic statistical pattern recog
nition methods to (a) researchers, (b) teachers, and (c) students. The STPRtool
is likely to help both in research and teaching. It can be found useful by practi
tioners who design PR applications too.
3
• The STPRtool oﬀers a collection of the basic and the stateofthe art statistical
PR methods.
• The STPRtool contains demonstration programs, working examples and visual
ization tools which help to understand the PR methods. These teaching aids are
intended for students of the PR courses and their teachers who can easily create
examples and incorporate them into their lectures.
• The STPRtool is also intended to serve a designer of a new PR method by
providing tools for evaluation and comparison of PR methods. Many standard
and stateofthe art methods have been implemented in STPRtool.
• The STPRtool is open to contributions by others. If you develop a method
which you consider useful for the community then contact us. We are willing to
incorporate the method into the toolbox. We intend to moderate this process to
keep the toolbox consistent.
1.3 A digest of the implemented methods
This section should give the reader a quick overview of the methods implemented in
STPRtool.
• Analysis of linear discriminant function: Perceptron algorithm and multi
class modiﬁcation. Kozinec’s algorithm. Fisher Linear Discriminant. A collection
of known algorithms solving the Generalized Anderson’s Task.
• Feature extraction: Linear Discriminant Analysis. Principal Component Anal
ysis (PCA). Kernel PCA. Greedy Kernel PCA. Generalized Discriminant Analy
sis.
• Probability distribution estimation and clustering: Gaussian Mixture
Models. ExpectationMaximization algorithm. Minimax probability estimation.
Kmeans clustering.
• Support Vector and other Kernel Machines: Sequential Minimal Opti
mizer (SMO). Matlab Optimization toolbox based algorithms. Interface to the
SV M
light
software. Decomposition approaches to train the Multiclass SVM clas
siﬁers. Multiclass BSVM formulation trained by Kozinec’s algorithm, Mitchell
DemyanovMolozenov algorithm and Nearest Point Algorithm. Kernel Fisher
Discriminant.
4
1.4 Relevant Matlab toolboxes of others
Some researchers in statistical PR provide code of their methods. The Internet oﬀers
many software resources for Matlab. However, there are not many compact and well
documented packages comprised of PR methods. We found two useful packages which
are brieﬂy introduced below.
NETLAB is focused to the Neural Networks applied to PR problems. The classical
PR methods are included as well. This Matlab toolbox is closely related to the
popular PR textbook [3]. The documentation published in book [21] contains de
tailed description of the implemented methods and many working examples. The
toolbox includes: Probabilistic Principal Component Analysis, Generative Topo
graphic Mapping, methods based on the Bayesian inference, Gaussian processes,
Radial Basis Functions (RBF) networks and many others.
The toolbox can be downloaded from
http://www.ncrg.aston.ac.uk/netlab/.
PRTools is a Matlab Toolbox for Pattern Recognition [7]. Its implementation is
based on the object oriented programming principles supported by the Matlab
language. The toolbox contains a wide range of PR methods including: analysis
of linear and nonlinear classiﬁers, feature extraction, feature selection, methods
for combining classiﬁers and methods for clustering.
The toolbox can be downloaded from
http://www.ph.tn.tudelft.nl/prtools/.
1.5 STPRtool document organization
The document is divided into chapters describing the implemented methods. The last
chapter is devoted to more complex examples. Most chapters begin with deﬁnition of
the models to be analyzed (e.g., linear classiﬁer, probability densities, etc.) and the list
of the implemented methods. Individual sections have the following ﬁxed structure: (i)
deﬁnition of the problem to be solved, (ii) brief description and list of the implemented
methods, (iii) reference to the literature where the problem is treaded in details and
(iv) an example demonstrating how to solve the problem using the STPRtool.
1.6 How to contribute to STPRtool?
If you have an implementation which is consistent with the STPRtool spirit and you
want to put it into public domain through the STPRtool channel then send an email
to xfrancv@cmp.felk.cvut.cz.
5
Please prepare your code according to the pattern you ﬁnd in STPRtool implemen
tations. Provide piece of documentation too which will allow us to incorporate it to
STPRtool easily.
Of course, your method will be bound to your name in STPRtool.
1.7 Acknowledgements
First, we like to thank to Prof. M.I. Schlesinger from the Ukrainian Academy of
Sciences, Kiev, Ukraine who showed us the beauty of pattern recognition and gave us
many valuable advices.
Second, we like to thank for discussions to our colleagues in the Center for Machine
Perception, the Czech Technical University in Prague, our visitors, other STPRtool
users and students who used the toolbox in their laboratory exercises.
Third, thanks go to several Czech, EU and industrial grants which indirectly con
tributed to STPRtool development.
Forth, the STPRtool was rewritten and its documentation radically improved in
May and June 2004 with the ﬁnancial help of ECVision European Research Network
on Cognitive Computer Vision Systems (IST200135454) lead by D. Vernon.
6
Notation
Uppercase bold letters denote matrices, for instance A. Vectors are implicitly column
vectors. Vectors are denoted by lowercase bold italic letters, for instance x. The
concatenation of column vectors w ∈ R
n
, z ∈ R
m
is denoted as v = [w; z] where the
column vector v is (n + m)dimensional.
¸x x
) Dot product between vectors x and x
.
Σ Covariance matrix.
µ Mean vector (mathematical expectation).
S Scatter matrix.
A ⊆ R
n
ndimensional input space (space of observations).
¸ Set of c hidden states – class labels ¸ = ¦1, . . . , c¦.
T Feature space.
φ: A → T Mapping from input space A to the feature space T.
k: A A →R Kernel function (Mercer kernel).
T
XY
Labeled training set (more precisely multiset) T
XY
=
¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦.
T
X
Unlabeled training set T
X
= ¦x
1
, . . . , x
l
¦.
q: A → ¸ Classiﬁcation rule.
f
y
: A →R Discriminant function associated with the class y ∈ ¸.
f: A →R Discriminant function of the binary classiﬁer.
[ [ Cardinality of a set.
  Euclidean norm.
det(()) Matrix determinant.
δ(i, j) Kronecker delta δ(i, j) = 1 for i = j and δ(i, j) = 0
otherwise.
_
Logical or.
7
Chapter 2
Linear Discriminant Function
The linear classiﬁcation rule q: A ⊆ R
n
→ ¸ = ¦1, 2, . . . , c¦ is composed of a set of
discriminant functions
f
y
(x) = ¸w
y
x) + b
y
, ∀y ∈ ¸ ,
which are linear with respect to both the input vector x ∈ R
n
and their parameter
vectors w ∈ R
n
. The scalars b
y
, ∀y ∈ ¸ introduce bias to the discriminant functions.
The input vector x ∈ R
n
is assigned to the class y ∈ ¸ its corresponding discriminant
function f
y
attains maximal value
y = argmax
y
∈Y
f
y
(x) = argmax
y
∈Y
(¸w
y
x) + b
y
) . (2.1)
In the particular binary case ¸ = ¦1, 2¦, the linear classiﬁer is represented by a
single discriminant function
f(x) = ¸w x) + b ,
given by the parameter vector w ∈ R
n
and bias b ∈ R. The input vector x ∈ R
n
is
assigned to class y ∈ ¦1, 2¦ as follows
q(x) =
_
1 if f(x) = ¸w x) + b ≥ 0 ,
2 if f(x) = ¸w x) + b < 0 .
(2.2)
The datatype used to describe the linear classiﬁer and the implementation of the
classiﬁer is described in Section 2.1. Following sections describe supervised learning
methods which determine the parameters of the linear classiﬁer based on available
knowledge. The implemented learning methods can be distinguished according to the
type of the training data:
I. The problem is described by a ﬁnite set T = ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ containing pairs
of observations x
i
∈ R
n
and corresponding class labels y
i
∈ ¸. The methods
using this type of training data is described in Section 2.2 and Section 2.3.
8
II. The parameters of class conditional distributions are (completely or partially)
known. The Anderson’s task and its generalization are representatives of such
methods (see Section 2.4).
The summary of implemented methods is given in Table 2.1.
References: The linear discriminant function is treated throughout for instance in
books [26, 6].
Table 2.1: Implemented methods: Linear discriminant function
linclass Linear classiﬁer (linear machine).
andrerr Classiﬁcation error of the Generalized Anderson’s
task.
androrig Original method to solve the AndersonBahadur’s
task.
eanders Epsilonsolution of the Generalized Anderson’s task.
ganders Solves the Generalized Anderson’s task.
ggradander Gradient method to solve the Generalized Anderson’s
task.
ekozinec Kozinec’s algorithm for epsoptimal separating hyper
plane.
mperceptr Perceptron algorithm to train multiclass linear ma
chine.
perceptron Perceptron algorithm to train binary linear classiﬁer.
fld Fisher Linear Discriminant.
fldqp Fisher Linear Discriminant using Quadratic Program
ming.
demo linclass Demo on the algorithms learning linear classiﬁers.
demo anderson Demo on Generalized Anderson’s task.
2.1 Linear classiﬁer
The STPRtool uses a speciﬁc structure array to describe the binary (see Table 2.2)
and the multiclass (see Table 2.3) linear classiﬁer. The implementation of the linear
classiﬁer itself provides function linclass.
Example: Linear classiﬁer
The example shows training of the Fisher Linear Discriminant which is the classical
example of the linear classiﬁer. The Riply’s data set riply trn.mat is used for training.
The resulting linear classiﬁer is then evaluated on the testing data riply tst.mat.
9
Table 2.2: Datatype used to describe binary linear classiﬁer.
Binary linear classiﬁer (structure array):
.W [n 1] The normal vector w ∈ R
n
of the separating hyper
plane f(x) = ¸w x) + b.
.b [1 1] Bias b ∈ R of the hyperplane.
.fun = ’linclass’ Identiﬁes function associated with this data type.
Table 2.3: Datatype used to describe multiclass linear classiﬁer.
Multiclass linear classiﬁer (structure array):
.W [n c] Matrix of parameter vectors w
y
∈ R
n
, y = 1, . . . , c of
the linear discriminant functions f
y
(x) = ¸w
y
x) +b.
.b [c 1] Parameters b
y
∈ R, y = 1, . . . , c.
.fun = ’linclass’ Identiﬁes function associated with this data type.
trn = load(’riply_trn’); % load training data
tst = load(’riply_tst’); % load testing data
model = fld( trn ); % train FLD classifier
ypred = linclass( tst.X, model ); % classify testing data
cerror( ypred, tst.y ) % evaluate testing error
ans =
0.1080
2.2 Linear separation of ﬁnite sets of vectors
The input training data T = ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ consists of pairs of observations
x
i
∈ R
n
and corresponding class labels y
i
∈ ¸ = ¦1, . . . , c¦. The implemented methods
solve the task of (i) training separating hyperplane for binary case c = 2, (ii) training
εoptimal hyperplane and (iii) training the multiclass linear classiﬁer. The deﬁnitions
of these tasks are given below.
The problem of training the binary (c = 2) linear classiﬁer (2.2) with zero training
error is equivalent to ﬁnding the hyperplane H = ¦x: ¸w x) +b = 0¦ which separates
the training vectors of the ﬁrst y = 1 and the second y = 2 class. The problem is
formally deﬁned as solving the set of linear inequalities
¸w x
i
) + b ≥ 0 , y
i
= 1 ,
¸w x
i
) + b < 0 , y
i
= 2 .
(2.3)
with respect to the vector w ∈ R
n
and the scalar b ∈ R. If the inequalities (2.3) have a
solution then the training data T is linearly separable. The Perceptron (Section 2.2.1)
10
and Kozinec’s (Section 2.2.2) algorithm are useful to train the binary linear classiﬁers
from the linearly separable data.
If the training data is linearly separable then the optimal separating hyperplane
can be deﬁned the solution of the following task
(w
∗
, b
∗
) = argmax
w,b
m(w, b)
= argmax
w,b
min
_
min
i∈I
1
¸w x
i
) + b
w
, min
i∈I
2
−
¸w x
i
) + b
w
_
,
(2.4)
where 1
1
= ¦i: y
i
= 1¦ and 1
2
= ¦i: y
i
= 2¦ are sets of indices. The optimal separating
hyperplane H
∗
= ¦x ∈ R
n
: ¸w
∗
x) + b
∗
= 0¦ separates training data T with maxi
mal margin m(w
∗
, b
∗
). This task is equivalent to training the linear Support Vector
Machines (see Section 5). The optimal separating hyperplane cannot be found exactly
except special cases. Therefore the numerical algorithms seeking the approximate so
lution are applied instead. The εoptimal solution is deﬁned as the vector w and scalar
b such that the inequality
m(w
∗
, b
∗
) −m(w, b) ≤ ε , (2.5)
holds. The parameter ε ≥ 0 deﬁnes closeness to the optimal solution in terms of the
margin. The εoptimal solution can be found by Kozinec’s algorithm (Section 2.2.2).
The problem of training the multiclass linear classiﬁer c > 2 with zero training
error is formally stated as the problem of solving the set of linear inequalities
¸w
y
i
x
i
) + b
y
i
> ¸w
y
x
i
) + b
y
, i = 1, . . . , l , y
i
,= y , (2.6)
with respect to the vectors w
y
∈ R
n
, y ∈ ¸ and scalars b
y
∈ R, y ∈ ¸. The task (2.6)
for linearly separable training data T can be solved by the modiﬁed Perceptron (Sec
tion 2.2.3) or Kozinec’s algorithm. An interactive demo on the algorithms separating
the ﬁnite sets of vectors by a hyperplane is implemented in demo linclass.
2.2.1 Perceptron algorithm
The input is a set T = ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of binary labeled y
i
∈ ¦1, 2¦ training
vectors x
i
∈ R
n
. The problem of training the separating hyperplane (2.3) can be
formally rewritten to a simpler form
¸v z
i
) > 0 , i = 1, . . . , l , (2.7)
where the vector v ∈ R
n+1
is constructed as
v = [w; b] , (2.8)
11
and transformed training data Z = ¦z
1
, . . . , z
l
¦ are deﬁned as
z
i
=
_
[x
i
; 1] if y
i
= 1 ,
−[x
i
; 1] if y
i
= 2 .
(2.9)
The problem of solving (2.7) with respect to the unknown vector v ∈ R
n+1
is equivalent
to the original task (2.3). The parameters (w, b) of the linear classiﬁer are obtained
from the found vector v by inverting the transformation (2.8).
The Perceptron algorithm is an iterative procedure which builds a series of vectors
v
(0)
, v
(1)
, . . . , v
(t)
until the set of inequalities (2.7) is satisﬁed. The initial vector v
(0)
can
be set arbitrarily (usually v = 0). The Novikoﬀ theorem ensures that the Perceptron
algorithm stops after ﬁnite number of iterations t if the training data are linearly
separable. Perceptron algorithm is implemented in function perceptron.
References: Analysis of the Perceptron algorithm including the Novikoﬀ’s proof of
convergence can be found in Chapter 5 of the book [26]. The Perceptron from the
neural network perspective is described in the book [3].
Example: Training binary linear classiﬁer with Perceptron
The example shows the application of the Perceptron algorithm to ﬁnd the binary
linear classiﬁer for synthetically generated 2dimensional training data. The generated
training data contain 50 labeled vectors which are linearly separable with margin 1.
The found classiﬁer is visualized as a separating hyperplane (line in this case) H =
¦x ∈ R
2
: ¸w x) + b = 0¦. See Figure 2.1.
data = genlsdata( 2, 50, 1); % generate training data
model = perceptron( data ); % call perceptron
figure;
ppatterns( data ); % plot training data
pline( model ); % plot separating hyperplane
2.2.2 Kozinec’s algorithm
The input is a set T = ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of binary labeled y
i
∈ ¦1, 2¦ training
vectors x
i
∈ R
n
. The Kozinec’s algorithm builds a series of vectors w
(0)
1
, w
(1)
1
, . . . , w
(t)
1
and w
(0)
2
, w
(1)
2
, . . . , w
(t)
2
which converge to the vector w
∗
1
and w
∗
2
respectively. The
vectors w
∗
1
and w
∗
2
are the solution of the following task
w
∗
1
, w
∗
2
= argmin
w
1
∈X
1
,w
2
∈X
2
w
1
−w
2
 ,
where A
1
stands for the convex hull of the training vectors of the ﬁrst class A
1
=
¦x
i
: y
i
= 1¦ and A
2
for the convex hull of the second class likewise. The vector w
∗
=
12
−55 −50 −45 −40 −35 −30
−55
−50
−45
−40
−35
−30
Figure 2.1: Linear classiﬁer trained by the Perceptron algorithm
w
∗
1
−w
∗
2
and the bias b
∗
=
1
2
(w
∗
2

2
−w
∗
1

2
) determine the optimal hyperplane (2.4)
separating the training data with the maximal margin.
The Kozinec’s algorithm is proven to converge to the vectors w
∗
1
and w
∗
2
in inﬁnite
number of iterations t = ∞. If the εoptimal optimality stopping condition is used
then the Kozinec’s algorithm converges in ﬁnite number of iterations. The Kozinec’s
algorithm can be also used to solve a simpler problem of ﬁnding the separating hyper
plane (2.3). Therefore the following two stopping conditions are implemented:
I. The separating hyperplane (2.3) is sought for (ε < 0). The Kozinec’s algorithm is
proven to converge in a ﬁnite number of iterations if the separating hyperplane
exists.
II. The εoptimal hyperplane (2.5) is sought for (ε ≥ 0). Notice that setting ε = 0
force the algorithm to seek the optimal hyperplane which is generally ensured to
be found in an inﬁnite number of iterations t = ∞.
The Kozinec’s algorithm which trains the binary linear classiﬁer is implemented in the
function ekozinec.
References: Analysis of the Kozinec’s algorithm including the proof of convergence
can be found in Chapter 5 of the book [26].
Example: Training εoptimal hyperplane by the Kozinec’s algorithm
The example shows application of the Kozinec’s algorithm to ﬁnd the (ε = 0.01)
optimal hyperplane for synthetically generated 2dimensional training data. The gen
erated training data contains 50 labeled vectors which are linearly separable with
13
margin 1. The found classiﬁer is visualized (Figure 2.2) as a separating hyperplane
H = ¦x ∈ R
2
: ¸w x) +b = 0¦ (straight line in this case). It could be veriﬁed that the
found hyperplane has margin model.margin which satisﬁes the εoptimality condition.
data = genlsdata(2,50,1); % generate data
% run ekozinec
model = ekozinec(data, struct(’eps’,0.01));
figure;
ppatterns(data); % plot data
pline(model); % plot found hyperplane
−10 −8 −6 −4 −2 0 2 4 6 8 10
−10
−8
−6
−4
−2
0
2
4
6
8
Figure 2.2: εoptimal hyperplane trained with the Kozinec’s algorithm.
2.2.3 Multiclass linear classiﬁer
The input training data T = ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ consists of pairs of observations
x
i
∈ R
n
and corresponding class labels y
i
∈ ¸ = ¦1, . . . , c¦. The task is to design
the linear classiﬁer (2.1) which classiﬁes the training data T without error. The prob
lem of training (2.1) is equivalent to solving the set of linear inequalities (2.6). The
inequalities (2.6) can be formally transformed to a simpler set
¸v z
y
i
) > 0 , i = 1, . . . , l , y = ¸ ¸ ¦y
i
¦ , (2.10)
where the vector v ∈ R
(n+1)c
contains the originally optimized parameters
v = [w
1
; b
1
; w
2
; b
2
; . . . ; w
c
; b
c
] .
14
The set of vectors z
y
i
∈ R
(n+1)c
, i = 1, . . . , l, y ∈ ¸ ¸ ¦y
i
¦ is created from the training
set T such that
z
y
i
(j) =
⎧
⎨
⎩
[x
i
; 1] , for j = y
i
,
−[x
i
; 1] , for j = y ,
0 , otherwise.
(2.11)
where z
y
i
(j) stands for the jth slot between coordinates (n+1)(j −1)+1 and (n+1)j.
The described transformation is known as the Kesler’s construction.
The transformed task (2.10) can be solved by the Perceptron algorithm. The simple
form of the Perceptron updating rule allows to implement the transformation implicitly
without mapping the training data T into (n + 1)cdimensional space. The described
method is implemented in function mperceptron.
References: The Kesler’s construction is described in books [26, 6].
Example: Training multiclass linear classiﬁer by the Perceptron
The example shows application of the Perceptron rule to train the multiclass linear
classiﬁer for the synthetically generated 2dimensional training data pentagon.mat. A
user’s own data can be generated interactively using function createdata(’finite’,10)
(number 10 stands for number of classes). The found classiﬁer is visualized as its sep
arating surface which is known to be piecewise linear and the class regions should be
convex (see Figure 2.3).
data = load(’pentagon’); % load training data
model = mperceptron(data); % run training algorithm
figure;
ppatterns(data); % plot training data
pboundary(model); % plot decision boundary
2.3 Fisher Linear Discriminant
The input is a set T = ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of binary labeled y
i
∈ ¦1, 2¦ training
vectors x
i
∈ R
n
. Let 1
y
= ¦i: y
i
= y¦, y ∈ ¦1, 2¦ be sets of indices of training
vectors belonging to the ﬁrst y = 1 and the second y = 2 class, respectively. The class
separability in a direction w ∈ R
n
is deﬁned as
F(w) =
¸w S
B
w)
¸w S
W
w)
, (2.12)
where S
B
is the betweenclass scatter matrix
S
B
= (µ
1
−µ
2
)(µ
1
−µ
2
)
T
, µ
y
=
1
[1
y
[
i∈Iy
x
i
, y ∈ ¦1, 2¦ ,
15
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 2.3: Multiclass linear classiﬁer trained by the Perceptron algorithm.
and S
W
is the within class scatter matrix deﬁned as
S
W
= S
1
+S
2
, S
y
=
i∈Yy
(x
i
−µ
y
)(x
i
−µ
y
)
T
, y ∈ ¦1, 2¦ .
In the case of the Fisher Linear Discriminant (FLD), the parameter vector w of the
linear discriminant function f(x) = ¸w x) + b is determined to maximize the class
separability criterion (2.12)
w = argmax
w
F(w
) = argmax
w
¸w
S
B
w
)
¸w
S
W
w
)
. (2.13)
The bias b of the linear rule must be determined based on another principle. The
STPRtool implementation, for instance, computes such b that the equality
¸w µ
1
) + b = −(¸w µ
2
) + b) ,
holds. The STPRtool contains two implementations of FLD:
I. The classical solution of the problem (2.13) using the matrix inversion
w = S
W
−1
(µ
1
−µ
2
) .
This case is implemented in the function fld.
16
II. The problem (2.13) can be reformulated as the quadratic programming (QP) task
w = argmin
w
¸w
S
W
w
) ,
subject to
¸w
(µ
1
−µ
2
)) = 2 .
The QP solver quadprog of the Matlab Optimization Toolbox is used in the
toolbox implementation. This method can be useful when the matrix inversion
is hard to compute. The method is implemented in the function fldqp.
References: The Fisher Linear Discriminant is described in Chapter 3 of the book [6]
or in the book [3].
Example: Fisher Linear Discriminant
The binary linear classiﬁer based on the Fisher Linear Discriminant is trained on
the Riply’s data set riply trn.mat. The QP formulation fldqp of FLD is used.
However, the function fldqp can be replaced by the standard approach implemented
in the function fld which would yield the same solution. The found linear classiﬁer is
visualized (see Figure 2.4) and evaluated on the testing data riply tst.mat.
trn = load(’riply_trn’); % load training data
model = fldqp(trn); % compute FLD
figure;
ppatterns(trn); pline(model); % plot data and solution
tst = load(’riply_tst’); % load testing data
ypred = linclass(tst.X,model); % classify testing data
cerror(ypred,tst.y) % compute testing error
ans =
0.1080
2.4 Generalized Anderson’s task
The classiﬁed object is described by the vector of observations x ∈ A ⊆ R
n
and a
binary hidden state y ∈ ¦1, 2¦. The class conditional distributions p
XY
(x[y), y ∈ ¦1, 2¦
are known to be multivariate Gaussian distributions. The parameters (µ
1
, Σ
1
) and
(µ
2
, Σ
2
) of these class distributions are unknown. However, it is known that the
parameters (µ
1
, Σ
1
) belong to a certain ﬁnite set of parameters ¦(µ
i
, Σ
i
): i ∈ 1
1
¦.
Similarly (µ
2
, Σ
2
) belong to a ﬁnite set ¦(µ
i
, Σ
i
): i ∈ 1
2
¦. Let q: A ⊆ R
n
→ ¦1, 2¦
17
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 2.4: Linear classiﬁer based on the Fisher Linear Discriminant.
be a binary linear classiﬁer (2.2) with discriminant function f(x) = ¸w x) + b. The
probability of misclassiﬁcation is deﬁned as
Err(w, b) = max
i∈I
1
∪I
2
ε(w, b, µ
i
, Σ
i
) ,
where ε(w, b, µ
i
, Σ
i
) is probability that the Gaussian random vector x with mean
vector µ
i
and the covariance matrix Σ
i
satisﬁes q(x) = 1 for i ∈ 1
2
or q(x) = 2 for
i ∈ 1
1
. In other words, it is the probability that the vector x will be misclassiﬁed by
the linear rule q.
The Generalized Anderson’s task (GAT) is to ﬁnd the parameters (w
∗
, b
∗
) of the
linear classiﬁer
q(x) =
_
1 , for f(x) = ¸w
∗
x) + b
∗
≥ 0 ,
2 , for f(x) = ¸w
∗
x) + b
∗
< 0 ,
such that the error Err(w
∗
, b
∗
) is minimal
(w
∗
, b
∗
) = argmin
w,b
Err(w, b) = argmin
w,b
max
i∈I
1
∪I
2
ε(w, b, µ
i
, Σ
i
) . (2.14)
The original Anderson’s task is a special case of (2.14) when [1
1
[ = 1 and [1
2
[ = 1.
The probability ε(w, b, µ
i
, Σ
i
) is proportional to the reciprocal of the Mahalanobis
distance r
i
between the (µ
i
, Σ
i
) and the nearest vector of the separating hyperplane
H = ¦x ∈ R
n
: ¸w x) + b = 0¦, i.e.,
r
i
= min
x∈H
¸(µ
i
−x) (Σ
i
)
−1
(µ
i
−x)) =
¸w µ
i
) + b
_
¸w Σ
i
w)
.
18
The exact relation between the probability ε(w, b, µ
i
, Σ
i
) and the corresponding Ma
halanobis distance r
i
is given by the integral
ε(w, b, µ
i
, Σ
i
) =
_
∞
r
i
1
√
2π
e
−
1
2
t
2
dt . (2.15)
The optimization problem (2.14) can be equivalently rewritten as
(w
∗
, b
∗
) = argmax
w,b
F(w, b) = argmax
w,b
min
i∈I
1
∪I
2
¸w µ
i
) + b
_
¸w Σ
i
w)
.
which is more suitable for optimization. The objective function F(w, b) is proven to
be convex in the region where the probability of misclassiﬁcation Err(w, b) is less than
0.5. However, the objective function F(w, b) is not diﬀerentiable.
The STPRtool contains implementations of the algorithm solving the original An
derson’s task as well as implementations of three diﬀerent approaches to solve the
Generalized Anderson’s task which are described bellow. An interactive demo on the
algorithms solving the Generalized Anderson’s task is implemented in demo anderson.
References: The original Anderson’s task was published in [1]. A detailed description
of the Generalized Anderson’s task and all the methods implemented in the STPRtool
is given in book [26].
2.4.1 Original Anderson’s algorithm
The algorithm is implemented in the function androrig. This algorithm solves the
original Anderson’s task deﬁned for two Gaussians only [1
1
[ = [1
2
[ = 1. In this case,
the optimal solution vector w ∈ R
n
is obtained by solving the following problem
w = ((1 −λ)Σ
1
+ λΣ
2
)
−1
(µ
1
−µ
2
) ,
1 −λ
λ
=
¸
¸w Σ
2
w)
¸w Σ
1
w)
,
where the scalar 0 < λ < 1 is unknown. The problem is solved by an iterative algorithm
which stops while the condition
¸
¸
¸
¸
γ −
1 −λ
λ
¸
¸
¸
¸
≤ ε , γ =
¸
¸w Σ
2
w)
¸w Σ
1
w)
,
is satisﬁed. The parameter ε deﬁnes the closeness to the optimal solution.
Example: Solving the original Anderson’s task
19
The original Anderson’s task is solved for the Riply’s data set riply trn.mat. The
parameters (µ
1
, Σ
1
) and (µ
2
, Σ
2
) of the Gaussian distribution of the ﬁrst and the
second class are obtained by the MaximumLikelihood estimation. The found linear
classiﬁer is visualized and the Gaussian distributions (see Figure 2.5). The classiﬁer is
validated on the testing set riply tst.mat.
trn = load(’riply_trn’); % load training data
distrib = mlcgmm(data); % estimate Gaussians
model = androrig(distrib); % solve Anderson’s task
figure;
pandr( model, distrib ); % plot solution
tst = load(’riply_tst’); % load testing data
ypred = linclass( tst.X, model ); % classify testing data
cerror( ypred, tst.y ) % evaluate error
ans =
0.1150
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
gauss 1
0.929
gauss 2
1.37
Figure 2.5: Solution of the original Anderson’s task.
20
2.4.2 General algorithm framework
The algorithm is implemented in the function ganders. This algorithm repeats the
following steps:
I. An improving direction (∆w, ∆b) in the parameter space is computed such that
moving along this direction increases the objective function F. The quadprog
function of the Matlab optimization toolbox is used to solve this subtask.
II. The optimal movement k along the direction (∆w, ∆b). The optimal movement k
is determined by 1Dsearch based on the cutting interval algorithm according to
the Fibonacci series. The number of interval cuttings of is an optional parameter
of the algorithm.
III. The new parameters are set
w
(t+1)
: = w
(t)
+ k∆w , b
(t+1)
: = b
(t)
+ k∆b .
The algorithm iterates until the minimal change in the objective function is less than
the prescribed ε, i.e., the condition
F(w
(t+1)
, b
(t+1)
) −F(w
(t)
, b
(t)
) > ε ,
is violated.
Example: Solving the Generalized Anderson’s task
The Generalized Anderson’s task is solved for the data mars.mat. Both the ﬁrst
and second class are described by three 2Ddimensional Gaussian distributions. The
found linear classiﬁer as well as the input Gaussian distributions are visualized (see
Figure 2.6).
distrib = load(’mars’); % load input Gaussians
model = ganders(distrib); % solve the GAT
figure;
pandr(model,distrib); % plot the solution
2.4.3 εsolution using the Kozinec’s algorithm
This algorithm is implemented in the function eanders. The εsolution is deﬁned as a
linear classiﬁer with parameters (w, b) such that
F(w, b) < ε
21
−5 −4 −3 −2 −1 0 1 2 3 4 5
−4
−3
−2
−1
0
1
2
3
4
5
gauss 1
0.0461
gauss 2
0.0483
gauss 3
0.0503
gauss 4
0.0421
gauss 5
0.047
gauss 6
0.0503
Figure 2.6: Solution of the Generalized Anderson’s task.
is satisﬁed. The prescribed ε ∈ (0, 0.5) is the desired upper bound on the probability
of misclassiﬁcation. The problem is transformed to the task of separation of two sets
of ellipsoids which is solved the Kozinec’s algorithm. The algorithm converges in ﬁnite
number of iterations if the given εsolution exists. Otherwise it can get stuck in an
endless loop.
Example: εsolution of the Generalized Anderson’s task
The εsolution of the generalized Anderson’s task is solved for the data mars.mat.
The desired maximal possible probability of misclassiﬁcation is set to 0.06 (it is 6%).
Both the ﬁrst and second class are described by three 2Ddimensional Gaussian distri
butions. The found linear classiﬁer and the input Gaussian distributions are visualized
(see Figure 2.7).
distrib = load(’mars’); % load Gaussians
options = struct(’err’,0.06’); % maximal desired error
model = eanders(distrib,options); % training
figure; pandr(model,distrib); % visualization
2.4.4 Generalized gradient optimization
This algorithm is implemented in the function ggradandr. The objective function
F(w, b) is convex but not diﬀerentiable thus the standard gradient optimization cannot
22
−5 −4 −3 −2 −1 0 1 2 3 4 5
−4
−3
−2
−1
0
1
2
3
4
5
gauss 1
0.0473
gauss 2
0.0496
gauss 3
0.0516
gauss 4
0.0432
gauss 5
0.0483
gauss 6
0.0516
Figure 2.7: εsolution of the Generalized Anderson’s task with maximal 0.06 probability
of misclassiﬁcation.
be used. However, the generalized gradient optimization method can be applied. The
algorithm repeats the following two steps:
I. The generalized gradient (∆w, ∆b) is computed.
II. The optimized parameters are updated
w
(t+1)
: = w
(t)
+ ∆w , b
(t+1)
= b
(t)
+ ∆b .
The algorithm iterates until the minimal change in the objective function is less than
the prescribed ε, i.e., the condition
F(w
(t+1)
, b
(t+1)
) −F(w
(t)
, b
(t)
) > ε ,
is violated. The maximal number of iterations can be used as the stopping condition
as well.
Example: Solving the Generalized Anderson’s task using the generalized
gradient optimization
The generalized Anderson’s task is solved for the data mars.mat. Both the ﬁrst
and second class are described by three 2Ddimensional Gaussian distributions. The
maximal number of iterations is limited to tmax=10000. The found linear classiﬁer as
well as the input Gaussian distributions are visualized (see Figure 2.8).
23
distrib = load(’mars’); % load Gaussians
options = struct(’tmax’,10000); % max # of iterations
model = ggradandr(distrib,options); % training
figure;
pandr( model, distrib ); % visualization
−5 −4 −3 −2 −1 0 1 2 3 4 5
−4
−3
−2
−1
0
1
2
3
4
5
gauss 1
0.0475
gauss 2
0.0498
gauss 3
0.0518
gauss 4
0.0434
gauss 5
0.0485
gauss 6
0.0518
Figure 2.8: Solution of the Generalized Anderson’s task after 10000 iterations of the
generalized gradient optimization algorithm.
24
Chapter 3
Feature extraction
The feature extraction step consists of mapping the input vector of observations x ∈ R
n
onto a new feature description z ∈ R
m
which is more suitable for given task. The linear
projection
z = W
T
x +b , (3.1)
is commonly used. The projection is given by the matrix W [n m] and bias vector
b ∈ R
m
. The Principal Component Analysis (PCA) (Section 3.1) is a representative of
the unsupervised learning method which yields the linear projection (3.1). The Linear
Discriminant Analysis (LDA) (Section 3.2) is a representative of the supervised learning
method yielding the linear projection. The linear data projection is implemented in
function linproj (See examples in Section 3.1 and Section 3.2). The data type used
to describe the linear data projection is deﬁned in Table 3.
The quadratic data mapping projects the input vector x ∈ R
n
onto the feature
space R
m
where m = n(n + 3)/2. The quadratic mapping φ: R
n
→R
m
is deﬁned as
φ(x) = [ x
1
, x
2
, . . . , x
n
,
x
1
x
1
, x
1
x
2
, . . . , x
1
x
n
,
x
2
x
2
, . . . , x
2
x
n
,
.
.
.
x
n
x
n
]
T
,
(3.2)
where x
i
, i = 1, . . . , n are entries of the vector x ∈ R
n
. The quadratic data mapping
is implemented in function qmap.
The kernel functions allow for nonlinear extensions of the linear feature extrac
tion methods. In this case, the input vectors x ∈ R
n
are mapped into a new higher
dimensional feature space T in which the linear methods are applied. This yields a
nonlinear (kernel) projection of data which has generally deﬁned as
z = A
T
k(x) +b , (3.3)
25
where A [l m] is a parameter matrix and b ∈ R
m
a parameter vector. The vector
k(x) = [k(x, x
1
), . . . , k(x, x
l
)]
T
is a vector of kernel functions centered in the training
data or their subset. The x ∈ R
n
stands for the input vector and z ∈ R
m
is the
vector of extracted features. The extracted features are projections of φ(x) onto m
vectors (or functions) from the feature space T. The Kernel PCA (Section 3.3) and
the Generalized Discriminant Analysis (GDA) (Section 3.5) are representatives of the
feature extraction methods yielding the kernel data projection (3.3). The kernel data
projection is implemented in the function kernelproj (See examples in Section 3.3
and Section 3.5). The data type used to describe the kernel data projection is deﬁned
in Table 3. The summary of implemented methods for feature extraction is given in
Table 3.1
Table 3.1: Implemented methods for feature extraction
linproj Linear data projection.
kerproj Kernel data projection.
qmap Quadratic data mapping.
lin2quad Merges linear rule and quadratic mapping.
lin2svm Merges linear rule and kernel projection.
lda Linear Discriminant Analysis.
pca Principal Component Analysis.
pcarec Computes reconstructed vector after PCA projection.
gda Generalized Discriminant Analysis.
greedykpca Greedy Kernel Principal Component Analysis.
kpca Kernel Principal Component Analysis.
kpcarec Reconstructs image after kernel PCA.
demo pcacomp Demo on image compression using PCA.
Table 3.2: Datatype used to describe linear data projection.
Linear data projection (structure array):
.W [n m] Projection matrix W = [w
1
, . . . , w
m
].
.b [m1] Bias vector b.
.fun = ’linproj’ Identiﬁes function associated with this data type.
3.1 Principal Component Analysis
Let T
X
= ¦x
1
, . . . , x
l
¦ be a set of training vectors from the ndimensional input space
R
n
. The set of vectors T
Z
= ¦z
1
, . . . , z
l
¦ is a lower dimensional representation of the
26
Table 3.3: Datatype used to describe kernel data projection.
Kernel data projection (structure array):
.Alpha [l m] Parameter matrix A [l m].
.b [m1] Bias vector b.
.sv.X [n l] Training vectors ¦x
1
, . . . , x
l
¦.
.options.ker [string] Kernel identiﬁer.
.options.arg [1 p] Kernel argument(s).
.fun = ’kernelproj’ Identiﬁes function associated with this data type.
input training vectors T
X
in the mdimensional space R
m
. The vectors T
Z
are obtained
by the linear orthonormal projection
z = W
T
x +b , (3.4)
where the matrix W [n m] and the vector b [m 1] are parameters of the projec
tion. The reconstructed vectors T
˜
X
= ¦ ˜ x
1
, . . . , ˜ x
l
¦ are computed by the linear back
projection
˜ x = W(z −b) , (3.5)
obtained by inverting (3.4). The mean square reconstruction error
ε
MS
(W, b) =
1
l
l
i=1
x
i
− ˜ x
i

2
, (3.6)
is a function of the parameters of the linear projections (3.4) and (3.5). The Principal
Component Analysis (PCA) is the linear orthonormal projection (3.4) which allows
for the minimal mean square reconstruction error (3.6) of the training data T
X
. The
parameters (W, b) of the linear projection are the solution of the optimization task
(W, b) = argmin
W
,b
ε
MS
(W
, b
) , (3.7)
subject to
¸w
i
w
j
) = δ(i, j) , ∀i, j ,
where w
i
, i = 1, . . . , m are column vectors of the matrix W = [w
1
, . . . , w
m
] and
δ(i, j) is the Kronecker delta function. The solution of the task (3.7) is the matrix
W = [w
1
, . . . , w
m
] containing the m eigenvectors of the sample covariance matrix
which have the largest eigen values. The vector b equals to W
T
µ, where µ is the
sample mean of the training data. The PCA is implemented in the function pca. A
demo showing the use of PCA for image compression is implemented in the script
demo pcacomp.
27
References: The PCA is treated throughout for instance in books [13, 3, 6].
Example: Principal Component Analysis
The input data are synthetically generated from the 2dimensional Gaussian dis
tribution. The PCA is applied to train the 1dimensional subspace to approximate
the input data with the minimal reconstruction error. The training data, the recon
structed data and the 1dimensional subspace given by the ﬁrst principal component
are visualized (see Figure 3.1).
X = gsamp([5;5],[1 0.8;0.8 1],100); % generate data
model = pca(X,1); % train PCA
Z = linproj(X,model); % lower dim. proj.
XR = pcarec(X,model); % reconstr. data
figure; hold on; axis equal; % visualization
h1 = ppatterns(X,’kx’);
h2 = ppatterns(XR,’bo’);
[dummy,mn] = min(Z);
[dummy,mx] = max(Z);
h3 = plot([XR(1,mn) XR(1,mx)],[XR(2,mn) XR(2,mx)],’r’);
legend([h1 h2 h3], ...
’Input vectors’,’Reconstructed’, ’PCA subspace’);
2 3 4 5 6 7 8
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
Input vectors
Reconstructed
PCA subspace
Figure 3.1: Example on the Principal Component Analysis.
28
3.2 Linear Discriminant Analysis
Let T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦, x
i
∈ R
n
, y ∈ ¸ = ¦1, 2, . . . , c¦ be labeled set
of training vectors. The withinclass S
W
and betweenclass S
B
scatter matrices are
described as
S
W
=
y∈Y
S
y
, S
y
=
i∈Iy
(x
i
−µ
i
)(x
i
−µ
i
)
T
,
S
B
=
y∈Y
[1
y
[(µ
y
−µ)(µ
y
−µ)
T
,
(3.8)
where the total mean vector µ and the class mean vectors µ
y
, y ∈ ¸ are deﬁned as
µ =
1
l
l
i=1
x
i
, µ
y
=
1
[1
y
[
i∈Iy
x
i
, y ∈ ¸ .
The goal of the Linear Discriminant Analysis (LDA) is to train the linear data projec
tion
z = W
T
x ,
such that the class separability criterion
F(W) =
det(
˜
S
B
)
det(
˜
S
W
)
=
det(S
B
)
det(S
W
)
,
is maximized. The
˜
S
B
is the betweenclass and
˜
S
W
is the withinclass scatter matrix
of projected data which is deﬁned similarly to (3.8).
References: The LDA is treated for instance in books [3, 6].
Example 1: Linear Discriminant Analysis
The LDA is applied to extract 2 features from the Iris data set iris.mat which
consists of the labeled 4dimensional data. The data after the feature extraction step
are visualized in Figure 3.2.
orig_data = load(’iris’); % load input data
model = lda(orig_data,2); % train LDA
ext_data = linproj(orig_data,model); % feature extraction
figure; ppatterns(ext_data); % plot ext. data
Example 2: PCA versus LDA
This example shows why the LDA is superior to the PCA when features good for
classiﬁcation are to be extracted. The synthetical data are generated from the Gaussian
mixture model. The LDA and PCA are trained on the generated data. The LDA and
29
−1.5 −1 −0.5 0 0.5 1 1.5
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 3.2: Example shows 2dimensional data extracted from the originally 4
dimensional Iris data set using Linear Discriminant Analysis.
PCA directions are displayed as connection of the vectors projected to the trained LDA
and PCA subspaces and reprojected back to the input space (see Figure 3.3a). The
extracted data using the LDA and PCA model are displayed with the Gaussians ﬁtted
by the MaximumLikelihood (see Figure 3.3b).
% Generate data
distrib.Mean = [[5;4] [4;5]]; % mean vectors
distrib.Cov(:,:,1) = [1 0.9; 0.9 1]; % 1st covariance
distrib.Cov(:,:,2) = [1 0.9; 0.9 1]; % 2nd covariance
distrib.Prior = [0.5 0.5]; % Gaussian weights
data = gmmsamp(distrib,250); % sample data
lda_model = lda(data,1); % train LDA
lda_rec = pcarec(data.X,lda_model);
lda_data = linproj(data,lda_model);
pca_model = pca(data.X,1); % train PCA
pca_rec = pcarec(data.X,pca_model);
pca_data = linproj( data,pca_model);
figure; hold on; axis equal; % visualization
ppatterns(data);
30
h1 = plot(lda_rec(1,:),lda_rec(2,:),’r’);
h2 = plot(pca_rec(1,:),pca_rec(2,:),’b’);
legend([h1 h2],’LDA direction’,’PCA direction’);
figure; hold on;
subplot(2,1,1); title(’LDA’); ppatterns(lda_data);
pgauss(mlcgmm(lda_data));
subplot(2,1,2); title(’PCA’); ppatterns(pca_data);
pgauss(mlcgmm(pca_data));
3.3 Kernel Principal Component Analysis
The Kernel Principal Component Analysis (Kernel PCA) is the nonlinear extension of
the ordinary linear PCA. The input training vectors T
X
= ¦x
1
, . . . , x
l
¦, x
i
∈ A ⊆ R
n
are mapped by φ: A → T to a high dimensional feature space T. The linear PCA
is applied on the mapped data T
Φ
= ¦φ(x
1
), . . . , φ(x
l
)¦. The computation of the
principal components and the projection on these components can be expressed in
terms of dot products thus the kernel functions k: A A →R can be employed. The
kernel PCA trains the kernel data projection
z = A
T
k(x) +b , (3.9)
such that the reconstruction error
ε
KMS
(A, b) =
1
l
l
i=1
φ(x
i
) −
˜
φ(x
i
)
2
, (3.10)
is minimized. The reconstructed vector
˜
φ(x) is given as a linear combination of the
mapped data T
Φ
˜
φ(x) =
l
i=1
β
i
φ(x
i
) , β = A(z −b) . (3.11)
In contrast to the linear PCA, the explicit projection from the feature space T to the
input space A usually do not exist. The problem is to ﬁnd the vector x ∈ A its image
φ(x) ∈ T well approximates the reconstructed vector
˜
φ(x) ∈ T. This procedure is
implemented in the function kpcarec for the case of the Radial Basis Function (RBF)
kernel. The procedure consists of the following step:
I. Project input vector x
in
∈ R
n
onto its lower dimensional representation z ∈ R
m
using (3.9).
31
II. Computes vector x
out
∈ R
n
which is good preimage of the reconstructed vector
˜
φ(x
in
), such that
x
out
= argmin
x
φ(x) −
˜
φ(x
in
)
2
.
References: The kernel PCA is described at length in book [29, 30].
Example: Kernel Principal Component Analysis
The example shows using the kernel PCA for data denoising. The input data are
synthetically generated 2dimensional vectors which lie on a circle and are corrupted by
the Gaussian noise. The kernel PCA model with RBF kernel is trained. The function
kpcarec is used to ﬁnd preimages of the reconstructed data (3.11). The result is
visualized in Figure 3.4.
X = gencircledata([1;1],5,250,1); % generate circle data
options.ker = ’rbf’; % use RBF kernel
options.arg = 4; % kernel argument
options.new_dim = 2; % output dimension
model = kpca(X,options); % compute kernel PCA
XR = kpcarec(X,model); % compute reconstruced data
figure; % Visualization
h1 = ppatterns(X);
h2 = ppatterns(XR, ’+r’);
legend([h1 h2],’Input vectors’,’Reconstructed’);
3.4 Greedy kernel PCA algorithm
The Greedy Kernel Principal Analysis (Greedy kernel PCA) is an eﬃcient algorithm
to compute the ordinary kernel PCA. Let T
X
= ¦x
1
, . . . , x
l
¦, x
i
∈ R
n
be the set of
input training vectors. The goal is to train the kernel data projection
z = A
T
k
S
(x) +b , (3.12)
where A [d m] is the parameter matrix, b [m 1] is the parameter vector and
k
S
= [k(x, s
1
), . . . , k(x, s
l
)]
T
are the kernel functions centered in the vectors T
S
=
¦s
1
, . . . , s
d
¦. The vector set T
S
is a subset of training data T
X
. In contrast to the
ordinary kernel PCA, the subset T
S
does not contain all the training vectors T
X
thus
the complexity of the projection (3.12) is reduced compared to (3.9). The objective of
the Greedy kernel PCA is to minimize the reconstruction error (3.11) while the size d
of the subset T
S
is kept small. The Greedy kernel PCA algorithm is implemented in
the function greedykpca.
32
References: The implemented Greedy kernel PCA algorithm is described in [11].
Example: Greedy Kernel Principal Component Analysis
The example shows using kernel PCA for data denoising. The input data are
synthetically generated 2dimensional vectors which lie on a circle and are corrupted
with the Gaussian noise. The Greedy kernel PCA model with RBF kernel is trained.
The number of vectors deﬁning the kernel projection (3.12) is limited to 25. In contrast
the ordinary kernel PCA (see experiment in Section 3.3) requires all the 250 training
vectors. The function kpcarec is used to ﬁnd preimages of the reconstructed data
obtained by the kernel PCA projection. The training and reconstructed vectors are
visualized in Figure 3.5. The vector of the selected subset T
X
are visualized as well.
X = gencircledata([1;1],5,250,1); % generate training data
options.ker = ’rbf’; % use RBF kernel
options.arg = 4; % kernel argument
options.new_dim = 2; % output dimension
options.m = 25; % size of sel. subset
model = greedykpca(X,options); % run greedy algorithm
XR = kpcarec(X,model); % reconstructed vectors
figure; % visualization
h1 = ppatterns(X);
h2 = ppatterns(XR,’+r’);
h3 = ppatterns(model.sv.X,’ob’,12);
legend([h1 h2 h3], ...
’Training set’,’Reconstructed set’,’Selected subset’);
3.5 Generalized Discriminant Analysis
The Generalized Discriminant Analysis (GDA) is the nonlinear extension of the ordi
nary Linear Discriminant Analysis (LDA). The input training data T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦,
x
i
∈ A ⊆ R
n
, y ∈ ¸ = ¦1, . . . , c¦ are mapped by φ: A → T to a high dimen
sional feature space T. The ordinary LDA is applied on the mapped data T
ΦY
=
¦(φ(x
1
), y
1
), . . . , (φ(x
l
), y
l
)¦. The computation of the projection vectors and the pro
jection on these vectors can be expressed in terms of dot products thus the kernel
functions k: A A → R can be employed. The resulting kernel data projection is
deﬁned as
z = A
T
k(x) +b ,
33
where A [l m] is the matrix and b [m 1] the vector trained by the GDA. The
parameters (A, b) are trained to increase the betweenclass scatter and decrease the
withinclass scatter of the extracted data T
ZY
= ¦(z
1
, y
1
), . . . , (z
l
, y
l
)¦, z
i
∈ R
m
. The
GDA is implemented in the function gda.
References: The GDA was published in the paper [2].
Example: Generalized Discriminant Analysis
The GDA is trained on the Iris data set iris.mat which contains 3 classes of
4dimensional data. The RBF kernel with kernel width σ = 1 is used. Notice, that
setting σ too low leads to the perfect separability of the training data T
XY
but the kernel
projection is heavily overﬁtted. The extracted data ¸
ZY
are visualized in Figure 3.6.
orig_data = load(’iris’); % load data
options.ker = ’rbf’; % use RBF kernel
options.arg = 1; % kernel arg.
options.new_dim = 2; % output dimension
model = gda(orig_data,options); % train GDA
ext_data = kernelproj(orig_data,model); % feature ext.
figure; ppatterns(ext_data); % visualization
3.6 Combining feature extraction and linear classi
ﬁer
Let T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ be a training set of observations x
i
∈ A ⊆ R
n
and
corresponding hidden states y
i
∈ ¸ = ¦1, . . . , c¦. Let φ: A →R
m
be a mapping
φ(x) = [φ
1
(x), . . . , φ
m
(x)]
T
deﬁning a nonlinear feature extraction step (e.g., quadratic data mapping or kernel
projection). The feature extraction step applied to the input data T
XY
yields extracted
data T
ΦY
= ¦(φ(x
1
), y
1
), . . . , (φ(x
l
), y
l
)¦. Let f(x) = ¸w φ(x)) + b be a linear
discriminant function found by an arbitrary training method on the extracted data
T
ΦY
. The function f can be seen as a nonlinear function of the input vectors x ∈ A,
as
f(x) = ¸w φ(x)) + b =
m
i=1
w
i
φ
i
(x) + b .
In the case of the quadratic data mapping (3.2) used as the feature extraction step,
the discriminant function can be written in the following form
f(x) = ¸x Ax) +¸x b) + c . (3.13)
34
The discriminant function (3.13) deﬁnes the quadratic classiﬁer described in Section 6.4.
Therefore the quadratic classiﬁer can be trained by (i) applying the quadratic data
mapping qmap, (ii) training a linear rule on the mapped data (e.g., perceptron, fld)
and (iii) combining the quadratic mapping and the linear rule. The combining is
implement in function lin2quad.
In the case of the kernel projection (3.3) applied as the feature extraction step, the
resulting discriminant function has the following form
f(x) = ¸α k(x)) + b . (3.14)
The discriminant function (3.14) deﬁnes the kernel (SVM type) classiﬁer described
in Section 5. This classiﬁer can be trained by (i) applying the kernel projection
kernelproj, (ii) training a linear rule on the extracted data and (iii) combing the kernel
projection and the linear rule. The combining is implemented in function lin2svm.
Example: Quadratic classiﬁer trained the Perceptron
This example shows how to train the quadratic classiﬁer using the Perceptron al
gorithm. The Perceptron algorithm produces the linear rule only. The input training
data are linearly nonseparable but become separable after the quadratic data map
ping is applied. Function lin2quad is used to compute the parameters of the quadratic
classiﬁer explicitly based on the found linear rule in the feature space and the mapping.
Figure 3.7 shows the decision boundary of the quadratic classiﬁer.
% load training data living in the input space
input_data = load(’vltava’);
% map data to the feature space
map_data = qmap(input_data);
% train linear rule in the feature sapce
lin_model = perceptron(map_data);
% compute parameters of the quadratic classifier
quad_model = lin2quad(lin_model);
% visualize the quadratic classifier
figure; ppatterns(input_data);
pboundary(quad_model);
Example: Kernel classiﬁer from the linear rule
This example shows how to train the kernel classiﬁer using arbitrary training algo
rithm for linear rule. The Fisher Linear Discriminant was used in this example. The
35
greedy kernel PCA algorithm was applied as the feature extraction step. The Riply’s
data set was used for training and evaluation. Figure 3.8 shows the found kernel clas
siﬁer, the training data and vectors used in the kernel expansion of the discriminant
function.
% load input data
trn = load(’riply_trn’);
% train kernel PCA by greedy algorithm
options = struct(’ker’,’rbf’,’arg’,1,’new_dim’,10);
kpca_model = greedykpca(trn.X,options);
% project data
map_trn = kernelproj(trn,kpca_model);
% train linear classifier
lin_model = fld(map_trn);
% combine linear rule and kernel PCA
kfd_model = lin2svm(kpca_model,lin_model);
% visualization
figure;
ppatterns(trn); pboundary(kfd_model);
ppatterns(kfd_model.sv.X,’ok’,13);
% evaluation on testing data
tst = load(’riply_tst’);
ypred = svmclass(tst.X,kfd_model);
cerror(ypred,tst.y)
ans =
0.0970
36
0 1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
LDA direction
PCA direction
(a)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
LDA
−4 −3 −2 −1 0 1 2 3 4 5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
PCA
(b)
Figure 3.3: Comparison between PCA and LDA. Figure (a) shows input data and
found LDA (red) and PCA (blue) directions onto which the data are projected. Figure
(b) shows the class conditional Gaussians estimated from data projected data onto
LDA and PCA directions.
37
−8 −6 −4 −2 0 2 4 6 8 10
−8
−6
−4
−2
0
2
4
6
8
10
Input vectors
Reconstructed
Figure 3.4: Example on the Kernel Principal Component Analysis.
−8 −6 −4 −2 0 2 4 6 8 10
−6
−4
−2
0
2
4
6
8
Training set
Reconstructed set
Selected set
Figure 3.5: Example on the Greedy Kernel Principal Component Analysis.
38
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
Figure 3.6: Example shows 2dimensional data extracted from the originally 4
dimensional Iris data set using the Generalized Discriminant Analysis. In contrast
to LDA (see Figure 3.2 for comparison), the class clusters are more compact but at the
expense of a higher risk of overﬁtting.
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Figure 3.7: Example shows quadratic classiﬁer found by the Perceptron algorithm on
the data mapped to the feature by the quadratic mapping.
39
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 3.8: Example shows the kernel classiﬁer found by the Fisher Linear Discriminant
algorithm on the data mapped to the feature by the kernel PCA. The greedy kernel
PCA was used therefore only a subset of training data was used to determine the kernel
projection.
40
Chapter 4
Density estimation and clustering
The STPRtool represents a probability distribution function P
X
(x), x ∈ R
n
as a
structure array which must contain ﬁeld .fun (see Table 4.2). The ﬁeld .fun speciﬁes
function which is used to evaluate P
X
(x) for given set of realization ¦x
1
, . . . , x
l
¦ of a
random vector. Let the structure model represent a given distribution function. The
Matlab command
y = feval(X,model.fun)
is used to evaluate y
i
= P
X
(x
i
). The matrix X [n l] contains column vectors
¦x
1
, . . . , x
l
¦. The vector y [1 l] contains values of the distribution function. In
particular, the Gaussian distribution and the Gaussian mixture models are described
in Section 4.1. The summary of implemented methods for dealing with probability
density estimation and clustering is given in Table 4.1
4.1 Gaussian distribution and Gaussian mixture model
The value of the multivariate Gaussian probability distribution function in the vector
x ∈ R
n
is given by
P
X
(x) =
1
(2π)
n
2
_
det(Σ)
exp(−
1
2
(x −µ)
T
Σ
−1
(x −µ)) , (4.1)
where µ [n 1] is the mean vector and Σ [n n] is the covariance matrix. The
STPRtool uses speciﬁc datatype to describe a set of Gaussians which can be labeled,
i.e., each pair (µ
i
, Σ
i
) is assigned to the class y
i
∈ ¸. The covariance matrix Σ is
always represented as the square matrix [n n], however, the shape of the matrix can
be indicated in the optimal ﬁeld .cov type (see Table 4.3). The datatype is described
in Table 4.4. The evaluation of the Gaussian distribution is implemented in the function
41
Table 4.1: Implemented methods: Density estimation and clustering
gmmsamp Generates sample from Gaussian mixture model.
gsamp Generates sample from Gaussian distribution.
kmeans Kmeans clustering algorithm.
pdfgauss Evaluates for multivariate Gaussian distribution.
pdfgauss Evaluates Gaussian mixture model.
sigmoid Evaluates sigmoid function.
emgmm ExpectationMaximization Algorithm for Gaussian
mixture model.
melgmm Maximizes Expectation of LogLikelihood for Gaus
sian mixture.
mlcgmm Maximal Likelihood estimation of Gaussian mixture
model.
mlsigmoid Fitting a sigmoid function using ML estimation.
mmgauss Minimax estimation of Gaussian distribution.
rsde Reduced Set Density Estimator.
demo emgmm Demo on ExpectationMaximization (EM) algorithm.
demo mmgauss Demo on minimax estimation for Gaussian.
demo svmpout Fitting a posteriori probability to SVM output.
Table 4.2: Datatype used to represent probability distribution function.
Probability Distribution Function (structure array):
.fun [string] The name of a function which evaluates probability
distribution y = feval( X, model.fun ).
pdfgauss. The random sampling from the Gaussian distribution is implemented in the
function gsamp.
The Gaussian Mixture Model (GMM) is weighted sum of the Gaussian distributions
P
X
(x) =
y∈Y
P
Y
(y)P
XY
(x[y) , (4.2)
where P
Y
(y), y ∈ ¸ = ¦1, . . . , c¦ are weights and P
XY
(x[y) is the Gaussian distribu
tion (4.1) given by parameters (µ
y
, Σ
y
). The datatype used to represent the GMM
is described in Table 4.5. The evaluation of the GMM is implemented in the function
pdfgmm. The random sampling from the GMM is implemented in the function gmmsamp.
Example: Sampling from the Gaussian distribution
The example shows how to represent the Gaussian distribution and how to sample
data it. The sampling from univariate and bivariate Gaussians is shown. The distribu
tion function of the univariate Gaussian is visualized and compared to the histogram
42
Table 4.3: Shape of covariance matrix.
Parameter cov type:
’full’ Full covariance matrix.
’diag’ Diagonal covariance matrix.
’spherical’ Spherical covariance matrix.
Table 4.4: Datatype used to represent the labeled set of Gaussians.
Labeled set of Gaussians (structure array):
.Mean [n k] Mean vectors ¦µ
1
, . . . , µ
k
¦.
.Cov [n n k] Covariance matrices ¦Σ
1
, . . . , Σ
k
¦.
.y [1 k] Labels ¦y
1
, . . . , y
k
¦ assigned to Gaussians.
.cov type [string] Optional ﬁeld which indicates shape of the covariance
matrix (see Table 4.3).
.fun = ’pdfgauss’ Identiﬁes function associated with this data type.
of the sample data (see Figure 4.1a). The isoline of the distribution function of the
bivariate Gaussian is visualized together with the sample data (see Figure 4.1b).
% univariate case
model = struct(’Mean’,1,’Cov’,2); % Gaussian parameters
figure; hold on;
% plot pdf. of the Gaussisan
plot([4:0.1:5],pdfgauss([4:0.1:5],model),’r’);
[Y,X] = hist(gsamp(model,500),10); % compute histogram
bar(X,Y/500); % plot histogram
% bivariate case
model.Mean = [1;1]; % Mean vector
model.Cov = [1 0.6; 0.6 1]; % Covariance matrix
figure; hold on;
ppatterns(gsamp(model,250)); % plot sampled data
pgauss(model); % plot shape
4.2 MaximumLikelihood estimation of GMM
Let the probability distribution
P
XY Θ
(x, y[θ) = P
XY Θ
(x[y, θ)P
Y Θ
(y[θ) ,
43
Table 4.5: Datatype used to represent the Gaussian Mixture Model.
Gaussian Mixture Model (structure array):
.Mean [n c] Mean vectors ¦µ
1
, . . . , µ
c
¦.
.Cov [n n c] Covariance matrices ¦Σ
1
, . . . , Σ
c
¦.
.Prior [c 1] Weights of Gaussians ¦P
K
(1), . . . , P
K
(c)¦.
.fun = ’pdfgmm’ Identiﬁes function associated with this data type.
−4 −3 −2 −1 0 1 2 3 4 5 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
−2 −1 0 1 2 3 4 5
−3
−2
−1
0
1
2
3
4
5
gauss 1
0.127
(a) (b)
Figure 4.1: Sampling from the univariate (a) and bivariate (b) Gaussian distribution.
be known up to parameters θ ∈ Θ. The vector of observations x ∈ R
n
and the hidden
state y ∈ ¸ = ¦1, . . . , c¦ are assumed to be realizations of random variables which
are independent and identically distributed according to the P
XY Θ
(x, y[θ). The con
ditional probability distribution P
XY Θ
(x[y, θ) is assumed to be Gaussian distribution.
The θ involves parameters (µ
y
, Σ
y
), y ∈ ¸ of Gaussian components and the values
of the discrete distribution P
Y Θ
(y[θ), y ∈ ¸. The marginal probability P
XΘ
(x[θ) is
the Gaussian mixture model (4.2). The MaximumLikelihood estimation of the pa
rameters θ of the GMM for the case of complete and incomplete data is described in
Section 4.2.1 and Section 4.2.2, respectively.
4.2.1 Complete data
The input is a set T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ which contains both vectors of ob
servations x
i
∈ R
n
and corresponding hidden states y
i
∈ ¸. The pairs (x
i
, y
i
) are
assumed to samples from random variables which are independent and identically dis
tributed (i.i.d.) according to P
XY Θ
(x, y[θ). The logarithm of probability P(T
XY
[θ)
is referred to as the loglikelihood L(θ[T
XY
) of θ with respect to T
XY
. Thanks to the
44
i.i.d. assumption the loglikelihood can be factorized as
L(θ[T
XY
) = log P(T
XY
[θ) =
l
i=1
log P
XY Θ
(x
i
, y
i
[θ) .
The problem of MaximumLikelihood (ML) estimation from the complete data T
XY
is
deﬁned as
θ
∗
= argmax
θ∈Θ
L(θ[T
XY
) = argmax
θ∈Θ
l
i=1
log P
XY Θ
(x
i
, y
i
[θ) . (4.3)
The ML estimate of the parameters θ of the GMM from the complete data set T
XY
has an analytical solution. The computation of the ML estimate is implemented in the
function mlcgmm.
Example: MaximumLikelihood estimate from complete data
The parameters of the Gaussian mixture model is estimated from the complete
Riply’s data set riply trn. The data are binary labeled thus the GMM has two Gaus
sians components. The estimated GMM is visualized as: (i) shape of components of the
GMM (see Figure 4.2a) and (ii) contours of the distribution function (see Figure 4.2b)
.
data = load(’riply_trn’); % load labeled (complete) data
model = mlcgmm(data); % ML estimate of GMM
% visualization
figure; hold on;
ppatterns(data); pgauss(model);
figure; hold on;
ppatterns(data);
pgmm(model,struct(’visual’,’contour’));
4.2.2 Incomplete data
The input is a set T
X
= ¦x
1
, . . . , x
l
¦ which contains only vectors of observations
x
i
∈ R
n
without knowledge about the corresponding hidden states y
i
∈ ¸. The
logarithm of probability P(T
X
[θ) is referred to as the loglikelihood L(θ[T
X
) of θ with
respect to T
X
. Thanks to the i.i.d. assumption the loglikelihood can be factorized as
L(θ[T
X
) = log P(T
X
[θ) =
l
i=1
y∈Y
P
XY Θ
(x
i
[y
i
, θ)P
Y Θ
(y
i
[θ) .
45
The problem of MaximumLikelihood (ML) estimation from the incomplete data T
X
is
deﬁned as
θ
∗
= argmax
θ∈Θ
L(θ[T
X
) = argmax
θ∈Θ
l
i=1
y∈Y
P
XY Θ
(x
i
[y
i
, θ)P
Y Θ
(y
i
[θ) . (4.4)
The ML estimate of the parameters θ of the GMM from the incomplete data set
T
XY
does not have an analytical solution. The numerical optimization has to be used
instead.
The ExpectationMaximization (EM) algorithm is an iterative procedure for ML
estimation from the incomplete data. The EM algorithm builds a sequence of parameter
estimates
θ
(0)
, θ
(1)
, . . . , θ
(t)
,
such that the loglikelihood L(θ
(t)
[T
X
) monotonically increases, i.e.,
L(θ
(0)
[T
X
) < L(θ
(1)
[T
X
) < . . . < L(θ
(t)
[T
X
)
until a stationary point L(θ
(t−1)
[T
X
) = L(θ
(t)
[T
X
) is achieved. The EM algorithm is a
local optimization technique thus the estimated parameters depend on the initial guess
θ
(0)
. The random guess or Kmeans algorithm (see Section 4.5) are commonly used to
select the initial θ
(0)
. The EM algorithm for the GMM is implemented in the function
emgmm. An interactive demo on the EM algorithm for the GMM is implemented in
demo emgmm.
References: The EM algorithm (including convergence proofs) is described in book [26].
A nice description can be found in book [3]. The monograph [18] is entirely devoted
to the EM algorithm. The ﬁrst papers describing the EM are [25, 5].
Example: MaximumLikelihood estimate from incomplete data
The estimation of the parameters of the GMM is demonstrated on a synthetic
data. The ground truth model is the GMM with two univariate Gaussian components.
The ground truth model is sampled to obtain a training (incomplete) data. The EM
algorithm is applied to estimate the GMM parameters. The random initialization is
used by default. The ground truth model, sampled data and the estimate model are
visualized in Figure 4.3a. The plot of the loglikelihood function L(θ
(t)
[T
X
) with respect
to the number of iterations t is visualized in Figure 4.3b.
% ground truth Gaussian mixture model
true_model.Mean = [2 2];
true_model.Cov = [1 0.5];
true_model.Prior = [0.4 0.6];
sample = gmmsamp(true_model, 250);
46
% ML estimation by EM
options.ncomp = 2; % number of Gaussian components
options.verb = 1; % display progress info
estimated_model = emgmm(sample.X,options);
% visualization
figure;
ppatterns(sample.X);
h1 = pgmm(true_model,struct(’color’,’r’));
h2 = pgmm(estimated_model,struct(’color’,’b’));
legend([h1(1) h2(1)],’Ground truth’, ’ML estimation’);
figure; hold on;
xlabel(’iterations’); ylabel(’loglikelihood’);
plot( estimated_model.logL );
4.3 Minimax estimation
The input is a set T
X
= ¦x
1
, . . . , x
l
¦ contains vectors x
i
∈ R
n
which are realizations of
random variable distributed according to P
XΘ
(x[θ). The vectors T
X
are assumed to be
a realizations with high value of the distribution function P
XΘ
(x[θ). The P
XΘ
(x[θ)
is known up to the parameter θ ∈ Θ. The minimax estimation is deﬁned as
θ
∗
= argmax
θ∈Θ
min
x∈T
X
log P
XΘ
(x[θ) . (4.5)
If a procedure for the MaximumLikelihood estimate (global solution) of the distri
bution P
XΘ
(x[θ) exists then a numerical iterative algorithm which converges to the
optimal estimate θ
∗
can be constructed. The algorithm builds a series of estimates
θ
(0)
, θ
(1)
, . . . , θ
(t)
which converges to the optimal estimate θ
∗
. The algorithm converges
in a ﬁnite number of iterations to such an estimate θ that the condition
min
x∈T
X
log P
XΘ
(x[θ
∗
) − min
x∈T
X
log P
XΘ
(x[θ) < ε , (4.6)
holds for ε > 0 which deﬁnes closeness to the optimal solution θ
∗
. The inequality (4.6)
can be evaluated eﬃciently thus it is a natural choice for the stopping condition of
the algorithm. The STPRtool contains an implementation of the minimax algorithm
for the Gaussian distribution (4.1). The algorithm is implemented in the function
mmgauss. An interactive demo on the minimax estimation for the Gaussian distribution
is implemented in the function demo mmgauss.
References: The minimax algorithm for probability estimation is described and ana
lyzed in Chapter 8 of the book [26].
47
Example: Minimax estimation of the Gaussian distribution
The bivariate Gaussian distribution is described by three samples which are known
to have high value of the distribution function. The minimax algorithm is applied
to estimate the parameters of the Gaussian. In this particular case, the minimax
problem is equivalent to searching for the minimal enclosing ellipsoid. The found
solution is visualized as the isoline of the distribution function at the point P
XΘ
(x[θ)
(see Figure 4.4).
X = [[0;0] [1;0] [0;1]]; % points with high p.d.f.
model = mmgauss(X); % run minimax algorithm
% visualization
figure; ppatterns(X,’xr’,13);
pgauss(model, struct(’p’,exp(model.lower_bound)));
4.4 Probabilistic output of classiﬁer
In general, the discriminant function of an arbitrary classiﬁer does not have meaning of
probability (e.g., SVM, Perceptron). However, the probabilistic output of the classiﬁer
can help in postprocessing, for instance, in combining more classiﬁers together. Fitting
a sigmoid function to the classiﬁer output is a way how to solve this problem.
Let T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ is the training set composed of the vectors x
i
∈
A ⊆ R
n
and the corresponding binary hidden states y
i
∈ ¸ = ¦1, 2¦. The training
set T
XY
is assumed to be identically and independently distributed from underlying
distribution. Let f: A ⊆ R
n
→R be a discriminant function trained from the data T
XY
by an arbitrary learning method. The aim is to estimate parameters of a posteriori
distribution P
Y FΘ
(y[f(x), θ) of the hidden state y given the value of the discriminant
function f(x). The distribution is modeled by a sigmoid function
P
Y FΘ
(1[f(x), θ) =
1
1 + exp(a
1
f(x) + a
2
)
,
which is determined by parameters θ = [a
1
, a
2
]
T
. The loglikelihood function is deﬁned
as
L(θ[T
XY
) =
l
i=1
log P
Y FΘ
(y
i
[f(x
i
), θ) .
The parameters θ = [a
1
, a
2
]
T
can be estimated by the maximumlikelihood method
θ = argmax
θ
L(θ
[T
XY
) = argmax
θ
l
i=1
log P
Y FΘ
(y
i
[f(x
i
), θ
) . (4.7)
48
The STPRtool provides an implementation mlsigmoid which uses the Matlab Opti
mization toolbox function fminunc to solve the task (4.7). The function produces the
parameters θ = [a
1
, a
2
]
T
which are stored in the datatype describing the sigmoid (see
Table 4.6). The evaluation of the sigmoid is implemented in function sigmoid.
Table 4.6: Datatype used to describe sigmoid function.
Sigmoid function (structure array):
.A [2 1] Parameters θ = [a
1
, a
2
]
T
of the sigmoid function.
.fun = ’sigmoid’ Identiﬁes function associated with this data type.
References: The method of ﬁtting the sigmoid function is described in [22].
Example: Probabilistic output for SVM
This example is implemented in the script demo svmpout. The example shows how
to train the probabilistic output for the SVM classiﬁer. The Riply’s data set riply trn
is used for training the SVM and the ML estimation of the sigmoid. The ML estimated
of Gaussian mixture model (GMM) of the SVM output is used for comparison. The
GMM allows to compute the a posteriori probability for comparison to the sigmoid
estimate. The example is implemented in the script demo svmpout. Figure 4.5a shows
found probabilistic models of a posteriori probability of the SVM output. Figure 4.5b
shows the decision boundary of SVM classiﬁer for illustration.
% load training data
data = load(’riply_trn’);
% train SVM
options = struct(’ker’,’rbf’,’arg’,1,’C’,10);
svm_model = smo(data,options);
% compute SVM output
[dummy,svm_output.X] = svmclass(data.X,svm_model);
svm_output.y = data.y;
% fit sigmoid function to svm output
sigmoid_model = mlsigmoid(svm_output);
% ML estimation of GMM model of SVM output
gmm_model = mlcgmm(svm_output);
% visulization
figure; hold on;
49
xlabel(’svm output f(x)’); ylabel(’p(y=1f(x))’);
% sigmoid model
fx = linspace(min(svm_output.X), max(svm_output.X), 200);
sigmoid_apost = sigmoid(fx,sigmoid_model);
hsigmoid = plot(fx,sigmoid_apost,’k’);
ppatterns(svm_output);
% GMM model
pcond = pdfgauss( fx, gmm_model);
gmm_apost = (pcond(1,:)*gmm_model.Prior(1))./...
(pcond(1,:)*gmm_model.Prior(1)+(pcond(2,:)*gmm_model.Prior(2)));
hgmm = plot(fx,gmm_apost,’g’);
hcomp = pgauss(gmm_model);
legend([hsigmoid,hgmm,hcomp],’P(y=1f(x)) MLSigmoid’,...
’P(y=1f(x)) MLGMM’,’P(f(x)y=1) MLGMM’,’P(f(x)y=2) MLGMM’);
% SVM decision boundary
figure; ppatterns(data); psvm(svm_model);
4.5 Kmeans clustering
The input is a set T
X
= ¦x
1
, . . . , x
l
¦ which contains vectors x
i
∈ R
n
. Let θ =
¦µ
1
, . . . , µ
c
¦, µ
i
∈ R
n
be a set of unknown cluster centers. Let the objective function
F(θ) be deﬁned as
F(θ) =
1
l
l
i=1
min
y=1,...,c
µ
y
−x
i

2
.
The small value of F(θ) indicates that the centers θ well describe the input set T
X
.
The task is to ﬁnd such centers θ which describe the set T
X
best, i.e., the task is to
solve
θ
∗
= argmin
θ
F(θ) = argmin
θ
1
l
l
i=1
min
y=1,...,c
µ
y
−x
i

2
.
The Kmeans
1
is an iterative procedure which iteratively builds a series θ
(0)
, θ
(1)
, . . . , θ
(t)
.
The Kmeans algorithm converges in a ﬁnite number of steps to the local optimum of
the objective function F(θ). The found solution depends on the initial guess θ
(0)
which
1
The K in the name Kmeans stands for the number of centers which is here denoted by c.
50
is usually selected randomly. The Kmeans algorithm is implemented in the function
kmeans.
References: The Kmeans algorithm is described in the books [26, 6, 3].
Example: Kmeans clustering
The Kmeans algorithm is used to ﬁnd 4 clusters in the Riply’s training set. The
solution is visualized as the cluster centers. The vectors classiﬁed to the corresponding
centers are distinguished by color (see Figure 4.6a). The plot of the objective function
F(θ
(t)
) with respect to the number of iterations is shown in Figure 4.6b.
data = load(’riply_trn’); % load data
[model,data.y] = kmeans(data.X, 4 ); % run kmeans
% visualization
figure; ppatterns(data);
ppatterns(model.X,’sk’,14);
pboundary( model );
figure; hold on;
xlabel(’t’); ylabel(’F’);
plot(model.MsErr);
51
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
gauss 1
0.981
gauss 2
1.45
(a)
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
(b)
Figure 4.2: Example of the MaximumLikelihood of the Gaussian Mixture Model from
the incomplete data. Figure (a) shows components of the GMM and Figure (b) shows
contour plot of the distribution function.
52
−5 −4 −3 −2 −1 0 1 2 3 4 5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Ground truth
ML estimation
(a)
0 5 10 15 20 25
−460
−458
−456
−454
−452
−450
−448
−446
−444
−442
iterations
l
o
g
−
l
i
k
e
l
i
h
o
o
d
(b)
Figure 4.3: Example of using the EM algorithm for ML estimation of the Gaussian
mixture model. Figure (a) shows ground truth and the estimated GMM and Figure(b)
shows the loglikelihood with respect to the number of iterations of the EM.
53
−0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
gauss 1
0.304
Figure 4.4: Example of the Minimax estimation of the Gaussian distribution.
54
−4 −3 −2 −1 0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
svm output f(x)
p
(
y
=
1

f
(
x
)
)
P(y=1f(x)) ML−Sigmoid
P(y=1f(x)) ML−GMM
P(f(x)y=1) ML−GMM
P(f(x)y=2) ML−GMM
(a)
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
(b)
Figure 4.5: Example of ﬁtting a posteriori probability to the SVM output. Figure (a)
shows the sigmoid model and GMM model both ﬁtted by ML estimated. Figure (b)
shows the corresponding SVM classiﬁer.
55
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
(a)
0 2 4 6 8 10 12
0.04
0.06
0.08
0.1
0.12
0.14
0.16
t
F
(b)
Figure 4.6: Example of using the Kmeans algorithm to cluster the Riply’s data set.
Figure (a) shows found clusters and Figure (b) contains the plot of the objective func
tion F(θ
(t)
) with respect to the number of iterations.
56
Chapter 5
Support vector and other kernel
machines
5.1 Support vector classiﬁers
The binary support vector classiﬁer uses the discriminant function f: A ⊆ R
n
→R of
the following form
f(x) = ¸α k
S
(x)) + b . (5.1)
The k
S
(x) = [k(x, s
1
), . . . , k(x, s
d
)]
T
is the vector of evaluations of kernel functions
centered at the support vectors o = ¦s
1
, . . . , s
d
¦, s
i
∈ R
n
which are usually subset of
the training data. The α ∈ R
l
is a weight vector and b ∈ R is a bias. The binary
classiﬁcation rule q: A → ¸ = ¦1, 2¦ is deﬁned as
q(x) =
_
1 for f(x) ≥ 0 ,
2 for f(x) < 0 .
(5.2)
The datatype used by the STPRtool to represent the binary SVM classiﬁer is described
in Table 5.2.
The multiclass generalization involves a set of discriminant functions f
y
: A ⊆ R
n
→
R, y ∈ ¸ = ¦1, 2, . . . , c¦ deﬁned as
f
y
(x) = ¸α
y
k
S
(x)) + b
y
, y ∈ ¸ .
Let the matrix A = [α
1
, . . . , α
c
] be composed of all weight vectors and b = [b
1
, . . . , b
c
]
T
be a vector of all biases. The multiclass classiﬁcation rule q: A → ¸ = ¦1, 2, . . . , c¦ is
deﬁned as
q(x) = argmax
y∈Y
f
y
(x) . (5.3)
The datatype used by the STPRtool to represent the multiclass SVM classiﬁer is
described in Table 5.3. Both the binary and the multiclass SVM classiﬁers are imple
mented in the function svmclass.
57
The majority voting strategy is other commonly used method to implement the
multiclass SVM classiﬁer. Let q
j
: A ⊆ R
n
→ ¦y
1
j
, y
2
j
¦, j = 1, . . . , g be a set of g binary
SVM rules (5.2). The jth rule q
j
(x) classiﬁes the inputs x into class y
1
j
∈ ¸ or y
2
j
∈ ¸.
Let v(x) be a vector [c 1] of votes for classes ¸ when the input x is to be classiﬁed.
The vector v(x) = [v
1
(x), . . . , v
c
(x)]
T
is computed as:
Set v
y
= 0, ∀y ∈ ¸.
For j = 1, . . . , g do
v
y
(x) = v
y
(x) + 1 where y = q
j
(x).
end.
The majority votes based multiclass classiﬁer assigns the input x into such class y ∈ ¸
having the majority of votes
y = argmax
y
=1,...,g
v
y
(x) . (5.4)
The datatype used to represent the majority voting strategy for the multiclass SVM
classiﬁer is described in Table 5.4. The strategy (5.4) is implemented in the function
mvsvmclass. The implemented methods to deal with the SVM and related kernel
classiﬁers are enlisted in Table 5.1
References: Books describing Support Vector Machines are for example [31, 4, 30]
5.2 Binary Support Vector Machines
The input is a set T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of training vectors x
i
∈ R
n
and cor
responding hidden states y
i
∈ ¸ = ¦1, 2¦. The linear SVM aims to train the linear
discriminant function f(x) = ¸w x) + b of the binary classiﬁer
q(x) =
_
1 for f(x) ≥ 0 ,
2 for f(x) < 0 .
The training of the optimal parameters (w
∗
, b
∗
) is transformed to the following quadratic
programming task
(w
∗
, b
∗
) = argmin
w,b
1
2
w
2
+ C
l
i=1
ξ
p
i
, (5.5)
¸w x
i
) + b ≥ +1 −ξ
i
, i ∈ 1
1
,
¸w x
i
) + b ≤ −1 + ξ
i
, i ∈ 1
2
,
ξ
i
≥ 0 , i ∈ 1
1
∪ 1
2
.
58
Table 5.1: Implemented methods: Support vector and other kernel machines.
svmclass Support Vector Machines Classiﬁer.
mvsvmclass Majority voting multiclass SVM classiﬁer.
evalsvm Trains and evaluates Support Vector Machines classi
ﬁer.
bsvm2 Multiclass BSVM with L2soft margin.
oaasvm Multiclass SVM using OneAgainstAll decomposi
tion.
oaosvm Multiclass SVM using OneAgainstOne decomposi
tion.
smo Sequential Minimal Optimization for binary SVM
with L1soft margin.
svmlight Interface to SV M
light
software.
svmquadprog SVM trained by Matlab Optimization Toolbox.
diagker Returns diagonal of kernel matrix of given data.
kdist Computes distance between vectors in kernel space.
kernel Evaluates kernel function.
kfd Kernel Fisher Discriminant.
kperceptr Kernel Perceptron.
minball Minimal enclosing ball in kernel feature space.
rsrbf Reduced Set Method for RBF kernel expansion.
rbfpreimg RBF preimage by Schoelkopf’s ﬁxedpoint algorithm.
rbfpreimg2 RBF preimage problem by Gradient optimization.
rbfpreimg3 RBF preimage problem by KwokTsang’s algorithm.
demo svm Demo on Support Vector Machines.
The 1
1
= ¦i: y
i
= 1¦ and 1
2
= ¦i: y
i
= 2¦ are sets of indices. The C > 0 stands for the
regularization constant. The ξ
i
≥ 0, i ∈ 1
1
∪ 1
2
are slack variables used to relax the
inequalities for the case of nonseparable data. The linear p = 1 and quadratic p = 2
term for the slack variables ξ
i
is used. In the case of the parameter p = 1, the L1soft
margin SVM is trained and for p = 2 it is denoted as the L2soft margin SVM.
The training of the nonlinear (kernel) SVM with L1soft margin corresponds to
solving the following QP task
β
∗
= argmax
β
¸β 1) −
1
2
¸β Hβ) , (5.6)
subject to
¸β γ) = 1 ,
β ≥ 0 ,
β ≤ 1C .
59
Table 5.2: Datatype used to describe binary SVM classiﬁer.
Binary SVM classiﬁer (structure array):
.Alpha [d 1] Weight vector α.
.b [1 1] Bias b.
.sv.X [n d] Support vectors o = ¦s
1
, . . . , s
d
¦.
.options.ker [string] Kernel identiﬁer.
.options.arg [1 p] Kernel argument(s).
.fun = ’svmclass’ Identiﬁes function associated with this data type.
Table 5.3: Datatype used to describe multiclass SVM classiﬁer.
Multiclass SVM classiﬁer (structure array):
.Alpha [d c] Weight vectors A = [α
1
, . . . , α
c
].
.b [c 1] Vector of biased b = [b
1
, . . . , b
c
]
T
.
.sv.X [n d] Support vectors o = ¦s
1
, . . . , s
d
¦.
.options.ker [string] Kernel identiﬁer.
.options.arg [1 p] Kernel argument(s).
.fun = ’svmclass’ Identiﬁes function associated with this data type.
The 0 [l 1] is vector of zeros and the 1 [l 1] is vector of ones. The vector γ [l 1]
and the matrix H [l l] are constructed as follows
γ
i
=
_
1 if y
i
= 1 ,
−1 if y
i
= 2 ,
H
i,j
=
_
k(x
i
, x
j
) if y
i
= y
j
,
−k(x
i
, x
j
) if y
i
,= y
j
,
where k: A A → R is selected kernel function. The discriminant function (5.1) is
determined by the weight vector α, bias b and the set of support vectors o. The
set of support vectors is denoted as o = ¦x
i
: i ∈ 1
SV
¦, where the set of indices is
1
SV
= ¦i: β
i
> 0¦. The weight vectors are compute as
α
i
=
_
β
i
if y
i
= 1 ,
−β
i
if y
i
= 2 ,
for i ∈ 1
SV
. (5.7)
The bias b can be computed from the KarushKuhnTucker (KKT) conditions as the
following constrains
f(x
i
) = ¸α k
S
(x
i
)) + b = +1 for i ∈ ¦j: y
j
= 1, 0 < β
j
< C¦ ,
f(x
i
) = ¸α k
S
(x
i
)) + b = −1 for i ∈ ¦j: y
j
= 2, 0 < β
j
< C¦ ,
must hold for the optimal solution. The bias b is computed as an average over all the
constrains such that
b =
1
[1
b
[
i∈I
b
γ
i
−¸α k
S
(x
i
)) .
60
Table 5.4: Datatype used to describe the majority voting strategy for multiclass SVM
classiﬁer.
Majority voting SVM classiﬁer (structure array):
.Alpha [d g] Weight vectors A = [α
1
, . . . , α
g
].
.b [g 1] Vector of biased b = [b
1
, . . . , b
g
]
T
.
.bin y [2 g] Targets for the binary rules. The vector bin y(1,:)
contains [y
1
1
, y
1
2
, . . . , y
1
g
] and the bin y(2,:) contains
[y
2
1
, y
2
2
, . . . , y
2
g
].
.sv.X [n d] Support vectors o = ¦s
1
, . . . , s
d
¦.
.options.ker [string] Kernel identiﬁer.
.options.arg [1 p] Kernel argument(s).
.fun = ’mvsvmclass’ Identiﬁes function associated with this data type.
where 1
b
= ¦i: 0 < β
i
< C¦ are the indices of the vectors on the boundary.
The training of the nonlinear (kernel) SVM with L2soft margin corresponds to
solving the following QP task
β
∗
= argmax
β
¸β 1) −
1
2
¸β (H+
1
2C
E)β) , (5.8)
subject to
¸β γ) = 1 ,
β ≥ 0 ,
where E is the identity matrix. The set of support vectors is set to o = ¦x
i
: i ∈ 1
SV
¦
where the set of indices is 1
SV
= ¦i: β
i
> 0¦. The weight vector α is given by (5.7).
The bias b can be computed from the KKT conditions as the following constrains
f(x
i
) = ¸α k
S
(x
i
)) + b = +1 −
α
i
2C
for i ∈ ¦i: y
j
= 1, β
j
> 0¦ ,
f(x
i
) = ¸α k
S
(x
i
)) + b = −1 +
α
i
2C
for i ∈ ¦i: y
j
= 2, β
j
> 0¦ ,
must hold at the optimal solution. The bias b is computed as an average over all the
constrains thus
b =
1
[1
SV
[
i∈I
SV
γ
i
(1 −
α
i
2C
) −¸α k
S
(x
i
)) .
The binary SVM solvers implemented in the STPRtool are enlisted in Table 5.5.
An interactive demo on the binary SVM is implemented in demo svm.
References: The Sequential Minimal Optimizer (SMO) is described in the books [23,
27, 30]. The SV M
light
is described for instance in the book [27].
Example: Training binary SVM.
61
Table 5.5: Implemented binary SVM solvers.
Function description
smo Implementation of the Sequential Minimal Optimizer (SMO)
used to train the binary SVM with L1soft margin. It can
handle moderate size problems.
svmlight Matlab interface to SV M
light
software (Version: 5.00) which
trains the binary SVM with L1soft margin. The executable
ﬁle svm learn for the Linux OS must be installed. It can be
downloaded from http://svmlight.joachims.org/. The
SV M
light
is very powerful optimizer which can handle large
problems.
svmquadprog SVM solver based on the QP solver quadprog of the Mat
lab optimization toolbox. It trains both L1soft and L2soft
margin binary SVM. The function quadprog is a part of the
Matlab Optimization toolbox. It useful for small problems
only.
The binary SVM is trained on the Riply’s training set riply trn. The SMO solver
is used in the example but other solvers (svmlight, svmquadprog) can by used as well.
The trained SVM classiﬁer is evaluated on the testing data riply tst. The decision
boundary is visualized in Figure 5.1.
trn = load(’riply_trn’); % load training data
options.ker = ’rbf’; % use RBF kernel
options.arg = 1; % kernel argument
options.C = 10; % regularization constant
% train SVM classifier
model = smo(trn,options);
% model = svmlight(trn,options);
% model = svmquadprog(trn,options);
% visualization
figure;
ppatterns(trn); psvm(model);
tst = load(’riply_tst’); % load testing data
ypred = svmclass(tst.X,model); % classify data
cerror(ypred,tst.y) % compute error
62
ans =
0.0920
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 5.1: Example showing the binary SVM classiﬁer trained on the Riply’s data.
5.3 OneAgainstAll decomposition for SVM
The input is a set T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of training vectors x
i
∈ A ⊆ R
n
and corresponding hidden states y
i
∈ ¸ = ¦1, 2, . . . , c¦. The goal is to train the
multiclass rule q: A ⊆ R
n
→ ¸ deﬁned by (5.3). The problem can be solved by
the OneAgainstAll (OAA) decomposition. The OAA decomposition transforms the
multiclass problem into a series of c binary subtasks that can be trained by the binary
SVM. Let the training set T
y
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ contain the modiﬁed hidden
states deﬁned as
y
i
=
_
1 for y = y
i
,
2 for y ,= y
i
.
The discriminant functions
f
y
(x) = ¸α
y
k
S
(x)) + b
y
, y ∈ ¸ ,
are trained by the binary SVM solver from the set T
y
XY
, y ∈ ¸. The OAA decomposition
is implemented in the function oaasvm. The function allows to specify any binary SVM
63
solver (see Section 5.2) used to solve the binary problems. The result is the multiclass
SVM classiﬁer (5.3) represented by the datatype described in Table 5.3.
References: The OAA decomposition to train the multiclass SVM and comparison
to other methods is described for instance in the paper [12].
Example: Multiclass SVM using the OAA decomposition
The example shows the training of the multiclass SVM using the OAA decomposi
tion. The SMO binary solver is used to train the binary SVM subtasks. The training
data and the decision boundary is visualized in Figure 5.2.
data = load(’pentagon’); % load data
options.solver = ’smo’; % use SMO solver
options.ker = ’rbf’; % use RBF kernel
options.arg = 1; % kernel argument
options.C = 10; % regularization constant
model = oaasvm(data,options ); % training
% visualization
figure;
ppatterns(data);
ppatterns(model.sv.X,’ko’,12);
pboundary(model);
5.4 OneAgainstOne decomposition for SVM
The input is a set T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of training vectors x
i
∈ A ⊆ R
n
and
corresponding hidden states y
i
∈ ¸ = ¦1, 2, . . . , c¦. The goal is to train the multiclass
rule q: A ⊆ R
n
→ ¸ based on the majority voting strategy deﬁned by (5.4). The
oneagainstone (OAO) decomposition strategy can be used train such a classiﬁer. The
OAO decomposition transforms the multiclass problem into a series of g = c(c −1)/2
binary subtasks that can be trained by the binary SVM. Let the training set T
j
XY
=
¦(x
1
, y
1
), . . . , (x
l
j
, y
l
j
)¦ contain the training vectors x
i
∈ 1
j
= ¦i: y
i
= y
1
_
y
i
= y
2
¦
and the modiﬁed the hidden states deﬁned as
y
i
=
_
1 for y
1
j
= y
i
,
2 for y
2
j
= y
i
,
i ∈ 1
j
.
The training set T
j
XY
, j = 1, . . . , g is constructed for all g = c(c −1)/2 combinations of
classes y
1
j
∈ ¸ and y
2
j
∈ ¸ ¸¦y
1
j
¦. The binary SVM rules q
j
, j = 1, . . . , g are trained on
the data T
j
XY
. The OAO decomposition for multiclass SVM is implemented in function
64
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 5.2: Example showing the multiclass SVM classiﬁer build by oneagainstall
decomposition.
oaosvm. The function oaosvm allows to specify an arbitrary binary SVM solver to train
the subtasks.
References: The OAO decomposition to train the multiclass SVM and comparison
to other methods is described for instance in the paper [12].
Example: Multiclass SVM using the OAO decomposition
The example shows the training of the multiclass SVM using the OAO decomposi
tion. The SMO binary solver is used to train the binary SVM subtasks. The training
data and the decision boundary is visualized in Figure 5.3.
data = load(’pentagon’); % load data
options.solver = ’smo’; % use SMO solver
options.ker = ’rbf’; % use RBF kernel
options.arg = 1; % kernel argument
options.C = 10; % regularization constant
model = oaosvm(data,options); % training
% visualization
figure;
ppatterns(data);
ppatterns(model.sv.X,’ko’,12);
pboundary(model);
65
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 5.3: Example showing the multiclass SVM classiﬁer build by oneagainstone
decomposition.
5.5 Multiclass BSVM formulation
The input is a set T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of training vectors x
i
∈ A ⊆ R
n
and
corresponding hidden states y
i
∈ ¸ = ¦1, 2, . . . , c¦. The goal is to train the multiclass
rule q: A ⊆ R
n
→ ¸ deﬁned by (5.3). In the linear case, the parameters of the multi
class rule can be found by solving the multiclass BSVM formulation which is deﬁned
as
(W
∗
, b
∗
, ξ
∗
) = argmin
W,b
1
2
y∈Y
(w
y

2
+ b
2
y
) + C
i∈I
y∈Y\{y}
(ξ
i
)
p
. ¸¸ .
F(W,b,ξ)
, (5.9)
subject to
¸w
y
i
x
i
) + b
y
i
−(¸w
y
x
i
) + b
y
) ≥ 1 −ξ
y
i
, i ∈ 1 , y ∈ ¸ ¸ ¦y
i
¦ ,
ξ
y
i
≥ 0 , i ∈ 1 , y ∈ ¸ ¸ ¦y
i
¦ ,
where W = [w
1
, . . . , w
c
] is a matrix of normal vectors, b = [b
1
, . . . , b
c
]
T
is vector of
biases, ξ is a vector of slack variables and 1 = [1, . . . , l] is set of indices. The L1soft
margin is used for p = 1 and L2soft margin for p = 2. The letter B in BSVM stands
for the bias term added to the objective (5.9).
In the case of the L2soft margin p = 2, the optimization problem (5.9) can be
66
expressed in its dual form
A
∗
= argmax
A
i∈I
y∈Y\{y
i
}
α
y
i
−
1
2
i∈I
y∈Y\{y
i
}
j∈I
u∈Y\{y
j
}
α
y
i
α
u
j
h(y, u, i, j) , (5.10)
subject to
α
y
i
≥ 0 , i ∈ 1 , y ∈ ¸ ¸ ¦y
i
¦ ,
where A = [α
1
, . . . , α
c
] are optimized weight vectors. The function h: ¸¸11 →R
is deﬁned as
h(y, u, i, j) = (¸x
i
x
j
)+1)(δ(y
i
, y
j
)+δ(y, u)−δ(y
i
, u)−δ(y
j
, y))+
δ(y, u)δ(i, j)
2C
. (5.11)
The criterion (5.10) corresponds to the singleclass SVM criterion which is simple to
optimize. The multiclass rule (5.3) is composed of the set of discriminant functions
f
y
: A →R which are computed as
f
y
(x) =
i∈I
¸x
i
x)
u∈Y\{y
i
}
α
u
i
(δ(u, y
i
) −δ(u, y)) + b
y
, y ∈ ¸ , (5.12)
where the bias b
y
, y ∈ ¸ is given by
b
y
=
i∈I
y∈Y\{y
i
}
α
u
i
(δ(y, y
i
) −δ(y, u)) , y ∈ ¸ .
The nonlinear (kernel) classiﬁer is obtained by substituting the selected kernel k: A
A →R for the dot products ¸xx
) to (5.11) and (5.12). The training of the multiclass
BSVM classiﬁer with L2soft margin is implemented in the function bsvm2. The opti
mization task (5.10) can be solved by most the optimizers. The optimizers implemented
in the function bsvm2 are enlisted bellow:
MitchellDemyanovMalozemov (solver = ’mdm’).
Kozinec’s algorithm (solver = ’kozinec’).
Nearest Point Algorithm (solver = ’npa’).
The above mentioned algorithms are of iterative nature and converge to the optimal
solution of the criterion (5.9). The upper bound F
UB
and the lower bound F
LB
on the
optimal value F(W
∗
, b
∗
, ξ
∗
) of the criterion (5.9) can be computed. The bounds are
used in the algorithms to deﬁne the following two stopping conditions:
Relative tolerance:
F
UB
−F
LB
F
LB
+ 1
≤ ε
rel
.
67
Absolute tolerance: F
UB
≤ ε
abs
.
References: The multiclass BSVM formulation is described in paper [12]. The trans
formation of the multiclass BSVM criterion to the singleclass criterion (5.10) imple
mented in the STPRtool is published in paper [9]. The MitchellDemyanovMalozemov
and Nearest Point Algorithm are described in paper [14]. The Kozinec’s algorithm
based SVM solver is described in paper [10].
Example: Multiclass BSVM with L2soft margin.
The example shows the training of the multiclass BSVM using the NPA solver. The
other solvers can be used instead by setting the variable options.solver (other options
are ’kozinec’, ’mdm’). The training data and the decision boundary is visualized in
Figure 5.4.
data = load(’pentagon’); % load data
options.solver = ’npa’; % use NPA solver
options.ker = ’rbf’; % use RBF kernel
options.arg = 1; % kernel argument
options.C = 10; % regularization constant
model = bsvm2(data,options); % training
% visualization
figure;
ppatterns(data);
ppatterns(model.sv.X,’ko’,12);
pboundary(model);
5.6 Kernel Fisher Discriminant
The input is a set T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ of training vectors x
i
∈ A ⊆ R
n
and
corresponding values of binary state y
i
∈ ¸ = ¦1, 2¦. The Kernel Fisher Discriminant
is the nonlinear extension of the linear FLD (see Section 2.3). The input training
vectors are assumed to be mapped into a new feature space T by a mapping function
φ: A → T. The ordinary FLD is applied on the mapped data training data T
ΦY
=
¦(φ(x
1
), y
1
), . . . , (φ(x
l
), y
l
)¦. The mapping is performed eﬃciently by means of kernel
functions k: A A → R being the dot products of the mapped vectors k(x, x
) =
¸φ(x) φ(x
)). The aim is to ﬁnd a direction ψ =
l
i=1
α
i
φ(x
i
) in the feature space
T given by weights α = [α
1
, . . . , α
l
]
T
such that the criterion
F(α) =
¸α Mα)
¸α (N+ µE)α)
=
¸α m)
2
¸α (N+ µE)α)
, (5.13)
68
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 5.4: Example showing the multiclass BSVM classiﬁer with L2soft margin.
is maximized. The criterion (5.13) is the feature space counterpart of the crite
rion (2.13) deﬁned in the input space A. The vector m [l 1], matrices N [l l]
M [l l] are constructed as
m
y
=
1
[1
y
[
K1
y
, y ∈ ¸ , m = m
1
−m
2
,
N = KK
T
−
y∈Y
[1
y
[m
y
m
T
y
, M = (m
1
−m
2
)(m
1
−m
2
)
T
,
where the vector 1
y
has entries i ∈ 1
y
equal to and remaining entries zeros. The kernel
matrix K [l l] contains kernel evaluations K
i,j
= k(x
i
, x
j
). The diagonal matrix
µE in the denominator of the criterion (5.13) serves as the regularization term. The
discriminant function f(x) of the binary classiﬁer (5.2) is given by the found vector α
as
f(x) = ¸ψ φ(x)) + b = ¸α k(x)) + b , (5.14)
where the k(x) = [k(x
1
, x), . . . , k(x
l
, x)]
T
is vector of evaluations of the kernel. The
discriminant function (5.14) is known up to the bias b which is determined by the linear
SVM with L1soft margin is applied on the projected data T
ZY
= ¦(z
1
, y
1
), . . . , (z
l
, y
l
)¦.
The projections ¦z
1
, . . . , z
l
¦ are computed as
z
i
= ¸α k(x
i
)) .
The KFD training algorithm is implemented in the function kfd.
69
References: The KFD formulation is described in paper [19]. The detailed analysis
can be found in book [30].
Example: Kernel Fisher Discriminant
The binary classiﬁer is trained by the KFD on the Riply’s training set riply trn.mat.
The found classiﬁer is evaluated on the testing data riply tst.mat. The training data
and the found kernel classiﬁer is visualized in Figure 5.5. The classiﬁer is visualized as
the SVM classiﬁer, i.e., the isolines with f(x) = +1 and f(x) = −1 are visualized and
the support vectors (whole training set) are marked as well.
trn = load(’riply_trn’); % load training data
options.ker = ’rbf’; % use RBF kernel
options.arg = 1; % kernel argument
options.C = 10; % regularization of linear SVM
options.mu = 0.001; % regularization of KFD criterion
model = kfd(trn, options) % KFD training
% visualization
figure;
ppatterns(trn);
psvm(model);
% evaluation on testing data
tst = load(’riply_tst’);
ypred = svmclass( tst.X, model );
cerror( ypred, tst.y )
ans =
0.0960
5.7 Preimage problem
Let T
X
= ¦x
1
, . . . , x
l
¦ be a set of vector from the input space A ⊆ R
n
. The kernel
methods works with the data nonlinearly mapped φ: A → T into a new feature space
T. The mapping is performed implicitly by using the kernel functions k: A A → R
which are dot products of the input data mapped into the feature space T such that
k(x
a
, x
b
) = ¸φ(x
a
) φ(x
b
)). Let ψ ∈ T be given by the following kernel expansion
ψ =
l
i=1
α
i
φ(x
i
) , (5.15)
70
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 5.5: Example shows the binary classiﬁer trained based on the Kernel Fisher
Discriminant.
where α = [α
1
, . . . , α
l
]
T
is a weight vector of real coeﬃcients. The preimage problem
aims to ﬁnd an input vector x ∈ A, its feature space image φ(x) ∈ T would well
approximate the expansion point Ψ ∈ T given by (5.15). In other words, the pre
image problem solves mapping from the feature space T back to the input space A.
The preimage problem can be stated as the following optimization task
x = argmin
x
φ(x) −ψ
2
= argmin
x
φ(x
) −
l
i=1
α
i
φ(x
i
)
2
= argmin
x
k(x
, x
) −2
l
i=1
α
i
k(x
, x
i
) +
l
i=1
l
j=1
α
i
α
j
k(x
i
, x
j
) .
(5.16)
Let the Radial Basis Function (RBF) k(x
a
, x
b
) = exp(−0.5x
a
−x
b

2
/σ
2
) be assumed.
The preimage problem (5.16) for the particular case of the RBF kernel leads to the
following task
x = argmax
x
l
i=1
α
i
exp(−0.5x
−x
i

2
/σ
2
) . (5.17)
The STPRtool provides three optimization algorithms to solve the RBF preimage
problem (5.17) which are enlisted in Table 5.6. However, all the methods are guaranteed
to ﬁnd only a local optimum of the task (5.17).
71
Table 5.6: Methods to solve the RBF preimage problem implemented in the STPRtool.
rbfpreimg An iterative ﬁxed point algorithm proposed in [28].
rbfpreimg2 A standard gradient optimization method using the
Matlab optimization toolbox for 1D search along the
gradient.
rbfpreimg3 A method exploiting the correspondence between dis
tances in the input space and the feature space pro
posed in [16].
References: The preimage problem and the implemented algorithms for its solution
are described in [28, 16].
Example: Preimage problem
The example shows how to visualize a preimage of the data mean in the feature
induced by the RBF kernel. The result is visualized in Figure 5.6.
% load input data
data = load(’riply_trn’);
% define the mean in the feature space
num_data = size(data,2);
expansion.Alpha = ones(num_data,1)/num_data;
expansion.sv.X = data.X;
expansion.options.ker = ’rbf’;
expansion.options.arg = 1;
% compute preimage problem
x = rbfpreimg(expansion);
% visualization
figure;
ppatterns(data.X);
ppatterns(x,’or’,13);
72
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 5.6: Example shows training data and the preimage of their mean in the feature
space induced by the RBF kernel.
5.8 Reduced set method
The reduced set method is useful to simplify the kernel classiﬁer trained by the SVM
or other kernel machines. Let the function f: A ⊆ R
n
→R be given as
f(x) =
l
i=1
α
i
k(x
i
, x) + b = ¸ψ φ(x)) + b , (5.18)
where T
X
= ¦x
1
, . . . , x
l
¦ are vectors from R
n
, k: A A → R is a kernel function,
α = [α
1
, . . . , α
l
]
T
is a weight vector of real coeﬃcients and b ∈ R is a scalar. Let
φ: A → T be a mapping from the input space A to the feature space T associated
with kernel function such that k(x
a
, x
b
) = ¸φ(x
a
) φ(x
b
)). The function f(x) can be
expressed in the feature space T as the linear function f(x) = ¸ψ φ(x)) + b. The
reduced set method aims to ﬁnd a new function
˜
f(x) =
m
i=1
β
i
k(s
i
, x) + b = ¸φ(x)
m
i=1
β
i
φ(s
i
)
. ¸¸ .
˜
ψ
) + b , (5.19)
such that the expansion (5.19) is shorter, i.e., m < l, and well approximates the original
one (5.18). The weight vector β = [β
1
, . . . , β
m
]
T
and the vectors T
S
= ¦s
1
, . . . , s
m
¦,
s
i
∈ R
n
determine the reduced kernel expansion (5.19). The problem of ﬁnding the
73
reduced kernel expansion (5.19) can be stated as the optimization task
(β, T
S
) = argmin
β
,T
S

˜
ψ −ψ
2
= argmin
β
,T
S
_
_
_
_
_
m
i=1
β
i
φ(s
i
) −
l
i=1
α
i
φ(x
i
)
_
_
_
_
_
2
. (5.20)
Let the Radial Basis Function (RBF) k(x
a
, x
b
) = exp(−0.5x
a
−x
b

2
/σ
2
) be assumed.
In this case, the optimization task (5.20) can be solved by an iterative greedy algo
rithm. The algorithm converts the task (5.20) into a series of m preimage tasks (see
Section 5.7). The reduced set method for the RBF kernel is implemented in function
rsrbf.
References: The implemented reduced set method is described in [28].
Example: Reduced set method for SVM classiﬁer
The example shows how to use the reduced set method to simplify the SVM clas
siﬁer. The SVM classiﬁer is trained from the Riply’s training set riply trn/mat. The
trained kernel expansion (discriminant function) is composed of 94 support vectors.
The reduced expansion with 10 support vectors is found. Decision boundaries of the
original and the reduced SVM are visualized in Figure 5.7. The original and reduced
SVM are evaluated on the testing data riply tst.mat.
% load training data
trn = load(’riply_trn’);
% train SVM classifier
model = smo(trn,struct(’ker’,’rbf’,’arg’,1,’C’,10));
% compute the reduced rule with 10 support vectors
red_model = rsrbf(model,struct(’nsv’,10));
% visualize decision boundaries
figure; ppatterns(trn);
h1 = pboundary(model,struct(’line_style’,’r’));
h2 = pboundary(red_model,struct(’line_style’,’b’));
legend([h1(1) h2(1)],’Original SVM’,’Reduced SVM’);
% evaluate SVM classifiers
tst = load(’riply_tst’);
ypred = svmclass(tst.X,model);
err = cerror(ypred,tst.y);
red_ypred = svmclass(tst.X,model);
74
red_err = cerror(red_ypred,tst.y);
fprintf([’Original SVM: nsv = %d, err = %.4f\n’ ...
’Reduced SVM: nsv = %d, err = %.4f\n’], ...
model.nsv, err, red_model.nsv, red_err);
Original SVM: nsv = 94, err = 0.0960
Reduced SVM: nsv = 10, err = 0.0960
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Original SVM
Reduced SVM
Figure 5.7: Example shows the decision boundaries of the SVM classiﬁer trained by
SMO and the approximated SVM classiﬁer found by the reduced set method. The
original decision rule involves 94 support vectors while the reduced one only 10 support
vectors.
75
Chapter 6
Miscellaneous
6.1 Data sets
The STPRtool provides several datasets and functions to generate synthetic data. The
datasets stored in the Matlab data ﬁle are enlisted in Table 6.1. The functions which
generate synthetic data and scripts for converting external datasets to the Matlab are
given in Table 6.2.
Table 6.1: Datasets stored in /data/ directory of the STPRtool.
/riply trn.mat Training part of the Riply’s data set [24].
/riply tst.mat Testing part of the Riply’s data set [24].
/iris.mat Well known Fisher’s data set.
/anderson task/*.mat Input data for the demo on the Generalized Ander
son’s task demo anderson.
/binary separable/*.mat Input data for the demo on training linear classiﬁers
from ﬁnite vector sets demo linclass.
/gmm samles/*.mat Input data for the demo on the Expectation
Maximization algorithm for the Gaussian mixture
model demo emgmm.
/multi separable/*.mat Linearly separable multiclass data in 2D.
/ocr data/*.mat Examples of handwritten numerals. A description is
given in Section 7.1.
/svm samples/*.mat Input data for the demo on the Support Vector Ma
chines demo svm.
76
Table 6.2: Data generation.
createdata GUI which allows to create interactively labeled vec
tors sets and labeled sets of Gaussian distributions. It
operates in 2dimensional space only.
genlsdata Generates linearly separable binary data with pre
scribed margin.
gmmsamp Generates data from the Gaussian mixture model.
gsamp Generates data from the multivariate Gaussian distri
bution.
gencircledata Generates data on circle corrupted by the Gaussian
noise.
usps2mat Converts U.S. Postal Service (USPS) database of
handwritten numerals to the Matlab ﬁle.
6.2 Visualization
The STPRtool provides tools for visualization of common models and data. The list
of functions for visualization is given in Table 6.3. Each function contains an example
of using.
Table 6.3: Functions for visualization.
pandr Visualizes solution of the Generalized Anderson’s
task.
pboundary Plots decision boundary of given classiﬁer in 2D.
pgauss Visualizes set of bivariate Gaussians.
pgmm Visualizes the bivariate Gaussian mixture model.
pkernelproj Plots isolines of kernel projection.
plane3 Plots plane in 3D.
pline Plots line in 2D.
ppatterns Visualizes patterns as points in the feature space. It
works for 1D, 2D and 3D.
psvm Plots decision boundary of binary SVM classiﬁer in
2D.
showim Displays images entered as column vectors.
77
6.3 Bayesian classiﬁer
The object under study is assumed to be described by a vector of observations x ∈ A
and a hidden state y ∈ ¸. The x and y are realizations of random variables with joint
probability distribution P
XY
(x, y). A decision rule q: A → T takes a decision d ∈ T
based on the observation x ∈ A. Let W: T¸ →R be a loss function which penalizes
the decision q(x) ∈ T when the true hidden state is y ∈ ¸. Let A ⊆ R
n
and the sets
¸ and T be ﬁnite. The Bayesian risk R(q) is an expectation of the value of the loss
function W when the decision rule q is applied, i.e.,
R(q) =
_
X
y∈Y
P
XY
(x, y)W(q(x), y) dx . (6.1)
The optimal rule q
∗
which minimizes the Bayesian risk (6.1) is referred to as the
Bayesian rule
q
∗
(x) = argmin
y
y∈Y
P
XY
(x, y)W(q(x), y) , ∀x ∈ A .
The STPRtool implements the Bayesian rule for two particular cases:
Minimization of misclassiﬁcation: The set of decisions T coincides to the set of
hidden states ¸ = ¦1, . . . , c¦. The 0/1loss function
W
0/1
(q(x), y) =
_
0 for q(x) = y ,
1 for q(x) ,= y .
(6.2)
is used. The Bayesian risk (6.1) with the 0/1loss function corresponds to the
expectation of misclassiﬁcation. The rule q: A → ¸ which minimizes the expec
tation of misclassiﬁcation is deﬁned as
q(x) = argmax
y∈Y
P
Y X
(y[x) ,
= argmax
y∈Y
P
XY
(x[y)P
Y
(y) .
(6.3)
Classiﬁcation with rejectoption: The set of decisions T is assumed to be T =
¸ ∪ ¦dont know¦. The loss function is deﬁned as
W
ε
(q(x), y) =
⎧
⎨
⎩
0 for q(x) = y ,
1 for q(x) ,= y ,
ε for q(x) = dont know ,
(6.4)
where ε is penalty for the decision dont know. The rule q: A → ¸ which mini
mizes the Bayesian risk with the loss function (6.4) is deﬁned as
q(x) =
⎧
⎨
⎩
argmax
y∈Y
P
XY
(x[y)P
Y
(y) if 1 −max
y∈Y
P
Y X
(y[x) < ε ,
dont know if 1 −max
y∈Y
P
Y X
(y[x) ≥ ε .
(6.5)
78
To apply the optimal classiﬁcation rules one has to know the classconditional
distributions P
XY
and a priory distribution P
Y
(or their estimates). The datatype
used by the STPRtool is described in Table 6.4. The Bayesian classiﬁers (6.3) and (6.5)
are implemented in function bayescls.
Table 6.4: Datatype used to describe the Bayesian classiﬁer.
Bayesian classiﬁer (structure array):
.Pclass [cell 1 c] Class conditional distributions P
XY
. The distribution
P
XY
(x[y) is given by Pclassy which uses the data
type described in Table 4.2.
.Prior [1 c] Discrete distribution P
Y
(y), y ∈ ¸.
.fun = ’bayescls’ Identiﬁes function associated with this data type.
References: The Bayesian decision making with applications to PR is treated in
details in book [26].
Example: Bayesian classiﬁer
The example shows how to design the plugin Bayesian classiﬁer for the Riply’s
training set riply trn.mat. The classconditional distributions P
XY
are modeled by
the Gaussian mixture models (GMM) with two components. The GMM are estimated
by the EM algorithm emgmm. The a priori probabilities P
Y
(y), y ∈ ¦1, 2¦ are estimated
by the relative occurrences (it corresponds to the ML estimate). The designed Bayesian
classiﬁer is evaluated on the testing data riply tst.mat.
% load input training data
trn = load(’riply_trn’);
inx1 = find(trn.y==1);
inx2 = find(trn.y==2);
% Estimation of classconditional distributions by EM
bayes_model.Pclass{1} = emgmm(trn.X(:,inx1),struct(’ncomp’,2));
bayes_model.Pclass{2} = emgmm(trn.X(:,inx2),struct(’ncomp’,2));
% Estimation of priors
n1 = length(inx1); n2 = length(inx2);
bayes_model.Prior = [n1 n2]/(n1+n2);
% Evaluation on testing data
tst = load(’riply_tst’);
ypred = bayescls(tst.X,bayes_model);
cerror(ypred,tst.y)
79
ans =
0.0900
The following example shows how to visualize the decision boundary of the found
Bayesian classiﬁer minimizing the misclassiﬁcation (solid black line). The Bayesian
classiﬁer splits the feature space into two regions corresponding to the ﬁrst class and
the second class. The decision boundary of the rejectoptions rule (dashed line) is
displayed for comparison. The rejectoption rule splits the feature space into three
regions corresponding to the ﬁrst class, the second class and the region in between
corresponding to the dont know decision. The visualization is given in Figure 6.1.
% Visualization
figure; hold on; ppatterns(trn);
bayes_model.fun = ’bayescls’;
pboundary(bayes_model);
% Penalization for don’t know decision
reject_model = bayes_model;
reject_model.eps = 0.1;
% Vislualization of rejetoption rule
pboundary(reject_model,struct(’line_style’,’k’));
6.4 Quadratic classiﬁer
The quadratic classiﬁcation rule q: A ⊆ R
n
→ ¸ = ¦1, 2, . . . , c¦ is composed of a set
of discriminant functions
f
y
(x) = ¸x A
y
x) +¸b
y
x) + c
y
, ∀y ∈ ¸ ,
which are quadratic with respect to the input vector x ∈ R
n
. The quadratic discrimi
nant function f
y
is determined by a matrix A
y
[n n], a vector b
y
[n 1] and a scalar
c
y
[1 1]. The input vector x ∈ R
n
is assigned to the class y ∈ ¸ its corresponding
discriminant function f
y
attains maximal value
y = argmax
y
∈Y
f
y
(x) = argmax
y
∈Y
(¸x A
y
x) +¸b
y
x) + c
y
) . (6.6)
In the particular binary case ¸ = ¦1, 2¦, the quadratic classiﬁer is represented by
a single discriminant function
f(x) = ¸x Ax) +¸b x) + c ,
80
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 6.1: Figure shows the decision boundary of the Bayesian classiﬁer (solid line)
and the decision boundary of the rejectoption rule with ε = 0.1 (dashed line). The
classconditional probabilities are modeled by the Gaussian mixture models with two
components estimated by the EM algorithm.
given by a matrix A [n n], a vector b [n 1] and a scalar c [1 1]. The input vector
x ∈ R
n
is assigned to class y ∈ ¦1, 2¦ as follows
q(x) =
_
1 if f(x) = ¸x Ax) +¸b x) + c ≥ 0 ,
2 if f(x) = ¸x Ax) +¸b x) + c < 0 .
(6.7)
The STPRtool uses a speciﬁc structure array to describe the binary (see Table 6.5)
and the multiclass (see Table 6.6) quadratic classiﬁer. The implementation of the
quadratic classiﬁer itself provides function quadclass.
Table 6.5: Datatype used to describe binary quadratic classiﬁer.
Binary linear quadratic (structure array):
.A [n n] Parameter matrix A.
.b [n 1] Parameter vector b.
.c [1 1] Parameter scalar c.
.fun = ’quadclass’ Identiﬁes function associated with this data type.
Example: Quadratic classiﬁer
The example shows how to train the quadratic classiﬁer using the Fisher Linear Dis
criminant and the quadratic data mapping. The classiﬁer is trained from the Riply’s
81
Table 6.6: Datatype used to describe multiclass quadratic classiﬁer.
Multiclass quadratic classiﬁer (structure array):
.A [n n c] Parameter matrices A
y
, y ∈ ¸.
.b [n c] Parameter vectors b
y
, y ∈ ¸.
.c [1 c] Parameter scalars c
y
, y ∈ ¸.
.fun = ’quadclass’ Identiﬁes function associated with this data type.
training set riply trn.mat. The found quadratic classiﬁer is evaluated on the test
ing data riply trn.mat. The boundary of the quadratic classiﬁer is visualized (see
Figure 6.2).
trn = load(’riply_trn’); % load input data
quad_data = qmap(trn); % quadratic mapping
lin_model = fld(quad_data); % train FLD
% make quadratic classifier
quad_model = lin2quad(lin_model);
% visualization
figure;
ppatterns(trn); pboundary(quad_model);
% evaluation
tst = load(’riply_tst’);
ypred = quadclass(tst.X,quad_model);
cerror(ypred,tst.y)
ans =
0.0940
6.5 Knearest neighbors classiﬁer
Let T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ be a set of prototype vectors x
i
∈ A ⊆ R
n
and
corresponding hidden states y
i
∈ ¸ = ¦1, . . . , c¦. Let R
n
(x) = ¦x
: x −x
 ≤ r
2
¦ be
a ball centered in the vector x in which lie K prototype vectors x
i
, i ∈ ¦1, . . . , l¦, i.e.,
[¦x
i
: x
i
∈ R
n
(x)¦[ = K. The Knearest neighbor (KNN) classiﬁcation rule q: A → ¸
is deﬁned as
q(x) = argmax
y∈Y
v(x, y) , (6.8)
82
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 6.2: Figure shows the decision boundary of the quadratic classiﬁer.
where v(x, y) is number of prototype vectors x
i
with hidden state y
i
= y which lie in
the ball x
i
∈ R
n
(x). The classiﬁcation rule (6.8) is computationally demanding if the
set of prototype vectors is large. The datatype used by the STPRtool to represent the
Knearest neighbor classiﬁer is described in Table 6.7.
Table 6.7: Datatype used to describe the Knearest neighbor classiﬁer.
KNN classiﬁer (structure array):
.X [n l] Prototype vectors ¦x
1
, . . . , x
l
¦.
.y [1 l] Labels ¦y
1
, . . . , y
l
¦.
.fun = ’knnclass’ Identiﬁes function associated with this data type.
References: The Knearest neighbors classiﬁer is described in most books on PR, for
instance, in book [6].
Example: Knearest neighbor classiﬁer
The example shows how to create the (K = 8)NN classiﬁcation rule from the
Riply’s training data riply trn.mat. The classiﬁer is evaluated on the testing data
riply tst.mat. The decision boundary is visualized in Figure 6.3.
% load training data and setup 8NN rule
trn = load(’riply_trn’);
model = knnrule(trn,8);
83
% visualize decision boundary and training data
figure; ppatterns(trn); pboundary(model);
% evaluate classifier
tst = load(’riply_tst’);
ypred = knnclass(tst.X,model);
cerror(ypred,tst.y)
ans =
0.1160
−1.5 −1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 6.3: Figure shows the decision boundary of the (K = 8)nearest neighbor
classiﬁer.
84
Chapter 7
Examples of applications
We intend to demonstrate how methods implemented in the STPRtool can be used to
solve a more complex applications. The applications selected are the optical charac
ter recognition (OCR) system based on the multiclass SVM (Section 7.1) and image
denoising system using the kernel PCA (Section 7.2).
7.1 Optical Character Recognition system
The use of the STPRtool is demonstrated on a simple Optical Character Recognition
(OCR) system for handwritten numerals from 0 to 9. The STPRtool provides GUI
which allows to use standard computer mouse as an input device to draw images of
numerals. The GUI is intended just for demonstration purposes and the scanner or
other device would be used in practice instead. The multiclass SVM are used as the
base classiﬁer of the numerals. A numeral to be classiﬁed is represented as a vector
x containing pixel values taken from the numeral image. The training stage of the
OCR system involves (i) collection of training examples of numerals and (ii) training
the multiclass SVM. To see the trained OCR system run the demo script demo ocr.
The GUI for drawing numerals is implemented in a function mpaper. The function
mpaper creates a form composed of 50 cells to which the numerals can be drawn (see
Figure 7.1a). The drawn numerals are normalized to size 16 16 pixels (diﬀerent
size can be speciﬁed). The function mpaper is designed to invoke a speciﬁed function
whenever the middle mouse button is pressed. In particular, the functions save chars
and ocr fun are called from the mpaper to process drawn numerals. The function
save chars saves drawn numerals in the stage of collecting training examples and the
function ocr fun is used to recognize the numerals if the trained OCR is applied.
The set of training examples has to be collected before the OCR is trained. The
function collect chars invokes the form for drawing numerals and allows to save the
numerals to a speciﬁed ﬁle. The STPRtool contains examples of numerals collected
85
from 6 diﬀerent persons. Each person drew 50 examples of each of numerals from
’0’ to ’9’. The numerals are stored in the directory /data/ocr numerals/. The ﬁle
name format name xx.mat (the character can be omitted from the name) is used.
The string name is name of the person who drew the particular numeral and xx stands
for a label assigned to the numeral. The labels from y = 1 to y = 9 corresponds to the
numerals from ’1’ to ’9’ and the label y = 10 is used for numeral ’0’. For example,
to collect examples of numeral ’1’ from person honza cech
collect_chars(’honza_cech1’)
was called. Figure 7.1a shows the form ﬁlled with the examples of the numeral ’1’.
The Figure 7.1b shows numerals after normalization which are saved to the speci
ﬁed ﬁle honza cech.mat. The images of normalized numerals are stored as vectors
to a matrix X of size [256 50]. The vector dimension n = 256 corresponds to
16 16 pixels of each image numeral and l = 50 is the maximal number of numerals
drawn to a single form. If the ﬁle name format name xx.mat is met then the function
mergesets(Directory,DataFile) can be used to merge all the ﬁle to a single one.
The resulting ﬁle contains a matrix X of all examples and a vector y containing labels
are assigned to examples according the number xx. The string Directory speciﬁes the
directory where the ﬁles name xx.mat are stored and the string FileName is a ﬁle name
to which the resulting training set is stored.
Having the labeled training set T
XY
= ¦(x
1
, y
1
), . . . , (x
l
, y
l
)¦ created the multiclass
SVM classiﬁer can be trained. The methods to train the multiclass SVM are described
in Chapter 5. In particular, the OneAgainstOne (OAO) decomposition was used as
it requires the shortest training time. However, if a speed of classiﬁcation is the major
objective the multiclass BSVM formulation is often a better option as it produce less
number of support vectors. The SVM has two free parameters which have to be tuned:
the regularization constant C and the kernel function k(x
a
, x
b
) with its argument.
The RBF kernel k(x
a
, x
b
) = exp(−0.5x
a
− x
b

2
/σ
2
) determined by the parameter
σ was used in the experiment. Therefore, the free SVM parameters are C and σ. A
common method to tune the free parameters is to use the crossvalidation to select the
best parameters from a preselected set Θ = ¦(C
1
, σ
1
), . . . , (C
k
, σ
k
)¦. The STPRtool
contains a function evalsvm which evaluates the set of SVM parameters Θ based on the
crossvalidation estimation of the classiﬁcation error. The function evalsvm returns the
SVM model its parameters (C
∗
, σ
∗
) turned out to be have the lowest crossvalidation
error over the given set Θ. The parameter tuning is implemented in the script tune ocr.
The parameters C = ∞ (data are separable) and σ = 5 were found to be the best.
After the parameters (C
∗
, σ
∗
) are tuned the SVM classiﬁer is trained using all
training data available. The training procedure for OCR classiﬁer is implemented in
the function train ocr. The directory /demo/ocr/models/ contains the SVM model
trained by OAO decomposition strategy (function oaosvm) and model trained based
on the multiclass BSVM formulation (function bsvm2).
86
The particular model used by OCR is speciﬁed in the function ocr fun which
implicitly uses the ﬁle ocrmodel.mat stored in the directory /demo/ocr/. To run the
resulting OCR type demo ocr or call
mpaper(struct(’fun’,’ocr_fun’))
If one intends to modify the classiﬁer or the way how the classiﬁcation results are
displayed then modify the function ocr fun. A snapshot of running OCR is shown in
Figure 7.2.
7.2 Image denoising
The aim is to train a model of a given class of images in order to restore images
corrupted by noise. The U.S. Postal Service (USPS) database of images of hand
written numerals [17] was used as the class of images to model. The additive Gaussian
noise was assumed as the source of image degradation. The kernel PCA was used to
model the image class. This method was proposed in [20] and further developed in [15].
The STPRtool contains a script which converts the USPS database to the Matlab
data ﬁle. It is implemented in the function usps2mat. The corruption of the USPS
images by the Gaussian noise is implemented in script make noisy data. The script
make noisy data produces the Matlab ﬁle usps noisy.mat its contains is summarized
in Table 7.1.
The aim is to train a model of images of all numerals at ones. Thus the unlabeled
training data set T
X
= ¦x
1
, . . . , x
l
¦ is used. The model should describe the nonlinear
subspace A
img
⊆ R
n
where the images live. It is assumed that the nonlinear subspace
A
img
forms a linear subspace T
img
of a new feature space T. A link between the
input space A and the feature space T is given by a mapping φ: A → T which is
implicitly determined by a selected kernel function k: A A →R. The kernel function
corresponds to the dot product in the feature space, i.e., k(x
a
, x
b
) = ¸φ(x
a
) φ(x
b
)).
The kernel PCA can be used to ﬁnd the basis vectors of the subspace T
img
from the
input training data T
X
. Having the kernel PCA model of the subspace T
img
it can be
used to map the images x ∈ A into the subspace A
img
. A noisy image ˆ x is assumed to
lie outside the subspace A
img
. The denoising procedure is based on mapping the noisy
image ˆ x into the subspace A
img
. The idea of using the kernel PCA for image denoising
is demonstrated in Figure 7.3. The example which produced the ﬁgure is implemented
in demo kpcadenois.
The subspace T
img
is modeled by the kernel PCA trained from the set T
X
mapped
into the feature space T by a function φ: A → T. Let T
Φ
= ¦φ(x
1
), . . . , φ(x
l
)¦ denote
the feature space representation of the training images T
X
. The kernel PCA computes
the lower dimensional representation T
Z
= ¦z
1
, . . . , z
l
¦, z
i
∈ R
m
of the mapped images
87
Table 7.1: The content of the ﬁle usps noisy.mat produced by the script
make noisy data.
Structure array trn containing the training data:
gnd X [256 7291] Training images of numerals represented as column
vectors. The input are 7291 grayscale images of size
16 16.
X [256 7291] Training images corrupted by the Gaussian noise with
signaltonoise ration SNR = 1 (it can be changed).
y [1 7291] Labels y ∈ ¸ = ¦1, . . . , 10¦ of training images.
Structure array tst containing the testing data:
gnd X [256 2007] Testing images of numerals represented as column vec
tors. The input are 2007 grayscale images of size
16 16.
X [256 2007] Training images corrupted by the Gaussian noise with
signaltonoise ration SNR = 1 (it can be changed).
y [1 2007] Labels y ∈ ¸ = ¦1, . . . , 10¦ of training images.
T
Φ
to minimize the mean square reconstruction error
ε
KMS
=
l
i=1
φ(x
i
) −
˜
φ(x
i
)
2
,
where
˜
φ(x) =
l
i=1
β
i
φ(x) , β = A(z −b) , z = A
T
k(x) +b . (7.1)
The vector k(x) = [k(x
1
, x), . . . , k(x
l
, x)]
T
contains kernel functions evaluated at the
training data T
X
. The matrix A [l m] and the vector b [m 1] are trained by the
kernel PCA algorithm. Given a projection
˜
φ(x) onto the kernel PCA subspace an
image x
∈ R
n
of the input space can be recovered by solving the preimage problem
x
= argmin
x
φ(x) −
˜
φ(x)
2
. (7.2)
The Radial Basis Function (RBF) kernel k(x
a
, x
b
) = exp(−0.5x
a
− x
b

2
/σ
2
) was
used here as the preimage problem (7.2) can be well solved for this case. The whole
procedure of reconstructing the corrupted image ˆ x ∈ R
n
involves (i) using (7.1) to
compute z which determines
˜
φ(x) and (ii) solving the preimage problem (7.2) which
yields the resulting x
. This procedure is implemented in function kpcarec.
In contrast to the linear PCA, the kernel PCA allows to model the nonlinearity
which is very likely in the case of images. However, the nonlinearity hidden in
88
the parameter σ of the used RBF kernel has to be tuned properly to prevent over
ﬁtting. Another parameter is the dimension m of the subspace kernel PCA subspace
to be trained. The best parameters (σ
∗
, m
∗
) was selected out of a prescribed set
Θ = ¦(σ
1
, m
1
), . . . , (σ
k
, m
k
)¦ such that the crossvalidation estimate of the mean square
error
ε
MS
(σ, m) =
_
X
P
X
(x)
_
x −x

2
_
dx ,
was minimal. The x stands for the ground truth image and x
denotes the image
restored from noisy one ˆ x using (7.1) and (7.2). The images are assumed to be gen
erated from distribution P
X
(x). Figure 7.4 shows the idea of tuning the parameters
of the kernel PCA. The linear PCA was used to solve the same task for comparison.
The same idea was used to tune the output dimension of the linear PCA. Having the
parameters tuned the resulting kernel PCA and linear PCA are trained on the whole
training set. The whole training procedure of the kernel PCA is implemented in the
script train kpca denois. The training for linear PCA is implemented in the script
train lin denois. The greedy kernel PCA Algorithm 3.4 is used for training the
model. It produces sparse model and can deal with large data sets. The results of
tuning the parameters are displayed in Figure 7.5. The ﬁgure was produced by a script
show denois tuning.
The function kpcarec is used to denoise images using the kernel PCA model. In
the case of the linear PCA, the function pcarec was used. An example of denoising of
the testing data of USPS is implemented in script show denoising. Figure 7.6 shows
examples of ground truth images, noisy images, images restored by the linear PCA and
kernel PCA.
89
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Control: left_button − draw, middle_button − classify, right_button − erase.
(a)
(b)
Figure 7.1: Figure (a) shows handwritten training examples of numeral ’1’ and Figure
(b) shows corresponding normalized images of size 16 16 pixels.
90
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Control: left_button − draw, middle_button − classify, right_button − erase.
(a)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 3 4 5 6 7 8 9 0
4 0
5 3 1 9
6 2 8
(b)
Figure 7.2: A snapshot of running OCR system. Figure (a) shows handwritten nu
merals and Figure (b) shows recognized numerals.
91
−4 −3 −2 −1 0 1 2 3 4 5 6 7
−4
−3
−2
−1
0
1
2
3
4
5
6
7
Ground truth
Noisy examples
Corrupted
Reconstructed
Figure 7.3: Demonstration of idea of image denoising by kernel PCA.
Data
generator
model
model
Distortion
KPCA Preimage
problem
θ ∈ Θ
Error
e
r
r
o
r
θ
assessment
Figure 7.4: Schema of tuning parameters of kernel PCA model used for image restora
tion.
92
0 10 20 30 40 50 60 70 80 90 100
10
11
12
13
14
15
16
17
18
dim
ε
M
S
3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
6
8
10
12
14
16
18
20
σ
ε
M
S
dim = 50
dim = 100
dim = 200
dim = 300
(a) (b)
Figure 7.5: Figure (a) shows plot of ε
MS
with respect to output dimension of lin
ear PCA. Figure(b) shows plot of ε
MS
with respect to output dimension and kernel
argument of the kernel PCA.
Ground truth
Numerals with added Gaussian noise
Linear PCA reconstruction
Kernel PCA reconstruction
Figure 7.6: The results of denosing the handwritten images of USPS database.
93
Chapter 8
Conclusions and future planes
The STPRtool as well as its documentation is being updating. Therefore we are grateful
for any comment, suggestion or other feedback which would help in the development
of the STPRtool.
There are many relevant methods and proposals which have not been implemented
yet. We welcome anyone who likes to contribute to the STPRtool in any way. Please
send us an email and your potential contribution.
The future planes involve the extensions of the toolbox by the following methods:
• Feature selection methods.
• AdaBoost and related learning methods.
• Eﬃcient optimizers to for large scale Support Vector Machines learning problems.
• Model selection methods for SVM.
• Probabilistic Principal Component Analysis (PPCA) and mixture of PPCA esti
mated by the ExpectationMaximization (EM) algorithm.
• Discrete probability models estimated by the EM algorithm.
94
Bibliography
[1] T.W. Anderson and R.R. Bahadur. Classiﬁcation into two multivariate normal
distributions with diﬀerrentia covariance matrices. Anals of Mathematical Statis
tics, 33:420–431, June 1962.
[2] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel ap
proach. Neural Computation, 12(10):2385–2404, 2000.
[3] C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford,
Great Britain, 3th edition, 1997.
[4] N. Cristianini and J. ShaweTaylor. Support Vector Machines. Cambridge Uni
versity Press, 2000.
[5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incom
plete data via the EM Algorithm. Journal of the Royal Statistical Society, 39:185–
197, 1977.
[6] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classiﬁcation. John Wiley &
Sons, 2nd. edition, 2001.
[7] R.P.W. Duin. Prtools: A matlab toolbox for pattern recognition, 2000.
[8] V. Franc. Programov´e n´astroje pro rozpozn´ av´ an´ı (Pattern Recognition Program
ming Tools, In Czech). Master’s thesis,
ˇ
Cesk´e vysok´e uˇcen´ı technick´e, Fakulta
elektrotechnick´a, Katedra kybernetiky, February 2000.
[9] V. Franc and V. Hlav´aˇc. Multiclass support vector machine. In R. Kasturi,
D. Laurendeau, and Suen C., editors, 16th International Conference on Pattern
Recognition, volume 2, pages 236–239. IEEE Computer Society, 2002.
[10] V. Franc and V. Hlav´ aˇc. An iterative algorithm learning the maximal margin
classiﬁer. Pattern Recognition, 36(9):1985–1996, 2003.
95
[11] V. Franc and V. Hlav´ aˇc. Greedy algorithm for a training set reduction in the
kernel methods. In N. Petkov and M.A. Westenberg, editors, Computer Analysis
of Images and Patterns, pages 426–433, Berlin, Germany, 2003. Springer.
[12] C.W. Hsu and C.J. Lin. A comparison of methods for multiclass support vector
machins. IEEE Transactions on Neural Networks, 13(2), March 2002.
[13] I.T. Jollife. Principal Component Analysis. SpringerVerlag, New York, 1986.
[14] S.S. Keerthi, S.K. Shevade, C. Bhattacharya, and K.R.K. Murthy. A fast itera
tive nearest point algorithm for support vector machine classiﬁer design. IEEE
Transactions on Neural Networks, 11(1):124–136, January 2000.
[15] Kim Kwang, In, Franz Matthias, O., and Sch¨olkopf Bernhard. Kernel hebbian
algorithm for singlefram superresolution. In Leonardis Aleˇs and Bischof Horst,
editors, Statisical Learning in Computer Vision, ECCV Workshop. Springer, May
2004.
[16] J.T. Kwok and I.W. Tsang. The preimage problem in kernel methods. In Pro
ceedings of the Twentieth International Conference on Machine Learning (ICML
2003), pages 408–415, Washington, D.C., USA, August 2003.
[17] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and
L.J Jackel. Backpropagation applied to handwritten zip code recognition. Neural
Computation, 1:541–551, 1989.
[18] G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley
& Sons, New York, 1997.
[19] S. Mika, G. R¨ atsch, J. Weston, B. Sch¨olkpf, and K. M¨ uller. Fisher discriminant
analysis with kernel. In Y.H. Hu, J. Larsen, and S. Wilson, E. Douglas, editors,
Neural Networks for Signal Processing, pages 41–48. IEEE, 1999.
[20] S. Mika, B. Sch¨ olkopf, A. Smola, K.R. M¨ uller, M. Scholz, and G. R¨ atsch. Kernel
pca and denoising in feature spaces. In M.S. Kearns, S.A. Solla, and D.A. Cohn,
editors, Advances in Neural Information Processing Systems 11, pages 536 – 542,
Cambridge, MA, 1999. MIT Press.
[21] I.T. Nabney. NETLAB : algorithms for pattern recognition. Advances in pattern
recognition. Springer, London, 2002.
[22] J. Platt. Probabilities for sv machines. In A.J. Smola, P.J. Bartlett, B. Scholkopf,
and D. Schuurmans, editors, Advances in Large Margin Classiﬁers (Neural Infor
mation Processing Series). MIT Press, 2000.
96
[23] J.C. Platt. Sequential minimal optimizer: A fast algorithm for training support
vector machines. Technical Report MSRTR9814, Microsoft Research, Redmond,
1998.
[24] B.D. Riply. Neural networks and related methods for classiﬁcation (with discu
sion). J. Royal Statistical Soc. Series B, 56:409–456, 1994.
[25] M.I. Schlesinger. A connection between learning and selflearning in the pattern
recognition (in Russian). Kibernetika, 2:81–88, 1968.
[26] M.I. Schlesinger and V. Hlav´ aˇc. Ten lectures on statistical and structural pattern
recognition. Kluwer Academic Publishers, 2002.
[27] B. Sch¨olkopf, C. Burges, and A.J. Smola. Advances in Kernel Methods – Support
Vector Learning. MIT Press, Cambridge, 1999.
[28] B. Sch¨olkopf, P. Knirsch, and C. Smola, A. Burges. Fast approximation of support
vector kernel expansions, and an interpretation of clustering as approximation in
feature spaces. In P. Levi, M. Schanz, R.J. Ahler, and F. May, editors, Mus
tererkennung 199820. DAGM., pages 124–132, Berlin, Germany, 1998. Springer
Verlag.
[29] B. Scholkopf, A. Smola, and K.R. Muller. Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998.
[30] B. Sch¨olkopf and A.J. Smola. Learning with Kernels. The MIT Press, MA, 2002.
[31] V. Vapnik. The nature of statistical learning theory. Springer Verlag, 1995.
97
Statistical Pattern Recognition Toolbox for Matlab
Vojtˇch Franc and V´clav Hlav´ˇ e a ac June 24, 2004
Contents
1 Introduction 1.1 What is the Statistical Pattern Recognition Toolbox and how it has been developed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The STPRtool purpose and its potential users . . . . . . . . . . . . . . 1.3 A digest of the implemented methods . . . . . . . . . . . . . . . . . . . 1.4 Relevant Matlab toolboxes of others . . . . . . . . . . . . . . . . . . . . 1.5 STPRtool document organization . . . . . . . . . . . . . . . . . . . . . 1.6 How to contribute to STPRtool? . . . . . . . . . . . . . . . . . . . . . 1.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Linear Discriminant Function 2.1 Linear classiﬁer . . . . . . . . . . . . . . . . . . 2.2 Linear separation of ﬁnite sets of vectors . . . . 2.2.1 Perceptron algorithm . . . . . . . . . . . 2.2.2 Kozinec’s algorithm . . . . . . . . . . . . 2.2.3 Multiclass linear classiﬁer . . . . . . . . 2.3 Fisher Linear Discriminant . . . . . . . . . . . . 2.4 Generalized Anderson’s task . . . . . . . . . . . 2.4.1 Original Anderson’s algorithm . . . . . . 2.4.2 General algorithm framework . . . . . . 2.4.3 εsolution using the Kozinec’s algorithm 2.4.4 Generalized gradient optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 4 5 5 5 6 8 9 10 11 12 14 15 17 19 21 21 22 25 26 29 31 32 33 34
3 Feature extraction 3.1 Principal Component Analysis . . . . . . . . . . . 3.2 Linear Discriminant Analysis . . . . . . . . . . . 3.3 Kernel Principal Component Analysis . . . . . . . 3.4 Greedy kernel PCA algorithm . . . . . . . . . . . 3.5 Generalized Discriminant Analysis . . . . . . . . . 3.6 Combining feature extraction and linear classiﬁer
1
6. . 4. . . . . . . . . . . . . . . . . . . . . . . . . . .2. . . . . . . 4. . . . . . . . . . . . . . . .1 Optical Character Recognition system . . 5. . . . . . . . .4 Probabilistic output of classiﬁer . . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 OneAgainstAll decomposition for SVM . . . . . .2 Binary Support Vector Machines . 41 41 43 44 45 47 48 50 57 57 58 63 64 66 68 70 73 76 76 77 78 80 82 85 85 87 94 7 Examples of applications 7. . . . . . . . . . 6 Miscellaneous 6. . . . . . . . . . 5. 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7. . . . 5 Support vector and other kernel machines 5. . . . . . . . . . . . . .2. . . .2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . .4 Density estimation and clustering 4. . . . . . . . .1 Gaussian distribution and Gaussian mixture model 4. . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . .5 Knearest neighbors classiﬁer . . . . . . . . 4. . . . . . . . 4. .5 Kmeans clustering .3 Minimax estimation . . . . . .3 Bayesian classiﬁer . . . . . . . . . . . . . . . . . . . 6. . . . . . 8 Conclusions and future planes 2 . . . . . . . . .7 Preimage problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Data sets . . .6 Kernel Fisher Discriminant . . . . . . . . . . . . . 5.2 Incomplete data . . . . . . . . . . . . . . . . . . . .1 Support vector classiﬁers . . . . . . . . . . . . . . . . . . .2 MaximumLikelihood estimation of GMM . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Quadratic classiﬁer . .4 OneAgainstOne decomposition for SVM . . . . . . . . . . . . . . . . . . . . . . .5 Multiclass BSVM formulation . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Complete data . . . . . . . . . . . . .8 Reduced set method .2 Image denoising . . . .
Chapter 1 Introduction
1.1 What is the Statistical Pattern Recognition Toolbox and how it has been developed?
The Statistical Pattern Recognition Toolbox (abbreviated STPRtool) is a collection of pattern recognition (PR) methods implemented in Matlab. The core of the STPRtool is comprised of statistical PR algorithms described in the monograph Schlesinger, M.I., Hlav´ˇ, V: Ten Lectures on Statistical and Structural ac Pattern Recognition, Kluwer Academic Publishers, 2002 [26]. The STPRtool has been e developed since 1999. The ﬁrst version was a result of a diploma project of Vojtˇch Franc [8]. Its author became a PhD student in CMP and has been extending the STPRtool since. Currently, the STPRtool contains much wider collection of algorithms. The STPRtool is available at http://cmp.felk.cvut.cz/cmp/cmp software.html. We tried to create concise and complete documentation of the STPRtool. This document and associated help ﬁles in .html should give all information and parameters the user needs when she/he likes to call appropriate method (typically a Matlab mﬁle). The intention was not to write another textbook. If the user does not know the method then she/he will likely have to read the description of the method in a referenced original paper or in a textbook.
1.2
The STPRtool purpose and its potential users
• The principal purpose of STPRtool is to provide basic statistical pattern recognition methods to (a) researchers, (b) teachers, and (c) students. The STPRtool is likely to help both in research and teaching. It can be found useful by practitioners who design PR applications too.
3
• The STPRtool oﬀers a collection of the basic and the stateofthe art statistical PR methods. • The STPRtool contains demonstration programs, working examples and visualization tools which help to understand the PR methods. These teaching aids are intended for students of the PR courses and their teachers who can easily create examples and incorporate them into their lectures. • The STPRtool is also intended to serve a designer of a new PR method by providing tools for evaluation and comparison of PR methods. Many standard and stateofthe art methods have been implemented in STPRtool. • The STPRtool is open to contributions by others. If you develop a method which you consider useful for the community then contact us. We are willing to incorporate the method into the toolbox. We intend to moderate this process to keep the toolbox consistent.
1.3
A digest of the implemented methods
This section should give the reader a quick overview of the methods implemented in STPRtool. • Analysis of linear discriminant function: Perceptron algorithm and multiclass modiﬁcation. Kozinec’s algorithm. Fisher Linear Discriminant. A collection of known algorithms solving the Generalized Anderson’s Task. • Feature extraction: Linear Discriminant Analysis. Principal Component Analysis (PCA). Kernel PCA. Greedy Kernel PCA. Generalized Discriminant Analysis. • Probability distribution estimation and clustering: Gaussian Mixture Models. ExpectationMaximization algorithm. Minimax probability estimation. Kmeans clustering. • Support Vector and other Kernel Machines: Sequential Minimal Optimizer (SMO). Matlab Optimization toolbox based algorithms. Interface to the SV M light software. Decomposition approaches to train the Multiclass SVM classiﬁers. Multiclass BSVM formulation trained by Kozinec’s algorithm, MitchellDemyanovMolozenov algorithm and Nearest Point Algorithm. Kernel Fisher Discriminant.
4
1.4
Relevant Matlab toolboxes of others
Some researchers in statistical PR provide code of their methods. The Internet oﬀers many software resources for Matlab. However, there are not many compact and well documented packages comprised of PR methods. We found two useful packages which are brieﬂy introduced below. NETLAB is focused to the Neural Networks applied to PR problems. The classical PR methods are included as well. This Matlab toolbox is closely related to the popular PR textbook [3]. The documentation published in book [21] contains detailed description of the implemented methods and many working examples. The toolbox includes: Probabilistic Principal Component Analysis, Generative Topographic Mapping, methods based on the Bayesian inference, Gaussian processes, Radial Basis Functions (RBF) networks and many others. The toolbox can be downloaded from http://www.ncrg.aston.ac.uk/netlab/. PRTools is a Matlab Toolbox for Pattern Recognition [7]. Its implementation is based on the object oriented programming principles supported by the Matlab language. The toolbox contains a wide range of PR methods including: analysis of linear and nonlinear classiﬁers, feature extraction, feature selection, methods for combining classiﬁers and methods for clustering. The toolbox can be downloaded from http://www.ph.tn.tudelft.nl/prtools/.
1.5
STPRtool document organization
The document is divided into chapters describing the implemented methods. The last chapter is devoted to more complex examples. Most chapters begin with deﬁnition of the models to be analyzed (e.g., linear classiﬁer, probability densities, etc.) and the list of the implemented methods. Individual sections have the following ﬁxed structure: (i) deﬁnition of the problem to be solved, (ii) brief description and list of the implemented methods, (iii) reference to the literature where the problem is treaded in details and (iv) an example demonstrating how to solve the problem using the STPRtool.
1.6
How to contribute to STPRtool?
If you have an implementation which is consistent with the STPRtool spirit and you want to put it into public domain through the STPRtool channel then send an email to xfrancv@cmp.felk.cvut.cz. 5
we like to thank for discussions to our colleagues in the Center for Machine Perception. Third. 1. Ukraine who showed us the beauty of pattern recognition and gave us many valuable advices. other STPRtool users and students who used the toolbox in their laboratory exercises. Second. Vernon.I.7 Acknowledgements First. thanks go to several Czech. Of course. we like to thank to Prof. 6 . EU and industrial grants which indirectly contributed to STPRtool development. the STPRtool was rewritten and its documentation radically improved in May and June 2004 with the ﬁnancial help of ECVision European Research Network on Cognitive Computer Vision Systems (IST200135454) lead by D. M. our visitors. Forth. Schlesinger from the Ukrainian Academy of Sciences. Kiev. Provide piece of documentation too which will allow us to incorporate it to STPRtool easily. the Czech Technical University in Prague. your method will be bound to your name in STPRtool.Please prepare your code according to the pattern you ﬁnd in STPRtool implementations.
. Discriminant function associated with the class y ∈ Y. Discriminant function of the binary classiﬁer. . . for instance A. Vectors are implicitly column vectors. Covariance matrix. Labeled training set (more precisely multiset) TXY = {(x1 . .Notation Uppercase bold letters denote matrices. Vectors are denoted by lowercase bold italic letters. Euclidean norm. Matrix determinant. . . Mapping from input space X to the feature space F . z] where the column vector v is (n + m)dimensional. . y1 ). (xl . Feature space. . xl }. The concatenation of column vectors w ∈ Rn . yl )}. c}. j) Dot product between vectors x and x . Scatter matrix. . Classiﬁcation rule. Set of c hidden states – class labels Y = {1. ndimensional input space (space of observations). x·x Σ µ S X ⊆ Rn Y F φ: X → F k: X × X → R TXY TX q: X → Y fy : X → R f: X → R · · det(()·) δ(i. z ∈ Rm is denoted as v = [w. Logical or. Unlabeled training set TX = {x1 . 7 . . for instance x. Cardinality of a set. j) = 1 for i = j and δ(i. Kronecker delta δ(i. j) = 0 otherwise. . . Kernel function (Mercer kernel). Mean vector (mathematical expectation).
2 and Section 2. The input vector x ∈ Rn is assigned to the class y ∈ Y its corresponding discriminant function fy attains maximal value y = argmax fy (x) = argmax ( wy · x + by ) . 2}. . ∀y ∈ Y . ∀y ∈ Y introduce bias to the discriminant functions.1) In the particular binary case Y = {1.1. The scalars by . 8 . yl )} containing pairs of observations xi ∈ Rn and corresponding class labels yi ∈ Y. the linear classiﬁer is represented by a single discriminant function f (x) = w · x + b .3.2) The datatype used to describe the linear classiﬁer and the implementation of the classiﬁer is described in Section 2. . 2 if f (x) = w · x + b < 0 . (2. . y ∈Y y ∈Y (2. . which are linear with respect to both the input vector x ∈ Rn and their parameter vectors w ∈ Rn . c} is composed of a set of discriminant functions fy (x) = w y · x + by . .Chapter 2 Linear Discriminant Function The linear classiﬁcation rule q: X ⊆ Rn → Y = {1. (xl . 2. given by the parameter vector w ∈ Rn and bias b ∈ R. 2} as follows q(x) = 1 if f (x) = w · x + b ≥ 0 . The problem is described by a ﬁnite set T = {(x1 . The input vector x ∈ Rn is assigned to class y ∈ {1. . . The methods using this type of training data is described in Section 2. y1 ). . Following sections describe supervised learning methods which determine the parameters of the linear classiﬁer based on available knowledge. The implemented learning methods can be distinguished according to the type of the training data: I.
andrerr Classiﬁcation error of the Generalized Anderson’s task. Example: Linear classiﬁer The example shows training of the Fisher Linear Discriminant which is the classical example of the linear classiﬁer. ekozinec Kozinec’s algorithm for epsoptimal separating hyperplane.mat is used for training.1 Linear classiﬁer The STPRtool uses a speciﬁc structure array to describe the binary (see Table 2.4). fldqp Fisher Linear Discriminant using Quadratic Programming. 6]. demo anderson Demo on Generalized Anderson’s task. androrig Original method to solve the AndersonBahadur’s task.1. fld Fisher Linear Discriminant. ggradander Gradient method to solve the Generalized Anderson’s task. The implementation of the linear classiﬁer itself provides function linclass. References: The linear discriminant function is treated throughout for instance in books [26. The parameters of class conditional distributions are (completely or partially) known. perceptron Perceptron algorithm to train binary linear classiﬁer. demo linclass Demo on the algorithms learning linear classiﬁers. 2.II. ganders Solves the Generalized Anderson’s task. 9 . The Anderson’s task and its generalization are representatives of such methods (see Section 2. The summary of implemented methods is given in Table 2.2) and the multiclass (see Table 2. Table 2.3) linear classiﬁer. The Riply’s data set riply trn.1: Implemented methods: Linear discriminant function linclass Linear classiﬁer (linear machine). eanders Epsilonsolution of the Generalized Anderson’s task. The resulting linear classiﬁer is then evaluated on the testing data riply tst.mat. mperceptr Perceptron algorithm to train multiclass linear machine.
2) with zero training error is equivalent to ﬁnding the hyperplane H = {x: w · x + b = 0} which separates the training vectors of the ﬁrst y = 1 and the second y = 2 class. y = 1. .b [1 × 1] Bias b ∈ R of the hyperplane.2 Linear separation of ﬁnite sets of vectors The input training data T = {(x1 .b [c × 1] Parameters by ∈ R. . The deﬁnitions of these tasks are given below. . If the inequalities (2.fun = ’linclass’ Identiﬁes function associated with this data type.1080 % % % % % load training data load testing data train FLD classifier classify testing data evaluate testing error 2. Binary linear classiﬁer (structure array): .X.Table 2. The Perceptron (Section 2. yl )} consists of pairs of observations xi ∈ Rn and corresponding class labels yi ∈ Y = {1. . .fun = ’linclass’ Identiﬁes function associated with this data type. .1) 10 . c of the linear discriminant functions fy (x) = w y · x + b. Table 2.3) with respect to the vector w ∈ Rn and the scalar b ∈ R.3) have a solution then the training data T is linearly separable. . model = fld( trn ). yi = 1 .y ) ans = 0. . tst = load(’riply_tst’). model ).W [n × c] Matrix of parameter vectors w y ∈ Rn . . . y1 ).W [n × 1] The normal vector w ∈ Rn of the separating hyperplane f (x) = w · x + b. . yi = 2 . The problem of training the binary (c = 2) linear classiﬁer (2. Multiclass linear classiﬁer (structure array): . (xl .2. . . . ypred = linclass( tst. cerror( ypred. . The problem is formally deﬁned as solving the set of linear inequalities w · xi + b ≥ 0 . . trn = load(’riply_trn’). tst.3: Datatype used to describe multiclass linear classiﬁer. y = 1. (ii) training εoptimal hyperplane and (iii) training the multiclass linear classiﬁer. c. c}. . The implemented methods solve the task of (i) training separating hyperplane for binary case c = 2. . (2.2: Datatype used to describe binary linear classiﬁer. . w · xi + b < 0 . .
b∗ ). i = 1. where the vector v ∈ Rn+1 is constructed as v = [w. b∗ ) = argmax m(w. b) ≤ ε . l .2.2. The optimal separating hyperplane H∗ = {x ∈ Rn : w∗ · x + b∗ = 0} separates training data T with maximal margin m(w ∗ .b = argmax min min i∈I1 w · xi + b w · xi + b . (2. The problem of training the separating hyperplane (2. . yl )} of binary labeled yi ∈ {1. . b∗ ) − m(w.2.6) for linearly separable training data T can be solved by the modiﬁed Perceptron (Section 2. . 2. . .1 Perceptron algorithm The input is a set T = {(x1 . The parameter ε ≥ 0 deﬁnes closeness to the optimal solution in terms of the margin. . The εoptimal solution can be found by Kozinec’s algorithm (Section 2.2). l . . The task (2. . (2. b] . The εoptimal solution is deﬁned as the vector w and scalar b such that the inequality m(w ∗ .4) where I1 = {i: yi = 1} and I2 = {i: yi = 2} are sets of indices. min − w w i∈I2 . This task is equivalent to training the linear Support Vector Machines (see Section 5).8) i = 1. (2. . y1). (xl . (2.3) can be formally rewritten to a simpler form v · zi > 0 .6) with respect to the vectors wy ∈ Rn .3) or Kozinec’s algorithm. yi = y . . y ∈ Y and scalars by ∈ R.2.and Kozinec’s (Section 2. 11 (2. . If the training data is linearly separable then the optimal separating hyperplane can be deﬁned the solution of the following task (w∗ . Therefore the numerical algorithms seeking the approximate solution are applied instead. y ∈ Y. An interactive demo on the algorithms separating the ﬁnite sets of vectors by a hyperplane is implemented in demo linclass.5) holds. . The optimal separating hyperplane cannot be found exactly except special cases. 2} training vectors xi ∈ Rn . b) w. The problem of training the multiclass linear classiﬁer c > 2 with zero training error is formally stated as the problem of solving the set of linear inequalities w yi · xi + byi > wy · xi + by .2) algorithm are useful to train the binary linear classiﬁers from the linearly separable data.b w.7) .
w2 which converge to the vector w ∗ and w ∗ respectively. w∗ = 1 2 argmin w1 ∈X1 . w1 . % generate training data % call perceptron % plot training data % plot separating hyperplane 2. . 1).3). The Kozinec’s algorithm builds a series of vectors w 1 . 2} training (0) (1) (t) vectors xi ∈ Rn . .1. References: Analysis of the Perceptron algorithm including the Novikoﬀ’s proof of convergence can be found in Chapter 5 of the book [26]. . . The Novikoﬀ theorem ensures that the Perceptron algorithm stops after ﬁnite number of iterations t if the training data are linearly separable. The 1 2 vectors w∗ and w ∗ are the solution of the following task 1 2 w∗.7) with respect to the unknown vector v ∈ Rn+1 is equivalent to the original task (2. . . Example: Training binary linear classiﬁer with Perceptron The example shows the application of the Perceptron algorithm to ﬁnd the binary linear classiﬁer for synthetically generated 2dimensional training data. 50.8). 1] if yi = 1 . The found classiﬁer is visualized as a separating hyperplane (line in this case) H = {x ∈ R2 : w · x + b = 0}. 1] if yi = 2 .7) is satisﬁed. . figure. Perceptron algorithm is implemented in function perceptron.2 Kozinec’s algorithm The input is a set T = {(x1 . . where X1 stands for the convex hull of the training vectors of the ﬁrst class X1 = {xi : yi = 1} and X2 for the convex hull of the second class likewise. . . The parameters (w. . . . . model = perceptron( data ). .w2 ∈X2 w1 − w2 . . See Figure 2. pline( model ). yl )} of binary labeled yi ∈ {1. The initial vector v (0) can be set arbitrarily (usually v = 0). v (t) until the set of inequalities (2. The generated training data contain 50 labeled vectors which are linearly separable with margin 1. The vector w∗ = 12 . The Perceptron from the neural network perspective is described in the book [3].9) The problem of solving (2. v (1) . y1). (xl . (2. data = genlsdata( 2. b) of the linear classiﬁer are obtained from the found vector v by inverting the transformation (2. w 2 . . . w 1 (0) (1) (t) and w 2 . The Perceptron algorithm is an iterative procedure which builds a series of vectors (0) v . z l } are deﬁned as zi = [xi .2.and transformed training data Z = {z 1 . ppatterns( data ). . −[xi . .
The Kozinec’s algorithm is proven to converge to the vectors w∗ and w ∗ in inﬁnite 1 2 number of iterations t = ∞.1: Linear classiﬁer trained by the Perceptron algorithm w∗ − w ∗ and the bias b∗ = 1 ( w ∗ 2 − w∗ 2 ) determine the optimal hyperplane (2. The Kozinec’s algorithm can be also used to solve a simpler problem of ﬁnding the separating hyperplane (2. The separating hyperplane (2. Notice that setting ε = 0 force the algorithm to seek the optimal hyperplane which is generally ensured to be found in an inﬁnite number of iterations t = ∞.−30 −35 −40 −45 −50 −55 −55 −50 −45 −40 −35 −30 Figure 2. Therefore the following two stopping conditions are implemented: I.4) 1 2 2 1 2 separating the training data with the maximal margin.3) is sought for (ε < 0).5) is sought for (ε ≥ 0). If the εoptimal optimality stopping condition is used then the Kozinec’s algorithm converges in ﬁnite number of iterations. Example: Training εoptimal hyperplane by the Kozinec’s algorithm The example shows application of the Kozinec’s algorithm to ﬁnd the (ε = 0. II.3). The εoptimal hyperplane (2. The generated training data contains 50 labeled vectors which are linearly separable with 13 . The Kozinec’s algorithm which trains the binary linear classiﬁer is implemented in the function ekozinec. References: Analysis of the Kozinec’s algorithm including the proof of convergence can be found in Chapter 5 of the book [26]. The Kozinec’s algorithm is proven to converge in a ﬁnite number of iterations if the separating hyperplane exists.01)optimal hyperplane for synthetically generated 2dimensional training data.
The inequalities (2.10) where the vector v ∈ R(n+1)c contains the originally optimized parameters v = [w1 .margin 1. .01)). . . (xl .1) is equivalent to solving the set of linear inequalities (2.2. The task is to design the linear classiﬁer (2. . . yl )} consists of pairs of observations xi ∈ Rn and corresponding class labels yi ∈ Y = {1. It could be veriﬁed that the found hyperplane has margin model. l . b2 . b1 .margin which satisﬁes the εoptimality condition. (2. struct(’eps’.3 Multiclass linear classiﬁer The input training data T = {(x1 . . bc ] . y = Y \ {yi } . .6). y1 ). % plot found hyperplane 8 6 4 2 0 −2 −4 −6 −8 −10 −10 −8 −6 −4 −2 0 2 4 6 8 10 Figure 2. . data = genlsdata(2. . i i = 1. . The found classiﬁer is visualized (Figure 2. .50. ppatterns(data). . % plot data pline(model). w c . . .0. 14 . c}.2: εoptimal hyperplane trained with the Kozinec’s algorithm. % generate data % run ekozinec model = ekozinec(data.1) which classiﬁes the training data T without error. .2) as a separating hyperplane H = {x ∈ R2 : w · x + b = 0} (straight line in this case). 2.6) can be formally transformed to a simpler set v · zy > 0 . The problem of training (2.1). figure. . w 2 .
The transformed task (2. . .11) where z y (j) stands for the jth slot between coordinates (n + 1)(j − 1) + 1 and (n + 1)j. Let Iy = {i: yi = y}. y1).The set of vectors z y ∈ R(n+1)c . µy = 15 . 2} . References: The Kesler’s construction is described in books [26. (2. ppatterns(data). w · SW w 1 xi . The described method is implemented in function mperceptron. data = load(’pentagon’). . ∈ Y \ {yi } is created from the training for j = yi . pboundary(model). . . % load training data % run training algorithm % plot training data % plot decision boundary 2.10) can be solved by the Perceptron algorithm. otherwise. respectively. i The described transformation is known as the Kesler’s construction. The simple form of the Perceptron updating rule allows to implement the transformation implicitly without mapping the training data T into (n + 1)cdimensional space. for j = y . . 6]. 1] . z i (j) = ⎩ 0. 2} training vectors xi ∈ Rn .3 Fisher Linear Discriminant The input is a set T = {(x1 .12) where SB is the betweenclass scatter matrix SB = (µ1 − µ2 )(µ1 − µ2 )T . A user’s own data can be generated interactively using function createdata(’finite’. Iy  i∈I y (2. 1] . (xl .10) (number 10 stands for number of classes).mat. l. .3). y ∈ {1. y ∈ {1. 2} be sets of indices of training vectors belonging to the ﬁrst y = 1 and the second y = 2 class. Example: Training multiclass linear classiﬁer by the Perceptron The example shows application of the Perceptron rule to train the multiclass linear classiﬁer for the synthetically generated 2dimensional training data pentagon. y i set T such that ⎧ ⎨ [xi . i = 1. figure. . The found classiﬁer is visualized as its separating surface which is known to be piecewise linear and the class regions should be convex (see Figure 2. y −[xi . yl )} of binary labeled yi ∈ {1. model = mperceptron(data). The class separability in a direction w ∈ Rn is deﬁned as F (w) = w · SB w .
13) using the matrix inversion w = SW −1 (µ1 − µ2 ) . The STPRtool contains two implementations of FLD: I. holds.8 −0.8 0. computes such b that the equality w · µ1 + b = −( w · µ2 + b) .4 −0.6 0. the parameter vector w of the linear discriminant function f (x) = w · x + b is determined to maximize the class separability criterion (2. y ∈ {1.2 −0.12) w = argmax F (w ) = argmax w w w · SB w .8 −1 −1 −0.4 −0.4 0.4 0. 16 .2 0 −0.2 0 0.2 0. and SW is the within class scatter matrix deﬁned as SW = S1 + S2 . for instance. The STPRtool implementation.3: Multiclass linear classiﬁer trained by the Perceptron algorithm. This case is implemented in the function fld.6 0. Sy = i∈Yy (xi − µy )(xi − µy )T . 2} .8 1 Figure 2.6 −0. The classical solution of the problem (2.6 −0.1 0.13) The bias b of the linear rule must be determined based on another principle. In the case of the Fisher Linear Discriminant (FLD). w · SW w (2.
y ∈ {1. The class conditional distributions pXY (xy). The found linear classiﬁer is visualized (see Figure 2. Let q: X ⊆ Rn → {1. 2} are known to be multivariate Gaussian distributions. However. Σi ): i ∈ I1 }. Example: Fisher Linear Discriminant The binary linear classiﬁer based on the Fisher Linear Discriminant is trained on the Riply’s data set riply trn. Σ2 ) of these class distributions are unknown. The method is implemented in the function fldqp. Σi ): i ∈ I2 }. cerror(ypred.4 Generalized Anderson’s task The classiﬁed object is described by the vector of observations x ∈ X ⊆ Rn and a binary hidden state y ∈ {1. tst = load(’riply_tst’). This method can be useful when the matrix inversion is hard to compute. Σ1 ) belong to a certain ﬁnite set of parameters {(µi . the function fldqp can be replaced by the standard approach implemented in the function fld which would yield the same solution. Similarly (µ2 . Σ2 ) belong to a ﬁnite set {(µi . However.4) and evaluated on the testing data riply tst. w subject to w · (µ1 − µ2 ) = 2 .1080 % load training data % compute FLD % % % % plot data and solution load testing data classify testing data compute testing error 2. figure.13) can be reformulated as the quadratic programming (QP) task w = argmin w · SW w . ypred = linclass(tst. it is known that the parameters (µ1 . References: The Fisher Linear Discriminant is described in Chapter 3 of the book [6] or in the book [3]. pline(model).mat. The QP formulation fldqp of FLD is used. 2} 17 . The problem (2. Σ1 ) and (µ2 . The parameters (µ1 .model).X.y) ans = 0. The QP solver quadprog of the Matlab Optimization Toolbox is used in the toolbox implementation. model = fldqp(trn). ppatterns(trn).mat. trn = load(’riply_trn’).tst. 2}.II.
14) when I1  = 1 and I2  = 1. 18 . b. b∗ ) of the linear classiﬁer 1 . µi .5 1 Figure 2. b∗ ) = argmin Err(w. such that the error Err(w∗ . b. Σi ) . The probability ε(w.5 0 0. be a binary linear classiﬁer (2. µi . Σi ) .4: Linear classiﬁer based on the Fisher Linear Discriminant.b w. Σi ) and the nearest vector of the separating hyperplane H = {x ∈ Rn : w · x + b = 0}.b i∈I1 ∪I2 (2.e. for f (x) = w∗ · x + b∗ < 0 . µi .1. µi . r i = min (µi − x) · (Σi )−1 (µi − x) = x∈H w · µi + b w · Σi w . i∈I1 ∪I2 where ε(w.2 0 −0.2) with discriminant function f (x) = w · x + b. The Generalized Anderson’s task (GAT) is to ﬁnd the parameters (w ∗ . In other words.2 1 0. q(x) = 2 . for f (x) = w∗ · x + b∗ ≥ 0 . b. b.6 0. b∗ ) is minimal (w ∗ . i. b) = max ε(w.14) The original Anderson’s task is a special case of (2.4 0.8 0.. Σi ) is probability that the Gaussian random vector x with mean vector µi and the covariance matrix Σi satisﬁes q(x) = 1 for i ∈ I2 or q(x) = 2 for i ∈ I1 . The probability of misclassiﬁcation is deﬁned as Err(w. it is the probability that the vector x will be misclassiﬁed by the linear rule q. w. b) = argmin max ε(w.5 −1 −0.2 −1. Σi ) is proportional to the reciprocal of the Mahalanobis distance r i between the (µi .
b) is less than 0. b) is not diﬀerentiable.4. 1−λ = λ w · Σ2 w . the objective function F (w. µi .1 Original Anderson’s algorithm The algorithm is implemented in the function androrig. 2. The problem is solved by an iterative algorithm which stops while the condition γ− 1−λ ≤ε. b.5. An interactive demo on the algorithms solving the Generalized Anderson’s task is implemented in demo anderson. b) = argmax min w.b w. References: The original Anderson’s task was published in [1]. 2π (2. b. In this case. µi . However.The exact relation between the probability ε(w. Example: Solving the original Anderson’s task 19 . w · Σi w which is more suitable for optimization. The parameter ε deﬁnes the closeness to the optimal solution. the optimal solution vector w ∈ Rn is obtained by solving the following problem w = ((1 − λ)Σ1 + λΣ2 )−1 (µ1 − µ2 ) .15) The optimization problem (2. w · Σ1 w where the scalar 0 < λ < 1 is unknown. A detailed description of the Generalized Anderson’s task and all the methods implemented in the STPRtool is given in book [26]. Σi ) and the corresponding Mahalanobis distance r i is given by the integral ε(w. λ γ= w · Σ2 w . The objective function F (w.b i∈I1 ∪I2 w · µi + b . This algorithm solves the original Anderson’s task deﬁned for two Gaussians only I1  = I2  = 1.14) can be equivalently rewritten as (w ∗ . w · Σ1 w is satisﬁed. The STPRtool contains implementations of the algorithm solving the original Anderson’s task as well as implementations of three diﬀerent approaches to solve the Generalized Anderson’s task which are described bellow. b∗ ) = argmax F (w. Σi ) = ∞ ri 1 2 1 √ e− 2 t dt . b) is proven to be convex in the region where the probability of misclassiﬁcation Err(w.
20 .2 0 0.2 0. % ypred = linclass( tst. % tst = load(’riply_tst’).8 −0.5: Solution of the original Anderson’s task.1150 load training data estimate Gaussians solve Anderson’s task plot solution load testing data classify testing data evaluate error 0.1 −0. pandr( model.5 0.The original Anderson’s task is solved for the Riply’s data set riply trn.6 Figure 2. % distrib = mlcgmm(data). Σ1 ) and (µ2 .929 0.X. The found linear classiﬁer is visualized and the Gaussian distributions (see Figure 2. Σ2 ) of the Gaussian distribution of the ﬁrst and the second class are obtained by the MaximumLikelihood estimation. model ). trn = load(’riply_trn’).2 0. % figure.3 0.6 −0.37 0.4 gauss 1 0.mat. % cerror( ypred.9 1.4 −0.8 0.4 0. tst.mat. distrib ).6 0. The classiﬁer is validated on the testing set riply tst. % model = androrig(distrib). The parameters (µ1 .5).7 gauss 2 0.y ) % ans = 0.
4.3 εsolution using the Kozinec’s algorithm This algorithm is implemented in the function eanders. The algorithm iterates until the minimal change in the objective function is less than the prescribed ε.2. The new parameters are set w(t+1) : = w (t) + k∆w . The number of interval cuttings of is an optional parameter of the algorithm. b) < ε 21 . ∆b) in the parameter space is computed such that moving along this direction increases the objective function F .mat. The optimal movement k is determined by 1Dsearch based on the cutting interval algorithm according to the Fibonacci series. distrib = load(’mars’). II. b(t+1) ) − F (w (t) . the condition F (w(t+1) . The εsolution is deﬁned as a linear classiﬁer with parameters (w.6).4. model = ganders(distrib). figure. % load input Gaussians % solve the GAT % plot the solution 2. The quadprog function of the Matlab optimization toolbox is used to solve this subtask. is violated. ∆b). b(t) ) > ε . The found linear classiﬁer as well as the input Gaussian distributions are visualized (see Figure 2. b(t+1) : = b(t) + k∆b .2 General algorithm framework The algorithm is implemented in the function ganders. Example: Solving the Generalized Anderson’s task The Generalized Anderson’s task is solved for the data mars. This algorithm repeats the following steps: I. The optimal movement k along the direction (∆w.e. Both the ﬁrst and second class are described by three 2Ddimensional Gaussian distributions.. b) such that F (w. An improving direction (∆w.distrib). III. i. pandr(model.
is satisﬁed.06 (it is 6%).4. The problem is transformed to the task of separation of two sets of ellipsoids which is solved the Kozinec’s algorithm. = load(’mars’).0483 4 3 0.0461 gauss 2 gauss 3 2 gauss 1 0. Otherwise it can get stuck in an endless loop.6: Solution of the Generalized Anderson’s task. The desired maximal possible probability of misclassiﬁcation is set to 0. distrib options model = figure. pandr(model. The objective function F (w. b) is convex but not diﬀerentiable thus the standard gradient optimization cannot 22 . = struct(’err’. The algorithm converges in ﬁnite number of iterations if the given εsolution exists.7). % % % % load Gaussians maximal desired error training visualization 2. 0.0503 1 0 gauss 5 0.5) is the desired upper bound on the probability of misclassiﬁcation.06’).mat.0.0421 gauss 6 gauss 4 0.options). Example: εsolution of the Generalized Anderson’s task The εsolution of the generalized Anderson’s task is solved for the data mars. The prescribed ε ∈ (0.4 Generalized gradient optimization This algorithm is implemented in the function ggradandr. The found linear classiﬁer and the input Gaussian distributions are visualized (see Figure 2. Both the ﬁrst and second class are described by three 2Ddimensional Gaussian distributions.distrib).0503 −1 −2 −3 −4 −5 −4 −3 −2 −1 0 1 2 3 4 5 Figure 2.047 0. eanders(distrib.5 0.
5 4 3 gauss 2 0.8). Example: Solving the Generalized Anderson’s task using the generalized gradient optimization The generalized Anderson’s task is solved for the data mars. The generalized gradient (∆w. The maximal number of iterations is limited to tmax=10000.e. b(t) ) > ε . is violated. the condition F (w(t+1) .. II. 23 . The maximal number of iterations can be used as the stopping condition as well. The algorithm repeats the following two steps: I.0516 −3 0. be used.mat.0473 0.0516 gauss 3 2 1 0. b(t+1) ) − F (w (t) . ∆b) is computed.7: εsolution of the Generalized Anderson’s task with maximal 0. the generalized gradient optimization method can be applied. The optimized parameters are updated w(t+1) : = w (t) + ∆w .06 probability of misclassiﬁcation. However.0483 gauss 5 0 −1 gauss 6 −2 gauss 4 0. b(t+1) = b(t) + ∆b . The found linear classiﬁer as well as the input Gaussian distributions are visualized (see Figure 2.0432 −4 −5 −4 −3 −2 −1 0 1 2 3 4 5 Figure 2.0496 gauss 1 0. i. Both the ﬁrst and second class are described by three 2Ddimensional Gaussian distributions. The algorithm iterates until the minimal change in the objective function is less than the prescribed ε.
distrib = load(’mars’).0434 −4 −5 −4 −3 −2 −1 0 1 2 3 4 5 Figure 2. options = struct(’tmax’.0475 1 0 gauss 5 0. pandr( model.0485 gauss 6 gauss 4 −2 −3 0. model = ggradandr(distrib.0518 −1 0.0518 gauss 2 gauss 3 3 2 gauss 1 0. % load Gaussians % max # of iterations % training % visualization 5 4 0.8: Solution of the Generalized Anderson’s task after 10000 iterations of the generalized gradient optimization algorithm.options). distrib ).10000).0498 0. 24 . figure.
. x2 . The linear data projection is implemented in function linproj (See examples in Section 3. . . This yields a nonlinear (kernel) projection of data which has generally deﬁned as z = AT k(x) + b . n are entries of the vector x ∈ Rn . . The linear projection z = WT x + b . . The quadratic data mapping is implemented in function qmap. the input vectors x ∈ Rn are mapped into a new higher dimensional feature space F in which the linear methods are applied. . where xi . .1) is a representative of the unsupervised learning method which yields the linear projection (3. The quadratic mapping φ: Rn → Rm is deﬁned as φ(x) = [ x1 . . The kernel functions allow for nonlinear extensions of the linear feature extraction methods. The projection is given by the matrix W [n × m] and bias vector b ∈ Rm . . . x1 x1 . .1). The Linear Discriminant Analysis (LDA) (Section 3.Chapter 3 Feature extraction The feature extraction step consists of mapping the input vector of observations x ∈ Rn onto a new feature description z ∈ Rm which is more suitable for given task. (3. . . . 25 (3. The Principal Component Analysis (PCA) (Section 3. . .3) (3.2) .2). The quadratic data mapping projects the input vector x ∈ Rn onto the feature space Rm where m = n(n + 3)/2.1) is commonly used. i = 1. x1 x2 . . In this case. x2 x2 . x1 xn . xn xn ]T .1 and Section 3. . . x2 xn . xn .2) is a representative of the supervised learning method yielding the linear projection. The data type used to describe the linear data projection is deﬁned in Table 3.
fun = ’linproj’ Identiﬁes function associated with this data type. . .3 and Section 3. The summary of implemented methods for feature extraction is given in Table 3. Table 3. The Kernel PCA (Section 3.3).1 Table 3.3) and the Generalized Discriminant Analysis (GDA) (Section 3.5).W [n × m] Projection matrix W = [w1 . pca Principal Component Analysis. . . pcarec Computes reconstructed vector after PCA projection. Linear data projection (structure array): . greedykpca Greedy Kernel Principal Component Analysis.1: Implemented methods for feature extraction linproj Linear data projection. . . . . kerproj Kernel data projection. . . lin2quad Merges linear rule and quadratic mapping. x1 ).where A [l × m] is a parameter matrix and b ∈ Rm a parameter vector. k(x. . The vector k(x) = [k(x. The extracted features are projections of φ(x) onto m vectors (or functions) from the feature space F . . . lda Linear Discriminant Analysis. .5) are representatives of the feature extraction methods yielding the kernel data projection (3.1 Principal Component Analysis Let TX = {x1 . z l } is a lower dimensional representation of the 26 . xl )]T is a vector of kernel functions centered in the training data or their subset. lin2svm Merges linear rule and kernel projection. . . wm ]. The set of vectors TZ = {z 1 . gda Generalized Discriminant Analysis. The kernel data projection is implemented in the function kernelproj (See examples in Section 3. xl } be a set of training vectors from the ndimensional input space Rn . . The data type used to describe the kernel data projection is deﬁned in Table 3. kpcarec Reconstructs image after kernel PCA. qmap Quadratic data mapping. demo pcacomp Demo on image compression using PCA. .b [m × 1] Bias vector b. kpca Kernel Principal Component Analysis.2: Datatype used to describe linear data projection. 3. The x ∈ Rn stands for the input vector and z ∈ Rm is the vector of extracted features.
.arg [1 × p] Kernel argument(s). .X [n × l] Training vectors {x1 . . j) is the Kronecker delta function. i = 1. where µ is the sample mean of the training data. .7) subject to wi · wj = δ(i. . . The vector b equals to WT µ.options. . j) . . A demo showing the use of PCA for image compression is implemented in the script demo pcacomp. .3: Datatype used to describe kernel data projection. ∀i.Table 3.b [m × 1] Bias vector b. . .sv. .b (3.7) is the matrix W = [w 1 . The parameters (W. Kernel data projection (structure array): .6) is a function of the parameters of the linear projections (3.4) which allows for the minimal mean square reconstruction error (3. .4). The vectors TZ are obtained by the linear orthonormal projection z = WT x + b . wm ] and δ(i. .5) obtained by inverting (3. . . The PCA is implemented in the function pca.Alpha [l × m] Parameter matrix A [l × m]. . The Principal Component Analysis (PCA) is the linear orthonormal projection (3. . xl }. . b ) . The reconstructed vectors TX = {x1 . b) = l l ˜ xi − xi i=1 2 . W .4) where the matrix W [n × m] and the vector b [m × 1] are parameters of the projec˜ ˜ tion.4) and (3.ker [string] Kernel identiﬁer. . input training vectors TX in the mdimensional space Rm . (3. The mean square reconstruction error 1 εM S (W. . b) = argmin εM S (W .6) of the training data TX . . m are column vectors of the matrix W = [w1 . . (3. (3.fun = ’kernelproj’ Identiﬁes function associated with this data type.5). 27 . . b) of the linear projection are the solution of the optimization task (W. where w i . w m ] containing the m eigenvectors of the sample covariance matrix which have the largest eigen values. .options. j . xl } are computed by the linear back ˜ projection ˜ x = W(z − b) . The solution of the task (3.
proj.’Reconstructed’. h3 = plot([XR(1.5 7 6. % visualization h1 = ppatterns(X.8 1].mx)].mx] = max(Z). .5 5 4. The training data. hold on.’kx’). The PCA is applied to train the 1dimensional subspace to approximate the input data with the minimal reconstruction error. h2 = ppatterns(XR.model). legend([h1 h2 h3]. data figure. 7. 3.1).[XR(2. reconstr.mx)].[1 0.mn) XR(1. 28 .5 3 2.References: The PCA is treated throughout for instance in books [13.model).5]. axis equal. X = gsamp([5.. [dummy.1).5 6 5.5 2 3 4 5 6 7 Input vectors Reconstructed PCA subspace 8 Figure 3.mn) XR(2. 6].0. ’PCA subspace’). the reconstructed data and the 1dimensional subspace given by the ﬁrst principal component are visualized (see Figure 3.. % % % % generate data train PCA lower dim. Example: Principal Component Analysis The input data are synthetically generated from the 2dimensional Gaussian distribution. model = pca(X.1: Example on the Principal Component Analysis.mn] = min(Z).’bo’). [dummy.’r’).5 4 3.100). ’Input vectors’. XR = pcarec(X. Z = linproj(X.8.
The SB is the betweenclass and SW is the withinclass scatter matrix of projected data which is deﬁned similarly to (3. . figure. y ∈ Y = {1. y ∈ Y are deﬁned as 1 µ= l l xi .mat which consists of the labeled 4dimensional data. The LDA and 29 .2 Linear Discriminant Analysis Let TXY = {(x1 . % % % % load input data train LDA feature extraction plot ext. xi ∈ Rn . The data after the feature extraction step are visualized in Figure 3. (xl . ext_data = linproj(orig_data. such that the class separability criterion F (W) = ˜ det(SB ) det(SB ) .2). . .2.8) SB = y∈Y Iy (µy − µ)(µy − µ)T . data Example 2: PCA versus LDA This example shows why the LDA is superior to the PCA when features good for classiﬁcation are to be extracted. y ∈ Y .3. . orig_data = load(’iris’). Sy = i∈Iy (xi − µi )(xi − µi )T . (3. y1). . 2. Example 1: Linear Discriminant Analysis The LDA is applied to extract 2 features from the Iris data set iris. = ˜ det(SW ) det(SW ) ˜ ˜ is maximized. The withinclass SW and betweenclass SB scatter matrices are described as SW = y∈Y Sy . Iy  i∈I y The goal of the Linear Discriminant Analysis (LDA) is to train the linear data projection z = WT x . The synthetical data are generated from the Gaussian mixture model. c} be labeled set of training vectors.model). The LDA and PCA are trained on the generated data. yl )}. .8). References: The LDA is treated for instance in books [3. . . where the total mean vector µ and the class mean vectors µy . 6]. ppatterns(ext_data). i=1 µy = 1 xi . model = lda(orig_data.
pca_model = pca(data.5 Figure 3. The extracted data using the LDA and PCA model are displayed with the Gaussians ﬁtted by the MaximumLikelihood (see Figure 3.3 0.Mean = [[5. pca_rec = pcarec(data. distrib.4 −0.9. 30 % % % % % mean vectors 1st covariance 2nd covariance Gaussian weights sample data % train LDA % train PCA % visualization . lda_model = lda(data.3b). PCA directions are displayed as connection of the vectors projected to the trained LDA and PCA subspaces and reprojected back to the input space (see Figure 3.Cov(:.250).4] [4. figure.2) = [1 0.X.pca_model). lda_data = linproj(data.lda_model).5]. 0.2 0.5 0.1) = [1 0.1 −0.9 1].Prior = [0. % Generate data distrib.4 0. ppatterns(data).3a).5 −1 −0.5 0 0.9.X. axis equal. hold on. distrib.X.0.5 1 1. pca_data = linproj( data.1 0 −0.:. lda_rec = pcarec(data.2: Example shows 2dimensional data extracted from the originally 4dimensional Iris data set using Linear Discriminant Analysis.9 1]. data = gmmsamp(distrib. 0.1).lda_model).3 −0.pca_model).:.1).2 −0.5]].Cov(:.5 −1. distrib.5 0.
h2 = plot(pca_rec(1. .3 Kernel Principal Component Analysis The Kernel Principal Component Analysis (Kernel PCA) is the nonlinear extension of the ordinary linear PCA. The procedure consists of the following step: I. (3. This procedure is implemented in the function kpcarec for the case of the Radial Basis Function (RBF) kernel. the explicit projection from the feature space F to the input space X usually do not exist.1. figure. 3. . The reconstructed vector φ(x) is given as a linear combination of the mapped data TΦ l ˜ φ(x) = i=1 βi φ(xi ) .2).’r’). . (3.1). subplot(2. .h1 = plot(lda_rec(1.:). 31 . φ(xl )}. The computation of the principal components and the projection on these components can be expressed in terms of dot products thus the kernel functions k: X × X → R can be employed. such that the reconstruction error 1 εKM S (A. .11) In contrast to the linear PCA.’PCA direction’).:). title(’LDA’).10) ˜ is minimized. b) = l l (3. The kernel PCA trains the kernel data projection z = AT k(x) + b . hold on. subplot(2.:). ppatterns(pca_data). title(’PCA’). .’b’).1. pgauss(mlcgmm(pca_data)).:). xl }.lda_rec(2. The input training vectors TX = {x1 . β = A(z − b) .pca_rec(2. . pgauss(mlcgmm(lda_data)). ppatterns(lda_data). The problem is to ﬁnd the vector x ∈ X its image ˜ φ(x) ∈ F well approximates the reconstructed vector φ(x) ∈ F . .’LDA direction’.9) ˜ φ(xi ) − φ(xi ) i=1 2 . The linear PCA is applied on the mapped data TΦ = {φ(x1 ). Project input vector xin ∈ Rn onto its lower dimensional representation z ∈ Rm using (3.9). xi ∈ X ⊆ Rn are mapped by φ: X → F to a high dimensional feature space F . legend([h1 h2].
The function kpcarec is used to ﬁnd preimages of the reconstructed data (3.11) while the size d of the subset TS is kept small. . .1]. (3. The input data are synthetically generated 2dimensional vectors which lie on a circle and are corrupted by the Gaussian noise. % kernel argument options. The vector set TS is a subset of training data TX . Example: Kernel Principal Component Analysis The example shows using the kernel PCA for data denoising. . The result is visualized in Figure 3. % use RBF kernel options. sl )]T are the kernel functions centered in the vectors TS = {s1 .’Reconstructed’). % generate circle data options. such that ˜ xout = argmin φ(x) − φ(xin ) 2 . In contrast to the ordinary kernel PCA. . Let TX = {x1 .9).options). . . The Greedy kernel PCA algorithm is implemented in the function greedykpca. k(x. sd }.250. % output dimension model = kpca(X.arg = 4. xi ∈ Rn be the set of input training vectors. X = gencircledata([1. .ker = ’rbf’. h2 = ppatterns(XR. the subset TS does not contain all the training vectors TX thus the complexity of the projection (3. . ’+r’).new_dim = 2. s1 ).’Input vectors’. .model). 30]. . The kernel PCA model with RBF kernel is trained.5. Computes vector xout ∈ Rn which is good preimage of the reconstructed vector ˜ φ(xin ). . 3.1). The goal is to train the kernel data projection z = AT kS (x) + b .12) where A [d × m] is the parameter matrix. The objective of the Greedy kernel PCA is to minimize the reconstruction error (3. x References: The kernel PCA is described at length in book [29. . xl }. 32 .11). % compute reconstruced data figure. b [m × 1] is the parameter vector and kS = [k(x.4 Greedy kernel PCA algorithm The Greedy Kernel Principal Analysis (Greedy kernel PCA) is an eﬃcient algorithm to compute the ordinary kernel PCA.4. % compute kernel PCA XR = kpcarec(X. legend([h1 h2]. % Visualization h1 = ppatterns(X).12) is reduced compared to (3.II.
arg = 4. In contrast the ordinary kernel PCA (see experiment in Section 3. The function kpcarec is used to ﬁnd preimages of the reconstructed data obtained by the kernel PCA projection. The computation of the projection vectors and the projection on these vectors can be expressed in terms of dot products thus the kernel functions k: X × X → R can be employed. The Greedy kernel PCA model with RBF kernel is trained. . y1). y ∈ Y = {1.’ob’. . The ordinary LDA is applied on the mapped data TΦY = {(φ(x1 ). 3. xi ∈ X ⊆ Rn . % % % % % use RBF kernel kernel argument output dimension size of sel. % visualization h1 = ppatterns(X). legend([h1 h2 h3]. .3) requires all the 250 training vectors. .5 Generalized Discriminant Analysis The Generalized Discriminant Analysis (GDA) is the nonlinear extension of the ordinary Linear Discriminant Analysis (LDA).m = 25.model). yl )}.’+r’). 33 . h2 = ppatterns(XR. . The input data are synthetically generated 2dimensional vectors which lie on a circle and are corrupted with the Gaussian noise.References: The implemented Greedy kernel PCA algorithm is described in [11].1). (xl . . . % generate training data options. . model = greedykpca(X. ’Training set’. Example: Greedy Kernel Principal Component Analysis The example shows using kernel PCA for data denoising. .250. options.options). The training and reconstructed vectors are visualized in Figure 3. c} are mapped by φ: X → F to a high dimensional feature space F . . The number of vectors deﬁning the kernel projection (3. X = gencircledata([1.5. (φ(xl ). h3 = ppatterns(model. The vector of the selected subset TX are visualized as well.’Reconstructed set’. The input training data TXY = {(x1 .. The resulting kernel data projection is deﬁned as z = AT k(x) + b . . . yl )}.sv.. y1 ).ker = ’rbf’.new_dim = 2.12). options. subset run greedy algorithm XR = kpcarec(X.12) is limited to 25.5.X. options.1].’Selected subset’). % reconstructed vectors figure. .
g. . y1 ).6.mat which contains 3 classes of 4dimensional data.2) used as the feature extraction step.options). .model).new_dim = 2. . . . . (z l .6 Combining feature extraction and linear classiﬁer Let TXY = {(x1 . as m f (x) = w · φ(x) + b = i=1 wiφi (x) + b . Notice. . Let φ: X → Rm be a mapping φ(x) = [φ1 (x). options. z i ∈ Rm . . (φ(xl ). yl )} be a training set of observations xi ∈ X ⊆ Rn and corresponding hidden states yi ∈ Y = {1. the discriminant function can be written in the following form f (x) = x · Ax + x · b + c . (xl . . In the case of the quadratic data mapping (3. . The GDA is implemented in the function gda.where A [l × m] is the matrix and b [m × 1] the vector trained by the GDA. The extracted data YZY are visualized in Figure 3. . References: The GDA was published in the paper [2]. Example: Generalized Discriminant Analysis The GDA is trained on the Iris data set iris. .arg = 1. visualization 3. The function f can be seen as a nonlinear function of the input vectors x ∈ X . .. The feature extraction step applied to the input data TXY yields extracted data TΦY = {(φ(x1 ). options. b) are trained to increase the betweenclass scatter and decrease the withinclass scatter of the extracted data TZY = {(z 1 . orig_data = load(’iris’). % % % % % % % load data use RBF kernel kernel arg. ext_data = kernelproj(orig_data. options. The RBF kernel with kernel width σ = 1 is used. yl )}. ppatterns(ext_data). .13) . . . yl )}. . Let f (x) = w · φ(x) + b be a linear discriminant function found by an arbitrary training method on the extracted data TΦY . φm (x)]T deﬁning a nonlinear feature extraction step (e. . . y1 ). y1 ). output dimension train GDA feature ext. quadratic data mapping or kernel projection).ker = ’rbf’. model = gda(orig_data. 34 (3. figure. that setting σ too low leads to the perfect separability of the training data TXY but the kernel projection is heavily overﬁtted. . c}. The parameters (A.
The Perceptron algorithm produces the linear rule only.14) The discriminant function (3.4.g. % load training data living in the input space input_data = load(’vltava’). pboundary(quad_model). % compute parameters of the quadratic classifier quad_model = lin2quad(lin_model). (3.13) deﬁnes the quadratic classiﬁer described in Section 6. Figure 3. fld) and (iii) combining the quadratic mapping and the linear rule.7 shows the decision boundary of the quadratic classiﬁer. The Fisher Linear Discriminant was used in this example.. ppatterns(input_data). perceptron. Function lin2quad is used to compute the parameters of the quadratic classiﬁer explicitly based on the found linear rule in the feature space and the mapping. (ii) training a linear rule on the mapped data (e. The 35 . The combining is implemented in function lin2svm.3) applied as the feature extraction step. In the case of the kernel projection (3. Therefore the quadratic classiﬁer can be trained by (i) applying the quadratic data mapping qmap.The discriminant function (3. The combining is implement in function lin2quad. The input training data are linearly nonseparable but become separable after the quadratic data mapping is applied. % visualize the quadratic classifier figure. Example: Quadratic classiﬁer trained the Perceptron This example shows how to train the quadratic classiﬁer using the Perceptron algorithm. (ii) training a linear rule on the extracted data and (iii) combing the kernel projection and the linear rule. the resulting discriminant function has the following form f (x) = α · k(x) + b . % train linear rule in the feature sapce lin_model = perceptron(map_data). % map data to the feature space map_data = qmap(input_data). Example: Kernel classiﬁer from the linear rule This example shows how to train the kernel classiﬁer using arbitrary training algorithm for linear rule.14) deﬁnes the kernel (SVM type) classiﬁer described in Section 5. This classiﬁer can be trained by (i) applying the kernel projection kernelproj.
’new_dim’. The Riply’s data set was used for training and evaluation.options).X.y) ans = 0.X.kfd_model).greedy kernel PCA algorithm was applied as the feature extraction step. Figure 3. kpca_model = greedykpca(trn. the training data and vectors used in the kernel expansion of the discriminant function. cerror(ypred.10).’arg’. ppatterns(kfd_model. % train kernel PCA by greedy algorithm options = struct(’ker’.sv. % load input data trn = load(’riply_trn’).13).X.’rbf’.’ok’.lin_model). % project data map_trn = kernelproj(trn.0970 36 . pboundary(kfd_model).kpca_model). ppatterns(trn).8 shows the found kernel classiﬁer. % evaluation on testing data tst = load(’riply_tst’). % train linear classifier lin_model = fld(map_trn). % combine linear rule and kernel PCA kfd_model = lin2svm(kpca_model. ypred = svmclass(tst.tst. % visualization figure.1.
7
LDA direction PCA direction
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
(a)
LDA 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
PCA 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −4 −3 −2 −1 0 1 2 3 4 5
(b) Figure 3.3: Comparison between PCA and LDA. Figure (a) shows input data and found LDA (red) and PCA (blue) directions onto which the data are projected. Figure (b) shows the class conditional Gaussians estimated from data projected data onto LDA and PCA directions.
37
10 Input vectors Reconstructed 8
6
4
2
0
−2
−4
−6
−8 −8
−6
−4
−2
0
2
4
6
8
10
Figure 3.4: Example on the Kernel Principal Component Analysis.
8 Training set Reconstructed set Selected set 6
4
2
0
−2
−4
−6 −8
−6
−4
−2
0
2
4
6
8
10
Figure 3.5: Example on the Greedy Kernel Principal Component Analysis.
38
0.15
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2 −0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
Figure 3.6: Example shows 2dimensional data extracted from the originally 4dimensional Iris data set using the Generalized Discriminant Analysis. In contrast to LDA (see Figure 3.2 for comparison), the class clusters are more compact but at the expense of a higher risk of overﬁtting.
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6 −0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 3.7: Example shows quadratic classiﬁer found by the Perceptron algorithm on the data mapped to the feature by the quadratic mapping.
39
1. The greedy kernel PCA was used therefore only a subset of training data was used to determine the kernel projection.5 1 Figure 3. 40 .2 1 0.6 0.5 −1 −0.2 0 −0.8: Example shows the kernel classiﬁer found by the Fisher Linear Discriminant algorithm on the data mapped to the feature by the kernel PCA.2 −1.8 0.4 0.5 0 0.
fun (see Table 4. each pair (µi . . .4. .fun) is used to evaluate yi = PX (xi ).fun speciﬁes function which is used to evaluate PX (x) for given set of realization {x1 . The datatype is described in Table 4. The STPRtool uses speciﬁc datatype to describe a set of Gaussians which can be labeled. xl } of a random vector. . i. x ∈ Rn as a structure array which must contain ﬁeld . 2 det(Σ) (4. Σi) is assigned to the class yi ∈ Y. xl }. The matrix X [n × l] contains column vectors {x1 . The covariance matrix Σ is always represented as the square matrix [n × n]. The summary of implemented methods for dealing with probability density estimation and clustering is given in Table 4.1. Let the structure model represent a given distribution function.model. The vector y [1 × l] contains values of the distribution function.e.1 Gaussian distribution and Gaussian mixture model The value of the multivariate Gaussian probability distribution function in the vector x ∈ Rn is given by PX (x) = 1 (2π) n 2 1 exp(− (x − µ)T Σ−1 (x − µ)) .1) where µ [n × 1] is the mean vector and Σ [n × n] is the covariance matrix. The Matlab command y = feval(X. The ﬁeld . In particular.Chapter 4 Density estimation and clustering The STPRtool represents a probability distribution function PX (x). The evaluation of the Gaussian distribution is implemented in the function 41 .. . .cov type (see Table 4. the Gaussian distribution and the Gaussian mixture models are described in Section 4.3). the shape of the matrix can be indicated in the optimal ﬁeld . .1 4.2). however. .
The evaluation of the GMM is implemented in the function pdfgmm. pdfgauss Evaluates Gaussian mixture model. . rsde Reduced Set Density Estimator. mlsigmoid Fitting a sigmoid function using ML estimation. Probability Distribution Function (structure array): . gsamp Generates sample from Gaussian distribution.fun ).1) given by parameters (µy . (4.2: Datatype used to represent probability distribution function. melgmm Maximizes Expectation of LogLikelihood for Gaussian mixture. . demo mmgauss Demo on minimax estimation for Gaussian. The sampling from univariate and bivariate Gaussians is shown. The datatype used to represent the GMM is described in Table 4. The random sampling from the Gaussian distribution is implemented in the function gsamp. sigmoid Evaluates sigmoid function. c} are weights and PXY (xy) is the Gaussian distribution (4. pdfgauss. Σy ).5. The distribution function of the univariate Gaussian is visualized and compared to the histogram 42 . emgmm ExpectationMaximization Algorithm for Gaussian mixture model. model. demo emgmm Demo on ExpectationMaximization (EM) algorithm. Example: Sampling from the Gaussian distribution The example shows how to represent the Gaussian distribution and how to sample data it. y ∈ Y = {1.1: Implemented methods: Density estimation and clustering gmmsamp Generates sample from Gaussian mixture model. pdfgauss Evaluates for multivariate Gaussian distribution. kmeans Kmeans clustering algorithm.2) where PY (y). . demo svmpout Fitting a posteriori probability to SVM output. mmgauss Minimax estimation of Gaussian distribution. . mlcgmm Maximal Likelihood estimation of Gaussian mixture model. Table 4. The Gaussian Mixture Model (GMM) is weighted sum of the Gaussian distributions PX (x) = y∈Y PY (y)PXY (xy) . The random sampling from the GMM is implemented in the function gmmsamp.Table 4.fun [string] The name of a function which evaluates probability distribution y = feval( X.
. hold on. µk }.250)).’Cov’. .X] = hist(gsamp(model. .1a). The isoline of the distribution function of the bivariate Gaussian is visualized together with the sample data (see Figure 4. 0. yθ) = PXY Θ (xy. ’spherical’ Spherical covariance matrix.6. . . % univariate case model = struct(’Mean’.Cov = [1 0. % Gaussian parameters figure. pgauss(model). .fun = ’pdfgauss’ Identiﬁes function associated with this data type.1. Σk }. hold on.1:5].pdfgauss([4:0. % plot pdf. of the sample data (see Figure 4. % plot histogram % bivariate case model.10).Cov [n × n × k] Covariance matrices {Σ1 . .500).3: Shape of covariance matrix. .1]. Labeled set of Gaussians (structure array): .4: Datatype used to represent the labeled set of Gaussians. [Y.model). ’diag’ Diagonal covariance matrix. . model. Table 4. . figure. ppatterns(gsamp(model.Mean [n × k] Mean vectors {µ1 .Mean = [1. of the Gaussisan plot([4:0. Parameter cov type: ’full’ Full covariance matrix. .3).6 1]. yk } assigned to Gaussians.Table 4.2 MaximumLikelihood estimation of GMM PXY Θ (x. .y [1 × k] Labels {y1 . . % Mean vector % Covariance matrix % plot sampled data % plot shape 4.1b). 43 Let the probability distribution .cov type [string] Optional ﬁeld which indicates shape of the covariance matrix (see Table 4. .2).1:5].’r’).Y/500). . % compute histogram bar(X. θ)PY Θ (yθ) . .
. .3 3 0. The θ involves parameters (µy . . y1).127 0 0.35 5 4 0.fun = ’pdfgmm’ Identiﬁes function associated with this data type. PK (c)}. yθ).25 2 0. The marginal probability PXΘ (xθ) is the Gaussian mixture model (4. (xl . yl )} which contains both vectors of observations xi ∈ Rn and corresponding hidden states yi ∈ Y. be known up to parameters θ ∈ Θ. . .1: Sampling from the univariate (a) and bivariate (b) Gaussian distribution.2. . . Σc }. y ∈ Y of Gaussian components and the values of the discrete distribution PY Θ (yθ). c} are assumed to be realizations of random variables which are independent and identically distributed according to the PXY Θ (x. yθ). .2. The vector of observations x ∈ Rn and the hidden state y ∈ Y = {1. .05 −2 0 −4 −3 −2 −1 0 1 2 3 4 5 6 −3 −2 −1 0 1 2 3 4 5 (a) (b) Figure 4. y ∈ Y. . Gaussian Mixture Model (structure array): .1 Complete data The input is a set TXY = {(x1 . .15 gauss 1 0. Σy ). The MaximumLikelihood estimation of the parameters θ of the GMM for the case of complete and incomplete data is described in Section 4. .Table 4. θ) is assumed to be Gaussian distribution. The pairs (xi .i.1 −1 0. . . Thanks to the 44 .d. . . The logarithm of probability P (TXY θ) is referred to as the loglikelihood L(θTXY ) of θ with respect to TXY .2 1 0. The conditional probability distribution PXY Θ (xy.Cov [n × n × c] Covariance matrices {Σ1 .Mean [n × c] Mean vectors {µ1 . .2. yi) are assumed to samples from random variables which are independent and identically distributed (i. . respectively.2). .5: Datatype used to represent the Gaussian Mixture Model.2. . 0. 4. .) according to PXY Θ (x. µc }. .1 and Section 4. .Prior [c × 1] Weights of Gaussians {PK (1).
’contour’)). The estimated GMM is visualized as: (i) shape of components of the GMM (see Figure 4. The problem of MaximumLikelihood (ML) estimation from the complete data TXY is deﬁned as l θ = argmax L(θTXY ) = argmax θ∈Θ θ∈Θ i=1 ∗ log PXY Θ (xi . % load labeled (complete) data model = mlcgmm(data). . ppatterns(data). Example: MaximumLikelihood estimate from complete data The parameters of the Gaussian mixture model is estimated from the complete Riply’s data set riply trn.struct(’visual’. 45 . .2 Incomplete data The input is a set TX = {x1 .2b) . Thanks to the i.2a) and (ii) contours of the distribution function (see Figure 4.3) The ML estimate of the parameters θ of the GMM from the complete data set TXY has an analytical solution.i. . The data are binary labeled thus the GMM has two Gaussians components. θ)PY Θ (yi θ) . assumption the loglikelihood can be factorized as l L(θTX ) = log P (TX θ) = i=1 y∈Y PXY Θ (xi yi. pgmm(model. yiθ) . assumption the loglikelihood can be factorized as l L(θTXY ) = log P (TXY θ) = i=1 log PXY Θ (xi . The computation of the ML estimate is implemented in the function mlcgmm. (4. data = load(’riply_trn’). . 4.i. ppatterns(data). The logarithm of probability P (TX θ) is referred to as the loglikelihood L(θTX ) of θ with respect to TX . xl } which contains only vectors of observations xi ∈ Rn without knowledge about the corresponding hidden states yi ∈ Y. hold on.i. figure. hold on. yiθ) .2. % ML estimate of GMM % visualization figure.d. pgauss(model).d.
Example: MaximumLikelihood estimate from incomplete data The estimation of the parameters of the GMM is demonstrated on a synthetic data. true_model. sample = gmmsamp(true_model.Mean = [2 2]. θ (1) .The problem of MaximumLikelihood (ML) estimation from the incomplete data TX is deﬁned as l θ = argmax L(θTX ) = argmax θ∈Θ θ∈Θ i=1 y∈Y ∗ PXY Θ (xi yi. . θ)PY Θ (yi θ) . The random guess or Kmeans algorithm (see Section 4.e.Prior = [0. The ground truth model is the GMM with two univariate Gaussian components. (4..Cov = [1 0. such that the loglikelihood L(θ (t) TX ) monotonically increases. An interactive demo on the EM algorithm for the GMM is implemented in demo emgmm. The EM algorithm is a local optimization technique thus the estimated parameters depend on the initial guess θ (0) . < L(θ (t) TX ) until a stationary point L(θ (t−1) TX ) = L(θ (t) TX ) is achieved.3a. The ground truth model is sampled to obtain a training (incomplete) data.6]. . The plot of the loglikelihood function L(θ (t) TX ) with respect to the number of iterations t is visualized in Figure 4. The EM algorithm builds a sequence of parameter estimates θ (0) . The ground truth model. The EM algorithm for the GMM is implemented in the function emgmm. sampled data and the estimate model are visualized in Figure 4. . 5]. . A nice description can be found in book [3]. The ﬁrst papers describing the EM are [25. i. The EM algorithm is applied to estimate the GMM parameters. . 250). 46 . . References: The EM algorithm (including convergence proofs) is described in book [26]. L(θ (0) TX ) < L(θ (1) TX ) < .5].4 0. The ExpectationMaximization (EM) algorithm is an iterative procedure for ML estimation from the incomplete data. true_model.3b.4) The ML estimate of the parameters θ of the GMM from the incomplete data set TXY does not have an analytical solution. θ (t) .5) are commonly used to select the initial θ (0) . % ground truth Gaussian mixture model true_model. The random initialization is used by default. The numerical optimization has to be used instead. The monograph [18] is entirely devoted to the EM algorithm.
hold on.3 Minimax estimation The input is a set TX = {x1 . The PXΘ (xθ) is known up to the parameter θ ∈ Θ. xl } contains vectors xi ∈ Rn which are realizations of random variable distributed according to PXΘ (xθ). . . x∈TX (4. . References: The minimax algorithm for probability estimation is described and analyzed in Chapter 8 of the book [26].1). The algorithm builds a series of estimates θ (0) . θ (1) . .struct(’color’. The inequality (4.’Ground truth’. . figure.6) can be evaluated eﬃciently thus it is a natural choice for the stopping condition of the algorithm.verb = 1.% ML estimation by EM options.’b’)).logL ). 47 .5) If a procedure for the MaximumLikelihood estimate (global solution) of the distribution PXΘ (xθ) exists then a numerical iterative algorithm which converges to the optimal estimate θ ∗ can be constructed. 4. The minimax estimation is deﬁned as θ ∗ = argmax min log PXΘ (xθ) .options). The vectors TX are assumed to be a realizations with high value of the distribution function PXΘ (xθ). θ∈Θ x∈TX (4. .ncomp = 2. . ppatterns(sample. The algorithm is implemented in the function mmgauss. ’ML estimation’). The algorithm converges in a ﬁnite number of iterations to such an estimate θ that the condition x∈TX min log PXΘ (xθ ∗ ) − min log PXΘ (xθ) < ε . plot( estimated_model. h2 = pgmm(estimated_model.6) holds for ε > 0 which deﬁnes closeness to the optimal solution θ ∗ .X). % number of Gaussian components options. An interactive demo on the minimax estimation for the Gaussian distribution is implemented in the function demo mmgauss. % visualization figure. xlabel(’iterations’). legend([h1(1) h2(1)]. The STPRtool contains an implementation of the minimax algorithm for the Gaussian distribution (4. h1 = pgmm(true_model.struct(’color’. . θ (t) which converges to the optimal estimate θ ∗ .X.’r’)). % display progress info estimated_model = emgmm(sample. ylabel(’loglikelihood’).
θ) . θ ) . pgauss(model. a2 ]T .Example: Minimax estimation of the Gaussian distribution The bivariate Gaussian distribution is described by three samples which are known to have high value of the distribution function. Let f : X ⊆ Rn → R be a discriminant function trained from the data TXY by an arbitrary learning method. Let TXY = {(x1 . Fitting a sigmoid function to the classiﬁer output is a way how to solve this problem. The parameters θ = [a1 . θ) = 1 . The distribution is modeled by a sigmoid function PY F Θ (1f (x). y1 ).0] [1. the probabilistic output of the classiﬁer can help in postprocessing. (4. the discriminant function of an arbitrary classiﬁer does not have meaning of probability (e. X = [[0. . .f. The loglikelihood function is deﬁned as l L(θTXY ) = i=1 log PY F Θ (yif (xi ). The aim is to estimate parameters of a posteriori distribution PY F Θ (yf (x).’xr’. . the minimax problem is equivalent to searching for the minimal enclosing ellipsoid. for instance.g. . 1 + exp(a1 f (x) + a2 ) which is determined by parameters θ = [a1 . in combining more classiﬁers together.4). % run minimax algorithm % visualization figure.4 Probabilistic output of classiﬁer In general. model = mmgauss(X).0] [0.exp(model. % points with high p. In this particular case. 2}. However.13). a2 ]T can be estimated by the maximumlikelihood method l θ = argmax L(θ TXY ) = argmax θ θ i=1 log PY F Θ (yi f (xi ). The training set TXY is assumed to be identically and independently distributed from underlying distribution.lower_bound))). struct(’p’.d. ppatterns(X. SVM.1]]. The found solution is visualized as the isoline of the distribution function at the point PXΘ (xθ) (see Figure 4. (xl . Perceptron).7) 48 . yl )} is the training set composed of the vectors xi ∈ X ⊆ Rn and the corresponding binary hidden states yi ∈ Y = {1. 4.. The minimax algorithm is applied to estimate the parameters of the Gaussian. θ) of the hidden state y given the value of the discriminant function f (x).
svm_model).fun = ’sigmoid’ Identiﬁes function associated with this data type. 49 . The function produces the parameters θ = [a1 .X] = svmclass(data.X. % visulization figure. % train SVM options = struct(’ker’. The ML estimated of Gaussian mixture model (GMM) of the SVM output is used for comparison. .options). References: The method of ﬁtting the sigmoid function is described in [22]. Sigmoid function (structure array): .7). % load training data data = load(’riply_trn’).y = data. svm_output. Example: Probabilistic output for SVM This example is implemented in the script demo svmpout. The example is implemented in the script demo svmpout. The evaluation of the sigmoid is implemented in function sigmoid.’arg’.6: Datatype used to describe sigmoid function.svm_output. The example shows how to train the probabilistic output for the SVM classiﬁer. Table 4. Figure 4.A [2 × 1] Parameters θ = [a1 . a2 ]T which are stored in the datatype describing the sigmoid (see Table 4. hold on.1.y. % ML estimation of GMM model of SVM output gmm_model = mlcgmm(svm_output). % compute SVM output [dummy. Figure 4. The Riply’s data set riply trn is used for training the SVM and the ML estimation of the sigmoid. % fit sigmoid function to svm output sigmoid_model = mlsigmoid(svm_output).The STPRtool provides an implementation mlsigmoid which uses the Matlab Optimization toolbox function fminunc to solve the task (4.10).6).5a shows found probabilistic models of a posteriori probability of the SVM output. a2 ]T of the sigmoid function.’rbf’. svm_model = smo(data. The GMM allows to compute the a posteriori probability for comparison to the sigmoid estimate.5b shows the decision boundary of SVM classiﬁer for illustration.’C’.
X). .:)*gmm_model.c The small value of F (θ) indicates that the centers θ well describe the input set TX .’P(y=1f(x)) MLSigmoid’.Prior(1)+(pcond(2./. hsigmoid = plot(fx. % sigmoid model fx = linspace(min(svm_output.’P(f(x)y=2) MLGMM’). Let θ = {µ1 . 200)...sigmoid_apost. Let the objective function F (θ) be deﬁned as l 1 F (θ) = min µy − xi 2 ..e.. . θ (t) .. gmm_apost = (pcond(1.’k’). ppatterns(data). ylabel(’p(y=1f(x))’). % SVM decision boundary figure. (pcond(1.X). The task is to ﬁnd such centers θ which describe the set TX best.. µc }.5 Kmeans clustering The input is a set TX = {x1 . . i. .’g’). The found solution depends on the initial guess θ (0) which 1 The K in the name Kmeans stands for the number of centers which is here denoted by c. max(svm_output. hcomp = pgauss(gmm_model). .Prior(2))). hgmm = plot(fx.c l i=1 θ θ The Kmeans1 is an iterative procedure which iteratively builds a series θ (0) . . . . xl } which contains vectors xi ∈ Rn .. 4. . .sigmoid_model).hcomp].. θ = argmin F (θ) = argmin y=1.. legend([hsigmoid. ’P(y=1f(x)) MLGMM’. . µi ∈ Rn be a set of unknown cluster centers.hgmm..xlabel(’svm output f(x)’). θ (1) . gmm_model). the task is to solve l 1 ∗ min µy − xi 2 . .:)*gmm_model. psvm(svm_model).gmm_apost.Prior(1))..:)*gmm_model.’P(f(x)y=1) MLGMM’. 50 . % GMM model pcond = pdfgauss( fx.. l i=1 y=1. sigmoid_apost = sigmoid(fx.. ppatterns(svm_output). The Kmeans algorithm converges in a ﬁnite number of steps to the local optimum of the objective function F (θ)..
hold on. The solution is visualized as the cluster centers.6a).14). ppatterns(model. References: The Kmeans algorithm is described in the books [26. % load data [model. Example: Kmeans clustering The Kmeans algorithm is used to ﬁnd 4 clusters in the Riply’s training set. ylabel(’F’). The plot of the objective function F (θ (t) ) with respect to the number of iterations is shown in Figure 4. plot(model.y] = kmeans(data. data = load(’riply_trn’). 4 ). xlabel(’t’).X.is usually selected randomly.MsErr). % run kmeans % visualization figure.data. 6. 3].6b. The vectors classiﬁed to the corresponding centers are distinguished by color (see Figure 4.’sk’. figure. pboundary( model ). The Kmeans algorithm is implemented in the function kmeans.X. ppatterns(data). 51 .
8 0.4 gauss 1 0.2 −1.1.5 0 0.4 0.5 −1 −0.8 gauss 2 0.2 1 0.6 0.2 0 −0.2: Example of the MaximumLikelihood of the Gaussian Mixture Model from the incomplete data. Figure (a) shows components of the GMM and Figure (b) shows contour plot of the distribution function.2 1 0.2 −1.981 0. 52 .45 0.2 0 −0.5 −1 −0.5 0 0.5 1 (a) 1.5 1 (b) Figure 4.6 1.
3: Example of using the EM algorithm for ML estimation of the Gaussian mixture model.0.3 0.15 0.4 Ground truth ML estimation 0.35 0. 53 .25 0. Figure (a) shows ground truth and the estimated GMM and Figure(b) shows the loglikelihood with respect to the number of iterations of the EM.1 0.2 0.05 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 (a) −442 −444 −446 −448 log−likelihood −450 −452 −454 −456 −458 −460 0 5 10 iterations 15 20 25 (b) Figure 4.
4: Example of the Minimax estimation of the Gaussian distribution.1 0.4 −0.2 0.6 0.2 −0. 54 .8 1 Figure 4.4 0.6 0.2 0 −0.4 −0.2 0 0.4 gauss 1 0.304 0.8 0.
5 0 0. Figure (b) shows the corresponding SVM classiﬁer.5 −1 −0.4 0.8 0.2 1 0.3 0.2 0.6 0.5: Example of ﬁtting a posteriori probability to the SVM output.1 P(y=1f(x)) ML−Sigmoid P(y=1f(x)) ML−GMM P(f(x)y=1) ML−GMM P(f(x)y=2) ML−GMM 0. Figure (a) shows the sigmoid model and GMM model both ﬁtted by ML estimated.6 p(y=1f(x)) 0.2 −1.5 1 (b) Figure 4.4 0.5 0.1 0 −4 −3 −2 −1 0 1 svm output f(x) 2 3 4 5 (a) 1.8 0. 55 .2 0 −0.7 0.9 0.
06 0.5 −1 −0.2 1 0.4 0.2 −1. 56 .5 0 0.12 0.08 0.2 0 −0.1 F 0.5 1 (a) 0.16 0. Figure (a) shows found clusters and Figure (b) contains the plot of the objective function F (θ (t) ) with respect to the number of iterations.1.8 0.04 0 2 4 6 t 8 10 12 (b) Figure 4.6 0.14 0.6: Example of using the Kmeans algorithm to cluster the Riply’s data set.
s1 ). 57 . 2} is deﬁned as q(x) = 1 for f (x) ≥ 0 .3) q(x) = argmax fy (x) . . 2. y∈Y The datatype used by the STPRtool to represent the multiclass SVM classiﬁer is described in Table 5. k(x.2. Let the matrix A = [α1 . c} deﬁned as fy (x) = αy · kS (x) + by . . (5. . . . . αc ] be composed of all weight vectors and b = [b1 . . . . The multiclass classiﬁcation rule q: X → Y = {1. bc ]T be a vector of all biases.1) The kS (x) = [k(x. .3. . . sd }. 2 for f (x) < 0 . .1 Support vector classiﬁers The binary support vector classiﬁer uses the discriminant function f : X ⊆ Rn → R of the following form f (x) = α · kS (x) + b . . The binary classiﬁcation rule q: X → Y = {1. . y∈Y. . 2. y ∈ Y = {1. . . .Chapter 5 Support vector and other kernel machines 5. . The multiclass generalization involves a set of discriminant functions fy : X ⊆ Rn → R. . .2) The datatype used by the STPRtool to represent the binary SVM classiﬁer is described in Table 5. Both the binary and the multiclass SVM classiﬁers are implemented in the function svmclass. c} is deﬁned as (5. si ∈ Rn which are usually subset of the training data. . . sd )]T is the vector of evaluations of kernel functions centered at the support vectors S = {s1 . (5. The α ∈ Rl is a weight vector and b ∈ R is a bias.
. The linear SVM aims to train the linear discriminant function f (x) = w · x + b of the binary classiﬁer q(x) = 1 for f (x) ≥ 0 . (xl .1 References: Books describing Support Vector Machines are for example [31. ξi ≥ 0 . .. j = 1.. .. The majority votes based multiclass classiﬁer assigns the input x into such class y ∈ Y having the majority of votes y = argmax vy (x) . . .4. yj }. (5. . 2 for f (x) < 0 . Let qj : X ⊆ Rn → {yj .The majority voting strategy is other commonly used method to implement the 1 2 multiclass SVM classiﬁer. The vector v(x) = [v1 (x). g do vy (x) = vy (x) + 1 where y = qj (x).5) 2 w. w · xi + b ≤ −1 + ξi . (5. The jth rule qj (x) classiﬁes the inputs x into class yj ∈ Y or yj ∈ Y. . .2 Binary Support Vector Machines The input is a set TXY = {(x1 . w · xi + b ≥ +1 − ξi . 2}. Let v(x) be a vector [c × 1] of votes for classes Y when the input x is to be classiﬁed. g be a set of g binary 1 2 SVM rules (5. end. vc (x)]T is computed as: Set vy = 0. 4. . ∀y ∈ Y. i ∈ I1 ∪ I2 . . b∗ ) is transformed to the following quadratic programming task l 1 ∗ ∗ 2 (w . The training of the optimal parameters (w ∗ . yl )} of training vectors xi ∈ Rn and corresponding hidden states yi ∈ Y = {1.2).g The datatype used to represent the majority voting strategy for the multiclass SVM classiﬁer is described in Table 5. .4) is implemented in the function mvsvmclass. 58 . . . The implemented methods to deal with the SVM and related kernel classiﬁers are enlisted in Table 5.b i=1 i ∈ I1 . The strategy (5. . y1 ). . For j = 1.4) y =1. .. b ) = argmin w + C ξip . i ∈ I2 . 30] 5.
β ≥ 0. β ≤ 1C . rbfpreimg RBF preimage by Schoelkopf’s ﬁxedpoint algorithm. oaasvm Multiclass SVM using OneAgainstAll decomposition. The linear p = 1 and quadratic p = 2 term for the slack variables ξi is used. 2 (5.6) subject to β·γ = 1. In the case of the parameter p = 1. minball Minimal enclosing ball in kernel feature space. rsrbf Reduced Set Method for RBF kernel expansion. oaosvm Multiclass SVM using OneAgainstOne decomposition. smo Sequential Minimal Optimization for binary SVM with L1soft margin. i ∈ I1 ∪ I2 are slack variables used to relax the inequalities for the case of nonseparable data. mvsvmclass Majority voting multiclass SVM classiﬁer. demo svm Demo on Support Vector Machines. kfd Kernel Fisher Discriminant. svmlight Interface to SV M light software. kdist Computes distance between vectors in kernel space. The ξi ≥ 0. kernel Evaluates kernel function.1: Implemented methods: Support vector and other kernel machines. The I1 = {i: yi = 1} and I2 = {i: yi = 2} are sets of indices. rbfpreimg3 RBF preimage problem by KwokTsang’s algorithm. svmquadprog SVM trained by Matlab Optimization Toolbox. diagker Returns diagonal of kernel matrix of given data. evalsvm Trains and evaluates Support Vector Machines classiﬁer. The training of the nonlinear (kernel) SVM with L1soft margin corresponds to solving the following QP task β ∗ = argmax β · 1 − β 1 β · Hβ . 59 . bsvm2 Multiclass BSVM with L2soft margin. svmclass Support Vector Machines Classiﬁer. The C > 0 stands for the regularization constant. the L1soft margin SVM is trained and for p = 2 it is denoted as the L2soft margin SVM. kperceptr Kernel Perceptron.Table 5. rbfpreimg2 RBF preimage problem by Gradient optimization.
.sv.Alpha [d × c] Weight vectors A = [α1 . . where the set of indices is ISV = {i: βi > 0}.X [n × d] Support vectors S = {s1 .sv. The bias b is computed as an average over all the constrains such that 1 γi − α · kS (xi ) . . The discriminant function (5. Hi. xj ) if yi = yj .ker [string] Kernel identiﬁer. must hold for the optimal solution. . . xj ) if yi = yj . . The set of support vectors is denoted as S = {xi : i ∈ ISV }. −k(xi .Alpha [d × 1] Weight vector α. αc ].X [n × d] Support vectors S = {s1 . . .ker [string] Kernel identiﬁer. . . for i ∈ ISV . . −1 if yi = 2 .options. 0 < βj < C} . .j = k(xi . sd }.b [1 × 1] Bias b.options. . bc ]T . b= Ib  i∈I b 60 .arg [1 × p] Kernel argument(s).b [c × 1] Vector of biased b = [b1 . .fun = ’svmclass’ Identiﬁes function associated with this data type. Multiclass SVM classiﬁer (structure array): . . . The weight vectors are compute as αi = βi if yi = 1 . where k: X × X → R is selected kernel function. (5. .7) The bias b can be computed from the KarushKuhnTucker (KKT) conditions as the following constrains f (xi ) = α · kS (xi ) + b = +1 f (xi ) = α · kS (xi ) + b = −1 for i ∈ {j: yj = 1. . The vector γ [l × 1] and the matrix H [l × l] are constructed as follows γi = 1 if yi = 1 .3: Datatype used to describe multiclass SVM classiﬁer. .options. bias b and the set of support vectors S. Binary SVM classiﬁer (structure array): . −βi if yi = 2 . . Table 5. . . . for i ∈ {j: yj = 2. . 0 < βj < C} . sd }. .1) is determined by the weight vector α. The 0 [l × 1] is vector of zeros and the 1 [l × 1] is vector of ones. .arg [1 × p] Kernel argument(s).Table 5.fun = ’svmclass’ Identiﬁes function associated with this data type.2: Datatype used to describe binary SVM classiﬁer.options.
The weight vector α is given by (5. .options.8) subject to β·γ = 1. bg ]T .ker [string] Kernel identiﬁer. . . References: The Sequential Minimal Optimizer (SMO) is described in the books [23. . . . The training of the nonlinear (kernel) SVM with L2soft margin corresponds to solving the following QP task β ∗ = argmax β · 1 − β 1 1 β · (H + E)β . 61 . .Table 5. .b [g × 1] Vector of biased b = [b1 . . . . 27.5.bin y [2 × g] Targets for the binary rules. β ≥ 0. y2 . αg ]. Majority voting SVM classiﬁer (structure array): .options. . . .7). yg ] and the bin y(2. βj > 0} . must hold at the optimal solution. y2 .:) 1 1 1 contains [y1 .sv.fun = ’mvsvmclass’ Identiﬁes function associated with this data type. The bias b can be computed from the KKT conditions as the following constrains f (xi ) = α · kS (xi ) + b = +1 − αi 2C αi f (xi ) = α · kS (xi ) + b = −1 + 2C for i ∈ {i: yj = 1. . . . .X [n × d] Support vectors S = {s1 . Example: Training binary SVM.Alpha [d × g] Weight vectors A = [α1 . 2 2C (5.arg [1 × p] Kernel argument(s). The vector bin y(1. . . 30].:) contains 2 2 2 [y1 . yg ]. . . . sd }. The set of support vectors is set to S = {xi : i ∈ ISV } where the set of indices is ISV = {i: βi > 0}.4: Datatype used to describe the majority voting strategy for multiclass SVM classiﬁer. γi (1 − ISV  i∈I 2C SV The binary SVM solvers implemented in the STPRtool are enlisted in Table 5. . . where Ib = {i: 0 < βi < C} are the indices of the vectors on the boundary. An interactive demo on the binary SVM is implemented in demo svm. for i ∈ {i: yj = 2. The bias b is computed as an average over all the constrains thus 1 αi b= ) − α · kS (xi ) . βj > 0} . The SV M light is described for instance in the book [27]. . where E is the identity matrix.
svmquadprog) can by used as well.options). % train model = % model = % model = % % % % load training data use RBF kernel kernel argument regularization constant SVM classifier smo(trn.X. SVM solver based on the QP solver quadprog of the Matlab optimization toolbox.options). Matlab interface to SV M light software (Version: 5. options.org/. The binary SVM is trained on the Riply’s training set riply trn. psvm(model). ppatterns(trn).5: Implemented binary SVM solvers.options). It trains both L1soft and L2soft margin binary SVM. options.00) which trains the binary SVM with L1soft margin. It can be downloaded from http://svmlight. The function quadprog is a part of the Matlab Optimization toolbox.tst.ker = ’rbf’.model).arg = 1. The decision boundary is visualized in Figure 5.Function smo svmlight svmquadprog Table 5.joachims. svmlight(trn. cerror(ypred. The executable ﬁle svm learn for the Linux OS must be installed.1. trn = load(’riply_trn’).C = 10. It can handle moderate size problems. svmquadprog(trn. options. % visualization figure. It useful for small problems only. The trained SVM classiﬁer is evaluated on the testing data riply tst.y) % load testing data % classify data % compute error 62 . tst = load(’riply_tst’). The SMO solver is used in the example but other solvers (svmlight. ypred = svmclass(tst. The SV M light is very powerful optimizer which can handle large problems. description Implementation of the Sequential Minimal Optimizer (SMO) used to train the binary SVM with L1soft margin.
The discriminant functions fy (x) = αy · kS (x) + by . .3 OneAgainstAll decomposition for SVM The input is a set TXY = {(x1 . (xl . . y are trained by the binary SVM solver from the set TXY . 5. .2 −1. . .5 1 Figure 5. . y ∈ Y.4 0. yl )} contain the modiﬁed hidden states deﬁned as 1 for y = yi .1: Example showing the binary SVM classiﬁer trained on the Riply’s data. 2. c}. . The OAA decomposition is implemented in the function oaasvm. The goal is to train the multiclass rule q: X ⊆ Rn → Y deﬁned by (5. yl )} of training vectors xi ∈ X ⊆ Rn and corresponding hidden states yi ∈ Y = {1. The problem can be solved by the OneAgainstAll (OAA) decomposition. . y1 ). .8 0. y∈Y.ans = 0. .0920 1.2 1 0.6 0.5 −1 −0. . (xl .5 0 0. Let the training set TXY = {(x1 .3). . y1 ). yi = 2 for y = yi .2 0 −0. The OAA decomposition transforms the multiclass problem into a series of c binary subtasks that can be trained by the binary y SVM. The function allows to specify any binary SVM 63 .
. . . pboundary(model). % visualization figure. % % % % % % load data use SMO solver use RBF kernel kernel argument regularization constant training 5.3. options. ppatterns(data). 2.solver (see Section 5. . options. model = oaasvm(data. ylj )} contain the training vectors xi ∈ I j = {i: yi = y 1 yi = y 2} and the modiﬁed the hidden states deﬁned as yi = 1 1 for yj = yi .4 OneAgainstOne decomposition for SVM The input is a set TXY = {(x1 .sv.3) represented by the datatype described in Table 5. yl )} of training vectors xi ∈ X ⊆ Rn and corresponding hidden states yi ∈ Y = {1. . The OAO decomposition for multiclass SVM is implemented in function 64 .arg = 1. . y1). data = load(’pentagon’). The goal is to train the multiclass rule q: X ⊆ Rn → Y based on the majority voting strategy deﬁned by (5. g is constructed for all g = c(c − 1)/2 combinations of 1 2 1 classes yj ∈ Y and yj ∈ Y \ {yj }. 2 2 for yj = yi . c}.options ). The oneagainstone (OAO) decomposition strategy can be used train such a classiﬁer. options. j The training set TXY . .C = 10. . .4). The binary SVM rules qj .12).solver = ’smo’.’ko’. . Let the training set TXY = {(x1 . . . i ∈ Ij . j = 1.ker = ’rbf’.2. The training data and the decision boundary is visualized in Figure 5. . The OAO decomposition transforms the multiclass problem into a series of g = c(c − 1)/2 j binary subtasks that can be trained by the binary SVM.X. . . . g are trained on j the data TXY . options. The SMO binary solver is used to train the binary SVM subtasks. . . j = 1. (xlj . y1 ).2) used to solve the binary problems. The result is the multiclass SVM classiﬁer (5. (xl . . Example: Multiclass SVM using the OAA decomposition The example shows the training of the multiclass SVM using the OAA decomposition. References: The OAA decomposition to train the multiclass SVM and comparison to other methods is described for instance in the paper [12]. ppatterns(model. .
data = load(’pentagon’). options.4 −0.8 0.solver = ’smo’.2 0 −0.4 −0.8 1 Figure 5.4 0.8 −1 −1 −0. The training data and the decision boundary is visualized in Figure 5.6 −0. ppatterns(model.2 0 0.ker = ’rbf’.options). Example: Multiclass SVM using the OAO decomposition The example shows the training of the multiclass SVM using the OAO decomposition. The function oaosvm allows to specify an arbitrary binary SVM solver to train the subtasks.6 −0.sv.12).3. model = oaosvm(data.2: Example showing the multiclass SVM classiﬁer build by oneagainstall decomposition.2 −0. ppatterns(data).C = 10. % visualization figure. options. options.6 0. options.2 0.1 0.arg = 1. % % % % % % load data use SMO solver use RBF kernel kernel argument regularization constant training 65 .6 0.4 0. pboundary(model). oaosvm. References: The OAO decomposition to train the multiclass SVM and comparison to other methods is described for instance in the paper [12].X.8 −0. The SMO binary solver is used to train the binary SVM subtasks.’ko’.
6 0. i ∈ I . 2.b i∈I y∈Y\{y} F (W. In the linear case. The L1soft margin is used for p = 1 and L2soft margin for p = 2. .5 Multiclass BSVM formulation The input is a set TXY = {(x1 . (5. c}. . bc ]T is vector of biases. . the parameters of the multiclass rule can be found by solving the multiclass BSVM formulation which is deﬁned as 1 (W∗ . 5. .2 0. where W = [w 1 . The letter B in BSVM stands for the bias term added to the objective (5. . ξ∗ ) = argmin ( wy 2 + b2 ) + C (ξi )p . (xl . y ∈ Y \ {yi } . y ∈ Y \ {yi } . yl )} of training vectors xi ∈ X ⊆ Rn and corresponding hidden states yi ∈ Y = {1. .4 −0. . . b∗ . .8 0. l] is set of indices.4 −0.4 0. In the case of the L2soft margin p = 2. .3: Example showing the multiclass SVM classiﬁer build by oneagainstone decomposition. .6 0.8 −0.2 0 −0. . .2 0 0. . the optimization problem (5.ξ) subject to w yi · xi + byi − ( w y · xi + by ) ≥ 1 − ξiy . . . ξiy ≥ 0 .9) y 2 y∈Y W. y1 ).4 0.6 −0.b. . . ξ is a vector of slack variables and I = [1. . . The goal is to train the multiclass rule q: X ⊆ Rn → Y deﬁned by (5.9) can be 66 . wc ] is a matrix of normal vectors.9). b = [b1 .2 −0.8 1 Figure 5.8 −1 −1 −0.3).6 −0.1 0. i ∈ I .
11) and (5. y∈Y. u)−δ(yi . yi) − δ(y. The upper bound FU B and the lower bound FLB on the optimal value F (W∗. yj )+δ(y. b∗ .12). FLB + 1 67 . (5. u)−δ(yj . yi) − δ(u. αc ] are optimized weight vectors. j) = ( xi ·xj +1)(δ(yi . j) . The nonlinear (kernel) classiﬁer is obtained by substituting the selected kernel k: X × X → R for the dot products x·x to (5.12) where the bias by . i. The above mentioned algorithms are of iterative nature and converge to the optimal solution of the criterion (5. The optimization task (5. i ∈ I . i. The bounds are used in the algorithms to deﬁne the following two stopping conditions: Relative tolerance: FU B − FLB ≤ εrel .11) 2C The criterion (5. The multiclass rule (5. Kozinec’s algorithm (solver = ’kozinec’). j) .10) subject to y αi ≥ 0 . u)δ(i. (5. . The optimizers implemented in the function bsvm2 are enlisted bellow: MitchellDemyanovMalozemov (solver = ’mdm’). u)) . ξ ∗ ) of the criterion (5. u. u. The function h: Y×Y×I×I → R is deﬁned as h(y.9). i∈I y∈Y\{yi } j∈I u∈Y\{yj } (5. y ∈ Y is given by by = i∈I y∈Y\{yi } u αi (δ(y. y))+ δ(y. y∈Y. The training of the multiclass BSVM classiﬁer with L2soft margin is implemented in the function bsvm2. y ∈ Y \ {yi} . . .expressed in its dual form A∗ = argmax A i∈I y∈Y\{yi } y αi − 1 2 y u αi αj h(y.9) can be computed. Nearest Point Algorithm (solver = ’npa’). y)) + by .3) is composed of the set of discriminant functions fy : X → R which are computed as fy (x) = i∈I xi · x u∈Y\{yi } u αi (δ(u.10) can be solved by most the optimizers.10) corresponds to the singleclass SVM criterion which is simple to optimize. . where A = [α1 .
x ) = φ(x) · φ(x ) . 2}. . data = load(’pentagon’). ’mdm’). α · (N + µE)α α · (N + µE)α 68 (5. yl )}. .C = 10. model = bsvm2(data.10) implemented in the STPRtool is published in paper [9]. options. y1). (xl . The example shows the training of the multiclass BSVM using the NPA solver. .arg = 1. αl ]T such that the criterion F (α) = α·m 2 α · Mα = . . . . pboundary(model). The Kozinec’s algorithm based SVM solver is described in paper [10].’ko’.options). % % % % % % load data use NPA solver use RBF kernel kernel argument regularization constant training 5.4. The Kernel Fisher Discriminant is the nonlinear extension of the linear FLD (see Section 2. ppatterns(data).sv. The other solvers can be used instead by setting the variable options. The input training vectors are assumed to be mapped into a new feature space F by a mapping function φ: X → F . (φ(xl ).3). ppatterns(model.ker = ’rbf’. The ordinary FLD is applied on the mapped data training data TΦY = {(φ(x1 ). The training data and the decision boundary is visualized in Figure 5. The transformation of the multiclass BSVM criterion to the singleclass criterion (5.13) . . References: The multiclass BSVM formulation is described in paper [12].12). % visualization figure.X.solver (other options are ’kozinec’. .Absolute tolerance: FU B ≤ εabs . y1 ). The mapping is performed eﬃciently by means of kernel functions k: X × X → R being the dot products of the mapped vectors k(x. . options. . . yl )} of training vectors xi ∈ X ⊆ Rn and corresponding values of binary state yi ∈ Y = {1. options.6 Kernel Fisher Discriminant The input is a set TXY = {(x1 . Example: Multiclass BSVM with L2soft margin. The aim is to ﬁnd a direction ψ = l αi φ(xi ) in the feature space i=1 F given by weights α = [α1 . The MitchellDemyanovMalozemov and Nearest Point Algorithm are described in paper [14].solver = ’npa’. options. .
2 0 −0.4 −0.14) is known up to the bias b which is determined by the linear SVM with L1soft margin is applied on the projected data TZY = {(z1 . m = m1 − m2 . .13) serves as the regularization term.6 0. .8 1 Figure 5. The criterion (5. (5. . The projections {z1 . y1).2 0. .13) deﬁned in the input space X . The kernel matrix K [l × l] contains kernel evaluations Ki. The discriminant function f (x) of the binary classiﬁer (5. xj ). Iy  N = KKT − y∈Y y∈Y.14) where the k(x) = [k(x1 . The discriminant function (5. . (zl . yl )}.2) is given by the found vector α as f (x) = ψ · φ(x) + b = α · k(x) + b . The vector m [l × 1]. zl } are computed as zi = α · k(xi ) .4 −0. . x). Iy my mT . . is maximized.4 0.13) is the feature space counterpart of the criterion (2. matrices N [l × l] M [l × l] are constructed as my = 1 K1y . y where the vector 1y has entries i ∈ Iy equal to and remaining entries zeros.8 −0.6 0.j = k(xi . k(xl . 69 . . . . x)]T is vector of evaluations of the kernel.8 −1 −1 −0.2 −0. M = (m1 − m2 )(m1 − m2 )T .4 0.4: Example showing the multiclass BSVM classiﬁer with L2soft margin.2 0 0. . The KFD training algorithm is implemented in the function kfd. .6 −0.1 0.8 0. The diagonal matrix µE in the denominator of the criterion (5.6 −0.
options.001. the isolines with f (x) = +1 and f (x) = −1 are visualized and the support vectors (whole training set) are marked as well. model ).References: The KFD formulation is described in paper [19]. .C = 10.arg = 1. ppatterns(trn).. psvm(model). The kernel methods works with the data nonlinearly mapped φ: X → F into a new feature space F . The training data and the found kernel classiﬁer is visualized in Figure 5. options.7 Preimage problem Let TX = {x1 .mat. . options) % visualization figure. xb ) = φ(xa ) · φ(xb ) . xl } be a set of vector from the input space X ⊆ Rn . The found classiﬁer is evaluated on the testing data riply tst.ker = ’rbf’. Let ψ ∈ F be given by the following kernel expansion l ψ= i=1 αi φ(xi ) .mat.e. i.15) . tst. The mapping is performed implicitly by using the kernel functions k: X × X → R which are dot products of the input data mapped into the feature space F such that k(xa . % evaluation on testing data tst = load(’riply_tst’). ypred = svmclass( tst. The detailed analysis can be found in book [30]. options. The classiﬁer is visualized as the SVM classiﬁer.mu = 0. model = kfd(trn. cerror( ypred. Example: Kernel Fisher Discriminant The binary classiﬁer is trained by the KFD on the Riply’s training set riply trn. .0960 % % % % % % load training data use RBF kernel kernel argument regularization of linear SVM regularization of KFD criterion KFD training 5.X.y ) ans = 0. options. trn = load(’riply_trn’). 70 (5. .5.
6 0. 71 .5 −1 −0.2 −1. .16) for the particular case of the RBF kernel leads to the following task l x = argmax x i=1 αi exp(−0. αl ]T is a weight vector of real coeﬃcients. xj ) . where α = [α1 .5 1 Figure 5. In other words.15).5 xa −xb 2 /σ 2 ) be assumed. its feature space image φ(x) ∈ F would well approximate the expansion point Ψ ∈ F given by (5. xb ) = exp(−0. all the methods are guaranteed to ﬁnd only a local optimum of the task (5.4 0.17) The STPRtool provides three optimization algorithms to solve the RBF preimage problem (5. the preimage problem solves mapping from the feature space F back to the input space X . . Let the Radial Basis Function (RBF) k(xa .2 0 −0. The preimage problem can be stated as the following optimization task x = argmin φ(x) − ψ x l 2 = argmin φ(x ) − x i=1 αi φ(xi ) l 2 (5. xi ) + i=1 j=1 αi αj k(xi . .5 0 0. .6. The preimage problem aims to ﬁnd an input vector x ∈ X . However.8 0.5: Example shows the binary classiﬁer trained based on the Kernel Fisher Discriminant.2 1 0. The preimage problem (5. (5.5 x − xi 2 /σ 2 ) .16) l l = argmin k(x .17).1.17) which are enlisted in Table 5. x ) − 2 x i=1 αi k(x .
X = data.options. ppatterns(x.6.Table 5.options. rbfpreimg An iterative ﬁxed point algorithm proposed in [28]. % visualization figure. expansion. expansion.1)/num_data. % compute preimage problem x = rbfpreimg(expansion). The result is visualized in Figure 5.13).X). 72 .2).Alpha = ones(num_data.arg = 1. % load input data data = load(’riply_trn’). expansion.6: Methods to solve the RBF preimage problem implemented in the STPRtool. expansion.X. rbfpreimg3 A method exploiting the correspondence between distances in the input space and the feature space proposed in [16]. rbfpreimg2 A standard gradient optimization method using the Matlab optimization toolbox for 1D search along the gradient.’or’. ppatterns(data. References: The preimage problem and the implemented algorithms for its solution are described in [28. Example: Preimage problem The example shows how to visualize a preimage of the data mean in the feature induced by the RBF kernel.sv.ker = ’rbf’. 16]. % define the mean in the feature space num_data = size(data.
. . βm ]T and the vectors TS = {s1 . k: X × X → R is a kernel function. . . Let φ: X → F be a mapping from the input space X to the feature space F associated with kernel function such that k(xa .1.18). xl } are vectors from Rn . Let the function f : X ⊆ Rn → R be given as l f (x) = i=1 αi k(xi . .2 1 0. ˜ ψ (5.18) where TX = {x1 . The function f (x) can be expressed in the feature space F as the linear function f (x) = ψ · φ(x) + b. 5. m < l.e. si ∈ Rn determine the reduced kernel expansion (5.. . .8 0. . sm }.5 0 0.2 0 −0. . .19) is shorter. α = [α1 .6: Example shows training data and the preimage of their mean in the feature space induced by the RBF kernel.4 0.19) such that the expansion (5. . x) + b = ψ · φ(x) + b . . The problem of ﬁnding the 73 . xb ) = φ(xa ) · φ(xb ) .6 0.2 −1. The weight vector β = [β1 . αl ]T is a weight vector of real coeﬃcients and b ∈ R is a scalar.5 1 Figure 5. . i. . The reduced set method aims to ﬁnd a new function m m ˜ f (x) = i=1 βi k(si .8 Reduced set method The reduced set method is useful to simplify the kernel classiﬁer trained by the SVM or other kernel machines. . x) + b = φ(x) · i=1 βi φ(si ) + b . and well approximates the original one (5.19). (5.5 −1 −0. .
X.’arg’.5 xa −xb 2 /σ 2 ) be assumed.’r’)). h1 = pboundary(model.struct(’line_style’.19) can be stated as the optimization task m l 2 ˜ (β. The reduced expansion with 10 support vectors is found. In this case.’rbf’.TS i=1 βi φ(si ) − i=1 αi φ(xi ) . the optimization task (5. The algorithm converts the task (5. TS ) = argmin ψ − ψ β . (5. err = cerror(ypred. Example: Reduced set method for SVM classiﬁer The example shows how to use the reduced set method to simplify the SVM classiﬁer. legend([h1(1) h2(1)].20) Let the Radial Basis Function (RBF) k(xa . % train SVM classifier model = smo(trn.X.struct(’line_style’.model).’Original SVM’. xb ) = exp(−0. % load training data trn = load(’riply_trn’).20) can be solved by an iterative greedy algorithm.’C’.mat.10)). % visualize decision boundaries figure.7).struct(’nsv’.reduced kernel expansion (5.struct(’ker’. ppatterns(trn).TS 2 = argmin β . The original and reduced SVM are evaluated on the testing data riply tst.7. References: The implemented reduced set method is described in [28]. % evaluate SVM classifiers tst = load(’riply_tst’).’Reduced SVM’). ypred = svmclass(tst. Decision boundaries of the original and the reduced SVM are visualized in Figure 5.tst.20) into a series of m preimage tasks (see Section 5.10)).1.’b’)). The trained kernel expansion (discriminant function) is composed of 94 support vectors. The reduced set method for the RBF kernel is implemented in function rsrbf. 74 .y). The SVM classiﬁer is trained from the Riply’s training set riply trn/mat. % compute the reduced rule with 10 support vectors red_model = rsrbf(model.model). red_ypred = svmclass(tst. h2 = pboundary(red_model.
tst. err = %. The original decision rule involves 94 support vectors while the reduced one only 10 support vectors.nsv.4f\n’].0960 1. err = 0.5 1 Figure 5...5 −1 −0.4 0. err = 0.nsv. Original SVM: nsv = 94. 75 . fprintf([’Original SVM: nsv = %d. red_model. ’Reduced SVM: nsv = %d..y).5 0 0.4f\n’ . .2 Original SVM Reduced SVM 1 0.red_err = cerror(red_ypred. model.0960 Reduced SVM: nsv = 10. err.7: Example shows the decision boundaries of the SVM classiﬁer trained by SMO and the approximated SVM classiﬁer found by the reduced set method. err = %.8 0.2 0 −0.2 −1.. red_err).6 0.
mat /binary separable/*. Linearly separable multiclass data in 2D. Input data for the demo on training linear classiﬁers from ﬁnite vector sets demo linclass.1: Datasets /riply trn. Testing part of the Riply’s data set [24].mat /anderson task/*. Examples of handwritten numerals.1. /multi separable/*. The datasets stored in the Matlab data ﬁle are enlisted in Table 6.1. Input data for the demo on the Support Vector Machines demo svm. Input data for the demo on the Generalized Anderson’s task demo anderson. Well known Fisher’s data set.mat /gmm samles/*. Training part of the Riply’s data set [24]. Table 6.mat /ocr data/*. The functions which generate synthetic data and scripts for converting external datasets to the Matlab are given in Table 6.mat 76 .Chapter 6 Miscellaneous 6.mat /iris.2. Input data for the demo on the ExpectationMaximization algorithm for the Gaussian mixture model demo emgmm.1 Data sets The STPRtool provides several datasets and functions to generate synthetic data.mat stored in /data/ directory of the STPRtool.mat /riply tst.mat /svm samples/*. A description is given in Section 7.
Generates linearly separable binary data with prescribed margin. GUI which allows to create interactively labeled vectors sets and labeled sets of Gaussian distributions. pgmm Visualizes the bivariate Gaussian mixture model. Generates data from the Gaussian mixture model.createdata genlsdata gmmsamp gsamp gencircledata usps2mat Table 6. Visualizes solution of the Generalized Anderson’s task. psvm Plots decision boundary of binary SVM classiﬁer in 2D. The list of functions for visualization is given in Table 6. plane3 Plots plane in 3D. pgauss Visualizes set of bivariate Gaussians. It works for 1D.3. showim Displays images entered as column vectors. pkernelproj Plots isolines of kernel projection. pandr 77 .3: Functions for visualization. Postal Service (USPS) database of handwritten numerals to the Matlab ﬁle.2 Visualization The STPRtool provides tools for visualization of common models and data. 6. Converts U. 2D and 3D.2: Data generation. Generates data on circle corrupted by the Gaussian noise. pline Plots line in 2D. It operates in 2dimensional space only.S. Generates data from the multivariate Gaussian distribution. Each function contains an example of using. Table 6. ppatterns Visualizes patterns as points in the feature space. pboundary Plots decision boundary of given classiﬁer in 2D.
y∈Y y∈Y (6. Let W : D × Y → R be a loss function which penalizes the decision q(x) ∈ D when the true hidden state is y ∈ Y. . y). The rule q: X → Y which minimizes the expectation of misclassiﬁcation is deﬁned as q(x) = argmax PY X (yx) . ∀x ∈ X .6. The rule q: X → Y which minimizes the Bayesian risk with the loss function (6. i. y∈Y (6. Let X ⊆ Rn and the sets Y and D be ﬁnite. y)W (q(x).3 Bayesian classiﬁer The object under study is assumed to be described by a vector of observations x ∈ X and a hidden state y ∈ Y. The x and y are realizations of random variables with joint probability distribution PXY (x. y) dx . (6. . A decision rule q: X → D takes a decision d ∈ D based on the observation x ∈ X ..5) q(x) = ⎩ dont know if 1 − max PY X (yx) ≥ ε . 1 for q(x) = y . The loss function is deﬁned as ⎧ ⎨ 0 for q(x) = y . The Bayesian risk (6. 1 for q(x) = y . y∈Y 78 .1) with the 0/1loss function corresponds to the expectation of misclassiﬁcation.4) Wε (q(x). y)W (q(x).3) = argmax PXY (xy)PY (y) . The STPRtool implements the Bayesian rule for two particular cases: Minimization of misclassiﬁcation: The set of decisions D coincides to the set of hidden states Y = {1. R(q) = X y∈Y PXY (x.1) is referred to as the Bayesian rule q ∗ (x) = argmin y y∈Y PXY (x. . y) . y) = ⎩ ε for q(x) = dont know . The 0/1loss function W0/1 (q(x). (6.1) The optimal rule q ∗ which minimizes the Bayesian risk (6. y∈Y Classiﬁcation with rejectoption: The set of decisions D is assumed to be D = Y ∪ {dont know}. . where ε is penalty for the decision dont know.e. c}.4) is deﬁned as ⎧ ⎨ argmax PXY (xy)PY (y) if 1 − max PY X (yx) < ε . y) = 0 for q(x) = y .2) is used. (6. The Bayesian risk R(q) is an expectation of the value of the loss function W when the decision rule q is applied.
The distribution PXY (xy) is given by Pclassy which uses the datatype described in Table 4. inx1 = find(trn.Pclass [cell 1 × c] Class conditional distributions PXY .5) are implemented in function bayescls.y==2).X(:.y==1).2)).Pclass{1} = emgmm(trn. Table 6.struct(’ncomp’.2)).2. The designed Bayesian classiﬁer is evaluated on the testing data riply tst.struct(’ncomp’.mat.3) and (6. Bayesian classiﬁer (structure array): . The classconditional distributions PXY are modeled by the Gaussian mixture models (GMM) with two components.X.X(:. y ∈ {1. . inx2 = find(trn. bayes_model.inx1).Prior [1 × c] Discrete distribution PY (y).To apply the optimal classiﬁcation rules one has to know the classconditional distributions PXY and a priory distribution PY (or their estimates). 2} are estimated by the relative occurrences (it corresponds to the ML estimate). The GMM are estimated by the EM algorithm emgmm.inx2). cerror(ypred. bayes_model. The Bayesian classiﬁers (6.bayes_model). n2 = length(inx2). ypred = bayescls(tst.mat.Prior = [n1 n2]/(n1+n2).tst. The a priori probabilities PY (y). % Estimation of priors n1 = length(inx1).fun = ’bayescls’ Identiﬁes function associated with this data type. . References: The Bayesian decision making with applications to PR is treated in details in book [26]. % load input training data trn = load(’riply_trn’). % Evaluation on testing data tst = load(’riply_tst’). The datatype used by the STPRtool is described in Table 6. y ∈ Y.4.4: Datatype used to describe the Bayesian classiﬁer.y) 79 . Example: Bayesian classiﬁer The example shows how to design the plugin Bayesian classiﬁer for the Riply’s training set riply trn.Pclass{2} = emgmm(trn. % Estimation of classconditional distributions by EM bayes_model.
bayes_model. which are quadratic with respect to the input vector x ∈ Rn . a vector by [n × 1] and a scalar cy [1 × 1]. 2.0900 The following example shows how to visualize the decision boundary of the found Bayesian classiﬁer minimizing the misclassiﬁcation (solid black line).’k’)). . % Vislualization of rejetoption rule pboundary(reject_model. The decision boundary of the rejectoptions rule (dashed line) is displayed for comparison.1. ppatterns(trn). . ∀y ∈ Y . % Penalization for don’t know decision reject_model = bayes_model. hold on.1.fun = ’bayescls’. reject_model.4 Quadratic classiﬁer The quadratic classiﬁcation rule q: X ⊆ Rn → Y = {1. The rejectoption rule splits the feature space into three regions corresponding to the ﬁrst class. 80 .ans = 0.eps = 0.6) In the particular binary case Y = {1. 2}. The Bayesian classiﬁer splits the feature space into two regions corresponding to the ﬁrst class and the second class. . y ∈Y y ∈Y (6. The input vector x ∈ Rn is assigned to the class y ∈ Y its corresponding discriminant function fy attains maximal value y = argmax fy (x) = argmax ( x · Ay x + by · x + cy ) . c} is composed of a set of discriminant functions fy (x) = x · Ay x + by · x + cy . The quadratic discriminant function fy is determined by a matrix Ay [n × n]. the quadratic classiﬁer is represented by a single discriminant function f (x) = x · Ax + b · x + c . . % Visualization figure.struct(’line_style’. pboundary(bayes_model). The visualization is given in Figure 6. the second class and the region in between corresponding to the dont know decision. 6.
The input vector x ∈ Rn is assigned to class y ∈ {1.6 0.2 1 0.1: Figure shows the decision boundary of the Bayesian classiﬁer (solid line) and the decision boundary of the rejectoption rule with ε = 0. . The implementation of the quadratic classiﬁer itself provides function quadclass. The classconditional probabilities are modeled by the Gaussian mixture models with two components estimated by the EM algorithm.5: Datatype used to describe binary quadratic classiﬁer.7) The STPRtool uses a speciﬁc structure array to describe the binary (see Table 6.1.4 0. Binary linear quadratic (structure array): .c [1 × 1] Parameter scalar c. given by a matrix A [n × n].5 −1 −0. 2 if f (x) = x · Ax + b · x + c < 0 .fun = ’quadclass’ Identiﬁes function associated with this data type.A [n × n] Parameter matrix A. The classiﬁer is trained from the Riply’s 81 .6) quadratic classiﬁer.5 0 0.5) and the multiclass (see Table 6. (6. . Example: Quadratic classiﬁer The example shows how to train the quadratic classiﬁer using the Fisher Linear Discriminant and the quadratic data mapping.8 0.2 0 −0.2 −1. a vector b [n × 1] and a scalar c [1 × 1].5 1 Figure 6. 2} as follows q(x) = 1 if f (x) = x · Ax + b · x + c ≥ 0 .1 (dashed line).b [n × 1] Parameter vector b. Table 6. .
. .fun = ’quadclass’ Identiﬁes function associated with this data type.. pboundary(quad_model). . lin_model = fld(quad_data). .mat.X. ppatterns(trn). . % evaluation tst = load(’riply_tst’).0940 % load input data % quadratic mapping % train FLD 6.8) y∈Y 82 . Let Rn (x) = {x : x − x ≤ r 2 } be a ball centered in the vector x in which lie K prototype vectors xi . (6.Table 6.y) ans = 0. % visualization figure. y ∈ Y. . y1 ). . y ∈ Y. quad_data = qmap(trn). % make quadratic classifier quad_model = lin2quad(lin_model). . training set riply trn.A [n × n × c] Parameter matrices Ay . Multiclass quadratic classiﬁer (structure array): .mat.c [1 × c] Parameter scalars cy . The Knearest neighbor (KNN) classiﬁcation rule q: X → Y is deﬁned as q(x) = argmax v(x. The found quadratic classiﬁer is evaluated on the testing data riply trn. yl )} be a set of prototype vectors xi ∈ X ⊆ Rn and corresponding hidden states yi ∈ Y = {1. c}.2). trn = load(’riply_trn’). .6: Datatype used to describe multiclass quadratic classiﬁer. y ∈ Y.5 Knearest neighbors classiﬁer Let TXY = {(x1 . . i. {xi : xi ∈ Rn (x)} = K. i ∈ {1.b [n × c] Parameter vectors by . . . ypred = quadclass(tst. cerror(ypred. . .e. The boundary of the quadratic classiﬁer is visualized (see Figure 6. (xl .quad_model).tst. y) . l}. .
8). % load training data and setup 8NN rule trn = load(’riply_trn’). in book [6].8) is computationally demanding if the set of prototype vectors is large. yl }. . for instance. The decision boundary is visualized in Figure 6. model = knnrule(trn. . References: The Knearest neighbors classiﬁer is described in most books on PR. 83 .5 −1 −0.mat.2: Figure shows the decision boundary of the quadratic classiﬁer.3.X [n × l] Prototype vectors {x1 . .1. xl }.5 0 0. The classiﬁcation rule (6.6 0.mat. .5 1 Figure 6. . Example: Knearest neighbor classiﬁer The example shows how to create the (K = 8)NN classiﬁcation rule from the Riply’s training data riply trn. where v(x. .4 0.fun = ’knnclass’ Identiﬁes function associated with this data type. . y) is number of prototype vectors xi with hidden state yi = y which lie in the ball xi ∈ Rn (x).7: Datatype used to describe the Knearest neighbor classiﬁer. . The classiﬁer is evaluated on the testing data riply tst.2 −1.2 0 −0.7. The datatype used by the STPRtool to represent the Knearest neighbor classiﬁer is described in Table 6. Table 6. .y [1 × l] Labels {y1 .2 1 0. . KNN classiﬁer (structure array): .8 0.
% visualize decision boundary and training data figure. ypred = knnclass(tst.4 0. % evaluate classifier tst = load(’riply_tst’).3: Figure shows the decision boundary of the (K = 8)nearest neighbor classiﬁer.tst.5 0 0.5 1 Figure 6.2 0 −0. 84 . pboundary(model).5 −1 −0.model). ppatterns(trn).8 0.1160 1.2 −1. cerror(ypred.6 0.y) ans = 0.2 1 0.X.
The applications selected are the optical character recognition (OCR) system based on the multiclass SVM (Section 7. A numeral to be classiﬁed is represented as a vector x containing pixel values taken from the numeral image. The function save chars saves drawn numerals in the stage of collecting training examples and the function ocr fun is used to recognize the numerals if the trained OCR is applied.1a). The training stage of the OCR system involves (i) collection of training examples of numerals and (ii) training the multiclass SVM. The GUI for drawing numerals is implemented in a function mpaper. The function mpaper creates a form composed of 50 cells to which the numerals can be drawn (see Figure 7. The function collect chars invokes the form for drawing numerals and allows to save the numerals to a speciﬁed ﬁle. The multiclass SVM are used as the base classiﬁer of the numerals. In particular. the functions save chars and ocr fun are called from the mpaper to process drawn numerals. The set of training examples has to be collected before the OCR is trained. The STPRtool contains examples of numerals collected 85 . 7. The GUI is intended just for demonstration purposes and the scanner or other device would be used in practice instead.2). The STPRtool provides GUI which allows to use standard computer mouse as an input device to draw images of numerals.1) and image denoising system using the kernel PCA (Section 7. To see the trained OCR system run the demo script demo ocr.Chapter 7 Examples of applications We intend to demonstrate how methods implemented in the STPRtool can be used to solve a more complex applications. The function mpaper is designed to invoke a speciﬁed function whenever the middle mouse button is pressed. The drawn numerals are normalized to size 16 × 16 pixels (diﬀerent size can be speciﬁed).1 Optical Character Recognition system The use of the STPRtool is demonstrated on a simple Optical Character Recognition (OCR) system for handwritten numerals from 0 to 9.
The vector dimension n = 256 corresponds to 16 × 16 pixels of each image numeral and l = 50 is the maximal number of numerals drawn to a single form. . The Figure 7. σ ∗ ) turned out to be have the lowest crossvalidation error over the given set Θ.1a shows the form ﬁlled with the examples of the numeral ’1’. The ﬁle name format name xx. A common method to tune the free parameters is to use the crossvalidation to select the best parameters from a preselected set Θ = {(C1 .mat. Having the labeled training set TXY = {(x1 . . The string name is name of the person who drew the particular numeral and xx stands for a label assigned to the numeral. If the ﬁle name format name xx. After the parameters (C ∗ . (Ck . Therefore. The methods to train the multiclass SVM are described in Chapter 5. 86 . .from 6 diﬀerent persons. σ1 ).mat is met then the function mergesets(Directory. y1 ). The function evalsvm returns the SVM model its parameters (C ∗ . . σ ∗ ) are tuned the SVM classiﬁer is trained using all training data available. The SVM has two free parameters which have to be tuned: the regularization constant C and the kernel function k(xa . The labels from y = 1 to y = 9 corresponds to the numerals from ’1’ to ’9’ and the label y = 10 is used for numeral ’0’. The STPRtool contains a function evalsvm which evaluates the set of SVM parameters Θ based on the crossvalidation estimation of the classiﬁcation error. .5 xa − xb 2 /σ 2 ) determined by the parameter σ was used in the experiment. . σk )}. In particular.1b shows numerals after normalization which are saved to the speciﬁed ﬁle honza cech. Each person drew 50 examples of each of numerals from ’0’ to ’9’. The training procedure for OCR classiﬁer is implemented in the function train ocr. The numerals are stored in the directory /data/ocr numerals/. yl )} created the multiclass SVM classiﬁer can be trained.DataFile) can be used to merge all the ﬁle to a single one. if a speed of classiﬁcation is the major objective the multiclass BSVM formulation is often a better option as it produce less number of support vectors. xb ) = exp(−0. the free SVM parameters are C and σ. The images of normalized numerals are stored as vectors to a matrix X of size [256 × 50]. . Figure 7. to collect examples of numeral ’1’ from person honza cech collect_chars(’honza_cech1’) was called. the OneAgainstOne (OAO) decomposition was used as it requires the shortest training time. . The resulting ﬁle contains a matrix X of all examples and a vector y containing labels are assigned to examples according the number xx. (xl . The parameters C = ∞ (data are separable) and σ = 5 were found to be the best. The directory /demo/ocr/models/ contains the SVM model trained by OAO decomposition strategy (function oaosvm) and model trained based on the multiclass BSVM formulation (function bsvm2).mat (the character can be omitted from the name) is used. However. The parameter tuning is implemented in the script tune ocr. The RBF kernel k(xa . The string Directory speciﬁes the directory where the ﬁles name xx. xb ) with its argument.mat are stored and the string FileName is a ﬁle name to which the resulting training set is stored. For example.
mat stored in the directory /demo/ocr/. φ(xl )} denote the feature space representation of the training images TX .3. The idea of using the kernel PCA for image denoising is demonstrated in Figure 7. . To run the resulting OCR type demo ocr or call mpaper(struct(’fun’. The kernel PCA can be used to ﬁnd the basis vectors of the subspace Fimg from the input training data TX . Thus the unlabeled training data set TX = {x1 . . . Postal Service (USPS) database of images of handwritten numerals [17] was used as the class of images to model. The kernel PCA computes the lower dimensional representation TZ = {z 1 . The kernel function corresponds to the dot product in the feature space. A snapshot of running OCR is shown in Figure 7. . xb ) = φ(xa ) · φ(xb ) . i. The additive Gaussian noise was assumed as the source of image degradation. xl } is used. z i ∈ Rm of the mapped images 87 .mat its contains is summarized in Table 7. The model should describe the nonlinear subspace Ximg ⊆ Rn where the images live. z l }. It is assumed that the nonlinear subspace Ximg forms a linear subspace Fimg of a new feature space F . . .. It is implemented in the function usps2mat.’ocr_fun’)) If one intends to modify the classiﬁer or the way how the classiﬁcation results are displayed then modify the function ocr fun. 7. . Having the kernel PCA model of the subspace Fimg it can be ˆ used to map the images x ∈ X into the subspace Ximg . The U.S. This method was proposed in [20] and further developed in [15].e. The kernel PCA was used to model the image class. A link between the input space X and the feature space F is given by a mapping φ: X → F which is implicitly determined by a selected kernel function k: X × X → R. . . A noisy image x is assumed to lie outside the subspace Ximg . The subspace Fimg is modeled by the kernel PCA trained from the set TX mapped into the feature space F by a function φ: X → F . The denoising procedure is based on mapping the noisy ˆ image x into the subspace Ximg . The example which produced the ﬁgure is implemented in demo kpcadenois. . . The aim is to train a model of images of all numerals at ones.1.2. The corruption of the USPS images by the Gaussian noise is implemented in script make noisy data. The STPRtool contains a script which converts the USPS database to the Matlab data ﬁle. . The script make noisy data produces the Matlab ﬁle usps noisy. Let TΦ = {φ(x1 ).2 Image denoising The aim is to train a model of a given class of images in order to restore images corrupted by noise.The particular model used by OCR is speciﬁed in the function ocr fun which implicitly uses the ﬁle ocrmodel. k(xa .
However.1) The vector k(x) = [k(x1 . i=1 β = A(z − b) . The input are 7291 grayscale images of size 16 × 16. (7. Given a projection φ(x) onto the kernel PCA subspace an n image x ∈ R of the input space can be recovered by solving the preimage problem ˜ x = argmin φ(x) − φ(x) x 2 .2) can be well solved for this case. The whole ˆ procedure of reconstructing the corrupted image x ∈ Rn involves (i) using (7. xb ) = exp(−0. This procedure is implemented in function kpcarec. x)]T contains kernel functions evaluated at the training data TX .mat produced by the script make noisy data.2) which yields the resulting x .1) to ˜ compute z which determines φ(x) and (ii) solving the preimage problem (7. In contrast to the linear PCA. . . x).2) The Radial Basis Function (RBF) kernel k(xa . where ˜ φ(x) = l βi φ(x) . Structure array tst containing the testing data: gnd X [256 × 2007] Testing images of numerals represented as column vectors. . the kernel PCA allows to model the nonlinearity which is very likely in the case of images. y [1 × 7291] Labels y ∈ Y = {1. 10} of training images. X [256 × 7291] Training images corrupted by the Gaussian noise with signaltonoise ration SNR = 1 (it can be changed). . . . .1: The content of the ﬁle usps noisy. The input are 2007 grayscale images of size 16 × 16. k(xl . (7. . The matrix A [l × m] and the vector b [m × 1] are trained by the ˜ kernel PCA algorithm. 10} of training images. Structure array trn containing the training data: gnd X [256 × 7291] Training images of numerals represented as column vectors. . .Table 7. z = AT k(x) + b . . the nonlinearity hidden in 88 . TΦ to minimize the mean square reconstruction error l εKM S = i=1 ˜ φ(xi ) − φ(xi ) 2 . y [1 × 2007] Labels y ∈ Y = {1. . X [256 × 2007] Training images corrupted by the Gaussian noise with signaltonoise ration SNR = 1 (it can be changed).5 xa − xb 2 /σ 2 ) was used here as the preimage problem (7.
m1 ). The images are assumed to be generated from distribution PX (x). The ﬁgure was produced by a script show denois tuning. mk )} such that the crossvalidation estimate of the mean square error PX (x) x − x 2 dx . m) = X was minimal. Another parameter is the dimension m of the subspace kernel PCA subspace to be trained. . images restored by the linear PCA and kernel PCA. The whole training procedure of the kernel PCA is implemented in the script train kpca denois.6 shows examples of ground truth images. noisy images. 89 . Having the parameters tuned the resulting kernel PCA and linear PCA are trained on the whole training set. . The best parameters (σ ∗ .2).4 shows the idea of tuning the parameters of the kernel PCA. . . The training for linear PCA is implemented in the script train lin denois. The linear PCA was used to solve the same task for comparison. An example of denoising of the testing data of USPS is implemented in script show denoising.the parameter σ of the used RBF kernel has to be tuned properly to prevent overﬁtting. The results of tuning the parameters are displayed in Figure 7. εM S (σ.5. The x stands for the ground truth image and x denotes the image ˆ restored from noisy one x using (7. m∗ ) was selected out of a prescribed set Θ = {(σ1 . The function kpcarec is used to denoise images using the kernel PCA model. (σk . In the case of the linear PCA. Figure 7.4 is used for training the model. the function pcarec was used.1) and (7. The same idea was used to tune the output dimension of the linear PCA. It produces sparse model and can deal with large data sets. The greedy kernel PCA Algorithm 3. Figure 7.
25 0. 90 .2 0.6 0.4 0. right_button − erase.15 0.5 0.8 0. middle_button − classify.2 0.1 0.35 0.5 0.7 0.3 0. 0.9 1 (a) (b) Figure 7.1 0.Control: left_button − draw.05 0 0 0.1: Figure (a) shows handwritten training examples of numeral ’1’ and Figure (b) shows corresponding normalized images of size 16 × 16 pixels.3 0.45 0.4 0.
right_button − erase.15 0.7 0. 91 .2: A snapshot of running OCR system.45 0.35 0.3 0.3 0.1 0.45 0.4 0 3 2 0.1 0.25 0.3 0.9 1 (b) Figure 7.2 0.2 0.25 0.05 0 0 0.1 0.5 0. 0.15 0.2 0.9 1 (a) 0.5 0.8 1 9 8 0.4 0.2 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 0 4 5 6 0.35 0.6 0.7 0. middle_button − classify.5 0.4 0.6 0.5 0. Figure (a) shows handwritten numerals and Figure (b) shows recognized numerals.4 0.3 0.8 0.Control: left_button − draw.
3: Demonstration of idea of image denoising by kernel PCA.7 6 5 4 3 2 1 0 −1 −2 −3 −4 −4 Ground truth Noisy examples Corrupted Reconstructed −3 −2 −1 0 1 2 3 4 5 6 7 Figure 7. Distortion model KPCA model θ∈Θ Figure 7.4: Schema of tuning parameters of kernel PCA model used for image restoration. Preimage problem error θ Data generator Error assessment 92 .
5: Figure (a) shows plot of εM S with respect to output dimension of linear PCA.6: The results of denosing the handwritten images of USPS database.5 7 7.5 13 12 11 10 0 10 20 30 40 50 dim 60 70 80 90 100 4 4.5 5 5.5 8 (a) (b) Figure 7. Ground truth Numerals with added Gaussian noise Linear PCA reconstruction Kernel PCA reconstruction Figure 7. Figure(b) shows plot of εM S with respect to output dimension and kernel argument of the kernel PCA. 93 .5 σ 6 6.18 20 dim = 50 dim = 100 dim = 200 dim = 300 17 18 16 16 15 14 εMS 14 εMS 12 10 8 6 3.
• AdaBoost and related learning methods. 94 . The future planes involve the extensions of the toolbox by the following methods: • Feature selection methods. Please send us an email and your potential contribution. We welcome anyone who likes to contribute to the STPRtool in any way. • Discrete probability models estimated by the EM algorithm.Chapter 8 Conclusions and future planes The STPRtool as well as its documentation is being updating. • Model selection methods for SVM. Therefore we are grateful for any comment. There are many relevant methods and proposals which have not been implemented yet. • Probabilistic Principal Component Analysis (PPCA) and mixture of PPCA estimated by the ExpectationMaximization (EM) algorithm. • Eﬃcient optimizers to for large scale Support Vector Machines learning problems. suggestion or other feedback which would help in the development of the STPRtool.
Fakulta e c ı e elektrotechnick´.W.P. pages 236–239. and D. Support Vector Machines. Neural Networks for Pattern Recognition.M. Pattern Classiﬁcation. volume 2. 2001.G. 1997. 3th edition. Cesk´ vysok´ uˇen´ technick´. P. Duda. 2003. [5] A. Duin. Neural Computation. Cristianini and J. Prtools: A matlab toolbox for pattern recognition.O. Baudat and F. 1977. edition.R. Franc. Hlav´ˇ. Laurendeau. 39:185– 197. 2000. 2000. Master’s thesis. Rubin. a [9] V. Bahadur. Laird. Multiclass support vector machine. [2] G. [4] N.M. Anals of Mathematical Statistics. Kasturi. 33:420–431. Pattern Recognition. [7] R. Anouar. Anderson and R. 2002.. An iterative algorithm learning the maximal margin classiﬁer. and D.P. editors. ac [10] V. In Czech). Classiﬁcation into two multivariate normal distributions with diﬀerrentia covariance matrices.E. Stork.Bibliography [1] T. ac D. Franc and V. June 1962. Clarendon Press. In R. Dempster. [8] V. N. Hart. and Suen C. Bishop. February 2000. Great Britain. Katedra kybernetiky. 16th International Conference on Pattern Recognition. ShaweTaylor. 2000. 95 . John Wiley & Sons. Maximum likelihood from incomplete data via the EM Algorithm. Cambridge University Press. [6] R.W. IEEE Computer Society. 12(10):2385–2404. Generalized discriminant analysis using a kernel approach. Journal of the Royal Statistical Society. Franc and V. Oxford.B. 36(9):1985–1996. Hlav´ˇ. [3] C. 2nd. Programov´ n´stroje pro rozpozn´v´n´ (Pattern Recognition Programe a a a ı ˇ e ming Tools.
R. [12] C. Advances in Neural Information Processing Systems 11. [22] J. Springer. Neural Computation. Shevade. In M. Springer. D.S. S. Mika. E. P.T. Denker. Smola.T. Franz Matthias. Bhattacharya. and K. Backpropagation applied to handwritten zip code recognition. Smola. USA. Statisical Learning in Computer Vision. In Leonardis Aleˇ and Bischof Horst. NETLAB : algorithms for pattern recognition. A. M. New York. editors. Nabney. and K. editors. MIT Press. 2002. Probabilities for sv machines. Westenberg. pages 41–48. Sch¨lkpf.[11] V. 11(1):124–136. Advances in pattern recognition. Douglas. Krishnan. Scholkopf.R. May 2004. [15] Kim Kwang. Computer Analysis of Images and Patterns. March 2002. Principal Component Analysis.A.W. [16] J. MIT Press. In. Wilson. Larsen. LeCun. Petkov and M. R¨tsch. Fisher discriminant analysis with kernel. O.K. Murthy. Lin. August 2003. IEEE. 1999.. s editors. 13(2). M¨ ller. New York. B. Boser. Howard. and D. John Wiley & Sons. 1997. Cohn. K. a o u [19] S. Berlin.A. editors.S. Schuurmans. Springer. Kwok and I. D. Kearns.S. J. January 2000. 1986. 2003.C. Tsang. pages 536 – 542. B. R¨tsch.J. G. J. Solla. pages 408–415. Hsu and C. [20] S. Platt. and D. 1989. Jollife. [21] I. Kernel o u a pca and denoising in feature spaces. [17] Y. [13] I. Henderson. Advances in Large Margin Classiﬁers (Neural Information Processing Series). Keerthi.. M¨ ller. Sch¨lkopf. A fast iterative nearest point algorithm for support vector machine classiﬁer design. The EM Algorithm and Extensions. Hlav´ˇ. IEEE Transactions on Neural Networks. Franc and V. In Y. B. SpringerVerlag. Bartlett. editors.J Jackel. MA.W.K. Scholz. ECCV Workshop. Kernel hebbian o algorithm for singlefram superresolution. C. In A. and S. 1:541–551.R. W. Hu.J. A comparison of methods for multiclass support vector machins.J. and Sch¨lkopf Bernhard. Washington.A. S. London. Weston.E. The preimage problem in kernel methods. pages 426–433. In Proceedings of the Twentieth International Conference on Machine Learning (ICML2003). J. and L. Greedy algorithm for a training set reduction in the ac kernel methods. Neural Networks for Signal Processing. Cambridge. Mika. B. In N. [14] S. and G. IEEE Transactions on Neural Networks.H. 1999. Hubbard. McLachlan and T.T. 2000. [18] G. Germany. 96 .
Series B. 1968. 1995. Schlesinger. [24] B. Platt.I. 1994. and an interpretation of clustering as approximation in feature spaces. Neural networks and related methods for classiﬁcation (with discusion). Sequential minimal optimizer: A fast algorithm for training support vector machines. 97 .J. Burges. pages 124–132.I. 1998. In P. The nature of statistical learning theory. Knirsch. A. Sch¨lkopf. Ahler. 2:81–88.C. editors. SpringerVerlag. Technical Report MSRTR9814. Muller. [28] B. Sch¨lkopf. 10:1299–1319. A connection between learning and selflearning in the pattern recognition (in Russian). M. MIT Press. Cambridge. DAGM.D. MA. Schanz. Fast approximation of support o vector kernel expansions. Ten lectures on statistical and structural pattern ac recognition. [26] M. J. C. Scholkopf. May. and C. Levi. Kibernetika. and A. Mustererkennung 199820.J. [30] B. 1998. P. Smola. Springer Verlag.J. Neural Computation. Hlav´ˇ. A. Royal Statistical Soc. Smola. 1999. 2002. and F. [25] M. Riply. Redmond. Burges. Learning with Kernels.R. o [31] V. 56:409–456. Berlin. Sch¨lkopf and A. Microsoft Research. and K. Nonlinear component analysis as a kernel eigenvalue problem. Germany. Smola. Schlesinger and V. Advances in Kernel Methods – Support o Vector Learning. [27] B. [29] B. Smola. 2002. The MIT Press. 1998. R. Kluwer Academic Publishers..[23] J. Vapnik.
This action might not be possible to undo. Are you sure you want to continue?