Professional Documents
Culture Documents
Networks
Johan Suykens
KU Leuven
September 2015
Abstract
In many application areas massive and growing volumes of data
are available which can be further explored and analysed in order to
obtain improved models, extract knowledge and automate processes.
Typical examples include pattern recognition, biomedical applications
and bio-informatics, signal processing and system identification, in-
dustrial processes, fraud detection, webmining, e-commerce, financial
engineering etc. For each of these areas artificial neural networks
constitute an important methodology for system analysis and design.
Neural networks are universal approximators, possess a parallel ar-
chitecture, can be trained either in batch mode or on-line from given
patterns and are a powerful class of methods for nonlinear modelling.
There exist both methods of supervised and unsupervised learning.
In this course a number of important classical and advanced meth-
ods for datamining and neural networks are discussed. Popular tech-
niques in neural networks (such as multilayer perceptrons and radial
basis function networks) are presented with aspects of architectures,
learning, optimization, on-line versus batch training, generalization,
validation, feedforward and recurrent networks, statistical interpre-
tations, pruning, variance reduction, discriminant functions, density
estimation and regularization theory. Special attention is paid to ef-
ficient and reliable methods for classification and function estimation
and for mining large data sets. Emphasis is also put on preprocess-
ing, feature selection, dimensionality reduction and incorporation of
expert knowledge. Besides classical techniques in neural networks,
also more advanced methods such as Bayesian inference, statistical
learning theory and support vector machines (kernel methods) are
explained. With respect to unsupervised learning methods, cluster al-
gorithms (and related methods based on expectation-maximization),
vector quantization and self-organizing maps are presented.
Contents
Foreword i
1 Introduction 3
iii
iv
7 Conclusions 157
2
Chapter 1
Introduction
3
4
interpretation
data mining 111
000
00
11
001
110
00
110
1
000
111
0
1
Knowledge
00
11
000
111
0
1
transformation 00
11
000
111
Patterns
preprocessing
Transformed Data
selection
Preprocessed Data
Target Data
DATA
report
model
Database
Neural Networks
Journals
• Neural Networks
http://www.elsevier.com/locate/neunet
• Neural Computation
http://neco.mitpress.org/
• Neurocomputing
http://www.journals.elsevier.com/neurocomputing/
• Machine learning
http://link.springer.com/journal/10994
Conferences
• Bayesian inference
http://wol.ra.phy.cam.ac.uk/mackay/
x1
x2
y
x n-1
x
n
Figure 1.4: Multilayer perceptron with one hidden layer, output y and
input vector x.
-
desired output
error +
(a) (b)
• Bioinformatics:
A new term has been coined for the communities of molecular
biology, engineering and computer science: bioinformatics. The
term bioinformatics refers to the creation and advancement of
algorithms, computational and statistical techniques, and theory
to solve formal and practical problems inspired from the man-
agement and analysis of biological data. The explosion in the
rate of acquisition of biomedical data and advances in molecular
genetics technologies, such as DNA microarrays allow now to
obtain a “global” view of the cell. For example, the biological
molecular state of a cell can now be investigated by measuring
the simultaneous expression of tens of thousands of genes using
DNA microarrays.
10
• Biomedicine:
Ovarian cancer is the most lethal cancer of the female repro-
ductive system [61]. According to the American Cancer So-
ciety, almost 15,000 women died of ovarian cancer in the US
in 2004. Only four types of cancer were more lethal. On the
other hand, many women present with benign ovarian tumours,
which could be managed conservatively or removed with mini-
Introduction 13
0.9
0.8
Normalized load
0.7
0.6
0.5
0.4
Hour
Notations
The notations in this course will be defined locally within each chapter
or section of a chapter.
18
Chapter 2
19
20
x1 w1
w2 y
x2
w3 f(a)
x3
wn
xn b
a = w1 x 1+ w2 x 2 + ... + wn x n + b
y=f(a)
y = W σ(V x + β) (2.2)
β1 ∈ Rnh1 with number of neurons in the hidden layers nh1 and nh2 .
In elementwise notation this becomes:
nh2 nh1 m
(1)
X X X
(2)
yi = wir σ( vrs σ( vsj xj + βs(1) ) + βr(2) ), i = 1, ..., l
r=1 s=1 j=1
(2.5)
where the upper indices indicate the layer numbers. Sometimes the
inputs are considered to be part of a so-called input layer. However, in
order to specify the number of layers of a network and avoid confusion
it is prefered to mention the number of hidden layers and define the
number of layers to be the sum of the number of hidden layers plus
the output layer.
sigmoid tanh
2 2
1 1
0 0
-1 -1
-2 -2
-5 0 5 -5 0 5
sat sign
2 2
1 1
0 0
-1 -1
-2 -2
-5 0 5 -5 0 5
Gaussian
0.8
0.6
0.4
0.2
0
25
20 25
15 20
10 15
10
5 5
0 0
put these problems within the right perspective [17, 18, 55].
It is well-known that discrete time linear systems with input vec-
tor u ∈ Rm , output vector y ∈ Rl and state vector x ∈ Rn can be
represented in state space form as
xk+1 = Axk + Buk
(2.7)
yk = Cxk .
parameterized by an MLP as
1 4
0.8 3
0.6 2
0.4 1
x_i
x_i
0.2 0
0 -1
-0.2 -2
-0.4 -3
-0.6 -4
0 5 10 15 20 25 30 35 40 45 50 0 2 4 6 8 10 12 14 16 18 20
k k
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
x_2
x_i
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
-2.5 -2.5
0 5 10 15 20 25 30 35 40 45 50 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
k x_1
4 4
3 3
2 2
1 1
x_2
x_i
0 0
-1 -1
-2 -2
-3 -3
-4 -4
0 10 20 30 40 50 60 70 80 90 100 -1.5 -1 -0.5 0 0.5 1 1.5 2
k x_1
for the nonlinear case where wk , vk denote the process noise and ob-
servation (or measurement) noise, respectively.
ŷk = f (yk−1, yk−2 , ..., yk−p, uk−1, uk−2, ..., uk−p) (2.13)
with zk|k−p = [yk−1 ; yk−2; ...; yk−p; uk−1; uk−2; ...; uk−p].
In the so-called NARMAX (Nonlinear ARMAX (Auto Regressive
Moving Average with eXogenous input)) one also tries to model the
noise influence by taking the following model structure
Inputs Outputs
System
Model
2n+1
X n
X
f (x) = χj ( φij (xi ))
j=1 i=1
where χj , φij are continuous functions and φij are also monotone.
Model structure
Parameterization
Cost function in
unknown weights
Testing
Training mode
yk
y k−1
^y
k+1
y
k−p
^y
k
^y
k−1
^y
k+1
^y
k−p
?
y
k
time k
A link between the Sprecher Theorem and MLP neural networks was
made by Hecht-Nielsen.
These results also hold for networks with multiple outputs. The fol-
lowing Theorem is more specific about the kind of activation functions
that are allowed.
This means that for a given number of training data, the number of
parameters to be estimated is huge (which should be avoided as we
34
where the upper index is the layer index and the lower indices indicate
the neuron within a layer and the pattern index. xli,p denotes the i-th
component of the output vector of layer number l for pattern p and wijl
the ij-th entry of the interconnection matrix of layer l. Eventually, the
activation function σ(·) might also change from layer to layer. Before
we are in a position to formulate the backpropagation algorithm we
first have to define so-called δ variables
l ∂Ep
δi,p = l (2.21)
∂ξi,p
Neural Networks and Modelling 35
l+1
w ij
l l+1
x xi
i
where Ep = 12 N
P L d L 2 d
i=1 (xi,p − xi,p ) is the error for pattern p and xi,p
denotes the desired output (note that this method is supervised).
The objective function (sometimes called energy function) that one
usually optimizes for the neural network is the mean squared error
(MSE) on the training set of patterns:
P L N
1 X 1X
min E = Ep Ep = (xdi,p − xLi,p )2 . (2.22)
l
wij P p=1 2 i=1
with learning rate η. Note that the last equation is a backward recur-
l
sive relation on the δi,p variable in the layer index l. The backprop-
agation algorithm is an elegant method in order to obtain analytic
expressions for the gradient of the cost function defined on a feedfor-
ward network with many layers. One could imagine that obtaining
36
expressions for the gradient in the case of one hidden layer is straight-
forward. However, suppose one has a network with e.g. 100 layers, it
is clear then that obtaining an expression for the gradient becomes far
from trivial, while by applying this generalized delta rule it becomes
straightforward. The special structure appearing in this generalized
delta rule is due to the layered structure of the network. In order to
fix the ideas, an application of the generalized delta rule is shown for
an MLP with one hidden layer (L = 2) in Fig.2.12. In this case the
equations become
∆wij2 = η δi,p
2
x1j,p
∆wij1 = η δi,p
1
x0j,p
(2.24)
2
δi,p = (xdi,p − x2i,p ) σ ′ (ξi,p
2
)
= ( N
1
P 2 2 2 ′ 1
δi,p r=1 δr,p wri ) σ (ξi,p ).
∆wijl (k + 1) = η δi,p
l
xl−1 l
j,p + α ∆wij (k) (2.25)
where k is the iteration step and 0 < α < 1. Often also an adaptive
learning rate η is taken. The previous change in the interconnection
weights is taken into account in this learning rule.
Neural Networks and Modelling 37
1 2
w w
ij ij
forward 0 1 2
x x x
i i i
1 2
δ δ backward
i i
ŷk+1 = f (zk|k−n ), zk|k−n = [yk ; yk−1; ... ; yk−n ; uk ; uk−1 ; ...; uk−n]
(2.26)
where f (·) is parameterized by an MLP. The training set consists
of input patterns {zk|k−n }N N
k=1 and output patterns {yk+1 }k=1 with N
given data points. For the objective function
N
1 X
min (yk+1 − ŷk+1 )2 (2.27)
l
wij 2N k=1
one has
N
1 X 1
E= Ek Ek = (yk+1 − ŷk+1 )2 (2.28)
N k=1 2
38
MSE error
test set
training set
epoch
stop training
Figure 2.13: In neural networks with too many hidden units overfitting
will take place during the optimization process, meaning that the error
on an independent test set will increase while the training set error
remains decreasing.
cost function
weight 2
weight 1
by
∂f
= g + H ∆x = 0 → ∆x = −H −1 g. (2.31)
∂(∆x)
One can see that the optimal step is determined by the gradient and
the inverse of the Hessian matrix. A sequence of points in the search
space {x0 , x1 , x2 , ...} is generated by applying these search directions
and calculating the optimal stepsize along these directions. It is well-
known that the Newton method converges quadratically which is much
faster than the steepest descent algorithm.
Unfortunately, one encounters a number of problems when trying
to apply the Newton method to the training of neural networks. A first
problem is that it often occurs that the Hessian has zero eigenvalues
which means that one cannot take the inverse of the matrix. A second
problem is that computing the second order derivatives analytically
for neural networks is very complicated. We have seen that even the
calculation of the gradient by means of the backpropagation method is
not that simple. One can overcome these two problems by considering
Levenberg-Marquardt and quasi-Newton methods.
One also has to be aware of the fact that there are very many local
minima solutions when training neural networks. MLP architectures
contain many symmetries (sign flip symmetries of the weights and
permutation of the number of hidden units). In general, for nh number
of hidden units one has nh !2nh weight symmetries that lead to the
same input/output mapping for the networks. One also starts from
small random interconnection weights in order to avoid (too) bad local
minima solutions.
1
min f (x) = f (x0 ) + g T ∆x + ∆xT H∆x. (2.32)
∆x 2
1 1
L(∆x, λ) = f (x0 ) + g T ∆x + ∆xT H∆x + λ(∆xT ∆x − 1) (2.33)
2 2
42
∂L ∂L
and the solution follows from ∂(∆x)
= 0, ∂λ
= 0 where
∂L
= g + H∆x + λ∆x = 0 → ∆x = −[H + λI]−1 g. (2.34)
∂(∆x)
Note that for λ = 0 this corresponds to the Newton method and
for λ → ∞ this becomes a steepest descent algorithm. By adding a
positive definite matrix λI to H with λ positive, one can always invert
this matrix by taking λ sufficiently large. At every iteration step a
suitable value for λ is selected. Note that for λ small this method will
converge faster. In the case of a sum squared error cost function (as
used for neural networks) the Hessian H has a special structure which
can be exploited. Often an approximation is taken then for H based
on a Jacobian matrix.
p0 = −g0
xk+1 = xk + αk pk , αk s.t. min f (xk + αk pk ) (Line search)
pk+1 = −gk+1 + βk pk
(2.44)
with
T
gk+1 gk+1
βk = gkT gk
(Fletcher − Reeves)
T
gk+1 (gk+1 −gk ) (2.45)
βk = gkT gk
(Polak − Ribiere).
Note that in the equation pk+1 = −gk+1 + βk pk one also has a momen-
tum effect, but a difference with the backpropagation with momentum
term is that βk is now automatically adjusted during the iterative pro-
cess. In these algorithms often a restart procedure is applied after n
steps, i.e. one resets the search direction again to p0 = −g0 . Modi-
fied versions of these algorithms have been successfully applied to the
optimization of neural networks [48]. The advantages of conjugate
gradient methods are that they are faster than backpropagation and
that no storage of matrices is required.
Neural Networks and Modelling 45
the chain rule. Finally, one can see that one has to simulate the aug-
mented system consisting of the state space model and the sensitivity
model with state vectors x and ∂x(t)
∂α
, respectively.
A similar reasoning holds for input/output representations of sys-
tems. Consider for example a second order scalar nonlinear differential
equation
ÿ + F (α, ẏ) + y = u (2.50)
where F (·) is some nonlinear function depending on ẏ and α. The
initial condition y(0) = y10 , ẏ(0) = y20 and α some scalar parameter
to be adjusted. Let us take a cost function
Z T
J(α) = [y(t) − d(t)]2 dt (2.51)
0
∂F ∂F
z̈ + ż + z = − (2.53)
∂ ẏ ∂α
N
1 X
J(θ) = ǫk (θ)T ǫk (θ) (2.55)
2N k=1
where
∂ǫk (θ) ∂[yk − ŷk (θ)] ∂ ŷk (θ)
= =− (2.57)
∂θ ∂θ ∂θ
and ∂ ŷ∂θ
k (θ)
follows from the sensitivity model. The neural state space
model is of the form
x̂k+1 = Φ(x̂k , uk , ǫk ; α) ; x̂0 = x0 given
(2.58)
ŷk = Ψ(x̂k , uk ; β)
y Model u
^ ^
δ y δ y Sensitivity δΦ δΨ
, ,
δα δ β Model δα δ β
∂Ψi
∂wCD jl
= δji tanh(ρl )
∂Ψi
= wCD ij (1 − tanh2 (ρj )) x̂l
∂vC jl
∂Ψ
∂β
:
∂Ψi
∂vD jl
= wCD ij (1 − tanh2 (ρj )) ul
∂Ψi
= wCD ij (1 − tanh2 (ρj ))
∂βCD j
∂Φi
∂Φ
wAB ij (1 − tanh2 (φj )) vA jr
P
∂ x̂k
: ∂ x̂r
= j
∂Ψi
∂Ψ
wCD ij (1 − tanh2 (ρj )) vC jr .
P
∂ x̂k
: ∂ x̂r
= j
φu2′ ǫ2 (k) = 0 ∀k
where φxz indicates the cross-correlation function between xk and zk ,
′
and u2 (k) = u2 (k) − u2 (k) where u2 (k) is the time average or mean
of u2 (k). The first two tests are common in the area of linear sys-
tem identification. In practice one works with normalized correlations
(−1 ≤ φψ1 ψ2 (τ ) ≤ 1)
PN −τ
ψ1 (k)ψ2 (k + τ )
φψ1 ψ2 (τ ) = PN k=12 (2.63)
[ k=1 ψ1 (k) N 2
P 1/2
k=1 ψ2 (k)]
for two sequences ψ1 (k), ψ2 (k) with discrete time index k. One defines
then 95% confidence bands as 1.96 √
N
with N the number of data. The
identified model should be rejected if it is not within these bands. In
that case one should try to look for other inputs, other past input and
output values to incorporate into the model.
A second well-known method is to apply a hypothesis test. Sup-
pose Ω(k) is an s-dimensional vector valued function of past inputs,
outputs and prediction errors. A convenient choice is Ω(k) = [ω(k); ω(k−
1); ... ; ω(k − s + 1)] with ω a function of input or output data. As the
50
null hypothesis one could take “The data are generated by the obtained
model”. If this hypothesis is true then the statistic ζ defined by
with ΓT Γ = N1 N
P T 1
PN
k=1 Ω(k)Ω (k), µ = N k=1 Ω(k)ǫ(k)/σǫ is asymp-
totically chi-squared distributed with s degrees of freedom. Here σǫ2
denotes the variance of residuals ǫ. The model is regarded as adequate
then if
ζ < χ2s (α) (2.65)
where χ2s (α) is the critical value of the chi-squared distribution with
s degrees of freedom given a significance level α (0.05) (acceptance
region 95%).
Chapter 3
y = sign(v T x + b)
(3.1)
= sign(w T z)
51
52
x1
v1
x2
v2
vn
+
xn b d
error
-
1
x2
v^T x + b = 0 x1
Perceptron algorithm
1. Choose c > 0
3. The training cycle begins here. Present the i-th input and com-
pute the corresponding output
x2
x o
o x
x1
N!
N
with M = (N −M )!M !
the number of combinations of M objects
selected from a total of N. Some important conclusions at this point
Neural Networks and Classification 55
are that if N < d + 1 any labelling of points will always lead to linear
separability and for larger d it becomes likely that more dichotomies
are linearly separable. If one takes e.g. N = 4 and d = 2 one of the
possible dichotomies corresponds to the XOR problem configuration.
According to Fig.3.4 one can indeed see that not all problems are
linearly separable. However, if one considers the same problem e.g. for
d = 5 instead of d = 2 the problem becomes linearly separable. The
problem of how to select a good decision boundary will be discussed
later. Cover’s theorem should rather be considered as a theorem about
existence of hyperplanes and not about whether this hyperplane is
good or bad in terms of generalization.
1.0 d=infty
d=1
0.5
d=20
0 1 2 3 4
N/(d+1)
The output values {−1, +1} (or {0, 1}) denote then the two classes.
In the case of multiple outputs, one can encode 2l classes by means of
l outputs, e.g. for l = 2 one has the following option
y1 y2
Class 1 +1 +1
Class 2 +1 −1
Class 3 −1 +1
Class 4 −1 −1
but on the other hand one can also utilize one output per class:
y1 y2 y3 y4
Class 1 +1 −1 −1 −1
Class 2 −1 +1 −1 −1
Class 3 −1 −1 +1 −1
Class 4 −1 −1 −1 +1
which is better from the viewpoint of information theory. For example
in trying to recognize the letters of the alphabet by an MLP one can
take 26 outputs (Fig.3.5). Training can be done then in the same way
as for regression with the class labels as target values for the outputs.
After training of the classifier, decisions are made by the classifier as
follows:
y = sign[W tanh(V x + β)] (3.4)
Neural Networks and Classification 57
7 x 5 pixels 0
0
0
0
1
0
0
0
0
0
user profile etc. When a mobile phone is stolen one might discriminate
between fraud and non-fraud by checking such features.
Now, typically the number of examples of non-fraud (data of class
C1 ) is much larger than the examples of fraud (data of class C2 ),
suppose e.g. that there would be 1000 times more examples of non-
fraudulent calls than fraudulent ones. In a probabilistic setting one
could say then that the prior class probabilities are P (C1 ) = 1000/1001,
P (C2 ) = 1/1001. At this stage the best classification rule would be
P (C1 ) > P (C2 ) meaning that we would always classify a new case as
belonging to class C1 according to this classification rule. The question
is then whether there exists a formalism which can combine this prior
knowledge with information of a training data set. The answer is Yes:
Bayesian decision theory can help us at this point.
Suppose that we consider fraud detection based on the frequency of
phone calls and assume that we characterize this feature as x1 (total
feature vector x is d dimensional) in term of discrete values Xl for
l = 1, ..., L where L denotes the total number of discrete values that
this variable can take. One can then consider the joint probability
P (Ck , Xl ), the conditional probability P (Xl |Ck ) (i.e. the probability
that an observation takes value Xl given it belongs to class Ck ), the
conditional probability P (Ck |Xl ) (i.e. the probability that the class
is Ck given that an observation takes value Xl ) and the prior class
probability P (Ck ). For the case of a binary classification problem one
has k = 1, 2.
One can write then
but also
P (Ck , Xl ) = P (Ck |Xl )P (Xl ).
Hence
P (Xl |Ck )P (Ck )
P (Ck |Xl ) = (3.5)
P (Xl )
or conceptually
Using this Bayes theorem, the classification can be based now on the
posterior probability P (Ck |Xl ) instead of the prior probabilities P (Ck )
only. The probability of misclassification is minimized by selecting the
class for which the posterior P (Ck |Xl ) is maximal.
p(x|Ck )P (Ck )
P (Ck |x) = , k = 1, ..., c , x ∈ Rd
p(x)
c
X
P (Ck |x) = 1 (normalization) (3.8)
k=1
c
X
p(x) = p(x|Ck )P (Ck ) (unconditional density).
k=1
60
data
preprocessing
definition of
feature space
classifier design
(training)
categorization
pattern classes
Likelihood × Prior
Posterior = . (3.9)
Normalization
or
p(x|Ck )P (Ck ) > p(x|Cj )P (Cj ) ∀j 6= k.
P (error) = P (x ∈ R2 , C1 ) + P (x ∈ R1 , C2 )
= P
Z (x ∈ R2 |C1 )P (C1 ) + P
Z (x ∈ R1 |C2 )P (C2 )
= p(x|C1 )P (C1 )dx + p(x|C2 )P (C2 )dx
R2 R1
c (3.11)
X
P (correct) = P (x ∈ Rk |Ck )P (Ck )
k=1
Xc Z
= p(x|Ck )P (Ck )dx.
k=1 Rk
p( x | C1 ) P( C1 )
p( x | C2 ) P( C2 )
x
R1 R2
The decision boundaries yk (x) = yj (x) are not influenced by the choice
of the monotonic function.
In the case of a binary classification problem one may take a re-
formulation by a single discriminant function:
with class C1 if y(x) ≥ 0 and class C2 if y(x) < 0. Instead of using two
discriminant functions one can take a single one.
Neural Networks and Classification 63
In the case of two classes this means that one selects class 1 if
L11 p(x|C1 )P (C1 ) + L21 p(x|C2 )P (C2 )
(3.19)
< L12 p(x|C1 )P (C1 ) + L22 p(x|C2 )P (C2 )
or
(L21 − L22 )p(x|C2 )P (C2 ) < (L12 − L11 )p(x|C1 )P (C1 ). (3.20)
Usually one has Lij > Lii . One obtains then the likelihood ratio
p(x|C1 ) P (C2 ) L21 − L22
l12 (x) = > = θ12 . (3.21)
p(x|C2 ) P (C1 ) L12 − L11
The classification is made as follows: class C1 if l12 (x) > θ12 and class
C2 if l12 (x) < θ12 . A typical choice for the loss matrix is
with δkj the Kronecker delta. For c = 2 one has L11 = L22 = 0,
L12 = L21 = 1. This means that the loss is equal to 1 if the pattern
is placed in the wrong class and the loss is zero if pattern is placed in
the correct class. This would correspond then again to the minimal
misclassification decision rule.
x2
u2
u1
1/2
λ λ
1/2
2 1
x1
Case Σ1 = Σ2 = Σ
Case Σ1 6= Σ2
Similar arguments hold for the case of binary variables with Bernoulli
distribution.
For class-conditional densities with Σ1 = Σ2 = Σ we had
1 1
p(x|Ck ) = exp{− (x − µk )T Σ−1 (x − µk )}.
(2π)d/2 |Σ|1/2 2
This leads to the following posterior
with
w = Σ−1 (µ1 − µ2 )
(3.33)
b = − 21 µT1 Σ−1 µ1 + 21 µT2 Σ−1 µ2 + log PP (C1)
(C2 )
.
Note that the bias term b depends on the prior class probabilities
(which are often unknown in practice). Hence, different prior class
probabilities will lead to a translational shift of the hyperplane (or
straight line in the case of a two dimensional feature space).
p(χ|θ)p(θ)
p(θ|χ) = . (3.36)
p(χ)
For independent
QN data points xn (n = 1, ..., N) one has the likelihood
p(χ|θ) = n=1 p(xn |θ). The normalization factor
Z N
Y
′ ′ ′
p(χ) = p(θ ) p(xn |θ )dθ
n=1
R
ensures p(θ|χ)dθ = 1. Fig.3.10 illustrates the process of obtaining a
sharper estimate of the posterior p(θ|χ) by combining the prior p(θ)
with the data χ.
In non-parametric methods typically a Gaussian function is
placed at each data point xn (n = 1, ..., N)
N
1 X 1 kx − xn k22
p̃(x) = exp{− } (3.37)
N n=1 (2πh2 )d/2 2h2
posterior
p( θ | χ)
prior
p(θ )
posterior prior
Data
p( θ | χ) p(θ )
Figure 3.10: Bayesian inference: starting from a prior p(θ) with a large
uncertainty on the estimate θ, the data χ are used in order to generate
a posterior p(θ|χ) which becomes more accurate.
Neural Networks and Classification 71
h too large
h too small
with
M
X
P (j) = 1, 0 ≤ P (j) ≤ 1 (3.39)
j=1
R
and normalization of the component densities p(x|j) such that p(x|j)dx =
1, where x ∈ Rd and N data points are given. The mixing parame-
ters P (j) correspond to the prior probability that data point has been
generated from component j in the mixture.
A network interpretation of the mixture model is given in Fig.3.12.
It is closely related to RBF networks if one takes a Gaussian mixture
72
p(x|1) x1
P(1)
p(x)
P(M)
p(x|M) xd
1 kx − µj k22
p(x|j) = exp{− } (3.40)
(2πσj2 )d/2 2σj2
which is the so-called softmax function, this can be done. There are
no constraints then on the values γj which can take any real value.
Neural Networks and Classification 73
x2
original
distribution
whitened
distribution
x1
Σ̂ uj = λj uj . (3.49)
1 N
1
G = number of genes
N = number of experiments
Negative Positive
True TN TP
False FN FP
meaning that the error made can be expressed in terms of the smallest
eigenvalues of the neglected components.
SENSITIVITY
C θ3
B
θ2
A
θ1
0 1
FALSE POSITIVE RATE
Figure 3.16: ROC curve for three classifiers A,B,C. The classifier C has
the best performance. It has the largest area under the ROC curve.
An ROC curve is obtained by moving the threshold of a classifier.
1
This ROC curve dates back from the second world war where it was used on
radar signal. After a paper by Swets has been published in Science it became
popular in the medical world.
Neural Networks and Classification 79
1
varying threshold
Class 1
0.5
Class 2
0
81
82
one has
(4.5)
The first term means that at the local minimum w ∗ of the error func-
tion, we have
y(x; w ∗) = ht|xi, (4.6)
meaning that the output approximates the conditional average of the
target data. The second term represents the intrinsic noise on the
data and sets a lower limit on the achievable error.
By taking the expectation over the ensemble of the data sets one gets
tn = h(xn ) + ǫn (4.11)
with true function h(x) and y(x) an estimate of h(x). Consider the
following two extremes:
1. Fix y(x) = g(x) independent of any data set. Then:
y
h(x)
g(x)
y
h(x)
y = a1 x + a2 x2 + a3 x3 + b (4.12)
Aθ = B, e = Aθ − B (4.14)
where A† denotes the pseudo inverse matrix. The results are shown
on Fig.4.2 and Fig.4.3 for the training and test set and for different
degrees of the polynomials. One can observe that for the higher order
polynomial (order 7) the solution starts oscillating. Let us now modify
the least squares cost function by an additional term which aims at
keeping the norm of the solution vector small. This technique is called
regularization or ridge regression in the context of linear systems.
One solves
1
min Jridge (θ) = JLS (θ) + λkθk22 , λ > 0. (4.18)
θ 2
The condition for optimality is given by
∂Jridge
= AT Aθ + λθ − AT B = 0 (4.19)
∂θ
with solution
θridge = (AT A + λI)−1 AT B. (4.20)
This technique is useful when AT A is ill conditioned. The results of
ridge regression for the order 7 polynomial model is shown on Fig.4.4
for different values of the regularization constant λ. Fig.4.5 concep-
tually shows the bias-variance trade-off in terms of the regularization
constant λ. A large value λ decreases the variance but leads to a larger
bias. This value of λ is chosen as a trade-off solution by minimizing
the sum of the variance and the bias square contributions.
5
10
0
10
−5
10
MSE
−10
10
−15
10
−20
10
−25
10
1 2 3 4 5 6 7 8 9 10
order of polynomial
the model becomes more constrained by the data (which reduces the
variance). However, in daily practice, the data sets are given and one
has to do the best possible given the available amount of information.
In order to obtain models with good generalization ability we discuss
now methods of regularization, early stopping and cross-validation.
4.3.1 Regularization
For a given data set and model one often applies regularization
Ẽ(w) = E(w) + νΩ(w) (4.21)
to the original cost function E
N
1 X
E(w) = [tn − y(xn ; w)]2 (4.22)
N n=1
and ν a positive real regularization constant. Usually a weight decay
term is taken (similar to ridge regression)
1X 2
Ω(w) = w (4.23)
2 i i
88
order 1
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
order 3
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
order 7
3
2.5
1.5
y
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
−6
order 7 − lambda = 10
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
order 7 − lambda = 0.1
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
(bias)^2 + variance
E
(bias)^2
variance
log λ
where the vector w contains all interconnection weights and bias terms
of the neural network model. This additional term aims at keeping
the interconnection weight values small.
Let us analyse now the influence of this weight decay term. Con-
sider the case of a quadratic cost function which can be related to a
Taylor expansion at a local point in the weight space. We have
1
E(w) = E0 + bT w + w T Hw (4.24)
2
with E0 a constant and the regularized version
1
Ẽ(w̃) = E(w̃) + ν w̃ T w̃. (4.25)
2
The gradient of these cost functions is
∂E
∂w
= b + Hw = 0
(4.26)
∂ Ẽ
∂ w̃
= b + H w̃ + ν w̃ = 0.
Huj = λj uj (4.27)
Learning and Generalization 91
validation set
training set
iteration step
Validation Test
Training
Set Set
Set
Figure 4.6: Training, validation and test set where the validation set
is used for early stopping of the training process. The designer is
responsible for making a good choice!
Learning and Generalization 93
4.3.3 Cross-validation
Working with a specific validation set has two main disadvantages.
First the results might be quite sensitive with respect to the specific
data points belonging to that validation set. Secondly when working
with a training, validation and test set a part of the training data can
no longer be used for training as it belongs then to the validation set.
A good procedure is then to apply cross-validation (Wahba,
1975) (Fig.4.7). One divides the training set into a number of S seg-
ments and trains in each run on S − 1 segments. The error on the
sum of the segments that were left out in the S runs serves then as
a validation set performance. A typical choice (which is both compu-
tationally attractive and of good statistical quality) is S = 10 (called
10-fold cross validation). In the extreme case one can take S = N
meaning that one has N runs with N − 1 data points (called leave-
one-out cross-validation). This is only recommended for small data
sets and certainly not for datamining applications with millions of
data points.
94
11111
00000
00000
11111
00000
11111
00000
11111 run S
00000
11111
00000
11111
1111
0000
0000
1111
0000
1111
0000
1111 = omitted for training
0000
1111
0000
1111
4.4 Pruning
In order to improve the generalization performance of the trained mod-
els one can remove interconnection weights that are irrelevant. This
procedure is called pruning. We discuss here methods of optimal brain
damage, optimal brain surgeon and weight elimination.
In Optimal Brain Damage (Le Cun, 1990) one considers the
error cost function change due to small changes in the interconnection
weights:
X ∂E 1 XX
δE ≃ δwi + Hij δwi δwj + O(δw 3 ) (4.35)
i
∂wi 2 i j
2
where Hij = ∂w∂i ∂w
E
j
is the Hessian. One takes the following assumption
after convergence:
1X
δE ≃ Hii δwi2 (4.36)
2 i
and one measures the relative importance of the interconnection weights
by
Hii wi2 /2
which are also called saliency values. The algorithm looks as follows:
2. Train the network in the usual way until some stopping criterion
is satisfied.
eTi δw + wi = 0 (4.38)
∂L
= eTi δw + wi = 0 → λeTi H −1ei = λ[H −1 ]ii = −wi .
∂λ
(4.41)
This results into
wi
δw = − −1
H −1 ei (4.42)
[H ]ii
and
1 wi2
δEi = . (4.43)
2 [H −1 ]ii
The pruning algorithm looks then as follows:
3. Evaluate δEi for each value of i and select the value of i which
gives the smallest increase in error.
X (wi/c)2
Ẽ = E + ν . (4.44)
i
1 + (wi /c)2
where ǫi (x) is the error function related to the i-th network. The
average sum-of-squares error for the individual model yi (x) is then
where
L L
1X 1X 2
EAV = Ei = E[ǫi ]. (4.52)
L i=1 L i=1
Learning and Generalization 99
(4.58)
∂L
= ~1T α − 1 = 0 → λ = 1
∂λ ~1T C −1~1
100
x y (x)
1
α
1
y2 (x) α2
x
+ y (x)
COM
αL
x y (x)
L
C −1~1
α= (4.59)
~1T C −1~1
1
At http://www.boosting.org/ many material about tutorials, papers and soft-
ware is available.
Learning and Generalization 101
meaning that
Likelihood × Prior
Posterior = .
Evidence
The Bayes rule can also be used for model comparison purposes. Con-
sider two alternative models H1 , H2 and data D
P (D|H1)P (H1 )
P (H1 |D) =
P (D)
(4.63)
P (D|H2)P (H2 )
P (H2 |D) = .
P (D)
One obtains
P (H1 |D) P (H1 ) P (D|H1)
= (4.64)
P (H2 |D) P (H2 ) P (D|H2)
which means in fact that Bayes’ Theorem automatically embodies Oc-
cam’s razor as illustrated in Fig.4.9. Indeed, suppose that equal prior
probabilities P (H1 ) = P (H2 ) hold. Assume that the model H1 makes
a limited range of predictions given by the evidence P (D|H1 ). The
more powerful model H2 with more free parameters is able then to pre-
dict a larger variety of data sets. If the data fall in region C1 on Fig.4.9
the simpler model H1 becomes more probable (larger P (H1 |D)) then
according to Bayes rule.
Evidence
P( D | H1 )
P( D | H2 )
D
C1
with
1 X X (n)
ED (w) = [t − yi (x(n) ; w)]2
2 n i i
1X 2 (4.66)
EW (w) = w .
2 i i
with
1
P (D|w, α, β, H) = exp(−βED )
ZD (β)
1 (4.68)
P (w|α, β, H) = exp(−αEW )
ZW (α)
where ZM , ZW , ZD are normalization factors. Gaussian noise on the
targets is assumed here.
For binary classification networks one has targets t(n) ∈ {0, 1}.
One can consider the neural network output y(x; w) as a probability
104
with
X
G(w) = t(n) log y(x(n) ; w) + (1 − t(n) ) log(1 − y(x(n) ; w)) (4.70)
n
4
Netlab software available from http://www.ncrg.aston.ac.uk/netlab/
108
α
1
α2
.
.
.
α
. m
.
.
EXPERT
109
110
^
x G(x) z F(z) x
x̂ = F (z) (5.2)
N
1 X
min E = kxi − x̂i k22
N i=1
(5.3)
N
1 X
= kxi − F (G(xi ))k22 .
N i=1
x
2 u
2
u
1
x
1
Figure 5.2: The data set in two dimensions has intrinsic dimensionality
1. The data can be explained in terms of the single parameter η, while
linear PCA is unable to detect the lower dimensionality.
x
2
η
x
1
x2
C1 u1
u
2
C2
x
1
Note that in this method one has to choose the number of centers
K (Fig.5.5). The method also depends on the initial choice of the
clusters. The performance of the method is characterized by the per-
formance indices Jj for each of the clusters j = 1, 2, ..., K. These
indices can be combined into a single performance measure. The K-
means algorithm can also be considered as a rough approximation to
the E-step of the EM algorithm for a mixture of Gaussians. Density
estimation methods such as mixture models can indeed also be consid-
ered as unsupervised learning. There also exist many other clustering
methods, e.g. isodata algorithm, hierarchical clustering, agglomera-
tive clustering, divisive clustering etc. [6, 9, 14].
feature 2 feature 2
feature 1 feature 1
data point
cluster center
2. Update centers:
R3
c3
c2
c1 R2
R1
z2
ψ1 ψ2
ψ
16
z1
2. Update centers
where PN
Kσ (z, zi )xi
F (z, σ) = Pi=1
N
i=1 Kσ (z, zi )
with
Kσ (z, zi ) = exp(−kz − zi k2 /2σ 2 ).
3. Decrease σ:
σfinal k/kmax
σ(k) = σinitial ( )
σinitial
at iteration step k. The initial value of σ is chosen such that the
neighborhood covers all the unit. The final value controls the
smoothness of the mapping.
The batch version is usually faster than the on-line version and is
often preferred. The initialization of the SOM can be done either
at random or based upon the two principal eigenvectors from PCA
analysis. Missing values within the data set are usually excluded from
the distance calculations.
For the neighborhood functions several choices are possible. In the
case of the choice Kσ (z, zi ) = exp(−kz − zi k2 /2σ 2 ) the neighborhood
size is controlled by σ.
One also often works with neighborhood sizes 1, 2 or 3 for the neurons,
which is illustrated in Fig.5.9.
Nice visualizations can be made by SOMs. It is important, how-
ever, to carefully interpret the results after the training of the SOM.
One gets insight by looking at the color or black/white maps. De-
pending on color definition by the user, dark areas might mean that
the data are very dense and clustered in that region (many data close
to each other). This information is obtained by calculating distances
between the prototype vectors. In case the class labels of the data
are given (supervised information) one can also show these on the
120
SOM map. In the so-called WEBSOM, the SOM method has been
applied to problems of webmining where millions of documents have
been processed (Fig.5.10) [43].
Figure 5.10: Example of a SOM map after training where insight about
the clustering of the data is obtained from the color or black/white
regions. This figure illustrates a result from webmining by means of
the WEBSOM http://websom.hut.fi/websom/.
122
(G + λI)c = y (5.9)
For a stabilizer P
M n Z
2
X
m 2 m 2
X ∂ m f (x)
kP f k = am kO f k , kO f k = dx
m=0 i1 ...im Rn ∂xi1 ...xim
(5.12)
one obtains as optimal solution the Radial Basis Function
neural networks (RBF) with Gaussian activation function.
An example for n = 2 is
∂2f ∂2f 2 ∂2f
Z
2 2
kO f k = [( 2 )2 + 2( ) + ( 2 )2 ]dx1 dx2 . (5.13)
R2 ∂x1 ∂x1 ∂x2 ∂x2
N
X
min HW [f ] = (yi − f (xi ))2 + λkP f k2y (5.18)
f
i=1
124
Gaussian
0.8
0.6
0.4
0.2
0
25
20 25
15 20
10 15
10
5 5
0 0
Thin-plate-spline
0.2
-0.2
-0.4
-0.6
-0.8
25
20 25
15 20
10 15
10
5 5
0 0
∗
with y = W x. The conditions for optimality are given by ∂H[f ∂cα
]
= 0,
∂H[f ]∗ ∗
∂H[f ]
∂tα
= 0, ∂W = 0 for α = 1, ..., nh . In the case λ = 0 the gradient
is given by
N
∂H[f ∗ ]
X
∆i G(kx − tα k2W )
= −2
∂cα
i=1
N
∂H[f ∗ ]
X
∂tα
= 4cα ∆i G′ (kx − tα k2W )W T W (xi − tα ) (5.22)
i=1
nh N
∂H[f ∗ ]
X X
= −4W cα ∆i G′ (kx − tα k2W )Qi,α
∂W
α=1 i=1
c = (GT G + λg)−1 GT y, y = W x.
Membership grade
cold hot
warm
0
10 20 30 40
Temperature
6.1 Motivation
Despite the fact that classical neural networks (MLPs, RBF networks)
have nice properties such as universal approximation and reliable al-
gorithms presently exist for this class of techniques, they still have a
number of persistent drawbacks. A first problem is the existence of
many local minima solutions. Although many of these local solutions
actually can be good solutions, it is often inconvenient, e.g. from a
statistical perspective. Another problem is how to choose the number
of hidden units (Fig.6.1).
The theory of Support Vector Machines (SVMs) sheds a new light
on these problems. Support vector machines have been introduced
by Vapnik. In fact the original idea of linear SVMs dates back al-
ready from the sixties but it became more important and popular in
recent years when extensions to general nonlinear SVMs have been
made [20, 21]. In SVMs one works with kernel based representations
of the network allowing linear, polynomial, splines, RBF and other
kernels. Several operations on kernels are allowed and for specific
applications such as textmining string kernels can be used. The so-
1
At http://www.kernel-machines.org/ many material about tutorials, papers
and software is available.
129
130
cost function
6.2.1 Margin
x2
+
+
Class 2
+
+
+
+ +
x
x
x x
?
x x
x x
Class 1
x1
x2
+
+ Class 2
+
+
+
+ +
x
x
x x
x x x
x
Maximize distance to
Class 1 nearest points
x1
x2
+
+ Class 2
+
+
+
+ +
x
x
x x w^T x + b = +1
x x
x w^T x + b = 0
x
w^T x + b = −1
Class 1
x1
which is equivalent to
yk [w T xk + b] ≥ 1, k = 1, ..., N. (6.2)
One obtains N
X
∂L
=0 → w= αk yk xk
∂w
k=1
N (6.6)
X
∂L
=0 → αk yk = 0
∂b
k=1
XN
y(x) = sign[ αk yk xTk x + b]. (6.7)
k=1
Support Vector Machines 135
By replacing the expression for w in the Lagrangian one obtains the fol-
lowing Quadratic Programming (QP) problem (Dual Problem) which
solves the problem in the Lagrange multipliers
N N
1 X X
max Q(α) = − yk yl xTk xl αk αl + αk (6.8)
α 2 k,l=1 k=1
such that
N
X
αk yk = 0, αk ≥ 0 (6.9)
k=1
Note that this problem is solved in α, not in w. One can prove that
the solution to the QP problem is global and unique. The data related
to nonzero αk are called support vectors, in other words these data
points contribute to the sum in the classifier model. A drawback is,
however, that the QP problem matrix size grows with number of data
N, e.g. when one has 1,000,000 data points the size of the matrix
involved in the QP problem will be 1,000,000 × 1,000,000 which is too
huge for computer memory storage.
subject to
yk [w T xk + b] ≥ 1 − ξk , k = 1, ..., N
(6.12)
ξk ≥ 0, k = 1, ..., N.
136
x2
+
+
Class 2
+
+
+
+ +
x
x +
x
x x
x x x
x Maximize distance to
Class 1 nearest points
x1
with Lagrangian
N
X N
X
T
L(w, b, ξ; α, ν) = J (w, ξ) − αk {yk [w xk + b] − 1 + ξk } − νk ξk
k=1 k=1
(6.13)
and Lagrange multipliers αk ≥ 0, νk ≥ 0 for k = 1, ..., N. The solution
is given by saddle point of Lagrangian:
max min L(w, b, ξ; α, ν). (6.14)
α,ν w,b,ξ
One obtains
N
X
∂L
∂w
=0 → w= αk yk xk
k=1
N
X (6.15)
∂L
∂b
=0 → αk yk = 0
k=1
∂L
= 0 → 0 ≤ αk ≤ c, k = 1, ..., N
∂ξk
such that
N
X
αk yk = 0
(6.17)
k=1
0 ≤ αk ≤ c, k = 1, ..., N.
one has Z
K(x, y)g(x)g(y)dxdy ≥ 0. (6.20)
φ (x)
+ +
+
+
+ +
+ +
+
+ x
+
+
x x x
x
x x +
x x
+
x x
x x x
x
x
Input space
Feature space
=
T
K(x,y) φ (x)
φ (y)
subject to
yk [w T ϕ(xk ) + b] ≥ 1 − ξk , k = 1, ..., N
(6.24)
ξk ≥ 0, k = 1, ..., N.
One constructs the Lagrangian:
N
X N
X
L(w, b, ξ; α, ν) = J (w, ξk )− αk {yk [w T ϕ(xk )+b]−1+ξk }− νk ξk
k=1 k=1
(6.25)
with Lagrange multipliers αk ≥ 0, νk ≥ 0 (k = 1, ..., N). The solution
is given by the saddle point of the Lagrangian:
max min L(w, b, ξ; α, ν). (6.26)
α,ν w,b,ξ
One obtains
N
X
∂L
∂w
=0 → w= αk yk ϕ(xk )
k=1
N
∂L
X (6.27)
∂b
=0 → αk yk = 0
k=1
∂L
= 0 → 0 ≤ αk ≤ c, k = 1, ..., N.
∂ξk
140
such that
N
X
αk yk = 0
(6.29)
k=1
0 ≤ αk ≤ c, k = 1, ..., N.
Note that w and ϕ(xk ) are never calculated but all calculations are
done in the dual space. We make use of the Mercer condition by
choosing a kernel
K(xk , xl ) = ϕ(xk )T ϕ(xl ). (6.30)
Finally, the nonlinear SVM classifier takes the form
XN
y(x) = sign[ αk yk K(x, xk ) + b] (6.31)
k=1
X N
y(x) = sign[ αk yk exp{−kx − xk k22 /σ 2 } + b]
k=1
X (6.32)
= sign[ αk yk exp{−kx − xk k22 /σ 2 } + b]
k∈SSV
Support Vector Machines 141
x
x x x
x
x x +
+
+ +
+ + x
+
x + + +
x +
+ +
x x
x
x
x
x x
Figure 6.6: In the abstract figure the encircled points are support vec-
tors. These points have a non-zero support value αk . The decision
boundary can be expressed in terms of these support vectors (which
explains the terminology). In standard QP type support vector ma-
chines all support vectors are located close to the decision boundary.
where SSV denotes the set of support vectors. It means that each
hidden unit corresponds to a support vector (non-zero support values
αk ) and the number of hidden units equals the number of support
values. The support vectors also have a nice geometrical meaning
(Fig.6.6). They are located close to the decision boundary and the
decision boundary can be expressed in terms of these support vectors
(which explains the terminology).
f (x) = w T x + b (6.33)
142
−ε 0 +ε
For standard SVM function estimation one employs the so-called Vap-
nik’s ǫ-insensitive loss function3
0 , if |y − f (x)| ≤ ǫ
|y − f (x)|ǫ = (6.35)
|y − f (x)| − ǫ , otherwise
shown in Fig.6.7.
By taking such a cost function one can formulate the following
optimization problem
1
min w T w (6.36)
2
subject to |yk − w T xk − b| ≤ ǫ or
yk − w T xk − b ≤ ǫ
(6.37)
w T xk + b − yk ≤ ǫ.
3
SVM theory can be extended to any convex cost function. Historically, the
SVM results were first derived for Vapnik’s ǫ-insensitive loss function. In general,
the choice of a 1-norm in the cost function is more robust than a 2-norm, e.g. with
respect to outliers and non-Gaussian noise on the data.
Support Vector Machines 143
f(x)
+ε
0
x
ζ x x −ε
x
x x
x
x x
Figure 6.8: Tube of ǫ-accuracy and points which cannot meet this
accuracy, motivating the use of slack variables.
subject to
yk − w T xk − b ≤ ǫ + ξk
w T xk + b − yk ≤ ǫ + ξk∗ (6.39)
ξk , ξk∗ ≥ 0.
subject to
N
X
(αk − αk∗ ) = 0
(6.44)
k=1 ∗
αk , αk ∈ [0, c].
The resulting SVM for linear function estimation is
f (x) = w T x + b (6.45)
PN
with w = k=1 (αk − αk∗ )xk such that
N
X
f (x) = (αk − αk∗ )xTk x + b. (6.46)
k=1
The support vector expansion is sparse in the sense that many support
values will be zero.
Support Vector Machines 145
subject to
yk − w T ϕ(xk ) − b ≤ ǫ + ξk
w T ϕ(xk ) + b − yk ≤ ǫ + ξk∗ (6.49)
ξk , ξk∗ ≥ 0
subject to
N
X
(αk − αk∗ ) = 0
(6.51)
k=1 ∗
αk , αk ∈ [0, c].
One applies the Mercer condition K(xk , xl ) = ϕ(xk )T ϕ(xl ) which gives
N
X
f (x) = (αk − αk∗ )K(x, xk ) + b. (6.52)
k=1
−Z T
I 0 0 w 0
0 0 0 −Y T b =
0
(6.56)
0 0 γI −I e 0
Z Y I 0 α ~1
YT
0 b 0
−1 = ~ (6.57)
Y Ω+γ I α 1
where
Ω = ZZ T (6.58)
and the Mercer’s condition is applied
0 YT
ξ1 d1
= (6.60)
Y H ξ2 d2
with H = Ω + γ −1 I, ξ1 = b, ξ2 = α, d1 = 0, d2 = ~1 as
−d1 + Y T H −1 d2
s 0 ξ1
= (6.61)
0 H ξ2 + H −1Y ξ1 d2
40
30
20
10
−10
−20
−30
−40
−40 −30 −20 −10 0 10 20 30 40
~1T
0 b 0
~1 = (6.67)
Ω + γ −1 I α y
150
with y = [y1 ; ...; yN ], ~1 = [1; ...; 1], α = [α1 ; ...; αN ] and by applying
Mercer’s condition
Ωkl = ϕ(xk )T ϕ(xl ), k, l = 1, ..., N
(6.68)
= K(xk , xl ).
LS-SVM pruning
1. Compute LS-SVM for N training data
α
k
0
N k
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
sorted abs(alpha k)
1.5 0.16
0.14
1
0.12
0.5
0.1
abs(alphak)
0 0.08
0.06
−0.5
0.04
−1
0.02
0
−1.5 0 50 100 150 200 250 300 350 400 450 500
−5 −4 −3 −2 −1 0 1 2 3 4 5 k
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
sorted abs(alpha )
k
1.5 0.8
0.7
1
0.6
0.5
0.5
abs(alpha )
k
0.4
0
0.3
−0.5
0.2
−1
0.1
0
−1.5 0 50 100 150 200 250 300 350 400 450 500
−5 −4 −3 −2 −1 0 1 2 3 4 5 k
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
8
1.5
1
6
0.5 5
abs(alphak)
4
0
−0.5
2
−1
1
0
−1.5 0 50 100 150 200 250 300 350 400 450 500
−1.5 −1 −0.5 0 0.5 1 1.5 k
γ1 σ1
test
training
validation
validation error 1
γ2 σ2
validation error 2
γ σ
3 3
validation error 3
Conclusions
157
158
Bibliography
[3] Cherkassky V., Mulier F., Learning from data: concepts, theory
and methods, John Wiley and Sons, 1998.
[5] Devroye L., Györfi L., Lugosi G., A Probabilistic Theory of Pat-
tern Recognition, NY: Springer, 1996.
[6] Duda R.O., Hart P.E., Stork D. G., Pattern Classification (2ed.),
Wiley, 2001.
[9] Hastie T., Tibshirani R., Friedman J., The elements of statistical
learning, Springer-Verlag, 2001.
159
160
[15] Ritter H., Martinetz T., Schulten K., Neural Computation and
Self-Organizing Maps: An Introduction, Addison-Wesley, Read-
ing, MA, 1992.
[16] Schölkopf B., Burges C., Smola A., Advances in Kernel Methods:
Support Vector Learning, MIT Press, Cambridge, MA, December
1998.
[19] Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B.,
Vandewalle J., Least Squares Support Vector Machines, World
Scientific, Singapore, 2002.
[24] Bassett D.E., Eisen M.B., Boguski M.S., “Gene expression in-
formatics - it’s all in your mine,” Nature Genetics, supplement,
Vol.21, pp.51-55, Jan 1999.
[25] Bengio Y., “Learning deep architectures for AI,” Foundations and
trends in Machine Learning, 2(1): 1-127, 2009.
[26] Brown P., Botstein D., “Exploring the new world of the genome
with DNA microarrays,” Nature Genetics, supplement, Vol.21,
pp.33-37, Jan 1999.
[28] Chen M., Mao S., Liu Y., “Big Data: A Survey,” Mobile Networks
and Applications, 19(2), 171-209, April 2014.
[29] Chen S., Billings S., Grant P., “Nonlinear system identification
using neural networks,” International Journal of Control, Vol.51,
No.6, pp.1191-1214, 1990.
[30] Chen S., Billings S., “Neural networks for nonlinear dynamic sys-
tem modelling and identification,” International Journal of Con-
trol, Vol.56, No.2, pp.319-346, 1992.
[32] Espinoza M., Suykens J.A.K., Belmans R., De Moor B., “Electric
Load Forecasting,” IEEE Control Systems Magazine, Vol. 27, No.
5, pp. 43-57, Oct. 2007.
[34] Fayyad U., Haussler D., Stolorz P. “Mining Scientific data,” Com-
munications of the ACM, Vol.39, No.11, pp.51-57, 1996.
[36] Glymour C., Madigan D., Pregibon D., Smyth P., “Statistical
inference and data mining,” Communications of the ACM, Vol.39,
No.11, pp.35-41, 1996.
[37] Guyon I., Matic N., Vapnik V., “Discovering informative pat-
terns and data cleaning,” in U.M. Fayyad, G. Piatetsky-Shapiro,
P. Smyth, and R. Uthurusamy, Eds., Advances in Knowledge Dis-
covery and Data Mining, pp. 181-203, MIT Press, 1996.
[38] Heckerman D., “A tutorial on learning with Bayesian networks,”
Technical Report MSR-TR-95-06, Microsoft Research, March,
1995.
[39] Hornik K., Stinchcombe M., White H., “Multilayer feedforward
networks are universal approximators,” Neural Networks, Vol.2,
pp.359-366, 1989.
[40] Jain A., Mao J., Mohiuddin K., “Artificial neural networks: a
tutorial,” IEEE Computer, Vol.29, No.3, pp.31-44, 1996.
[41] Jones N., “Computer science: The learning machines,” Nature,
news feature, 8 Jan 2014.
[42] Kohonen T., “The self-organizing map,” Proc. IEEE, Vol.78,
No.9, pp.1464-1480, 1990.
[43] Kohonen T., Kaski S., Lagus K., Salojärvi J., Paatero V., Saarela
A., “Organization of a massive document collection,” IEEE
Transactions on Neural Networks (special issue on neural net-
works for data mining and knowledge discovery), Vol.11, No.3,
pp. 574-586, 2000.
[44] Lerouge E., Moreau Y., Verrelst H., Vandewalle J., Stoermann
C., Gosset P., Burge P., “Detection and management of fraud in
UMTS networks”, in Proc. of the Third International Conference
on The Practical Application of Knowledge Discovery and Data
Mining (PADD99), London, UK, Apr. 1999, pp. 127-148.
[45] MacKay D.J.C, “Bayesian interpolation,” Neural Computation,
4(3): 415-447, 1992.
[46] MacKay D.J.C, “A practical Bayesian framework for backpropa-
gation networks,” Neural Computation, 4(3): 448-472, 1992.
Conclusions 163
[47] Mjolsness E., DeCoste D., “Machine Learning for Science: State
of the Art and Future Prospects,” Science, Vol.293, pp. 2051-
2055, 2001.
[48] Møller M.F., “A scaled conjugate gradient algorithm for fast su-
pervised learning,” Neural Networks, Vol.6, pp.525-533, 1993.
[49] Morgan N., Bourlard H., “Continuous speech recognition: an in-
troduction to the hybrid HMM/connectionist approach,” IEEE
Signal Processing Magazine, pp.25-42, May 1995.
[50] Narendra K.S., Parthasarathy K., “Gradient methods for the
optimization of dynamical systems containing neural networks,”
IEEE Transactions on Neural Networks, Vol.2, No.2, pp.252-262,
1991.
[51] Poggio T., Girosi F., “Networks for approximation and learning,”
Proceedings of the IEEE, Vol.78, No.9, pp.1481-1497, 1990.
[52] Reed R., “Pruning algorithms - a survey,” IEEE Transactions on
Neural Networks, Vol.4, No.5, pp.740-747, 1993.
[53] Rumelhart D.E., Hinton G.E., Williams R.J., “Learning represen-
tations by back-propagating errors,” Nature, Vol.323, pp.533-536,
1986.
[54] Schölkopf B., Sung K.-K., Burges C., Girosi F., Niyogi P., Poggio
T., Vapnik V., “Comparing support vector machines with Gaus-
sian kernels to radial basis function classifiers,” IEEE Transac-
tions on Signal Processing, Vol.45, No.11, pp.2758-2765, 1997.
[55] Sjöberg J., Zhang Q., Ljung L., Benveniste A., Delyon B., Glo-
rennec P., Hjalmarsson H., Juditsky A., “Nonlinear black-box
modeling in system identification: a unified overview,” Automat-
ica, Vol.31, No.12, pp.1691-1724, 1995.
[56] Smola A., Schölkopf B., “A Tutorial on Support Vector Regres-
sion,” NeuroCOLT Technical Report NC-TR-98-030, Royal Hol-
loway College, University of London, UK, 1998.
[57] Smola A., Schölkopf B., Müller K.-R., “The connection between
regularization operators and support vector kernels,” Neural Net-
works, 11, 637-649, 1998.
164
[58] Suykens J.A.K., Vandewalle J., De Moor B., “NLq theory: check-
ing and imposing stability of recurrent neural networks for nonlin-
ear modelling,” IEEE Transactions on Signal Processing (special
issue on neural networks for signal processing), Vol.45, No.11, pp.
2682-2691, Nov. 1997.
[60] Suykens J.A.K., “Least squares support vector machines for clas-
sification and nonlinear modelling,” Neural Network World (Spe-
cial Issue on PASE 2000), Vol.10, No.1-2, pp.29-48, 2000.
[61] Van Calster B., Timmerman D., Lu C., Suykens J.A.K., Valentin
L., Van Holsbeke C., Amant F., Vergote I., Van Huffel S., “Preop-
erative diagnosis of ovarian tumors using Bayesian kernel-based
methods”, Ultrasound in Obstetrics and Gynecology, vol. 29, no.
5, May 2007, pp. 496-504.
[62] van der Smagt P.P., “Minimisation methods for training feedfor-
ward neural networks,” Neural Networks, Vol.7, No.1, pp.1-11,
1994.
[63] Van Gestel T., Suykens J.A.K., Baesens B., Viaene S., Vanthienen
J., Dedene G., De Moor B., Vandewalle J., “Benchmarking Least
Squares Support Vector Machine Classifiers,” Machine Learning,
vol. 54, no. 1, Jan. 2004, pp. 5-32.
[65] Zadeh L.A., “Fuzzy logic, neural networks and soft computing,”
Communications of the ACM, Vol.37, No.3, pp.77-84, 1994.