Professional Documents
Culture Documents
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Alessandro E.P. Villa Włodzisław Duch
Péter Érdi Francesco Masulli
Günther Palm (Eds.)
13
Volume Editors
Włodzisław Duch
Nicolaus Copernicus University, Department of Informatics
87-100, Toruń, Poland
E-mail: wduch@is.umk.pl
Péter Érdi
Kalamazoo College, Center for Complex Systems Studies
Kalamazoo, MI 49006, USA
E-mail: peter.erdi@kzoo.edu
Francesco Masulli
Università di Genova, Dipartimento di Informatica e Scienze dell’Informazione
16146 Genoa, Italy
E-mail: masulli@disi.unige.it
Günther Palm
Universität Ulm, Institut für Neuroinformatik
89069 Ulm, Germany
E-mail: guenther.palm@uni-ulm.de
in the spirit of ICANN and the goals promoted by ENNS. All posters remained
on display during the three days of the conference with a mandatory presenter
standing near odd numbers on Thursday 13th and near even numbers on Friday
14th. This year the organizers decided to slash the registration fee and focus
on the core of ICANN activities at the expense of excluding the lunches. This
scheme has proven to be successful and attracted many foreign participants,
coming from 35 different countries and all continents, in particular at graduate
and postgraduate levels.
This was the first ICANN after the death of Prof. John Gerald Taylor (JGT),
the first president and co-founder of the European Neural Network Society
(ENNS). John was born in Hayes, Kent, on August 18, 1931. He obtained a PhD
in Theoretical Physics from Christ’s College, Cambridge (1950–1956), where he
was strongly influenced by the teaching of Paul Dirac. John G. Taylor started
research in neural networks in 1969 and has contributed to many, if not all, of its
subfields. In 1971 he was appointed to the established Chair in Applied Math-
ematics at King’s College London where he founded and directed the Centre
for Neural Networks. His research interests were wide, ranging from high energy
physics, superstrings, quantum field theory and quantum gravity, neural compu-
tation, neural bases of behavior, and mathematical modelling in neurobiology.
After observing the metal “bending” skills of Uri Geller in 1974, Prof. J.G. Tay-
lor became interested in parapsychology and sought to establish whether there
is an electromagnetic basis for the phenomenon. After careful investigation char-
acterized by an initial enthusiasm and late skepticism he came to the conclusion,
expressed in his book Science and the Supernatural (1980), that the paranor-
mal cannot be reconciled with modern physics. After Francis Crick’s hypothesis
(1984) on the internal attentional searchlight role played by the thalamic reticu-
lar nucleus, Prof. Taylor became involved in developing a higher cognitive level
model of consciousness, using the most recent results on attention to describe it
as an engineering control system. This led him to the CODAM (attention copy)
model of consciousness. In 2007, Prof. Taylor developed the first program of its
kind in the hedge funds industry using artificial intelligence techniques to create
portfolios of hedge funds. He also trained as an actor and performed in plays
and films, wrote several science fiction plays, as well as directing stage produc-
tions in Oxford and Cambridge. Throughout his career Prof. Taylor encouraged
young scientists to follow their curiosity in their search for a better understand-
ing of nature and he served on numerous PhD dissertation juries around the
world. This brief biographical sketch of John G. Taylor is not intended to be
exhaustive but it is an attempt to present an exceptional person, though humble
and ordinary, yet out of the ordinary, who was part of our community from the
very beginning. At the ICANN conferences Prof. Taylor spent much time in the
poster sessions interacting with the participants and his presence at the oral
sessions was often marked by his questions and comments. The attendants at
past ICANN conferences remember that at banquet dinner Prof. Taylor usually
gave a short speech that was a condensed summary of his elegance and humor.
I had the privilege of his friendship during the past twenty years and I am sure
Preface VII
that many of us will remember stories about Prof. John Gerald Taylor. Dear
John, thank you for your legacy, it is now up to us to pursue your effort, make
it grow and flourish.
Committees
General Chair Alessandro E.P. Villa
Special Sessions Chair Marco Tomassini
Tutorials Chair Lorenz Goette
Competitions Chair Giacomo Indiveri
Program Chairs
Wlodek Duch Guenther Palm
Péter Érdi Alessandro E.P. Villa
Francesco Masulli
Additional Reviewers
Fabio Babiloni Alfredo Petrosino
Simone Bassis Ramin Pichevar
Fülöp Bazso Marina Resta
Francesco Camastra Alessandro Rozza
Alessandro Di Nuovo Justus Schwabedal
Simona Doboli Vladyslav Shaposhnyk
Alessio Ferone Giorgio Valentini
Maurizio Filippone Eleni Vasilaki
Stefan Heinrich Jan K. Woike
Hassan Mahmoud Sean Wood
Registration Committee
Paulo Monteiro
Table of Contents – Part II
Bioinformatics (A2)
Rademacher Complexity and Structural Risk Minimization:
An Application to Human Gene Expression Datasets . . . . . . . . . . . . . . . . . 491
Luca Oneto, Davide Anguita, Alessandro Ghio, and Sandro Ridella
Diffusion Maps and Local Models for Wind Power Prediction . . . . . . . . . . 565
Ángela Fernández Pascual, Carlos M. Alaı́z,
Ana Ma González Marcos, Julia Dı́az Garcı́a, and
José R. Dorronsoro
The Capacity and the Versatility of the Pulse Coupled Neural Network
in the Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Yuta Ishida, Masato Yonekawa, and Hiroaki Kurokawa
The Spherical Hidden Markov Self Organizing Map for Learning Time
Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
Gen Niina and Hiroshi Dozono
1 Introduction
A complex-valued MLP (multilayer perceptron) has the attractive potential a
real-valued MLP doesn’t have. For example, a complex-valued MLP can be nat-
urally used in the fields where complex values are indispensable, or a complex-
valued MLP can naturally fit a periodic or unbounded function.
Our preliminary experiments showed the search space of a complex-valued
MLP parameters is full of crevasse-like forms having huge condition numbers,
much the same as in a real-valued MLP [8]. In such an extraordinary space, it
will be hard for usual gradient-based search such as BP to find excellent solutions
because the search will easily get stuck. Recently a higher-order search method
has been proposed to get better performance for a complex-valued MLP [1].
This paper proposes a totally new search method for a complex-valued MLP,
which utilizes eigen vector descent and reducibility mapping [3,9], aiming to sta-
bly find excellent solutions in such an extraordinary search space full of crevasse-
like forms. Our experiments showed that the proposed method worked well for
two data sets generated by an unbounded function and Bessel functions.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 1–8, 2012.
c Springer-Verlag Berlin Heidelberg 2012
2 S. Suzumura and R. Nakano
J
(2)
K
(1)
fiμ = wij zjμ , zjμ = g(hμj ), hμj = wjk xμk (1)
j=0 k=0
The following activation function g(h) is employed: g(h) = 1/(1 + i + e−z ). This
function has unbounded and periodic features. Since an activation function plays
an important role in a complex-valued MLP, many have been proposed [4,5] so
far. Our function is quite similar to that proposed by [7].
Then the complex Hessian H c and the real Hessian H r are defined respectively
as below. Note that H c is Hermitian and H r is symmetric. The former is more
convenient to calculate than the latter.
H T
∂ ∂E ∂ ∂E
Hc = , Hr = (3)
∂c ∂c ∂r ∂r
What kind of landscapes does the search space of a complex-valued MLP have?
Here search space means the error surface the weights of a complex-valued MLP
form. To the best of our knowledge, little has been known about the landscapes.
Since the search space is usually high-dimensional, eigen values of the Hessian
will give us the exact nature of the landscapes.
Complex-Valued MLP Search Utilizing EVD and RM 3
Steepest Descent
Now the sum-of-squares error E is formally defined below, where y μ is a teacher
signal for data point μ. The error is a real-valued scalar.
N
I
E= δiμ δiμ , δiμ = fiμ − yiμ (4)
μ i
∂E
N
I
(2) ∂E
N
= δiμ wij g (hμj )xμk , = δiμ z μj (5)
(1) (2)
∂wjk μ i ∂wij μ
Steepest descent uses the gradient to get the search direction by multiplying the
learning rate. Since a constant learning rate does not work well in crevasse-like
forms, a line search [6] is employed to get an adaptive learning rate.
Let λm and v m be the m-th eigen value and eigen vector of H r respectively.
The main idea of eigen vector descent is to consider each eigen vector as
a candidate of the search direction. Let ηm be the suitable step length in the
4 S. Suzumura and R. Nakano
v m . Putting together the result of each direction, we get the real update
direction
Δr = m ηm v m . The complex update Δc is obtained from Δr. That is, Δc =
2M
J m ηm v m . Substituting this into eq.(6), we get the following. The basis {vm }
is assumed to be orthonormal.
H
1
2M 2M
∂E
E(c + Δc) ≈ E(c) + J v m ηm + λm ηm2
(8)
∂c m
2 m
By minimizing the above with respect to ηm , we get the suitable step length
ηm . When λm < 0, the above ηm gives the maximal point; then, ηm is selected
so as to reduce E. Moreover, we check if ηm surely reduces E, and if that does
not hold, we set ηm = 0. Thus, the weight update rule of eigen vector descent is
given as below.
2M
w new
←w old
+ Δwm , Δw m = ηm (I iI)v m (9)
m
3. while J ≤ Jmax do
3.1 Apply reducibility mapping to get w(J + 1) from w(J), where free
(1)
weights {wJ+1,k } are left undetermined.
3.2 for = 1, 2, · · · , Lmax do
(1)
a. Initialize free weights {wJ+1,k } randomly.
b. Call Crevasse Search, and let E(J + 1) and w(J + 1) be the error and
weights after learning.
c. if E(J) − E(J + 1) > θE(J) then break end if
end for
3.3 J ← J + 1.
Figures 2 and 3 show the learning processes of steepest descent and the pro-
posed method respectively. The error of the best solution found by steepest
descent is around 100 , while the proposed method found solutions whose errors
are around 10−12 , much better than steepest descent. In Fig. 3 we see reducibility
mapping (red circles) successively triggered error reductions to guide the search
into a new promising search field.
The generalization of complex-valued MLP learned by the proposed method
was evaluated. Points with the equal interval 0.001 were used, ten times smaller
than training data, in the range x ∈ [−2, 2], twice wider than training data.
Thus, interpolation and extrapolation capabilities were checked. Figure 4 shows
excellent fitting; in Fig. 5 showing the first quadrant in double log scale, we see
some mismatches only in very small real parts around 10−3 .
6 S. Suzumura and R. Nakano
Fig. 2. Transition of training error in the Fig. 3. Transition of training error in the
learning process of the unbounded func- learning process of the unbounded func-
tion by steepest descent tion by the proposed method
x α 2
k
−x /4 Jα (x) cos(απ) − J−α (x)
Jα (x) = , Yα (x) = (10)
2 k!Γ (α + k + 1) sin(απ)
k=0
items values
max number of hidden units Jmax 10
max number of CS iterations Nmax 2000
max number of RM search iterations Lmax 50
error improvement threshold θ 10−6
value range of initial weights [−10, 10]
Complex-Valued MLP Search Utilizing EVD and RM 7
0.5
-0.5
101 -3
0
5 10 15 20 25 30 35 40
10
x
10-1
1
10-2
10-3 0.5
-1
-3
1 5 10 15 20 25 30 35 40
x
0.5
1
0
Bessel function Jα(x)
0.5
predicted Bessel function Jα(x)
-0.5
0
-1
-0.5
-1.5
-1
-2
-1.5
-2.5
-2
-3
5 10 15 20 25 30 35 40 -2.5
x
-3
5 10 15 20 25 30 35 40
Fig. 7. True values of Bessel function
x
Jα , α = 1, 2, ..., 5
Fig. 8. Output of complex-valued MLP
for unknown data of Bessel function
Jα , α = 1, 2, ..., 5. The number of hidden
units J was 3, 5, 7 from top to bottom.
We used a complex-valued MLP which inputs real variable x and real integer α
and outputs Jα (x) and Yα (x). Variable x changes from 1 to 20 with the equal
interval 0.1 and α is set to be 1,2, and 3; thus, sample size is 191 × 3 = 573.
Generalization was evaluated using points from 1 to 40, twice larger than training
data, together with α = 1,2,3,4,5, where α=4,5 are extrapolation. Table 2 shows
the experimental conditions.
Figure 6 shows the learning process of the proposed method. We see again
reducibility mapping (red circles) triggered error reduction nicely guiding the
8 S. Suzumura and R. Nakano
search. Figure 7 shows true values of Bessel function Jα (x), while Fig. 8 shows
output of complex-valued MLP learned by the proposed method. From Fig. 8
a small J(=3) gives rather poor fitting and poor extrapolation, while a large
J(=7) gives unstable fitting. Excellent fitting and extrapolation was obtained
for J=5. Much the same tendency was observed for Yα (x).
6 Conclusion
The paper proposed a new search method called CS+RM for a complex-valued
MLP, which makes use of eigen vector descent and reducibility mapping. Our
experiments using an unbounded function and Bessel functions showed the pro-
posed method worked well with nice generalization.
References
1. Amin, M. F., Amin, M.I., Al-Nuaimi, A.Y.H., Murase, K.: Wirtinger Calcu-
lus Based Gradient Descent and Levenberg-Marquardt Learning Algorithms in
Complex-Valued Neural Networks. In: Lu, B.-L., Zhang, L., Kwok, J. (eds.)
ICONIP 2011, Part I. LNCS, vol. 7062, pp. 550–559. Springer, Heidelberg (2011)
2. Delgado, K.K.: The complex gradient operator and the CR-calculus. ECE275A-
Lecture Supplement, Fall (2006)
3. Fukumizu, K., Amari, S.: Local minima and plateaus in hierarchical structure of
multilayer perceptrons. Neural Networks 13(3), 317–327 (2000)
4. Kim, T., Adali, T.: Approximation by fully complex multilayer perceptrons. Neural
Computation 15(7), 1641–1666 (2003)
5. Kuroe, Y., Yoshida, M., Mori, T.: On Activation Functions for Complex-Valued
Neural Networks–Existence of Energy Functions. In: Kaynak, O., Alpaydın, E.,
Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 985–
992. Springer, Heidelberg (2003)
6. Luenberger, D.G.: Linear and nonlinear programming. Addison-Wesley (1984)
7. Leung, H., Haykin, S.: The complex backpropagation algorithm. IEEE Trans. Sig-
nal Processing 39(9), 2101–2104 (1991)
8. Nakano, R., Satoh, S., Ohwaki, T.: Learning method utilizing singular region of
multilayer perceptron. In: Proc. 3rd Int. Conf. on Neural Computation Theory and
Applications, pp. 106–111 (2011)
9. Nitta, T.: Reducibility of the complex-valued neural network. Neural Information
Processing - Letters and Reviews 2(3), 53–56 (2004)
10. Sussmann, H.J.: Uniqueness of the weights for minimal feedforward nets with a
given input-output map. Neural Networks 5(4), 589–593 (1992)
Theoretical Analysis of Function of Derivative
Term in On-Line Gradient Descent Learning
1 Introduction
Learning in neural networks can be formulated as optimization of an objective
function that quantifies the system’s performance. An important property of
feed-forward networks is their ability to learn a rule from examples. Statistical
mechanics has been successfully used to study this property, mainly for the
simple perceptron [1,2,3]. A compact description of the learning dynamics can
be obtained by using statistical mechanics, which uses a large input dimension
N and provides an accurate model of mean behavior for a realistic N [2,3,4].
Several studies have investigated ways to accelerate the learning process [5,6,7].
For example, the slow convergence due to the plateau occurs in the learning
process using a gradient descent algorithm. In gradient descent learning, the
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 9–16, 2012.
c Springer-Verlag Berlin Heidelberg 2012
10 K. Hara et al.
parameters are updated in the direction of the steepest descent of the objec-
tive function, and the derivative of the output is taken into account. Fahlman
[8] proposed, on the basis of empirical studies, a “simple method” in which the
derivative term is replaced with a constant, thereby speeding up the conver-
gence. However, the results should be supported by theoretical analysis to show
its generality.
In this paper, we theoretically analyze the effect of using the simple method by
using statistical mechanics methods and derive coupled differential equations of
the order parameters depicting its learning behavior. We validate the analytical
solutions by comparing them with those of computer simulation. Then we com-
pare the behavior of the true gradient descent method and the simple method
from the theoretical point of view. Our results show that the simple method
leads to faster convergence up until a optimum — learning rate; beyond this
rate it leads to slower convergence. We also show that, in the presence of output
noise, the optimum learning rate changes, which means the simple method is
not robust in noisy circumstances. Consequently, the derivative term affects the
learning speed and the robustness to noise is clarified.
2 Formulation
In this section, we formulate teacher and student networks and a gradient descent
algorithm in which the derivative term is replaced with a constant. We use a
teacher-student formulation, so we assume the existence of a teacher network
that produces the desired outputs. Teacher output t is the target of student
output s.
Consider a teacher and a student that are perceptrons with connection weights
B = (B1 , ..., BN ) and J m = (J1 − m, ..., JN m
), respectively, where m denotes the
number of learning iterations. We assume that the teacher and student percep-
trons receive N -dimensional input ξ m = (ξ1m , . . . , ξN m
), that the teacher out-
(m) (m)
puts t = g(ym ), and that the student outputs s = g(xm ). Here, g(·) is
the output
function, y m is the inner potential of the teacher calculated using
ym = N i=1 Bi ξim , and xm is the inner potential of the student calculated using
N
xm = i=1 Jim ξim .
We assume that the elements ξim of the independently drawn input ξ m are
uncorrelated random variables with zero mean and unit variance; that is, the
ith element of the input is drawn from a probability distribution P(ξi ). The
thermodynamic limit of N → ∞ is also assumed. The
statistics of the inputs
√
at the thermodynamic limit are ξim = 0, (ξim )2 = 1, and ξ m = N ,
where · · · denotes average and · denotes the norm of a vector. Each element
Bi , i = 1 ∼ N , is drawn from a probability distribution with zero mean and 1/N
variance. With the assumption of the thermodynamic
limit, the statistics of the
teacher weight vector are Bi = 0, (Bi )2 = N1 , and B = 1. The distribution
of inner potential ym follows a Gaussian distribution with zero mean and unit
variance in the thermodynamic limit. For the sake of analysis, we assume that
each element of Ji0 , which is the initial value of the student vector J m , is drawn
Function of Derivative in Gradient Descent Learning 11
η
J m+1 = J m + (g(ym ) − g(xm )) g (xm )ξ m , (2)
N
where η is the learning step size and g (x) is the derivative of the output function
g(x).
3 Theory
In this section, we first show why the local property of the derivative of the
output slows convergence and then derive equations that depict the learning
dynamics.
We √use a sigmoid function as the output functionof perceptrons: g(x) =
erf(x/ 2). The derivative of the function is g (x) = 2/π exp(−x2 /2). Since
g (x) is a Gaussian function, it decreases quickly along x. As explained in the
previous section, the distribution of inner potential P (x) follows a Gaussian
distribution with mean zero and unit variance in the thermodynamic limit of
N → ∞. Consequently, g (x) for non-zero x are very small, so the update of the
student weight from (2) is very √small, which reduces the convergence speed.
We expand g (x) = exp(−x2 / 2) ∼ 1 − x2 /2 + x4 /8 · · · and use the first term.
When the first term is constant, the update for non-zero x becomes larger. A
better approach might be to use a constant value, ”a”, instead of ”1” (the first
term). We thus modify the learning equation to include a constant term:
ηa ym xm ηa
J m+1 =J m + erf √ − erf √ − n ξm = J m + δξ. (3)
N 2 2 N
We replace η with ηa for simplicity.
12 K. Hara et al.
1
g = (g(ym ) − g(xm ) − n)2
2
1 −1 1 1 −1 Q2 2 −1 R σ2
= sin + sin − sin + , (4)
π 2 π 1 + Q2 π 2(1 + Q2 ) 2
dQ2
= 2η δx + η 2 δ 2 , (5)
dα
dR
= η δy , (6)
dα
where δ = erf √y2 − erf √x2 − n. α is time defined as α = m/N , and we
assume the limit of N → ∞. Note that (5) and (6) are macroscopic equations
2 (2) and (3) are microscopic equations. By calculating three averages, δx,
while
δ , and δy, we get two closed differential equations.
dR η 2R
=√ 1− (7)
dα π 2(1 + Q2 )
dQ2 2η 2Q2
=√ R−
dα π 2(1 + Q2 )
2 2 −1 1 −1 Q2 −1 R
+ (η ) sin + sin − 2 sin +σ 2
π 2 1 + Q2 2(1 + Q2 )
(8)
4 Results
In this section, we first present the results for noise-free cases and compare the
analytical solutions with those of computer simulation. We then present and
discuss the results for noise cases. In the figures presented here, the horizontal
axis is continuous time α = m/N , where m is the learning iteration. The vertical
axis for the analytical solutions is the generalization error, g ; for the simulation
solutions, it is the square mean error for N inputs.
Function of Derivative in Gradient Descent Learning 13
Figure 1 shows the results for the noise-free cases. The learning √ step size η
is 0.1, 0.5, 2.7, 3.0 or 5.0, and we set B = 1, J = 1, and x = N . In the
0
simulations, N = 1000. The curves in the figure show the analytical solutions,
and the symbols show the simulation results: ”+” is for η = 0.1, ”×” is for 0.5,
”∗” is for 2.7, ”2” is for 3.0, and ”
” is for 5.0. The close agreement between
the analytical and simulation results validates the analytical results.
0.45
0.4
Generalization error
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15 20
Next, we compare the true gradient descent method with the simple method
using analytical solutions. As reported by Biehl and Schwarze, the optimum
learning step size is ηopt ≈ 2.7 [2]. With this in mind, we compare the generaliza-
tions for learning step size η = 0.1, 0.5, 3.0, and 5.0. Figure 2 shows the results.
In the figures, ”T” is the true gradient descent method, and ”P” is the simple
method. For η = 0.1 and 0.5, the generalization error with the simple method
decreases faster than with the true gradient descent method. For η = 3.0, the
generalization error with the true gradient descent method decreases faster than
with the simple method. With both methods, the generalization error approaches
zero. For η = 5.0, the residual generalization error with the simple method is
larger than with the true gradient descent method.
Figure 3 shows the results for η = ηopt = 2.7 for both analytical and simula-
tion solutions. Label ”T” is results of the true gradient method, and label ”P” is
results of the simple method. The analytical solutions agree with the simulation
ones, meaning that the generalization error is reduced at the same rate with both
methods when ηopt is used. Therefore, when the learning step size is η < ηopt ,
the generalization error with the simple method decreases faster than with the
true gradient descent one, and the generalization error with the true gradient
descent method decreases faster than with the simple one when η > ηopt .
Next, we present and discuss the results for noisy cases. Figure 4 shows the
results. The learning √ step size η is 0.1, 0.5, 2.7, 3.0 or 5.0 and we set B = 1,
J = 1, and ξ = N . In the simulation, N = 1000. The curves in the figures
0
14 K. Hara et al.
0.35 0.35
Generalization error
Generalization error
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 5 10 15 20 0 5 10 15 20
0.35 0.35
Generalization error
Generalization error
η’=3.0(P) 0.3
0.3 η’=3.0(T)
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 5 10 15 20 0 5 10 15 20
Fig. 2. Comparison of generalization error between true gradient descent and simple
methods
0.4
0.35
T
Generalization error
0.3
0.25
0.2
0.15
0.1
0.05
0
-0.05
0 5 10 15 20
Fig. 3. Comparison of asymptotic property between simple and true gradient descent
methods for both analytical and simulation solutions
show the analytical solutions, and the symbols show the simulation solutions:
”+” is for η = 0.1, ”×” is for 0.5, ”∗” is for 2.7, ”2” is for 3.0, and ”
” is for
5.0. As shown in the figures, the presence of noise greatly increases the residual
error for η ≥ 2.7. The optimum learning step size is no longer η = 2.7; the
fastest convergence is attained with η = 0.5.
Function of Derivative in Gradient Descent Learning 15
0.45 0.45
0.4 0.4
Generalization error
Generalization error
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 5 10 15 20 0 5 10 15 20
Fig. 4. Comparison of learning behavior between noise-free (left) and noise cases (right)
5 Conclusion
We have analyzed the simple method using a constant value ”a” instead of
g (x). We derived closed order parameter differential equations depicting the
dynamic behavior of the learning system and solved for the generalization error
by theoretical analysis. The analytical solutions were confirmed by the simulation
results. We found that the generalization error decreases faster with the simple
method than with the true gradient descent method when the learning step size is
fixed at η < ηopt . When η > ηopt , the generalization error decreases slower with
the simple method and the residual error is larger than with the true gradient
descent method. The addition of output noise changed the optimum learning
rate, meaning that the simple method is not robust in noisy circumstances.
Consequently, the derivative term affects the learning speed and the robustness
to noise is clarified.
The order parameter equations ((5) and (6)) are derived from learning equation
(3). To obtain the deterministic differential equation for Q, we square both sides
of (3) and then average the terms in the equation by using the distribution of
P (x, y). Since Q has a self-averaging property, we get
2η η2 2
(Q(m+1) )2 = (Q(m) )2 + δx + δ , (9)
N N
16 K. Hara et al.
dQ2 4η 1 R Q2
= −
dα π 1 + Q2 2(1 + Q2 ) − R2 1 + 2Q2
2η 2 2 1 1 + 2(Q2 − R2 ) Q2
+ sin−1 + sin−1
π π 1 + 2Q2 1(1 + 2Q − R )
2 2 1 + 3Q2
R σ2
−2 sin +√ . (10)
2(1 + 2Q2 − R2 ) 1 + 3Q2 3
References
1. Krogh, A., Hertz, J., Palmer, R.G.: Introduction to the Theory of Neural Compu-
tation. Addison-Wesley, Redwood City (1991)
2. Biehl, M., Schwarze, H.: Learning by on-line gradient descent. Journal of Physics
A: Mathematical and General Physics 28, 643–656 (1995)
3. Saad, D., Solla, S.A.: On-line learning in soft-committee machines. Physical Review
E 52, 4225–4243 (1995)
4. Hara, K., Katahira, K., Okanoya, K., Okada, M.: Statistical Mechanics of On-Line
Node-perturbation Learning. Information Processing Society of Japan, Transactions
on Mathematical Modeling and Its Applications 4(1), 72–81 (2011)
5. Fukumizu, K.: A Regularity Condition of the Information Matrix of a Multilayer
Perceptron Network. Neural Networks 9(5), 871–879 (1996)
6. Rattray, M., Saad, D.: Incorporating Curvature Information into On-line learning.
In: Saad, D. (ed.) On-line Learning in Neural Networks, pp. 183–207. Cambridge
University Press, Cambridge (1998)
7. Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10,
251–276 (1998)
8. Fahlman, S.E.: An Empirical Study of Learning Speed in Back-Propagation Net-
works, CMU-CS-88-162 (1988)
9. Williams, C.K.I.: Computation with Infinite Neural Networks. Neural Computa-
tion 10, 1203–1216 (1998)
Some Comparisons of Networks with Radial
and Kernel Units
Věra Kůrková
1 Introduction
Originally, artificial neural networks were built from biologically inspired per-
ceptrons. Later, other types of computational units became popular in neuro-
computing merely due to their good mathematical properties. Among them,
radial-basis-function (RBF) units introduced by Broomhead and Lowe [1] and
kernel units introduced by Girosi and Poggio [2] became most popular. In partic-
ular, kernel units with symmetric positive semidefinite kernels have been widely
used due to their good classification properties [3]. In contrast to RBF networks,
where both centers and widths are adjustable, in networks with units defined by
symmetric kernels, all units have the same fixed width determined by the choice
of the kernel. Both computational models have their advantages. RBF networks
are known to be universal approximators [4,5]. In addition to the capability of
RBF networks to approximate arbitrarily well all reasonable real-valued func-
tions, model complexity of RBF networks is often lower than complexity of
traditional linear approximators (see, e.g., [6,7,8] for some estimates). On the
other hand, kernel models with symmetric positive semidefinite kernels benefit
from geometrical properties of Hilbert spaces generated by these kernels. These
properties allow application of maximal margin classification [3], generate suit-
able stabilizers for modeling of generalization in terms of regularization [9], and
lead to mathematical description of theoretically optimal solutions of learning
tasks [10,11,12]. Thus both types of computational models, the one with units
having fixed widths and the one with units having variable widths, have their
advantages.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 17–24, 2012.
c Springer-Verlag Berlin Heidelberg 2012
18 V. Kůrková
where the set G is called a dictionary [13], n is the number of hidden units, and
R denotes the set of real numbers. The set of input-output functions of networks
with an arbitrary number of hidden units is denoted
n
span G := { i=1 wi gi | wi ∈ R, gi ∈ G, n ∈ N+ } ,
GK (X, Y ) := {K(., y) : X → R | y ∈ Y }
the Euclidean norm on Rd , v is called a center, and 1b a width. Thus RBF units
generate dictionaries GBψ (X, Y ) := {ψ(b(. − v)) : X → R | b ∈ R+ , v ∈ Y }.
So for RBF units, sets of inputs X differ from sets of parameters Y as in addition
to centers, also widths are varying. Fixing a width b > 0, we get from an RBF
kernel Bψ a symmetric kernel Bψb : Rd × Rd → R.
We investigate radial and symmetric kernel units in terms of scaled kernels.
For K : Rd × Rd → R, we denote by K a : Rd × Rd → R the kernel defined as
When it is clear from the context, we also use K a to denote the restriction of K a
to X × X, where X ⊆ Rd . Thus a symmetric kernel K induces two dictionaries:
a dictionary with fixed widths
GK (X) := {K(., y) : X → R | y ∈ X}
Our next theorem shows that for convolution kernels with certain properties
of their Fourier transforms, variability of widths is not a necessary condition for
the universal approximation. Networks with such kernel units even with fixed
widths can approximate arbitrarily well all functions from L2 (Rd ).
Recall that a convolution kernel K is induced by translations of a one-variable
function k : Rd → Rd , i.e., K(x, y) = k(x−y), and so GK (X) := {k(.−y) |y ∈ Y }.
The convolution is an operation defined as
Rd 0
f (y) h(y)dy ≤ f0 L2 hL2 = 0, which is a contradiction. 2
Proof. (i) Extending functions from L2 (X) to L2 (Rd ) by setting their values
equal to zero outside of X and restricting their approximations from span GK (Rd )
to X, we get the statement from Theorem 1.
(ii) The statement follows from (i) as for X compact, C(X) ⊂ L2 (X). 2
Note that, Theorem 1 and Corollary 1 imply the universal approximation prop-
erty of Gaussian kernel networks with any fixed width both in (L2 (Rd ), .L2 )
Networks with Radial and Kernel Units 21
and in (C(X), .sup ) with X compact. Indeed, for any a > 0, the Fourier
transform of the scaled d-dimensional Gaussian function satisfies e−a 2 .2
=
√ −d −(1/a 2
). 2
( 2a) e [16, p.186]. So our results provide an alternative to Mhaskar’s
proof of the universal approximation property of Gaussian networks with fixed
widths. Moreover, our proof technique applies to a wider class of kernels than
Gaussians and holds in both L2 (Rd ) and C(X). In particular, it applies to all
convolution kernels induced by functions with positive Fourier transforms. Such
kernels are known to be positive definite and thus they play an important role
in classification and generalization [18].
(ii) for all α > 0, there exists a unique argminimum f α of Ez,α,K over HK (X)
which satisfies −1
fα = m i=1 ci Kui , where c = (c1 , . . . , cm ) = (Km [u] + α Im )
α α α α
v;
(iii) limα→0 f − f K = 0.
α +
−1
ad f 2 a
[16, p.183], we have f 2K a = (2π)d/2 Rd fˆ(s)2 k̂( as ) ds. Hence f K 2 =
a d
ˆ 2 s
f (s) (k̂( a ))
−1
ds
−1 Kb
−1
Rd
−1 . As k̂ is non increasing, b ≤ a implies k̂( as ) ≤ k̂( sb ) .
bd Rd fˆ(s)2 (k̂( sb )) ds
Thus f K a
f b ≤ b
a d/2
. 2
K
References
1. Broomhead, D.S., Lowe, D.: Error bounds for approximation with neural networks.
Complex Systems 2, 321–355 (1988)
2. Girosi, F., Poggio, T.: Regularization algorithms for learning that are equivalent
to multilayer networks. Science 247(4945), 978–982 (1990)
24 V. Kůrková
3. Cortes, C., Vapnik, V.N.: Support vector networks. Machine Learning 20, 273–297
(1995)
4. Park, J., Sandberg, I.: Universal approximation using radial–basis–function net-
works. Neural Computation 3, 246–257 (1991)
5. Park, J., Sandberg, I.: Approximation and radial basis function networks. Neural
Computation 5, 305–316 (1993)
6. Kainen, P.C., Kůrková, V., Sanguineti, M.: Complexity of Gaussian radial basis
networks approximating smooth functions. J. of Complexity 25, 63–74 (2009)
7. Gnecco, G., Kůrková, V., Sanguineti, M.: Some comparisons of complexity in
dictionary-based and linear computational models. Neural Networks 24(1), 171–
182 (2011)
8. Gnecco, G., Kůrková, V., Sanguineti, M.: Can dictionary-based computational
models outperform the best linear ones? Neural Networks 24(8), 881–887 (2011)
9. Girosi, F.: An equivalence between sparse approximation and support vector ma-
chines. Neural Computation 10, 1455–1480 (1998) (AI memo 1606)
10. Cucker, F., Smale, S.: On the mathematical foundations of learning. Bulletin of
AMS 39, 1–49 (2002)
11. Poggio, T., Smale, S.: The mathematics of learning: dealing with data. Notices of
AMS 50, 537–544 (2003)
12. Kůrková, V.: Inverse problems in learning from data. In: Kaslik, E., Sivasundaram,
S. (eds.) Recent advances in dynamics and control of neural networks. Cambridge
Scientific Publishers (to appear)
13. Gribonval, R., Vandergheynst, P.: On the exponential convergence of matching
pursuits in quasi-incoherent dictionaries. IEEE Trans. on Information Theory 52,
255–261 (2006)
14. Pietsch, A.: Eigenvalues and s-Numbers. Cambridge University Press, Cambridge
(1987)
15. Mhaskar, H.N.: Versatile Gaussian networks. In: Proceedings of IEEE Workshop
of Nonlinear Image Processing, pp. 70–73 (1995)
16. Rudin, W.: Functional Analysis. Mc Graw-Hill (1991)
17. Friedman, A.: Modern Analysis. Dover, New York (1982)
18. Schölkopf, B., Smola, A.J.: Learning with Kernels – Support Vector Machines,
Regularization, Optimization and Beyond. MIT Press, Cambridge (2002)
19. Bertero, M.: Linear inverse and ill-posed problems. Advances in Electronics and
Electron Physics 75, 1–120 (1989)
20. Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks ar-
chitectures. Neural Computation 7, 219–269 (1995)
21. Wahba, G.: Splines Models for Observational Data. SIAM, Philadelphia (1990)
22. Loustau, S.: Aggregation of SVM classifiers using Sobolev spaces. Journal of Ma-
chine Learning Research 9, 1559–1582 (2008)
23. Fine, T.L.: Feedforward Neural Network Methodology. Springer, Heidelberg (1999)
24. Kůrková, V., Neruda, R.: Uniqueness of functional representations by Gaussian
basis function networks. In: Proceedings of ICANN 1994, pp. 471–474. Springer,
London (1994)
Multilayer Perceptron for Label Ranking
1 Introduction
In many real-world applications, assigning a single label to an example is not
enough. For instance, when trading in the stock market based on recommenda-
tions from financial analysts, predicting who is the best analyst does not suffice
because 1) he/she may not make a recommendation in the near future and 2)
we may prefer to take into account recommendations of multiple analysts, to be
on the safe side [1]. Hence, to support this approach, a model should predict a
ranking of analysts rather than suggesting a single one. Such a situation can be
modeled as a Label Ranking (LR) problem: a form of preference learning, aiming
to predict a mapping from examples to rankings of a finite set of labels [2].
Recently, quite some solutions have been proposed for the label ranking prob-
lem [2], including one based on the Multilayer Perceptron algorithm (MLP) [4].
MLP is a type of neural network architecture, which has been applied in a super-
vised learning context using the error back-propagation (BP) learning algorithm.
In this paper, we try a different approach to the simple adaptation proposed ear-
lier [4]. We adapt the BP learning mechanism to LR. More specifically, we inves-
tigate how the error signal explored by BP can use information from the LR loss
function. We introduce six approaches and evaluate their (relative) performance.
We also show some preliminary experimental results that indicate whether our
new method could compete with state-of-the-art LR methods.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 25–32, 2012.
c Springer-Verlag Berlin Heidelberg 2012
26 G. Ribeiro et al.
2 Preliminaries
Throughout this paper, we assume a training set T = {xn , πn } consisting
of t examples xn and their associated label rankings πn . Such a ranking is a
permutation of a finite set of labels L = {λ1 , . . . , λk }, given k , taken from
the permutation space ΩL . Each example xn consists of m attributes xn =
{a1 , . . . , am } and is taken from the example space X. The position of λa in a
ranking πn is denoted by πn (a) and assumes a value in the set {1, . . . , k}.
layer, the hidden layer(s) and the output layer. Each layer has one or more
neurons. Every neuron i is connected to the j neurons of the next layer by a set
of weighted links denoted by w1i , . . . , wji . At the input layer, {a1 , a2 , . . . , am }
represent m input signals associated with the m attributes. At the hidden and
output layers, each neuron j receives the input signals as a linear combination of
m
the output given by: vj = i=0 wji ai . The linear combinations are transformed
into output signals using an activation function ϕ(vj ). These signals are sent in
a forward direction layer by layer to the output layer which delivers an output
yj for each output neuron j. In classification, each class is associated with an
output neuron and the prediction is typically given by the one with the highest
activation level.
The goal is to define the values for the connections weights that return the
outputs which lowest error, i.e., the output is most similar to the desired value,
d(n). One method to learn the weights is BP, which propagates errors in a
backward direction from the output layer to the input layer, updating the weight
connections if an error is detected at the output layer. A weight correction on the
nth training example is defined in terms of the error signals cj (n) for each output
neuron j. Considering a sequential mode in which the weights are updated after
every training example, the predicted output yj (n) is compared with the desired
target dj (n), and the individual error ej (n) is estimated as follows: ej (n) =
dj (n) − yj (n). In a typical NN, the error signal is equal to the individual error,
because the predicted output is directly compared with the target. The correction
is given by Δwji (n) = ηδj (n)yi (n), where η is the learning rate, yi (n) is the
output signal of the previous neuron i and the local gradient δj is defined by
δj = ej (n)ϕ (vj (n)). For a hidden neuron i, the local gradient is defined in a
recursive form by δi (n) = ϕi (vi (n)) j δj (n)wji (n).
To prevent the MLP learning from getting stuck in a local optimum we use
random-restart hill climbing, by generating new random weights wji ∼ N (0, 1).
For each restart we present every example in the training set to the learning
process a user-defined number of times, called an epoch. The weights associated
with the best performance are returned.
Local Approach (LA). The error signal is the individual error of each output
neuron, cj (n) = ej (n) = πn (j) − πˆn (j), as in the original MLP. The LR error,
eτ , is only used to evaluate the activation of the BP.
Global Approach (GA). The error signal is defined in terms of the LR error. In
this case, it is simply given by cj (n) = eτ (n).
Score-Based Signed Global Approach (SSGA). The motivation for SSGA is the
same as for WSGA. The difference is that we rank the output neuron scores yj
instead of the input weights. The positions of the weights, pw (j) is replaced in
eq. 1 with the positions of the scores, ps (j).
4 Experimental Results
The goal is to compare the performance of the proposed approaches on dif-
ferent datasets. The datasets used for the evaluation are from the KEBI Data
Repository [12] hosted by the Philipps University of Marburg. These datasets,
which are commonly used for LR, are presented in Table 1. Our approach starts
Adaptation of a Neural Network for Label Ranking 29
by normalizing all attributes, and separating the dataset into a training and
a test set. On each dataset we tested the six approaches with h = 3 hidden
neurons, η = 0.2, using 5 epochs with 5 random restarts. The error estimation
methodology is 10-fold cross-validation. The results are presented in terms of the
similarity between the rankings πi and π̂i with the Kendall τ coefficient, which
is equivalent to the error measure described in Section 2.
In Table 2, we show the resulting τ -values for each approach, and associated
rank (lower is better) per dataset. The bottom row shows the average rank for
each approach, which allows us to compare the relative performance of the ap-
proaches using the Friedman test with post-hoc Nemenyi test [13]. The Friedman
test proves that the average ranks are significantly unequal (with α = 1%). Then
the Nemenyi test gives us a critical difference of CD = 2.225 (with α = 1%).
30 G. Ribeiro et al.
(a) Boxplot of the results according (b) Results per number of epochs on
to the approaches Iris dataset
The test implies that for each pair of approaches Ai and Aj , if Ri < Rj − CD,
then Ai is significantly better than Aj . Hence we can see from the table that
approaches LA and CA significantly outperform all other approaches except for
SSGA. However, at α = 10% the critical difference becomes CD = 1.712, so at
this significance level CA significantly outperforms SSGA too.
As we can see from Table 2, not all approaches have a very high τ -value
for all datasets. Notice, however, that these experiments are performed with
a rather arbitrary set of parameters. Varying parameters such as the number
of hidden neurons in the MLP, the number of epochs used when learning the
neural network, and the number of random restarts, could benefit performance.
To illustrate this, Figure 1b displays the variation of τ -values for the different
approaches on the Iris dataset, when varying the number of epochs. As we
can see, we can subtantially improve the results when tweaking the number of
epochs. For some approaches using more epochs is better, but for others this
monotonicity does not hold. We see similar behavior when varying the number
of stages and hidden neurons. Hence, we expect that much better results can be
gained with the new approaches when the parameter space is properly explored
for each dataset, but this is beyond the scope of this paper.
In Table 3, we compare the performance of approaches LA and CA with pub-
lished results of the state-of-the-art algorithms equal width apriori label ranking
(EW), minimum entropy apriori label ranking (ME) [10], constraint classifica-
tion (CC), instance-based label ranking (IBLR) and ranking trees (LRT) [8, 10],
in terms of Kendall’s τ coefficient. Notice that the new methods do not gener-
ally outperform the current state-of-the-art methods, but they do achieve results
that are often of the same magnitude. Since the results for the new approaches
are obtained without any form of parameter optimization, we feel confident that
exploration of the parameter space can yield a competitive algorithm.
Adaptation of a Neural Network for Label Ranking 31
To learn more about our results, we crafted a metalearning dataset from Ta-
bles 1 and 3. We performed a Subgroup Discovery [14, 15] run using the dataset
characteristics from Table 1 as search space, and mined for local patterns wherein
the rank of LA or CA deviates from the average over all datasets. Such a run re-
sults in a set of conditions on dataset characteristics, under which our approaches
perform unusually good or bad, giving pointers for further research.
The most convincing metasubgroup under which both LA and CA perform
well, is defined by m ≥ 13. Datasets belonging to this subgroup are indicated
by bold blue names in Table 3. When the dataset at hand has relatively many
attributes, our approaches have relatively many input signals in the MLP. Hence
there are many more connections with the hidden layer, and much more interac-
tions between the neurons in the network. Apparently, this increased complexity
of the MLP adds subtlety to its predictions, which allows the MLP-LR method
to induce more accurate representations of the underlying concepts.
5 Conclusions
Empirical results indicate that the two methods that directly incorporate the
individual errors perform significantly better than the methods that focus on the
LR error. However, the best results are obtained by combining both errors (CA).
A comparison with results published for other methods additionally indicates
that our method has the potential to compete with other methods. This holds
even though no parameter tuning was carried out, which is known to be essential
for learning accurate networks. Our method becomes more competitive when the
32 G. Ribeiro et al.
data contains more attributes; this increases the amount of input neurons, and
the MLP-LR predictions benefit from the more complex network. As future
work, apart from parameter tuning we will investigate other ways of combining
the local and global errors and we will investigate how to give more importance
to higher ranks.
References
1. Aiguzhinov, A., Soares, C., Serra, A.P.: A Similarity-Based Adaptation of Naive
Bayes for Label Ranking: Application to the Metalearning Problem of Algorithm
Recommendation. In: Discovery Science (2010)
2. Vembu, S., Gärtner, T.: Label Ranking Algorithms: A Survey. In: Fürnkranz, J.,
Hüllermeier, E. (eds.) Preference Learning. Springer (2010)
3. Hüllermeier, E., Fürnkranz, J.: On loss functions in label ranking and risk mini-
mization by pairwise learning. JCSS 76(1), 49–62 (2010)
4. Kanda, J., Carvalho, A.C.P.L.F., Hruschka, E.R., Soares, C.: Using Meta-learning
to Classify Traveling Salesman Problems. In: SBRN (2010)
5. Brinker, K., Hüllermeier, E.: Label Ranking in Case-Based Reasoning. In: Weber,
R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 77–91.
Springer, Heidelberg (2007)
6. Dekel, O., Manning, C.D., Singer, Y.: Log-linear models for label ranking. In:
Advances in Neural Information Processing Systems (2003)
7. Hülermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning
pairwise preferences. Artif. Intell., 1897–1916 (2008)
8. Cheng, W., Dembczynski, K., Hüllermeier, E.: Label Ranking Methods based on
the Plackett-Luce Model. In: ICML (2010)
9. Cheng, W., Huhn, J.C., Hüllermeier, E.: Decision tree and instance-based learning
for label ranking. In: ICML (2009)
10. de Sá, C.R., Soares, C., Jorge, A.M., Azevedo, P., Costa, J.: Mining Association
Rules for Label Ranking. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD
2011, Part II. LNCS, vol. 6635, pp. 432–443. Springer, Heidelberg (2011)
11. Haykin, S.: Neural Networks: a comprehensive foundation, 2nd edn (1998)
12. KEBI Data Repository,
http://www.uni-marburg.de/fb12/kebi/research/repository
13. Demǎr, J.: Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research 7, 1–30 (2006)
14. Klösgen, W.: Subgroup Discovery. In: Handbook of Data Mining and Knowledge
Discovery, ch. 16.3. Oxford University Press, New York (2002)
15. Pieters, B.F.I., Knobbe, A., Džeroski, S.: Subgroup Discovery in Ranked Data, with
an Application to Gene Set Enrichment. In: Proc. Preference Learning Workshop
(PL 2010) at ECML PKDD (2010)
Electricity Load Forecasting:
A Weekday-Based Approach
1 Introduction
Electricity load forecasting is the task of predicting the electricity load (demand)
based on previous electricity loads and other variables such as weather conditions. It
is important for the management of power systems, including daily decision making,
dispatching of generators, setting the minimum reserve and planning maintenance. In
this paper we focus on 5-minute-ahead prediction from previous 5-minute electricity
load data. This is an example of 1-step-ahead and very short-term prediction, and is
especially useful in competitive electricity markets, to help the market operator and
participants in their transactions. The overall goal is to ensure reliable operation of the
electricity system while minimizing the costs.
There are two main groups of approaches for electricity load prediction: the
traditional statistical approaches, which are linear and model-based, such as
exponential smoothing and autoregressive integrated moving average, and the more
recent machine learning approaches, with neural network-based approaches being
most popular. Taylor’s work [1-3] is the most prominent example of the first group.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 33–41, 2012.
© Springer-Verlag Berlin Heidelberg 2012
34 I. Koprinska, M. Rana, and V.G. Agelidis
He studied a number of methods for very short-term and short-term prediction based
on exponential smoothing and autoregressive integrated moving average using British
and French data, and found that the best methods for 5-minute-ahead prediction was
double seasonal Holt-Winter’s exponential smoothing. Notable examples of the
second group are [4, 5] that used Backpropagation Neural Networks (BPNN) and [6-
8] that also used other prediction algorithms such as support vector machines and
Linear Regression (LR). For example, in [5] Shamsollahi et al. used a BPNN for 5-
minute ahead load forecasting. The data was preprocessed by applying a logarithmic
differencing of the consecutive loads; the BPNN’s architecture consisted of one
hidden layer and one output node; the stopping criterion was based on the use of
validation set. They obtained an excellent accuracy and their method was integrated
into the New England energy market system.
Most of the previous work has focused on building global models. Exceptions are
[7] where local models for each season were created and the wavelet-based
approaches [9, 10] where the load was decomposed into different frequency
components and a local model was built for each of them. In this paper we consider
another type of local prediction models - models based on the day of the week.
The key idea is to exploit the differences in the load profiles for the different days
of the week, e.g. it is well known that the load during the weekend is smaller than the
load during the working days. If a global model is built, it will treat all days in the
same way and will capture an average of the previous dependencies for all days. For
example, in [6, 7] we found that one of the most important predictors is the load from
the previous day at the same time as the prediction time. It is an important predictor,
on average, if a single prediction model is built for all days of the week. However,
this predictor is not equally important for all days of the week. For example, the load
profile on Monday is more likely to be similar to the load profile on the previous
Friday, not the previous day (Sunday). Similarly, the load profile on Saturday is more
likely to be similar to the load profile on the previous Saturday and Sunday, not the
previous day (Friday).
The key contributions of this paper are:
1) We propose a new approach for building local weekday-based model using
autocorrelation analysis. It is a generic approach and can be applied to other time
series and local components, not only to electricity load data and day of the week.
2) We compare the performance of the local model with the performance of a
global model, i.e. one single model for all days of the week. We conduct a
comprehensive evaluation using two years of Australian electricity data.
the autocorrelation features in a slightly different way (different number of peaks and
neighborhood size). Our approach consists of two main steps: 1) selecting features
using autocorrelation analysis (local and global selection) and 2) building prediction
models (local and global) using the LR and BPNN algorithms.
The autocorrelation function shows the correlation of a time series with itself at
different time lags. It is used to investigate the cyclic nature of a time series and is
appropriate for electricity load data as there are well defined daily and weekly cycles.
The first graph in Fig. 1 (“global”) shows the autocorrelation function of the
electricity load in 2006 for the state of New South Wales (NSW) in Australia. Values
close to 1 or -1 (i.e. peaks) indicate high positive or negative autocorrelation and
values close to 0 indicate lack of autocorrelation. The data is highly correlated; the
strongest dependence is at lag 1 (i.e. values that are 1 lag apart), the second strongest
dependence is at lag 2016 (i.e. values that are exactly 1 week apart) and so on.
global Monday
Friday Saturday
Fig. 1. Autocorrelation function for the global model and the local (weekday) models for
Monday, Friday and Saturday
To form a feature set we extract load variables from the seven highest peaks and
their neighbourhoods. The number of peaks and the size of the neighbourhoods are
selected empirically. As the higher peaks indicate stronger dependence and more
informative variables, we extract more variables from them than from the lower
peaks. More specifically, we extracted the following 37 variables:
• from peak 1 (the highest peak): the peak and the 10 lags before it; note that there
are no lags after it (11 features).
• from peaks 2 and 3: the peak and the three lags before and after it (7 features each)
• from peaks 4-7: the peak and the surrounding 1 lag before and after it (3 features
each).
36 I. Koprinska, M. Rana, and V.G. Agelidis
Using the selected features we build prediction models that learn from the training
data. As prediction algorithms we used LR and BPNN; BPNN is the most popular
algorithm for load forecasting and LR is the algorithm we found to work best in
previous work [6, 7]. LR assumes linear decision boundary and uses the least squares
method to find the line of best fit. We used a variant of stepwise regression with
backward elimination based on the M5 method. BPNN is a classical neural network
trained with the backpropagation algorithm and capable of producing complex non-
linear decision boundaries. We used 1 hidden layer; to tune the BPNN parameters we
experimented with different number of hidden nodes, learning rate, momentum and
maximum number of epochs and report the best results that we obtained.
We build one model for each day of the week, e.g. one for Monday, one for Tuesday
and so on. It is used to predict the load only for this day of the week. The
autocorrelation analysis and feature selection are conducted separately for each
weekday. Fig. 1 shows the autocorrelation function for three of the days (Monday,
Friday and Saturday). Table 1 shows the location of the highest peaks and Table 2
shows the extracted features for each of the local models.
Table 1. The seven highest autocorrelation peaks for the global and local models
Peak number 1 2 3 4 5 6 7
global 1 2016 288 4032 6048 1728 2304
same day 1 week 1 day 2 weeks 3 weeks 6 days 8 days
local 1 2016 4032 6048 854 1152 1440
Monday same day 1 week 2 weeks 3 weeks 3 days 4 days 5 days
local 1 2016 4032 6048 288 1152 1440
Tuesday same day 1 week 2 weeks 3 weeks 1 day 4 days 5 days
local 1 2016 4032 6048 288 576 1440
Wednesday same day 1 week 2 weeks 3 weeks 1 day 2 days 5 days
local 1 2016 4032 6048 288 576 864
Thursday same day 1 week 2 weeks 3 weeks 1 day 2 days 3 days
local 1 2016 4032 6048 288 576 864
Friday same day 1 week 2 weeks 3 weeks 1 day 2 days 3 days
local 1 2016 4032 6048 288 1728 8064
Saturday same day 1 week 2 weeks 3 weeks 1 day 6 days 4 weeks
local 1 2016 4032 6048 288 8064 2304
Sunday same day 1 week 2 weeks 3 weeks 1 day 4 weeks 8 days
Electricity Load Forecasting: A Weekday-Based Approach 37
2.3 Comparison of the Features Selected in the Global and Local Models
The two strongest dependencies are the same for all prediction models – at the same
day and 1 week before. However, there are considerable differences in the remaining
5 strongest dependencies. For the global model, they are (in decreasing order) at 1
day, 2 weeks, 3 weeks, 6 days and 8 days. Hence, there is a mixture between daily
and weekly dependencies as the global model captures the dependencies for all days
of the week and represents an overall average dependence measure.
In contrast, for the local models, the weekly dependencies are stronger than the
daily, with all of them having 2 weeks and 3 weeks as the third and fourth highest
dependencies, followed by mainly daily dependencies for peaks 5-7. The daily
dependencies for the different weekdays are as expected. For example, the load for
Monday correlates with other working days – the previous Friday, Thursday and
Wednesday and not with the previous Sunday which is the third strongest predictor in
the global model. The load for Sunday correlates with other weekend days – the
Saturday before, Sunday 4 weeks ago and Saturday 1 week ago. The load for Tuesday
correlates with the other workdays - the previous Monday, Friday and Thursday.
Hence, the features extracted by the local models are meaningful and represent
better the load dependencies for the respective day of the week than the global model
which averages these dependencies for all days of the week.
Table 2. Selected features to predict the load Xt+1 for the global and local models
model Selected features to predict Xt+1:
global Xt-10 to Xt, XWt-3 to XWt+3, XDt-3 to XDt+3, XW2t-1 to XW2t+1,
XW3t-1 to XW3t+1, XD6t-3 to XD6t+3, XD8t-3 to XD8t+3
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Monday XD3t-1 to XD3t+1, XD4t-1 to XD4t+1, XD5t-1 to XD5t+1
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Tuesday XDt-1 to XDt+1, XD4t-1 to XD4t+1, XD5t-1 to XD5t+1
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Wednesday XDt-1 to XDt+1, XD2t-1 to XD2t+1, XD5t-1 to XD5t+1
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Thursday XDt-1 to XDt+1, XD2t-1 to XD2t+1, XD3t-1 to XD3t+1
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Friday XDt-1 to XDt+1, XD2t-1 to XD2t+1, XD3t-1 to XD3t+1
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Saturday XDt-1 to XDt+1, XD6t-1 to XD6t+1, XW4t-1 to XW4t+1
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Sunday XDt-1 to XDt+1, XW4t-1 to XW4t+1, XD8t-1 to XD8t+1
where: Xt – load on forecast day at time t; XDt, XD2t, XD3t, XD4t, XD5t, XD6t, XD8t –
load 1, 2, 3, 4, 5, 6 and 8 days before the forecast day at time t, XWt, XW2t, XW3t,
XW4t - load 1, 2, 3 and 4 weeks before the forecast day at time t.
(99,067 instances) and the 2007 data was used as testing data (105,119 instances).
To measure the predictive accuracy, we used the Mean Absolute Error (MAE) and
the Mean Absolute Percentage Error (MAPE):
n n
L _ actual i − L _ forecast i
L _ actual i − L _ forecast i , MAPE = n
1 1
MAE = 100 [%]
n i =1 i =1 L _ actual i
where L_actuali and L_forecasti are the actual and forecasted load at the 5-minute lag
i and n is the total number of predicted loads. MAE is a standard metric used by the
research community and MAPE is widely used by the industry forecasters.
The performance of our models was compared with four naïve baselines where the
predicted load value was: 1) the mean of the class variable in the training data (Bmean),
2) the load from the previous lag (Bplag), 3) the load from the previous day at the same
time (Bpday) and 4) the load from the previous week at the same time (Bpweek).
Table 3 shows the accuracy results of the global model and the local weekday model.
In the global case, one prediction model is built and is then used to predict the load
for all examples in the test data. In the local case, seven prediction models are built,
one for each day of the week. Each model is used to predict the load only for the test
examples from the respective day (e.g. the model for Monday predicts the Mondays in
the test set) and the reported result is the average of these predictions for the test data.
Table 3. Accuracy of the global and local prediction models, and the baselines
global local baselines
LR BPNN LR BPNN Bmean Bplag Bpday Bpweek
MAE [MW] 25.07 25.06 24.73 25.47 1159.42 41.24 453.88 451.03
MAPE [%] 0.286 0.286 0.282 0.291 13.484 0.473 5.046 4.940
The best model was the local LR achieving MAPE=0.282%. It was slightly more
accurate than the global LR model (MAPE=0.286%) and this difference was
statistically significant at p<0.001 (Wilcoxon rank sum test). The local BPNN was
slightly less accurate then the global BPNN model (MAPE=0.291% and 0.286%) and
again this difference was statistically significant at p<0.001. The only pair-wise
accuracy difference that was not statistically significant was between LR global and
BPNN global. All global and local models considerably outperformed all baselines.
The computational complexity of the local model is higher than the global model
as it requires training seven models instead of one. As the training is done offline, this
is not a problem for LR which is very fast to train (approx. 10 sec) but is prohibitive
for BPNN which requires hours.
A closer examination of the selected features shows that all prediction models have
65% common and 35% different features. Specifically, there are 24 common features:
Electricity Load Forecasting: A Weekday-Based Approach 39
11 from the same day, 7 from 1 week ago, 3 from 2 weeks ago and 3 from 3 weeks
ago. The results show that the 35% different features, with the prediction algorithms
used, are not sufficient to make a big difference in the predictive accuracy.
In order to understand the differences between the global and local model for each
day of the week and also to investigate if there are days of the week associated with
higher and lower accuracy, we calculated the accuracy separately for each day of the
week, see Table 4. A comparison of the global and local models shows that their
accuracies are very similar for all days of the week, for both LR and BPNN. A
comparison along the days of the week shows that it was easiest to predict the load on
Wednesday, followed by Thursday, then Sunday, Saturday and Friday, then Monday
and it was most difficult to predict the load on Tuesday. The difference between
Tuesday and Wednesday requires further investigation; it might be due to public
holidays or random disturbances from large loads.
In the global model, LR and BPNN were equally accurate. In the local model, LR
was more accurate than BPNN. LR has another advantage over BPNN – it is much
faster to train (seconds versus hours). These results are consistent with our previous
work [6, 7] showing that LR is more accurate and faster than BPNN and thus more
suitable for load prediction. There is a debate in the literature [12-13] about the need
to use nonlinear prediction models such as BPNN for forecasting electricity load data.
Table 4. Accuracy of the global and local prediction models for each weekday using LR and
BPNN. In bold is the best result for each day for each prediction algorithm.
LR global
Mon Tue Wed Thu Fri Sat Sun
MAE [MW] 26.91 27.20 20.17 23.85 26.11 25.66 25.54
MAPE [%] 0.304 0.302 0.225 0.261 0.288 0.306 0.314
local
Mon Tue Wed Thu Fri Sat Sun
MAE [MW] 25.71 27.35 20.02 23.59 25.54 25.45 25.45
MAPE [%] 0.291 0.303 0.223 0.258 0.282 0.303 0.314
BPNN global
Mon Tue Wed Thu Fri Sat Sun
MAE [MW] 26.70 27.22 20.34 23.90 25.98 25.72 25.53
MAPE [%] 0.301 0.302 0.227 0.261 0.287 0.307 0.315
local
Mon Tue Wed Thu Fri Sat Sun
MAE [MW] 27.19 27.92 20.81 23.66 25.93 25.88 26.87
MAPE [%] 0.307 0.310 0.232 0.259 0.286 0.307 0.334
In summary, the performance of the global model and the local weekday model
was very similar. There was a small and statistically significant gain in accuracy when
building local (versus global) model and using LR as a prediction algorithm. LR was
also very fast to train and predict new data, so both local and global prediction models
with LR are suitable for practical applications.
40 I. Koprinska, M. Rana, and V.G. Agelidis
5 Conclusions
We considered the task of predicting the electricity load every 5 minutes from
previous 5-minute loads. We presented a new approach for building weekday-based
prediction models using local autocorrelation feature selection and machine learning
algorithms (LR and BPNN). We compared the performance of the local weekday
model with a global (non-weekday dependent) model using two full years of
Australian electricity data. We found that the performance of the local and global
models was comparable. The local model when used with LR obtained a small and
statistically significant gain in accuracy over the global model and also achieved the
highest overall accuracy of MAPE=0.282%. Both models, global and local, with LR
are accurate and fast to train and are thus suitable for practical applications. Our
approach for building local models using autocorrelation analysis is not limited to
electricity load data and day of the week; it is a generic approach that can be applied
to other time series and local characteristics. Future work will include learning
weights for the individual features and investigating the effectiveness of local
prediction models for holidays and irregular days.
References
1. Taylor, J.W.: An evaluation of methods for very short-term load forecasting using minite-
by-minute British data. International Journal of Forecasting 24, 645–658 (2008)
2. Taylor, J.W.: Short-term electricity demand forecasting using double seasonal exponencial
smoothing. Journal of the Operational Research Society 54, 799–805 (2003)
3. Taylor, J.W.: Triple seasonal methods for short-term electricity demand forecasting.
European Journal of Operational Research 204, 139–152 (2010)
4. Charytoniuk, W., Chen, M.-S.: Very short-term load forecasting using artificial neutal
networks. IEEE Transactions on Power Systems 15, 263–268 (2000)
5. Shamsollahi, P., Cheung, K.W., Chen, Q., Germain, E.H.: A neural network based VSTLF
for the interim ISO New England electricity market system. In: 22nd IEEE PICA Conf.,
pp. 217–222 (2001)
6. Sood, R., Koprinska, I., Agelidis, V.: Electricity load forecasting based on autocorrelation
analysis. In: International Joint Conference on Neural Networks (IJCNN), Barcelona, pp.
1772–1779 (2010)
7. Koprinska, I., Rana, M., Agelidis, V.: Yearly and Seasonal Models for Electricity Load
Forecasting. In: International Joint Conference on Neural Networks (IJCNN 2011), USA
(2011)
8. Chen, B.J., Chang, M.-W., Lin, C.-J.: Load forecasting using support vector machines: a
study on EUNITE competition 2001. IEEE Transactions on Power Systems 19, 1821–1830
(2001)
9. Reis, A.J.R., Alvis, A.P., da Silva, P.A.: Feature Extraction via Multiresolution Analysis
for Short-Term Load Forecasting. IEEE Transactions on Power Systems 20, 189–198
(2005)
Electricity Load Forecasting: A Weekday-Based Approach 41
10. Chen, Y., Luh, P.B., Guan, C., Zhao, Y., Michel, L.D., Coolbeth, M.A.: Short-term load
forecas-ting: similar day-based wavelet neural network. IEEE Trans. on Power
Systems 25, 322–330 (2010)
11. Australian Energy Market Opeartor (AEMO) http://www.aemo.com.au
12. Hippert, H.S., Pedreira, C.E., Souza, R.C.: Neural Networks for short-term load
forecasting: a review and evaluation. IEEE Transactions on Power Systems 16(1), 44–55
(2001)
13. Darbellay, G.A., Slama, M.: Forecasting the short-term demand for electricity - Do neural
networks stand a better chance? International Journal of Forecasting 16, 71–83 (2000)
Adaptive Exploration Using Stochastic Neurons
1 Introduction
One of the most challenging tasks in reinforcement learning (RL) is balancing
the amount of exploration and exploitation [1]. If the behavior of an agent is
too exploratory, the outcome of randomly selected bad actions can prevent from
maximizing short-term reward. In contrast, if an agent is too exploitative, the se-
lection of only sub-optimal actions prevents from maximizing long-term reward,
because the outcome of true optimal actions is underestimated. Conclusively,
the optimal balance is somewhere in between, dependent on many parameters
such as the learning rate, discounting factor, learning progress, and of course on
the learning problem itself.
Many different approaches exist for trading off exploration and exploitation.
Based on a single exploration parameter, some basic policies select random ac-
tions either equally distributed (ε-Greedy) or value sensitively (Softmax) [1], or
by a combination of both [2], with the advantage of not requiring to memorize
any exploratory data. In contrast, approaches utilizing counters in every state
(exploration bonuses) direct the exploration process towards finding the optimal
policy in polynomial time under certain circumstances [3, 4]. Nevertheless, basic
policies can be effective having a proper exploration parameter configured, which
has been successfully shown for example in board games with huge discrete state
spaces like Othello [5] or English Draughts [6]. For such state spaces, utility func-
tions are hard to approximate, and conducting experiments for determining a
proper exploration parameter can be time consuming. A non-convergent counter
function is even harder to approximate than a convergent utility function [7]. In-
terestingly, Daw et al. revealed in biologically-motivated studies on exploratory
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 42–49, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Adaptive Exploration Using Stochastic Neurons 43
2 Methodology
The learning problems considered in this paper can be described as Markovian
Decision Processes (MDP) [1], which basically consist of a set of states, S, and a
set of possible actions within each state, A(s) ∈ A, ∀s ∈ S. A stochastic transi-
tion function P(s, a, s ) describes the (stochastic) behavior of the environment,
i.e. the probability of reaching successor state s after selecting action a ∈ A(s)
in state s. The selection of an action is rewarded by a numerical signal from
the environment, r ∈ R, used for evaluating the utility of the selected action.
The goal of an agent is finding an optimal policy, π ∗ : S → A, maximizing the
cumulative reward. In the following, it is allowed for S to be continuous, but
assumed that A is a finite set of actions. Action-selection decisions are taken at
regular time steps, t ∈ {1, 2, . . . , T }, until a maximum number of T actions is
exceeded or a terminal state is reached.
At the end of episode i, the components of θ are adapted towards the gradient
with regard to the outcome ρ of the current episode
For improving the future performance of π(ae , ·, ·), the policies outcome is mea-
sured as the cumulative reward in the current episode
∂ ln g(ae , μ, σ) ae − μ
= (5)
∂μ σ2
∂ ln g(ae , μ, σ) (ae − μ)2 − σ 2
= , (6)
∂σ σ3
and a reasonable algorithm for adapting μ and σ has the following form
ae − μ
Δμ = αR (ρ − ρ̄) (7)
σ2
(ae − μ)2 − σ 2
Δσ = αR (ρ − ρ̄) . (8)
σ3
The learning rate αR has to be chosen appropriately, e.g. as a small positive
constant, αR = ασ 2 , [9]. The baseline ρ̄ is adapted by a simple reinforcement-
comparison scheme
ρ̄ = ρ̄ + α(ρ − ρ̄) . (9)
Analytically, in Eqn. 7 the mean μ is shifted towards ae in case of ρ ≥ ρ̄. On the
contrary, μ is shifted towards the opposite direction if ρ is less than ρ̄. Similarly,
in Eqn. 8 the standard deviation σ is adapted in a way that the occurrence of ae
is increased if ρ ≥ ρ̄, and decreased otherwise (see proof in [9]). In simple words,
the standard deviation controls exploration in the space of ae .
Importantly, a proper functioning of the proposed algorithm depends on some
requirements. In order to limit the search of reasonable parameters, the explo-
ration parameter, mean and standard deviation must be bounded for obtaining
reasonable performance. Furthermore, if the learning problem consists of more
than one starting state, all parameters must be associated to each occurring
starting state, i.e. μ → μ(s), σ → σ(s) and ρ̄ → ρ̄(s), since way costs might
affect ρ unevenly. However, if a learning problem consists of just one starting
state, all utilized parameters can be considered as global parameters.
4 Experiments
The presented approach is evaluated in two environments using Q-learning and
Sarsa. First, a variation of the cliff-walking problem [1] is proposed as the
non-stationary cliff-walking problem comprising a non-stationary environment.
46 M. Tokic and G. Palm
Goal 1
Goal 2
rt+1=1000
a) b) c) rt+1=0
( G2 )
G1
1 5 10 21
Rewards:
Phase/ a) b) c)
Episode 1…200 201…1000 1001…3000
rG1 3 21 41 -1.2 -0.5 0.3
rG2 n/a n/a 500 Position
(a) (b)
Fig. 1. The non-stationary cliff-walking problem (a) and the mountain-car problem
with two goals (b)
Results. Figure 2(a) shows the averaged reward per episode over 500 runs for
the non-stationary cliff-walking problem. Averages of the mean μ and standard
deviation σ are shown in Figure 2(b). It is observable that VDBE-Softmax max-
imizes the reward/episode. MBE shows best performance of the remaining three
basic policies with the advantage of not requiring to memorize any further ex-
ploratory data such as utilized by VDBE-Softmax. The Sarsa learning algorithm
shows better results for all four investigated policies. All REC policies have a
much higher reward in episode 3000 compared to when using a pure greedy pol-
icy, which converges to a reward per episode of 0, and is only the optimal policy
for the first 1000 episodes. In contrast, a pure random policy converges to a
reward per episode of −2750 respectively.
up for collecting enough inertia for overcoming gravity. In the here presented
modification of the original learning problem, two goal states are utilized as
depicted in Fig. 1(b), which are rewarded differently upon arrival, because a
simple greedy policy leads to optimal performance in the original description
of the learning problem. The bounded state variables are continuously valued
consisting of the position of the car, −1.2 ≤ x ≤ 0.3, and its velocity −0.07 ≤
ẋ ≤ 0.07. The car’s dynamics are described by differential equations
xt+1 = bound xt + ẋt+1
ẋt+1 = bound ẋt + 0.001at − 0.0025 cos (3xt ) . (10)
At each discrete time step, the agent can chose between one of seven actions, at ∈
{−1.0, −0.66, −0.33, 0, 0.33, 0.66, 1.0}, each rewarded by rt+1 = −1, except for
reaching the right goal, which is rewarded by rt+1 = 1000. An episode terminates
when either one of the two goals has been arrived or when a maximum number
of actions, Tmax = 10000, is exceeded. At the beginning of each episode, the car
is positioned in the valley at position x = −0.5 with initial velocity ẋ = 0.0.
The state space is approximated by a 100 × 100 matrix, thus causing the actual
state to be only partially observable. Since the learning problem is episodic, no
discounting (γ = 1) is used. Finally, all utility values are optimistically initialized
with Qt=0 (s, a) = 0, and learned using a step-size parameter of α = 0.7.
Results. The averaged results over 200 runs are shown in Figure 3. Similar
to the non-stationary cliff-walking problem, episodic adaptation of MBE out-
performs ε-Greedy and Softmax. Furthermore, the Sarsa algorithm shows only
to be advantageous in combination with ε-Greedy and Softmax policies, in con-
trast to MBE and VDBE-Softmax behaving more efficiently in combination with
Q-learning. In the first phase of learning, a degradation of performance is rec-
ognizable for episodic MBE and VDBE-Softmax, which is due to the reason of
first learning a path to the left goal and afterwards learning the path to the right
(better) goal. For comparison, a greedy policy converges to an average reward
per episode of 114, in contrast to a pure random policy converging to −5440.
Fig. 3. The mountain-car problem with two goals: Averaged reward using Q-learning
and Sarsa (smoothed)
Adaptive Exploration Using Stochastic Neurons 49
References
[1] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press,
Cambridge (1998)
[2] Wiering, M.: Explorations in Efficient Reinforcement Learning. PhD thesis, Uni-
versity of Amsterdam, Amsterdam (1999)
[3] Thrun, S.B.: Efficient exploration in reinforcement learning. Technical Report
CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, USA (1992)
[4] Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. The
Journal of Machine Learning Research 3, 397–422 (2002)
[5] van Eck, N.J., van Wezel, M.: Application of reinforcement learning to the game
of Othello. Computers and Operations Research 35, 1999–2017 (2008)
[6] Faußer, S., Schwenker, F.: Learning a strategy with neural approximated temporal-
difference methods in english draughts. In: Proceedings of the 20th International
Conference on Pattern Recognition, ICPR 2010, pp. 2925–2928. IEEE Computer
Society (2010)
[7] Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems.
Technical Report CUED/F-INFENG/TR 166, Cambridge University (1994)
[8] Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical sub-
strates for exploratory decisions in humans. Nature 441(7095), 876–879 (2006)
[9] Williams, R.J.: Simple statistical Gradient-Following algorithms for connectionist
reinforcement learning. Machine Learning 8, 229–256 (1992)
[10] Watkins, C.: Learning from Delayed Rewards. PhD thesis, University of Cam-
bridge, England (1989)
[11] Grzes, M., Kudenko, D.: Online learning of shaping rewards in reinforcement learn-
ing. Neural Networks 23(4), 541–550 (2010)
[12] Tokic, M., Palm, G.: Value-Difference Based Exploration: Adaptive Control be-
tween Epsilon-Greedy and Softmax. In: Bach, J., Edelkamp, S. (eds.) KI 2011.
LNCS, vol. 7006, pp. 335–346. Springer, Heidelberg (2011)
Comparison of Long-Term Adaptivity
for Neural Networks
1 Introduction
Artificial Neural Networks (ANN) for function approximation can be used for
modern process monitoring and process control. One of the most common ap-
proaches is to use ANN to predict future values of a measurement [1]. If this
prediction is used in combination with a controller, the system is known as Model
Predictive Control (MPC).
One of the most important problems with MPC in industrial applications is a
slightly changing environment which results in a drift of properties and dynamics
of the process. This changing is known as concept drift [2]. Hence, the prognosis
of the ANN will worsen the more time has passed since training the ANN.
To counter this worsening, the ANN has to be retrained with new data to
adapt to changes of the process. In this paper, we compare different methods
to retrain neural networks (in particular Multi-Layer Perceptrons) with new
data. We are especially interested in long term effects of different retraining
approaches. We compare all methods on artificial data and on data obtained
from several industrial cement production plants.
The remainder of this paper is organized as follows: we present a Model Pre-
dictive Control scenario and the data we used to benchmark different algorithms
in Sect. 2. Accordingly in Sect. 3, we give a brief review of related approaches
to adapt Neural Networks for function approximation to new data and compare
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 50–57, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Comparison of Long-Term Adaptivity for Neural Networks 51
Fig. 1. Free lime value prognosis at a cement plant. The plot shows three days of plant
operation with laboratory measurements and neural network prognosis of the free lime
value. While a laboratory measurement is only available every four hours, the prognosis
is available the whole time.
The free lime value [3] is a major quality criterion of the cement. If a good pre-
diction of this value is possible, the whole production process can be stabilized.
52 F.-F. Steege and H.-M. Groß
This prognosis can only be used for control purposes if it is correct. To obtain a
correct prognosis over a long time range of two or more years is very demanding
as process dynamics of the cement plant change as mentioned afore. Figure 2
shows long term changes of an important process measurement from a cement
plant over a period of two years. To test the capability of different long-term
network adaptation algorithms, we used data from three cement plants. The
target for the approximation was the free lime value of the cement produced by
the plant. Network inputs were signals obtained from the process such as kiln
rotation speed, kiln temperature, and raw meal feed.
Fig. 2. Time plot of kiln inlet meal temperature over the period of two years. The
measurement is low-pass filtered with a sliding time window of one week. Time spans
without signal denote stops of the plant. Over the whole time, a change in the level
and dynamics of the signal of about 80 degrees is visible.
3 State-of-the-Art
One of the first publications that mentions the difficulties in long term adaptivity
of artificial neural networks in changing environments is [4]. The author anal-
ysed the problem of catastrophic interference in neural networks. Catastrophic
interference describes the phenomenon that learning of new facts disrupts per-
formance on previously learned old facts, however, in [4] the author did not give
solutions how the problem could be solved or avoided.
In [5] the author proposes the FLORA framework which accepts only certain
samples for training of a classificator. Old samples, that do not suit to the current
window are not used for training. A method to weight old and new samples for
a two layered network is introduced in [6] by using a forgetting function which
reduces influence of old samples in the training process.
Other approaches use not only one approximator but an ensemble. Examples
are Learn++ [7], dynamic weighted majority [8], incremental adaptive learning
[9], or iRGLVQ [10]. All ensemble based approaches are similar in their use of
more than one model, where each model is trained with different data. They can
be discerned by the basic model used and the way the ensemble members are
used to compute an approximation.
Comparison of Long-Term Adaptivity for Neural Networks 53
While comparing these different approaches two major problems occur: first,
every approach uses a different basic approximation model which is adapted
to concept drift. In [10] Learning Vector Quantisation (LVQ) is used, in [9,7]
Multi-Layer Perceptrons (MLP), in [6] a combination of two subnetworks, in [5]
attribute-value logic, and in [8] an Incremental Tree Inducer (ITI) and Bayes
learners. The second problem is that most results are obtained on artificial data
or real data with artificially induced concept drift.
In the following, we compare algorithms on industrial data, use the same basic
approximator (MLP), and compare results of different adaptation algorithms.
For purpose of explanation and to allow for other researchers to reproduce our
results, we also use three artificial datasets for benchmarking.
In Sec. 3 we showed that there are two paradigms how to adapt a model to
concept drift: 1. data accumulation and 2. ensemble learning. In this section we
propos algorithms to apply these paradigms to training of Multi-Layer Percep-
trons as function approximator for the purpose of Model Predictive Control.
Every approach starts with the same preconditions: there is an initial dataset
Sinit with si ∈ Sinit , si = (x1 , . . . , xm , y), where x1 , . . . , xm are the input val-
ues/measurements for m input dimensions and y is the target value of the
respective sample. This dataset is used to train a MLP Ninit with Levenberg-
Marquardt training algorithm.
The principle of data accumulation is to have one model which is adapted when
a certain amount of new data Snew is available. We applied three variations of
this concept:
– data acc.1: create a dataset Saccu = Sinit ∪ Snew and retrain Ninit with
dataset Saccu
– data acc.2: retrain Ninit only with dataset Snew ; ignore old data
train val
– data acc.3: split Snew in training data Snew and validation data Snew
train
create a new MLP Nnew and train it with Snew
val
if the approximation error of Nnew on Snew is lower than approximation
val
error of Ninit on Snew delete Ninit and use Nnew
54 F.-F. Steege and H.-M. Groß
Each of the three concepts is repeated every time a new dataset Snew is available
(the exact number of samples sufficient to create a new dataset depends on the
data used). The first concept uses all samples/information acquired over time
but may suffer catastrophic interference as described in [4]. The second concept
is a very basic version of sliding time window used in [5]. The third concept tries
to compensate drawbacks of the first two concepts: a new (retrained) model is
only accepted if its results are better than old model results.
5 Experiments
In this section, we apply the adaptation algorithms explained in Sec. 4 to data
with concept drift. We use two different types of data. The first three data sets
are obtained from rotary kiln cement production plants. The target y for the
MLP prognosis is the free lime value which indicates the quality of the cement
produced [3]. Five to ten different measurements from each kiln, such as kiln
inlet temperature, secondary air temperature, raw meal feed, etc (see [3] for a
detailed description of the measurements) are used as input values for a sample
si = (x1 , . . . , xm , y). We use two years of data of each plant which results in
Comparison of Long-Term Adaptivity for Neural Networks 55
2,100-4,000 samples, depending on the sample rate of the laboratory (from three
up to eight hours) and the revision times of the plants.
For purpose of explanation and to allow for other researchers to reproduce
our results, we also use three simple artificial datasets. We generate a target y(t)
from five input signals x1 (t), . . . , x5 (t) as shown in equation 1:
Each input xi (t) is randomly sampled from a Gaussian distribution of N (0, 1).
We add noise data d(t) sampled from N (0, 0.3). Linear concept drift is induced
by the parameter α which changes over the simulation time. We apply three
different variations for the concept drift:
1. α changes linear from 0.1 to 1 and is set back to 0.1 at certain times; this
corresponds to slagging in industrial plants which grows over time but is set
back to a low level after a plant revision
2. α is linearly changing from 0.1 to 1
3. α does not change, which corresponds to a process without concept drift
Fig. 3. Progress of parameter α used to model concept drift in three artificial datasets
For the prognosis of the target, we use a Multi-Layer Perceptron featuring one
hidden layer with five neurons to approximate the target. Training algorithm is
standard Levenberg-Marquardt training as included in the Neural Network li-
brary of Matlab. All networks (except approach data acc.1) are trained/retrained
with 250 samples of data where the last 50 samples are used for validation. For
data acc.1, we use a growing training set which includes all samples available
since starting retraining. For ensemble2, we apply a k-means clusterer with
k = 10 to cluster the training set of each network. The test performed in
ensemble1 to determine the best model is carried out with the last 50 sam-
ples observed. No additional pruning algorithms are used as we are focused on
effects the different adaptation algorithms have on long term prognosis error.
Table 1 shows results on the different datasets.
56 F.-F. Steege and H.-M. Groß
Table 1. Median prognosis error eQ50% of 200 trials network training. The two best
results of each data set are marked with a greybackground. The number of plant
revision (resets of the concept drift) and the sum of errors over all datasets are also
listed.
eQ50% plant1 plant2 plant3 plant art.data1 art.data2 art.data3 art.
no adapt. 0.635 0.871 0.782 2.288 0.309 0.334 0.135 0.778
data acc.1 0.492 0.766 0.714 1.972 0.220 0.217 0.124 0.561
data acc.2 0.701 0.773 0.768 2.242 0.201 0.156 0.124 0.481
data acc.3 0.520 0.801 0.779 2.100 0.249 0.167 0.134 0.550
ensemble1 0.478 0.795 0.749 2.022 0.193 0.168 0.134 0.495
ensemble2 0.524 0.850 0.793 2.167 0.306 0.275 0.185 0.766
revisions 3 4 0 3 0 0
For evaluation of the results, we repeated every simulation 200 times. The
mean prognosis error over the whole time period was calculated. Afterwards,
we compared the median eQ50% of all 200 trials for each concept and data set.
We chose the median and not the mean because approximately 1% of the net-
works trained produces a very high error because of disadvantageous initialisa-
tion which influences the mean error of all 200 simulations disproportionately.
Prognosis without adaptation of the network produces the worst result. This
was expected as it does not counter the concept drift. ensemble2 also performs
very bad. This is a result of the imprecise representation of the input space we
choose with the k-means clustering. If a better method is aquired to map and
compare input/output relations in trained MLPs, this approach would surely
produce better results. The potential of ensembles is revealed by ensemble1,
which is the second best method of the six approaches we tested. Only if the
concept does not change (art.data3) or there is no revision of the plant included
in the data (art.data2, plant3), data accumulation approaches outperform this
ensemble approach.
Of the three different data accumulation approaches data acc.1 performs best
on real world data. This is surprising, since data acc.1 uses all data available,
which results in ambiguous data due to the changing parameters (boiler slagging
in plants and α in artificial data). Nonetheless the prediction acquired with
unambiguous data but fewer training samples is worse. We expect the results of
data acc.2/3 to get better if the sampling rate is increased and more samples
are available for the used time window.
On plant3 the differences between the approaches are smaller than on other
plants. The reason is that in plant3 other sensor measurements than in plant1
and plant2 had to be used because of the plant architecture. Hence the over-
all prognosis quality decreases and differences between the adaptation concepts
disappear.
Comparison of Long-Term Adaptivity for Neural Networks 57
6 Conclusion
Concept drift does influence the quality of neural network prognosis in industrial
combustion processes. Through growing boiler slagging and the use of other fuels,
the prognosis of important performance figures is getting worse if the networks
used are not adapted to changing data. We applied different approaches to adapt
networks to concept drift over long time ranges. The best approach depends on
the type of the concept drift. If dynamics and properties of the plant change
very slowly and old states do not appear again, it is advantageous to use sliding
window technic and data accumulation to constantly retrain a single network
with new data.
If changes in the dynamics and properties appear very abrupt and old states
reappear (due to revisions of a plant or a small selection of used fuels) ensemble
learning with more than one model is superior to other concepts.
Our future work concentrates on improving the use of ensemble methods.
Furthermore we want to apply the approaches to other industrial MPC problems
and compare them with the results gained on cement plants.
References
1. Agachi, P.S., Nagy, Z.K., Cristea, M.V., Imre-Lucaci, A.: Model based control:
Case Studies in Process Engineering. Wiley-VCH (2006)
2. Tsymbal, A.: The problem of concept drift: Definitions and related work. Technical
Report, Department of Computer Science, Trinity College: Dublin, Ireland (2004)
3. Alsop, P.A.: The Cement Plant Operations Handbook, 5th edn. Tradeship Publi-
cations Ltd. (2007)
4. McCloskey, M., Cohen, N.: Catastrophic interference in connectionist networks:
The sequential learning problem. Psychology of Learning and Motivation 24, 109–
164 (1989)
5. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden
context. Machine Learning 23(1), 69–101 (1996)
6. Pérez-Sánchez, B., Fontenla-Romero, O., Guijarro-Berdiñas, B.: An incremental
learning method for neural networks in adaptive environments. In: Int. Joint Conf.
on Neural Networks (IJCNN 2010), pp. 1–8 (2010)
7. Elwell, R., Polikar, R.: Incremental Learning of Concept Drift in Nonstationary
Environments. IEEE Transactions on Neural Networks 22(10), 1517–1531 (2011)
8. Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: A new ensemble method
for tracking concept drift. In: Proc. IEEE Int. Conf. on Data Mining (ICDM 2003),
pp. 123–130 (2003)
9. He, H.: Self-Adaptive Systems for Machine Intelligence. John Wiley & Sons (2011)
10. Kirstein, S., Wersing, H., Gross, H.-M., Koerner, E.: A life-long learning vector
quantization approach for interactive learning of multiple categories. Neural Net-
works 28, 90–105 (2012)
Simplifying ConvNets for Fast Learning
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 58–65, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Simplifying ConvNets for Fast Learning 59
Fig. 1. (a) a typical ConvNet architecture with two feature extraction stages; (b) Fusion
of convolution and sub-sampling layers
in convolutional layers. Thus, in the rest of this paper, we will only consider Ci
layers with identity activation function. We will also consider average pooling
layers Si performing a sub-sampling by two. For a Ci layer, its input map size
Win × Hin , its output map size Wi × Hi , and the following Si sub-sampled
output map size SWi × SHi are connected to the convolution kernel size Ki by:
(Wi , Hi ) = (Win − Ki + 1, Hin − Ki + 1) and (SWi , SHi ) = (Wi /2, Hi /2).
Since these layers rely on local receptive fields, the complexity of the back-
propagation delta-rule algorithm for a given element is proportional to its output
map size and the cardinal of its connections with the following layer, that is,
proportional to (Wi × Hi ) for Ci layers and (SWi × SHi × Ki+1 2
) for Si layers.
Weight sharing in these layers implies a complexity of the weight update
algorithm that is proportional to output map and kernel sizes: i.e. (Wi ×Hi ×Ki2)
for Ci layers, and in (SWi × SHi ) for Si layers.
In the remainder of this section, we present our proposition to learn modified
ConvNets where Ci and Si layers are replaced by equivalent convolutional filters,
and compare the back-propagation complexity of these layers.
Fig. 2. (a) Separable convolution layers; (b) Fused separable convolution and sub-
sampling layers
broadly used in image processing, as far as we know, no study has been published
on learning separable filters within ConvNet architectures.
We thus propose to restrict the hypothesis space using only separable con-
volutions in ConvNets, and directly learn two successive 1D-filters. If in the
feed-forward application, horizontal and vertical filters are commutative, back-
propagation in ConvNets may lead to different trained weights. Thus, we will
evaluate either a horizontal convolution Chi whose output map size is Wi × Hin ,
followed by a vertical one Cvi (Figure 2.(a)), or a vertical convolution Cvi whose
output map size is Win × Hi , followed by a horizontal one Chi . No activation
function is used in Chi and Cvi layers.
We denote the first (resp. second) configuration Chi ∗ Cvi (resp. Cvi ∗ Chi ).
The delta-rule complexity of the Chi ∗ Cvi configuration is proportional to
(Wi Hin Ki + Wi Hi ), since the Chi layer is connected to the Cvi layer, which
is itself connected to the Si layer. The weight update algorithm is proportional
to (Wi (Hin + Hi )Ki ). The complexity of the Cvi ∗ Chi configuration is obtained
by replacing H and W .
The hypothesis space represented by these separable convolutional filters is a
more restricted set than the one of classical ConvNets.
Our third proposition is to combine the two previous kinds of filters to learn
fused separable convolution and sub-sampling layers, which consist in either a
horizontal convolution CShi with a horizontal step of two, whose output map
size is SWi × Hin , followed by a vertical one CSvi with a vertical step of two
and an activation function, or a vertical convolution CSvi with a vertical step
of two, whose output map size is Win × SHi , followed by a horizontal one CShi
with a horizontal step of two and an activation function.
We denote the first configuration CShi ∗CSvi and the second CSvi ∗CShi . The
CShi ∗CSvi configuration is described in Figure 2.(b), underlining its equivalence
with a traditional (Ci , Si ) couple or a CSi layer.
62 F. Mamalet and C. Garcia
3 Experiments
The main goal of these experiments is not to propose novel convolutional ar-
chitectures for the following tasks, but to compare learning capabilities with
Simplifying ConvNets for Fast Learning 63
Fig. 3. Feature maps obtained with simplified convolutional filters (upper left: CSi ;
bottom left: CShi ∗ CSvi ; right: Chi ∗ Cvi )
References
1. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. of the IEEE (November 1998)
2. Chellapilla, K., Puri, S., Simard, P.: High Performance Convolutional Neural Net-
works for Document Processing. In: Proc. of the Int. Workshop on Frontiers in
Handwriting Recognition, IWFHR 2006 (2006)
3. Garcia, C., Delakis, M.: Convolutional Face Finder: a neural architecture for fast
and robust face detection. IEEE Trans. on Pattern Analysis and Machine Intelli-
gence (November 2004)
4. Osadchy, M., LeCun, Y., Miller, M.L., Perona, P.: Synergistic face detection and
pose estimation with energy-based model. In: Proc. of Advances in Neural Infor-
mation Processing Systems, NIPS 2005 (2005)
5. Garcia, C., Duffner, S.: Facial image processing with convolutional neural networks.
In: Proc. Int. Workshop on Advances in Pattern Recognition (2007)
6. Delakis, M., Garcia, C.: Text detection with Convolutional Neural Networks. In:
Proc. of the Int. Conf. on Computer Vision Theory and Applications (2008)
7. Saidane, Z., Garcia, C.: Automatic scene text recognition using a convolutional
neural network. In: Proc. of Int. Workshop on Camera-Based Document Analysis
and Recognition (2007)
8. Hadsell, R., Sermanet, P., Scoffier, M., Erkan, A., Kavackuoglu, K., Muller, U.,
LeCun, Y.: Learning long-range vision for autonomous off-road driving. Journal of
Field Robotics (February 2009)
9. Raiko, T., Valpola, H., LeCun, Y.: Deep learning made easier by linear transfor-
mations in perceptrons. In: Conf. on AI and Statistics (2012)
10. Reed, R.: Pruning algorithms - a survey. IEEE Trans. on Neural Networks (1993)
11. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-
stage architecture for object recognition? In: Proc. Int. Conf. on Computer Vision
(2009)
12. Mrazova, I., Kukacka, M.: Hybrid convolutional neural networks. In: Proc. of IEEE
Int. Conf. on Industrial Informatics, INDIN 2008 (2008)
13. Holt, J., Baker, T.: Back propagation simulations using limited precision calcula-
tions. In: Proc. of Int. Joint Conf. on Neural Networks, IJCNN 1991 (1991)
14. Petrowski, A.: Choosing among several parallel implementations of the backprop-
agation algorithm. In: Proc. of IEEE Int. Conf. on Neural Networks (1994)
15. Ciresan, D., Meier, U., Gambardella, L.M., Schmidhuber, J.: Handwritten digit
recognition with a committee of deep neural nets on GPUs. In: Computing Research
Repository (2011)
16. Mamalet, F., Roux, S., Garcia, C.: Real-time video convolutional face finder on
embedded platforms. EURASIP Journal on Embedded Systems (2007)
17. Mamalet, F., Roux, S., Garcia, C.: Embedded facial image processing with convo-
lutional neural networks. In: Proc. of Int. Symp. on Circuits and Systems (2010)
A Modified Artificial Fish Swarm Algorithm
for the Optimization of Extreme Learning
Machines
Abstract. Neural networks have been largely applied into many real
world pattern classification problems. During the training phase, every
neural network can suffer from generalization loss caused by overfitting,
thereby the process of learning is highly biased. For this work we use
Extreme Learning Machine which is an algorithm for training single hid-
den layer neural networks, and propose a novel swarm-based method for
optimizing its weights and improving generalization performance. The
algorithm presents the basic Artificial Fish Swarm Algorithm (AFSA)
and some features from Differential Evolution (Crossover and Mutation)
to improve the quality of the solutions during the search process. The re-
sults of the simulations demonstrated good generalization capacity from
the best individuals obtained in the training phase.
1 Introduction
Artificial Neural Networks (ANNs) have been largely applied in many applica-
tions in the recent years. ANNs need to be trained by an algorithm to gather
enough informations over the solution space of a given problem. Traditional
gradient descent algorithms such as backpropagation and Levenberg Maquardt
present low training speed and may easily get stuck on local minima [1]. An
alternative algorithm is the Extreme Learning Machine (ELM) [2], which is a
fast learning algorithm for the training of neural networks with only one hidden
layer.
The knowledge in a given ANN is represented by its weights, and thus, its
performance will vary depending on how well those weights cover the solution
space. Finding weights to improve the performance of a ANN is not an easy
task, due to many possible weight configurations. In the literature, there are
many search algorithms applied to the ANNs in order to find a set of weights
that improve the performance of the system. Algorithms such as Particle Swarm
optimization (PSO) [3], Genetic Algorithms (GA) [3] and Artificial Fish Swarm
Algorithm (AFSA) [4], may be applied to find the best solution (set of weights)
in the solution space.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 66–73, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Modified Artificial Fish Swarm Algorithm 67
The ELM optimization has been studied in [5,6,7,8] where algorithms such as
PSO, Differential Evolution (DE) and GA are applied.
In this work we propose a hybrid model based on the optimization of ELM
weights by a modified AFSA algorithm called Modified Artificial Fish Swarm
Algorithm for the Optimization of Extreme Learning Machines (MAFSA-ELM).
3 Proposed Method
The objective of the proposed algorithm is to find the best initial input weights
for the ELM algorithm through a hybrid search algorithm. The AFSA algorithm
does not produces good results in comparison with other existing algorithms [10],
thus, a modification was proposed to the original algorithm in this work.
68 J.F.L. de Oliveira and T.B. Ludermir
One of the limitations of the AFSA algorithm comes from the visual param-
eter which allows a given fish to gather informations about other fish inside its
visual field. With this restriction, some fish could sink into a local minima, while
only a few could reach a better region. The absence of information from other
fish near a global minima, may reduce the search capacity of the algorithm.
The MAFSA-ELM has the basic AFSA behaviors as in [4], however the
Crossover-Mutation phase from the DE algorithm is used as an additional be-
havior without the influence of the visual and step parameters.
The configurable parameters of the algorithm are: number of fish in the pop-
ulation N , the position of each fish Si which in this work will be defined as a
weight matrix W and a bias vector B, an objective function Y = f (x) which
will be the accuracy rate of the ELM on the validation set, the capacity to lo-
cate neighbor fish visual, number of neighbor fish nf , a distance measure Dij
between fish, a swim measure of each fish step, the crowd factor of a given region
δ and the number of tries (T rynumber) of execution of the prey behavior.
The visual parameter will influence the behaviors of each fish determining
the environment conditions. From the definition of this parameter a fish can
determine whether the region is crowded. According to the current environment
conditions, the fish evaluates the results from the behaviors and chooses one to
execute and update the fish position.
Optimization algorithms such as PSO, have their search process based on
previous movements and experiences of the population in the problem envi-
ronment. However, the AFSA and MAFSA techniques are not only based on
previous movements, but also on the current state of the population. Thus it
needs more parameters to measure the current state of the population in order
to ascertain how the movements will be performed.
The proposed method is based on the hybridization between the AFSA, DE
and ELM algorithms, called MAFSA-ELM, where each fish in the MAFSA algo-
rithm will represent a ELM network, and the search will be conducted based on
the behaviors of each fish. We also study the performance of the hybridization be-
tween the AFSA and ELM, which we call AFSA-ELM. This algorithm does not
have the crossover-Mutation behavior in its execution, and it was implemented
to verify its performance against the proposed method.
In order to calculate whether a certain fish is within the visual field of another
fish, the distance between them is calculated using an euclidean distance. The
output weights of the ELM algorithm are unknown a priori, hence the distance
is calculated as showed on equation 1.
Q Q
n
dpq = (Wij (p) − Wij (q))2 + (bi (p) − bi (q))2 (1)
i=1 j=1 i=1
The original AFSA algorithm has some basic behaviors executed by each fish,
and their fitness are calculated through the objective function Y = f (x). The be-
havior that produces the highest accuracy rate will be the one used for updating
the fish position. The basic behaviors are: Follow, Swarm, Prey and Leap.
A Modified Artificial Fish Swarm Algorithm 69
3.1 Follow
Let S = [S1 , S2 , ..., SN ] be a vector with the position of each fish in the swarm
and Smax is the position found by a fish with best food concentration. The
updating process is showed on equation 2 in case the region is not crowded
n
( Nf < δ) and the fish with better positioning is in a better food concentration
region than the current fish (Ymax > Yi ).
Smax − Si
F ollow(Si ) = Si + rand()step , (2)
Smax − Si
If the above conditions are not satisfied, the updating will be done as shown on
equation 3.
3.2 Prey
In this behavior, the fish chooses a position inside its visual field, and if after a
number of tries (trynumber) the fish does not find a better food concentration
region, it moves randomly. If the selected fish has better food concentration, the
updating is performed as shown on equation 4 :
Sj − S i
P rey(Si ) = Si + rand()step (4)
Sj − Si
If the fish does not find a neighbor with better food concentration after trynumber
tries, it moves randomly through the space as shown on equation 5.
P rey(Si ) = Leap(Si ) (5)
3.3 Swarm
This behavior is based on the number of neighbor fish, where a central position
Sc is calculated, using the positions of each fish. The updating process is shown
n
on equation 6 in case the region is not crowded ( Nf < δ) and the central position
is in a better food concentration region than the current fish (Yc > Yi ).
S c − Si
Swarm(Si ) = Si + rand()step (6)
Sc − Si
Otherwise, the updating process is done as follows:
Swarm(Si ) = P rey(Xi ) (7)
3.4 Leap
Leap behavior is based on random movements, independent from the rest of the
swarm. This is an stochastic behavior of the fish.
Leap(Si ) = Si + rand()step (8)
70 J.F.L. de Oliveira and T.B. Ludermir
3.5 Crossover-Mutation
This behavior is part of the MAFSA-ELM search strategy. In order to avoid sink-
ing on local minima and to improve the performance of the algorithm, we ran-
domly select three fish in the swarm and combine them using the basic Crossover
and Mutation strategies of the DE algorithm [11]. This behavior is not restricted
by the visual parameter to ensure the selection of any fish in the swarm and to
gather the information necessary for escaping from possible local minima. The
step parameter does not influence this behavior, thus the global search capacity
of the algorithm is increased.
The mutation phase is shown as follows:
4 Experiments
The experiments were performed on 4 dataset from the UCI Machine Learning
Repository [12]. The data from each dataset were split into training set (50%),
validation set (25%) and test set (25%) randomly generated for 30 iterations.
For each iteration all the classifiers receive the same training, validation and
A Modified Artificial Fish Swarm Algorithm 71
test sets. All the attributes from the datasets were normalized into the interval
[0..1]. The simulations were perfomed with 10, 15 and 20 hidden neurons, and
the configuration that produced the best result was selected. The initialization
of the AFSA and MAFSA parameters in this work were based on several sim-
ulations with distinct parameters, and the best configuration was selected. The
parameters are described as follows: the number of fish N = 30, step = 0.6,
lotation factor δ = 0.8. Amplitude factor for the mutation F = 1 and crossover
rate CR = 0.5. For the experiments using the PSO-ELM method, we used the
same configuration presented in [5] with some modifications, to match the pa-
rameters of other techniques such as population number and maximum number
of iterations, (C1 = 2, C2 = 2, w = 0.9, number of particles was set to 30, iter-
ations=50). The parameters from the E-ELM algorithm is the same presented
in [6], however, the total number of individuals was increased to 30. On table 1
the following results on the test set are given: the mean accuracy rate, standard
deviation SD and the number of hidden neurons Q.
Table 1. Results for the Glass, Ionosphere, Sonar and Vehicle datasets
(a) Glass (b) Ionosphere
Technique Mean± SD Q Technique Mean± SD Q
ELM 63.45 ± 5.39 20 ELM 84.24 ± 3.41 20
RBF [1] 62.42 ± 4.21 20 RBF [1] 84.65 ± 2.39 20
LM [1] 52.55 ± 16.70 15 LM [1] 88.16 ± 11.36 20
AFSA-ELM 64.21 ± 5.34 20 AFSA-ELM 87.15 ± 4.08 20
MAFSA-ELM 64.96 ± 4.87 20 MAFSA-ELM 88.10 ± 4.06 20
E-ELM 63.52 ± 5.97 20 E-ELM 88.06 ± 4.22 20
PSO-ELM 64.65 ± 5.77 15 PSO-ELM 85.11 ± 3.70 20
For all datasets we used the Wilcoxon signed-rank hypothesis test with 5%
of significance for statistical comparison of the results. The results for the Glass
dataset (figure 1a and table 1a) show that the proposed technique achieved
lower validation errors. Through the hypothesis test we concluded that the PSO-
ELM and MAFSA-ELM achieved similar results, and they were superior to the
remained methods.
In the Ionosphere dataset (figure 1b and table 1b) the MAFSA-ELM and
E-ELM methods performed similarly in the accuracy on the test set and on
validation error. In this dataset the MAFSA-ELM technique also outperformed
the traditional AFSA-ELM algorithm.
In the Sonar dataset (figure 1c and table 1c) the MAFSA-ELM, E-ELM and
PSO-ELM methods achieved similar results, however the MAFSA-ELM algo-
rithm obtained the lowest validation error. In this dataset the MAFSA-ELM also
achieved better results than the AFSA-ELM algorithm. On the Vehicle dataset
(figure 1d and table 1d), the PSO-ELM, MAFSA-ELM and E-ELM achieved
similar classifications accuracies, and the PSO-ELM and the MAFSA-ELM had
similar validation errors.
References
1. Haykin, S.: Neural networks: a comprehensive foundation. Prentice Hall PTR, Up-
per Saddle River (1994)
2. Huang, G.B., Wang, D.H., Lan, Y.: Extreme learning machines: a survey. Interna-
tional Journal of Machine Learning and Cybernetics, 1–16 (2011)
3. Engelbrecht, A.P.: Fundamentals of computational swarm intelligence, vol. 1. Wi-
ley, NY (2005)
4. Wang, C.R., Zhou, C.L., Ma, J.W.: An improved artificial fish-swarm algorithm
and its application in feed-forward neural networks. In: Proceedings of 2005 Inter-
national Conference on Machine Learning and Cybernetics, vol. 5, pp. 2890–2894.
IEEE (2005)
5. Xu, Y., Shu, Y.: Evolutionary extreme learning machine–based on particle swarm
optimization. In: Advances in Neural Networks, ISNN 2006, pp. 644–652 (2006)
6. Zhu, Q.Y., Qin, A.K., Suganthan, P.N., Huang, G.B.: Evolutionary extreme learn-
ing machine. Pattern Recognition 38(10), 1759–1763 (2005)
7. Saraswathi, S., Sundaram, S., Sundararajan, N., Zimmermann, M., Nilsen-
Hamilton, M.: Icga-pso-elm approach for accurate multiclass cancer classification.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 8(2), 452–
463 (2011)
8. Qu, Y., Shang, C., Wu, W., Shen, Q.: Evolutionary fuzzy extreme learning machine
for mammographic risk analysis. International Journal of Fuzzy Systems 13(4)
(2011)
9. Rao, C.R., Mitra, S.K.: Generalized inverse of matrices and its applications. Wiley,
NY (1971)
10. Yazdani, D., Nadjaran Toosi, A., Meybodi, M.: Fuzzy Adaptive Artificial Fish
Swarm Algorithm. In: Li, J. (ed.) AI 2010. LNCS, vol. 6464, pp. 334–343. Springer,
Heidelberg (2010)
11. Storn, R., Price, K.: Differential evolution. Journal of Global Optimization 11(4),
341–359 (1997)
12. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)
Robust Training of Feedforward Neural Networks
Using Combined Online/Batch Quasi-Newton Techniques
Hiroshi Ninomiya
1 Introduction
Neural network techniques have been recognized as a useful tool for the function
approximation problems with high-nonlinearity [1]. For example, the techniques are
useful for microwave modeling and design in which neural networks can be trained
from Electro-Magnetic (EM) data over a range of geometrical parameters and trained
neural networks become models providing fast solutions of the EM behavior [2][3].
Training is the most important step in developing a neural network model.
Gradient based algorithms such as Back propagation and quasi-Newton are popularly
used for this purpose [1]. For a given set of training data, the gradient algorithm
operates in one of two modes: online (stochastic) or batch. In the online mode, the
synaptic weights of all neurons in the network are adjusted in a sequential manner,
pattern by pattern. In the batch mode, by contrast, the adjustments to all synaptic
weights are made on a set of training data, with the result that a more accurate
estimate of the gradient vector is utilized. Despite its disadvantages, the online form is
the most frequently used for the training of multilayer perceptrons, particularly for
large-scale problems and also has better global searching ability than batch mode
training without being trapped into local minimum [1]. The quasi-Newton method
which is one of the most efficient optimization technique [4] is widely utilized as the
robust training algorithm for highly nonlinear function approximation using
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 74–83, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Robust Training of Feedforward Neural Networks 75
feedforward neural networks [1]-[3]. Most of them were batch mode. On the other
hand, the online quasi-Newton-based training algorithm referred to as oBFGS (Online
quasi-Newton based Broyden-Fletcher-Goldfarb-Shanno formula [4]) was introduced
as the algorithm for machine learning with huge training data set in [5]. This
algorithm worked with gradients which obtained from small subsamples (mini-
batches) of training data and could greatly reduce computational requirements: on
huge, redundant data sets. However, when applied to highly nonlinear function
modeling and neural network training, oBFGS still converges too slowly and
optimization error cannot be effectively reduced within finite time in spite of its
advantage [6].
Recently, Improved Online BFGS (ioBFGS) was developed for neural network
training [6]. The gradient of ioBFGS was calculated by variable training samples.
Namely, training samples for a weight update were automatically increased from a
mini-batch to all samples as quasi-Newton iteration progressed. That is, ioBFGS
gradually changed from online to batch during iteration. This algorithm was
overcoming the problem of local minima in prevailing quasi-Newton based batch
mode training, and slow convergence of existing stochastic mode training.
This paper describes a robust training algorithm based on quasi-Newton in which
the online and batch error functions are combined by a weighting coefficient. This
coefficient is adjusted to ensure that the algorithm gradually changes from online to
batch. In other words, the transition from online to batch can be parameterized in
quasi-Newton iteration. The parameterized method not only has an effect similar to
ioBFGS, but also facilitates the analysis of algorithm by using an analogy between the
parameterized method and Langevin algorithm. Langevin algorithm is a gradient-
based continuous optimization method incorporating Simulated Annealing (SA)
concept [7]. This technique called poBFGS (Parameterized Online BFGS),
substantially improves the quality of solutions during global optimization compared
with the other quasi-Newton based algorithms. The algorithm is tested on some
function approximation problems with high-nonlinearity.
1⁄ , and 1⁄ 2 , (2)
76 H. Ninomiya
where denotes a training data set and is the number of sample pairs within
.
Training is the most important step in developing a neural network model.
Gradient-based algorithms such as Back propagation and quasi-Newton method are
popularly used for this purpose [1]. Among the gradient-based algorithm, the
objective function of (2) is minimized by the following iterative formula
, (3)
where k is the iteration count and is the gradient vector. Gradient vectors of online
and batch training algorithms are defined as ∂ ⁄∂ and
∂ ⁄∂ , respectively. The learning rate is either positive number for
Back propagation or positive definite matrix for (quasi-)Newton method. The quasi-
Newton method is considered in this paper, because the method in which positive
definite matrix is updated by using BFGS formula, is one of the most efficient
optimization algorithms [4] and commonly-used training method for highly nonlinear
function problems [1]-[3].
Most of quasi-Newton methods are batch mode. The batch BFGS (BFGS) depends on
initial values of , with good result only if initial guess is suitable. On the other
hand, oBFGS in which a training data set was divided into Seg subsamples (mini-
batches) was introduced in [5]. Seg denotes the number of mini-batches. A mini-batch
is called a “segment” and includes _ ⁄ training samples in this paper.
Then the gradient, of oBFGS is calculated by training samples in a segment and
positive definite matrix is updated by the using BFGS formula. oBFGS
improved efficiency for the convex optimization with a large data set over BFGS as
reported in [5]. However, oBFGS still converges too slowly and optimization error
cannot be effectively reduced within finite time, when applied to highly nonlinear
function modeling and neural network training. Notable recent progress in data-driven
optimization for high-nonlinearity function modeling is the improved online quasi-
Newton training method that is called Improved Online BFGS (ioBFGS) [6]. ioBFGS
was carried out using following aspects of two existing BFGSs. First, in the early
stage of training, the weight vector was updated using a mini-batch. Next, a mini-
batch size of oBFGS was gradually increased overlapping multiple segments. Finally,
a mini-batch of oBFGS included all training samples, namely the algorithm became
BFGS. The details of the increasing strategy of mini-batch size are shown in [6].
ioBFGS could make use of not only strong global searching ability, namely, the
capability to avoid local minima of the online BFGS, but also strong local searching
ability of the batch BFGS by just systematically combining online with batch. In this
paper, a robust training algorithm which is build on the same concept of ioBFGS, that
Robust Training of Feedforward Neural Networks 77
is changing from online to batch, is described for highly nonlinear function modeling.
In this algorithm, online and batch error functions are associated by a weighting
coefficient. Then, the coefficient is adjusted to ensure that the algorithm gradually
changes from online to batch. In other words, the transition from online to batch is
parameterized in quasi-Newton iteration. This algorithm not only has an effect similar
to ioBFGS, but also facilitates the analysis of algorithm by using an analogy between
the parameterized method and Langevin algorithm (LA). LA is a gradient-based
continuous optimization method incorporating SA concept [8]. This algorithm is
referred to as Parameterized Online BFGS (poBFGS).
4 Simulation Results
to all algorithms with different starting values of . Each trained neural network was
estimated by average of 103 and the average of computational time sec .
<example A> First of all, the following functions are considered [6][10][11][12]:
, 1.9 1.35 e e sin 13 0.6 sin 7 . (11)
, 1.3356 1.5 1 e sin 3 0.6
.
e sin 4 0.9 . (12)
⁄ ∑ 1 1 10sin 10sin
1 , 4, 4 , . (13)
and are 2-Dimensional benchmark problems and referred to as Complicated
Interaction (Fig. 1) and Additive (Fig. 2) functions, respectively [10][11]. In and
problems, includes 1,680 training samples within , 1, 1 . is Levy
function [12]. This function is usually used as a benchmark problem for multimodal
function optimization. That is, the function has a huge number of local minima as
shown in Fig.3 even when 2. As a result, Levy function can be regarded as a
highly nonlinear function for neural network modeling. Moreover the input dimension
( ) can be arbitrarily decided. Therefore two examples ( and ) are considered
here, and the parameters of and are ( , )=(5, 1,000) and (10, 2,000),
respectively. The numbers of hidden neurons for , , and are 27, 9, 20
and 40 respectively. The maximum iteration count is 2 10 for all
algorithms. The cooling parameters of laBFGS and poBFGS are experimentally set
to 0.7 for and , and 0.2 for and . The simulation results are illustrated
in Table 1. Several mini-batches are tested for oBFGS and ioBFGS. A mini-batch
includes _ ⁄ training samples. This table shows that poBFGS and
laBFGS/G can obtain slightly smaller errors than BFGS and similar results of ioBFGS
for . However, poBFGS and ioBFGS can reduce the error, compared with
BFGS and laBFGS/G for , and without taking extra computational
time. The results of ioBFGS and poBFGS are better than the results of the other
BFGS-based algorithms, indicating that the increasing strategy of training samples is
effective for these problems. On the other hand, oBFGS cannot also obtain the enough
small errors. That is, oBFGS is easy to get stuck to local minimum for these problems
although the computational times are short. Furthermore, it is shown that Gaussian
random noise is more effective than the uniform one for laBFGS.
. 1. . 2. . 3. ( 2) . 4.
<example B> Next, (14) is used as a function approximation problem with high-
nonlinearity [6][13],
, , 1 2 sin 4,4 , (14)
where , and are input variables for neural networks. (14) can be regarded as a
highly nonlinear function because 3-variables ( , and ) are included in a sine
function as frequency elements. In particular the EM behavior of microwave circuit is
quite similar to these test functions [3][6]. For (14), three examples are considered,
and two maximum iteration counts are used for each example, namely, 2 10
and 1 10 .
Algorithm _ _ _
Algorithm
s s s
BFGS 2.52/1.82 102/483 2.05/2.03 2363/9163 0.829/0.748 8324/41611
ioBFGS(10) 1.23/1.15 111/556 1.10/0.909 1708/9068 4.20/0.806 7416/45902
laBFGS/G 3.54/2.13 117/550 1.53/1.56 1901/9208 0.764/0.376 8248/41094
poBFGS 1.01/0.599 105/581 1.75/0.573 1853/10652 0.796/0.321 8355/41749
parameters of laBFGS/G and poBFGS are set to 0.7 and 0.9 for and ,
respectively. In the second example, (Fig.4), the network has 2 inputs, and .
Here, and 1,1 are variables, and is fixed to 0. The neural network is 2-
45-1. includes 3,320 training points. The cooling parameters are same as .
More complicated problem ( ) is used as the third example. The structure of
network is 3-55-1. That is, , 1,1 and, 0.5,0.5 are variables.
includes 10,080 training points. The cooling parameters are set to 0.05 and 0.5 for
and , respectively. The mini-batch ( _ ) of ioBFGS is set to 10 for
three examples. The simulation results are illustrated in Table 2. and s are
shown as ⁄ for each maximum iteration count in this table. From the
table, it is demonstrated for that poBFGS can obtain the smallest error among
the tested algorithms for each maximum iteration count, and . This
indicates that the quality of solution by poBFGS can be more certain than other
algorithms with respect to randomly chosen starting point. From the results of ,
when the maximum iteration count is , the smallest error can be obtained by
using ioBFGS(10). On the other hand, the minimum error was obtained by poBFGS
when the maximum iteration count was . This means that poBFGS has strong
ability to search a global minimum covering a wider solution area and without being
trapped into local minimum by taking much
more iteration. Namely, poBFGS is a robust Table 3. Comparison training with
algorithm less dependent on the choice of testing errors of and
initial guesses of . Furthermore, it is
confirmed that the error of by poBFGS 100 400 1,680 3,320
was also the smallest among four algorithms. 0.335 1.82 1.46 2.03
As a result, poBFGS can obtain the small BFGS
2.80 1.66 2.74 1.81
training errors with much certainty while it is 0.190 0.599 0.589 0.573
poBFGS
impossible for the other algorithms to obtain 1.62 0.558 2.51 0.545
such small training errors.
<example C> Finally, the generalization abilities of trained neural networks using
BFGS and poBFGS are studied for and . In this simulation, new training
data sets which are smaller than those of ex. B, are generated. Each for and
is set to 100 and 1,680, respectively, and the neural networks have same
structures as ex. B. Each trained neural network is estimated by testing errors. Testing
error denotes the generalization ability which is the ability of a trained neural
network to estimate with input never seen during training (i.e. ). is
calculated by (2) using a data set of 10,000 randomly-selected pairs , . The
training and testing errors of the trained neural networks are presented in Table 3.
82 H. Ninomiya
From the table, the testing errors s of and with small training data
sets are larger than s calculated by the trained networks in ex. B, although the
training errors are low compared with the results of ex. B. On the other hand, the
testing errors of trained networks in ex. B ( 400 and 3,320) are almost exactly
the same as the training errors. This means that the small data sets used here are
insufficient for the useful neural models and the adequate number of training samples
is necessary to generate an accurate neural model. Furthermore, it makes the neural
network training especially difficult with the increased number of training samples.
As a result, it is obvious that poBFGS is practical and useful for highly nonlinear
function modeling and neural network training.
5 Conclusions
In this paper we have presented a robust training technique of feedforward neural
networks. The technique combines the global optimization capability of the online
BFGS, and fast and strong local search capability of the batch BFGS. Furthermore, an
analogy between the proposed algorithm and Langevin one was considered. The
method overcomes the problem of local minima of conventional gradient based neural
network training. The method is robust, and provides high quality training and testing
solutions regardless of initial values. It helps provide accurate neural network models
for highly nonlinear function approximation problems.
In the future the validity of the proposed algorithm for the real world problems
such as microwave circuit modeling [2][3][6] will be demonstrated.
References
1. Haykin, S.: Neural Networks and Learning Machines 3rd. Pearson (2009)
2. Zhang, Q.J., Gupta, K.C., Devabhaktuni, V.K.: Artificial neural networks for RF and
microwave design-from theory to practice. IEEE Trans. Microwave Theory and Tech. 51,
1339–1350 (2003)
3. Ninomiya, H., Wan, S., Kabir, H., Zhang, X., Zhang, Q.J.: Robust training of microwave
neural network models using combined global/local optimization techniques. In: IEEE
MTT-S International Microwave Symposium (IMS) Digest, pp. 995–998 (June 2008)
4. Nocedal, J., Wright, S.J.: Numerical Optimization 2nd. Springer (2006)
5. Schraudolph, N.N., Yu, J., Gunter, S.: A stochastic quasi-Newton method for online
convex optimization. In: Proc. 11th Intl. Conf. Artificial Intelligence and Statistics (2007)
6. Ninomiya, H.: An improved online quasi-Newton method for robust training and its
application to microwave neural network models. In: Proc. IEEE&INNS/IJCNN 2010, pp.
792–799 (July 2010)
7. Gelfand, S.B., Mitter, S.K.: Recursive stochastic algorithms for global optimization in. Rd
SIAM J. Control and Optimization 29(5), 999–1018 (1991)
8. Corane, A., Marechesi, M., Martini, C., Ridella, S.: Minimizing multimodal functions of
continuous variables with the Simulated Annealing algorithm. ACM Trans. Math.
Soft. 13(3), 262–280 (1987)
9. Rögnvaldsson, T.: On Langevin updating in multilayer perceptrons. Neur. Comp. 6(5),
916–926 (1991)
Robust Training of Feedforward Neural Networks 83
10. Kwok, T.Y., Yeung, D.Y.: Objective functions for training new hidden units in
constructive neural networks. IEEE Trans. Neural Networks 8(5), 630–645 (1997)
11. Ma, L., Khorasani, K.: New training strategies for constructive neural networks with
application to regression problems. Neural Networks 17, 589–609 (2004)
12. Levy, A., Montalvo, A., Gomez, S., Galderon, A.: Topics in global optimization. Lecture
Notes in Mathematics, vol. (909). Springer, New York (1981)
13. Benoudjit, N., Archambeau, C., Lendasse, A., Leel, J., Verleysen, M.: Width optimization
of the Gaussian kernels in radial basis function networks. In: Proc. Eur. Symp. Artif.
Neural Netw., pp. 425–432 (April 2002)
Estimating a Causal Order among Groups
of Variables in Linear Models
1 Introduction
Many techniques have recently been developed for inferring causal relationships
from data over a set of random variables [1,2,3,4,5,6,7,8,9]. While most of this
work has focused on uncovering connections among scalar random variables,
in many actual cases each of the variables of interest may consist of multiple
related, but distinct, measurements. For instance, in fMRI data analysis one
is often interested in the functional connectivity among brain regions, and for
each such region-of-interest one has data measured from a set of multiple voxels.
Typically, in these cases, some aggregate of each area is computed, after which
the standard approaches are directly applicable. However, it can be shown that
not only may information be lost when computing aggregates, but the outputs
of such methods may not even be correct in the large sample limit.
A simple example illustrating one of the problems inherent with working with
aggregates is the following. Consider three sets of variables with causal connec-
tions X → Y → Z, i.e. the variables in X may influence the variables in Y, but
not directly the variables in Z, and the variables in Y may influence the variables
in Z. In this case, each variable x ∈ X is independent of each z ∈ Z conditional
on the full set of mediating variables Y. However, when replacing the variables
of each group with their respective mean value (the typical aggregate used),
denoted by x̄, ȳ, and z̄, in general we obtain x̄ ⊥
/⊥ z̄ | ȳ [1,10]. Thus, it is impor-
tant to develop methods for causal discovery that exploit the full information
available, as opposed to only aggregates of the data.
Towards this end, in this paper we extend two existing approaches [5,6] de-
signed for causal discovery among scalar random variables to the case of random
vectors (i.e. groups of variables), both exploiting any kind of non-Gaussianity
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 84–91, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Estimating a Causal Order among Groups of Variables in Linear Models 85
present in the data. We also extend a recent method [8] for inferring the causal
relationship among two arbitrarily distributed multi-dimensional variables to an
arbitrary number of such variables. After describing the resulting algorithms, we
evaluate and compare their performance in numerical simulations.
with Bki ,kj arbitrary (real) matrices of dimension nki × nkj , containing the
direct effects from group Xkj to group Xki . The vectors of disturbance terms
eki are assumed to be zero mean, and mutually independent over groups, i.e.
eki ⊥⊥ ekj , i = j, but are allowed to be dependent within each group. If we
arrange the groups in a causal order K and define x = (xk1 , . . . , xkG ) and e =
(ek1 , . . . , ekG ), we can rewrite Equation (1) in matrix form as x = Bx+e with B
a lower block triangular matrix. The model reduces to standard LiNGAM (Linear
Non-Gaussian Acyclic Model, [4,5]) when ∀g : ng = 1 and all disturbances e are
non-Gaussian. It also includes the model of [6] when G = 2, n1 = n2 = 1 and the
disturbances are non-Gaussian. Finally, it contains as a special case the noisy
model of [8] when G = 2 (but with no restriction on the ng and e).
We assume that all variables in x are observed, and that the grouping of these
variables is known. Given merely observations of x generated by Model (1) (i.e.
B and e are unknown), we want to estimate the unknown causal order K among
these groups. We denote the data matrix of observations over the variables x as
X = (X1 , . . . , XG )T , where each column corresponds to one observation and each
row to one variable. The observations are grouped according to the G groups,
arranged in a random order, such that the first n1 rows correspond to group X1 ,
the following n2 rows to group X2 , and so on.
We note that our model family is equivalent to that given by [7]. The main
difference between our approach and theirs is that they do not assume to know
which variable belongs to what group, which results in algorithms exponential in
the number of involved variables, whereas our algorithm explicitly builds upon
such knowledge, allowing to construct computationally and statistically more
efficient algorithms, polynomial in the number of groups.
86 D. Entner and P.O. Hoyer
3 Method
The overall algorithm for finding a causal order among the groups follows the
approach introduced in [5]. We first search for an exogenous group (Section 3.1),
and then ‘regress out’ the effect of this group on all other groups (Section 3.2).
We iterate this process to generate a full causal order over the G groups.
To obtain the p-values pji we can test for joint dependence of the two vectors
(j)
xj and r i using the Hilbert Schmidt Independence Criterion (HSIC, [12]),
which, however, requires many samples to detect dependencies for high dimen-
sional vectors. Alternatively, we can perform pairwise tests of each variable in
(j)
xj against each variable in r i using nonlinear correlations, and combine the
resulting nj × ni p-values appropriately. Details are left to the online appendix.
As pointed out in Section 2, this is just a special case of our more general model.
The (normalized) ratio of the log likelihoods for the two possible causal models is
given by R(x, y) = (log L(x → y) − log L(y → x)) /m, where m is the sample size
and L the likelihood of the specified direction, under some suitable assumption
on the distributions of the disturbances. If the true underlying causal direction
is x → y, then R(x, y) > 0 in the large sample limit. Symmetrically, if x ← y,
then R(x, y) < 0 in the limit.
To use the ratio R(· , ·) to find an exogenous group Xj , the naı̈ve approach
(j) (i) (j) (i)
is to calculate R(xk , xl ) for each pair with xk ∈ Xj , k = 1, . . . , nj , xl ∈
Xi , l = 1, . . . , ni , i
= j, and combine these measures. However, even if Xj is
exogenous, these pairs do not necessarily meet the model assumption because of
the dependent error terms within each group, and hence there is no guarantee
for correctness even in the large sample limit. This approach, termed the Naı̈ve
Pairwise Measure, may however have a statistical advantage for small sample
sizes (see Section 4).
To obtain a consistent method (simply termed Pairwise Measure in Sec-
(j) (i)
tion 4), we replace the second variable of the pairs (xk , xl ) with a quantity
which guarantees that the model assumption is met if Xj is exogenous: We
(i) nj (j)
first estimate the regression model xl = k̃=1 b̂lk̃ xk̃ + rl,(i) . If Xj is exoge-
nous then the regression coefficients b̂lk̃ are consistent estimators of the true
total effects (when marginalizing out any intermediate groups). Hence, defining
(i) (i) nj (j) (j) (j) (i)
zk,l := xl − k̃=1; b̂ x = b̂lk xk +rl,(i) yields a pair (xk , zk,l ) meeting the
k̃=k lk̃ k̃
(j) (i)
model assumption of [6] if Xj is exogenous. Thus, in this case, R(xk , zk,l ) > 0,
in the limit, for all k, l, and i
= j. On the contrary, if Xj is not exogenous the
measure can take either sign, and simulations show that it is unlikely to always
obtain a positive one. A way to combine the ratios is suggested in [6], which can
be modified for the group case as
nj ni
1
μ(j) = (j) (i)
min{0, R(xk , zk,l )}2 . (3)
nj i=j ni k=1 i=j l=1
That is, we penalize each negative value according to its squared magnitude and
adjust for the group sizes. We select the group minimizing this measure as the
exogenous one.
Trace Method. Our third method for finding an exogenous group is based on
the approach of [8,9], termed the Trace Method, designed to infer the causal order
among two groups of variables X and Y with nx and ny variables, respectively.
If the underlying true causality is given by X → Y, the model is defined as y =
Bx+e, where the connection matrix B is chosen independently of the covariance
matrix of the regressors Σ := cov(x, x), and the disturbances e are independent
of x. Note that this method is based purely on second-order statistics and does
not make any assumptions about the distribution of the error terms e, as opposed
to the previous two approaches where we needed non-Gaussianity. The measure
to infer the causal direction defined in [8] is given by
88 D. Entner and P.O. Hoyer
ΔX→Y := log tr(B̂Σ̂B̂T )/ny − log tr(Σ̂)/nx − log tr(B̂B̂T )/ny (4)
where tr(·) denotes the trace of a matrix, Σ̂ an estimate of the covariance matrix
of x, and B̂ the OLS estimate of the connection matrix from x to y. The measure
for the backward direction ΔY→X is calculated similarly by exchanging B̂ with
the OLS estimate of the connection matrix from y to x and Σ̂ with the estimated
covariance matrix of y. If the correct direction is given by X → Y, Janzing et
al. [8] (i) conclude that ΔX→Y ≈ 0, (ii) show for the special case of B being an
orthogonal matrix and the covariance matrix of e being λI, that ΔY→X < 0,
and (iii) show for the noise free case that ΔY→X ≥ 0. Hence, the underlying
direction is inferred to be the one yielding Δ closer to zero [8]. In particular, if
|ΔX→Y | / |ΔY→X | < 1, then the direction is judged to be X → Y.
We suggest using the Trace Method to find an exogenous group Xj among G
groups in the following way. For each j, we calculate the measures ΔXj →Xi and
ΔXi →Xj , for all i
= j, and infer as exogenous group the one minimizing
2
μ(j) = ΔXj →Xi / ΔXi →Xj . (5)
i=j
Another approach is to apply the methods of Section 3.1 for finding an exoge-
nous group to N data sets, each of which consists of G groups formed by taking
subsets of the variables of the corresponding original groups. We then calculate
(j)
measures μn , j = 1, . . . , G, n = 1, . . . , N , as in Equations (2), (3) or (5), for
each such data set separately, and pick the group Xj ∗ which minimizes the sum
over these sets to be an exogenous one, i.e.
j ∗ = arg min μ(j)
n (6)
j 1≤n≤N
where μn is the measure of group j in the nth data set. We then can proceed
(j)
4 Simulations
Together, the methods of Section 3 provide a diverse toolbox for inferring the
model of Section 2. Here, we provide simulations to evaluate the performance of
the variants of Algorithm 1, and compare it to a few ad hoc methods. Matlab
code is available at http://www.cs.helsinki.fi/u/entner/GroupCausalOrder/
We generate models following Equation (1) by randomly creating the connec-
tion matrices Bki ,kj , i > j with, on average, s% of the entries being nonzero
and additionally ensure that at least one entry is nonzero, to ensure a complete
graph over the groups. To obtain the disturbance terms eki for each group, we
linearly mix random samples from various independent non-Gaussian variables
as to obtain dependent error terms within each group. Finally, we generate the
sample matrix X and randomly block-permute the rows (groups) to hide the
generating causal order from the inference algorithms.
90 D. Entner and P.O. Hoyer
error rate
error rate
error rate
200PwMeas.
500 1000 TrMeth.,10sets
0.3 0.3 0.4 0
ICA−L 200PwMeas.,L2reg
5001000
0.2 0.2 DL,nlcorr PwMeas.,10sets
0.2
0.1 0.1
DL,HSIC NaivePwMeas.
0 0 0
200 500 1000 200 500 1000 200 500 1000
sample size sample size sample size
Fig. 1. Sample size (x-axis) against error rate (y-axis) for various model sizes and
algorithms, as indicated in the legends (abbreviations: GDL = GroupDirectLiNGAM;
nlcorr, HSIC: nonlinear correlation or HSIC as independence test; TrMeth. = Trace
Method; PwMeas. = Pairwise Measure; ICA-L = modified ICA-LiNGAM approach;
DL = DirectLiNGAM on the mean-variables; 10sets = Equation (6) on N = 10 data
sets; L2reg = L2 -regularization for covariance matrix). The dashed black line indicates
the number of mistakes made when randomly guessing an order.
Finally, we test the strategies described in Section 3.3 for handling low sample
sizes in high dimensions on 50 models with 3 groups of 100 variables each, using
200, 500 and 1000 samples, and s = 5%. For L2 -regularization, we choose the pa-
rameter λ using 10-fold cross validation on the covariance matrix. When taking
subgroups, we use N = 10 data sets, and each subgroup containing ten variables.
The error rates are shown in Figure 1 (b) (we only show the L2 -regularized results
if they were better than without regularization). Unreliable estimates of the
covariance matrix seem to affect especially the Trace Method, and the Pairwise
Measure on the smaller sample sizes. On the smallest sample, using subsets seems
to be advantageous for most methods, however, the best performing approach
is the Naı̈ve Pairwise Measure, which, however, does not seem to converge to be
consistent, where as GroupDirectLiNGAM and the Pairwise Measure are.
In general, the simulations show that the introduced method often correctly
identifies the true causal order, and clearly outperforms the simple ad hoc ap-
proaches. It is left to future work to study the performance in cases of model
violations as well as to apply the method to real world data.
Acknowledgments. We thank Ali Bahramisharif and Aapo Hyvärinen for dis-
cussion. The authors were supported by Academy of Finland project #1255625.
References
1. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn.
MIT Press (2000)
2. Pearl, J.: Causality: Models, Reasoning, and Inference, 2nd edn. Cambridge Uni-
versity Press (2009)
3. Chickering, D.M., Meek, C.: Finding optimal bayesian networks. In: UAI (2002)
4. Shimizu, S., Hoyer, P.O., Hyvärinen, A., Kerminen, A.J.: A linear non-gaussian
acyclic model for causal discovery. JMLR 7, 2003–2030 (2006)
5. Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y., Washio, T.,
Hoyer, P.O., Bollen, K.: DirectLiNGAM: A direct method for learning a linear
non-Gaussian structural equation model. JMLR 12, 1225–1248 (2011)
6. Hyvärinen, A.: Pairwise measures of causal directions in linear non-gaussian acyclic
models. JMLR W.&C.P. 13, 1–16 (2010)
7. Kawahara, Y., Bollen, K., Shimizu, S., Washio, T.: GroupLiNGAM: Linear non-
gaussian acyclic models for sets of variables. arXiv, 1006.5041v1 (June 2010)
8. Janzing, D., Hoyer, P.O., Schölkopf, B.: Telling cause from effect based on high-
dimensional observations. In: ICML (2010)
9. Zscheischler, J., Janzing, D., Zhang, K.: Testing whether linear equations are
causal: A free probability theory approach. In: UAI (2011)
10. Scheines, R., Spirtes, P.: Causal structure search: Philosophical foundations and
problems, 2008. In: NIPS 2008 Workshop: Causality: Objectives and Assessment
(2008)
11. Fisher, R.A.: Statistical Methods for Research Workers, 11th edn. Oliver and Boyd,
London (1950)
12. Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.J.: A
kernel statistical test of independence. In: NIPS (2008)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference and Prediction, 2nd edn. Springer (2008)
14. Shimizu, S.: Joint estimation of linear non-gaussian acyclic models. Neurocomput-
ing 81, 104–107 (2012)
Training Restricted Boltzmann Machines
with Multi-tempering: Harnessing Parallelization
1 Introduction
Since the recent popularity of deep neural architectures for learning [2], Re-
stricted Boltzmann Machines (RBM’s; [6,5]), which are the building blocks of
Deep Belief Networks [7], have been studied extensively. An RBM is an undi-
rected graphical model with a bipartite connection structure. It consists of a layer
of visible units and a layer of hidden units and can be trained in an unsupervised
way to model the distribution of a dataset. After training, the activations of the
hidden units can be used as features for applications such as classification or
clustering. Unfortunately, the likelihood gradient of RBM’s is intractable and
needs to be approximated.
Most approximations for RBM training are based on sampling methods.
RBM’s have an independence structure that makes it efficient to apply Gibbs
sampling. However, the efficiency of Gibbs sampling depends on the rate at which
independent samples are generated. This property is known as the mixing rate.
While Gibbs samplers will eventually generate samples from the true underlying
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 92–99, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Training Restricted Boltzmann Machines with Multi-tempering 93
distribution they approximate, they can get stuck in local modes. This is espe-
cially problematic for distributions that contain many modes that are separated
by regions where the probability density is very low.
In this paper, we investigate two methods for improving both the mixing rate
of the sampler and the quality of the gradient estimates at each sampling step.
These two methods are extensions for the so-called Replica Exchange method
and were recently proposed for statistical physics simulations [1]. The first ex-
tension allows every possible pair of replicas to swap positions to increase the
number of sampling chains that can be used in parallel. The second extension
is to use a weighted average of the replicas that are simulated in parallel. The
weights are chosen in a way that is consistent with the exchange mechanism.
where Nh and Nv are, respectively, the number of hidden and the number of
visible units. The symbols W , a and b denote trainable weight and bias parame-
ters. This function
can be used to define a Gibbs probability distribution of the
form p(v) = h e−E(v,h) /Z, where Z is the partition function which is given by
Z = h,v e−E(v,h) .
The gradient of this likelihood function is given by
where θ is an element in the set of parameters {W, a, b}. The first term of this
gradient can be evaluated analytically in RBM’s but the second term needs to
be approximated. This second term is the gradient of the partition function and
will be referred to as the model expectation.
3 Training RBM’s
The most commonly used training method for RBM’s is the Contrastive Di-
vergence (CD; [6]) algorithm. During training, a Gibbs sampler is initialized at
a sample from the data and run for a couple of iterations. The last sample of
the chain is used to replace the intractable model expectation. This strategy
assumes that many of the low energy configurations, that contribute most to the
model expectation, can be found near the data. However, it is very likely that
94 P. Brakel, S. Dieleman, and B. Schrauwen
there are many other valleys of low energy. Furthermore, the algorithm does not
necessarily optimize the likelihood function at all.
In Persistent Contrastive Divergence learning [13], (PCD) a Markov chain
is updated after every parameter update during training and used to provide
samples that approximate the model expectation. The difference with normal
CD is that the chain is not reset at a data point after every update, but keeps on
running so it can find low energy regions that are far away from the data. Given
infinite training time, this algorithm optimizes the true likelihood. However, as
training progresses and the model parameters get larger, the energy landscape
becomes more rough. This will decrease the size of the steps the chain takes and
increase the chance that the chain gets stuck in local modes of the distribution.
To obtain better mixing rates for the sampling chains in PCD, the Fast PCD
algorithm was proposed [12]. This algorithm uses a copy of the model that is
trained using a higher learning rate to obtain samples. The training itself is in
this case pushing chains out of local modes. Unfortunately, the training algorithm
is now not necessarily converging to the true likelihood anymore.
Another way to improve the mixing rate is Replica Exchange Monte Carlo
[11], also referred to as Parallel Tempering (PT). Recently, PT has been ap-
plied to RBM training as well [4]. This algorithm runs various chains in parallel
that sample from replicas of the system of interest that operate under different
temperatures. Chains that operate at lower temperatures can escape from local
modes by jumping to locations of similar energy that have been proposed by
chains that operate at higher temperatures. A serial version of this idea has also
been proposed for training RBM’s [9].
One downside of PT for training RBM’s is that the number of parallel sam-
pling chains that can be used by this algorithm is limited. One can use many
chains in PT to cover more temperatures. This will cause more swaps between
neighbouring chains to be accepted because they are closer together. However,
it will also take more sequential updates before a certain replica moves back and
forth between the lowest and the highest temperatures. Another disadvantage of
PT is that only the chain with the lowest temperature is actually used to gather
statistics for the learning algorithm.
4 Multi-tempering
To increase the number of parallel chains that PT can effectively use, we propose
Multiple Replica Exchange methods for RBM training. These methods have
already been shown to work well in statistical physics [3,1]. To prevent the use
of very different names for similar algorithms, we will refer to this method as
Multi-Tempering (MT). Since MT is a modification of PT Markov Chain Monte
Carlo, it is necessary to describe the original algorithm in some more detail.
The idea behind PT is to run several Markov chains in parallel and treat
this set of chains as one big chain that generates samples from a distribution
with augmented variables. Transition steps in this combined chain can now also
include possible exchanges among the sub chains. Let X = {x1 , · · · , xM } be
Training Restricted Boltzmann Machines with Multi-tempering 95
the state of a Markov chain that consists of the states of M sub chains that
operate under inverse temperatures {β1 , · · · , βM }, where β1 = 1 and indicative
of the model we want to compute
expectations for. The combined energy of this
system is given by E(X) = M i=1 βi E(xi ). The difference in total energy that
results from switching two arbitrary sub chains with indices i, j, is given by
where X̂(·) denotes the new state of the combined chain that results from the
exchange indicated by its arguments1 . If i and j are selected uniformly and
forced to be neighbours, the Metropolis-Hastings acceptance probability is given
by rij = exp(E(X) − E(X̂(i, j))). This is the acceptance criterion that is used
in standard Parallel Tempering.
In Multi-Tempering [1], index i is selected uniformly and index j is selected
with a probability that is based on the difference in total energy the proposed
exchange would cause:
rij
p(j|i) = M . (4)
j =1 rij
Given the selection probabilities p(j|i) from Equation 4 and the acceptance prob-
abilities A(i, j|X), one can compute a weighted average to estimate the gradient
of the intractable likelihood term. This average is given by
M
g1 = [(1 − A (i, j)) g(x1 ) + A(i, j)g(xj )] p(j|i) , (6)
j=1
6 Experiments
All experiments were done on the MNIST dataset. This dataset is a collection of
70, 000 28 × 28 grayscale images of handwritten digits that has been split into a
1
So X̂(i, j, k) would mean that i is first swapped with j and subsequently, the sample
at position j is swapped with the one at position k.
96 P. Brakel, S. Dieleman, and B. Schrauwen
train set of 50000 images and test and validation sets of each 10000 images. The
pixel intensities were scaled between 0 and 1 and interpreted as probabilities
from which binary values were sampled whenever a datapoint was required.
First, is was investigated how the MT and the PT algorithms behave with
different numbers of parallel chains by looking at the rate at which replicas travel
from the highest temperature chain to the one with the lowest temperature. Ten
RBM’s with 500 hidden units were trained with PCD using a linearly decay-
ing learning rate with a starting value .002 for 500 epochs. Subsequently, both
sampling methods were run for 10000 iterations and the number of times that
a replica was passed all the way from the highest to the lowest temperature
chain was counted. This experiment was done for different numbers of parallel
chains. The inverse temperatures were uniformly spaced between .8 and 1. In
preliminary experiments, we found that almost no returns from the highest to
the lowest temperature occurred for any algorithm for much larger intervals.
The second experiment was done to get some insight in the mixing rates
of the sampling methods and their success at approximating the gradient of the
partition function. A small RBM with 15 hidden units was trained on the MNIST
dataset using the PCD algorithm. The different sampling methods were now run
for 20000 iterations while their estimates of the gradient were compared with
the true gradient which had been computed analytically . Because the success of
the samplers partially depends on their random initialization, we repeated this
experiment 10 times.
Finally, to see how the different sampling algorithms perform at actual train-
ing, a method called annealed importance sampling (AIS) [8,10] was used to
estimate the likelihood of the data under the trained models. PCD, PT, MT
and MTw were each used to train 10 RBM models on the train data for 500
epochs. Each method used 100 chains in parallel. The inverse temperatures for
the Tempering methods were linearly spaced between .85 and 1 as we expected a
slightly more conservative temperature range would be needed to make PT com-
petitive. We used no weight decay and the order of magnitude of the starting
learning rates was determined using a validation set. The learning rate decreased
linearly after every epoch.
Fig. 1 displays the results of the first experiment. The number of returns is a lot
higher for MT at the start and seems to go down at a slightly slower rate than
for PT. This allows a larger number of chains to be used before the number of
returns becomes negligible.
As Fig. 2 shows, the MT estimator was most successful at approximating
the gradient of the partition function of the RBM with 15 hidden units. To
our surprise, the MT estimator also performed better than the MTw estimator.
However, it seems that the algorithms that used a single chain to compute the
expectations (MT and PT), fluctuate more than the ones that use averages
(MTw and PCD).
Training Restricted Boltzmann Machines with Multi-tempering 97
Fig. 1. Number of returns for parallel tempering and multiple replica exchange as a
function of the number of parallel chains that are used
Fig. 2. Mean Square Error (MSE) between the approximated and the true gradients of
the partition function of an RBM with 15 units as a function of the number of samples
Table 1. Means and standard deviations of the AIS estimates of the likelihood of the
MNIST test set for different training methods. Means are based on 10 experiments
with different random initializations.
Table 1 displays the AIS estimates of the likelihood for the MNIST test set
for each of the training methods. MTw outperforms all other methods on this
task. The standard deviations of the results are quite high and MT, PT and
PCD don’t seem to differ much in performance. The fact that MT and PT use
only a single chain to estimate the gradient seems to be detrimental. This is
98 P. Brakel, S. Dieleman, and B. Schrauwen
not in line with the results for the gradient estimates for the 15 unit RBM. It
could be that larger RBM’s benefit more from the higher stability of gradient
estimates that are based on averages than small RBM’s. The results suggest that
PCD with averaged parallel chains is preferable to Tempering algorithms that
use only a single chain as estimate due to its relative simplicity but that MTw
is an interesting alternative.
During MT training, we also recorded the transition indices for further inspec-
tion. There are clearly many exchanges that are quite large as can be seen in
Fig. 3a, which shows a matrix in which each entry {i, j} represents the number
of times that a swap occurred between chains i and j. While there seems to be
a bottleneck that is difficult to cross, it is clear that some particles still make it
to the other side once in a while. In Fig. 3b, one can see that occasionally some
very large jumps occur that span almost the entire temperature range.
8 Conclusion
We proposed two methods to improve Parallel Tempering training for RBM’s and
showed that the combination of the two methods leads to improved performance
on learning a generative model of the MNIST dataset. We also showed that the
MTw algorithm allows more chains to be used in parallel and directly improves
the gradient estimates for a small RBM. While the weighted average didn’t seem
to improve the mixing rate, it seemed to stabilize training. For future work, it
would be interesting to see how the sampling algorithms compare when the
RBM’s are used for pre-training a Deep Belief Network.
References
1. Athènes, M., Calvo, F.: Multiple-Replica Exchange with Information Retrieval.
Chemphyschem. 9(16), 2332–2339 (2008)
2. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine
Learning 2(1), 1–127 (2009), also published as a book. Now Publishers (2009)
3. Brenner, P., Sweet, C.R., VonHandorf, D., Izaguirre, J.A.: Accelerating the Replica
Exchange Method through an Efficient All-Pairs Exchange. The Journal of Chem-
ical Physics 126(7), 074103 (2007)
4. Desjardins, G., Courville, A.C., Bengio, Y., Vincent, P., Delalleau, O.: Tempered
markov chain monte carlo for training of restricted boltzmann machines. Journal
of Machine Learning Research - Proceedings Track 9, 145–152 (2010)
5. Freund, Y., Haussler, D.: Unsupervised Learning of Distributions on Binary Vectors
Using Two Layer Networks. Tech. rep., Santa Cruz, CA, USA (1994)
6. Hinton, G.E.: Training Products of Experts by Minimizing Contrastive Divergence.
Neural Computation 14(8), 1771–1800 (2002)
7. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief
nets. Neural Computation 18(7), 1527–1554 (2006)
8. Neal, R.M.: Annealed importance sampling. Statistics and Computing 11, 125–139
(1998)
9. Salakhutdinov, R.: Learning in markov random fields using tempered transitions.
In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.)
NIPS, pp. 1598–1606. Curran Associates, Inc. (2009)
10. Salakhutdinov, R., Murray, I.: On the quantitative analysis of Deep Belief Net-
works. In: McCallum, A., Roweis, S. (eds.) Proceedings of the 25th Annual Inter-
national Conference on Machine Learning (ICML 2008), pp. 872–879. Omnipress
(2008)
11. Swendsen, R.H., Wang, J.S.: Replica Monte Carlo Simulation of Spin-Glasses.
Physical Review Letters 57(21), 2607–2609 (1986)
12. Tieleman, T., Hinton, G.: Using Fast Weights to Improve Persistent Contrastive Di-
vergence. In: Proceedings of the 26th International Conference on Machine Learn-
ing, pp. 1033–1040. ACM, New York (2009)
13. Tieleman, T.: Training restricted Boltzmann machines using approximations to the
likelihood gradient. In: Proceedings of the International Conference on Machine
Learning (2008)
A Computational Geometry Approach
for Pareto-Optimal Selection of Neural Networks
1 Introduction
Multi-objective (MOBJ) learning of Artificial Neural Networks (ANNs) provides
an alternative approach for implementing Structural Risk Minimization (SRM)
[1]. Its basic principle is to explicitly minimize two separate objective functions,
one related to the empirical risk (training error) and the other to the network
complexity, usually represented by the norm of the network weights [3,4,5,6]. It
is known from Optimization Theory, however, that the minimization of these two
conflicting objective functions do not yield a single minimum but result, instead,
on a set of Pareto-optimal (PO) solutions [10]. Similarly to Support Vector Ma-
chines (SVM) [11] and other regularization learning approaches, the choice of a
PO solution is analogous to selecting the regularization parameter, which pro-
vides a balance between smoothness and dataset fitness. The selection of the PO
solution and of the regularization parameter in SVMs should be accomplished
according to an additional decision criteria. In SVMs learning, crossvalidation is
often adopted.
Some PO selection strategies have been proposed in the literature in the con-
text of MOBJ learning. Current approaches include searching the Pareto-set for
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 100–107, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Computational Geometry Approach for Pareto-Optimal Model Selection 101
the solution that minimizes the error of a validation set [4], making assumptions
about the error distribution based on prior knowledge [7] and assuming uncorre-
lation between error and approximation function [2]. However, these strategies
are only valid under restricted conditions and can not be regarded as general. In
any case, the selection method embodies the criteria that will guide the search
towards a given kind of solution. For instance, margin maximization is a well
accepted criteria for model selection of classification and regression problems.
Nevertheless, margin maximization with SVMs depends on setting the regular-
ization parameter first, since the solution of the corresponding quadratic opti-
mization problem can only be accomplished after learning and kernel parameters
are set [11].
In this paper we present a parameterless PO selection method that is based on
the geometrical definition of the separation margin, which is estimated according
to concepts borrowed from Computational Geometry [8]. The Gabriel Graph
(GG ) [14] is adopted in order to construct a model of the graph formed by the
input data and their relative distances. Once the graph model is constructed,
it is possible to identify those patterns that are in the separation margin and
then to point out the PO solution that maximizes a given smoothness metric
defined according to the margin patterns. Results that will be presented in this
paper show that its performance on benchmark UCI datasets is similar to those
obtained by SVMs and LS-SVMs (Least Squares Support Vector Machines) on
the same data.
The remainder of this paper is organized as follows. Section 2 presents the
underlying principles for decision makings of MOBJ learning and the main mo-
tivations for PO selection. Section 3 extends the MOBJ section and shows the
main principles of multi-criteria decision making. Section 4 presents the quality
function proposed in this paper, followed by results and conclusions in the final
two sections.
2 MOBJ Learning
It is well accepted that the general supervised learning problem can be for-
mulated according to the minimization of two, sometimes conflicting, objective
functions, being one related to the learning set error and the other related to the
model complexity [1]. So, this general formulation of learning has a bi-objective
nature since, for most problems, there is not a single set of model parameters
that minimize concurrently the two objectives. In any case, the two objectives
should be minimized and the learning problem, according to this formulation,
can be stated as: “find the minimum complexity model that fits the learning set
with minimum error”. Learning algorithms differ on how this general statement
is implemented and many approaches that attempt to solve the problem have
appeared in the literature in the last decades. However, after the widespread
acceptance of Statistical Learning Theory (SLT) [1] as a general framework for
learning, the popularity of SVMs and the formal proof that the “magnitude of
the weights is more important than the number of weights” [12], algorithms that
102 L.C.B. Torres, C.L. Castro, and A.P. Braga
minimize both the learning set error and the norm of network weights became
popular for ANNs learning. For instance, MOBJ learning [4] can be described
according to the multi-objective formulation that follows.
Given the dataset D = {xi , yi }N
i=1 , MOBJ learning aims at solving the opti-
mization problem of Equation (1) [4].
N 2
J1 (w) = i=1 (yi − f (xi , w))
min (1)
J2 (w) = w
where f (xi , w) is the output of the model for the input pattern xi , w is the
vector of network parameters (weights), yi is the target response for xi and ·
is the Euclidean norm operator.
3 Decision Making
to obtain a quality function that does not depend on external parameters and
that can be assessed directly from the dataset.
The concept of separation margin is well understood, especially for a separable
dataset. It is defined by the distance of the nearest patterns, or support vectors
in SVMs’ terminology, of each class to the separation hyperplane in feature space
[1]. The hyperplane should separate evenly the dataset or, in other words, the
distances of support vectors to the separation hyperplane should be maximum
and even for the two classes. When the dataset is not linearly separable in feature
space, slack variables determine the tolerance allowed in the overlapping region
between classes. In practice, the effect of formulating the problem according to
slack variables is to transform the problem into a linearly separable one, so that
the margin concept above can be applied. Our quality measure function aims,
therefore, at identifying the patterns that are in the overlapping region directly
from the dataset by applying concepts from Computational Geometry. Once the
overlapping patterns are identified, similarly to the slack variables formulation,
they are not considered in margin estimation and PO selection.
Considering that the PO solutions have been already generated, the proposed
selection strategy is accomplished in three distinct phases. The first one aims at
identifying the separation region between the two classes. This is carried on by
identifying the edges of a Gabriel Graph [8] that have patterns from different
classes in their vertices. The corresponding patterns at the extremes of the border
edges are analogous to the support vectors of a SVM, although we should make
it clear that we do not claim that they correspond exactly to the actual support
vectors that would have been obtained from a SVM solution. They will be simply
called here as border patterns, although their importance will be similar to the
support vectors of SVMs. Our selection strategy aims at choosing the maximum
margin separator from the PO solutions or, in other words, the closest one to the
mean of border patterns. So, in the second phase the mean-vector of each pair of
border patterns is obtained, so that the selection procedure can be accomplished
in the last phase. Each one of the three phases will be described next.
(vi , vj ) ∈ E ↔ δ 2 (vi , vj ) ≤ δ 2 (vi , z) + δ 2 (vj , z) ∀ z ∈ V, vi , vj
= z (3)
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
Fig. 1. PO selection for the problem of Figure 1(a). (a) Pareto optimal solutions for
a binary classification problem. (b) Border edges, mean separation vectors and the
closest PO solution. (c) Chosen solution from the Pareto-set.
5 Results
Prior to presenting the efficiency of the method with benchmark problems, we
show first the results for a two-dimensional synthetic dataset, known as “two-
moons” problem. This example is interesting because the non-Gaussian class
distributions present an additional challenge for classification models and also
because the actual graph for PO selection can be visualized. The results are
shown in the graph of Fig. 2. The dataset for the classification problem is shown
in Fig. 2(a), the corresponding Gabriel Graph in Fig. 2(b) and the final solution
selected from the Pareto set in Fig. 2(c), where it is also shown the solution
obtained from a validation dataset. It can be observed that the Gabriel Graph
solution has larger margin than the one obtained with validation.
Next, experiments were carried on with the following binary datasets from
the UCI repository: Stalog Australian Credit (acr), Stalog German Credit (gcr),
Stalog heart disease (hea), Ionosphere (ion), Pima Indians diabetes (pid), Sonar
(snr) and Wisconsin Breast Cancer (wbc). Table 1 shows the characteristics
of each database, where NT r/V c is the amount of data used for training or
cross-validation, Ntest is the test set size and N is the total number of samples.
The number of numerical and categorical attributes are denoted by nnum and
ncat respectively, and n is the total number of attributes. All datasets were
normalized with mean x̄ = 0 and standard deviation σ = 1. In order to achieve
representative results, 10 random permutations were generated for each dataset.
Then, each permutation was split into training (or cross-validation) (2/3) and
test (1/3) subsets.
Computational Geometry Approach for Pareto-Optimal Model Selection 105
1.2 1.2
Class 1 Edge 1.5
Class 2 Class 1 Class 1
1 1
Class 2 Class 2
Boundary
Validation
0.8 0.8
1 Margin
0.6 0.6
0.4 0.4
0.5
0.2 0.2
0 0
0
−0.2 −0.2
−0.6 −0.6
−0.8 −0.8 −1
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Fig. 2. (a) Dataset for the two-moons classification problem. (b) Gabriel Graph. (c)
Selected solution with Gabriel Graph compared with the one selected according to a
validation set.
The results were compared with the benchmarks of the LS-SVM algorithm
presented in the article [9], and also an SVM implemented from the LIBSVM
toolbox [15]. According to [9], an RBF Kernel was selected for LS-SVMs; the
regularization parameter γ and kernel ϕ were selected with a 10-fold cross-
validation grid-search procedure. In the case of SVMs, we used the standard
C-SVC formulation with RBF kernel. The same methodology (grid-search with
10-fold cross-validation) was used to chose the corresponding SVMs’ γ and ϕ
parameters. The best parameters for each dataset are shown in Table 2.
The results obtained with the datasets of Table 1 for SVMs and LS-SVMs
were then compared with those obtained with multiobjective learning of Multi-
layer Perceptrons (MLPs) [4]. The final selection strategy from the PO solutions
was the one described in Section 4. Mean accuracy and standard deviation for
Table 3. Results
acr gcr hea ion pid snr wbc
all methods are presented in Table 3. Although a statistical test was not accom-
plished, since comparing numerically the results was not the main goal of this
paper, the inspection of Table 3 shows that the methods have similar perfor-
mances on all datasets.
References
1 Introduction
Many real world applications make use of high dimensional inputs, e. g. images
or multi-dimensional electroencephalography (EEG) signals. If we want to ap-
ply machine learning to such applications, we usually have to optimize a large
number of parameters. The optimization process requires a large amount of com-
putational resource and/or a long training time.
One way of reducing the training time is by first projecting the input signal
onto a subspace of a lower dimension. This idea is motivated by the work in the
area of compressed sensing [3]. Another way of reducing the training time is by
learning parameters of the model at hand in compressed parameter space. This
approach, which we consider in this paper, allows to optimize fewer parameters
than required without affecting the input data.
The paper is organized as follows: we first give a review of related work,
then continue with the discussion of the closed form solution for weighted sum
This work was supported by the German Bundesministerium für Wirtschaft und
Technologie (BMWi, grant FKZ 50 RA 1012 and grant FKZ 50 RA 1011).
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 108–115, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Learning Parameters of Linear Models in Compressed Parameter Space 109
of squared errors in compressed parameter space for regression, and show its
equivalence to learning in compressed input space. Afterwards we discuss learn-
ing in compressed parameter space for classification problems and proceed to
learning in compressed parameter space for reinforcement learning. Finally, we
present some preliminary experimental results.
T
1 2 k
ϕm = ϕm (0), ϕm , ϕm , . . . , ϕm , . . . , ϕm (1) , (5)
N N N
d(n) T
m = ϕm φ(x
(n)
). (6)
(n) (n) (n)
Let d(n) = [d0 , d1 , . . . , dM ]T and α = [α0 , α1 , . . . , αM ]T . The sum of
weighted square errors is now given by
1 T (n) 2
e= λn α d − c(n) . (7)
2 n
One can see that this is equivalent to learning in a compressed input space of
lower dimension, where the training set is given by
This view on the problem has the advantage that it can easily be applied on clas-
sifiers with more complex optimization algorithms like support vector machines
1
Other forms of error functions can also be considered.
Learning Parameters of Linear Models in Compressed Parameter Space 111
d is a random variable whose realizations are {d , d , . . . , d }, and
(1) (2) (N )
and
P d = d(n) = λn . For a weighted error with regularization terms, one can show
that the vector α = [α0 , α1 , . . . , αM ]T minimizing the error function satisfies the
equation
Ω α + Ω α = γ, (11)
where Ω = Ω, Ω = diag(ν0 p0 , ν1 p1 , . . . , νM pM ) and pm = ϕTm ϕm . If ϕm is
a unit vector then pm = 1. The analysis presented in this section remains the
same for a linear model of the form
L
y(x; w) = wk xk + w0 = wT φ(x) (12)
k=1
T
1 2 k
ϕm = ϕm (0), ϕm , ϕm , . . . , ϕm , . . . , ϕm (1) , (13)
L L L
k ∈ [0, 1, 2, . . . , L] and m ∈ [0, 1, 2, . . . , M ].
we do not have a closed form solution for classification problem, we make use of
the iterative reweighted least squares as follows:
1. Initialize the vector α and the basis functions ϕ0 , . . . , ϕM , and set αold = α.
2. Generate the weighting vector λ =[y1 (1 − y1 ), y2 (1 − y2 ), . . . , yN (1 − yN )]T .
λ
Normalize λ as λ ← λ , so that n λn = 1.
3. Use equation (9) to solve for α.
4. If α − αold ≤ stop, else set αold = α and go to step 2. The quantity is
a small positive real number used to stop the iteration.
112 Y. Kassahun et al.
In the experiment two kinds of visual stimuli were presented to the test per-
son: irrelevant ”standards” and relevant ”targets”. When a target was presented
the test person had to react with a movement of the right arm. The ratio between
standards and targets was 8:1. The data was acquired from 8 different subjects
in 3 distinct sessions per subject. It is recorded at 5 kHz sampling rate with 136
electrodes from which 124 electrodes were used to record EEG data, 4 electrodes
to record electrooculogram (EOG) data, and 8 electrodes to record electromyo-
gram data. For the experiments we used 64 EEG electrodes (in accordance with
the extended 10-20 system with reference at electrode FCz). The data was ac-
quired using an actiCap system (Brain Products GmbH, Munich, Germany) and
amplified by four 32 channel BrainAmp DC amplifiers (Brain Products GmbH,
Munich, Germany). To estimate the effect of the compression on the classification
performance, we performed a stratified 2-fold cross validation on all data sets, and
repeated this experiment 100 times. All epochs of the data were processed as fol-
lows: (1) standardization (the mean of the data in the epoch was subtracted and
divided by the standard deviation) (2) decimation to 25 Hz (first the data was fil-
tered with an anti alias filter and afterwards subsampled) (3) again filtered with
low pass filter with cut-off frequency of 4 Hz (4) standardization per feature (5)
compression (6) classification with SVM since we showed that learning in com-
pressed parameter space is equivalent to learning in compressed input space. To
enhance the classification performance of the SVM, in each training attempt 7
different complexity values were investigated and the best one was chosen with
a 5-fold cross validation. Figure 1 shows the classification performance (balanced
accuracies [13]) and training times of different compression rates. As can be seen
from the figure, it is possible to reduce the training times of SVMs using learn-
ing in compressed input space. Note that the training time is reduced by a factor
of approximately 11 times for slight loss of the performance for SVM trained in
compressed input space for a compression rate of 80% (fraction left 0.2).
0.90 120
0.88
100
0.86
Balanced Accuracy
80
Training time [s]
0.84
60
0.82
0.80 40
0.78
20
0.76
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fraction left Fraction left
(a) (b)
Fig. 1. Balanced accuracy (a) and training times (b) versus fraction of data left for
training. A Fraction of one means that the original data is used. The training time
corresponds to the steps (4), (5) and (6) described above.
114 Y. Kassahun et al.
The augmented neural network has been tested on the difficult versions of the
single and double pole balancing without velocities benchmarks [4], and has
achieved significantly better results on these benchmarks than the published
results of other algorithms to date. Table 1 shows the performance of learning in
compressed parameter space for both single and double pole balancing experi-
ments without velocity information. For this experiment, evolution of augmented
neural network in compressed parameter space outperforms significantly the evo-
lution of recurrent neural network in compressed parameter space. The increase
in performance is due to the simplification of neural networks through αβ fil-
ters. For CosyNE the performance in the compressed parameter space got worse,
which we suspect is due to the recurrent connections that exist in the recurrent
neural networks used to solve the problems.
Table 1. Results for the single and double pole-balancing benchmarks. Average over
50 independent evolutions. DOF stands for discrete orthogonal functions.
8 Conclusion
For supervised learning, we have shown that it is possible to accelerate training
in compressed input space. For reinforcement learning, we have shown that by
evolving the parameters of the augmented neural network in a compressed pa-
rameter space, it is possible to accelerate neuroevolution for partially observable
domains. The results presented for reinforcement learning are preliminary since
(1) the problem considered is not difficult and (2) the number of parameters to
optimize in compressed parameter space is not large. Therefore, the method has
to be tested on more complex problems to assess the feasibility of evolving aug-
mented neural network for complex problems in compressed parameter space.
In the future, we would like to extend the method to non-linear models such as
neural networks, and test the method on standard benchmark problems for both
supervised and reinforcement learning.
Learning Parameters of Linear Models in Compressed Parameter Space 115
References
1. Calderbank, R., Jafarpour, S., Schapire, R.: Compressed learning: Universal sparse
dimensionality reduction and learning in the measurement domain. Technical re-
port (2009)
2. Davenport, M.A., Duarte, M.F., Wakin, M.B., Laska, J.N., Takhar, D., Kelly, K.F.,
Baraniuk, R.G.: The smashed filter for compressive classification and target recog-
nition. In: Proceedings of Computational Imaging V at SPIE Electronic Imaging,
San Jose, CA (January 2007)
3. Donoho, D.L.: Compressed sensing. IEEE Transactions on Information The-
ory 52(4), 1289–1306 (2006)
4. Gomez, F.J., Miikkulainen, R.: Robust non-linear control through neuroevolution.
Technical Report AI-TR-03-303, Department of Computer Sciences, The Univer-
sity of Texas, Austin, USA (2002)
5. Gomez, F.J., Schmidhuber, J., Miikkulainen, R.: Efficient Non-linear Control
Through Neuroevolution. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.)
ECML 2006. LNCS (LNAI), vol. 4212, pp. 654–662. Springer, Heidelberg (2006)
6. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution
strategies. Evolutionary Computation 9(2), 159–195 (2001)
7. Haupt, J., Castro, R., Nowak, R., Fudge, G., Yeh, A.: Compressive sampling for
signal classification. In: Proceedings of the 40th Asilomar Conference on Signals,
Systems and Computers, pp. 1430–1434 (2006)
8. Kassahun, Y., de Gea, J., Edgington, M., Metzen, J.H., Kirchner, F.: Accelerating
neuroevolutionary methods using a kalman filter. In: Proceedings of the 10th An-
nual Conference on Genetic and Evolutionary Computation (GECCO), pp. 1397–
1404. ACM, New York (2008)
9. Koutnı́k, J., Gomez, F., Schmidhuber, J.: Searching for minimal neural networks
in fourier space. In: Baum, E., Hutter, M., Kitzelnmann, E. (eds.) Proceedings of
the Third Conference on Artificial General Intelligence (AGI), pp. 61–66. Atlantic
Press (2010)
10. Koutnı́k, J., Gomez, F.J., Schmidhuber, J.: Evolving neural networks in compressed
weight space. In: Proceedings of Genetic and Evolutionary Computation Confer-
ence (GECCO), pp. 619–626. ACM, New York (2010)
11. Maillard, O., Munos, R.: Compressed least-squares regression. In: Bengio, Y., Schu-
urmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural
Information Processing Systems (NIPS), pp. 1213–1221 (2009)
12. Schaul, T., Schmidhuber, J.: Towards Practical Universal Search. In: Proceedings
of the Third Conference on Artificial General Intelligence (AGI), Lugano (2010)
13. Velez, D.R., White, B.C., Motsinger, A.A., Bush, W.S., Ritchie, M.D., Williams,
S.M., Moore, J.H.: A balanced accuracy function for epistasis modeling in im-
balanced datasets using multifactor dimensionality reduction. Genetic Epidemiol-
ogy 31(4), 306–315 (2007)
14. Zander, T.O., Kothe, C.: Towards passive brain computer interfaces: applying brain
computer interface technology to human machine systems in general. Journal of
Neural Engineering 8(2), 025005 (2011)
15. Zhou, S., Lafferty, J.D., Wasserman, L.A.: Compressed regression. In: Platt, J.C.,
Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Pro-
cessing Systems (NIPS), Curran Associates, Inc. (2008)
Control of a Free-Falling Cat
by Policy-Based Reinforcement Learning
1 Introduction
A nonlinear dynamical system is said ’nonholonomic’, if its constraints can-
not be reduced to algebraic equations consisting only of generalized coordi-
nates x ∈ n and time t [1], that is, it cannot be reduced to a form like
h(x, t) = 0 ∈ m (n ≥ m) but is represented by differential equations like
h(x, ẋ, ẍ, t) = 0 ∈ m . Cars, space-robots, submarines, and other underactuated
systems are examples of nonholonomic systems. According to the Brockett’s the-
orem, however, it is known that nonholonomic systems cannot be asymptotically
stabilized by static and smooth feedback control, indicating the difficulty to de-
sign a controller for such systems [2]. There have been many studies of control
of nonholonomic systems. However, they are mostly about heuristic approaches
specialized for target tasks, and It is difficult to generalize such heuristic ap-
proaches to a general and unified control method. Moreover, accurate dynamics
of target systems must be known to establish such a heuristic controller although
they are often unknown or partially known in practical situations.
In this study, we propose a reinforcement learning (RL) approach which is
a kind of autonomous control and applicable even when the detailed dynamics
of target systems are unknown. It enables an adaptive controller (an agent) to
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 116–123, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Control of a Free-Falling Cat by Policy-Based Reinforcement Learning 117
acquire the optimal controller (policy) that maximizes the cumulative or average
reward through trial and error with the target system. RL approaches are clas-
sified into two major categories; one consists of value function-based methods,
by which the policy is updated such to enlarge the value function according to
the scenario of policy iteration, and the other consists of policy search methods,
by which policy parameters are directly updated so as to increase the objective
function. The latter category, policy search methods, is further classified into
two in terms of optimization techniques. One is a policy gradient method [6],
in which the policy parameters are updated based on the approximate gradi-
ent of the objective function. In the other class, any kinds of ’meta-heuristic’
optimization technique, like genetic algorithm (GA), can be used. GA has been
successfully applied to optimize the control policy, which is represented by means
of instances, and enabled controlling nonholonomic systems [7] [8]. While GA can
be applied even when the objective function is not differentiable with respect to
the policy parameters, there is almost no guideline for optimization with respect
to high-dimensional policy parameters; in GA, the dependence of the objective
function on the parameters is obscure.
In contrast, in policy gradient methods, more efficient optimization can be
performed because the (approximate) gradient of the objective function with
respect to the policy parameters represents the knowledge of the target system,
i.e., how the objective function depends on the parameters to be optimized.
Thus, in this study, we propose to use a policy gradient method to exploit the
basic knowledge of the target system. In particular, GPOMDP [8], which is
one of policy gradient methods, is used. As a typical and interesting example
of nonholonomic systems, we focus on a falling-cat system. Even when a cat
is falling on its back first, it can twist its body, put its feet down and land
safely. Such a cat’s motion in the air is called a falling-cat motion. In the falling-
cat motion, its angular momentum should be conserved and it constitutes a
nonholonomic constraint [9]. To fully utilize the inherent property of the system,
we also propose to use a stochastic policy incorporating normalized radial basis
functions, which is suitable for control in a periodic state space. We will show
our approach that makes use of these inherent natures of the system, enables a
quick learning than that by the existing GA method.
from the external force, total angular momentum of the cylinders should be
conserved along the whole movement of the model, which provides a nonholo-
nomic constraint. Thus, although the state is represented by three-dimensional
variables, its degree of freedom is constrained to two. So, two angular velocities,
u = [ψ̇, γ̇], are assumed to be controllable directly. In a continuous-time domain,
this model is described by
⎡ ⎤
1 0
ẋ = G(x)u, G(x) = ⎣ 0 1 ⎦, (1)
fψ (ψ, γ) fγ (ψ, γ)
cos γ sin2 γ[ρ + (1 − ) cos2 ψ]
fψ (ψ, γ) = ,
(1 − sin2 γ cos2 ψ)[1 + (ρ − cos2 ψ) sin2 γ]
cos ψ sin ψ sin γ(1 − + ρ sin2 γ)
fγ (ψ, γ) = ,
(1 − sin2 γ cos2 ψ)[1 + (ρ − cos2 ψ) sin2 γ]
where ρ and are scalars that determine the system dynamics and are set to
ρ = 3 and = 0.01, respectively, referring to the existing study [11]. As can be
seen in this equation, the system dynamics are highly nonlinear although the
object structure is quite simple.
In our simulation, this continuous-time system is observed, in terms of state
and reward, every 0.02 sec. A controller (policy) produces a control signal (an
action) immediately after the observation, and the control signal is continuously
applied until the next observation for the interval of 0.02 sec. Since the system
is observed intermittently, it can be approximated by a discrete-time system in
which a state xt and a reward rt are observed, and an action ut is taken at each
time step t = 0, 1, · · · . The initial state is fixed at x0 = [0, 0, 0] where a cat is in
the supine position. A control sequence of 5.0 sec from the initial state is defined
as an episode.
The objective of RL is defined as the maximization of the average reward
T −1
η = T1 t=0 rt+1 . Here, the reward rt+1 = r(xt , ut ) is given by the sum of
instantaneous rewards R(xt,k ) along the trajectory between time steps t − 1
K−1
and t, that is , r(xt , ut ) = k=0 R(xt,k ). xt,k denotes the k-th intermediate
state on the local trajectory between time step t − 1 and t where xt,0 and
xt,K are defined by xt,0 = xt and xt,K = xt+1 , respectively. The function
λ1
R(x) = (x−x )T Λ(x−x )+1
becomes maximal when the system is at the goal state
g g
xg = [2π, 0, π] suggesting the cat being in the prone position. Here, Λ denotes
weight parameters that represent the importance of the three state variables,
and was set to [λψ , λγ , λφ ] = [0.6, 0.6, 0.6]. In the later simulation, K and λ1 was
set to K = 20 and λ1 = 10, respectively.
Control of a Free-Falling Cat by Policy-Based Reinforcement Learning 119
3 Method
Our RL algorithm is based on GPOMDP [10], a policy gradient method. In this
section, we describe the learning algorithm and its implementation.
represent the system’s periodic character with respect to system angles, the basis
function bj (xt ) is described by cosine functions as;
bj (x) = exp[σ{cos(ψ − ψjb ) + cos(γ − γjb ) + cos(φ − φbj )}], (3)
where σ and xbj = [ψjb , γjb , φbj ] denote the width and center of the basis function,
respectively. The basis center xbj (j = 1, · · · , N ) was arranged on a regular grid
independently for each dimension. Putting l centers in the interval [0, 2π] for ψ,
m centers in the interval [−π, π] for γ, and n centers in the interval [0, 2π] for φ,
we have in total l × m × n = N grid points, leading to the center vector b(x).
For representing the distribution center, wi (pi )T b(xt ), we used the normal-
ized radial basis function (RBF) b(x), rather than the original RBF bj (x). The
normalized RBF provides good interpolation without being affected by the al-
location of the centers {xbj |j = 1, · · · , N }, whereas the original RBF outputs
tend to be small if the input x is apart from any of the basis centers. By using
the basis function vector and the weight vector described above, it is guaranteed
that not only the action value uti but also the distribution mode wi (pi )T b(xt )
are constrained within the interval [−li , li ].
3.2 GPOMDP
In this study, we use GPOMDP [10], which is one of the policy gradient methods,
for RL. According to the policy gradient method, policy parameters are updated
such to increase the objective function, the average reward in each episode,
without estimating the value function.
In the policy gradient method, policy parameters are updated by
θh+1 = θh + βh+1 ∇θ η, (4)
β0
where h counts the number of parameter updates. β(h) = δLh+1 is the reciprocal-
linearly scaled learning rate, where β0 , δ and L are an initial learning rate, a
decay constant, and the number of episodes for calculating the gradient, re-
spectively, which are pre-determined. ∇θ η denotes a differential of the average
reward η = η(θ) with respect to θ.
However, ∇θ η cannot be calculated analytically since the calculation of the
gradient of the average reward requires the unknown state transition probability.
Therefore, we approximate ∇θ η by GPOMDP. GPOMDP is advantageous, be-
cause no information of the system is necessary in the gradient estimation and the
required memory capacity is small. In our implementation, each policy parameter
was updated independently, and the average of L gradients ∇θ ηv (v = 1, · · · , L),
each estimated from L episodes, ξ v = [xv0:T , uv0:T −1 ] (v = 1, · · · , L), are used for
the policy updated in order to suppress the variance of the gradient estimation
and perform a parallel computing.
4 Simulations
Our RL algorithm was evaluated by numerical simulations.
Control of a Free-Falling Cat by Policy-Based Reinforcement Learning 121
4.2 Result
Fig. 3(a) shows the learning curve, in which the abscissa and ordinate denote
the number of training episodes and the average reward obtained in a single
episode, respectively. The average reward increases rapidly up to 10, 000 episodes
and then slowly after that, suggesting RL obtained a control that enables the
system to reach the goal state by around 10, 000 episodes, and then a better
control to achieve the goal faster. Interesting is, this learning process is fairly
stable, which can be seen in the relatively small standard deviation over the 10
training runs. After 100, 000 training episodes, the controller makes the system
be the goal state in about 2.0 sec (Fig. 3(b)). This efficient control can also be
seen in Fig. 4; within 1.6 sec the target prone position is realized, although the
control trajectory from the initial supine position to the goal prone position is
not simple.
2p
psi
gamma
phi
states(rad)
7 p
6
Average Reward
5 0
4 2p psi
psi
psi
g am m a
input(rad/s)
3 gamma
gamma
2 0
1
-2p
0 0 1 2 3 4 5
0 20,000 40,000 60,000 80,000 100,000 time(sec)
Episode
(b) Time-series of the (c) Comparison between
(a) The learning curve state variables (upper) RL (solid) and GA
and control signals (lower) (dashed)
Fig. 3. (a) Learning curve; mean and standard deviation over 10 simulation runs are
shown. (b) A control trajectory, the series of state variables (upper panel) and those
of control signals (lower panel) after 100, 000 training episodes. Here, the deterministic
policy, in which the mean action wi (pi )b(xt ) is always taken, was used. In the upper
panel, the goal state xg for the three state variables are depicted by thin straight lines.
(c) Fitness comparison between the GA study [8] and ours.
Fig. 4. Snapshots in a control trajectory after 100, 000 training episodes, every 0.02sec
from the initial state x0 to the last state at 5.0sec. Two small pins on cylinders indicate
the direction of cat’s feet.
Fig. 3(c) shows the average fitness curves by our RL method (solid line) and
by the GA method [8] (dashed line). Both fitness curves show average of 10
simulation runs. Although the direct optimization of the fitness function is ad-
vantageous for enlarging the fitness itself, our objective function, the average
reward, has high correlation with the fitness function. The higher fitness sug-
gests our algorithm obtained a policy that reaches the goal with smaller control
inputs.
5 Conclusion
A unified control law which can handle any nonholonomic systems has not yet
been discovered. In this study, we proposed an RL approach to the control prob-
lem of systems which cannot be well controlled by usual feedback control and
Control of a Free-Falling Cat by Policy-Based Reinforcement Learning 123
References
1. Nakamura, Y.: Nonholonomic robot systems, Part 1: what’s a nonholonomic robot?
Journal of RSJ 11, 521–528 (1993)
2. Brockett, R.W.: Asymptotic stability and feedback stabilization. Progress in Math-
ematics 27, 181–208 (1983)
3. Mita, T.: Introduction to nonlinear control Theory-Skill control of underactuated
robots. SHOKODO Co., Ltd. (2000) (in Japanese)
4. Murray, R.M., Sastry, S.S.: Nonholonomic motion planning: steering using sinu-
soids. IEEE Transactions on Automatic Control 38, 700–716 (1993)
5. Holamoto, S., Funasako, T.: Feedback control of a planar space robot using a
moving manifold. Journal of RSJ 25, 745–751 (1993)
6. Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients.
Neural Networks 21, 682–697 (2008)
7. Miyamae, A., et al.: Instance-based policy learning by real-coded genetic algo-
rithms and its application to control of nonholonomic systems. Transactions of the
Japanese Society for Artificial Intelligence 24, 104–115 (2009)
8. Tsuchiya, C., et al.: SLIP: A sophisticated learner for instance-based policy using
hybrid GA. Transactions of SICE 42, 1344–1352 (2006)
9. Nakamura, Y., Mukherjee, R.: Nonholonomic path planning of space robots via a
bidirectional approach. IEEE Transactions on Robotics and Automation 7, 500–514
(1991)
10. Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of
Artificial Intelligence Research 15, 319–350 (2001)
11. Ge, X., Chen, L.: Optimal control of nonholonomic motion planning for a free-
falling cat. Applied Mathematics and Mechanics 28, 601–607 (2007)
Gated Boltzmann Machine in Texture Modeling
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 124–131, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Gated Boltzmann Machine in Texture Modeling 125
that, one can improve the understanding of objects in complex real-world recog-
nition tasks.
In this paper, a new type of building block for deep network is explored
to understand texture modeling. The new model is used to model the local
relationship within the texture in a biologically plausible manner. Instead of
searching exhaustively over the whole image patch, we propose to search for
local structures in a smaller region of interest. Also, due to the complexity of
the model, a novel learning scheme for such model is proposed.
2 Background
2.1 Co-occurrence Matrices in Texture Classification
Co-occurrence matrix [6] measures the frequencies a pair of pixels with a certain
offset gets particular values. Modeling co-occurrence matrices instead of pixels
brings the analysis to a more abstract level immediately, and it has therefore
been used in texture modeling.
The co-occurrence matrix C is defined over {m × n} size image I, where
{1 . . . Ng } levels of gray scales are used to model pixel intensities. Under this
assumption, the size of C is {Ng × Ng }. Each entry in C is defined by
M N
1 if I(m, n) = i & I(m + δx , n + δy ) = j
cij = (1)
m=1 n=1
0 otherwise
Different offset schemes for {δx , δy } result in different co-occurrence matrices. For
instance, one can look for textural pattern over an image with offset {−1, 0} or
{0, 1}. These different co-occurrence matrices typically have information about
the texture from different orientations. Therefore, a set of invariant features can
be obtained by having several different co-occurrence matrices together.
h h
h
x y x t
x
(a) GRBM(X) (b) GGBM (c) GRBM(X,T)
3 Proposed Method
Combining the nature of texture information and GGBM, a modified GGBM
especially suitable for texture modeling is proposed. To start with, we consider
a slightly modified general gated Boltzmann machine where there are pair-wise
connections between all sets of nodes. This model has the most comprehensive
Gated Boltzmann Machine in Texture Modeling 127
information about the input vectors. Accordingly, the energy function of the
model is written as
xi yj xi yj (xi − bx )2
E(x, y, h) = − hk wijk − uij + 2
i
σi σj ij
σ σ
i j i
4σi
ijk
(4)
(yj − bj )y 2 xi (1)
yj (2)
+ − h k ck − hk vik − hk vjk
j
4σj2 2σi2 2σj2
k ik jk
(1) (2)
where uij ,vik and vik are additional parameters to model the pair-wise connec-
tions between two sets of visible neurons {x, y} and hidden neurons h. Instead
of looking for the image transformation, we seek for the internal structure of
texture information. Therefore, the same patch of image is fed to the two sets
of visible neurons, that is x = y. Accordingly, the weights v and bias b for the
two sets of visible neurons are tied, which is V = V(1) = V(2) ; b = bx = by .
Also, a unified variance σ 2 = σi2 = σj2 is learned to reduce the complexity of the
model further The complexity of the model remains as the weight tensor wijk
still needs huge learning efforts. As x = y, xi and yj can be considered a pair
of pixels, and hk is learned to model this interaction. Given an image patch, the
traditional GGBM will go through all the combinations of such pairs. This is
highly redundant as the texture is repetitive within a very small region. Recalling
that co-occurrence matrix tries to summarize the interaction of pairs of pixels
over a certain area, this structure can be introduced to GGBM. In order to do
that, we will assume wijk = wdk , such that the weight wijk depends only on the
displacement d and the hidden neuron hk . d represents the offest from i to j.
Similarly, uij = ud . One can think of wdk and ud as a convolutional model only
over the local regions in image patches. Convolutional approximation has been
argued to be rather successful in other applications such as image recognition
tasks [10]. It is further assumed tthat wdk = 0 for large displacement d.
After these simplifications, the energy function (4) becomes
1 1 1
E(x, y, h) = − 2 xi yj hk wdij k − 2 xi yj udij + 2 (xi − bi )2
σ σ 2σ i
ijk d
(5)
1
− 2 xi hk vik − hk c k
σ
ik k
(7)
128 T. Hao et al.
We also define a related but much simpler model as follows. Firstly, we define
auxiliary variables td = i xi yi+d where d is the offset between pixels i and j
as before. This formulation stems from the principle of the co-occurrence matrix
where each feature is only related to particular pairs of pixels in the image.
These computations can be done as a preprocessing step. Secondly, we learn
a GRBM using the concatenation of vectors [x, t] as data. We call this model
the GRBM(X,T) and illustrate it in Figure 1c. In the figure, the dashed line
represents t being computed from x.
When we write the energy function of GRBM(X,T)
1
E(x, t, h) = − 2 xi hk vik + ti hk wdk − h k ck
σ
ik dk k
(8)
1
+ 2 (xi − bi )2 + (td − ud )2 ,
2σ i d
we notice the similarities to the GGBM energy function in Equation (5). Each
parameter has its corresponding counterpart. The only remaining difference is
1 2
E(x, t, h) − E(x, y, h) = td + const (9)
2σ 2
d
It turns out p(h|x, y) can be written in the exact same form as in Equation (7).
Since learning higher order Boltzmann machines is known to be quite difficult,
we propose to use this related model as a way for learning them. So in practice we
first train a GRBM(X,T), and then convert the parameters to the GGBM model.
Actually, in texture classification, the converted model produces exactly the same
hidden activations h and thus the same classification results. On the other hand,
in the texture reconstruction problem, the GRBM(X,T) model cannot be used
directly, since t cannot be computed from partial observations.
We noticed experimentally, that the converted GGBM model needs to be
further regularized, since the regularizing terms t2d in the energy function of
GRBM(X,T) are dropped off as seen in Equation (9). We simply converted wdk
and ud by scaling them with a constant factor smaller than 1, and chose that
constant by the smallest validation reconstruction error.
4 Experiments
KTH Texture Dataset. This dataset [1] has 11 different textures, 4 different
samples for each texture, and 108 different images are available for each
sample. Each image is of size {200 × 200}, and the patch size is still selected
as 20 × 20. Only the 108 images from sample a2 in each texture are used:
54 for generating training samples and 54 for generating testing samples.
118800 patches are used for extracting the features. 11000 patches are used
for training a classifier and 1100 sample are used for testing. The best result
is obtained with the proposed method. Please note a poorer overall perfor-
mance is expected as the variations within the training samples make the
problem harder. The detailed results are shown in Table 1b.
Fig. 2. The texture reconstruction experiment. The first row shows the random samples
with missing centers. The second row shows the reconstruction from GRBM model,
and the reconstruction from the proposed model is shown in the third row. The original
samples are shown at the last row.
5 Conclusions
In this paper, we tackled the problem of modeling texture information. We pro-
posed a modified version of GGBM and a simpler learning algorithm for that.
2
Available at http://www.nada.kth.se/cvap/databases/kth-tips/
Gated Boltzmann Machine in Texture Modeling 131
From the experimental results, we can argue that the proposed model is bene-
ficial in terms of modeling the structured information such as textures. Among
all the results, the highest accuracies are obtained by the features learned from
the proposed model. Although these accuracies are not the state-of-the-art, the
proposed model opened up a possibility where the texture information can be
successfully modeled using the higher order Boltzmann machine.
References
1. Caputo, B., Hayman, E., Mallikarjuna, P.: Class-Specific Material Categorisation.
In: Int. Conf. on Computer Vision, pp. 1597–1604 (2005)
2. Cho, K., Raiko, T., Ilin, A.: Gaussian-Bernoulli Deep Boltzmann Machine. In: NIPS
2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011)
3. Cho, K., Ilin, A., Raiko, T.: Improved Learning of Gaussian-Bernoulli Restricted
Boltzmann Machines. In: Int. Conf. on Artifical Neural Networks, pp. 10–17 (2011)
4. Cireşan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep, Big, Simple
Neural Nets for Handwritten Digit Recognition. Neural Comput. 22(12) (2010)
5. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A Li-
brary for Large Linear Classification. JMLR, 1871–1874 (2008)
6. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural Features for Image Classi-
fication. IEEE Trans. Syst., Man, Cybern. 3(6), 610–621 (1973)
7. Hinton, G., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural
networks. Science 313(5786), 504–507 (2006)
8. Hinton, G., Salakhutdinov, R.: Discovering Binary Codes for Documents by Learn-
ing Deep Generative Models. Topics in Cognitive Science 3(1), 74–91 (2010)
9. Kivinen, J., Williams, C.: Multiple Texture Boltzmann Machines. JMLR
W&CP 22, 638–646 (2012)
10. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In: Int. Conf.
Machine Learning, p. 77 (2009)
11. Liu, L., Fieguth, P.: Texture Classification from Random Features. IEEE Trans.
Pattern Anal. Mach. Intell. 34(3), 574–586 (2012)
12. Memisevic, R., Hinton, G.E.: Learning to Represent Spatial Transformations with
Factored Higher-Order Boltzmann Machines. Neural Comput. 22(6), 1473–1492
(2010)
13. Ranzato, M., Krizhevsky, A., Hinton, G.E.: Factored 3-Way Restricted Boltzmann
Machines For Modeling Natural Images. JMLR W&CP 9, 621–628 (2010)
14. Varma, M., Zisserman, A.: A Statistical Approach to Material Classification Using
Image Patch Exemplars. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2032–
2047 (2009)
Neural PCA and Maximum Likelihood Hebbian
Learning on the GPU
1 Introduction
Modern many-core GPUs have been successfully used to accelerate a variety of
meta-heuristics and bio-inspired algorithms [6,12,13] including different types
of artificial neural networks [1,10,11,14,15,17,18,20,22,24]. To fully utilize the
parallel hardware, the algorithms have to be carefully adapted to data-parallel
architecture of the GPUs [21].
Artificial neural networks (ANNs) performing PCA and MLHL are known to
be useful for the analysis of high dimensional data [5,25]. Their main aim is to
identify interesting projections of high dimensional data to lower dimensional
subspaces that reveal hidden structure of the data sets. Due to the relative
simplicity of their operations and generally real-valued data structures, such a
networks are suitable for a parallel implementation on multi-core systems and on
the GPUs that reach peak performance of hundreds and thousands giga FLOPS
(floating-point operations per second) at low costs.
This study presents a design and evaluation of a novel fine-grained data-
parallel implementation of an ANN for PCA and MLHL for the nVidia Compute
Unified Device Architecture (CUDA) platform.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 132–139, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Neural PCA and Maximum Likelihood Hebbian Learning on the GPU 133
The activation is fed back through the same weights and subtracted from the
inputs (where the inhibition takes place):
M
ej = xj − Wij yi , ∀j (2)
i=1
After that, simple Hebbian learning is performed between input and outputs:
The effect of the negative feedback is the network learning stability. This network
is capable of finding the principal components of the input data [9] in a manner
that is equivalent to Oja’s Subspace algorithm [19], and so the weights will not
find the current Principal Components but a basis of the Subspace spanned by
these components.
Maximum Likelihood Hebbian Learning [2,3,4,8] is based on the previous
PCA-type rule and can be described as a family of learning rules based on the
following equations: a feedforward step (1) followed by a feedback step (2) and
then a weight change, which is as follows:
Maximum Likelihood Hebbian Learning (MLHL) [2,3,4,8] has been linked to the
standard statistical method of Exploratory Projection Pursuit (EPP)[4,7].
2 GPU Computing
Modern graphics hardware has gained an important role in the area of paral-
lel computing. The data-parallel architecture of the GPUs is suitable for vec-
tor and matrix algebra operations and it is nowadays widely used for scientific
134 P. Krömer et al.
computing. The GPUs and general purpose GPU (GPGPU) programming have
established a new platform for neural computation. The usage of the GPUs to
accelerate neural information processing and artificial neural networks pre-dates
the inception of general purpose GPU APIs [1,11,17,18]. At that time, the data
structures were mapped directly to native GPU concepts such as textures and
the operations were implemented using vertex and pixel shaders of the GPUs.
Often, the ANNs were implemented using graphic oriented shader programs,
OpenGL functions, or DirectX functions to accelerate ANN operations. For ex-
ample, a 20 times accelerated feedforward network on the GPU was presented
by Oh and Jung in [18]. Martı́nez-Zarzuela et al. [17] proposed a 33 times faster
GPU-based fuzzy ART network. Ho et al. [11] developed a simulator of cellular
neural network on the GPU that was 8 to 17 times faster than a corresponding
CPU version, and Brandstetter and Alessandro [1] designed a 3 to 72-fold faster
radial basis function network powered by the GPUs.
The GPGPU APIs have simplified the development of neural algorithms and
ANNs for the graphics hardware significantly [10,16] and a variety of neuro-
computing algorithms were ported to the GPUs [10,14,15,16,18,20,22,24]. The
CUDA platform was used to achieve 46 to 63 times faster learning of a feedfor-
ward ANN by the backpropagation algorithm by Sierra-Canto et al. [24] while
Lopes and Ribeiro [14] reported a 10 to 40 faster implementation of the multiple
backpropagation training of feedforward and multiple feedforward ANNs.
Ghuzva et al. [10] presented a coarse-grained implementation of the multilayer
perceptron (MLP) on the CUDA platform that operated a set of MLPs in parallel
50 times faster than a sequential CPU-based implementation. The training of a
feedforward neural network by genetic algorithms was implemented on CUDA by
Patulea et al. [20] and it was 10 times faster than a sequential version of the same
algorithm. An application of a GPU-powered ANN for speech recognition is due
to Scanzio et al. [22]. The GPU technology accelerated the ANN approximately
6 times. Martı́nez-Zarzuela et al. [15] used the GPU to speedup a neural texture
classification process and achieved 16 to 26 times better performance than on
the CPU. In [16], the authors implemented a fuzzy ART network on the CUDA
platform and achieved a 57-fold peak speedup.
An example of the use of GPUs for unsupervised neural networks is due to
Shitara et al. [23]. Three different graphic cards were used to benchmark the
performance of the algorithm and it was shown that the GPUs can improve the
performance of the SOM up to 150 times for certain hardware configurations.
In this research, the CUDA platform is used to accelerate the training of the
Negative Feedback Network and also for MLHL.
&38 *38
WKUHDG PDQ\WKUHDGV
&38 *38 /RDG
WKUHDG PDQ\WKUHDGV WUDLQLQJ
GDWD
/RDG
WUDLQLQJ 3UHSURFHVVGDWD
GDWD FRPSXWH
FRYDULDQFH
VTUWPLQY
NHUQHOFDOO &RPSXWHGDWD
PHDQDQG NHUQHOFDOO
NHUQHOUHWXUQ *HQHUDWHUDQGRP
VXEWUDFW NHUQHOUHWXUQ ZHLJKWPDWUL[Z
NHUQHOFDOO
NHUQHOUHWXUQ
*HQHUDWHUDQGRP LQHDFK
ZHLJKWPDWUL[Z LWHUDWLRQ
LQHDFK
LWHUDWLRQ
6HOHFW
UDQGRP
URZ[
NHUQHOFDOO
6HOHFW
UDQGRP NHUQHOUHWXUQ FRPSXWH\
URZ[
NHUQHOFDOO
NHUQHOFDOO
NHUQHOUHWXUQ
FRPSXWHH
NHUQHOUHWXUQ FRPSXWH\
NHUQHOFDOO
NHUQHOFDOO XSGDWHHPOKO
NHUQHOUHWXUQ
NHUQHOUHWXUQ FRPSXWHH OHUDQLQJ
NHUQHOFDOO NHUQHOFDOO
NHUQHOUHWXUQ XSGDWHZ NHUQHOUHWXUQ
XSGDWH:
one contained 10000 records with 1024 attributes. Each record in the data set
can be seen as an n × n image with a single vertical or horizontal bar painted
by different shades of gray (represented by real values between 0.7 and 1). The
visualisation of the first 20 records of the 1024-dimensional data set is shown
in fig. 2. It can be expected, that in such a data set, the pictures with the bar
in the same position ought to form (at least one) cluster, i.e. there might be at
least n + n clusters. The randomized data sets used in this study contained 15
and 31 unique bar positions respectively.
The data sets were processed by both, the Negative Feedback Network and
MLHL on CPU and GPU with the following parameters: 100000 iterations, learn-
ing rate 0.00001 and the MLHL parameter p = 2.2.
In the experiment, the dimension of the target subspace m was set to the
powers of 2 from the interval [2, DIM ] (where DIM was the full dimension of the
data set) and the execution time of network training was measured. The results
are visualized in fig. 3. It clearly illustrates how the execution time grows with the
dimension of the target subspace m and with the number of attributes. These two
parameters define the complexity of the vector-matrix operations. As expected,
the CPU is faster for small m (m < 32 for 256-dimensional data and m < 16
for 1024-dimensional data) for the Negative Feedback Network. The MLHL on
the GPU was faster than the CPU-based implementation of the same algorithm
even for small values of m. The speedup obtained by the parallel implementation
for the 256-dimensional data set ranged from 1.4 for m = 32 to 5.5 for m = 256
for the Negative Feedback Network and from 1.5 to 6.1 for the MLHL. The
performance increase was more significant for the 1024-dimensional data set.
The improvement in the training time of the Negative Feedback Network on the
GPU ranged from 1.36 times faster training for m = 16 to 47.95 faster training
for m = 512. The processing of the 1024-dimensional data set by the MLHL on
the GPU was between 2.18 to 47.81 times faster than on the CPU.
The performance results of both algorithms for the 1024-dimensional data
set on different hardware are visualized in fig. 3. It displays the dependency of
the execution time (y-axis, note the log scale) on the dimension of the target
subspace m and illustrates how the GPU versions of the algorithms outperform
the CPU versions by an order of magnitude for larger m.
The visual results of the projection of the 1024-dimensional data set to the 2-
dimensional subspace are for both methods shown in fig. 4. Figure 4a shows the
results of the projection by the neural PCA and fig. 4b shows the structure of the
same data processed by the MLHL. Points representing images that had a bar
in the same position were drawn in the same color. We can clearly see that both
Neural PCA and Maximum Likelihood Hebbian Learning on the GPU 137
107
PCA AMD Opteron 2.2GHz
PCA Tesla C2050
MLHL AMD Opteron 2.2GHz
MLHL Tesla C2050
6
10
Time [ms]
5
10
104
3
10
2 4 8 16 32 64 128 256 512 1024
m
Fig. 3. Neural PCA (Negative Feedback Network) and MLHL execution time for the
1024-dimensional data set
projections have emphasized a structure in the data. The neural PCA version
clearly separated several clusters from the rest of the data, which populates
the center of the graph, while the MLHL lead to a more regular pattern of 2D
clusters. This can serve as a visual proof that the CUDA-C implementations of
both algorithms provide projections to lower dimensional subspaces with good
structure.
4 Conclusions
This research introduced a fine-grained data-parallel implementation of two
types of ANNs, the Negative Feedback network for the PCA and the Maxi-
mum Likelihood Hebbian Learning network. The GPU versions of the algorithms
have achieved for two high-dimensional artificial data sets a significant speedup
in training times. When projecting to low dimensional subspaces (m < 16), the
CPU version of the negative feedback network was faster but when projecting
138 P. Krömer et al.
the data to spaces with larger dimension, the GPU was up to 47.99 times faster
(for the 1024-dimensional data set and m = 1024). The projection through the
MLHL network was on the GPU faster for all m ∈ [2, DIM ] ranging from 2.1-fold
speedup for m = 8 to 47.81 times faster execution time for m = 1024.
In the future, other variants of the MLHL will be implemented and the GPU
version will be used to process and analyze real world data sets.
Acknowledgements. This research is partially supported through a projects of
the Spanish Ministry of Economy and Competitiveness [ref: TIN2010-21272-C02-
01] (funded by the European Regional Development Fund). This work was also
supported by the European Regional Development Fund in the IT4Innovations
Centre of Excellence project (CZ.1.05/1.1.00/02.0070) and by the Bio-Inspired
Methods: research, development and knowledge transfer project, reg. no. CZ.1.07
/2.3.00/20.0073 funded by Operational Programme Education for Competitive-
ness, co-financed by ESF and state budget of the Czech Republic.
References
1. Brandstetter, A., Artusi, A.: Radial basis function networks gpu-based implemen-
tation. IEEE Transactions on Neural Networks 19(12), 2150–2154 (2008)
2. Corchado, E., Fyfe, C.: Orientation selection using maximum likelihood hebbian
learning. Int. Journal of Knowledge-Based Intelligent Engineering 2(7) (2003)
3. Corchado, E., Han, Y., Fyfe, C.: Structuring global responses of local filters using
lateral connections. J. Exp. Theor. Artif. Intell. 15(4), 473–487 (2003)
4. Corchado, E., MacDonald, D., Fyfe, C.: Maximum and minimum likelihood heb-
bian learning for exploratory projection pursuit. Data Mining and Knowledge Dis-
covery 8, 203–225 (2004)
5. Corchado, E., Perez, J.C.: A three-step unsupervised neural model for visualizing
high complex dimensional spectroscopic data sets. Pattern Anal. Appl. 14(2), 207–
218 (2011)
6. De, P., Veronese, L., Krohling, R.A.: Swarm’s flight: accelerating the particles using
c-cuda. In: Proceedings of the Eleventh conference on Congress on Evolutionary
Computation, CEC 2009, pp. 3264–3270. IEEE Press, Piscataway (2009)
7. Friedman, J., Tukey, J.: A projection pursuit algorithm for exploratory data anal-
ysis. IEEE Transactions on Computers C- 23(9), 881–890 (1974)
8. Fyfe, C., Corchado, E.: Maximum likelihood Hebbian rules. In: Verleysen, M. (ed.)
ESANN 2002, Proceedings of the 10th European Symposium on Artificial Neural
Networks, Bruges, Belgium, April 24-26, pp. 143–148 (2002)
9. Fyfe, C.: A neural network for pca and beyond. Neur. Proc. Letters 6, 33–41 (1997)
10. Guzhva, A., Dolenko, S., Persiantsev, I.: Multifold Acceleration of Neural Network
Computations Using GPU. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas,
G. (eds.) ICANN 2009, Part I. LNCS, vol. 5768, pp. 373–380. Springer, Heidelberg
(2009)
11. Ho, T.Y., Lam, P.M., Leung, C.S.: Parallelization of cellular neural networks on
gpu. Pattern Recogn. 41(8), 2684–2692 (2008)
12. Krömer, P., Platoš, J., Snášel, V., Abraham, A.: An Implementation of Differential
Evolution for Independent Tasks Scheduling on GPU. In: Corchado, E., Kurzyński,
M., Woźniak, M. (eds.) HAIS 2011, Part I. LNCS, vol. 6678, pp. 372–379. Springer,
Heidelberg (2011)
Neural PCA and Maximum Likelihood Hebbian Learning on the GPU 139
13. Langdon, W.B., Banzhaf, W.: A SIMD Interpreter for Genetic Programming
on GPU Graphics Cards. In: O’Neill, M., Vanneschi, L., Gustafson, S., Esparcia
Alcázar, A.I., De Falco, I., Della Cioppa, A., Tarantino, E. (eds.) EuroGP 2008.
LNCS, vol. 4971, pp. 73–85. Springer, Heidelberg (2008)
14. Lopes, N., Ribeiro, B.: GPU Implementation of the Multiple Back-Propagation
Algorithm. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788, pp.
449–456. Springer, Heidelberg (2009)
15. Martı́nez-Zarzuela, M., Dı́az-Pernas, F., Antón-Rodrı́guez, M., Dı́ez-Higuera, J.,
González-Ortega, D., Boto-Giralda, D., López-González, F., De La Torre, I.: Multi-
scale neural texture classification using the gpu as a stream processing engine.
Machine Vision and Applications 22, 947–966 (2011)
16. Martı́nez-Zarzuela, M., Pernas, F., de Pablos, A., Rodrı́guez, M., Higuera, J., Gi-
ralda, D., Ortega, D.: Adaptative Resonance Theory Fuzzy Networks Parallel Com-
putation Using CUDA. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M.
(eds.) IWANN 2009, Part I. LNCS, vol. 5517, pp. 149–156. Springer, Heidelberg
(2009)
17. Martı́nez-Zarzuela, M., Dı́az Pernas, F., Dı́ez Higuera, J., Rodrı́guez, M.: Fuzzy
ART Neural Network Parallel Computing on the GPU. In: Sandoval, F., Prieto,
A.G., Cabestany, J., Graña, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 463–470.
Springer, Heidelberg (2007)
18. Oh, K.S., Jung, K.: GPU implementation of neural networks. Pattern Recogni-
tion 37(6), 1311–1314 (2004)
19. Oja, E.: Neural networks, principal components, and subspaces. International Jour-
nal of Neural Systms 1(1), 61–68 (1989)
20. Patulea, C., Peace, R., Green, J.: Cuda-accelerated genetic feedforward-ann train-
ing for data mining. J. of Physics: Conference Series 256(1), 012014 (2010)
21. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose
GPU Programming, 1st edn. Addison-Wesley Professional (July 2010)
22. Scanzio, S., Cumani, S., Gemello, R., Mana, F., Laface, P.: Parallel implementation
of artificial neural network training for speech recognition. Pattern Recognition
Letters 31(11), 1302–1309 (2010)
23. Shitara, A., Nishikawa, Y., Yoshimi, M., Amano, H.: Implementation and eval-
uation of self-organizing map algorithm on a graphic processor. In: Parallel and
Distributed Computing and Systems 2009 (2009)
24. Sierra-Canto, X., Madera-Ramirez, F., Uc-Cetina, V.: Parallel training of a back-
propagation neural network using cuda. In: Proceedings of the 2010 Ninth In-
ternational Conference on Machine Learning and Applications, ICMLA 2010, pp.
307–312. IEEE Computer Society, Washington, DC (2010)
25. Zhang, K., Li, Y., Scarf, P., Ball, A.: Feature selection for high-dimensional machin-
ery fault diagnosis data using multiple models and radial basis function networks.
Neurocomputing 74(17), 2941–2952 (2011)
Construction of Emerging Markets Exchange Traded
Funds Using Multiobjective Particle Swarm Optimisation
1 Introduction
Emerging markets are increasingly being regarded as the new drivers of the global
economy and as a consequence more and more investors regard emerging markets
investments as a critical component in their portfolios. While many such investors
have chosen to focus on the 'Big Four' of Brazil, Russia, India and China there are
sopportunities beyond these, in particular in Andean countries rich in mineral
resources such as Colombia and Chile, whose economies are growing at an
accelerating pace. It is not always easy for foreign investors to gain exposure to these
markets, and one of the best ways is to invest in an exchange traded fund (ETF) that
replicates the behaviour of a representative index of stocks. However in setting up
such a fund it is necessary to consider both the transaction costs involved in buying
and selling the component assets and also the market impact of these transactions,
both of which may be larger in less developed economies. The aim of this work is to
show how multiobjective particle swarm optimisation can be used to implement a
new Andean index as an ETF with internal weightings adjusted to minimise tracking
error (how closely the fund mimics the behaviour of the index) while reducing
transaction costs and market impact.
Particle swarm optimisation (PSO) [1] is a population-based search algorithm that
has achieved popularity by being simple to implement, having low computational
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 140–147, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Construction of Emerging Markets Exchange Traded Funds Using MOPSO 141
costs, and by having been found effective in a wide variety of application areas,
including finance [2]. However in its multiobjective form (MOPSO), where the aim is
to search for solutions that optimally satisfy possibly conflicting demands such as
maximising profit while at the same time minimising risk, it has so far been used
relatively little in a financial context (for examples see [3], [4]) and in particular
neither MOPSO nor any other population-based multiobjective algorithm has been
applied to the problem of minimising an index tracking error while attempting to
enhance liquidity, the subject of the current work.
2 Methods
The relative simplicity of PSO and its quantum-behaved variant QPSO made them
natural candidates to be extended for multiobjective optimisation. The methods used
here are vector evaluated PSO (VEPSO) [5] and its quantum-behaved equivalent
VEQPSO [6], in which swarms seeking to optimise two conflicting objectives
exchange information by following each other's leaders. QPSO has the advantage of
needing fewer training parameters to be set—other than the number of iterations in
fact only one, the contraction-expansion coefficient β—compared to the original PSO,
which requires also a decision to be made about the balance between learning based
on each particle's own past best experience (cognitive contribution weighted by φ1)
and learning based on following the swarm's—or in the case of VEPSO a
neighbouring swarm's—best performing member (social contribution weighted by
φ2). The standard form of PSO also requires the specification of an iteration-
decreasing inertia weight W that balances the above forms of learning (exploitation of
the search space) with random search (exploration).
s s
The equations used to update the velocity v i and position x i of particle i in
swarm s (where here s=1,2) in the two versions of the multiobjective PSO algorithm
s
are given in summary below. In these expressions p i,t ('personal best') is the best
parameter position (in relation to the objective to be optimised by swarm s) found at
s
time t by particle i, g t ('global best') is the best position found by at this time by any
particle in swarm s, and 's+1' in the case of two-objective multiplication, as here,
denotes addition mod 2 (i.e. the leader of the competitor swarm is followed).
VEPSO:
where β1, β2 are random numbers chosen uniformly from the interval [0,1].
142 M. Díez-Fernández, S. Alvarez Teleña, and D. Gorse
VEQPSO:
where φ, k, u are random numbers chosen uniformly from the interval [0,1],
1 if k ≥ 0.5
ϑ (k) = , (2b)
0 if k < 0.5
and ms is the mean of the personal best positions for members of swarm s,
1 N s
m =
s
p .
N i=1 i
(2c)
It has generally been found that QPSO is both faster and more effective in finding
good solutions than the original PSO [7]. However it will be shown in the Results that
neither form of PSO can be regarded as 'better' for the current problem, as VEPSO
and VEQPSO methods will be seen to contribute solutions appropriate to different
parts of the problem space.
The most usual way to assess and compare the results of k-objective optimisation
procedures (typically phrased as the need to minimise each of f1(x), f2(x),..., fk(x),
possibly subject to a number of external constraints) in an n-dimensional decision
variable space x = (x1, x2,..., xn) is via a Pareto front
where P* is the Pareto optimal set of nondominated solutions x, where one solution x
is said to dominate another, v, denoted x ≤ v , if it is better than v with respect to at
least one problem objective and no worse with respect to any of the others:
x ≤ v if and only if:
fi (x) ≤ fi (v) for all i ∈ {1,2,...,k} and f j (x) < f j (v) for at least one j ∈ {1,2,...,k} (4)
This method is adopted here for the case of k=2 (as here there are two quantities to be
minimised, index tracking error TE and a joint measure of transaction costs and
market impact, TC&MI) and n=8 (as here there are eight weights, one for each of the
assets in the prototype eight-component portfolio).
Table 1. Country and sector distributions for the stocks in the benchmark index
The measurement of tracking error (TE), the first of our objectives to be minimised, is
straightforward: it is the standard deviation of the difference between returns from the
above benchmark and from the constructed ETF. The second objective to be
minimised is denoted TC&MI and is the sum of transaction costs (TC) and market
impact (MI). Transaction costs are easy to define, being given as the bid-ask spread
(the difference between what one would pay to buy an asset and what one could sell it
for) and taxes on gains and dividends, if any, associated with the assets held.
However it is considerably more difficult to obtain a workable definition of market
impact, and it is necessary in emerging markets to use local expertise (this is standard
practice in the industry) to assign a parameter γi to each asset i which is calculated
linearly using the expert's estimation of the market impact buying or selling a
specified amount of shares would have in the market. The market impact of a
modification to a n-asset portfolio is then calculated according to the following
formula
n
wi × $budget
MI = ×γi , (5)
i=1 pricei × MeDVi
in which for each of the i=1..n assets wi is its weight in the portfolio (note wi=0 if no
transactions have been carried out for asset i); $budget is the total budget managed, in
US dollars; pricei is the closing price of the asset on the day of the transaction; MeDVi
is the median daily volume of transactions in that asset (the median being calculated
over three months of past data); and γi is the expert-derived parameter discussed
above. Median rather than average daily volume is used to avoid the effect of outliers
generated by 'block trades' in which large volumes of shares may change hands.
144 M. Díez-Fernández, S. Alvarez Teleña, and D. Gorse
3 Results
As discussed in section 2.2, this work follows the usual methodology for
multiobjective optimisation problems in constructing a Pareto front of candidate
solutions. In the context of the present application a fund manager would be able to
choose from such a front a solution (weighted combination of the eight assets) that
emphasised either close tracking of the underlying index or maximal liquidity
(minimisation of transaction costs and market impact). However it was discovered
that while the Pareto fronts obtained by VEPSO and VEQPSO were of similar
quality, 0.968 and 0.936 respectively in terms of their hypervolume [8] (a measure of
the degree to which both objectives are being jointly achieved, being preferably as
large as possible in a situation such as this in which two or more quantities are to be
simultaneously minimised), they were significantly different in that VEPSO
predominantly found solutions with a low tracking error while VEQPSO in contrast
found solutions with low market impact and transaction costs. It was noteworthy that
no modification of the learning process—changes to learning parameters, running the
algorithms for more iterations, increasing or decreasing the number of particles in the
swarms, or attempting to add to the fronts by reinitialising the weights and re-running
the algorithm—was able to change this. Such an effect has not to our knowledge
been previously observed where standard and quantum behaved PSOs were being
compared for the same multiobjective problem, and the reasons why the algorithms
here appear to specialise in certain areas of the solution space are under investigation.
It was decided on pragmatic grounds that the best solution set would be a merging
of those points derived from VEPSO and those derived from VEQPSO, and this
merged Pareto front is shown in Figure 1 below. Note that the two groups of points
are more strongly separated in terms of tracking error than in terms of TC&MI; this is
another feature not significantly affected by performing additional runs of the
algorithm or by modifying training parameters, and also appears to be a feature of the
application of multiobjective algorithms to this particular data set.
It was clearly of interest to look at the composition of the generated optimal ETF
portfolios as one moves through the Pareto front. Figure 2 shows how the proportions
of the assets allocated to the five industrial sectors (Figure 2a) and the three countries
(Figure 2b) vary as one moves along the x-axis (TE) of the merged Pareto front. Note
the large break along the TE axis in both figures; the parts of the curves to the left of
this are derived purely from VEPSO solutions, and those to the right from VEQPSO,
and reflect the division shown in Figure 1. As TE → 0 these proportions should
automatically approach those of the benchmark portfolio, which was observed to be
the case. As TE increases (corresponding on the Pareto front to a lowered TC&MI) in
the case of sector allocations one sees an increase in the proportion of assets assigned
to the utilities sector, and a corresponding decrease with respect to the other sectors.
In the case of country, one sees an equally marked increase in the allocations to Chile.
The case of Peru is interesting as allocations to this country initially fall, then rise
again for a time at higher allowed values of TE (lower TC&MI).
Figure 3 shows in more detail how portfolio composition, now in terms of the eight
individual assets, varies as the allowed tracking error (TE) increases. It can be seen
that just two of the assets would take over the portfolio in the limit of very high TE,
one of them Chile's national energy provider, the other the major bank in Colombia,
these being the component assets within their sectors that are the most liquid.
Figures 4a, 4b show equivalent variations as TC&MI increases. Note that the
variations seen in these figures are expected to be the converse of those seen for TE
variation in Figure 2 (in the sense that a behaviour associated with a low TE
corresponds to a high TC&MI, and vice versa) since the MOPSO algorithms play off
the minimisation of one objective against the other, and it can be observed that this is
broadly the case.
146 M. Díez-Fernández, S. Alvarez Teleña, and D. Gorse
Fig. 3. Portfolio composition in relation to the eight included assets as a function of tracking
error
4 Discussion
It has been demonstrated that a combination of vector-evaluated PSO (VEPSO) and
its quantum-behaved equivalent VEQPSO can deliver an optimal trade-off between
tracking error minimisation and liquidity enhancement for a portfolio manager who
wishes to launch an ETF to track an index. The experimental results show a hybrid
Pareto front obtained from a combination of these algorithms produces the best range
of well-balanced Pareto-optimal solutions. Future research will be focused on a)
gaining a better understanding of why the two forms of MOPSO appear to specialise
so strongly in the minimisation of one or other of the objectives; b) experimenting
with a range of nonlinear market impact models to replace the linear one used here; c)
analysing the stability of the portfolio weights along the Pareto front in order to see
how robust these solutions are (it is expected VEQPSO will deliver a more steady
Construction of Emerging Markets Exchange Traded Funds Using MOPSO 147
References
1. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: IEEE International Conference
Symposium on Neural Networks, pp. 1942–1948. IEEE Press, New York (1995)
2. Poli, R.: An Analysis of Publications on Particle Swarm Optimisation Application.
Technical report, Department of Computer Science, University of Essex (2007)
3. Mishra, K.S., Panda, G., Meher, S.: Multi-objective Particle Swarm Optimization Approach
to Portfolio Optimization. In: 2009 World Congress on Nature and Biologically Inspired
Computing, pp. 1611–1614. IEEE Press, New York (2009)
4. Briza, A.C., Naval Jr., P.C.: Stock Trading System Based on the Multi-objective Particle
Swarm Optimization of Technical Indicators on End-of-Day Market Data. Applied Soft
Computing 11, 1191–1201 (2011)
5. Parsopoulos, K.E., Vrahatis, M.N.: Particle Swarm Optimization Method in Multiobjective
Problems. In: 2002 ACM Symposium on Applied Computing, pp. 603–607. ACM Press
(2002)
6. Omkar, S.N., Khandelwal, R., Ananth, T.V.S., Naik, G.N., Gopalakrishnan, S.: Quantum
Behaved Particle Swarm Optimization (QPSO) for Multi-objective Design Optimization of
Composite Structures. Expert Systems with Applications 36, 11312–11322 (2009)
7. Sun, J., Xu, W., Feng, B.: A Global Search Strategy of Quantum-Behaved Particle Swarm
Optimization. In: 2004 IEEE Conference on Cybernetics and Intelligent Systems, pp. 111–
116. IEEE Press, New York (2004)
8. Benne, N., Fonseca, M., López-Ibáñez, M., Paquete, L., Vahrenhold, J.: On the Complexity
of Computing the Hypervolume Indicator. IEEE Transactions on Evolutionary
Computation 13, 1075–1082 (2009)
The Influence of Supervised Clustering
for RBFNN Centers Definition:
A Comparative Study
1 Introduction
Radial Basis Function Neural Networks (RBFNNs) are universal approximators
and have been successfully applied to deal with a wide range of problems. The
architecture of an RBFNN is composed of an input layer, a hidden layer, and an
output layer. The number of neurons in the input layer is equal to the number
of attributes of the input vector. The hidden layer is composed of an arbitrary
number of RBFs (e.g. Gaussian RBFs), being each one defined by a center and
a dispersion parameter. The response of each neuron in the output layer is a
weighted sum over the values from the hidden layer neurons.
RBFNNs can be trained by either a full or a quick learning scheme. In the
former, nonlinear optimization algorithms (e.g. gradient-descent-based) are used
to determine the whole set of parameters of an RBFNN: (i) location of each
center, (ii) dispersion of each RBF, and (iii) weights of the output layer. In this
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 148–155, 2012.
c Springer-Verlag Berlin Heidelberg 2012
The Influence of Supervised Clustering for RBFNN Centers Definition 149
2 Clustering Algorithms
The following clustering algorithms were considered, along with their correspond-
ing versions employing labeled data: (i) k-means, (ii) Neural-Gas (NG) and, (iii)
Adaptive Radius Immune Algorithm (ARIA). The first one was selected due to
its simplicity and wide usage. NG is also a fast clustering algorithm that can
be seen as a generalization of k-means, where each data point is assigned to ev-
ery prototype with different weights, based on their similarities. Both k-means
and NG have already supervised versions available in the literature. ARIA is
a self-adaptive algorithm which automatically determines the number of proto-
types based on the local data density. Moreover, ARIA intrinsically defines the
coverage of each prototype, which can clearly be adopted as the RBF disper-
sion. In [14], ARIA achieved good results when applied to determine the internal
structure of RBFNNs for regression problems.
150 A.R. Gonçalves et al.
In the supervised version of k-means [3], the value of k was divided among the
classes proportionally to the number of training samples per class, and k-means
was applied to each class individually. This variant of k-means is named here as
k-meansL .
In NG [10], the neighborhood ranking is based on Euclidean distance between
the prototypes and the training samples. In the supervised version of NG [9], the
Euclidean distance was replaced by the F-measure to define the neighborhood
ranking. This variant of NG is named here as NGF .
For k-means, k-meansL , NG and NGF , the number of centers (neurons) was
estimated based on the Bayesian Information Criterion√ (BIC) [13]. Moreover,
the dispersion ρ of each center is calculated as ρ = dmax 2 · k [8], where dmax
is the largest distance among the centers, and k is the number of centers.
RBFNN classifiers regarding accuracy were achieved when the dispersions are
equal to three times the values of the adaptive radii.
3 Experiments
In this section, we carried out an experimental analysis to compare, by means of
the accuracy of RBFNN classifiers, the supervised and unsupervised clustering
algorithms in determining the RBF centers.
following minimum radius values were set to: Wpbc=4, Bupa=1.5, Ionosphere=5,
Pima=1.6, Sonar =6, Transfusion=0.5, Wine=2.5, Iris=0.4 and Glass=0.7. For
the artificial ones, we used 0.8 for Artf1 to Artf6 and 0.6 for Artf7 to Artf9.
These values were obtained through a grid search procedure. The other parame-
ter values were: mutation rate μ = 1, decay rate γ = 0.9 and neighborhood size
Ns = 3. In NG and NGF , the initial step size was set to 0.5 and the initial
neighborhood range, λ, was defined as n/2, where n is the number of neurons.
For all algorithms, the number of iterations was fixed in 60.
Algorithms Artf1 Artf2 Artf3 Artf4 Artf5 Artf6 Artf7 Artf8 Artf9
(∼) (+) (+) (∼) (+) (+) (∼) (+) (+)
ARIACS – ARIA
0.72 0 1e-3 0.88 0 0 0.92 0 0
(–) (–) (–) (–) (–) (–) (–) (+) (+)
NGF – NG
2e-04 0 0 0 0 0 0 0 0
(∼) (–) (–) (+) (+) (+) (+) (+) (+)
k-meansL – k-means
0.87 0.02 0 8e-3 0 0 0 0 0
Algorithms Wpbc Bupa Ionosphere Pima Sonar Transfusion Wine Iris Glass
(∼) (+) (+) (+) (+) (∼) (∼) (+) (+)
ARIACS – ARIA
0.91 2e-4 0 0 0 0.42 0.42 0 9e-4
(–) (+) (–) (–) (+) (–) (–) (–) (–)
NGF – NG
4e-3 0 0 0 0 0 0 0 0
(∼) (+) (–) (+) (+) (∼) (∼) (+) (+)
k-meansL – k-means
0.93 0 0.02 0.03 2e-4 0.58 0.94 0 0
Table 5 shows the average and standard deviation of the classifiers’ accuracy
using the considered clustering algorithms for the real-world datasets. A pairwise
comparison was done to assess the effective impact of using labeled information
to define the centers of RBFs. Significantly better results are highlighted.
In most cases, NGF performed worse than the NG algorithm, indicating that
the F-measure maximization does not improve RBFNN accuracy. Unlike Eu-
clidean distance, F-measure does not necessarily preserve the topological order
of the clusters. Considering multiple prototypes per class, the assignment of a
data point to a distant or a near cluster (representing the same class) may have
the same F-measure, causing a misleading update of the prototypes.
It is possible to infer that the incorporation of labeled information in clus-
tering algorithms may not always lead to an improvement in RBFNN accuracy.
Depending on the problem complexity, the two already proposed supervised clus-
tering algorithms, k-meansL and NGF , worsen the RBFNN performance. On the
other hand, ARIACS achieved greater or equal performance when compared to
the original ARIA for all considered problems.
Acknowledgments. The authors would like to thank CNPq and CAPES for
the financial support.
References
1. Barra, T., Bezerra, G., de Castro, L., Von Zuben, F.: An Immunological Density-
Preserving Approach to the Synthesis of RBF Neural Networks for Classification.
In: IEEE International Joint Conference on Neural Networks, pp. 929–935 (2006)
2. Bezerra, G., Barra, T., De Castro, L., Von Zuben, F.: Adaptive Radius Immune
Algorithm for Data Clustering. In: Jacob, C., Pilat, M.L., Bentley, P.J., Timmis,
J.I. (eds.) ICARIS 2005. LNCS, vol. 3627, pp. 290–303. Springer, Heidelberg (2005)
3. Bruzzone, L., Prieto, D.: A technique for the selection of kernel-function param-
eters in RBF neural networks for classification of remote-sensing images. IEEE
Transactions on Geoscience and Remote Sensing 37(2), 1179–1184 (1999)
4. Cevikalp, H., Larlus, D., Jurie, F.: A supervised clustering algorithm for the ini-
tialization of RBF neural network classifiers. In: 15th IEEE Signal Processing and
Communications Applications, pp. 1–4 (2007)
5. Frank, A., Asuncion, A.: UCI machine learning repository (2010),
http://archive.ics.uci.edu/ml
6. Gan, M., Peng, H., Dong, X.: A hybrid algorithm to optimize RBF network archi-
tecture and parameters for nonlinear time series modeling. Applied Mathematical
Modelling (2011)
7. Guillén, A., Pomares, H., Rojas, I., González, J., Herrera, L., Rojas, F., Valenzuela,
O.: Studying possibility in a clustering algorithm for RBFNN design for function
approximation. Neural Computing and Applications 17(1), 75–89 (2008)
8. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice-Hall (1999)
9. Lamirel, J., Mall, R., Cuxac, P., Safi, G.: Variations to incremental growing neural
gas algorithm based on label maximization. In: IEEE International Joint Confer-
ence on Neural Networks (IJCNN), pp. 956–965 (2011)
The Influence of Supervised Clustering for RBFNN Centers Definition 155
10. Martinetz, T.M., Berkovich, S.G., Schulten, K.J.: “Neural-Gas” network for vector
quantization and its application to time-series prediction. IEEE Transactions on
Neural Networks 4(4), 558–569 (1993)
11. Okamoto, K., Ozawa, S., Abe, S.: A Fast Incremental Learning Algorithm of RBF
Networks with Long-Term Memory. In: Proceedings of the International Joint Con-
ference on Neural Networks, pp. 102–107 (2003)
12. Qian, Q., Chen, S., Cai, W.: Simultaneous clustering and classification over cluster
structure representation. Pattern Recognition 45(6), 2227–2236 (2012)
13. Spiegelhalter, D., Best, N., Carlin, B., Van Der Linde, A.: Bayesian measures of
model complexity and fit. Journal of the Royal Statistical Society. Series B: Sta-
tistical Methodology 64(4), 583–616 (2002)
14. Veroneze, R., Gonçalves, A.R., Von Zuben, F.J.: A Multiobjective Analysis of
Adaptive Clustering Algorithms for the Definition of RBF Neural Network Centers
in Regression Problems. In: Yin, H., Costa, J.A.F., Barreto, G. (eds.) IDEAL 2012.
LNCS, vol. 7435, pp. 127–134. Springer, Heidelberg (2012)
15. Wang, X., Syrmos, V.: Optimal cluster selection based on Fisher class separability
measure. In: Proceedings of American Control Conference, pp. 1929–1934 (2005)
Nested Sequential Minimal Optimization
for Support Vector Machines
DITEN – University of Genova, Via Opera Pia 11A, Genova, I-16145, Italy
{Alessandro.Ghio,Davide.Anguita,Luca.Oneto,Sandro.Ridella}@unige.it,
Carlotta.Schatten@smartlab.ws
1 Introduction
The Support Vector Machine (SVM) [15] is one of the state–of–the–art tech-
niques for classification problems. The learning phase of SVM consists in solving
a Convex Constrained Quadratic Programming (CCQP) problem to identify a
set of parameters; however, this training step does not conclude the SVM learn-
ing, as a set of hyperparameters must be tuned to reach the optimal performance
during the SVM model selection. This last tuning is not trivial: the most used,
effective and reliable approach in practice is to perform an exhaustive grid search
[5], where the CCQP problem is solved several times with different hyperparam-
eters settings.
As a consequence, identifying an efficient QP solver is of crucial importance
for speeding-up the SVM learning and several approaches have been proposed
in literature [14]. Two main categories of solvers exist: problem-oriented and
general purpose methods. Problem-oriented techniques make the most of the
characteristics of the problem or of the model to train: e.g., when classification
with a linear SVM is targeted, the LibLINEAR algorithm [3] is a very efficient
solver, which however cannot be used when Radial Basis Function kernels (such
as the Gaussian one) are exploited. In the framework of general purpose meth-
ods, one of the most well-known tools for solving the SVM CCQP problem is
the Sequential Minimal Optimization (SMO) algorithm [12,7]. SMO takes inspi-
ration from the decomposition method of Osuna et al. [11], which suggests to
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 156–163, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Nested Sequential Minimal Optimization for Support Vector Machines 157
solve SVM training problems by dividing the available dataset into an inactive
and an active part (namely the working set ). In particular, SMO pushes the
decomposition idea to the extreme, as it optimizes the smallest possible working
set, consisting of only two parameters selected according to proper heuristics [7].
The main advantage of SMO with respect to other general purpose methods (e.g.
SVMlight [6]) lies in the fact that solving such a simple problem can be done
analytically: thus, numerical QP optimization, which can be costly or slow, can
be completely avoided [13]. Moreover, SMO is easy to implement and is included
in the well-known LibSVM package [2], which allowed to further spread the use
of this solver. However, an overall speed-up of the algorithm is expected [13]: in
particular, performance can improve by exploiting working sets of larger cardi-
nality [9], as the number of accesses to memory, which represents one of the main
computational burden of SMO, can be reduced. Thus, the analytical solution for
a modified SMO algorithm, able to optimize three parameters at each iteration,
has been proposed in [9]: though efficient, this modified version of SMO (called
3PSMO) requires that a new optimization algorithm is implemented. Moreover,
its scalability to larger working sets is not straightforward.
In this paper, we propose an innovative Nested SMO (N–SMO) algorithm: we
firstly pick a subset of data by selecting the samples according to the heuristics
proposed in [7]; then, we apply the conventional SMO algorithm to optimize the
parameters on the selected subset, so that no ad hoc optimization procedures
must be implemented. In addition to be easily scalable and to allow the use
of widespread software libraries, our proposal outperforms the state-of-the-art
SMO implementation included in LibSVM, as shown by the tests in Section 4.
1
n n n
min g (α) αi αj qij − αi (1)
α 2 i=1 j=1 i=1
n
s.t. 0 ≤ αi ≤ C yi αi = 0,
i=1
where Q = {qij } = {yi yj K(xi , xj )}, K(xi , xj ) is the kernel function and C is
the hyperparameter that must
be tuned during the model selection phase. The
SVM classifier is f (x) = ni=1 yi αi K(xi , x) + b, where b ∈ R is the bias. The
patterns for which αi > 0 are called Support Vectors (SVs), while the subset of
patterns for which 0 < αi < C are called True SVs (TSVs).
Osuna et al. [11] suggested to solve Problem (1) by selecting working sets
of smaller cardinality, which can be efficiently managed by the optimization
procedure. Let us define the following sets:
158 A. Ghio et al.
⏐ ⏐
⏐ ⏐
S = {α1 , . . . , αn } , S opt ⊆ S, I = i⏐αi ∈ S , I opt = i⏐αi ∈ S opt , (2)
where |S| = n, |S opt | ≤ n and |·| is the cardinality of the set. The algorithm
proposed in [11] randomly takes a subset of the αi ∈ S, S opt , and optimizes
the subproblem defined by these variables. The procedure is then repeated until
all the Karush–Kuhn–Tucker (KKT) conditions of Eq. (1) are satisfied [11].
Platt [12], in particular, proposed to select working sets such that |S opt | =
2, i.e. characterized by the minimum cardinality. As the selection of the two
parameters to optimize deeply affects the performance of the algorithm and
its rate of convergence, ad hoc strategies have been presented in [7]. In that
work, the authors propose to include in the working set the Most Violating Pair
(MVP), i.e. the two parameters corresponding to the samples which violate the
KKT conditions the most. A further improvement has been recently presented
in [4], which takes into account second order information regarding Problem (1)
and which is currently exploited by the last versions of the LibSVM package [2].
Then, by extending the criterion introduced in [7], the m MV samples (if any)
can be chosen as follows:
⏐
⏐
Iup
m–MV
= arg max −yk ∇gk α(t) ⏐k ∈ Iup (5)
k1 ,...,k m
2
⏐
⏐
Ilow
m–MV
= arg max yk ∇gk α(t) ⏐k ∈ Ilow , (6)
k1 ,...,k m
2
where maxk1 ,...,km selects the m largest elements of a vector. Then, the indexes of
⏐ set are I = Iup ∪ Ilow
opt m–MV m–MV
the patterns of the working and the working set
(t) ⏐
is defined as S opt
= αi ⏐i ∈ I opt
. Then, we can optimize Problem (1) only
with respect to the parameters included in S opt by exploiting the conventional
SMO algorithm1 :
1
The proof is omitted here due to space constraints.
Nested Sequential Minimal Optimization for Support Vector Machines 159
⎡ ⎤
⏐
min
1
αi αj qij + ⎣∇gi ⏐
⏐ − αj qij ⎦ αi
(t)
(7)
αi ∈S opt 2 α(t)
i∈I opt i∈I opt j∈I opt
j ∈ I opt
(t)
s.t. 0 < αi < C i ∈ I opt yi αi = y i αi .
i∈I opt i∈I\I opt
4 Experimental Results
This section is devoted to the comparison of the performance of N–SMO against
the state-of-the-art and widely used implementation of the conventional SMO
algorithm included in LibSVM [2] (and exploited by our method as well, see line
13 in Algorithm 1). The datasets used for the comparison are presented in Table
1, where nl is the number of patterns used for the learning phase while nt is the
number of samples reserved for testing purposes. As MNIST and NotMNIST are
multi-class datasets and we target two-class problems, we performed an All-Vs-
All approach and we considered a subset of the so-created binary sets. The Test
Set (TS) approach is used for model selection purposes, where an exhaustive
grid search is used to explore several hyperparameter values. All the tests have
been performed on an Intel Core i5 processor (2.67 GHz) with 4 GB RAM. The
experiments presented in this section have been replicated 30 times with the
same setup in order to build statistically relevant results.
As a first issue, in Fig. 1 we compare the performance of the two solvers by
considering the MNIST 1 vs 7 problem, where a Gaussian kernel is used. In
particular, the figure on the left compares the time, needed by the algorithms to
compute the solution, when m (i.e. the dimension of the working set) and C are
varied (the width of the Gaussian kernel is fixed to the optimal value, identified
during the model selection phase): when C assumes either small (< 10−2 ) or
large (> 102 ) values, N–SMO allows to outperform the LibSVM SMO (for which
160 A. Ghio et al.
Dataset Reference nl nt d
MNIST 1 vs 7 [8] 11000 2448 784
MNIST 0 vs 1 [8] 10000 3074 784
MNIST 3 vs 8 [8] 10000 2381 784
NotMNIST A vs B [1] 10000 2000 784
NotMNIST C vs D [1] 10000 2000 84
NotMNIST I vs J [1] 10000 2000 784
Daimler [10] 8000 1800 648
Webspamunigram [16] 10000 2000 254
m = 2). Similar results can be obtained on the other datasets in Table 1, but
are not reported here due to space constraints: in particular, after the extensive
numerical simulations we performed on the datasets of Table 1, m ≈ 400 seems
to represent the optimal trade-off for datasets characterized by a cardinality of
approximately 10000 samples in the range of interest for the hyperparameters.
The right plot in Fig. 1, instead, compares the number of accesses to memory
nacc : it is worth noting that, when C is large, N–SMO remarkably outperforms
SMO; on the contrary, when C is small, the number of accesses to memory is
similar, which seems surprisingly in contrast with the results on the training
time. Thus, we deepened the analysis of the results, in order to better explore
the reasons of such an unexpected behavior.
In Table 2, we present a more detailed list of results for the comparison of
LibSVM SMO against N–SMO, where different kernels (the linear, the Gaussian
Nested Sequential Minimal Optimization for Support Vector Machines 161
Time (sec)
m=2 0.7
8000 m = 20
nacc
m = 220
0.6
m = 420
6000 m = 620
m = 820 0.5
m = 1020
4000
0.4
2000 0.3
0 0.2
−8 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6
log10 C log C
10
and the polynomial ones) are exploited for the SVM. In particular, we report
the results obtained for “extreme” values of the hyperparameters (C, the degree
of the polynomial p and the width of the Gaussian kernel γ) and for the opti-
mal values, identified during the model selection; the dimension of the working
set is fixed to m = 400. In addition to nacc and the time needed by the solver
to compute the solution, we also present the number of misclassification nerr ,
performed by the learned model on the test set, and the average distance dacc
between the indexes of the rows (or columns) of the matrix Q, read in memory
for updating the gradient value. It can be noted that, when C is small, dacc for
N–SMO is always noticeably smaller than the value obtained for the LibSVM
SMO, while nacc is similar for the two methods: this confirms that the caching
strategy of the computing system has a remarkable influence on the overall per-
formance of the algorithms and choosing larger working sets can help decreasing
the computational time.
Finally, the results obtained for all the datasets are presented in Table 3,
where only the values referred to the Gaussian kernel are proposed due to space
constraints: conclusions, analogous to the ones for the values of Table 2, can be
drawn.
162 A. Ghio et al.
5 Concluding Remarks
This is a preliminary work, since N–SMO needs to be tested on a larger number
of datasets with different cardinalities and requires that a strategy for tuning
m is depicted; further comparisons with other state-of-the-art solvers must be
performed as well. Nevertheless, the N–SMO approach proved to be effective and,
as such, represents the basis for research on these topics. Possible perspectives
for further improvements are twofold, to the authors’ current best knowledge:
firstly, the exploitation of the working set selection strategy, proposed in [4], at
line 4 of Algorithm 1; as a second (and even more appealing) issue, the design
of a customized caching algorithm for very large cardinality problems.
References
1. Bulatov, Y.: (2011), dataset,
http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html
Nested Sequential Minimal Optimization for Support Vector Machines 163
2. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011)
3. Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: Liblinear: A library for large linear
classification. The Journal of Machine Learning Research 9, 1871–1874 (2008)
4. Fan, R., Chen, P., Lin, C.: Working set selection using second order information for
training support vector machines. The Journal of Machine Learning Research 6,
1889–1918 (2005)
5. Hsu, C., Chang, C., Lin, C.: A practical guide to support vector classification
(2003)
6. Joachims, T.: Making large-scale svm learning practical. In: Advances in Kernel
Methods (1999)
7. Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to platt’s
smo algorithm for svm classifier design. Neural Computation 13(3), 637–649 (2001)
8. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical
evaluation of deep architectures on problems with many factors of variation. In:
Proceedings of the International Conference on Machine Learning, pp. 473–480
(2007)
9. Lin, Y., Hsieh, J., Wu, H., Jeng, J.: Three-parameter sequential minimal optimiza-
tion for support vector machines. Neurocomputing 74(17), 3467–3475 (2011)
10. Munder, S., Gavrila, D.: An experimental study on pedestrian classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence 28(11), 1863–1868
(2006)
11. Osuna, E., Freund, R., Girosi, F.: An improved training algorithm for support
vector machines. In: Proceedings of the Workshop Neural Networks for Signal
Processing (1997)
12. Platt, J.: Sequential minimal optimization: A fast algorithm for training sup-
port vector machines. In: Advances in Kernel Methods Support Vector Learning,
vol. 208, pp. 1–21 (1998)
13. Platt, J.: Using analytic qp and sparseness to speed training of support vector
machines. In: Advances in Neural Information Processing Systems, pp. 557–563
(1999)
14. Shawe-Taylor, J., Sun, S.: A review of optimization methodologies in support vector
machines. Neurocomputing 74(17), 3609–3618 (2011)
15. Vapnik, V.: Statistical learning theory. Wiley, New York (1998)
16. Webb, S., Caverlee, J., Pu, C.: Introducing the webb spam corpus: Using email
spam to identify web spam automatically. In: Proceedings of the Conference on
Email and Anti-Spam (2006)
Random Subspace Method and Genetic Algorithm
Applied to a LS-SVM Ensemble
Abstract. The Least Squares formulation of SVM (LS-SVM) finds the solution
by solving a set of linear equations instead of quadratic programming
implemented in SVM. The LS-SVMs provide some free parameters that have to
be correctly chosen in order that the performance. Lots of tools have been
developed to improve their performance, mainly the development of new
classifying methods and the employment of ensembles. So, in this paper, our
proposal is to use both the theory of ensembles and a genetic algorithm to
enhance the LS-SVM classification. First, we randomly divide the problem into
subspaces to generate diversity among the classifiers of the ensemble. So, we
apply a genetic algorithm to find the values of the LS-SVM parameters and also
to find the weights of the linear combination of the ensemble members, used to
take the final decision.
1 Introduction
The Least Squares Support Vector Machine (LS-SVM) is a reformulation of the
standard SVM [1] introduced by Suykens [2] that uses equality constraints instead of
inequality constraints implemented in the problem formulation. Both the SVMs and
the LS-SVMs provide some parameters that have to be tuned to reflect the
requirements of the given task because if these ones are not correctly chosen,
performances will not be satisfactory. Despite their high performance, several
techniques have been employed in order to improve them, either by developing new
training methods [3] or by creating ensembles [4].
The most popular ensemble learning methods are Bagging [5], Boosting [6] and
the Random Subspace Method (RSM) [7]. In Bagging, one samples the training set,
generating random independent bootstrap replicates [8], constructs the classifier on
each of these, and aggregates them by a simple majority vote in the final decision
rule. In Boosting, classifiers are constructed on weighted versions of the training set,
which are dependent on previous classification results. Initially, all objects have equal
weights, and the first classifier is constructed on this data set. Then, weights are
changed according to the performance of the classifier. Erroneously classified objects
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 164–171, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Random Subspace Method and Genetic Algorithm Applied to a LS-SVM Ensemble 165
get larger weights, and the next classifier is boosted on the reweighted training set. In
this way, a sequence of training sets and classifiers is obtained, which is then
combined by single majority voting or by weighted majority voting in the final
decision. In the RSM, classifiers are constructed in random subspaces of the data
feature space. These classifiers are usually combined by single majority voting in the
final decision rule.
Several approaches related to RSM can be found in [9, 10, 11,12]. In [9] Bryll et
al. discuss Attribute Bagging, a technique for improving the accuracy and the stability
of classifier ensembles induced using random subsets of features. The Input
Decimation method (ID) [10] generates subsets of the original feature space for
reducing the correlation the correlation among the base classifiers in an ensemble.
Each base classifier is presented with a single feature subset, each feature subset
containing features correlated with a single class. The RSM by Ho [11] is a forest
construction method, which builds a decision tree-based ensemble classifier. It
presents randomly selected subspace of features to the individual classifiers, and then
combines their output using voting. The classifier ensemble entitled Classification
Ensembles from Random Partitions (CERP) was described in [12]. In CERP, each
base classifier is constructed from a different set of attributes determined by a
mutually exclusive random partitioning of the original feature space. CERP uses
optimal tree classifiers as the base classifiers and the majority voting as combiner.
In our previous work [13], we used a GA to analyze the importance of each SVM
in the ensemble by means of a weight vector. The diversity in ensemble was
generated providing different parameter values for each model. In this paper, we
propose another way to generate diversity in ensemble and extend the use of GA,
called RSGALS-SVM. We use the combination of the RSM and GA to enhance the
classification of a LS-SVM ensemble. First, we use the RSM, constructing models in
random subspaces of an n-dimensional problem, so that each LS-SVM will be
responsible for the classification of a subproblem. Then, the GA is used to minimize
an error function and therefore it will act finding effective values for the parameters
of each model in the ensemble and a weight vector, measuring the importance of each
one in the final classification. That way, if, for example, there is one LS-SVM whose
decision surface works better than the others, the GA will find the weight vector so
that the final classifying is the best possible. Then we compare our results to proposed
method with some other algorithms.
This paper is organized as follows: Section 2 introduces the LS-SVM and some of
its characteristics, and also has a brief explanation on genetic algorithms and its
mechanism. Section 3 describes our proposed method, while Section 4 has the
experimental results and its analysis. Section 5 presents the conclusion of the paper.
2 Theoretical Background
, (1)
where are support values and b is a real constant. For ·,· one typically has the
following choices: , (linear SVM); , 1
(polynomial SVM of degree p); , exp (RBF SVM);
, tanh (MLP SVM), where , and are constants.
For the case of two classes, one assumes
1, 1
(2)
1, 1
which is equivalent to
1, 1, … , (3)
where · is a nonlinear function which maps the input space into a higher
dimensional space. LS-SVM classifiers as introduced in [2] are obtained as solution to
the following optimization problem:
1 1
min , , (4)
, , 2 2
, , ; 1 (6)
where are Lagrange multipliers, which can be either positive or negative due to
equality constraints as follows from Karush-Kuhn-Tucker (KKT) conditions.
The conditions for optimality
0 ∑
0 ∑ 0
(7)
0 , 1, … ,
0 1 0, 1, … ,
3 RSGALS-SVM
Our main objective is to improve the LS-SVM ensemble performance through the
combination of RSM and GA. In order to create this set of LS-SVMs, we delved a
little into the ensemble theory.
In [15,16,17,18] we see that an effective ensemble should consist of a set of
models that are not only highly accurate, but ones that make their errors on different
parts of the input space as well. Thus, varying the feature subsets used by each
member of the ensemble should help promote this necessary diversity. From [4] we
see that the SVM kernel that allows for higher diversity from the most popular ones is
the Radial Basis Function kernel, because its Gaussian width parameter, , allows
detailed tuning.
Therefore, the combination of RSM and GA is used to generate highly accurate
models and promote disagreement among them. Given an n-dimensional problem, we
use RSM to divide it randomly into M subspaces of the data feature space, thus each
LS-SVM will be responsible for the classification of the problem based on the
information that your subspace provides.
Once defined the division into subspaces of the original problem, we define how
the GA will be used in this work. The GA will act on two different levels of the
ensemble, on the parameters and output of each model. At the first level, the GA will
find effective values of and , the regularization term that controls the tradeoff
between allowing training errors and forcing rigid margins, for M LS-SVMs. At the
second level, the GA will find a weight vector , measuring the importance of each
LS-SVM in the final classifying. The final classification is obtained by a simple linear
combination of the decision values of the LS-SVMs with the weight vector. This way,
the representation of each individual of our population is defined as a vector
containing the adjustable parameters and weights.
, ,… , , , ,…, , , ,…,
where M is the number of LS-SVMs.
The fitness function of our GA is the error rate of the ensemble and can be seen as:
, ,
,…, , ,…, (10)
, 1, … ,
168 C. Padilha, A.D. Neto, and J. Melo
where d contains the output patterns, y contains the final hypothesis, o contains the
LS-SVMs outputs for a given input pattern and w is the weight vector.
So, we can formulate the optimization problem to be solved by the GA:
min , , (11)
, ,
subject to
1. ∑ 1
2. , 0 and 0, 1, … ,
The initial population of the GA was generated randomly. We employed stochastic-
uniform selection, which divides the parents into uniform-sized sections, each
parent’s size being determined by the fitness scaling function. In each of these
sections, one parent is chosen. The mutation function is a Gaussian; the individuals in
mutation are added with a random number from a Gaussian distribution with mean
zero, and variance declining at each generation. The crossover function is the
scattered crossover; a binary vector is generated, and the elements of this vector
decide the outcome of the crossover. If the element is a 1, the corresponding gene of
the child will come from the first parent. If it’s a 0, said gene will come from the
second parent. The size of the population is 20, at each generation two elite
individuals are kept for the next generation, and the fraction generated by crossover is
0.8. The GA runs for 100 generations. Table 1 show the method’s pseudo code.
Given:
, ,…, , ; , 1,1 , the input set
Procedure:
Generate from the training set and the test set
Randomly divide P into M subspaces of features
Generate M LS-SVM to compose the ensemble, each one will be trained using one of those groups
of features
Call the GA to solve the optimization problem:
min , ,
, ,
subject to
1. ∑ 1
2. , 0 and 0, 1, … ,
Retrieve the optimal values for , and the optimal weight vector
Evaluate the ensemble using V with the same division made in P
Output: Final Classification:
4 Experimental Results
To evaluate the performance of proposed method, tests were performed using 11 two-
class benchmark data sets as used in [19] that include various types of classification
problems (real-world problems and artificial). Table 2 shows the number of inputs,
training data, test data, and the number of different partitions for training and test data.
Random Subspace Method and Genetic Algorithm Applied to a LS-SVM Ensemble 169
The data sets chosen vary across a number of dimensions including: the type of the
features in the data set (continuous, discrete or a mix of the two) and number of
examples in the data set. We compared the proposed method to a single RBF Network,
AdaBoost with RBF Networks and SVM (with Gaussian kernel) trained with GA and
their results were obtained from [19]. All results are averaged over all partitions in
every problem. In all tests, we used an ensemble composed of 5 LS-SVM.
Table 3 shows the average recognition rates and their standard deviations of the
validation data sets by RBF, AdaBoost, SVM and proposed method.
The results of AdaBoost are in almost all tests worse than the single classifier.
Analyzing these results, this is clearly due to the overfitting of AdaBoost. In [19], the
authors explain that if the early stopping is used then this effect is less drastic but still
observable.
The averaged results of RSGALS-SVM are a bit better than the results achieved by
other classifiers in most tests (7/11). A significance test (95%-t-test) was done as seen
on Table 3 and it showed that the proposed method gives the best overall performance.
The results of SVM are often better than the results of RBF classifier.
Table 3. Comparison between the RSGALS-SVM, a single RBF classifier, AdaBoost (AB) and
Support Vector Machine trained with GA (GA-SVM). The best average recognition rate is
shown in boldface. The columns and show the results of a significance test (95%-t-test)
between AB/RSGALS-SVM and RSGALS-SVM/GA-SVM, respectively.
5 Conclusion
In this work, we proposed two changes in relation to the previous work [13], we
incorporated the RSM to make the feature selection, creating diversity among the LS-
SVMs in the ensemble, and we extended the use of GA to find good values for the
parameters ( , . The search space of these parameters is enormous in complex
problems due to their large range of values. This is why we extended this global search
technique (GA) to find their values. We tested the previous work using 4 data sets
(Image, Ringnorm, Splice and Waveform) and this work got better results in all cases.
We compared the proposed method RSGALS-SVM to a single RBF classifier,
AdaBoost with RBF networks and GA-SVM (with Gaussian kernel) and it achieved
better results than these traditional classifiers in most tests.
Many improvements are possible and need be explored. For example, we can
investigate further expand the use of GA to make the feature selection, as in [20], but
keeping the fitness function used in this work.
References
1. Vapnik, V.: Statistical Learning Theory. John Wiley and Sons Inc., New York (1998)
2. Suykens, J.A.K., Vandewalle, J.: Least-Squares Support Vector Machine Classifiers.
Neural Processing Letters 9(3) (1999)
3. Osuna, E., Freund, R., Girosi, F.: An Improved Training Algorithm for Support Vector
Machines. In: NNSP 1997 (1997)
4. Lima, N., Dória Neto, A., Melo, J.: Creating an Ensemble of Diverse Support Vector
Machines Using Adaboost. In: Proceedings on International Joint Conference on Neural
Networks (2009)
5. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
6. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings
13th International Conference on Machine Learning, pp. 148–156 (1996)
7. Ho, T.K.: The Random subspace method for constructing decision forests. IEEE
Transactions Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
8. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman & Hall, New York
(1993)
9. Bryll, R., Gutierrez-Osuna, R., Quek, F.: Attribute Bagging: Improving Accuracy of
Classifier Ensembles by using Random Feature Subsets. Pattern Recognition 36, 1291–
1302 (2003)
10. Oza, N.C., Tumer, K.: Input Decimation Ensembles: Decorrelation through
Dimensionality Reduction. In: Kittler, J., Roli, F. (eds.) MCS 2001. LNCS, vol. 2096, pp.
238–247. Springer, Heidelberg (2001)
11. Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE
Transactions Pattern Analysis and Machine Intelligence 20, 832–844 (1998)
12. Ahn, H., Moon, H., Fazzari, M.J., Lim, N., Chen, J., Kodell, R.: Classification by
ensembles from random partitions of high-dimensional data. Computational Statistics and
Data Analysis 51, 6166–6179 (2007)
13. Padilha, C., Lima, N., Dória Neto, A., Melo, J.: An Genetic Approach to Support Vector
Machines in classification problems. In: Proceedings on International Joint Conference on
Neural Networks (2010)
Random Subspace Method and Genetic Algorithm Applied to a LS-SVM Ensemble 171
14. Castro, L., Zuben, F.V.: Algoritmos Genéticos. Universidade Estadual de Campinas
(2002),
ftp://ftp.dca.fee.unicamp.br/pub/docs/
vonzuben/ia707_02/topico9_02.pdf
15. Kuncheva, L., Whitaker, C.: Measures in diversity in classifier ensembles and their
relationship with ensemble accuracy. Machine Learning 51(2), 181–207 (2003)
16. Hansen, L., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern
Analysis and Machine Intelligence 12, 993–1001 (1990)
17. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning.
In: Advances in Neural Information Processing Systems, vol. 7, pp. 231–238. MIT Press,
Cambridge (1995)
18. Opitz, D., Shavlik, J.: Actively searching for an effective neural-network ensemble.
Connection Science 8(3/4), 337–353 (1996)
19. Rätsch, G., Onoda, T., Müller, K.-R.: Soft Margins for Adaboost. Machine Learning 42
(2001)
20. Opitz, D.: Feature Selection for Ensembles. In: Proceedings of the Sixteenth National
Conference on Artificial Intelligence (1999)
Text Recognition in Videos Using a Recurrent
Connectionist Approach
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 172–179, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Text Recognition Using a Connectionist Approach 173
2 Proposed Approach
The first task for video text recognition consists in detecting and extracting texts
from videos as described in [4]. Once extracted, text images are recognized by
means of two main steps as depicted in fig. 1: generation of text image repre-
sentations and text recognition. In the first step, images are scanned at different
scales so that, for each position in the image, four different windows are ex-
tracted. Each window is then represented by a vector of features learnt with a
convolutional neural network (ConvNet). Considering the different positions in
the scanning step and the four windows extracted each time, a sequence of learnt
features vectors X 0 , . . . , X t , . . . , X p is thus generated to represent each image.
The second step of the proposed OCR is similar to the model presented in [7],
using a specific bidirectional recurrent neural network (BLSTM) able to learn
to recognize text making use of both future and past context. The recurrent
network is also characterized by a specific objective function (CTC) [7], that
allows the classification of non-segmented characters. Finally, the network’s out-
puts are decoded to obtain the recognized text. The following sections describe
these different steps and their interactions within the recognition scheme.
174 K. Elagouni et al.
(CTC) to build a model able to learn how to classify the feature sequences and
hence recognize texts. While the BLSTM allows to handle long-range depen-
dencies between features, the CTC enables our scheme to avoid any explicit
segmentation in characters, and learn to recognize jointly a sequence of classes
and their positions in the input data.
The basic idea of RNN is to introduce recurrent connections which enable the
network to maintain an internal state and thus to take into account the past
context. However, these models have a limited “memory” and are not able to
look far back into the past [8] becoming insufficient when dealing with long input
sequences, such as our feature sequences. To overcome this problem, the Long
Short-Term Memory (LSTM) [5] model was proposed to handle data with long
range interdependencies. A LSTM neuron contains a constant “memory cell”—
namely constant error carousel (CEC)—whose access is controlled by some mul-
tiplicative gates. For these reasons we chose to use the LSTM model to classify
our learnt feature sequences. Moreover, in our task of text recognition, the past
context is as important as the future one (i.e., both previous and next letters are
important to recognize the current letter). Hence, we propose to use a bidirec-
tional LSTM which consists of two separated hidden layers of LSTM neurons.
The first one permits to process the forward pass making use of the past context,
while the second serves for the backward pass making use of the future context.
Both hidden layers are connected to the same output layer (cf. fig. 1).
Even though BLSTM networks are able to model long-range dependencies, as for
classical RNNs, they require pre-segmented training data to provide the correct
target at each timestep. The Connectionist Temporal Classification (CTC) is a
particular objective function defined [6] to extend the use of RNNs to the case of
non-segmented data. Given an input sequence, it allows the network to jointly
learn a sequence of labels and their positions in the input data. By considering
an additional class called “Blank”, the CTC enables to transform the BLSTM
network outputs into a conditional probability distribution over label sequences
(“Blank” and Characters). Once the network is trained, CTC activation outputs
can be decoded, removing the “Blank” timesteps, to obtain a sequence of labels
corresponding to a given input sequence. In our application, a best path decoding
algorithm is used to identify the most probable sequence of labels.
After testing several architectures, a BLSTM with two hidden layers of 150
neurons, each one containing recurrent connexions with all the other LSTM cells
and fully connected to the input and the output layers, has been chosen. The
Text Recognition Using a Connectionist Approach 177
5.1 Datasets
Our experiments have been carried out on a dataset of 32 videos of French
news broadcast programs. Each video, encoded by MPEG-4 (H. 264) format at
720 × 576 resolution, is about 30 minutes long and contains around 400 words
which correspond to a set of 2200 characters (i.e., small and capital letters,
numbers and punctuation marks). Embedded texts can vary a lot in terms of
size (from 8 to 24 pixels of height), color, font and background. Four videos
were used to generate a dataset of 15168 images of single characters perfectly
segmented. This database—called CharDb—consists of 42 classes of characters
(26 letters, 10 numbers, the space character and 5 special characters; namely ’.’,
’-’, ’(’, ’)’ and ’:’) and is used to train the ConvNet described in section 3.2. The
remaining videos were annotated and divided into two sets: VidTrainDb and
VidTestDb containing respectively 20 and 8 videos. While the first one is used
to train the BLSTM, the second is used to test the complete OCR scheme.
Fig. 2. Example of recognized text: each class is represented by a color, the label “ ”
represents the class “space” and the gray curve corresponds to the class “Blank”
was evaluated on the remaining 10%. A very high recognition rate of 98.04%
was obtained. Learnt features were thus generated with the trained ConvNet
and used to feed the BLSTM. Fig. 2 illustrates an example of recognized text
and shows its corresponding BLSTM outputs where each recognized character
is represented with a peak. Even though extracted geometrical features achieve
good performance, for our application, they seem to be less adapted than learnt
features which obtain a high character recognition rate of 97.18% (cf. table 1).
The main improvement is observed for text images with complex background,
for which the geometrical features introduced high inter-class confusions.
We further compare our complete OCR scheme to another previously pub-
lished method [4] and commercial OCR engines; namely ABBYY, tesseract,
GNU OCR, and SimpleOCR. Using the detection and extraction modules pro-
posed in [4], these different systems were tested and their performances were
evaluated. As shown in table 2, the proposed OCR yields the best results and
outperforms commercial OCRs.
6 Conclusions
We have presented an OCR scheme adapted to the recognition of texts extracted
from digital videos. Using a multi-scale scanning scheme, a novel representation
Text Recognition Using a Connectionist Approach 179
References
1. Casey, R., Lecolinet, E.: A survey of methods and strategies in character segmen-
tation. PAMI 18(7), 690–706 (2002)
2. Chen, D., Odobez, J., Bourlard, H.: Text detection and recognition in images and
video frames. PR 37(3), 595–608 (2004)
3. Elagouni, K., Garcia, C., Mamalet, F., Sébillot, P.: Combining multi-scale character
recognition and linguistic knowledge for natural scene text OCR. In: DAS, pp. 120–
124 (2012)
4. Elagouni, K., Garcia, C., Sébillot, P.: A comprehensive neural-based approach for
text recognition in videos using natural language processing. In: ICMR (2011)
5. Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with lstm
recurrent networks. JMLR 3(1), 115–143 (2003)
6. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal
classification: Labelling unsegmented sequence data with recurrent neural net-
works. In: ICML, pp. 369–376 (2006)
7. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhu-
ber, J.: A novel connectionist system for unconstrained handwriting recognition.
PAMI 31(5), 855–868 (2009)
8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computa-
tion 9(8) (1997)
9. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series.
In: The Handbook of Brain Theory and Neural Networks. MIT Press (1995)
10. Lienhart, R., Effelsberg, W.: Automatic text segmentation and text recognition for
video indexing. Multimedia Systems 8(1), 69–81 (2000)
11. Saidane, Z., Garcia, C.: Automatic scene text recognition using a convolutional
neural network. In: ICBDAR, pp. 100–106 (2007)
12. Yi, J., Peng, Y., Xiao, J.: Using multiple frame integration for the text recognition
of video. In: ICDAR, pp. 71–75 (2009)
An Investigation of Ensemble Systems Applied
to Encrypted and Cancellable Biometric Data
Isaac de L. Oliveira Filho, Benjamn R.C. Bedregal, and Anne M.P. Canuto
1 Introduction
The use of different approaches for the identification of individuals in user-access
systems reflects the relevance of information security in data storage. For exam-
ple, passwords, key phrases and identification numbers have traditionally been
used in the authentication process. However, they can be used in a fraudulent
way. In order to increase the security and robustness of identification systems, it
is important to use more elaborated approaches, such as biometric data. These
features are unique of each person and it increases reliability, convenience and
universality of the identification systems [3]. However, there are still some is-
sues that need to be addressed in biometric-based identification systems. The
main issues are concerned with the security of biometric identification systems
since these systems need to ensure their integrity and public acceptance. For
biometric-based identification systems, security is even more important than for
the non-biometric systems, since a biometric is permanently associated with a
user and cannot be revoked or cancelled if compromised. Therefore, it is impor-
tant to avoid an explicit storage of biometric templates in the system, eliminating
any possibility of leakage of the original biometric trait.
Cancellable biometrics have been increasingly applied to address such security
issues [8]. This term is commonly referred to the application of non-invertible
and repeatable modifications to the original biometric templates. However, the
use of transformation functions in biometric data still allows the improper use of
these information by unauthorized individuals. In [4], for instance, it was shown
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 180–188, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Applying Ensembles on Encrypted and Cancellable Signature Data 181
that the use of ensemble systems in cancellable data had a similar accuracy level
than the original data. Therefore, in case of being stolen, the original data could
not be obtained, but the transformed dataset could be used with a reasonable
performance level.
Aiming to increase the security level for biometric dataset, we propose an anal-
ysis of the use of the cryptography and transformation methods in biometric data
and we focus on signature as the main biometric of this analysis. The main goal
of this work is making an analysis of elaborated classification structures in secure
datasets. In order to do this, a transformation function was applied in the origi-
nal signature dataset, creating a transformed dataset. Then, it was used the last
versions of a hard cryptosystem, cryptography algorithm, initially developed in
[2] on the transformed signature dataset, creating a cryptographic/transformed
dataset. A comparative analysis of these three datasets is made and the results
shows as the used cryptosystem breaks the relationship of between attributes
and patterns, decreasing the performance level of the ensemble systems. In this
way, it is possible to say that these data are more secure than the use of a trans-
formation method. In addition, the use of a transformation method guarantees
that the original data will be recovered, even if the encrypted data is broken,
providing robust and secure biometric-based identification systems.
In the context of biometric data, the unauthorized copy of stored data is proba-
bly the most dangerous threat, regarding the privacy and security of the users. In
order to offer security for biometric-based identification systems, the biometric
templates must be stored in a protected way. There are several template protec-
tion methods proposed in the literature. In [7], these methods were broadly di-
vided into two classes of methods, which are: biometric cryptosystem and feature
transformation functions. In the latter, a transformation function (f ) is applied
to the biometric template (T ) and only the transformed template (f (T )) is stored
in the database. These functions can be categorized as salting and non-invertible
transformations. In salting, the transformation function (f ) is invertible, while f
is (as implied in the name) non-invertible in the non-invertible transformations.
In this work, we will focus on the use of non-invertible transformation functions.
Hence, hereafter, the terms transformation function and template protection will
be taken as referring to the non-invertible transformation function.
The use of signature template protection systems were first considered in [13]
this being based on the biometric cryptosystem approach (key generation cryp-
tosystem). In this method, a set of parametric features was extracted from the
acquired dynamic signatures and a hash function was applied to the feature
binary representation, exploiting some statistical properties of the enrolment
signatures. Another study can be found in [5]. In this work, an adaptation of the
fuzzy vault to signature protection has been proposed, employing a quantized
set of maxima and minima of the temporal functions mixed with chaff points in
order to provide security.
182 I. de L. Oliveira Filho, B.R.C. Bedregal, and A.M.P. Canuto
5 Comparative Results
This section presents the results of the experiments described in the previous sec-
tion. It is important to emphasize that we have used the identification task with
the users. In this case, once a biometric is provided, the classification systems
(classifiers and ensembles) will provide the class of user. From a classification
point of view, a verification task is a two-class problem since an identification task
is a N-class problem (where N is the number of user). In this case, from a classi-
fication point of view, identification is more complex and time consuming task.
Applying Ensembles on Encrypted and Cancellable Signature Data 185
OriginalDataset
Size 3 Ind Sum Voting K-NN SVM
Het 82.55 ± 5.9 87.8 ±5.1 86.56±5.96 85.47± 4.26 88.56 ± 2.24
Hom 81.41 ±7.21 83.47±5.32 82.1±6.83 79.13± 8.3 83.37 ± 5.77
Size 6 Ind Sum Voting K-NN SVM
Het 81.66±5.57 88.29±6.26 86.14±6.86 87.26±4.82 89.5±2.55
Hom 80.59±7.70 84.03±5.18 81.30±6.51 79.54±6.94 83.43±5.82
Size 12 Ind Sum Voting K-NN SVM
Het 81.49±5.99 88.04±6.17 87.46±6.78 87.46±4.74 89.09±2.49
Hom 81.19±8.22 83.90±5.67 82.83±6.51 79.77±7.49 82.37±6.75
TransDataset
Size 3 Ind Sum Voting K-NN SVM
Het 74.01±5.18 76.41±9.57 75.23±8.65 74.49±6.81 78.47±4.12
Hom 72.67±5.51 74.33±7.00 72.97±5.89 69.33±10.18 73.43±7.76
Size 6 Ind Sum Voting K-NN SVM
Het 73.25±5.01 76.66±9.44 74.84±9.09 75.74±6.62 79.33±4.25
Hom 71.92±5.21 74.37±7.04 72.00±5.24 69.07±8.05 73.33±7.85
Size 12 Ind Sum Voting K-NN SVM
Het 72.35±5.19 76.63±8.82 75.99±9.81 75.88±6.60 79.02±3.53
Hom 72.02±11.33 74.50±12.33 73.30±11.61 68.93±16.25 72.50±13.77
BaseCrypt
Size 3 Ind Sum Voting K-NN SVM
Het 5.59±2.26 7.61±2.87 4.30±2.00 5.23±1.00 5.89±1.52
Hom 5.41±1.92 5.43±1.98 5.00±1.80 4.53±2.75 5.90±3.46
Size 6 Ind Sum Voting K-NN SVM
Het 5.42±2.02 8.18±2.80 5.86±2.15 5.87±1.34 6.44±1.50
Hom 5.36±1.82 5.70±2.20 4.90±1.51 4.73±2.26 4.80±2.86
Size 12 Ind Sum Voting K-NN SVM
Het 5.28±1.89 6.43±2.95 5.97±2.27 5.37±1.37 6.80±1.67
Hom 5.42±1.93 5.40±2.00 5.57±2.11 7.20±2.43 4.97±3.78
pattern of behaviour of the ensemble systems are still the same, with the best
results obtained by the heterogeneous ensembles. However there is a difference
on the best combination method, since ensembles combined by SUM obtained
the best accuracy level in the CryptDataset. Thus, it is possible reiterate that
ensembles still can take advantage of the heterogeneous settings structures, even
in very difficult scenarios.
6 Conclusion
Considering the results provided by the use of ensemble systems on all three
datasets of this work, it was possible to determinate the importance of the use
of ensembles on signatures datasets, taking into account the good results (ac-
curacy rate) obtained by these systems. However, as biometric datasets require
confidentiality of the stored values, it is necessary to apply some protection tem-
plate methods. In this paper, this was done applying the following two methods:
Transformation functions and the Papı́lio cryptosystem. Through this analysis,
it is possible to verify that the ensemble systems applied on the transformed
database had better results than the ones obtained by the encrypted dataset.
This proves that Papı́lio really broke the interdependence of the values of each
pattern in the dataset. The Papı́lio method provided a level of greater complex-
ity than the transformation function by itself. Therefore, it can not be used for
classification purposes (only for storage). In addition, the use of a transformation
function means that a sole break in the cryptography algorithm does not lead
to the access of the original data, but the transformed data. In this case, the
biometric data becomes more secure and with a reasonable level of performance,
since the transformed data is used for classification purposes.
This analysis reveals us a hypothesis: a cryptosystem is considered strong
when performance is drastically reduced, even when using more elaborated clas-
sification structures as ensemble systems. In other words, the growth of the
power of a cipher encryption method is inversely proportional to the efficiency
of the classification method. Therefore, the use of other cryptosystems and/or
transformation functions and their application to other modalities is the subject
of on-going research.
References
1. Akgün, M., Kavak, P., Demirci, H.: New Results on the Key Scheduling Algorithm
of RC4. In: Chowdhury, D.R., Rijmen, V., Das, A. (eds.) INDOCRYPT 2008.
LNCS, vol. 5365, pp. 40–52. Springer, Heidelberg (2008)
2. Araujo, F.S., Ramos, K.D., Bedregal, B.R., Silva, I.: Paplio cryptography algo-
rithm. In: International Symposium on Computational and Information Sciences
(2004)
3. Bringera, J., Chabanne, H., Kindarji, B.: The best of both worlds: Applying secure
sketches to cancellable biometrics. Science of Computer Programming 74(1-2), 43–
51 (2008)
188 I. de L. Oliveira Filho, B.R.C. Bedregal, and A.M.P. Canuto
4. Canuto, A.M., Fairhurst, M.C., Pintro, F., Junior, J.C.X., Neto, A.F., Gonalves,
L.M.G.: Classifier ensembles and optimization techniques to improve the perfor-
mance of cancellable fingerprint. Int. J. of Hybrid Intelligent Systems 8(3), 143–154
(2011)
5. Freire-Santos, M., Fierrez-Aguilar, J., Ortega-Garcia, J.: Cryptographic key gen-
eration using handwritten signature. In: Biometric Technology for Human Identi-
fication III. SPIE, Int. Society for Optical Engineering, United States (2006)
6. Guest, R.: The repeatability of signatures. In: The 9th Int. Workshop on Frontiers
in Handwriting Recognition, IWFHR 2004, pp. 492–497 (2004)
7. Jain, A.K., Nandakumar, K., Nagar, A.: Biometric template security. Eurasip Jour-
nal on Advance in Signal Processing (2008)
8. Jin, A.T.B., Hui, L.M.: Cancelable biometrics. Scholarpedia (2010)
9. Daemen, J., Rijmen, V.: The Design of Rijndael: AES - The Advanced Encryption
Standard (2002)
10. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley
(2004)
11. Maiorana, E., Martinez-Diaz, M., Campisi, P., Ortega-Garcia, J., Neri, A.: Tem-
plate protection for hmm-based on-line signature authentication. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition Workshops, CVPRW, pp. 1–6
(2008)
12. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures
and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978)
13. Vielhauer, C., Steinmetz, R., Mayerhofer, A.: Biometric hash based on statistical
features of online signatures. In: Proceedings of 16th International Conference on
Pattern Recognition, vol. 1, pp. 123–126 (2002)
14. Witten, I.H., Frank, E.: Data Mining: Pratical Machine Learning Tools and Te-
chiniques, 2nd edn. Elsevier (2005)
New Dynamic Classifiers Selection Approach
for Handwritten Recognition
1 Introduction
For almost any real world pattern recognition problem a series of approaches and
procedures may be used to solve it. After more than 20 years of continuous and
intensive effort devoted to solving the challenges of handwriting recognition, progress
in recent years has been very promising [1].
Classical approaches to pattern recognition require the selection of an appropriate
set of features for representing input samples and the use of a powerful single
classifier. In recent years, in order to improve the recognition accuracy in complex
application domains, there has been a growing research activity in the study of
efficient methods for combining the results of many different classifiers [2], [3].
The application of an ensemble creation method, such as bagging [4], boosting
and random subspace, generates a set of classifiers C, where C = {C1, C2, . . . , Cn}.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 189–196, 2012.
© Springer-Verlag Berlin Heidelberg 2012
190 N. Azizi, N. Farah, and A. Ennaji
Given such a pool of classifiers, the selection of classifiers has focused on finding the
most relevant subset of classifiers E, rather than combining all available L classifiers,
where |E| ≤ |L|. Indeed, the selection of classifiers relies on the idea that either each
classifier member is an expert in some local regions of the feature space or
component classifiers are redundant.
Ensemble of classifiers (EoCs) exploits the idea that different classifiers can offer
complementary information about patterns to be classified. It is desirable to take
advantage of the strengths of individual classifiers and to avoid their weaknesses,
resulting in the improvement of classification accuracy. Both theoretical and
empirical researches have demonstrated that a good ensemble can not only improve
generalization ability significantly, but also strengthen the robustness of the
classification system [2], [3]. The EoCs has become a hotspot in machine learning and
pattern recognition and been successfully applied in various application fields,
including handwriting recognition [5], [6], speaker identification and face recognition.
In our previous works, we dealt with the recognition of handwritten Arabic words in
Algerian and Tunisian town names using single classifiers [6]. We focused later on
multiple classifiers approaches. We tried several combination schemes for the same
application [8], [9] and, while studying diversity role in improving multiple classifiers
system (MCS) and in spite of the weak correlation between diversity and performance,
we argue that diversity might be useful to build ensembles of classifiers. We
demonstrated through experimentation that using diversity jointly with performance to
guide selection can avoid overfitting during the search. So we have proposed three new
approaches based on Static classifiers selection using diversity measures and individual
classifiers accuracies to choose best set of classifiers [9]-[11].
The static classifier selection strategy called either the “overproduce and select” is
subject to the main problem. A fixed subset of classifiers defined using a
training/optimization data set may not be well adapted for the classification of the
whole test set [12]. This problem is similar to searching for a universal best individual
classifier, i.e., due to differences among samples, there is no an individual classifier
perfectly adapted for every test set. On the other hand, in dynamic classifier selection,
a competence of each classifier in the ensemble is calculated during the classification
phase and then the most competent classifier is selected [7], [12], [13]. The
competence of a classifier is usually defined in terms of its estimated local accuracy
[7]. Recently, dynamic ensemble of classifiers selection (DES) methods have been
developed. In these methods, first a subset of classifiers is dynamically selected from
the ensemble and then the selected classifiers are combined by majority voting.
However, the computational requirements of the DES methods developed are still
high [14].
In this paper, we propose a new dynamic ensemble of classifiers selection
approach based on local reliability estimation. The proposed algorithm extracts the
best EoC for each test sample using new measure about each class for all classifiers.
That measure named Local-Reliability measure witch is calculated by information set
extracted from confusion matrixes constructed during training level. When an
ensemble of classifiers (EoC) is selected based on our Algorithm and the L-Reliability
measure , two fusion method which are voting and weighted voting are applied to
generate the final label class with the appropriate confidence.
New Dynamic Classifiers Selection Approach for Handwritten Recognition 191
The remainder of this paper is organized as follows: The next section describes
DCS paradigm and the main idea of our proposed Dynamic Ensemble Classifier
Selection methodology Based on Local Reliability (DECS-LR) with the proposed
algorithm. The main results are illustrated in section3.
Test set
The main difference between the various DCS methods is the strategy employed to
generate regions of competence and proposed selection algorithm.
Among different DCS schemes, the most representative one is Dynamic Classifier
Selection by Local Accuracy (DCS-LA) [7].
Dynamic Classifier Selection by Local Accuracy explores a local community for
each test instance to evaluate the base classifiers, where the local community is
characterized as the k Nearest Neighbours (kNN) of the test instance in the evaluation
set EV. The intuitive assumption behind DCS-LA is quite straightforward: Given a
test instance I, we find its neighbourhood δI in EV (using the Euclidean distance), and
the base classifier that has the highest accuracy in classifying the instances δI should
also have the highest confidence in classifying δ. Let Cj(j = 1,…,L) be a classifier, and
an unknown test instance I. We first label with all individual classifiers (Cj ; j =
1,…,L) and acquire L class labels C1(I);…;CL(I). If individual classifiers disagree, the
192 N. Azizi, N. Farah, and A. Ennaji
local accuracy is estimated for each classifier. Given EV, the local accuracy of
classifier Cj with instance δI, LocCj(δI), is determined by the number of local
evaluation instances for classifier Cj that have the same class label as the classifier's
classification, over the total number of instances considered.
The final decision for δI is base classifier which provide the max of local accuracy.
This best classifierfor C* for classifying sample Ican be selected by [16],[34]:
(1)
Where Wj =1/dj is the weight, and dj is the Euclidean distance between the test
pattern I and the its neighbor sample xj. .
The advantage of using local accuracy is that instead of using the entire evaluation
set, DCS-LA uses a local neighbourhood of the given test instance to explore the
reliability of the base classifier. DCS-LA is an efficient mechanism in selecting the
"best" classifier.
We observe that used selection criterion in Local accuracy estimation take into
account only the local accuracy of each classifier without taking into account
behaviour of output classes for each classifier. This behaviour criterion may be added
new information concerning the selected set in the evaluation region. So, it may be
improving classification rate.
To attempt this objective, and also to choose the best set of classifiers dynamically
not only the one winner classifier, we propose a new algorithm based on the
definition of DCS-LA method but makes it possible to select, for each test pattern,
the best ensemble of classifiers that has more chances to make a correct classification
on that pattern. The proposed criteria uses a new measure named Local Reliability
measure witch is calculated in k nearest neighbor of the input pattern I
(neighbourhood (I)), defined with respect to evaluation set.
To calculate L-reliability measure, we need to construct confusion matrix for each
classifier during training level. Used confusion matrix can be define as square matrix
that contains N rows (the calculated Label Class) ands N columns (the predicted label
class). Each cell (d, f) represents the training samples number classified in label class
d knowing that the predicted label class is f.
- We can also define the Local classifier accuracy (AC(ci)) by Equation 2:
N
j =1
a jj
(2)
Ac ( C i ) =
N
With “ajj”: the number of correct predictions for each class j (j=1,…,N).
After training phase execution, proposed Local reliability of each class J of each
classifier Ci can be defined by the following equation:
New Dynamic Classifiers Selection Approach for Handwritten Recognition 193
a jj * Ac ( C i )
L − reliabilit y ( C )= d=N
i, j
a
(3)
d,j
d =1 etd ≠ j
3 Experimental Results
The pool of classifiers used for proposed approach validation uses the same ensemble
of classifiers published in our previous work based on static classifier selection to
permit to compare both of results. In fact, we have used different classification
algorithms:
- 02 SVM (Support Vector Machine), with the strategy "one against all ",
elaborated under the library lisb SVM, version 2.7. The inputs on this SVM system
are the structural features. We have used polynomial and Gaussian kernel function.
- 03 KNN (k - Nearest Neighborhoods with K=2, 3and 5).
- 03 NN (Neuronal Network with different number of the hidden layer neurons.
- 02 HMM (Hidden Marcov Models: Discreet and Continuous with modified
Viterbi algorithm.
Classifiers individual performance using the Ifn-Enit , AL-LRI and Mnist databases
are resumed in (Tab.1) .
During training level, confusion matrix for each classifier Ci ( i=1;..;10) and the local
Reliability of output class aj (j=1;..;48) for all classifiers are calculated.
Before DECS-LR algorithm execution, we need to select two parameters. The first
is the number of k value witch represents the number of neighborhoods that are
chosen for the local decision set. The second one is the ε threshold. A series of
experiments has been carried out to determine the best value of K, for dynamic
selection level proposed in our approach and to show whether or not DECS is better
than SECS (Static Ensemble Classifier Selection) of previous work on Arabic
handwritten recognition. For Ensemble combination, we have tested two fusion
methods witch are majority voting and weighted voting.
Table 2 show the performances of various implemented MCSs based on proposed
DCS-LR algorithm with comparison with classical DCS by LA. For validation we
have tested our approach using tree databases: IFN-ENIT [15], Algerian Database
[11]and MNist-digit database [16].
We can conclude that with k value equal to 4, our general methodology offer the
best accuracy for the both fusion methods and the two used databases with 93.89% as
best accuracy (from W. Voting and Algerian database).We must indicate that obtained
performance of our novel algorithm based on DECS-LR is better than our pervious
work witch the best percentage accuracy were 94.22% (from Table2) and better the
classical DCS based on Local Accuracy estimation.
New Dynamic Classifiers Selection Approach for Handwritten Recognition 195
Table 2. Classification accuracies on the test set provided by our DECS-LR Algorithm using
Voting method
4 Conclusion
In this paper, new DES strategy based on Local accuracy estimation and a proposed
Local Reliability measure is proposed to improve performance of handwritten lexicon
classification. This strategy using DECS-LR Algorithm exploit Local accuracy
196 N. Azizi, N. Farah, and A. Ennaji
References
1. Govindaraju, V., Krishnamurthy, R.K.: Holistic handwritten word recognition using
temporal features derived from off-line images. Pattern Recognition Letters 17(5), 537–
540 (1996)
2. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Trans.
Pattern Anal. Mach. Intell. 20(3), 226–238 (1998)
3. Kuncheva, L.I., Whitaker, C.S.C., Duin, R.P.W.: Is independence good for combining
classiers. In: Proceedings of the 15th International Conference on Pattern Recognition,
Barcelona, Spain, pp. 168–171 (2000)
4. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1991)
5. Azizi, N., Farah, N., Khadir, M., Sellami, M.: Arabic Handwritten Word Recognitionv
Using Classifiers Selection and features Extraction / Selection. In: 17 th IEEE Conference
in Intelligent Information System, IIS 2009, Poland, pp. 735–742 (2009)
6. Azizi, N., Farah, N., Sellami, M., Ennaji, A.: Using Diversity in Classifier Set Selection
for Arabic Handwritten Recognition. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS
2010. LNCS, vol. 5997, pp. 235–244. Springer, Heidelberg (2010)
7. Woods, K., Kegelmeyer, W.P., Bowyer, K.: Combination of multiple classifiers using
local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 405–410 (1997)
8. Azizi, N., Farah, N., Sellami, M.: Off-line handwritten word recognition using ensemble of
classifier selection and features fusion. Journal of Theoretical and Applied Information
Technology, JATIT 14(2), 141–150 (2010)
9. Azizi, N., Farah, N., Sellami, M.: Ensemble classifier construction for Arabic handwritten
recognition. In: The 7th IEEE International Workshop in Signal Processing and Sustems,
WOSSPA 2011, Tipaza, Algeria, May 8-10 (2011)
10. Azizi, N., Farah, N., Sellami, M.: Progressive Algorithm for Classifier Ensemble
Construction Based on Diversity in Overproduce and Select Paradigm: Application to the
Arabic handwritten Recognition. In: The 2nd ICICS 2011, Jordan, May 22-24, pp. 27–33
(2011)
11. Farah, N., Souici, L., Sellami, M.: Classifiers combination and syntax analysis for arabic
literal amount recognition. Engineering Applications of Artificial Intelligence 19(1) (2006)
12. Dos Santos, E.M., Sabourin, R., Maupin, P.: A dynamic overproduce-and-choose strategy
for the selection of classiffier ensembles. Pattern Recognition 41, 2993–3009 (2008)
13. Singh, F., Singh, M.A.: dynamic classifier selection and combination approach to image
region labelling, Signal Process. In: Image Commun., vol. 20, pp. 219–231 (2005)
14. Woloszynski, T., Kurzynski, M.: A Measure of Competence Based on Randomized
Reference Classifier for Dynamic Ensemble Selection. In: ICPR 2010, Turkey, August 23-
26, pp. 4194–4198 (2010)
15. Pechwitz, M., Maergner, V.: Baseline estimation for arabic handwritten words. In:
Frontiers in Handwriting Recognition, pp. 479–484 (2002)
16. http://yann.lecun.com/exdb/mnist/
Vector Perceptron Learning Algorithm
Using Linear Programming
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 197–204, 2012.
c Springer-Verlag Berlin Heidelberg 2012
198 V. Kryzhanovskiy, I. Zhelavskaya, and A. Fonarev
The idea of using LP to perceptron learning was first put forward by Krauth
and Mezard in 1987 [9]. We ve also suggested a learning rule based on LP de-
scribed in the paper [10]. But the main and the significant difference between
these rules is a number of objective variables and these variables themselves. In
the Krauth and Mezard learning rule N values of synaptic weights are optimized.
In the suggested rule we optimize M coefficients (having an indirect effect on
interconnections), where M is a number of stored patterns. It is obvious that in
the field of greater practical interest (M < N ) our algorithm outperforms.
In the present paper the idea of using LP to vector neural networks learning
is considered for the first time. The number of synaptic coefficients significantly
increases for this type of networks, so it becomes impossible to use Krauth
and Mezard algorithm for solving high dimensionality problems (comparable to
practical tasks). But even at low dimensions, when this algorithm is applicable,
the suggested rule outperforms up to 104 times. It is quite clear that LP approach
is considerably slower than the Hebb rule, but this disadvantage is balanced out
by decreasing the probability of incorrect recognition by up to 50 times.
2 Problem Statement
Consider the following model problem. Suppose we have photos of some objects
− reference patterns. They are grayscale images made under favorable conditions
(with a number of gray gradations Q). The system receives the photos of these
objects as inputs (the photos are made from the same angle etc., therefore the
problems of scaling and others will not be covered here). These input photos
differ from the reference ones in the distortions imposed as:
x̃i = xi + δ, (1)
3 Model Description
3.1 Vector Perceptron
Let us consider a vector perceptron (VP), consisting of two layers of vector
neurons; each neuron of the input layer (N neurons) is connected with all output
Vector Perceptron Learning Algorithm Using Linear Programming 199
layer neurons(n neurons). Input and output layers neurons have Q and q discrete
states accordingly (in the general case, Q = q). States of the input layer neurons
Q
are described by the basis vectors {ek } in Q-dimensional space, states of the
output layer neurons by the basis vectors {vl }q in q-dimensional space. Vectors
ek and vl are zero vectors with k-th and l-th identity component correspondingly.
Let each reference vector Xm = (xm1 , xm2 , ..., xmN ) be put in a one-to-
one correspondence to the response vector Ym = (ym1 , ym2 , ..., ymn ), where
Q q
xmj ∈ {ek } , ymi ∈ {vl } , and m = 1, M . Then the synaptic connection be-
tween i-th and j-th neurons is assigned by q × Q-matrix, according to the gen-
eralized Hebb rule:
M
Wij = rm ymi xTmj J, i = 1, n, j = 1, N , (2)
m=1
Then the i-th neuron, similar to the spin locating in the magnetic field under
the influence of the local field Hi assumes the position which is the closest to
the direction of this field (the neuron s state is discrete, that is why it cannot
be oriented exactly along the vector Hi ). In other words, if the local field s
projection Hi on a basis vector vs is maximal, the neuron will be oriented along
this basis vector. Let it be, for instance, the projection on a basis vector v3 .
Then the i-th output neuron will switch in the state 3 described by the basis
vector v3 :
yi = v3 (4)
This procedure is carried out concurrently for all output neurons (i = 1, n).
so these messages can be typed with errors. For example, the input of a letter
”I” would most probably be followed by the incorrect input of letters ”U”, ”O”
and ”K” (these letters are neighbours to ”I” on a keyboard). For constructing a
neural network for input words recognition it would be reasonably good to take
the information about these most probable errors (nearby keys) into account.
It is clear that these letters are not neighbours in the alphabet, but in the
instant case, these letters are nearer to ”I” than ”habitual” ”G”, ”H” or ”J”.
This information is represented in the above-mentioned matrix of measures of
proximities J. Let us describe it more formally.
J is a symmetric matrix of proximities measures between the states of the
input layer neurons; elements Jkl = Jlk are the proximities measures between
the states k and l, k, l = 1, Q. The proximity measure between the states of
the output layer neurons is not entered. If J is a unit matrix E (i.e., J = E),
expression (2) will describe weights of the classic Potts perceptron [1, 3, 4].
Therefore to enter the proximity measure to the Potts perceptron that has been
trained already, it is sufficient to modify interconnection weights multiplying
them by the matrix J on the right side.
The proximity measures matrix J may be defined either by problem’s spec-
ifications, or based on the data analysis and the nature of the noise. To enter
information of noise distribution to the VNN, it is suggested to specify proxim-
ity measures between the states of neurons by probability of switching from one
state to another under the influence of distortions:
Jkl = Pkl , k, l = 1, Q, (5)
where Pkl is the probability of switching from the state k to the state l under
the influence of distortions. For the model problem in hand, the matrix P is
characterized with the only parameter σout , named the external environment
parameter:
2
1 − (k−l)
2
Pkl = √ e 2σout , (6)
2πσout
Parameter σout is unknown precisely; therefore we use the estimation of this
parameter − σin . Parameter σin is an internal variable parameter of the model
to be chosen such that the recognition error is minimal. From general consid-
erations, it can be expected that σin = σout ; however, as computer modeling
shows, this equation holds with an accuracy to a multiplier: σin = c · σout , where
1 < c < 2.
4 LP Learning Rule
In accordance to the algorithm described above, conditions for correct recog-
nition of all reference patterns may be presented by the system of M (q − 1)
equations:
⎧
⎨ hi (m)ymi − hi (m)vl > Δ, ymi = vl , m = 1, M , l = 1, q,
0 < rm < 1, (7)
⎩
Δ > 0,
Vector Perceptron Learning Algorithm Using Linear Programming 201
where hi (m) is a local field on the i-th output neuron when undistorted m-th
reference pattern is presented, ymi is the expected response value.
Parameter Δ is introduced for better recognition stability. Parameter Δ is
responsible for depth and size of basins of attraction of local minima being formed
speaking in the language of the fully connected Hopfield model. The more Δ is
in the process of training, the more the probability of right recognition of noisy
patterns is. Therefore it is necessary to find weight coefficients rm such that
system (7) holds for all reference patterns for the largest possible value of Δ. In
this case, the depth of local minima formed is maximum possible.
Thus we have a linear programming problem with the set of constraints (7)
and with the following objective function:
It is required to find (M + 1) variables that are the solution to this linear pro-
gramming problem.
The similar idea was formulated by Krauth and Mezard [9]. It is concerned
with binary neural networks, but, to be fair, we extend their algorithm to vector
multistate ones (MATLAB programs with realization of all methods can be found
at [12]). The unknown quantities in the algorithm of Krauth and Mezard (on
the analogy of [10]) are N · Q · q weight coefficients and the stability parameter
Δ. For binary perceptrons these algorithms (ours and theirs) are nearly equal in
noise immunity and memory capacity; but for vector perceptrons they can’t be
applied under the same conditions, since the inequality N · Q · q >> M always
holds. Even at low values of parameters, such as N = 100, Q = 20, q = 24,
the resulting number of variables prohibits solving the formulated problem in a
reasonable time. Note that RAM requirements for solving the problem increase
with its size. For example, at such low parameters the KM algorithm uses more
than 19 Gb of RAM while the proposed one only 1 Gb.
Fig. 1. The probability of incorrect noisy Fig. 2. The ratio of the KM algorithm
reference patterns recognition as a func- learning time tKIM to the proposed algo-
tion of the external environment param- rithm learning time tOU R as a function
eter σout at the chosen parameters N = of the problem size N at the parameters
50, q = 6, Q = 16, M = 60, σin = 1.3 q = 6, Q = 16, σin = 1.3 and σout = 0.7
6 Conclusions
In this paper we have considered three algorithms for vector neural networks
learning: the Hebb rule, the Krauth-Mezard learning rule (generalized to vector
neural networks) and our algorithm. The last two algorithms involve the use of
Linear Programming.
It was shown that despite greater computational complexity than one of the
Hebb rule, the use of LP for vector neural networks learning is justified, because
this approach allows reducing incorrect recognition probability by up to 50 times.
(Note that application of linear programming for binary perceptron learning
allows reaching theoretical maximum loading that had been predicted by E.
Gardner.)
The suggested algorithm was compared to the algorithm of Krauth and
Mezard. The proposed algorithm differs from the KM one in essentially less
number of objective variables. This has a positive effect on the learning rate: the
suggested algorithm outperforms the KM one by several orders of magnitude
(by up to 104 times). Moreover, the stability of neural network trained by this
approach is 10-75 percent higher.
We want to highlight that here we suggest only a modification of the Hebb
rule. Therefore generalization performance of the proposed rule is the same as
one of the Hebb rule in the sense that proximity between two patterns (the
distance between them) is measured by the Hamming distance. By using linear
programming we increase noise immunity in particular, but the generalization
performance remains unchanged.
Throughout the paper we are referring to the KM algorithm generalized to
vector neural networks. However the algorithm itself is not presented here due
to space limit. Materials and MATLAB listings can be found at [12].
204 V. Kryzhanovskiy, I. Zhelavskaya, and A. Fonarev
References
1. Kanter, I.: Potts-glass models of neural networks. Physical Review A 37(7), 2739–
2742 (1988)
2. Cook, J.: The mean-field theory of a Q-state neural network model. Journal of
Physics A 22, 2000–2012 (1989)
3. Bolle, D., Dupont, P., Huyghebaert, J.: Thermodynamics properties of the q-state
Potts-glass neural network. Phys. Rew. A 45, 4194–4197 (1992)
4. Wu, F.: The Potts model. Review of Modern Physics 54, 235–268 (1982)
5. Kryzhanovsky, B., Mikaelyan, A.: On the Recognition Ability of a Neural Network
on Neurons with Parametric Transformation of Frequencies. Doklady Mathemat-
ics 65(2), 286–288 (2002)
6. Kryzhanovsky, B., Kryzhanovsky, V., Litinskii, L.: Machine Learning in Vec-
tor Models of Neural Networks. In: Koronacki, J., Raś, Z.W., Wierzchoń, S.T.,
Kacprzyk, J. (eds.) Advances in Machine Learning II. SCI, vol. 263, pp. 427–443.
Springer, Heidelberg (2010)
7. Kryzhanovskiy, V.: Binary Patterns Identification by Vector Neural Network with
Measure of Proximity between Neuron States. In: Honkela, T. (ed.) ICANN 2011,
Part II. LNCS, vol. 6792, pp. 119–126. Springer, Heidelberg (2011)
8. Austin, J., Turner, A., Lees, K.: Chemical Structure Matching Using Correlation
Matrix Memories. In: International Conference on Artificial Neural Networks, IEE
Conference Publication 470, Edinburgh, UK, September 7-10. IEE, London (1999)
9. Krauth, W., Mezard, M.: Learning algorithms with optimal stability in neural
networks. J. Phys. A: Math. Gen. 20, L745–L752 (1987)
10. Kryzhanovskiy, V., Zhelavskaya, I., Karandashev, J.: Binary Perceptron Learning
Algorithm Using Simplex-Method. In: Rutkowski, L., Korytkowski, M., Scherer, R.,
Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part I. LNCS,
vol. 7267, pp. 111–118. Springer, Heidelberg (2012)
11. Kryzhanovsky, B., Kryzhanovsky, V.: Binary Optimization: On the Probability of
a Local Minimum Detection in Random Search. In: Rutkowski, L., Tadeusiewicz,
R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp.
89–100. Springer, Heidelberg (2008)
12. Center of Optical-Neural Technologies,
http://www.niisi.ru/iont/downloads/km/
13. Gardner, E., Derrida, B.: Optimal storage properties of neural network models. J.
Phys. A: Math. Gen. 21, 271–284 (1988)
A Robust Objective Function of Joint
Approximate Diagonalization
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 205–212, 2012.
c Springer-Verlag Berlin Heidelberg 2012
206 Y. Matsuda and K. Yamaguchi
A significant advantage of JAD is the versatility. Because JAD utilizes the linear
algebraic properties on the cumulants, it does not depend on the specific statis-
tical properties of source signals except for non-Gaussianity [2]. However, from
the viewpoint of the robustness, JAD lacks the theoretical foundation. Because
many ICA methods are based on some probabilistic models, the estimated re-
sults are guaranteed to be “optimal” in the models. On the other hand, JAD
is theoretically valid only if every non-diagonal element νijpq is equal to 0. In
other words, it is not guaranteed in JAD that V with the minimal non-diagonal
elements is more “desirable.” This theoretical problem often causes a deficiency
of robustness in practical applications.
In this paper, a new objective function of JAD is derived by an information
theoretic approach in order to improve the robustness of JAD. The information
theoretic approach has been proposed previously in [7,8], which incorporates a
probabilistic model into JAD by regarding each non-diagonal element of cumu-
lants as independent random variables. The approach can theoretically clarify
the properties of JAD. It has been also shown that the approach can improve
the efficiency and robustness of JAD practically by model selection [7] or the
approximation of entropy [8]. However, the previous probabilistic models were
too rough to utilize the information theoretic approach exhaustively. In this
paper, the robustness of JAD is improved further by using a more accurate ap-
proximation involving diagonal elements. This paper is organized as follows. In
Section 2, the information theoretic approach to non-diagonal elements is briefly
explained. In Section 3.1, a new objective function of JAD is proposed by apply-
ing the information theoretic approach to diagonal elements, whose distributions
are approximated as Gaussian with unknown variance. In addition, an optimiza-
tion algorithm of the objective function is proposed in Section 3.2. In Section
4, numerical results on artificial datasets verify that the proposed method can
improve the robustness when the sample size is small. This paper is concluded
in Section 5.
1/M . In
other words, the true distribution gn-diag (ν) is given by exp −ν 2 M/2 / 2π/M .
where, the four conditions are given as follows:
1. Linear ICA Model: The linear ICA model X = AS holds, where the mean
and the variance of each source sim are 0 and 1, respectively.
2. Large Number of Samples: The sample size M is so large that the central
limit theorem holds.
3. Random Mixture: Each element aij in A is given randomly and indepen-
dently, whose mean and variance are 0 and 1/N , respectively.
4. Large Number of Signals: The number of signals N is sufficiently large.
The details of the proof are described in [8]. In brief, it is proved first that the
distribution of each νijpq is approximated to be Gaussian by the central limit
theorem. Then, it is proved that E (νijpq νklrs ) is approximated as δik δjl δpr δqs /M
under the conditions, where δik is the Kronecker delta. Though it has been
described in a rough manner in [8] that the objective function Ψ is derived from
this theorem, the following more rigorous derivation is given in this paper. First,
it is assumed that the diagonal elements νiipq is given according to a sparse
uniform distribution u (x) = c because there is no prior knowledge. Regarding
νjipq = νijpq , the value is determined by algebraic symmetry. So, any fixed prior
distribution can be employed without changing the likelihood essentially. Here,
the same uniform distribution u (x) = c is employed for simplicity. Then, the
ν
true distribution Ptrue (ν pq ) is given as
ν
Ptrue (ν pq ) = cN (N +1)/2 gn-diag (νijpq ) (2)
i,j>i
By the transformation ν = V C pq V , the linear transformation matrix from the
vectorized elements of C pq to those of ν pq is given as the Kronecker product
V ⊗ V . Therefore, the distribution of C pq with the parameter V is determined
by
P ν (ν pq ) P ν (ν pq )
P C (C pq |V ) = = (3)
|V ⊗ V | |V |2N
where |V | is the determinant of V . The log-likelihood function is given as
C ν
(V ) = log Ptrue (C pq |V ) = −2N log |V | + log Ptrue (ν pq )
p,q>p p,q>p
∼
= −N 2 (N − 1) log |V | + log gn-diag (νijpq )
p,q>p i,j>i
∼ M 2
= −N 2 (N − 1) log |V | − ν (4)
2 p,q>p i,j>i ijpq
208 Y. Matsuda and K. Yamaguchi
where some constant terms are neglected. In many JAD algorithms, V is con-
strained to be orthogonal by pre-whitening (in other words, |V | = 1). In this
case, the maximization of the likelihood in Eq. (4) is equivalent to the mini-
mization of the JAD objective function Ψ in Eq. (1).
3 Method
3.1 Objective Function
While the original objective function of JAD is derived by the information theo-
retic approach in Section 2, it is not useful for improving the objective function.
In Section 2, the diagonal elements are assumed to be distributed uniformly
and independently. This assumption gives no additional clues for estimating V .
Here, the “true” distribution of the diagonal elements is focused on and the
new objective function is derived. When V = A−1 (the accurate estimation),
the dominant term of a diagonal element νiipq without the estimation error
is given as νiipq api aqi κ̄iiii (κ̄iiii is the unknown true kurtosis of the i-th
source) [2,8]. Because each aij is assumed to be a normally and independently
distributed random variable in Section 2, the dominant term of νiipq is given
according to a normal product distribution with unknown variance. In addition,
νiipq slightly depends on every νijpq by api and aqi . However, in order to esti-
mate the likelihood easily, independent Gaussian distributions are employed as
the approximations in this paper. Therefore, the distribution of νiipq (p < q)
is approximated as an independent
Gaussian one with unknown variance σi2 :
gdiag (ν) = exp −ν 2 /2σi2 / 2πσi2 . Then, Eq. (2) is rewritten as
ν
Ptrue (ν pq ) = cN (N −1)/2 gdiag (νiipq ) gn-diag (νijpq ) .
i i,j>i
Therefore, the log-likelihood depending V and σ = σi2 is given as
(V , σ)
∼ M 2 log σi2 νiipq
2
= −N 2 (N − 1) log |V | − νijpq − + (5)
2 p,q>p i,j>i p,q>p i
2 2σi2
M 2 N (N − 1)
(V ) ∼
= −N 2 (N − 1) log |V | − νijpq − log νiipq
2
.
2 p,q>p i,j>i 4 i p,q>p
(6)
This is the new objective function of JAD. It is worth noting that Eq. (6) is
closer to the original JAD as the number of samples (M ) is greater than the
number of parameters to be estimated (N 2 ).
A Robust Objective Function of Joint Approximate Diagonalization 209
Note that the range of θ can be limited to [0, π/2) by the symmetry. Therefore,
ν̃iipq
2
and ν̃jjpq
2
are given by
ν̃iipq
2
= α1 sin 4θ + α2 cos 4θ + α3 sin 2θ + α4 cos 2θ + α5 , (10)
ν̃jjpq
2
= α1 sin 4θ + α2 cos 4θ − α3 sin 2θ − α4 cos 2θ + α5 (11)
where
νiipq νijpq − νijpq νjjpq
α1 = , (12)
p,q>p
2
νiipq2
+ νjjpq
2
− 2νiipq νjjpq − 4νijpq
2
α2 = , (13)
p,q>p
8
α3 = (νiipq νijpq + νjjpq νijpq ) , (14)
p,q>p
νiipq
2
− νjjpq
2
α4 = , (15)
p,q>p
2
210 Y. Matsuda and K. Yamaguchi
3νiipq
2 2
+ 3νjjpq + 2νiipq νjjpq + 4νijpq
2
α5 = . (16)
p,q>p
8
Note that these coefficients α1−5 can be calculated before the optimization of
Φij because they do not depend on θ. Unlike the original JADE, Φij can not be
minimized analytically because it includes logarithms. However, the optimal θ̂ is
easily calculated numerically because the function has only the single parameter
θ in [0, π/2). Though there is the possibility of finding some local optima, a
simple MATLAB function “fminbnd” is employed in this paper. In summary,
the proposed method is given as follows:
1. Initialization. Whiten the given observed matrix X (orthogonalization) and
calculate the cumulant matrices C pq = (κijpq ) for every p and q > p. Besides,
set V to the identity matrix.
2. Sweep. For every pair i and j > i,
(a) Calculate θ̂ minimizing Φij in Eq. (8).
(b) Only if θ̂ is greater than a given small threshold
, do the actual rotation
of V and update every νijpq depending on i or j by θ̂.
3. Convergence decision. If no pair has been actually rotated in the current
sweep, end. Otherwise, go to the next sweep.
4 Results
Here, the proposed method is compared with the original JADE in blind source
separation of artificial sources. Regarding the source signals, a half of which were
generated by the Laplace distribution (super-Gaussian) and the other half by the
uniform distribution (sub-Gaussian). All the sources are normalized (the mean
of 0 and the variance of 1). JAD is known to be effective for such cases where
sub- and super-Gaussian sources are mixed. The number of sources N was set
to 24 and 30. The mixing matrix A was randomly generated where each element
is given by the standard normal distribution. The non-linearity parameter λ
was empirically set to N (N − 1) /2M , which is the half of the theoretical value
and weakening the non-linearity. A small threshold
was set to 10−8 . All the
experiments were averaged over 10 runs. The results are shown in Fig. 1. Fig.
1-(a) shows the transitions of the separating error along the sample size by the
proposed method and the original JADE. Fig. 1-(b) shows the transitions of Ψ
(the objective function of the original JADE). In order to clarify the difference
between the two methods, the transitions of the t-statistics comparing the sep-
arating error of the proposed method with that of the original JADE are shown
in Fig. 1-(c). The t-statistics were calculated under the assumptions that there
are two independent groups with the same variance, where the sample size of
each group is 10 (the times of runs). Though the results fairly fluctuated, espe-
cially for N = 30, the t-statistics tend to be smaller than 0 for a relatively small
sample size (around under 1200 for N = 24 and 1800 for N = 30). In addition,
the t-statistics are often below the t-test threshold at the 0.1 level. It shows that
A Robust Objective Function of Joint Approximate Diagonalization 211
N = 24 N = 30
Separating error along sample size (N = 24) Separating error along sample size (N = 30)
300 500
proposed method proposed method
original JADE 400 original JADE
separating error
separating error
200
300
200
100
100
0 0
500 1000 1500 2000 2500 500 1000 1500 2000 2500
sample size (M) sample size (M)
JADE function
200 400
100 200
0 0
500 1000 1500 2000 2500 500 1000 1500 2000 2500
sample size (M) sample size (M)
2 t-statistic 2 t-statistic
zero line zero line
t-test threshold (10%) t-test threshold (10%)
1 1
t-statistic
t-statistic
0 0
-1 -1
-2 -2
500 1000 1500 2000 2500 500 1000 1500 2000 2500
sample size (M) sample size (M)
Fig. 1. Separating error and reduction rate along the sample size: The left and right
sides correspond to N = 24 and N = 30, respectively. (a) The transitions of Amari’s
separating error [1] along the sample size M by the proposed method (solid curves)
and the original JADE (dashed). (b) The transitions of Ψ by the proposed method
(solid) and the original JADE (dashed). (c) The transitions of the t-statistics comparing
the proposed method with the original JADE for the separating error (solid curves).
The dashed and dotted lines are the zero line and the t-test threshold (10% and left-
tailed), respectively. If the t-statistic is smaller than the threshold, the superiority of
the proposed method is statistically significant at the 0.1 level.
212 Y. Matsuda and K. Yamaguchi
5 Conclusion
In this paper, we propose a new objective function of JAD by an information the-
oretic approach and a JADE-like method minimizing the function. The numerical
results show that the proposed method is effective for the limited samples. We
are planning to improve the proposed method by analyzing numerical results
and elaborating the probabilistic model. Especially, we are planning to carry
out extensive numerical experiments in order to find the optimal value of the
non-linearity parameter λ and to estimate the accurate distribution of the diag-
onal element νiipq (which is roughly approximated as Gaussian in this paper).
We are also planning to compare this method with other ICA methods such as
the extended infomax algorithm [6]. In addition, we are planning to apply this
method to various practical applications as well as artificial datasets.
References
1. Amari, S., Cichocki, A.: A new learning algorithm for blind signal separation. In:
Touretzky, D., Mozer, M., Hasselmo, M. (eds.) Advances in Neural Information
Processing Systems 8, pp. 757–763. MIT Press, Cambridge (1996)
2. Cardoso, J.F.: High-order contrasts for independent component analysis. Neural
Computation 11(1), 157–192 (1999)
3. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEE
Proceedings-F 140(6), 362–370 (1993)
4. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning
Algorithms and Applications. Wiley (2002)
5. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley
(2001)
6. Lee, T.W., Girolami, M., Sejnowski, T.J.: Independent component analysis using
an extended infomax algorithm for mixed subgaussian and supergaussian sources.
Neural Computation 11(2), 417–441 (1999)
7. Matsuda, Y., Yamaguchi, K.: An adaptive threshold in joint approximate diago-
nalization by assuming exponentially distributed errors. Neurocomputing 74, 1994–
2001 (2011)
8. Matsuda, Y., Yamaguchi, K.: An Information Theoretic Approach to Joint Approx-
imate Diagonalization. In: Lu, B.-L., Zhang, L., Kwok, J. (eds.) ICONIP 2011, Part
I. LNCS, vol. 7062, pp. 20–27. Springer, Heidelberg (2011)
TrueSkill-Based Pairwise Coupling
for Multi-class Classification
Jong-Seok Lee
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 213–220, 2012.
c Springer-Verlag Berlin Heidelberg 2012
214 J.-S. Lee
distinguish a class from the remaining classes [14]. Thus, the total number of
binary classifiers is equal to the number of target classes, i.e., N = C. When
a novel sample is given for classification, the class for which the correspond-
ing classifier shows the highest probability is chosen. In the “one-vs-one” ap-
proach, all the C(C − 1)/2 pairwise combinations of the classes are considered,
for each of which a binary classifier fij is trained to distinguish class i and class
j [12]. For a test sample x, each classifier provides its preference between the
two classes considered by the classifier during training, which is in the form of
rij = P (x in class i|x in class i or j). The final classification decision is drawn
based on the outputs of all the classifiers. In [9], a Bradley-Terry model-based
method was proposed to estimate the probability that the sample is from a
class, pi , by minimizing the weighted Kullback-Leibler distance between rij and
qij = pi /(pi + pj ), i.e.,
rij rji
min wij rij log + rji log
{pi }
i<j
qij qji
subject to 0 ≤ pi ≤ 1, i = 1, ..., C, and pi = 1 (1)
i
where rji = 1 − rij , qij = 1 − qji , and wij is a weighting factor that can be simply
put as unity or other values such as the number of training samples for classes
i and j. An iterative algorithm to solve the above optimization problem was
proposed in [9], and a more stable algorithm was developed in [16]. Finally, the
“error-correcting output code” approach considers solving a multi-class problem
as a communication task [4]. Binary classifiers are trained as in the one-vs-all
and one-vs-one approaches, but each of them set some classes as a positive label
and the rest as a negative label. The number of classifiers must be redundant,
i.e., higher than the minimum number required to uniquely distinguish each
class, so that classification errors by some classifiers can be recovered. When a
test sample is inputted, the final decision is based on decoding the classifiers’
outputs, e.g., Hamming decoding. The approach was extended to allow for binary
classifiers not to consider some classes as either a positive or negative label [1]. In
[11,7], the aforementioned approaches have been compared for various problems,
from which their performance are nearly same. It was also observed that the
one-vs-one approach is more practical than the others due to its reduced time
complexity [11].
In this paper, we address the issue of the complexity of the one-vs-one ap-
proach during the process of classification of an unseen sample, which has been
rarely addressed in the prior work. Traditionally, C(C − 1)/2 pairwise classifica-
tions need be performed for classification of the sample. Then, an aggregation
process takes place to combine the results, which requires additional complex-
ity. For instance, the Bradley-Terry model-based method [9] performs iterative
estimation of the class probabilities according to the problem in (1), from which
the final classification decision is made.
Therefore, we propose a novel aggregation method for the one-vs-one ap-
proach. In contrast to conventional methods, the “score” of each class label for
TrueSkill-Based Pairwise Coupling for Multi-class Classification 215
2 Proposed Method
As already mentioned, the proposed method is based on the one-vs-one approach.
Therefore, N = C(C − 1)/2 binary classifiers are constructed and each of them
is trained using the training data having the corresponding two class labels.
When a test sample is given, binary classifications are conducted sequentially
by choosing one among the N trained classifiers, during which the score of each
class label (and consequently the ranking of the classes) is updated at each binary
classification step. The key components of the proposed method are the TrueSkill
system-based on-line ranking scheme and the prioritized match-making scheme
for classifier selection, which are explained below.
The TrueSkill system [10] is characterized by the average skill of a player
and the degree of uncertainty (or variance) in the player’s skill, which models
a player’s skill as a normally distributed random variable. A larger variance
indicates that the player’s performance is more unstable. In our case, a player
can be considered as a class label. Once the game between two players has
finished and thus the winner and loser have been determined (in our case, once
a binary classifier has performed classification of a given sample), their skills (μ)
and degrees of uncertainty (σ) are updated according to the following rule:
σ2 Δμ
μwinner ← μwinner + winner · v − (2)
c c c
σloser
2
Δμ
μloser ← μloser − ·v − (3)
c c c
σ2 Δμ
σwinner
2
← σwinner
2
· 1 − winner ·w − (4)
c2 c c
σ2 Δμ
σloser
2
← σloser
2
· 1 − loser ·w − (5)
c2 c c
216 J.-S. Lee
where N and Φ are the probability density function and cumulative density
function of the standard normal distribution, respectively. The parameter β 2 is
a per-game variance, in other words, a larger value of β 2 means that a game’s
outcome becomes less dependent on the players’ skills. According to the above
updating rule, the winner’s (loser’s) skill is increased (decreased), while the vari-
ances of the two players are decreased in order to reflect that the confidence
about the skills increases as more games are conducted.
In order to speed up the convergence of the ranking, the following “match-
making” procedure is used in the proposed method. At the beginning of classi-
fication, the scores of all the players are the same to an initial value, so matches
are randomly selected, i.e., a classifier is selected randomly among the C(C −1)/2
trained binary classifiers at each step. After a few random matches, candidates
for matches are strategically selected as follows. First, among the players that
have not completed all the matches with the other players, the one having the
best score is selected (player 1). Then, among the players that have not played
with player 1, the one having the largest chance of drawing is selected as the op-
ponent of player 1 (player 2). A large chance of drawing between the two players
means that there exists a large ambiguity in the ranking. Therefore, by promot-
ing the match between these players, such ambiguity can be reduced quickly.
The chance of drawing is a function of the skills (μ1 and μ2 ) and variances (σ12
and σ22 ) of the two players [10], i.e.,
√
2β (μ1 − μ2 )2
pdraw = exp − (8)
c 2c2
3 Experiments
The performance of the proposed algorithm is evaluated on several multi-class
classification problems. For comparison, the proposed algorithm without the
prioritized match-making, i.e., a match is always selected randomly at each
classification step, is also evaluated. In addition, the method in [9] where the
Bradley-Terry model-based method is applied after the full pairwise matches
are conducted is considered.
Six real-world classification problems were chosen from the UCI Machine
Learning Repository [6]. Table 1 summarizes the characteristics of the chosen
TrueSkill-Based Pairwise Coupling for Multi-class Classification 217
Table 1. Summary of the datasets for multi-class classification. Since the original
Soybean dataset contains missing values, we used only the samples without missing
attributes, which is indicated in the parenthesis. The Soybean dataset specifies the
training and test data, while the other datasets do not.
Table 2. Final accuracy and standard deviation values (%) of the Bradley-Terry
model-based method, the Trueskill-based method with random match selection, and
the proposed Trueskill-based method with prioritized match-making. Note that for the
Soybean dataset where the training and test data are fixed over ten trials, the Bradley-
Terry model-based method always produces the same results for ten different random
seeds so that the standard deviation is zero.
Trueskill-based
Dataset Bradley-Terry model
Random match-making Prioritized match-making
Zoo 94.60±4.99 94.60±4.99 94.60±4.99
Ecoli 83.75±3.13 83.75±3.13 83.81±3.11
Yeast 55.80±2.50 55.65±2.52 55.74±2.47
Vowel 94.24±1.36 94.38±1.25 94.30±1.38
Soybean 93.92±0.00 94.46±0.36 94.05±0.24
Letter 96.72±0.18 96.73±0.19 96.74±0.17
problems. Except for the Soybean dataset, there is no explicit separation of train-
ing and test data. Thus, we randomly divided the whole data of each dataset
into halves, one for training and the other for test.
SVMs having Gaussian kernels were employed as the binary classifiers by
using the LIBSVM library [2]. In our method, we set the initial scores (μ) to 25,
initial variances (σ 2 ) to (25/3)2 , β = 25/6, and = 0.5, as in [10]. The number of
random matches at the beginning of our algorithm was set to five. We repeated
each experiment ten times with different random seeds, and report the average
performance below.
The final classification accuracies of the three different methods are compared
in Table 2. It can be seen that the final accuracies and their standard deviations
of the three algorithms is almost the same, which validates the empirical con-
vergence of the proposed TrueSkill-based methods.
As aforementioned, an important advantage of the proposed method is that
the number of matches can be reduced while the classification performance is
not degraded significantly. Table 3 shows the number of matches (as the ratio
with respect to the number of total matches) that is required to reach the 90%,
218 J.-S. Lee
95%, 98%, and 100% of the final accuracy (i.e., the accuracy obtained after all
the matches are performed). On average, the final accuracy is reached when
95.5% of matches are performed, in other words, the complexity can be reduced
up to 4.5% without any performance loss. Especially, the maximum reduction
(20%) was obtained for the Yeast dataset. If a slight loss of the accuracy is
permitted, the complexity reduction becomes large; for relative accuracy errors
of 2%, 5%, and 10%, the complexity can be reduced by 13.8%, 22%, and 43.8%,
respectively. The Zoo and Letter problems are more difficult to obtain higher
reductions in comparison to the other problems, while the proposed method is
the most effective for the Yeast problem.
The evolution of the classification accuracy with respect to the number of
matches is shown in Fig. 1, where the two match-making methods are compared.
It is clearly seen that the prioritized match-making method helps the accuracy to
TrueSkill-Based Pairwise Coupling for Multi-class Classification 219
80 80 80
70 70 70
Accuracy (%)
Accuracy (%)
Accuracy (%)
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 5 10 15 20 0 5 10 15 20 25 0 5 10 15 20 25 30 35 40
Number of matches Number of matches Number of matches
80 80 80
70 70 70
Accuracy (%)
Accuracy (%)
Accuracy (%)
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of matches Number of matches Number of matches
Fig. 1. Evolution of the classification accuracy with respect to the number of matches
for the two different match-making schemes. (a) Zoo (b) Ecoli (c) Yeast (d) Vowel (e)
Soybean (f) Letter.
converge to the final value quickly. It produces the stair-shaped curves because
the class label having the highest score competes with the remaining classes
until all the matches for the class are performed, during which the flat regions
are produced in the curves. Because of this, the random match-making is faster
at the beginning. However, as the number of matches increases, the prioritized
match-making scheme converges faster than the random match-making, which
was also shown in Table 3.
4 Conclusion
References
1. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: a unifying
approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2001)
2. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM
Trans. Intell. Syst. Techn. 2(3), 27:1–27:27 (2011),
http://www.csie.ntu.edu.tw/~ cjlin/libsvm
3. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-
based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)
4. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-
correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)
5. Elo, A.E.: The Rating of Chessplayers, Past and Present. Arco Publishing, New
York (1986)
6. Frank, A., Asuncion, A.: UCI machine learning repository. School of Information
and Computer Science, University of California, Irvine (2010)
7. Garcı́a-Pedrajas, N., Ortiz-Boyer, D.: An empirical study of binary classifier fusion
methods for multiclass classification. Inform. Fusion 12, 111–130 (2011)
8. Glickmann, M.E.: Parameter estimation in large dynamic paired comparison ex-
periments. Appl. Statist. 48, 377–394 (1999)
9. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2),
451–471 (1998)
10. Herbrich, R., Minka, T., Graepel, T.: Trueskill: a Bayesian skill rating system.
In: Adv. Neural Info. Process. Syst., vol. 19, pp. 569–576. MIT Press, Cambridge
(2007)
11. Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector ma-
chines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
12. Kreßel, U.: Pairwise classification and support vector machines. In: Schölkopf, B.,
Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods-Support Vector
Learning, pp. 255–268. MIT Press, Cambridge (1999)
13. Lorena, A.C., de Carvalho, A.C.P.L.F., Gama, J.M.P.: A review on the combination
of binary classifiers in multiclass problems. Artif. Intell. Rev. 30, 19–37 (2008)
14. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn.
Res. 5, 101–141 (2004)
15. Weng, R.C., Lin, C.J.: A Bayesian approximation method for online ranking. J.
Mach. Learn. Res. 12, 267–300 (2011)
16. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification
by pairwise coupling. J. Mach. Learn. Res. 5, 975–1005 (2004)
Analogical Inferences in the Family Trees Task:
A Review
1 Introduction
In a work that has become a recurrent citation in the field, Hinton [1] paved the
way for connectionist modelling of higher-order cognitive processes by exploring
neural networks’ ability to make analogical inferences to complete their know-
ledge of a domain by learning another with the same relational structure [2].
This ambitious goal was tackled by means of a pattern association task in
which two families with isomorphic genealogical trees (see Fig. 1) were involved.
Through supervised learning, a multilayer perceptron was trained to point at
the right relative given one family member and the relationship between them
as inputs. For example, given ‘Roberto’ and ‘son’ as inputs, the network had to
learn to point at ‘Emilio’, since Emilio is Roberto’s son.
In this so-called family trees task, only twelve kinship terms were allowed—
mother, daughter, sister, wife, aunt, niece and their male equivalents—, so the
relational structure of a family was defined by 52 relationships. These were coded
as triplets like {person 1, relationship, person 2}, in which person 1 and rela-
tionship were the network’s inputs and person 2 its desired output.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 221–228, 2012.
c Springer-Verlag Berlin Heidelberg 2012
222 S. Varona-Moya and P.L. Cobos
Fig. 1. Isomorphic family trees of British and Italian relatives. The symbol ‘=’ means
“married to”. Taken from [1].
Hinton trained the network on only 100 of the 104 triplets concerning both
families and checked if it could guess the remaining four triplets. For each of
these unlearned triplets, the filler of the person 1 role had never been associated
with the filler of the person 2 role during training, but it had actually been
associated with other family members in the person 2 role. Thus, solving any of
those triplets would not be explicable in terms of correlation between fillers, but
in terms of some structure-based process, arguably an analogical inference.
Two simulations were run and subjected to this generalization test. Results
were surprisingly good: one of them pointed at the correct filler of the person 2
role in the four test triplets, whereas the other one got three test triplets right.
Besides, an indirect analysis of one simulation’s internal representations of fillers
of the person 1 role showed that they were organized around structural features
like generation and family branch, and that those corresponding to analogous
fillers (e.g., Charlotte and Sophia) were similar. So, it was concluded that the
network had systematically solved the test triplets by discovering both relational
structures and matching analogous concepts together.
However, despite the sound rationale supporting this finding, some aspects in
the way it was obtained cast doubt on it. So, we considered advisable to replicate
Hinton’s work [1] in order to obtain stronger evidence of his conclusions.
The rest of the paper is organized as follows. First, we will describe the
architecture of that perceptron. Then, those problematic aspects and the co-
rresponding methodological changes that they impose will be identified. Finally,
after detailing the learning procedure of our simulations, we will show and discuss
the results obtained in our review.
2 Network Architecture
Hinton’s [1] network was a five-layer feed-forward perceptron (See Fig. 2). Each
neuron had a logistic transfer function.
The input layer was divided into two groups of 24 and 12 units, which coded
the fillers of person 1 and relationship roles, respectively. The output layer con-
sisted of a single group of 24 units, which coded the filler of person 2 role.
Analogical Inferences in the Family Trees Task: A Review 223
and compare them with the five triplets in which Charlotte acts as person 1:
As can be seen, except for one out of five cases, these members are perfectly
interchangeable as fillers of the person 1 role.
If from these ten triplets {Colin, father, James} were reserved for the gene-
ralization test, that similarity between Colin and Charlotte regarding the output
pattern would be less marked, but they could still be considered as almost equi-
valent fillers of the person 1 role.
Now, it is well known that perceptrons with one or more hidden layers come
up with similar internal representations of input patterns not only when these
are alike, but also when they are associated with the same output patterns. This
way they escape the so-called “tyranny of similarity” [3]. In the family trees task,
this means that those members who are mostly associated with the same fillers
of the person 2 role would have similar internal distributed representations.
Returning to our example, Colin and Charlotte would very likely have similar
hidden layer 1 representations. So, when the input compound {Colin, father}
were presented to the network, hidden layer 1 units would transform it into a
vector similar to that corresponding to the input compound {Charlotte, father},
which was associated with the output pattern of {James} during training. Thus,
the network would produce the correct answer with a high probability, but not
because of an analogy-based generalization from the Italian tree.
This example highlights the need to carefully select the test triplets to be
sure that they are only solvable through analogical inferences. But, as far as we
know, this aspect was not taken into account by Hinton [1]. As a matter of fact,
he did not report which four test triplets were used, but he did report, however,
their correct answers. From these, we can deduce the triplets from which those
four must have been selected. These are:
the generalization test should be significantly better after learning about both
families compared to the results after learning one family.
Thus, our simulations were first trained on 50 British triplets only—the re-
maining two were reserved for the generalization test—and then they were
trained simultaneously on the 50 British and the 52 Italian triplets.
Besides, to provide evidence of the influence of test triplets on the network’s
performance, we also tested our simulations with two unsuitable triplets, {James,
father, Andrew} and {Penelope, daughter, Victoria}, expecting that they would
be easily solved due to the similarity of fillers of the person 1 role.
4 Learning Procedure
500 simulations were run to strengthen the external validity of our results. The
Neural Network Toolbox of MATLAB (Version 7.4.0.287) was used.
We followed all the procedural specifications detailed in [1] and used the same
parameter values, except for the learning rate and the coefficient of the momen-
tum term in the second phase of the learning process, for using Hinton’s values we
could not obtain a significant reduction of the cost function (a modified version
[1] of the sum of squared errors [SSE]).
Instead, we used a learning rate of 0.04 and a coefficient of .95 in that second
phase, and allowed 50000 epochs of training1 . This way, in the case of the two
suitable test triplets, the 500 simulations that learned the 50 British triplets
achieved a mean (standard deviation in parentheses) SSE of 0.046 (0.005) and
of 0.127 (0.016) when they learned the 102 triplets concerning both families.
Likewise, in the case of the two unsuitable test triplets, the corresponding values
were 0.048 (0.006) and 0.129 (0.015), respectively. Thus, each training set was
almost perfectly learned irrespective of the generalization test triplets.
Performance on each generalization test was scored following Hinton’s [1] crite-
rion, according to which test triplets were considered solved only if the output
unit (or units) corresponding to the right answer had an activation level above
.5 and all the other output units were below or equal to .5.
Results corresponding to the two mentioned training conditions on the two
different generalization tests are displayed in Table 1. For the sake of brevity, we
will refer to the test constituted by the triplets {Christine, husband, Andrew}
and {James, wife, Victoria}, which were thought to induce structure-based infe-
rences, as generalization test A, and to that constituted by the triplets {James,
father, Andrew} and {Penelope, daughter, Victoria}, which could arguably be
solved by simpler processes, as generalization test B.
1
Our training times contrast with Hinton’s [1], who stated that one of his simulations
achieved null SSE after 1500 epochs. However, according to Melz [4], Hinton himself
declared that the training times reported in his paper [1] were “probably erroneous”.
Analogical Inferences in the Family Trees Task: A Review 227
Table 1. Distribution of the 500 simulations under the two training conditions accor-
ding to their performance on both generalization tests
χ2 (1, N = 47) = 47, p < .001, but solving {Christine, husband, Andrew} was
not, χ2 (1, N = 572) = 0.25, p = .882. In other words, despite our careful selec-
tion of test triplets, only {James, wife, Victoria} was really sensitive to learning
an analogous domain. On the contrary, the triplet {Christine, husband, Andrew}
was mainly solvable due to grasping the relational structure of the British family.
Future work should be done to explain this inequality and to understand why
so few simulations benefited from learning the analogous family tree.
6 Conclusions
We reviewed one of the first works [1] that tackled connectionist modelling of ana-
logical thinking, as some methodological aspects raised doubts about its results.
We focused on the network’s reportedly good capacity to acquire knowledge of
a domain via generalizations from an analogous domain, and proved, through
experimental manipulation, that while such good performance was probably an
artifact, there is strong evidence to consider that a multilayer perceptron is sensi-
tive to being trained on analogous domains with a common relational structure.
References
1. Hinton, G.E.: Learning Distributed Representations of Concepts. In: Proceedings of
the Eight Annual Conference of the Cognitive Science Society, pp. 1–12. Lawrence
Erlbaum Associates, New Jersey (1986)
2. Clement, C.A., Gentner, D.: Systematicity as a Selection Constraint in Analogical
Mapping. Cognitive Sci. 15, 89–132 (1991)
3. McLeod, P., Plunkett, K., Rolls, E.T.: Introduction to Connectionist Modelling of
Cognitive Processes. Oxford University Press, New York (1998)
4. Melz, E.R.: Developing Microfeatures by Analogy. In: Proceedings of the Fourteenth
Annual Conference of the Cognitive Science Society, pp. 42–47. Lawrence Erlbaum
Associates, New Jersey (1992)
An Efficient Way of Combining SVMs for Handwritten
Digit Recognition
Renata F.P. Neves, Cleber Zanchettin, and Alberto N.G. Lopes Filho
1 Introduction
Nowadays the world is digital. Technology has become ubiquitous in people´s lives,
and some human tasks, such as handwriting recognition, voice recognition, face
recognition and others are now machine tasks. The main recognition process [1][2]
used in this kind of application requires the following steps: data acquisition; pre-
process data to eliminate noise; segmentation, where the objects (text, numbers, face,
etc) to be recognized are located and separated from the background; feature
extraction, where the main features of each object are extracted; and finally there is
the recognition, or classification, where the objects are labeled based on their features.
This paper is focused on the classification task and we used as case study the
handwritten digit recognition problem because this task can represent some classification
issues. For example, the patterns can be ambiguous or some features are similar in more
than one group of classes. An example of this problem is presented in Fig. 1. In Fig. 1a
and in Fig. 1c the correct value of the image is seven, and in Fig. 1b, four. But the Fig. 1a
and b are similar and could be the same digit. Fig. 1c could be confused as a digit one.
Because of this, to build a classifier that generalizes well is a hard task. In some cases the
best choice is try to use the context information to differentiate one class from the other.
The Hidden Markov Models (HMM) [3] is a technique frequently used to analyze
the context and improve the classifier recognition rate. But its main disadvantage is
the processing time. Modeling context techniques also usually are slower. Thus, our
research focuses on studying the optimization and combination of classical
approaches and trying to introduce more knowledge in the classifier.
A brief overview of the handwritten digit recognition research in recent years shows
that classical classifiers such as the multilayer perceptron (MLP) [5], k-nearest
neighbor (kNN) [2] and support vector machine (SVM) [6] are extensively used. Some
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 229–237, 2012.
© Springer-Verlag Berlin Heidelberg 2012
230 R.F.P. Neves, C. Zanchettin, and A.N.G. Lopes Filho
a b c
Fig. 1. Imagens from NIST database[3]. a) handwritten digit 7; b) handwritten digit 4; and
c) handwritten digit 7 again.
The SVM [6] is considered the best binary classifier, because it finds the best margin
of separation between two classes. The fact that SVM is a binary classifier is its greatest
disadvantage, as most of the recognition tasks are multiclass problems. To solve this,
some authors try to combine the SVMs [8] or use it as a decision maker classifier [9].
Based on these assumptions, this paper introduces a hierarchical SVM combination
that provides a highly accurate recognition rate in a short time to answer when applied
to handwritten digit recognition.
The present study is structured as follows: related works are presented in Section 2;
the SVM combination architecture proposed in Section 3; the experiments and results
are in Section 4; and the final conclusions of this paper are in Section 5.
2 Related Works
Support Vector Machine (SVM) [6][5] is a binary classification technique. The training
phase consists of finding the support vectors for each class and creating a function that
represents an optimal separation margin between the support vectors of different
classes. Consequently, it is possible to obtain an optimal hyperplane for class separation.
Analyzing SVM and its previously presented characteristics, it seems similar to the
perceptron [1] because it also tries to find a linear function to separate classes. But
there are two main differences: the SVM discovers the optimal linear function while
the perceptron seeks to discover any linear separation function; and the second
difference is that the SVM can deal with non-linearly separable data. To do that, SVM
uses a kernel function to increase the feature dimensionality and consequently turning
the data linearly separable.
There are two classical ways to work with multiple classes using SVM: one-against-
all and one-against-one [13]. In one-against-all approach, one SVM is created for each
An Efficient Way of Combining SVMs for Handwritten Digit Recognition 231
capitalized or not, and then defines if the upper or lower case forms of a character can
be included into a single class. The characters are then submitted to the SVM
recognizer to obtain the final classification.
In [19] the authors create an ensemble classifier, using gating networks to assemble
outputs given from three different neural networks. In [20] is proposed a method that
uses a new feature extraction technique based on recursive subdivision of the character
image. The combination of MLP and SVM are also used to recognize no western
languages.
3 Proposed Algorithm
After analyzing the state of the art, we propose another simple SVM combination. The
main idea is to create a hierarchical SVM structure. The first level is composed by a set
of SVMs. There is one SVM for each class pair but if one class is in one pair it cannot
be in other pairs. For example, in the case of digit recognition, we have 10 possible
classes (outputs): 0 to 9. The first level will have five SVMs, one for each of these
pairs: 0-1, 2-3, 4-5, 6-7 and 8-9.
The pattern will be classified by each SVM in the first level. It is expected that the
SVM trained with the correct classification set correctly classifies the sample and the
others can choose any class of its pair. The second level will combine the outputs
obtained in the first level, using the same strategy of the previous level. The process
will continue until there is only one output. An example of this hierarchical structure is
shown in Fig. 2, where the letters a, b, x, y and i represents the outputs given by the
SVMs and the number in parenthesis represents the pair that the SVM can
differentiate.
The images were separated into isolated digits using an algorithm which employed
connected component labeling [10]. Each label corresponds to an isolated digit. After
segmentation, vertical and horizontal projections [15] were used to centralize the digit
in the image. Images larger than 20x25 were cropped, removing only the extra white
borders. If the object in the image is larger than 20x25, the white border is eliminated
and the digit resized. The size (20x25) was selected because the majority of digits are
approximately of this size.
Each digit was manually separated and labeled into classes to be used in the
supervised training of classifiers. The final database contains a total of 11,377 digits.
Each class contains in average 1,150 digits. This digit database was separated into a
training set and a test set. The training set contains 7,925 samples. It contains
approximately 800 digits per class. The test set contains 3,452 samples, which is
approximately 350 digits per class.
The feature vector is the same for all classifiers. It is the matrix image structured as
a vector. However, before converting it into a vector the image was resized again to
12x15 in order to reduce the dimensionality of the feature vector, generating a vector
with 180 binary features. This size was empirically defined based on previous
experiments.
All algorithms were implemented using MatlabTM [16] version R2010a. All
algorithms were trained after the parameters selection. The same trained MLP and
SVMs were used throughout the methods, as were the same kNN parameters.
the recognition rate criterion. The one-against-all approach is the only one that does not
present a high recognition rate among the SVM approaches. However, analyzing it
carefully, we can see that among the 634 errors, 629 had not been labeled, when all
SVMs returns the same classification and in this way it is not possible to classify the
pattern (the sample was rejected by the classifier). Among the 634 errors just 15 were
misclassified. This approach can be a good choice when rejection is rather than a
classification error.
The proposed hierarchical SVM is the third best classifier for handwritten digits, but
the difference between the first and the second is very small so in the statistical tests
they are equivalent. In this case, the processing time is the main goal because the kNN-
SVM and 45SVMs techniques are the slowest and the proposed method is the fastest.
Thus, the hierarchical method can be highlighted by the high recognition rates in a
short processing time. In Fig. 4 the processing time and error rate of the evaluated
methods were normalized and plotted together in the graphic. In this case the best
method is the method with results close to graphical origin. This analysis shows that
the proposed method is the best evaluated method.
5 Conclusion
This paper presents a method to combining SVMs applied to handwritten recognition
problem considering a short processing time and a highly accurate recognition rate.
After a brief study of the related works, it was found that classical classifiers are still
236 R.F.P. Neves, C. Zanchettin, and A.N.G. Lopes Filho
used to recognize handwritten texts because of their low processing times and high
recognition rates. New approaches, such as proposed by Neves et al. [8] and Zanchettin
et al. [7] increase the recognition rates but also increase processing time and
computational costs.
Based on these criteria, classical classifiers and classifiers built specifically to
handwritten digits recognition were implemented and tested. The proposed new
approach presented the best results considering the processing time and recognition
rate. The techniques based on kNN had the longest processing time and the highest
recognition rate. On the other hand, the hierarchical SVM combination obtained high
recognition rates and the shortest processing time. As verified by experiments this
SVM combination is the best choice when both of these criteria were important
requirements.
The future works will consider handwritten characters and to attempt to combine the
proposed method with word classification methods.
References
1. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall Pearson
Education Inc. (2003)
2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley–Intersciance (2001)
3. Plötz, T., Fink, G.A.: Markov models for offline handwritting recognition: a survey.
International Journal on Document Analysis and Recognition 12(4), 269–298 (2009)
4. NIST Special Database 19. Handprinted Forms and Characters Database,
http://www.nist.gov/srd/nistsd19.cfm
5. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice-Hall
(1999)
6. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York (1998)
7. Zanchettin, C., Bezerra, B.L.D., Azevedo, W.W.: A KNN-SVM Hybrid Model for Cursive
Handwriting Recognition. In: IEEE Int. Joint Con. on Neural Networks, Birsbane (2012)
8. Neves, R.F.P., Lopes-Filho, A.N.G., Mello, C.A.B., Zanchettin, C.: A SVM Based Off-
Line Handwritten Digit Recognizer. In: IEEE International Conference on Systems, Man,
and Cybernetics, vol. 1, pp. 510–515 (October 2011)
9. Bellili, A., Gilloux, M., Gallinari, P.: An MLP-SVM combination architecture for offline
handwritten digit recognition. Reduction of recognition errors by Support Vector Machines
rejection mechanisms. International Journal on Document Analysis and Recognition 5,
244–252 (2004)
10. Bhowmik, T.K., Ghanty, P., Roy, A., Parui, S.K.: SVM-based hierarchal architectures for
handwritten Bangla character recognition. International Journal on Document Analysis and
Recognition 12, 97–108 (2009)
11. Parsiavash, H., Mehran, R., Razzazi, F.: A robust free size OCR for omni-font
Persian/Arabic document using combined MLP/SVM. In: Proceedings of Iberoamerican
Congress on Pattern Recognition, pp. 601–610 (2005)
12. Camastra, F.: A SVM-based cursive character recognizer. Pattern Recognition 40, 3721–
3727 (2007)
An Efficient Way of Combining SVMs for Handwritten Digit Recognition 237
13. Hsu, C.W., Lin, C.J.: A Comparison of Methods for Multiclass Support Vector Machines.
IEEE Transactions on Neural Networks 13(2), 415–425 (2002)
14. Parker, J.R.: Algorithms for Image Processing and Computer Vision. John Wiley and Sons
(1997)
15. Gonzalez, R., Woods, C., Richard, E.: Digital Image Processing. Addison-Wesley (1992)
16. Mathworks MatlabTM – The language of technical computing,
http://www.mathworks.com/products/matlab/
17. Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep, big, simple neural
nets for handwritten digit recognition. Neural Computation 22, 3207–3220 (2010)
18. Camastra, F.: A SVM-based cursive character recognizer. Pattern Recognition 40, 3721–
3727 (2007)
19. Zhang, P., Bui, T.D., Suen, C.Y.: A novel cascade ensemble classifier system with a high
recognition performance on handwritten digits. Pattern Recognition 40, 3415–3429 (2007)
20. Vamvakas, G., Gatos, B., Perantonis, S.J.: Handwritten character recognition through two-
stage foreground sub-sampling. Pattern Recognition (43), 2807–2816 (2010)
Comparative Evaluation of Regression Methods
for 3D-2D Image Registration
Ana Isabel Rodrigues Gouveia1,2, Coert Metz3, Luís Freire4, and Stefan Klein3
1
CICS-UBI – Health Sciences Research Centre, University of Beira Interior,
Covilhã, Portugal
2
Institute of Biophysics and Biomedical Engineering, University of Lisbon, Lisbon, Portugal
3
Biomedical Imaging Group Rotterdam, Depts. of Medical Informatics & Radiology,
Erasmus MC, Rotterdam, the Netherlands
4
Escola Superior de Tecnologia da Saúde de Lisboa, Instituto Politécnico de Lisboa,
Lisbon, Portugal
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 238–245, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Comparative Evaluation of Regression Methods for 3D-2D Image Registration 239
measure, such iterative optimization procedures usually have a small capture range
and therefore require initialization close to the searched pose [1, 2]. In our recently
proposed registration-by-regression framework [3], image registration is treated as a
nonlinear regression problem, as an alternative for the iterative traditional approach.
A MLP was used as the regression model relating the DRR image to the
transformation parameters of the 3D object.
In literature, some authors already treat medical image registration as a regression
problem, but the application, the features, and the outputs predicted by the function
are different from our approach. MLP is used as the regression method in [4, 5] and in
[6], different Neural Networks were studied, including RBF networks. Image
registration using SVR [7] and k-NN [8] was investigated as well.
In this paper we perform a comparative evaluation of seven different regression
techniques for the 3D-2D registration-by-regression problem, particularly for the
registration of 3D preoperative coronary CTA images to 2D intraoperative X-Ray
images. For reference, the results of a conventional registration method (i.e. based on
iterative optimization) obtained in [3] are reported as well.
2 Method
The 3D-2D registration method based on regression was first presented in [3]. The
regression model relates image features of the 2D projection image (Fig. 1) to the
transformation parameters of the 3D image (translation and rotation). Before the
intervention, a set of simulated 2D images (DRRs) is generated by applying random
transformations to the pre-interventional 3D image followed by projection of its
coronary artery segmentation. A set of features extracted from the DRR and their
corresponding transformation parameters form an input-output pair in the training set
for the learning process. During the intervention, the image features of the 2D
projection image are computed and fed as input to the regression function, which
returns the estimated 3D translation and rotation parameters of the 3D image.
Mathematically, given an input vector ,…, , we want to predict an
output Y with a model such that . To estimate the parameters of the
prediction model we use a set of measurements , for 1, … , (the
training data). For each transformation parameter (three rotation angles and three
translations) an independent regression model is trained.
The RBF uses the Orth hogonal Least Squares as training algorithm where R RBF
centres are chosen one by one
o from the input data. Each selected centre maximizes the
increment to the explained variance
v of the desired output [13].
Support Vector Machine.. In Support Vector Regression (SVR) a linear regresssion
function in a high N-d dimensional feature space is computed as f
∑ , where the input data are mapped via a nonlinear transformationn
T
according to the kernel funnction , [14]. The -SVR moodel
[15] with an -insensitive lo
oss function [16] and RBF kernel function is used.
Fig. 1. Geometry of a C-arm C Fig. 2. Coronary CTA slice (a), coronary 3D CTA with
device, which makes 2D segmented coronary arteries (b), 3D coronary arteery
projection images of a 3D ob
bject model (c), and DRR obtained from the model (d)
242 A.R. Gouveia et al.
Prior to the evaluation of the regression models, some experiments were performed to
optimize k-NN, MLP, RBF and SVR. To this end, the image set of one of the patients
(named as patient 0) was used, whereas the sets of the nine remaining patients were
used for the evaluation in Section 3.5. For all patients, the set of 10000 images, for the
construction of the regression model, was split in two sets of 70% and 30%. The set
of 7000 images was used to train the regression models; the remaining 3000 images
were considered for validation purposes, to select tuning parameters.
For the k-NN model, the search range for the optimal value of k was limited to
[1,50], based on experiments on patient 0 with a coarse grid-search in a larger range.
For every patient and for each transformation parameter, the optimal value of k
[1,50] was chosen as the point where the validation error started to grow with
increasing k.
For the MLPs we tried several numbers of units for the hidden layer, {9, 18, 36,
54}, using the image set of patient 0. The number of hidden units was set to 36, which
leads to a topology of P=18 input units, 36 hidden units and 1 output unit. The
number of epochs was defined separately for each MLP by a stopping epoch (i.e., the
epoch when the validation error started to grow) with a maximum of 1000.
For RBF networks, a two-level grid-search for the spread of radial basis functions
was performed for patient 0. First, a rough spread estimate was computed as
/ where represents the maximum distance between the inputs [10]. The
Comparative Evaluation of Regression Methods for 3D-2D Image Registration 243
grid search was performed in the range /2, 5 , and the initial estimate was
found to be optimal. For the other patients, the spread was set to , with
computed on their respective input sets. The optimum number of RBF centres was
determined for each patient (and each transformation parameter) as the point where
the validation error started to grow, with a maximum of 1000 neurons.
In the SVR case, three parameters need to be set: , C and γ [20]. All three are
tuned simultaneously. We use a coarse-to-fine grid-search as recommended by [21,
22], considering exponentially growing sequences of the parameter values. We
performed a wide range three-level grid-search for patient 0 and the values obtained
were used for the other patients. In the first level, we used ranges {2-13, 2-11,…, 25} for
, {2-1, 21,…, 215} for C and {2-9, 2-7,…, 29} for γ.
All regressions experiments were performed using MatLab 7.11.0.584. For SVR
we used version 3.1 of the LIBVSM tool [22].
3.5 Results
The mTRE values of the different regression methods and of the conventional
registration method, considering all patients except the one used for parameter
optimization, are shown in Figure 3. We also present, for each of the above patients,
the Regression Error Characteristic (REC) curves [23] for all methods (Fig. 4). These
REC curves estimate the cumulative distribution function of the error and are a
customization of receiver operating characteristic curves to regression. They plot the
error tolerance on the x-axis (expressed as mTRE in our case) and the accuracy of a
regression function on the y-axis (i.e. the percentage of points that lie within the
tolerance).
Fig. 3. Comparison of registrration results for all regression methods and for a conventioonal
registration method (Conv), considering
c all patients, except patient 0 (which was used for
parameter optimization). The graphic
g also shows the initial mTRE before registration.
Fig. 4. REC curves for all meethods and for each patient, except patient 0 (which was usedd for
parameter optimization)
Comparative Evaluation of Regression Methods for 3D-2D Image Registration 245
References
1. Markelj, P., et al.: A review of 3D/2D registration methods for image-guided interventions.
Medical Image Analysis (2010)
2. van de Kraats, E.B., et al.: Standardized evaluation methodology for 2D – 3D registration.
IEEE Transactions on Medical Imaging 24, 1177–1189 (2005)
3. Gouveia, A.R., et al.: 3D-2D Image registration by nonlinear regression. In: IEEE-ISBI (in
press, 2012)
4. Freire, L.C., Gouveia, A.R., Godinho, F.M.: A Neural Network-Based Method for Affine
3D Registration of FMRI Time Series Using Fourier Space Subsets. In: Diamantaras, K.,
Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part I. LNCS, vol. 6352, pp. 22–31. Springer,
Heidelberg (2010)
5. Zhang, J., et al.: Rapid surface registration of 3D volumes using a neural network
approach. Image and Vision Computing 26, 201–210 (2008)
6. Wachowiak, M.P., et al.: A supervised learning approach to landmark-based elastic
biomedical image registration and interpolation. In: IEEE – IJCNN 2002, pp. 1625–1630
(2002)
7. Qi, W., et al.: Effective 2D-3D medical image registration using Support Vector Machine.
In: IEEE - EMBS 2008, pp. 5386–5389 (2008)
8. Banks, S., et al.: Accurate measurement of three-dimensional knee replacement kinematics
using single-plane fluoroscopy. IEEE TBE 43, 638–649 (1996)
9. Hastie, T., et al.: The elements of statistical learning: data mining, inference and
prediction. Springer (2009)
10. Haykin, S.: Neural networks: a comprehensive foundation. Prentice Hall, Delhi (1999)
11. Hagan, M.: Training feedforward networks with the Marquardt algorithm. IEEE
Transactions on Neural Networks 5, 2–6 (1994)
12. Sarle, W.: Neural network frequently asked questions (2005),
ftp://ftp.sas.com/pub/neural/FAQ.html
13. Chen, S., et al.: Orthogonal least squares learning algorithm for radial basis function
networks. IEEE Transactions on Neural Networks 2, 302–309 (1991)
14. Basak, D., et al.: Support Vector Regression. Neural Information Processing – Letters and
Reviews 11, 207–224 (2007)
15. Vapnik, V.N.: The nature of statistical learning theory (1995)
16. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Statistics and
Computing 14, 199–222 (2004)
17. Metz, C.T., et al.: Alignment of 4D Coronary CTA with Monoplane X-Ray Angiography.
In: MICCAI 2011 (2011), http://www.bigr.nl/publication/694
18. Metz, et al.: GPU accelerated alignment of 3-D CTA with 2-D X-ray data for improved
guidance in coronary interventions. In: IEEE-ISBI (2009)
19. Fitzpatrick, J.M., West, J.B.: The distribution of target registration error in rigid-body
point-based registration. IEEE Transactions on Medical Imaging 20, 917–927 (2001)
20. Cherkassky, V., Ma, Y.: Practical selection of SVM parameters and noise estimation for
SVM regression. Neural Networks: the Official Journal of the INNS 17, 113–126 (2004)
21. Hsu, C., et al.: A Practical Guide to Support Vector Classification. Bioinformatics 1, 1–16
(2010)
22. Chang, C., et al.: LIBSVM: A Library for Support Vector Machines. Science, 1–39 (2011)
23. Bi, J., et al.: Regression error characteristic curves. In: ICML (2003)
A MDRNN-SVM Hybrid Model for Cursive Offline
Handwriting Recognition
1 Introduction
Despite more than 30 years of handwriting recognition research [15], [12], [21], [3],
developing a reliable, general-purpose system for unconstrained text line recognition
remains an open problem. This is a complex task due to variations of existing styles
in handwriting, noise in acquisition process and similarity among some classes [23]. In
this domain each writer has a different calligraphy and it my change depending on the
writing material (type of pen, pencil or paper). The emotional state of the person, the
paper space available and time to write may yet have some influence on the handwriting
text. In the classical literature the classifiers used to perform handwriting recognition
are statistical, connectionist and probability based classifiers [3].
Recently the Recurrent Neural Networks (RNN) presented promising results in this
field [1]. The architecture proposed by Graves et al. [7] obtained the best results in
The ICDAR 2009 Handwriting Recognition Competition [1]. This model is composed
by a hierarchy of Multi-dimensional Recurrent Neural Network (MDRNN) [9] layers
that uses the Long Short Term Memory (LSTM) method [8] and Connectionist Tempo-
ral Classification (CTC) output layer to character and word recognition. The proposed
model is an offline recognition system that would works on raw image pixels. As well as
being alphabet independent, such a system would have the advantage of being globally
trainable, with the image features optimized along with the classifier.
As the model does not need a explicit feature extraction procedure from the digi-
talized handwriting image, despite simplify the recognition process, this property may
cause misclassification in similar letters. This confusion can happen because the method
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 246–254, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A MDRNN-SVM Model for Cursive Handwriting Recognition 247
don’t have specific information about discriminatory regions of specific letters (e.g. U
and V, Q and O, I and l, among others), like specialized feature extraction techniques
perform to avoid misclassifications. In this paper, we evaluate the performance of the
original MDRNN in handwriting character recognition problem and propose the use
of specialized Support Vector Machine (SVM) [20] to improve the performance of the
MDRNN in cases of confusion letters. The performance of the method is verified in the
C-Cube database and compared with different classifiers using the same database.
extraction techniques, using different approaches, were evaluated. Based on the results,
a combination of feature sets was proposed in order to achieve high recognition perfor-
mance. This combination was motivated by the observation that the sets of character-
istics are independent and complementary. The ensemble was conducted by combining
the outputs generated by the classifier in each set of features separately. The database
used for the experiments was the C-Cube and the classifier a three-layer MLP network,
trained with the Resilient Backpropagation.
Bellili et al. [2] proposed an hybrid MLP-SVM method for unconstrained handwrit-
ten digits recognition. This hybrid architecture is based on the idea that the correct digit
class almost systematically belongs to the two maximum MLP outputs and that some
pairs of digit classes constitute the majority of MLP misclassifications. Specialized
local SVMs are introduced to detect the correct class among these two classification
hypotheses.
Camastra [4] presented a cursive character recognizer that performs the character
classification using SVM and neural gas. The neural gas is used to verify whether lower
and upper case version of a certain letter can be joined in a single class or not. Once
this is done for every letter SVMs performs the character recognition. They compare
notably better, in terms of recognition rates, with popular neural classifiers, such as
Learning Vector Quantization (LVQ) and MLP. The SVM recognition rate is among the
highest presented in the literature for cursive character recognition.
In Neves et. al. [11] is presented a hierarchical combination of SVM to handwriting
digit recognition with promising classification results but a high classification time due
to the necessity of use a model of SVM to each class pair.
Zanchettin et al. [23] presents a hybrid KNN-SVM method for cursive character
recognition. Specialized SVMs are introduced to significantly improve the performance
of KNN in handwriting recognition. This hybrid approach is based on the observation
that when using KNN in the task of handwritten characters recognition, the correct
class is almost always one of the two nearest neighbors of the KNN and SVM is used
to reduce misclassifications between the two nearest neighbors.
In [13] and [19] it was observed similarities between some letters (e.g. ‘B and D’,
‘H and N’, ‘O and Q’). In [23] it was observed similarities between letters ‘O and D’,
‘B and R’, ‘D and B’, ‘N and M’. In [5] was detected a high error rate in characters
with different ways of writing (e.g., ‘a and A’, ‘f and F’). In this paper we perform
experiments with the different ways of writing like upper, lower and joint case.
The hierarchical structure proposed by Graves et al. [7] is composed by the MDRNN
using the LSTM. The output layer use the connectionist temporal classification (CTC)
[10]. The concept of MDRNNs [9] is to replace the single recurrent connection found
in standard recurrent networks with as many connections as there are spatio-temporal
dimensions of the data. These connections allow the network to create a flexible internal
representation of surrounding context, which is robust to localized distortions.
This architecture allows cyclic connections among the network nodes. These connec-
tions enable the network save the entire history of previous inputs and for each output
A MDRNN-SVM Model for Cursive Handwriting Recognition 249
(without this characteristic - cyclic connections) only mapping the input to the output
class. The key issue is the recurrent connections acts as a ‘memory’ of previous inputs
to persist in the network’s internal state, which can be used to influence the network
output [8].
In Figure 2 is presented a RNN with one hidden-layer. Unfortunately, for standard
RNN architectures, the range of context accessible is limited. The problem is that the in-
fluence of a given input on the hidden layer, and therefore on the network output, either
decays or blows up exponentially as it cycles around the network’s recurrent connec-
tions. This behavior results in what is generally called the Vanishing Gradient Problem,
illustrated in Figure 4(a). In this figure the luminance of the nodes indicate the sensitiv-
ity over time that previous information of an input can influence in the information of
the next input. The sensitivity decays exponentially over time as new inputs overwrite
the activation of hidden units and the network ’forgets’ the first input.
Fig. 1. Characters with struc- Fig. 2. RNN with one hidden Fig. 3. Illustration of the
tural similarity layer LSTM model [8]
To solve this problem Graves et al. [7] proposed the use of the LSTM method. The
LSTM consists of a recurrent subnetwork that is known as memory block. Each block
contains one or more self-connected memory cells and three multiplicative units: the
input, output and forget gates. The multiplicative gates allow LSTM memory cells to
store and access information over long periods of time, thereby avoiding the Vanishing
Gradient Problem. A LSTM unit is illustrated in Figure 3. In Figure 4(b) is presented
how the LSTM work over time with the next network inputs. A detailed description is
presented in [7].
To work with multiple dimensions the authors suggested to use a multi-dimensional
network expanding the concept above presented. The basic idea of the MDRNNs is to
replace the single recurrent connection found in standard RNNs with as many recurrent
connections as there are dimensions in the data. During the forward pass, at each point
in the data sequence, the hidden layer of the network receives both an external input and
its own activations from one step back along all dimensions. Therefore, in the character
recognition problem we use two recurrent connections based on the image dimensions.
The MDRNN and LSTM activation output is presented to the CTC output layer
designed for sequence labeling with RNNs. Unlike other neural network output layers, it
does not require pre-segmented training data, or postprocessing to transform its outputs
into transcriptions. Instead, it trains the network to directly estimate the conditional
probabilities of the possible class given the input sequences.
Although the robustness and complexity in different problems of the MDRNN-LSTM
model [8], we have observed some misclassifications when the correct character class or
250 B.L.D. Bezerra, C. Zanchettin, and V.B. de Andrade
word have significant similarities to an incorrect class. In some cases still is possible to
distinct the classes without using of context information. The reason this happens is, at
the same time, one of the most interesting advantages of the MDRNN-LSTM model: it
does not have to run the feature extraction step. Otherwise, this step is generally needed
in recognition methods grounded on the Bayes Decision Theory (BDT) [6]. According
to the Bayes desision rule, the class of a given observed feature vector must be chosen
uniquely based on the likelihood of this feature vector concerning the class, times the
probability of this class. Therefore, suppose we are able to define discriminant functions
which take as input well defined features extracted from the image of some character
and, as output, compute the sample’s probability to belong to the respective class as-
sociated. Then, the class associated with the discriminant function which has produced
the highest output, is the chosen class. Our idea is combining both strategies in order to
improve the recognition results.
(a) Vanishing Gradient Prob- (b) Consider ’O’, for a open gate
lem and ’-’ for a closed gate
If we adopt the previous approach for each class where the MDRNN-LSTM model
have more misclassifications, we maximize the chance of deciding the correct class
based on this composed model, since it is specialized on the representative features ex-
tracted from the data. Additionally, take into account the majority of confusion observed
in the MDRNN-LSTM model belongs to some pair of classes (e.g., ‘U’ and ‘V’, ‘I’ and
‘l’, among others), we just need a dichotomizer [6] for each pair of classes misclassified
by the MDRNN-LSTM method. In other to determine the pair of classes we need to run
the respective dichotomizer, we select the two classes which receives the highest values
(the most probable) in the MDRNN output layer.
Fig. 5. The SVM optimal Fig. 6. Features in a bi- Fig. 7. kernel function
margin separation [14] dimensional space [14]
A MDRNN-SVM Model for Cursive Handwriting Recognition 251
The SVM [20] is a binary classification technique and also a dichotomizer. We pro-
pose the use of the SVM as a class confirmation for the MDRNN-LSTM method. The
SVM training consists of finding the support vectors for each class and creating a func-
tion that represents an optimal margin separation between the support vectors of dif-
ferent classes. Consequently, it is possible to obtain an optimal hyperplane to class
separation as shown in Figure 5. Therefore, the SVM find the optimal linear separation
function among two classes and still deal with non-linear separable data (see Figure 6).
The SVM uses a kernel function to increase the feature dimensionality and consequently
turning the data linearly separable as is shown in Figure 7.
The proposed system architecture is shown in Figure 8. In the proposed system, the
feature extraction step is done only when the MDRNN-LSTM return as output a class
with low confidence and this class is generally confounded with some other one. In this
case, we take the second most probable class returned from the MDRNN-LSTM and
use the appropriate SVM to solve the confusion between the options. In experiments
we used the 34 features proposed by Camastra [4].
Different SVMs were derived from pairs of the classes (e.g. (U, V), (m, n), (N, n),
etc.) constituting the majority of the confusions observed with the MDRNN-LSTM
classifier. Different kernel functions (linear, polynomial and RBF) were tested and the
best performances were obtained by trained SVMs with the RBF kernel function. The
choice of pairs of classes with confusion was based on the amount of errors taking as
minimum 10% size of the validation set.
4 Experiments
The C-Cube is a public database available for download on the Cursive Character Chal-
lenge website (http://ccc.idiap.ch). The database consists of 57,293 files, including up-
percase and lowercase letters, manually extracted from the CEDAR and United States
Post Service (USPS) databases. All images are binary and with variable size. The data
are unbalanced and there is a big difference in the number of pattern among the letters.
There are several feature extraction techniques proposed in literature to character
recognition and it’s an important factor to achieve high accuracy rates [18]. Cruz et
al. [5] performed experiments with different feature extraction techniques with this
database. Camastra [4] used a clustering analysis to verify whether the upper and lower
case versions of the same letters are similar in shape. The letters (c, x, o, w, y, z, m, k,
j, u, n, f, v) presented the highest similarity between the two versions and were joined
into a single class in experiments without much loss of generality.
The classification results for the split (Upper and Lower case) and Joined cases are
shown in Table 1. The edge maps algorithm presented the overall best result. Most
252 B.L.D. Bezerra, C. Zanchettin, and V.B. de Andrade
feature sets presented better accuracy for the upper case letters with the exception of
the method proposed by Camastra that performed better for lower case. This feature set
also presented the best accuracy (84.37%) for the lower case. The last two lines of Table
1 present yet the results of the MDRNN and MDRNN+SVM. The MDRNN presented
promising results, specially in comparison with methods where the feature extraction
step is needed. Additionally, our proposed hybrid model statistically outperforms the
MDRNN model. In fact, our proposed model achieves the best rates in the upper and
joint databases and is statistically equivalent to the Camastra method in case of the
lower letters database. The used hypothesis test was the t-test with 1% of significance.
In Tables 1 and 2 the boldface numbers are the best results found.
The best results obtained in recent years for C-Cube database are displayed in
Table 2. In Thornton et al. [16] the HVQ with temporal pooling algorithm is a partial im-
plementation of hierarchical temporal memory. This biologically-inspired model places
emphasis on the temporal aspect of pattern recognition, and consequently parses all im-
ages as ‘movies’. In [17] the modified direction feature extraction technique combines
the use of direction features (DFs) and transition features (TFs) to produce recognition
rates that are generally better than either DFs or TFs used individually.
Camastra [4] presented a cursive character recognizer. The character classification
is achieved by using SVMs and a neural gas. The neural gas is used to verify whether
lower and upper case version of a certain letter can be joined in a single class or not.
Once this is done for every letter, the character recognition is performed by SVMs.
A method for increasing the recognition rates of handwritten characters by com-
bining MLP and SVM was presented in [22]. The experiments demonstrated that the
combination of MLPs networks with SVMs experts pairs of classes that constitute the
greatest confusion of MLP, had improved performance in terms of recognition rate.
Table 1. Recognition rate by feature set for Table 2. Recognition rates for the C-Cube
the upper and lower case separated [13] Database Joint Case
In Zanchettin et al. [23] a combination of KNN and SVM is presented. The main idea
is to use the SVM to increase the kNN recognition rate as a decision maker classifier.
A MDRNN-SVM Model for Cursive Handwriting Recognition 253
The adaptation in this case is to take the two most frequent classes in the k nearest
neighbors and to use the SVM to decide between these two classes. It is a satisfactory
technique to be used where a misclassification results in high costs. However, as this
technique depends on the kNN method, its main disadvantage is the processing time.
According the results presented in Table 2 we conclude the MDRNN model and our
hybrid proposed model are in the top list methods for cursive character recognition. One
advantage to use these methods against the others is the fact the MDRNN itself design
and learn everything that is needed from the pixels of the image to distinguish the main
differences of the classes. Therefore, the training step is much easier than other methods
and the classification step performs faster, even in the MDRNN-SVM proposed model,
since the feature extraction and the classification step with SVM occur only in case of
doubt of the MDRNN output.
5 Final Remarks
This paper evaluated the performance of the original MDRNN recurrent neural net-
works in handwriting character recognition in a well-known benchmark, the C-Cube
database. Additionally, it is proposed the use of specialized SVMs to improve the per-
formance of the MDRNN in a hierarchical way. The performance of the method is ver-
ified and compared with different classifiers using the C-Cube database. The method
presented promising results in the classification task and the proposed combination
improve the method performance and robustness, specially in the disjoint Upper and
Lower letters databases.
As future work we suggest to evaluate the performance of the MDRNN and also the
proposed method against others in some benchmark of isolated word images, varying
the sample training number, the size of classes in the dataset, the amount of noise in the
words, the resolution of the images, and other variables.
References
1. El Abed, H., Margner, V., Kherallah, M., Alimi, A.M.: ICDAR 2009 Handwriting Recogni-
tion Competition. In: Int. Conf. Document Analysis and Recognition, pp. 1388–1392 (2009)
2. Bellili, A., Gilloux, M., Gallinari, P.: An Hybrid MLP-SVM Handwritten Digit Recognizer.
In: Int. Conf. on Document Analysis and Recognition, pp. 28–32 (2001)
3. Bunke, H.: Recognition of cursive roman handwriting - past present and future. In: Proc. 7th
Int. Conf. on Document Analysis and Recognition, vol. 1, pp. 448–459 (2003)
4. Camastra, F.: A SVM-Based Cursive Character Recognizer. Pattern Recognition 40(12),
3721–3727 (2007)
5. Cruz, R.M.O., Cavalcanti, G.D.C., Tsang, I.R.: An Ensemble Classifier for Offline Cursive
Character Recognition using Multiple Feature Extraction Techniques. In: IEEE Int. Joint
Conf. on Neural Networks, pp. 744–751 (2010)
6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Sons (2001)
7. Graves, A., Fernández, S., Schmidhuber, J.: Multidimensional Recurrent Neural Networks.
In: Proc. of Int. Con. on Artificial Neural Networks, pp. 549–558 (2007)
8. Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Dissertation,
Technische Universität München, München (2008)
254 B.L.D. Bezerra, C. Zanchettin, and V.B. de Andrade
9. Graves, A., Schmidhuber, J.: Offline Handwriting Recognition with Multidimensional Re-
current Neural Networks. In: Adv. in Neural Information Proc. Syst., pp. 545–552 (2009)
10. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735–
1780 (1997)
11. Neves, R.F.P., Lopes, A.N.G., Mello, C.A.B., Zanchettin, C.: A SVM Based Off-line Hand-
written Digit Recognizer. In: IEEE Int. Conf. on Sys., Man, and Cyb., pp. 510–515 (2011)
12. Plamondon, R., Srihari, S.N.: On-line and Off-line Handwriting Recognition: A Comprehen-
sive Survey. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 63–84 (2000)
13. Rodrigues, R.J., Kupac, G.V., Thomé, A.C.G.: Character Feature Extraction using Polygo-
nal Projection Sweep (Contour Detection). In: Proc. Int. Work. Conf. on Artificial Neural
Networks, pp. 687–695 (2001)
14. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall Pearson
Education Inc. (2003)
15. Tappert, C., Suen, C., Wakahara, T.: The State of the Art in Online Handwriting Recognition.
IEEE Trans. on Patt. Analysis and Machine Intelligence 12(8), 787–808 (1990)
16. Thornton, J., Faichney, J., Blumenstein, M., Hine, T.: Character Recognition using Hierarchi-
cal Vector Quantization and Temporal Pooling. In: Proc. Australasian Joint Con. on Artificial
Intelligence, pp. 562–572 (2008)
17. Thornton, T., Blumenstein, M., Nguyen, V., Hine, T.: Offline Cursive Character Recognition:
A State-of-the-art Comparison. In: Conf. Int. Graphonomics Society (2009)
18. Trier, O.D., Jains, A.K., Taxt, T.: Feature Extraction Methods for Character Recognition - A
Survey. Pattern Recognition 29(4), 641–662 (1996)
19. Vamvakas, G., Gatos, B., Perantonis, S.J.: Handwritten Character Recognition Through Two-
stage Foreground Sub-sampling. Pattern Recognition (43), 2807–2816 (2010)
20. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York (1998)
21. Vinciarelli, A.: A Survey on Off-line Cursive Script Recognition. Pattern Recognition 35(7),
1433–1446 (2002)
22. Washington, W.A., Zanchettin, C.: A MLP-SVM Hybrid Model for Cursive Handwriting
Recognition. In: Proc. of Int. Joint Conf. on Neural Networks, pp. 843–850 (2011)
23. Zanchettin, C., Bezerra, B.L.D., Azevedo, W.W.: A KNN-SVM Hybrid Model for Cursive
Handwriting Recognition. In: IEEE Int. Joint Con. on Neural Networks, Birsbane (2012)
Extraction of Prototype-Based Threshold Rules
Using Neural Training Procedure
Abstract. Complex neural and machine learning algorithms usually lack com-
prehensibility. Combination of sequential covering with prototypes based on
threshold neurons leads to a prototype-threshold based rule system. This kind
of knowledge representation can be quite efficient, providing solutions to many
classification problems with a single rule.
1 Introduction
Neural networks and other complex machine learning models usually lack the advan-
tages of comprehensibility. This property is very important in many application, in-
cluding safety-critical ones. Also in technological applications, where the systems are
used for support of industrial processes (see for example [1]) simple and understand-
able models are crucial to avoid dangerous situations and to raise process engineers’
confidence in the solutions. The second aspect of building a comprehensible model is
directly related to knowledge extraction. In medical applications or social science the
data driven models may be the only source of knowledge about certain processes. Thus
there are evident advantages of data driven models, which use human-friendly knowl-
edge representation.
Four general approaches, which find comprehensive mappings f (xi ) → yi are usu-
ally considered in that purpose [2]:
– propositional logic using crisp logic rules (C-rules)
– fuzzy logic and fuzzy rule based systems (F-rules)
– prototype-based rules and logic (P-rules)
– first and higher-order predicate logic
C-rules are the most common form of user-friendly knowledge representation. They
avoid any ambiguity, which assures that there is only one possible interpretation of
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 255–262, 2012.
c Springer-Verlag Berlin Heidelberg 2012
256 M. Blachnik, M. Kordos, and W. Duch
the rules, although at the expense of several limitations. Continuous attributes usually
cannot be discretized in a natural way, crisp decision borders divide the input space into
hyperboxes that cannot easily cover some data distributions, and some simple forms of
knowledge (like "the majority vote") may be expressed only using a very large number
of logical conditions in propositional form. The problems with discretization have been
addressed by fuzzy logic, with fuzzy rule-based systems using membership functions
to represent the degree (in [0, 1] range) of fulfilling certain conditions [3]. This leads
to more flexible decision borders and allows for handling uncertainty in the real world
data. However, F-rules do not represent many forms of knowledge that predicate logic
can express in a natural way.
Prototype based rules try to capture intuitive estimation of similarity in decision-
making processes [4], as it is done in the case-based reasoning. P-rules, which are rep-
resented by reference vectors and similarity measures are more suitable to learn from
continuous data than predicate logic. Relation of P-rules with F-rules has been studied
in [5,6] where it has been shown that fuzzy rules can be converted to prototype-rules
with additive distance functions. P-rules work with symbolic attributes by applying
heterogeneous distance measures like VDM distance [7]. Moreover, they can easily ex-
press some forms of knowledge, such as the selection of the m of n true conditions, what
limits the use of crisp and fuzzy rules.
P-rules can be expressed in two forms: as the nearest neighbor rules or as the
prototype-threshold based rules. This article addresses the problem of extracting the
prototype-threshold based rules from the data. A simple algorithm called nOPTDL
based on combination of the sequential covering approach to rule extraction with neural-
like training and representation of single rules is presented below. it allows to use
gradient-based optimization methods to optimize the parameters of neurons, which rep-
resent single rules.
In the next section the details of prototype-threshold rules are discussed and in sec-
tion 3 the nOPTDL algorithm is presented. Section 4 presents a few examples of the
nOPTDL performance and discusses the results. The last section concludes the paper
and presents the directions of further research.
In the paper the notation, which describes the data is given as a tuple
T = {[x1 , y1 ] , [x2 , y2 ] , . . . , [xn , yn ]}, where xi ∈ Rm is an input vector, yi ∈ [−1, 1]
is a label (only two-class problems are considered).
where D(xi , p) expresses the distance between vector xi and the prototype p and l is
the class label associated with the rule.
There are several approaches to construct this type of rules. Perhaps the simplest one
is based on classical decision trees. First, conditions that define each branch of the tree
are used to define a separate distance function for a single prototype associated with the
root of the tree. A more common approach leading to a lower number of rules starts
Extraction of Prototype-Based Threshold Rules Using Neural Training Procedure 257
from a distance matrix D (q, w) that is used to construct new features for the decision
tree training. Each new feature represents the distance from a selected training vector,
so the number of new features added to the original set is equal to the number of training
instances m = n. In this approach each node can consist either of a single prototype-
threshold rule or a combination of crisp conditions with distance-based conditions [8].
Another approach called ordered prototype-threshold decision list OPTDL is derived
from the sequential covering principle. In this approach, described in [9], the algorithm
starts from creating a single rule and then adds new rules such that each new rule covers
examples not classified by previously constructed rules (sketch (1)). The shape of the
decision border of a single rule depends on the distance function. Euclidean distance
function creates hyperspherical borders. To avoid unclassified regions the new rules
should overlap with each other. When a test vector falls in such an overlapping region a
unique decision is made by ordering the rules from the most general to the most specific.
The training algorithm starts from creating the most general rule and then each new rule
is marked as more specific then the previous one. The decision-making process starts
from analyzing the most specific rule and if its conditions are not fulfilled then more
general rules are being analyzed. If an instance is not covered by any rule then the else
clause is used to determine the default class label.
shape of the decision border. For two overlapping Gaussian distributions with identi-
cal σ the optimal decision border is a hyperplane, but such decision border cannot be
created using prototypes restricted to the examples of the training set. Without that re-
striction a prototype can be moved to infinity and with appropriately large threshold
a good approximation to a linear decision border can be obtained. In the next section
presents the optimization procedures to determine the position and appropriate thresh-
old of a prototype.
where p is the position of the prototype, D (x, p) is the distance function, α represents
the exponent of the distance function (for Euclidean distance α = 2) and θ denotes
the threshold or bias. The α parameter is used to add flexibility to distance functions,
regulating their shape as a function of differences between vectors.
α
The inner part of the transfer function g (x) = D (x, p) −θ defines the area covered
by active neuron, such that vectors x that fall into this area give positive values g (x) >
0 and those being outside - negative values g (x) < 0. The logistic function is used
for smooth nonlinear normalization of the g (x) values to fit them into the range [0, 1].
Extraction of Prototype-Based Threshold Rules Using Neural Training Procedure 259
x vectors close to the border defined by z(·) = 0.5 will increase this value towards 1
inside and towards 0 outside the area covered by the neuron, with the speed of change
that depends on the slope of the logistic function and the scaling of the distance function.
The objective function used to optimize the properties of the neuron is defined as:
E(p, θ) = z (xi |p, θ) · l · yi (4)
i∈C
4 Numerical Experiments
In the first experiment the influence of the number of extracted rules on classification
accuracy of the nOPTDL has been analyzed. A 10-fold crossvalidation has been used
for estimation of accuracy, repeating the test for different number of rules in the range
k = [1 . . . 10]. The results are presented in Fig. (1). The obtained results show that the
classification accuracy using just a single P-rule is sometimes as good as with many
rules (heart, breast cancer). In other cases adding new rules improves accuracy up to a
260 M. Blachnik, M. Kordos, and W. Duch
90 80
78
76
85
74
72
70
80
68
66
75 64
0 2 4 6 8 10 12 0 2 4 6 8 10 12
99
90
98.5
98
85
97.5
80 97
96.5
75
96
95.5
70
95
65 94.5
0 2 4 6 8 10 12 0 2 4 6 8 10 12
92
90
90
85
88
80
86
84
75
82
70
80
65
78
60 76
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 1. Classification accuracy and variance as the function of the number of the nOPTDL rules
certain point, but for all datasets no more than 5 rules were needed to reach the maxi-
mum accuracy. This shows that prototype-threshold form of knowledge representation
can be quite efficient.
To compare the proposed nOPTDL algorithm to other state-of-the-art rule extraction
algorithms another test was performed using double crossvalidation: the inner crossval-
idation was used to optimize parameters of the given classification algorithm (for exam-
ple, the number of rules in our system) and the outer crossvalidation was used to predict
the final accuracy. The testing procedure is presented in Fig.(4). Our nOPTDL algo-
rithm has been compared to the previous version based on search strategies (sOPTDL),
and also to the C4.5 decision tree [11] and Ripper rule induction system [12]. The
experiments have been conducted using RapidMiner [13] with Weka extension and
with the Spider toolbox [14]. The parameters of both C4.5 and Ripper algorithms have
also been optimized using double crossvalidation, optimizing pureness and the minimal
weights of instances. The results are presented in Tab. (2).
Extraction of Prototype-Based Threshold Rules Using Neural Training Procedure 261
Selecting best
parameters
Model training
Model testing
Table 2. Comparison of the accuracy of the nOPTDL algorithm with C4.5 decision tree and
Ripper rule induction
For Heart disease the average accuracy of nOPTDL (1 rule) is over 5% higher in
comparison with C4.5 classifier (21 rules) and 3% higher than the Ripper algorithm
(4 rules). a very good accuracy was also achieved for the Appendicitis dataset. The
average accuracy of the Sonar dataset (4 rules) was also very high, however the standard
deviation, comparable to that obtained from C4.5 decision tree, was much higher than
the standard deviation of Ripper. Diabetes also required 3 P-rules. In other cases a
single rule was sufficient. The results show that knowledge representation using a small
number of P-rules is very efficient.
Further extensions of this algorithm, including beam search instead of the best first
search, should improve its quality. Our future work also includes adding local fea-
ture weights to each neuron to automatically adjust feature significance. Enforcing
regularization should increase the sparsity of the obtained feature weights and lead
to improvement of comprehensibility by filtering useless attributes and thus simplify
the extracted knowledge. Adopting appropriate distance measures and switching to the
Chebyshev distance (Linf norm) may allow for classical crisp rule extraction using the
same OPTDL family of algorithms.
Acknowledgment. The work was founded by the grant No. ATH 2/IV/GW/2011 from
the University of Bielsko-Biala and by project No. 4421/B/T02/2010/38 (N516 442138)
from the Polish Ministry of Science and Higher Education.
The software package is available on the web page of The Instance Selection and
Prototype Based Rules Project at http:\www.prules.org
References
1. Wieczorek, T.: Neural modeling of technological processes. Silesian University of Technol-
ogy (2008)
2. Duch, W., Setiono, R., Zurada, J.: Computational intelligence methods for understanding of
data. Proceedings of the IEEE 92, 771–805 (2004)
3. Nauck, D., Klawonn, F., Kruse, R., Klawonn, F.: Foundations of Neuro-Fuzzy Systems. John
Wiley & Sons, New York (1997)
4. Duch, W., Grudziński, K.: Prototype based rules - new way to understand the data. In: IEEE
International Joint Conference on Neural Networks, pp. 1858–1863. IEEE Press, Washington
D.C. (2001)
5. Duch, W., Blachnik, M.: Fuzzy Rule-Based Systems Derived from Similarity to Prototypes.
In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.) ICONIP 2004. LNCS,
vol. 3316, pp. 912–917. Springer, Heidelberg (2004)
6. Kuncheva, L.: On the equivalence between fuzzy and statistical classifiers. International Jour-
nal of Uncertainty, Fuzziness and Knowledge-Based Systems 15, 245–253 (1996)
7. Wilson, D.R., Martinez, T.R.: Value difference metrics for continuously valued attributes. In:
Proceedings of the International Conference on Artificial Intelligence, Expert Systems and
Neural Networks, pp. 11–14 (1996)
8. Grabczewski,
˛ K., Duch, W.: Heterogeneous Forests of Decision Trees. In: Dorronsoro, J.R.
(ed.) ICANN 2002. LNCS, vol. 2415, pp. 504–509. Springer, Heidelberg (2002)
9. Blachnik, M., Duch, W.: Prototype-Based Threshold Rules. In: King, I., Wang, J., Chan, L.-
W., Wang, D. (eds.) ICONIP 2006. LNCS, vol. 4234, pp. 1028–1037. Springer, Heidelberg
(2006)
10. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://www.ics.uci.edu/~mlearn/MLRepository.html
11. Quinlan, J.: C 4.5: Programs for machine learning. Morgan Kaufmann, San Mateo (1993)
12. William, C.: Fast effective rule induction. In: Twelfth International Conference on Machine
Learning, pp. 115–123 (1995)
13. Rapid-I: Rapidminer, http://www.rapid-i.com
14. Weston, J., Elisseeff, A., BakIr, G., Sinz, F.: The spider,
http://www.kyb.tuebingen.mpg.de/bs/people/spider/
Instance Selection with Neural Networks
for Regression Problems
Abstract. The paper presents algorithms for instance selection for regression
problems based upon the CNN and ENN solutions known for classification tasks.
A comparative experimental study is performed on several datasets using multi-
layer perceptrons and k-NN algorithms with different parameters and their vari-
ous combinations as the method the selection is based on. Also various similarity
thresholds are tested. The obtained results are evaluated taking into account the
size of the resulting data set and the regression accuracy obtained with multi-
layer perceptron as the predictive model and the final recommendation regarding
instance selection for regression tasks is presented.
1 Introduction
1.1 Motivation
There are two motivations for us to undertake research in the area of instance selec-
tion for regression problems. The first one is theoretical - most research on instance
selection, which has been done so far refers to classification problems and there are
only few papers on instance selection for regression tasks, which do not cover the topic
thoroughly, especially in the case of practical application to real-world datasets. And
our second motivation is very practical - we have implemented in industry several com-
putational intelligence systems for technological process optimization [1], which deal
with regression problems and huge datasets and there is a practical need to optimally
reduce the number of instances in the datasets before building the prediction and rule
models.
There are following reasons to reduce the number of instances in the training dataset:
1. Some instance selection algorithms, as ENN, which is discussed later, reduce noise
in the dataset by eliminating outliers, thus improving the model performance.
2. Other instance selection algorithms, as CNN, which is also discussed later, discard
from the dataset instances that are too similar to each other, what simplifies and
reduces size of the data.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 263–270, 2012.
c Springer-Verlag Berlin Heidelberg 2012
264 M. Kordos and M. Blachnik
3. The above two selection models can be joined together to obtain the benefits of
both.
4. The training performed on a smaller dataset is faster. Although reducing the dataset
size also takes some time, it frequently can be done only once, before attempting
various models with various parameters to match the best model for the problem.
5. While using lazy-learning algorithms, as k-NN, reducing the dataset size also re-
duces the prediction time.
6. Instance selection can be joined with prototype selection in prototype based sys-
tems, e.g. prototype rule-based systems.
problem is the error measure, which in classification tasks is very straightforward, while
in regression tasks, the error measure can be defined in several ways and in practical
solutions not always the simple error definitions as the MSE (mean square error) work
best [1].
Because of the challenges, there were very few approaches in the literature to in-
stance selection for regression problems. Moreover, the approaches were usually not
verified on real-world datasets. Zhang [9] presented a method to select the input vectors
while calculating the output with k-NN. Tolvi [10] presented a genetic algorithm to per-
form feature and instance selection for linear regression models. In their works Guillen
et al. [11] discussed the concept of mutual information used for selection of prototypes
in regression problems.
2 Methodology
The CNN (Condensed Nearest Neighbor) algorithm was proposed by Hart [2]. For clas-
sification problems, as shown in [4], CNN condenses on the average the number of vec-
tors three times. CNN used for classification works in the following way: the algorithm
starts with only one randomly chosen instance from the original dataset T. And this
instance is added to the new dataset P. Then each remaining instance from T is classi-
fied with the k-NN algorithm, using the k nearest neighbors from the dataset P. If the
classification is correct, then the instance is not added to the final dataset P. If the clas-
sification is wrong - the instance is added to P. Thus, the purpose of CNN is to reject
these instances, which do not bring any additional information into the classification
process.
The ENN (Edited Nearest Neighbor) algorithm was created by Wilson [3]. The main
idea of ENN is to remove given instance if its class is different than the majority class
of its neighbors, thus ENN works as a noise filter. ENN starts from the entire original
training set T. Each instance, which is correctly classified by its k nearest neighbors is
added to the new dataset P and each instance wrongly classified is not added. Several
variants of ENN exist. Repeated ENN, proposed by Wilson, where the process of ENN
is iteratively repeated as long as there are any instances wrongly classified. In all k-NN
algorithm proposed Tomek [12], the ENN is repeated for all k from k=1 to kmax .
2.2 RegENN and RegCNN: ENN and CNN for Regression Problems
The first step to modify the CNN and ENN algorithms to enable using them for regres-
sion task is to replace the wrong/correct classification decision with a distance measure
and a similarity threshold, to decide if the examined vector can be considered as simi-
lar to its neighbors or not. For that purpose we use Euclidean measure and a threshold
θ, which express the maximum difference between the output values of two vectors to
consider them similar. Using θ proportional to the standard deviation of several nearest
neighbors of the vector xi reflects the speed of changes of the output around xi and
allows adjusting the threshold to that local landscape, what, as the experiments showed,
266 M. Kordos and M. Blachnik
allows for better compression of the dataset. Then we changed the algorithm used to
predict the output Y (xi ) from k-NN to an MLP (multilayer perceptron) neural net-
work, which in many cases allowed for better results (see Table 1). Additionally the
best results were obtained if the MLP network was trained not on the entire dataset, but
only on a part of it in the area of the vector of interest. The algorithms are shown in the
following pseudo-codes.
Where T is the training dataset, P is the set of selected prototypes, xi is the i-th
vector, m is the number of vectors in the dataset ,Y (xi ) is the real output value of vector
xi , Ȳ (xi ) is the predicted output of vector xi , S is the set of nearest neighbors of vector
xi , NN(A,x) is the algorithm, which is trained on dataset A and vector x is used as a test
sample, for which the Y (xi ) is predicted (in our case NN(A,x) is implemented by k-NN
or MLP), kNN is the k-NN algorithm returning the subset S of several closest neighbors
to xi , and θ is the threshold of acceptance/rejection of the vector as a prototype, α is a
certain coefficient (it will be discussed in the experimental section) and std (Y (XS ))
is the standard deviation of the outputs of the vectors in S.
– MLP network trained on the entire training data within one validation of the cross-
validation process (that is 90 percent of whole the dataset, therefore it is shown in
the result section as CNN-MLP90 or ENN-MLP90)
– MLP network trained on 33 percent of the training vectors, which were closest
to the considered vector (shown in the result section as CNN-MLP30 or ENN-
MLP30)
– MLP network trained on 11 percent of the training vectors, which were closest
to the considered vector (shown in the result section as CNN-MLP10 or ENN-
MLP10)
Because of limited space, only the best results obtained for each model with variable θ
are shown in the table and comparison of results obtained with various θ (constant and
variable) are shown for one dataset and one method in a graphical form.
Fig. 2. Dependance of MSE (MSE_CT: with constant θ, MSE_VT: with variable θ) and the num-
ber of selected vectors (vect_CT: with constant θ, vect_VT: with variable θ) on the threshold θ
(when it is constant) and on α (whereθ = α · std (Y (XS )))
4 Conclusions
We presented an extension of CNN and ENN, called RegCNN and RegENN algorithms
that can be applied to regression tasks and experimentally evaluated the influence of the
θ and α parameter and various learning methods within the selection algorithm on the
number or selected vectors and the prediction accuracy obtained with an MLP neural
network on the reduced dataset. The general conclusions are that in most cases the best
results are obtained using an MLP network trained on the subset of closest neighbors
of the considered point. It was observer, that in general the θ used with CNN could be
on average set to 0.1 (or α to 0.5) of the MSE value while performing prediction on the
270 M. Kordos and M. Blachnik
unreduced dataset, while the θ used with ENN to 5 times the value of the MSE (or α
to 5) for standardized data, however, the algorithms are not very sensitive to the change
of α in the terms of prediction accuracy, but especially RegENN with lower α allows
for better dataset compression. Variable θ allows for reducing more vectors, while it
does not influence the prediction accuracy. The best results are obtained if first ENN
and after that CNN was applied to the dataset.
It should be possible to significantly improve the results, first by tuning the param-
eters of the MLP network and using more efficient MLP training methods, such as
Levenberg-Marquardt algorithm and second by using more advanced instance selection
methods, which were shortly presented in the introduction. These issues will be the area
of our further research.
Acknowledgment. The work was sponsored by the grant ATH 2/IV/GW/2011 from
the University of Bielsko-Biala.
References
1. Kordos, M., Blachnik, M., Wieczorek, T.: Temperature Prediction in Electric Arc Furnace
with Neural Network Tree. In: Honkela, T. (ed.) ICANN 2011, Part II. LNCS, vol. 6792, pp.
71–78. Springer, Heidelberg (2011)
2. Hart, P.E.: The condensed nearest neighbor rule. IEEE Transactions on Information The-
ory 14, 515–516 (1968)
3. Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans-
actions on Systems, Man, and Cybernetics 2, 408–421 (1972)
4. Jankowski, N., Grochowski, M.: Comparison of Instances Seletion Algorithms I. Algorithms
Survey. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC
2004. LNCS (LNAI), vol. 3070, pp. 598–603. Springer, Heidelberg (2004)
5. Kuncheva, L., Bezdek, J.C.: Presupervised and postsupervised prototype classifier design.
IEEE Transactions on Neural Networks 10, 1142–1152 (1999)
6. Wilson, D.R., Martinez, D., Reduction, T.: techniques for instance-based learning algorithms.
Machine Learning 38, 251–268 (2000)
7. Salvador, G., Derrac, J., Ramon, C.: Prototype Selection for Nearest Neighbor Classifica-
tion: Taxonomy and Empirical Study. IEEE Transactions on Pattern Analysis and Machine
Intelligence 34, 417–435 (2012)
8. Kovahi, R., John, G.: Wrappers for Feature Subset Selection. AIJ special issue on relevance
(May 1997)
9. Zhang, J., et al.: Intelligent selection of instances for prediction functions in lazy learning
algorithms. Artifcial Intelligence Review 11, 175–191 (1997)
10. Tolvi, J.: Genetic algorithms for outlier detection and variable selection in linear regression
models. Soft Computing 8, 527–533 (2004)
11. Guillen, A., et al.: Applying Mutual Information for Prototype or Instance Selection in Re-
gression Problems. In: ESANN 2009 Proceedings (2009)
12. Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Transactions on Sys-
tems, Man, and Cybernetics 6, 448–452 (1976)
13. Kordos, M., Blachnik, M., Strzempa, D.: Do We Need Whatever More Than k-NN? In:
Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010,
Part I. LNCS (LNAI), vol. 6113, pp. 414–421. Springer, Heidelberg (2010)
14. Merz, C., Murphy, P.: UCI repository of machine learning databases (1998-2012),
http://www.ics.uci.edu/mlearn/MLRepository.html
15. http://www.rapid-i.com
16. http://www.kordos.com/icann2012
A New Distance for Probability Measures
Based on the Estimation of Level Sets
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 271–278, 2012.
c Springer-Verlag Berlin Heidelberg 2012
272 A. Muñoz et al.
the estimation of density level sets regions. The article is organized as follows:
In Section 2 we present probability measures as generalized functions and then
we define general distances acting on the Schwartz distribution space. Section 3
presents a new distance built according to this point of view. Section 4 illustrates
the theory with some simulated and real data sets.
d (P, φi , Q, φi ) for i ∈ I, where d is some distance function. Our test functions
will be indicator functions of α-level sets, introduced below.
Given a PM P with density function fP , minimum volume sets (or α-level sets)
are defined by Sα (fP ) = {x ∈ X| fP (x) ≥ α}, such that P (Sα (fP )) = 1 − ν ,
where fP is the density function and 0 < ν < 1. If we consider an ordered
sequence α1 < . . . < αn , αi ∈ (0, 1), then Sαi+1 (fP ) ⊆ Sαi (fP ). Let us define
Ai (P) = Sαi (fP ) − Sαi+1 (fP ), i ∈ {1, . . . , n − 1}. We can choose α1 0 and
αn ≥ maxx∈X fP (x) (which exists, given that X is compact and fP continuous);
then i Ai (P) Supp(P) = {x ∈ X| fP (x) = 0} (equality takes place when
n → ∞, α1 → 0 and αn → 1 ). Given the definition of the Ai , if Ai (P) = Ai (Q)
for every i when n → ∞, then P = Q. Thus, taking φi = ½[Ai ] , our choice is
d
(P, φ i , Q, φi ) = | P, φi − Q, φj | = | ½[Ai ] dP − ½[Bi ] dQ| | ½[Ai ] μ −
½[Bi ] μ|, the ambient measure. Indeed, given the definition of level set and the
choice of Ai , both P and Q are approximately constant on Ai and Bi , respectively,
and so we are using the counting (ambient) measure.
Denote by the symmetric difference operator: A B = (A − B) ∪ (B − A).
Consider φ1i = ½[Ai (P)−Ai (Q)] and φ2i = ½[Ai (Q)−Ai (P)] . Consider di (P, Q) =
| P, φ1i −Q, φ1i |+| P, φ2i −Q, φ2i |. From the previous discussion di (P, Q)
μ (Ai (P) Ai (Q)), what motivates the following:
n−1
μ (Ai (P) Ai (Q))
dα (P, Q) = αi , (1)
i=1
μ (Ai (P) ∪ Ai (Q))
where μ is the ambient measure. We use μ (Ai (P) Ai (Q)) in the numerator
instead of di (P, Q) μ (Ai (P) Ai (Q)) for compactness. When n → ∞ both
expressions are equivalent. Of course, we can calculate dα in eq. (1) only when we
know the distribution function for both PMs. In practice there will be available
two data samples generated from P and Q, and we need to define some plug
in estimator: Consider estimators Âi (P) = Ŝαi (fP ) − Ŝαi+1 (fP ), then we can
estimate dα (P, Q) by
# Âi (P) S Âi (Q)
n−1
dˆα (P, Q) = αi , (2)
i=1 # Âi (P) ∪ Âi (Q)
where #A indicates the number of points in A and S indicates the set estimate
of the symmetric difference, defined below.
Both dα (P, Q) and dˆα (P, Q), as currently defined, are semimetrics. The pro-
posal of Euclidean metrics will be afforded immediately after the present work.
Example: Consider the distance from a point x ∈ IRd to its k th -nearest neigh-
bour in sn , x(k) : M (x, sn ) = dk (x, sn ) = d(x, x(k) ): it is a sparsity measure. Note
that dk is neither a density estimator nor is it one-to-one related to a density
estimator. Thus, the definition of ‘sparsity measure’ is not trivial.
The Support Neighbour Machine [19] solves the following optimization
problem:
n
max νnρ − ξi
ρ,ξ
i=1 (3)
s.t. g(xi ) ≥ ρ − ξi ,
ξi ≥ 0, i = 1, . . . , n ,
where g(x) = M (x, sn ) is a sparsity measure, ν ∈ [0, 1], ξi with i = 1, . . . , n are
slack variables and ρ is a predefined constant.
Fig. 1. Set estimate of the symmetric difference. (a) Data sets  and B̂. (b)  − B̂.
(c) B̂ − Â.
4 Experimental Work
Being the proposed distance intrinsically nonparametric, there are no ‘simple’
parameters (like mean and variance) on which we can concentrate our attention
to do exhaustive benchmarking. The strategy will be to compare the proposed
distance to other classical PM distances for some well known (and parametrized)
distributions, to get a first impression on its performance. Here we consider
Kullback-Leibler (KL) divergence, Kolmogorov-Smirnov (KS) distance and t-
test (T) measure (Hotelling test in the multivariate case).
We begin by testing our Level Distance (LD) on the most favourable case to
the classical PM metrics: normal distributions. Consider a mixtures of normal
1 1
distributed populations: α N(μ = −d− 2 12 ; Σ = .75I) + (1−α)N(μ = d− 2 12 ; Σ =
1 1
I) and (1 − α)N(μ = −d− 2 12 ; Σ = I) + α N(μ = d− 2 12 ; Σ = .75I), with
α = .6 and d the dimension considered in order to compare the discrimination
performance of the proposed distance relative to other classical multivariate
distances: Kullback-Leibler (KL) divergence and t-test (T) measure (Hotelling
test in the multivariate case). We found the minimum sample size n for which
the PM metrics are able to discriminate between both samples. In all cases we
choose type I error =.05 and type II error =.1. Table 1 report the results, we
can see that the Level Set Distance (LD) measure is more efficient (in terms of
sample size) in all the dimensions considered.
Table 1. Minimum sample size for a 5% type I and 10% type II errors
Metric d: 1 2 3 4 5 10 20 50 100
KL 1300 1700 1800 1900 2000 2700 > 5000 > 5000 > 5000
T 750 800 900 1000 1100 1400 1500 2100 2800
LD 200 380 650 750 880 1350 1400 1800 2200
276 A. Muñoz et al.
Fig. 2. MDS plot for texture groups. A representer for each class is plotted in the map.
5 Future Work
In the near future we will afford the study of the geometry induced by the pro-
posed measure and its asymptotic properties. An exhaustive testing on a variety
of data sets following different distributions is needed. We are also working on
a variation of the LD distance that satisfies the Euclidean metric conditions.
References
1. Amari, S.-I., Barndorff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Differ-
ential Geometry in Statistical Inference. Lecture Notes-Monograph Series, vol. 10
(1987)
2. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathemat-
ical Society (2007)
3. Atkinson, C., Mitchell, A.F.S.: Rao’s Distance Measure. The Indian Journal of
Statistics, Series A 43, 345–365 (1981)
4. Müller, A.: Integral Probability Metrics and Their Generating Classes of Functions.
Advances in Applied Probability 29(2), 429–443 (1997)
5. Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering whit Bregman Diver-
gences. Journal of Machine Learning Research, 1705–1749 (2005)
6. Burbea, J., Rao, C.R.: Entropy differential metric, distance and divergence mea-
sures in probability spaces: A unified approach. Journal of Multivariate Analysis 12,
575–596 (1982)
7. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance
Metrics for Name-matching Tasks. In: Proceedings of IJCAI 2003, pp. 73–78 (2003)
8. Devroye, L., Wise, G.L.: Detection of abnormal behavior via nonparametric esti-
mation of the support. SIAM J. Appl. Math. 38, 480–488 (1980)
9. Dryden, I.L., Koloydenko, A., Zhou, D.: Non-Euclidean statistics for covariance
matrices, with applications to diffusion tensor imaging. The Annals of Applied
Statistics 3, 1102–1123
10. Dryden, I.L., Koloydenko, A., Zhou, D.: The Earth Mover’s Distance as a Metric
for Image Retrieval. International Journal of Computer Vision 40, 99–121 (2000)
11. Gretton, A., Borgwardt, K., Rasch, M., Schlkopf, B., Smola, A.: A kernel method
for the two sample problem. In: Advances in Neural Information Processing Sys-
tems, pp. 513–520 (2007)
12. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning, 2nd
edn. Springer (2009)
13. Hayashi, A., Mizuhara, Y., Suematsu, N.: Embedding Time Series Data for Clas-
sification. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587,
pp. 356–365. Springer, Heidelberg (2005)
14. Kylberg, G.: The Kylberg Texture Dataset v. 1.0. Centre for Image Analysis,
Swedish University of Agricultural Sciences and Uppsala University, Uppsala, Swe-
den, http://www.cb.uu.se/gustaf/texture/
15. Lebanon, G.: Metric Learnong for Text Documents. IEEE Trans. on Pattern Anal-
ysis and Machine Intelligence 28(4), 497–508 (2006)
278 A. Muñoz et al.
16. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Rep-
resentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(7), 674–
693
17. Marriot, P., Salmon, M.: Aplication of Differential Geometry to Econometrics.
Cambridge University Press (2000)
18. Moon, Y.I., Rajagopalan, B., Lall, U.: Estimation of mutual information using
kernel density estimators. Physical Review E 52(3), 2318–2321
19. Muñoz, A., Moguerza, J.M.: Estimation of High-Density Regions using One-
Class Neighbor Machines. IEEE Trans. on Pattern Analysis and Machine Intel-
ligence 28(3), 476–480
20. Ramsay, J.O., Silverman, B.W.: Applied Functional Data Analysis. Springer, New
York (2005)
21. Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Scholkopf, B., Lanckriet, G.R.G.:
Non-parametric estimation of integral probability metrics. In: International Sym-
posium on Information Theory (2010)
22. Strichartz, R.S.: A Guide to Distribution Theory and Fourier Transforms. World
Scientific (1994)
23. Székely, G.J., Rizzo, M.L.: Testing for Equal Distributions in High Dimension.
InterStat (2004)
24. Ullah, A.: Entropy, divergence and distance measures with econometric applica-
tions. Journal of Statistical Planning and Inference 49, 137–162 (1996)
25. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Ap-
plication to Clustering with Side-information. In: Advances in Neural Information
Processing Systems, pp. 505–512 (2002)
26. Zolotarev, V.M.: Probability metrics. Teor. Veroyatnost. i Primenen 28(2), 264–287
(1983)
Low Complexity Proto-Value Function Learning
from Sensory Observations with Incremental
Slow Feature Analysis
IDSIA-USI-SUPSI,
Galleria 2, 6928, Manno-Lugano, Switzerland
1 Introduction
A reinforcement learning [21] agent, which experiences the world from its con-
tinuous and high-dimensional sensory input stream, is exploring an unknown
environment. It would like be able to predict future rewards, i.e., learn a value
function (VF), but, due to its complicated sensory input, VF learning must be
precluded by learning a simplified perceptual representation.
There has been a plethora of work on learning representation for RL, specif-
ically Markov Decision Processes (MDPs); we can outline four types. 1. Top-
Down Methods. Here, the representation/basis function parameter adaptation
is guided by the VF approximation error only [13,16]. 2. Spatial Unsupervised
Learning (UL). An unsupervised learner adapts to improve its own objective,
which treats each sample independently, e.g., minimize per-sample reconstruc-
tion error. The UL feeds into a reinforcement learner. UL methods used have in-
cluded nearest-neighbor type approximators [17] or autoencoder neural nets [11].
3. Hybrid Systems. Phases of spatial UL and top-down VF-based feedback are
interleaved [5,11]. 4. Spatiotemporal UL. Differs from the spatial UL type by
using a UL objective that takes into account how the samples change through
time. Such methods include the framework of Proto-Reinforcement Learning
(PRL) [15], and Slow Feature Analysis (SFA) [22,12].
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 279–287, 2012.
c Springer-Verlag Berlin Heidelberg 2012
280 M. Luciw and J. Schmidhuber
There are some potential drawbacks to types 1,2 and 3. The top-down tech-
niques bias their representation for the reward function. They also require the
reward information for any representation learning to take place. In the spa-
tial UL techniques, the encoding need not capture the information important
for reward prediction — the underlying Markov Process dynamics. The spa-
tiotemporal UL do not have these drawbacks. These capture the state-transition
dynamics, the representation is not biased by any particular reward function,
and it can learn when the reward information is not available.
In PRL, the features are called Proto-Value Functions (PVFs); theoretical
analysis shows just a few PVFs can capture the global characteristics of some
Markovian processes [3,4] and that just a few PVFs can be used as building
blocks to approximate value functions with low error. Sprekeler recently showed
how SFA can be considered a function approximation to learning PVFs [20], so
slow features (SFs) can have the same set of beneficial properties for representa-
tion learning for general RL. Kompella, Luciw and Schmidhuber recently devel-
oped an incremental method for updating a set of slow features (IncSFA; [10,9]),
with linear computational and space complexities.
The new algorithm in this paper is the combination of IncSFA and RL — here
we use a method based on temporal-differences (TD) for its local nature, but
other methods like LSTD [1] are possible — for incrementally learning a good
set of RL basis functions for value functions, as well as the value function itself.
The importance is twofold. First, the method gives a way to approximately
learn PVFs directly from sensory data. It doesn’t need to build a transition
model, adjacency matrix, or covariance matrix, and in fact does not need to
ever know what state its in. Second, it has linear complexity in the number of
input dimensions. The other methods that derive such features — batch SFA
and graphical embedding (Laplacian EigenMap) — have cubic complexity and
don’t scale up well to a large input dimension. Therefore our method is suited
to autonomous learning on sensory input streams (e.g., vision), which the other
methods are not suited for due to their computational and space complexities.
Due to space limits, we just skim over the background. See elsewhere
[21,15,3,22,20] for further details.
Value Function Approximation for MDPs. An MDP is a five-tuple:
a
(S, A, P, R, γ), where S is a set of states, A is a set of actions, Ps,s is the
J
V (x; φ, θ) = φj (x)θj . (1)
j
We’re not so concerned with learning θ since there are good methods to do
this given suitable basis functions. Here we are interested in learning φ. We’d
like a compact set of mappings, which can deliver a reasonable approximation of
different possible value functions, and which could be learned in an unsupervised
way, i.e., without requiring the reward information. PVFs suit all these criteria,
but require that the state is known, so they do not fit Eq. 1.
Proto-Value Functions. PVFs capture global dynamic characteristics of the
MDP in a low dimensional space. The objective is to find a Φ that preserves
similarity relationships between each pair of states st and st , with a small set
2
of basis functions, formally to minimize Ψ (φj ) = t,t At,t (φj (st ) − φj (st ))
with unit norm and orthogonality constraints. In general At,t is a matrix of
similarities, for MDPs typically a binary adjacency matrix, where a one means
the states are connected (i.e., transition probability higher than some threshold).
The objective penalizes differences between mapped outputs of adjacent states,
e.g., if states st and st are connected, and a good φj will have (yt − yt )2 small.
Laplacian EigenMap (LEM) Procedure. Φ can be solved for through an
eigenvalue problem [2,19]. The eigenvectors of the combinatorial Laplacian L,
⎧
⎨ degree(si ) if i = j
Li,j = −1 if i
= j & si is adjacent to sj (2)
⎩
0 otherwise
Algorithm 1. IncSFA-TD(J, η, γ, α, T )
//Autonomously learn J slow features and VF approximation
coefficients from whitened samples x ∈ RI
1 {W, vβ , θ} ← Initialize ()
// W : Matrix of slow feature column vectors w1 , ..., wJ
// vβ : First Principal Component in difference space, with
magnitude equal to eigenvalue
// θ : Coefficients for the VF
2 for t ← 1 to ∞ do
3 (xprev ← xcurr ) //After t = 1
4 xcurr ← GetWhitenedObsv()
5 r ← ObserveReward()
6 if t > 1 then
7 ẋ ← (xcurr − xprev )
8 vβ ← CCIPCA-Update (vβ , ẋ) //For seq. addition parm.
9 β ← vβ /vβ
//Slow features update
10 l1 ← 0
11 for i ← 1 to J do
12 wi ← (1 − η)wi − η [(ẋ · wi ) ẋ + li ] .
13 wi ← wi /wi .
i
14 li+1 ← β j (wj · wi )wj
15 end
16 (yprev ← ycurr ) //After t = 2
17 ycurr ← xTcurr W
18 if t > T then
19 δ ← r + (γycurr − yprev ) θ //TD-error
20 θ ← θ + α δ yprev //TD update
21 end
22 end
23 a ← SelectAction()
24 end
The main reason the slow features (SFs) are approximations of the PVFs
depends on the relation of observations to states. If the state is not extractable
from each single observation, the problem becomes partially-observable (and
out of scope here). Even if the observation has the state information embedded
within, there may not be a linear mapping. Expanded function spaces [22] and
hierarchical networks [8] are typically used with SFA to deal with such cases,
and they can be used with IncSFA as well [14].
elsewhere [10,9]. We want to use it to develop φ in Eq. 1, but we also need something
to learn θ. As a motivation behind this work is to move towards biologically plau-
sible, practical, RL methods, we use TD learning, a simple local learning method
of value function coefficient adaptation. The resulting algorithm, IncSFA-TD (see
Alg. 1) is biologically plausible to the extent that it is local in space and time [18],
and its updating equation (Line 12) has an anti-Hebbian form [6]. The input pa-
rameters: J, the number of features to learn, η, the IncSFA learning rate, γ, the
discount factor, α, the TD learning rate, and T , the time to start adapting the
VF coefficients. For simplicity, the algorithm requires the observation to be drawn
from a whitened distribution. Note the original IncSFA also provides a method for
incrementally doing this whitening.
On Complexity. The following table compares the time and space complexi-
ties of three methods that will give approximately the same features — LEM
(Laplacian EigenMap), BSFA (Batch SFA), and IncSFA — in terms of number
of samples n and input dimension I.
The computational burden on BSFA and LEM is the one time cost of ma-
trix eigendecomposition, which has cubic complexity [7]. SFA uses covariance
matrices of sensory input, which scale with input dimension I. However LEM’s
graph Laplacian scales with the number of data points n. So the computational
complexity of batch SFA can be quite a bit less than LEM, especially for agents
that collect a lot of samples (since typically I << n). IncSFA has linear updating
complexity since it avoids batch-based eigendecomposition altogether. However,
as an incremental method, it will be less efficient with each data point. The
space burden in BSFA and LEM involves collecting the data and building the
matrices, which IncSFA avoids.
4 Experiment
IncSFA UL LEM
1
1 UL
0.5 1
0.5
1
0
1
R 0.5 R 0.5
0.5 0
0 0 1 0.5 0 0
exploit 0.5
(explore)
0
0 2000 4000 6000 00 1000 2000 3000 4000 5000
Fig. 1. Upper left: sample observation (30 × 30 image) from the environment. The
IncSFA features developed from the exploration sequence directly from these observa-
tions are approximately the same as LEM features learned from eigendecomposition
of the graph Laplacian of the true transition model. Upper center/right: embedding
of a trajectory through the environment for both LEM features and IncSFA features
(UL refers to the upper left corner of the room and R refers to the small inner room).
Lower left: feature responses upon a grid of different images, where the agent is at
different possible positions, for each of LEM and IncSFA (best viewed in color). Lower
right: the agent goes to exploitation mode (maximize reward) using its value function
learned upon the incrementally developed slow features. The performance shoots up
to a nearly optimal level, for two different reward positions.
Setup. The agent explores the environment via random walk for 40, 000 steps.
After each step, the slow features are updated, with learning rate η = 0.0002.
At t = 40, 000, the reward appears, and the agent continues its random walk for
3, 000 more steps, while learning the value function coefficients, with learning
rate α = 0.0001. After t = 43, 000, the agent enters exploitation mode, where
it picks the action that will take it to the most valuable possible next state
(using its current VF approximation), but with a 5% random action chance. To
avoid the agent staying at the reward in exploitation mode, when the reward
is reached, the agent teleports away. The features and coefficients continue to
adapt. To show some generality, the reward will be placed in two different places
(in two different instances) — inside the room or at the bottom center of the
image.
Results. Are shown in Fig. 1. First, we want to show the features learned in-
crementally from sensory data actually deliver a reasonable LEM embedding.
Visually compare the features of IncSFA, developed online and incrementally
on the high-dimensional noisy images, to eigenvectors of the graph Laplacian,
using the actual underlying transition model. Also note the similarity of the
graphical embeddings of a single trajectory through the entire room upon the
first three (non-constant) features for each of LEM and IncSFA. After going
into exploitation mode, the agent quickly reaches a near optimal level of reward
IncSFA: Low Complexity Proto-Value Function Learning 285
accumulation, for both reward functions. The features did not change signifi-
cantly in the roughly 3, 000 exploitative steps.
Discussion. We can discuss some other methods that might apply to this set-
ting. As mentioned, this environment was first developed elsewhere [11]. In that
work, deep autoencoder neural nets are trained to compress the observations,
and the bottleneck layer (with the fewest number of neurons) output becomes the
state representation for the agent. Neural-fitted reinforcement learning (NFQ)
learns the Q-function upon this state representation, and the NFQ net error (the
TD-error) backpropagates throughout the autoencoder, causing the state repre-
sentation to conform to a map-like embedding. This effect only emerges when
the Q-error is backpropagated; otherwise the autoencoder representation does
not resemble a map. In our case, the slow features learn a map representation
in the unsupervised phase and therefore do not need the reward information to
learn such a representation.
Another type of method that would apply is a nearest-neighbor prototype
state quantization, where new prototypes/states are added when the distance of
an observation from all existing prototypes exceeds some threshold. This type
of method provides distinct states for RL but does not provide an embedding.
Additionally, this method can lead to a large number of states, which increases
the search space for the RL.
One might want to try an incremental Principal Component Analysis (PCA),
which like SFA will also give a compressed code in a few features, but captures
directions of highest variance (a spatial encoding). SFA uses the temporal in-
formation to learn spatial features, i.e., it casts the data into a low-dimensional
space where similarity information is preserved. A low-dimensional map is quite
useful for planning and control, but PCA’s encoding does not necessarily have
these properties (it will be good for reconstructing the input).
5 Conclusions
A real-world reinforcement learning agent doesn’t get clean states, but messy
observations. Learning to represent its perceptions in such a way that will aid
its future reward prediction capabilities is just as (if not more) important than
its method for learning a value function. For biological plausibility, the methods
for learning representation and learning value need to be incremental and local
in space and time. IncSFA and TD fulfill these criteria. We hope this method
and the background we provided here influences autonomous real-world rein-
forcement learners.
References
1. Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference
learning. Machine Learning 22(1), 33–57 (1996)
2. Chung, F.R.K.: Spectral graph theory. AMS Press, Providence (1997)
3. Coifman, R.R., Lafon, S., Lee, A.B., Maggioni, M., Nadler, B., Warner, F., Zucker,
S.W.: Geometric diffusions as a tool for harmonic analysis and structure definition
of data: Diffusion maps. Proceedings of the National Academy of Sciences of the
United States of America 102(21), 7426 (2005)
4. Coifman, R.R., Maggioni, M.: Diffusion wavelets. Applied and Computational Har-
monic Analysis 21(1), 53–94 (2006)
5. da Motta Salles Barreto, A., Anderson, C.W.: Restricted gradient-descent algo-
rithm for value-function approximation in reinforcement learning. Artificial Intel-
ligence 172(4-5), 454–482 (2008)
6. Dayan, P., Abbott, L.F.: Theoretical neuroscience: Computational and mathemat-
ical modeling of neural systems (2001)
7. Forsythe, G.E., Henrici, P.: The cyclic Jacobi method for computing the princi-
pal values of a complex matrix. Applied Mathematics and Statistics Laboratories,
Stanford University (1958)
8. Franzius, M., Sprekeler, H., Wiskott, L.: Slowness and sparseness lead to place,
head-direction, and spatial-view cells. PLoS Computational Biology 3(8), e166
(2007)
9. Kompella, V.R., Luciw, M.D., Schmidhuber, J.: Incremental slow feature analy-
sis: Adaptive low-complexity slow feature updating from high-dimensional input
streams. Neural Computation (accepted and to appear, 2012)
10. Kompella, V.R., Luciw, M., Schmidhuber, J.: Incremental slow feature analysis.
In: International Joint Conference of Artificial Intelligence (2011)
11. Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforcement
learning. In: International Joint Conference on Neural Networks, Barcelona, Spain
(2010)
12. Legenstein, R., Wilbert, N., Wiskott, L.: Reinforcement learning on slow features
of high-dimensional input streams. PLoS Computational Biology 6(8) (2010)
13. Lin, L.J.: Reinforcement learning for robots using neural networks. School of Com-
puter Science, Carnegie Mellon University (1993)
14. Luciw, M., Kompella, V.R., Schmidhuber, J.: Hierarchical incremental slow feature
analysis. In: Workshop on Deep Hierarchies in Vision (2012)
15. Mahadevan, S.: Proto-value functions: Developmental reinforcement learning. In:
Proceedings of the 22nd International Conference on Machine Learning, pp. 553–
560. ACM (2005)
16. Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal dif-
ference reinforcement learning. Annals of Operations Research 134(1), 215–238
(2005)
17. Santamaria, J.C., Sutton, R.S., Ram, A.: Experiments with reinforcement learning
in problems with continuous state and action spaces. Adaptive Behavior 6(2), 163
(1997)
18. Schmidhuber, J.: A local learning algorithm for dynamic feedforward and recurrent
networks. Connection Science 1(4), 403–412 (1989)
IncSFA: Low Complexity Proto-Value Function Learning 287
19. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
20. Sprekeler, H.: On the relation of slow feature analysis and laplacian eigenmaps.
Neural Computation, 1–16 (2011)
21. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction, vol. 1. Cam-
bridge Univ. Press (1998)
22. Wiskott, L., Sejnowski, T.: Slow feature analysis: Unsupervised learning of invari-
ances. Neural Computation 14(4), 715–770 (2002)
Improving Neural Networks Classification
through Chaining
Due to the importance of classification, many techniques have been developed over the
past decades to automate it. Artificial Neural Networks are amongst the best techniques
used for classification [1]. They are a simplification of the humans’ central nervous
system and have been the subject of studies for the past decades [2]. In a nutshell,
neural networks are ideal for learning complex relations between inputs and classes.
They consist of multilayered groups of interconnected components that connect to each
other via weighted links. The training process of the networks alters these weights based
on the produced error such that the new weights minimize the classification error [1].
Although neural networks are very effective, there remain interests in enhancing
their effectiveness by increasing learning and convergence speed, estimating optimal
structures, increasing classification accuracy, etc. Many techniques have been proposed
to these ends. For instance, Yu et al. proposed dynamic leaning rate for increasing the
speed of learning, while cascade correlation [3], upstart algorithm [4], and optimal
brain damage [5] were for estimating network structures. In addition, the modular ap-
proach [6], the em ensemble neural networks [7], and the bagging and boosting tech-
niques [8] worked toward increasing classification accuracy. In essence, the modular
approach is concerned mainly with breaking down a problem into smaller sub prob-
lems, to which separate networks are applied. The ensemble technique is to combine
This work is supported by NSERC, Canada.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 288–295, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Improving Neural Networks Classification through Chaining 289
the outputs of structurally or logically different networks trained to predict the same tar-
get through some technique, i.e. voting, averaging, etc. Since varying neural networks
make errors on different areas of the input space, a good combining technique will yield
an ensemble that is more accurate and error tolerant than a typical network [7,8].
NN2 NN2
NN3
NNn
NNn
(a) (b)
In general, the intuition behind SLC and MLC is that a neural network’s predictions
can be used to correct the predictions of upcoming networks since these predictions
resulted from attributes that are indicative of the target classes. Therefore, the predic-
tions are highly correlated with the target classes. Using these predictions is expected
to further improve the predictability of the data.
In particular, the intuition behind SLC is that a network may not need the predictions
of all preceding networks in order to correct its classification. A network trained on the
predictions of a previous network produces predictions influenced by that knowledge.
Therefore, it seems reasonable that the predictions of the new network should replace
that of the previous network in the dataset. On the other hand, the intuition behind MLC
290 K. Zaamout and J.Z. Zhang
Dataset Name # Attributes # Instances Class Type (#) Missing Values? Data Characteristics
DAGA 17 72 Real No Time Series-Real
DSP F 27 1941 Categorical(7) No Integer-Continuous
DSY N 17 7280 Real No Times Series-Real
DRER 40 405 Categorical(2) Yes Integer
is that it may be necessary for a network to have access to the predictions of all the
previous networks such that the training procedure is able to learn the most beneficial
for the network’s own prediction.
3 Experiment Setup
Four datasets were used to experiment with the proposed approaches, as shown in Ta-
ble 1. The first set, Agassiz, denoted as DAGA , was obtained from an agriculture ap-
plication1 [9]. This dataset is a record of the yields of greenhouse tomato plants under
controlled conditions. The second dataset is called Synthetic dataset, denoted as DSY N ,
and was extrapolated from the Agassiz dataset by randomly increasing or decreasing
each attribute’s value by a random amount to create a synthetic value. The third one
is called the Steel Plates Faults dataset, denoted as DSP F . It was obtained from UCI
machine learning repository2. It records various aspects of steel plates, such as type
of steel, thickness, luminosity, etc. which allows predicting various faults in the plates.
The fourth dataset, called Restaurant Reviews dataset, denoted as DRER , was collected
from a website3 . The data is intended to determine from customer reviews whether a
costumer will return to a restaurant or not.
Before building our chained neural networks, our datasets have undergone some pre-
processing tasks. For each dataset, three attribute selection algorithms, Principal Com-
ponent Analysis (PCA) [10], Correlation-based Feature Selection (CFS) [11], and Reli-
efF (ReliefF) [12], were applied to diversify the datasets. PCA is a multivariate analysis
technique that takes as an input a dataset of inter-correlated attributes with their in-
stances and produces a new dataset of reduced dimensionality while retaining most of
its properties. CFS selects a subset of attributes such that they are highly correlated
with the class attribute and the least correlated with each other. ReliefF ranks attributes
based on their relevance to the class attribute. A selected attribute would contain values
that are dissimilar for different classes and are similar for the same class. The produced
datasets resulted from these selection algorithms along with the original ones will assist
us in understanding and interpreting the performance of our chaining neural networks.
1
Provided by Dr. David Ehret of the Pacific Agri-Food Research Centre in BC. Canada.
2
http://archive.ics.uci.edu/ml/
3
Gathered from http://www.restaurantica.com/ by Taha Azizi, the University of Lethbridge,
Canada.
Improving Neural Networks Classification through Chaining 291
Two filters were applied on the datasets: nominal to binary filter and data normal-
ization/centering filter. The nominal to binary filter converts all nominal variables into
binary variables. If a nominal variable consists of k values, and if the class is also nom-
inal, this filter will transform the variable into k binary attributes. Data normalization
filter transforms variables’ values to ones between a given range. The normalization
filter applied normalizes all variables’ values in a given dataset, including the class, to
values in the range [0, 1]. The need for data normalization and centering stems from the
use of sigmoid thresholding function in neural networks.
For each of the datasets and the original dataset, we create two styles of neural net-
works, an SLC-style chain, denoted SLCs , and an MLC-style chain, denoted M LCs .
In order to determine the best performing network for each chain link, a “trial and er-
ror” process was conducted using a relatively small range of values for learning rates,
momenta, epochs, etc. The best chain link is the one that reduces the largest error (Mean
Absolute Error - (MAE)). The use of small ranges for training parameters is intended to
speed up the chain links selection process.
1 1.2
NNall NNall
NNcfs NNcfs
NNpca NNpca
NNreliefF 1.1 NNreliefF
SLCall MLCall
SLCcfs MLCcfs
0.9 SLCpca MLCpca
SLCreliefF 1 MLCreliefF
0.9
0.8
Error
Error
0.8
0.7
0.7
0.6
0.6
0.5
0.5 0.4
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 2. DAGA SLC networks vs. regular Fig. 3. DAGA MLC networks vs. regular
neural networks neural networks
A total of 20 SLCs and M LCs chain links were attempted for each dataset. For
SLC, only the best performing chain link was chosen. For MLC, a subset of chain links
was selected starting from the first chain link until the chain link that reduced error the
most. After the best numbers of chain links have been determined, the entire chain gets
trained using larger range of parameters and evaluated using 10-fold cross-validation.
Then, its performance is compared against a typical individual neural network. Our
proposed approach was implemented on top of Weka4 . All the experiments were run on
a network of computers controlled by a distributed computing system, called Condor5 .
epochs after their structures have been determined. Tables 2, 3, 4, and 5 show a com-
parison between SLCs and M LCs performance for each dataset and its subsets by
highlighting the lowest error values achieved. In these tables, we also show the num-
bers of chain links for SLC s and M LC s .
Table 2. DAGA chain links summary Table 3. DSP F chain links summary
DAGA all cfs pca reliefF DSP F all cfs pca reliefF
Link # 19 16 14 15 Link # 10 7 19 4
SLC s SLC s
MAE 0.6252 0.7806 0.8611 0.8911 MAE 0.0858 0.0968 0.0913 0.0856
Link # 6 6 6 20 Link # 7 4 2 5
M LC s M LC s
MAE 0.6672 0.7366 0.5821 0.7611 MAE 0.0887 0.0932 0.0943 0.0843
The power of our approach becomes evident when it was compared against typical
neural networks. For DAGA , Figs. 2 and 3 show that both SLCs chains and M LCs
chains have significantly outperformed their corresponding typical neural networks and
have done so early in the learning process, regardless of the subsets used. The same
situation is observed in Figs. 6 and 7 for DRER .
Figs. 4 and 5 for DSP F show that the performances of SLCs and M LCs chains
differ than chains of the other datasets. In Fig. 4 three of the four SLCs chains, i.e.,
SLCcf s , SLCpca , and SLCrelief F , have slightly outperformed their corresponding
typical neural network and in Fig. 5 two M LCs chains, M LCcf s and M LCpca , have
slightly outperformed their corresponding typical neural networks. However, N Nall
have significantly outperformed all SLCs and M LCs chains. This might be due to the
fact that the data in DSP F is already of high quality, since it has been used for a long
time. Any change to its attributes could only lead to poor classification performance.
0.11 0.105
NNall NNall
NNcfs NNcfs
NNpca NNpca
NNreliefF NNreliefF
0.105 SLCall MLCall
SLCcfs MLCcfs
SLCpca 0.1 MLCpca
SLCreliefF MLCreliefF
0.1
0.095
Error
Error
0.095
0.09
0.09
0.085
0.085
0.08 0.08
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 4. DSP F SLC networks vs. regular Fig. 5. DSP F MLC networks vs. regular
neural networks neural networks
From Table 2 and Fig. 2 it is interesting to note that the chain that has reduced
error the most in the chain links selection process did not necessarily reduce error
the most when trained. In the case of SLCs , for example, SLCall has reduced the
Improving Neural Networks Classification through Chaining 293
0.1 0.1
NNall NNall
NNcfs NNcfs
NNpca NNpca
NNReliefF NNreliefF
0.09 SLCall 0.09 MLCall
SLCcfs MLCcfs
SLCpca MLCpca
SLCreliefF MLCreliefF
0.08 0.08
0.07 0.07
Error
Error
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 6. DRER SLC networks vs. regular Fig. 7. DRER MLC networks vs. regular
neural networks neural networks
Table 4. DRER chain links summary Table 5. DSY N chain links summary
DRER all cfs pca reliefF DSY N all cfs pca reliefF
Link # 19 2 13 16 Link # 12 15 16 11
SLC s SLC s
MAE 0.0537 0.0461 0.0716 0.0367 MAE 4.9494 4.6457 4.7848 4.4919
Link # 10 7 2 10 Link # 3 2 6 12
M LC s M LC s
MAE 0.0562 0.0469 0.0640 0.0387 MAE 5.0767 4.8037 4.9475 4.6738
It is noteworthy that the typical neural networks have demonstrated some overfitting
patterns which disappeared or reduced in our approach. In DAGA Fig. 2 of SLC, we can
see that N Nall , N Npca , and N Nrelief F have shown a slight overfitting pattern which
disappeared in the corresponding SLCs chains. On the other hand, Fig. 3 of MLC shows
that M LCpca demonstrated an overfitting pattern. The same situation is noted in Figs. 6
and 7 for DRER where N Nall , N Ncf s , and N Npca have demonstrated overfitting that
disappeared in SLCall and SLCpca , and reduced in SLCcf s and M LCcf s . This could
hint that our approach allows further training while delaying overfitting.
294 K. Zaamout and J.Z. Zhang
4.6 4.85
NNall NNall
NNcfs NNcfs
NNpca 4.8 NNpca
NNreliefF NNreliefF
SLCall MLCall
SLCcfs 4.75 MLCpca
4.55 SLCpca MLCcfs
SLCreliefF MLCreliefR
4.7
4.65
4.5
4.6
Error
Error
4.55
4.45
4.5
4.45
4.4
4.4
4.35
4.35 4.3
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 8. DSY N SLC networks vs. regular Fig. 9. DSY N MLC networks vs. regular
neural networks neural networks
When applying SLC to DSY N , we expected that the process would fail and produce
large errors from the start, since it is a synthetic dataset. By observing Figs. 8 and 9 it is
easily noticed that this is exactly the case. Generally, the error increases as the number
of epochs increases. Table 5 is shown here for the purpose of the chain selection process.
The performance of SLC and MLC varies when they were compared to each other.
Table 6 shows that MLC have outperformed SLC in three of the four chains of DAGA ,
namely M LCall , M LCpca , and M LCrelief F with considerable differences while SLC
have outperformed MLC in one subset, SLCcf s , with a marginal difference. In DSP F
M LCs performed better than SLCs in M LCcf s and M LCpca and worse than SLCs in
SLCall and SLCrelief F . MLC and SLC for DRER have performed relatively similar.
MLC have slightly outperformed SLC in M LCpca and M LCrelief F while the opposite
is true in SLCall and SLCcf s .
performed experiments to establish their effectiveness. Table 6 showed that MLC have
outperformed SLC and the typical neural networks in eight out of 16 datasets while
SLC have outperformed MLC and the typical networks in six out of 16 and the typical
network have outperformed both variations in only two cases.
Our work aligns with the previous efforts on increasing the effectiveness of neural
networks, in particular their classification accuracy. It would be beneficial to compare
our proposed chaining methods to some similar previous results, e.g., the works in [7,8],
to explore further their relative performance. Due to the time and space considerations,
we omit these comparisons in this initial report. But we are planning on pursuing along
this direction further. This will be definitely our main work in the near future.
In addition, we will further analyze our experiment results and provide theoretical
justifications as why our approach works or under what conditions it fails. Moreover,
we plan to investigate how to choose the number of chain links or detecting a single
good chain link. This will lead to automating and speeding up the chaining process.
References
1. Zhang, G.P.: Neural networks for classification: a survey. IEEE Transactions on Systems,
Man, and Cybernetics, Part C: Applications and Reviews 30, 451–462 (2000)
2. Lippmann, R.P.: An introduction to computing with neural nets. IEEE ASSP Magazine 3,
4–22 (1987)
3. Fahlman, S., Lebiere, C.: The cascade-correlation learning architecture. In: Advances in Neu-
ral Information Processing Systems, vol. 2, pp. 524–532 (1990)
4. Frean, M.: The upstart algorithm: A method for constructing and training feedforward neural
networks. Neural Computation 2, 198–209 (1990)
5. Cun, Y.L., Denker, J., Solla, S.: Optimal brain damage. In: Advances in Neural Information
Processing Systems, pp. 598–605 (1990)
6. Lu, B., Ito, M.: Task decomposition and module combination based on class relations: a
modular neural network for pattern classification. IEEE Transactions on Neural Networks 10,
1244–1256 (1999)
7. Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence 12, 993–1001 (1990)
8. Maclin, R.: An empirical evaluation of bagging and boosting. In: Proceedings of the Four-
teenth National Conference on Artificial Intelligence, pp. 546–551. AAAI Press (1997)
9. Helmer, T., Ehret, D.L., Bittman, S.: Cropassist, an automated system for direct measurement
of greenhouse tomato growth and water use. Computers and Electronics in Agriculture 48,
198–215 (2005)
10. Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdisciplinary Reviews:
Computational Statistics 2, 433–459 (2010)
11. Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learn-
ing, pp. 359–366 (2000)
12. Kira, K., Rendell, L.A.: The feature selection problem: Traditional methods and a new algo-
rithm. In: AAAI, pp. 129–134. AAAI Press and MIT Press (1992)
Feature Ranking Methods
Used for Selection of Prototypes
1 Introduction
Prototype selection is frequently used as a preprocessing step in machine learning. For-
mally it was designed to improve the accuracy of nearest neighbour based classifiers
but as it was shown in [1] these methods are also very useful to improve the quality of
other prediction methods not necessary based on the nearest neighbour approach.
The prototype selection (also called instance selection) is a process of selection or
construction of new instance based on the original set of examples such that n n
where n - is the number of instances, and n - is a set of instances after selection. This
condition is satisfied by removing redundant examples, outliers, or by rejection of ir-
relevant data from the training datasets. It can be shown that instance selection may
be treated as a form of regularization that leads to the improvement of predictive ac-
curacy. There may be also other benefits obtained from instance selection. An example
are prototype-based rules (P-rules) proposed in [2]. P-Rules are based on capturing
similarity to reference examples, and keeping minimum number of prototypes that are
sufficient to achieve appropriate error rate. Such prototypes may be the treated as repre-
sentatives of the original data or transformed into rule based representation, where the
antecedent part of a single rule defines similarity between prototype and a given vector.
P-rules with additive similarity functions may be also represented as a fuzzy rules, but
are even more general [3].
Recently new data is acquired and stored from many sources (sensors, web-logs, etc.)
leading to tremendous number of samples. This makes a real challenge for data mining
tasks because the most of machine learning algorithms are not designed to meet such
needs. Moreover many of the algorithms requires serial processing without any chances
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 296–304, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Feature Ranking Methods Used for Selection of Prototypes 297
for distributing the training process among nodes in computer cluster. One of possible
solution which solves such a problem is data filtering. One may observe that in large
datasets the percentage of irrelevant or redundant samples is very high. This allows for
application of instance selection methods such as ENN [4], CNN [5], IB3 [6] (a survey
of almost 70 selection algorithms can be found in [7]) algorithms. However they are not
optimized to preserve low computational complexity.
For many years a great importance was feature selection. This allowed for the devel-
opment of very effective and efficient methods which after appropriate adaptation can
be used for instance selection. A good example of such methods is feature ranking being
one of the most efficient approaches which we believe may also meet the requirements
of instance filtering in large datasets, so in this paper we would like to present how to
adjusting feature ranking method with standard feature ranking coefficients to solve the
instance selection challenge.
The paper is organized as follows: in the next section feature selection algorithm
based on feature ranking are presented, then in section 3 we discus the adaptation steps
to create a prototype selection algorithm based on feature ranking. Illustrative examples
for several benchmark datasets are presented in section (4), and conclusions are given
in the final section.
In the paper the following notation will be used: T =
{[x1 , y1 ] , [x2 , y2 ] , . . . , [xn , yn ]}, is a training set of xi ∈ Rm input vectors de-
fined in m-dimensional feature space f = {f1 , f2 , . . . fm }, and yi is the label
associated with xi vector, and the label attribute is denoted as y.
Feature selection belongs to the one of the fast-growing areas of machine learning re-
search. One of the reasons for its growing importance is the increasing size of the
datasets, which may consist of tens or hundreds of thousands of variables. The ob-
jective of variable selection is three-fold: improving the prediction performance of the
predictors, providing faster and more cost-effective predictors, and providing a better
understanding of the underlying process that generated the data [8].
Feature selection techniques rely on two main procedures: a search strategy and a
feature quality evaluation. The search strategy controls the order of evaluations of the
quality of feature subsets. Computational complexity of both the search strategy and the
feature evaluation should be low to meet the needs of high scalability. Feature ranking is
based on evaluating each feature independently, calculating the feature-class relevance
index H(fi , y) of some kind [9]. These indices are then sorted in descending order of
quality and m best features are selected.
An overview and comparison of different ranking indices H(·) can be found in [10].
Below a short description of 3 commonly used indices based on Correlation Coefficient
(CC), Mutual Information (MI) and Switch Index (SI) is given, although our approach
may be used with any index.
Correlation Coefficient. The absolute value of the linear correlation coefficient is a
widely used ranking criterion.
298 M. Blachnik, W. Duch, and T. Maszczyk
(f − f¯ ) · (y − ȳ)
j j
RCC (fj , y) = ∈ [0, 1] (1)
(f − f¯ )2 (y − ȳ)2
j j
It is proportional to the dot product between fj and y centered at their average values.
The Correlation Coefficient is applicable to binary and continuous variables or target
values, which makes it very versatile. Categorical variables can be handled by using
some coding method. The Correlation Coefficient is a measure of linear dependency
between variables. Irrelevant variables that are not correlated with the target should
have RCC value near zero, but a value near zero does not necessarily indicate that the
variable is irrelevant: a non-linear dependency may exist, which is not captured by this
measure.
Mutual Information. This criterion measures strength of dependencies between ran-
dom variables. It may assess the “information content” also in cases when methods
based on linear relations are prone to mistakes.
The M I(f ; y) index for random variables f and y is defined as:
Pf y (f, y)
M I(f ; y) = Pf y (f, y) log dxdy (2)
Pf (f )Py (y)
Accuracy of estimation of the joint probability Pf y from experimental data is usually
poor [9], especially for large number of classes. Discretization of continuous variables
into boxes of size Δf Δy is commonly used to calculate discrete Pf y values (Parzen
windows is an obvious alternative). The MI index can then be re-written as:
Pf y (rf , ry )
M I(f ; y) = Pf y (rf , ry ) log (3)
rf ry
Pf (rf )Py (ry )
where, rf and ry are the intervals for the f and y variables. MI dependence of two
variables using Nf , Ny intervals requires estimation of Nf Ny probabilities, and esti-
mation of index for pairs of features f, g requires estimation of Nf Ng Ny probabilities,
which is not only costly but also leads to significant errors, unless a lot of training data
is available.
Switch Index. This simple measure is applicable to the typical classification problem
with features f that can be ordered, and discrete target variables y. Sorting the values of
variable f in ascending order one calculates how many times the values of variable y is
changed when subsequent f values are taken. If the correlation between feature values
and labels is ideal the number of switches is equal to the number of classes minus one;
if there is no correlation each increase of f may result in change of y. The number of
switches is normalized to fit the range [0, 1].
index H(f, f ) to account for redundancy. The algorithm chooses features that maxi-
mize the difference between H(f, y) and all H(f, f ), where f are all features already
selected. Such algorithm selects M features and is formalized as follows:
1. Set F ← “the initial set of features”; S ← ∅.
2. For each feature f ∈ F , compute H(f, y).
3. Find feature f that maximizes H(f, y);
4. Move f from F to S: F ← F \f ; S ← f .
5. repeat until |S| = M
– For all pairs of variables f ∈ F, f ∈ S calculate H(f
; f ).
– Choose feature f that maximizes H(f, y) − β/|S| f ∈S H(f ; f );
– Move f from F to S: F ← F \f ; S ← S ∪ f .
Parameter β regulates the relative importance of feature-feature relevance H(f, f ) for
all already selected features with respect to the feature-class relevance H(f, y). The
recommended β value is between 0.5 and 1 [11]. This ranking selects features that are
highly relevant for the target and loosely correlated with other selected features. This
procedure is more costly than simple ranking because calculation of the feature-feature
indices requires O(m2 n) operations. However, instead of all m features one can take
into account only the top m features that are worth considering.
90
90
85
85
80
80
accuracy [%]
accuracy [%]
75
70
75
MI MI
CC CC
65
SI SI
70
60
65
55
60 50
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
compression [%] compression [%]
94
70
92
65
90
60
accuracy [%]
accuracy [%]
88
86
55
84
50
82
45 MI MI
CC 80 CC
SI SI
40 78
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
compression [%] compression [%]
Fig. 1. Comparison of the influence of ranking criterion on the accuracy of the RBIS algorithm;
for Diabetes differences were not significant
4 Illustrative Examples
The usefulness of the method presented in this paper has been evaluated on 5 popular
datasets from the UCI repository [12]: Ionosphere, Sonar, Car, and Diabetes (also
known as “Pima Indian diabetes") dataset. Only the last one has relatively simple struc-
ture [13]. In our experiments for each dataset the relation between data compression
and achievable accuracy was analyzed. For that purpose the algorithm selected 20%,
40%, 60%, 80%, or 100% of the samples and the accuracy of the system was evaluated.
All calculations are wrapped in the cross validation test to estimate the expected accu-
racy. For comparison also the Monte Carlo, ENN [4], CNN [5], and IB3 [6] instance
selection algorithms are used, and also the methods based on editing the distance graph
like Gabriel Editing and Relative Neighbor Graph [14,15].
In the first series of experiments two different aspects of the RBIS algorithm are
compared. First, the influence of the ranking criterion is analyzed. Results are presented
graphically in Fig. (4).
Unlike the results of feature ranking comparison [10] the ranking criterion used for
instance selection has strong influence on the results. In almost all cases the Switch
Feature Ranking Methods Used for Selection of Prototypes 301
95 90
90
85
85
80
80
accuracy [%]
accuracy [%]
0
75 0.01
0.1
75
0.5
70
70
0
0.01 65
65
0.1
0.5
60 60
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
compression [%] compression [%]
92
70
90
65
88
accuracy [%]
accuracy [%]
0
0.01
0.1
86
60 0.5
84
55 0
0.01 82
0.1
0.5
50 80
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
compression [%] compression [%]
76
74
72
70
accuracy [%]
68
66
64
62 0
0.01
0.1
60
0.5
58
10 20 30 40 50 60 70 80 90 100 110
compression [%]
(e) Diabetes
Fig. 2. Comparison of the influence of redundancy rejection parameter on the accuracy of the
RBIS algorithm
Index performed best, but for the car data Mutual Information is much better. This may
be related to the special properties of the distance function for which the indices were
applied and in case of car data to discrete nature of features, but it also points out to the
need for selecting optimal ranking coefficient for a given problem.
302 M. Blachnik, W. Duch, and T. Maszczyk
95 90
← GE
90 85
← CNN ← RNG
← CNN ← GE ← ENN
85 80
← ENN
← RNG
80 75
accuracy [−]
accuracy [−]
← IB3
75 70
MC MC
RBIS RBIS
RBIS+ RBIS+
70 65
65 60 ← IB3
60 55
0 20 40 60 80 100 120 0 20 40 60 80 100 120
compression [−] compression [−]
94 ← CNN
70
← GE
92 ← ENN
← RNG
← CNN ← ENN
90
65
accuracy [−]
accuracy [−]
88
← IB3
← IB3
60 MC ← GE
86
RBIS ← RNG
RBIS+
84
55
MC
82
RBIS
RBIS+
50 80
10 20 30 40 50 60 70 80 90 100 110 0 20 40 60 80 100 120
compression [−] compression [−]
Fig. 3. Comparison of RBIS and RBIS+ algorithms to 3 state-of-the-art methods ENN, CNN,
IB3, GE, RNG
In the second series of experiments the effect of redundancy rejection was tested.
To verify performance of the algorithm described in the section 2.1 five β values were
tested. The results presented in Fig. (4) show that for some data the redundancy filter
may significantly affect the accuracy if high data compression is desired, although the
computational complexity strongly increases in this case.
Comparison of the RBIS and RBIS+ algorithms to state-of-the-art prototype selec-
tion algorithms ENN, CNN, IB3, GE and RNG is presented in Fig. (4). The results show
that our algorithms based on instance ranking are quite competitive to these methods,
allowing for regulation of the number of prototypes left, and keeping low computational
complexity (for RBIS).
5 Conclusions
The Ranking Based Instance Selection (RBIS) algorithm for prototype selection based
on the feature ranking methods has been described. It follows the basic approach of
feature selection, using instance ranking to determine the relevance of particular data
Feature Ranking Methods Used for Selection of Prototypes 303
samples. The use of dualism between feature and instance selection in kernel spaces
seems to be a novel idea worth further exploration.
Results of numerical experiments showed the importance of selecting good rank-
ing coefficient. Surprisingly, in almost all experiments a simple Switch Index achieved
better results than the-state-of-the-art algorithms based on Mutual Information or Pear-
son’s Correlation Coefficient. This may be due to the properties of the obtained distance
matrix. The results of RBIS algorithm are competitive with the state-of-the-art meth-
ods, although for small number of prototypes CNN in case of Sonar and Ionosphere
data achieved slightly better results. This may be related to the problem of redundancy,
because similar instances usually have similar ranking indices. The methodology of re-
jection of the redundant instances based on the Battiti’s algorithm of feature selection,
although computationally expensive gives sometimes good results selecting small num-
ber of vectors. Room for improvement includes different kernels replacing Euclidean
distance (ex. Gaussian-weighted kernels that stress the importance of closer distances),
as well as clustering methods to reduce the influence of redundancy, by selecting a
single prototype from each cluster.
Acknowledgment. The work was sponsored by the Polish Ministry of Science and
Higher Education, project No. 4421/B/T02/2010/38 (N516 442138). The software pack-
age is available at http:\www.prules.org
References
1. Grochowski, M., Jankowski, N.: Comparison of Instance Selection Algorithms II. Results
and Comments. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.)
ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 580–585. Springer, Heidelberg (2004)
2. Duch, W., Grudziński, K.: Prototype based rules - new way to understand the data. In: IEEE
International Joint Conference on Neural Networks, pp. 1858–1863. IEEE Press, Washington
D.C. (2001)
3. Duch, W., Blachnik, M.: Fuzzy Rule-Based Systems Derived from Similarity to Prototypes.
In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.) ICONIP 2004. LNCS,
vol. 3316, pp. 912–917. Springer, Heidelberg (2004)
4. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans.
Systems, Man and Cybernetics 2, 408–421 (1972)
5. Hart, P.E.: The condensed nearest neighbor rule. IEEE Transactions on Information The-
ory 114, 515–516 (1968)
6. Aha, D., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6,
37–66 (1991)
7. Salvador, G., Joaquin, D., Cano, J.R., Herrera, F.: Prototype selection for nearest neighbor
classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and
Machine Intelligence 34(3), 417–435 (2010)
8. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.: Feature extraction, foundations and applica-
tions. Springer, Heidelberg (2006)
9. Duch, W.: Filter methods. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.) Feature
Extraction, Foundations and Applications, pp. 89–118. Springer, Physica Verlag, Heidelberg,
Berlin (2006)
10. Duch, W., Wieczorek, T., Biesiada, J., Blachnik, M.: Comparision of feature ranking meth-
ods based on information entropy. In: Proc. of International Joint Conference on Neural
Networks, pp. 1415–1420. IEEE Press, Budapest (2004)
304 M. Blachnik, W. Duch, and T. Maszczyk
11. Battiti, R.: Using mutual information for selecting features in supervised neural net learning.
IEEE Trans. on Neural Networks 5, 537–550 (1994)
12. Asuncion, A., Newman, D.:UCI machine learning repository (2007),
http://www.ics.uci.edu/~mlearn/MLRepository.html
13. Duch, W., Maszczyk, T., Jankowski, N.: Make it cheap: learning with o(nd) complexity. In:
2012 IEEE World Congress on Computational Intelligence, Brisbane, Australia, pp. 132–135
(2012)
14. Bhattacharya, B.K., Poulsen, R.S., Toussaint, G.T.: Application of proximity graphs to edit-
ing nearest neighbor decision rule. In: Proc. Int. Symposium on Information Theory, Santa
Monica, CA, pp. 1–25 (1981)
15. Toussaint, G.T.: The relative neighborhood graph of a finite planar set. Pattern Recogni-
tion 12(4), 261–268 (1980)
A “Learning from Models” Cognitive Fault
Diagnosis System
1 Introduction
Unsupervised cognitive fault diagnosis systems for complex dynamic plants take
advantage of machine learning algorithms to catch the description of their nom-
inal status and assess potential fault-induced changes by inspecting deviations
from nominal conditions. Fault isolation and classification phases follow, by ex-
ploiting information and features extracted from available datastreams. As a
consequence, these cognitive fault diagnosis systems are able to work with ap-
preciable accuracy even when a model for the system under monitoring is par-
tially/totally unavailable. In fact, the ability to learn the nominal condition
during the operational life does not require any a-priori knowledge about the
nature of the fault and its time profile, hence, making it feasible an on-line gen-
eration and maintenance of the fault dictionary, i.e., the dictionary containing
the fault signature.
While the literature about fault diagnosis provides rather well-established
techniques (e.g., see [3]), research on cognitive fault diagnosis systems is rela-
tively new with few works available [4,5,6,7] mainly focusing on specific applica-
tion cases within an evolving framework. [3]-(Chap. 16) suggests an unsupervised
“clustering-labeling” approach based on SOM which builds rules during training
to distinguish nominal from faulty states. Similarly, [4] and [5] provide diagnos-
tic mechanisms for a quality inspection system. An evolution-based fuzzy-neural
approach to fault diagnosis for marine propulsion systems is presented in [6]. In
This research has been funded by the European Commissions 7th Framework Pro-
gram, under grant Agreement INSFO-ICT-270428 (iSense).
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 305–313, 2012.
c Springer-Verlag Berlin Heidelberg 2012
306 C. Alippi, M. Roveri, and F. Trovò
[7], nominal states are separated from faulty ones by relying on a fuzzy c-means
algorithm. In general, these solutions confine the evolving approach solely to the
training phase.
The paper suggests an evolving model-free mechanism for cognitive fault di-
agnosis working in the space of model parameters. The approach can be briefly
summed up as follows:
– the parameters of a suitable linear model are extracted during the train-
ing phase and used to generate the nominal conditions for the plant. No
assumptions about the linearity of the process generating the data is made;
– when deviations from the nominal condition is detected the fault dictionary
is used to classify the class of fault. At the beginning no fault dictionary is
given and the algorithm automatically builds it over time by following an
evolving mechanism. Procedures for the management of the dictionary, e.g.,
collapsing of two equivalence classes and other management procedure are
also provided.
The novelty of the proposed approach resides in the justification of the use of
linear models for building the fault dictionary mechanism and in the clustering-
based evolving algorithm, which is able to automatically update the fault dic-
tionary over time. During the operational life, the proposed algorithm detects if
new parameters describing the approximate model can be intended as generated
from the nominal state, from a previously created type of fault (each cluster of
parameters is associated with a class of faults) or it represents a new faulty state
or, again, it is an outlier.
Since the key point of the analysis refers to the use of linear dynamic systems,
and given the fact that the linear hypothesis can be barely accepted in a real
application scenario, we review in Section 2 the theoretical framework justifying
such a choice. Section 3 describes the proposed algorithm for a cognitive fault
diagnosis system, while Section 4 presents the experimental results.
2 Problem Formulation
In the following, we assume that the process P under investigation is unknown
and time invariant. In reality, the above assumption can be weakened by also
considering a time variant process provided that the time variance is explored
during the training phase (e.g., the nominal state can be characterized through
a Markov process in the parameter space). In other words, we are assuming that
the nominal state can be somehow characterized through approximating linear
models despite the fact P is time variant or not.
We approximate P by considering discrete-time linear MISO models:
m
Bi (z) C(z)
A(z)y(t) = ui (t) + d(t)
i=1
F i (z) D(z)
where y(t) ∈ R is the output of the system at time t, u(t) = (u1 (t), . . . , um (t)) ∈
Rm is the vector of input samples at time t, d(t) is an independent and identically
Evolving Fault Isolation 307
distributed (i.i.d.) random variable accounting for the noise, z is the time-shift
operator, A(z), Bi (z), C(z), D(z), and Fi (z) represent the z-transform functions,
whose parameter vectors are θA , θB1 , . . . , θBm , θC , θD , θF1 , . . . , θFm , respectively.
By using this notation, an element fθ in the approximating linear model family
M(θ) can be fully described with a parameter vector θ ∈ Rp encompassing the
above parameter vectors. In the following, we assume that the model does not
degenerate, assumption which, eventually, might require an underdimensioned
model (here we are devoted to the diagnosis ability more than approximation
accuracy).
A linear model places our framework on a solid mathematical ground [1,2]
despite the potentially introduced model bias ||fθ − P||. More specifically, let
M(θ) be a model family fθ parameterized in θ ∈ DM , DM being a compact
subset in Rp . Consider the loss function VN (u, y, θ) : Rm ×R×Rp → R providing,
given a training dataset composed of N samples {u(t), y(t)}N t=1 , an estimate θ̂N
of the optimal parameter θ∗ = arg minθ∈DM limN →+∞ WN (θ), where WN (θ) =
E [VN (u, y, θ)]). Under the hypothesis that:
– P satisfies the exponential stability for the closed loop system, i.e., it is
possible to generate accurate approximations of y(t) given time windows of
y(·) and u(·) without requiring data coming from the remote past;
– fθ is, at most, linear with respect to u(t), y(t) and three times differentiable
w.r.t. θ;
– VN (u, y, θ) has partial derivatives up to order 3 bounded by a constant;
– ∃β ∈ R+ , ∃N0 ∈ R+ s.t. WN (θ) > βI ∀θ ∈ DM , N ≥ N0 ;
from [1] we have that
√ −1
N PN 2 (θ̂N − θ∗ ) ∼ N (0, Ip ) N → ∞, (1)
It is worth noting the the term we add at each parameter vector insertion
is proportional to spatial (||θ̂N,i − θ̂N,k ||) and temporal (|i − k|) proximities.
To decide whether creating a cluster or not, the algorithm verifies if ∃k ∗ ∈
X (O) s.t. ωk∗ (i) ≥ ηo . By differently setting ηo , we decide how densely the
parameter vectors should aggregate in order to identify a new cluster. The even-
tually created cluster Φφ+1 is formed by the set O ⊆ O with the longest and
most recent temporally contiguous sequence of parameter vectors. O becomes
a new cluster and its elements are removed from O. Afterwards, the algorithm
updates the parameter ωk for all remaining points in O:
||θ̂N,k − θ̂N,h ||
ωk (i) = ωk (i) − exp − − |k − h| − α(i − max(k, h)) ,
p
h|h∈X (O )
4 Experimental Results
– pc : the percentage of runs where the algorithm identified the correct number
of clusters;
– ε: the percentage of parameter vectors assigned to the wrong cluster;
– po : the percentage of parameter vectors assigned to the outlier set O;
– τ : the delay (in term of number of batches) between the occurrence of the
change and the creation of the corresponding cluster;
– nc : number of clusters at the end of the experiment (in order to compare it
with respect to the expected one).
All these values are computed at the end of each experiment, ε, po and τ are
calculated only if the algorithm correctly detects the number of clusters and are
averaged over independent experiments. In the following the clustering algorithm
parameters are set to ηs = 3, ηo = 1 and α = 0.001.
Evolving Fault Isolation 311
Application D1 Application D2
δ pc ε po τ nc pc ε po τ nc
0.01 0% N.a. N.a. N.a. 1.02 0% N.a. N.a. N.a. 1.005
0.05 45% 0.0300 0.0795 5.0926 4.86 12.5% 0.2680 0.0983 11.6667 2.055
Profile 1 0.075 40% 0.0273 0.0807 4.8292 4.92 32.5% 0.0968 0.0774 6.3616 4.605
0.1 44.5% 0.0269 0.0785 4.8801 4.865 39.5% 0.0337 0.0589 4.7137 4.93
0.15 42% 0.0263 0.0778 4.8492 4.935 32.5% 0.0330 0.0674 5.0154 5.075
0.01 1% 0.474 0.1178 39.5 1.01 0% N.a. N.a. N.a. 1
0.05 82% 0.0169 0.0343 4.9512 2.21 34% 0.4225 0.0627 6.6912 1.43
Profile 2 0.075 88% 0.0170 0.0363 4.9375 2.14 75% 0.1579 0.0414 5.46 2.295
0.1 84% 0.0172 0.0359 4.9583 2.19 82.5% 0.0204 0.0361 5.0061 2.225
0.15 72.5% 0.0168 0.0458 4.9517 2.375 84.5% 0.0155 0.035469 4.8994 2.185
−0.12 −0.17
−0.14 −0.18
−0.19
−0.16
−0.2
−0.18
−0.21
−0.2
−0.22
−0.22
−0.23
−0.24
−0.24
−0.26
−0.25
−0.28 −0.26
−0.3 −0.27
−0.16 −0.14 −0.12 −0.1 −0.08 −0.06 −0.04 −0.02 0 −0.22 −0.2 −0.18 −0.16 −0.14 −0.12 −0.1 −0.08
comment: results for fault profile 2 show that the algorithm works well also on
slowly drifting time variant environments. The simulation results of Application
D2 are particularly interesting, since the performance of the algorithm is given
in the case of a strong model bias. We appreciate the fact that these results are
coherent with those of the linear case when the magnitude δ is above 0.075. We
comment that the effect of this strong nonlinearity is an increase in the value of
the covariance matrix for each state. For this reason, the magnitude δ must be
equal or above 0.075 to be able to detect the change, while in Application D1
the algorithm provides good performance even when δ is as low as 0.05.
5 Conclusion
The paper presents an evolving mechanism for cognitive fault diagnosis able to
detect and classify faults by automatically creating the fault dictionary (initially
empty) during the operational phase. The novelty of the proposed approach
resides in the theoretically grounded framework that allows us for working in
the space of linear approximation models even if the system under monitoring
is nonlinear. The experimental section shows the effectiveness of the proposed
solution.
References
1. Ljung, L.: Convergence analysis of parametric identification methods. IEEE Trans-
actions on Automatic Control 23(5), 770–783 (1978)
2. Ljung, L., Caines, P.E.: Asymptotic normality of prediction error estimators for
approximate system models. IEEE Decision and Control 17, 927–932 (1978)
3. Isermann, R.: Fault-diagnosis systems, an introduction from fault detection to fault
tolerance. Springer (2006)
4. Fochem, M., Wischnewski, P., Hofmeier, R.: Qualitycontrolsystems on the produc-
tionline of tapedeckchassis using selforganizing feature maps. In: Proc. 1st Euro-
pean Symp. on Applications of Intelligent Technologies (1997)
Evolving Fault Isolation 313
5. Naresh, R., Sharma, V., Vashisthcv, M.: An integrated neural fuzzy approach for
fault diagnosis of transformers. IEEE T. Power Del. 23(4), 2017–2024 (2008)
6. Kuo, H.C., Chang, H.K.: A new symbiotic evolution-based fuzzy-neural approach to
fault diagnosis of marine propulsion systems. Engineering Applications of Artificial
Intelligence 17, 919–930 (2004)
7. Joentgen, A., Mikenina, L., Weber, R., Zeugner, A., Zimmermann, H.J.: Auto-
matic fault detection in gearboxes by dynamic fuzzy data analysis. Fuzzy Sets and
Systems 105, 123–132 (1999)
8. Huang, G., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: Theory and appli-
cations. Neurocomputing 70(1-3), 489–501 (2006)
9. Schrauwen, B., Verstraeten, D., Campenhout, J.: An overview of reservoir comput-
ing: Theory, applications and implementations. E. S. on Artificial NNs, 471–482
(2007)
10. Nasraoui, O., Rojas, C.: Robust clustering for tracking noisy evolving data streams.
In: Proc. SIAM Conf. Data Mining, pp. 618–622 (2006)
11. Song, Q., Kasabov, N.: ECM - A Novel On-line, Evolving Clustering Method and
Its Applications. Found. of cognitive science, pp. 631–682. MIT Press (2001)
12. Angelov, P., Filev, D.P., Kasabov, N.: Evolving intelligent systems: Methodology
and Applications, vol. 12. Wiley-IEEE Press (2010)
13. Johnson, R.A. and Wichern, D.W.: Applied multivariate statistical analysis. Pren-
tice hall, Upper Saddle River (2002)
Improving ANNs Performance on Unbalanced
Data with an AUC-Based Learning Algorithm
Abstract. This paper investigates the use of the Area Under the ROC
Curve (AUC) as an alternative criteria for model selection in classifica-
tion problems with unbalanced datasets. A novel algorithm, named here
as AUCMLP, which incorporates AUC optimization into the Multi-layer
Perceptron (MLPs) learning process is presented. The basic principle
of AUCMLP is the solution of an optimization problem that aims at
ranking quality as well as the separability of class distributions with re-
spect to the threshold decision. Preliminary results achieved on real data,
point out that our approach is promising, and can lead to better decision
surfaces, specially under more severe unbalance conditions.
1 Introduction
Global squared error functions are often used in error-correction learning since
they yield simplification of the optimization problem, specially of those algo-
rithms which are based on gradient descent. Many of current learning algorithms
for Artificial Neural Networks (ANNs) have inherited this learning principle from
Backpropagation [1]. Nevertheless, the use of a global error function may fail to
represent properly the true error of unbalanced classification problems. In such
kind of problems, the discrimination function tend to favor the majority class
since the global error function assumes uniform losses for all training samples,
regardless of the prior probability of the corresponding class [2]. Model perfor-
mance on each separate class is often considered as a final criteria for model
assessment, but it is not usually embodied in the adaptive learning procedures.
Performance assessment and model selection for unbalanced learning prob-
lems have been often accomplished with the aid of the ROC (Receiver Operating
Characteristic) curve [3] which represents the relationship between the true pos-
itive rate (TPrate) and the false positive rate (FPrate) of a family of classifiers
resulted from different output thresholds. A more robust criteria extracted from
the ROC curve is the AUC (Area Under the ROC Curve) which is a global
metric for all thresholds regardless of class prior probabilities. Because of that,
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 314–321, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Improving ANNs with an AUC-Based Learning Algorithm 315
the AUC has been applied to ranking quality estimation [4] and also to highly
unbalanced learning problems [5].
Despite being a more robust metric for unbalanced classification problems,
AUC maximization is not guaranteed by global error minimization learning al-
gorithms [6]. In order to guarantee AUC maximization, learning algorithms are
expected to incorporate AUC optimization into the learning procedures, ap-
proach that has been adopted by some learning algorithms [7,8,9]. It has been
also shown [6,4] that the RankBoost algorithm under specific conditions com-
putes a function that is equivalent to the AUC.
Although the aforementioned algorithms have been proposed to maximize
ranking in specific domains, such as Information Retrieval, their application
in the context of unbalanced learning have not been investigated yet in the
literature. Since the inherent properties of the AUC metric motivates its use
for model selection in the presence of uneven data, it is natural to suppose
that AUC optimization-based algorithms could represent an alternative to the
well known sampling and cost-sensitive learning approaches, which have been
commonly used to increase ANNs discrimination ability [10,11]. This work aims
to investigate this hypotheses with a novel algorithm for Multi-layer Perceptrons
(MLPs), named here as AUCMLP, which embodies AUC optimization in the
learning process.
Our goal is to adopt AUC as a general cost function in order to improve MLPs’
performance for representing classification functions, particularly those induced
from unbalanced datasets. The main principles of AUCMLP are the solution
of the optimization problem independently of the prior distributions yielded by
the AUC, as well as its relationship with the quality of classification ranking. In
contrast with a global error cost function, it may yield better performances in
(highly) unbalanced datasets, specially those with class overlapping.
The paper is organized as follows: Section 2 describes the foundations of our
AUC-based learning approach for MLPs. Section 3 presents the methodology
of the empirical study conducted to evaluate the effectiveness of our approach.
Also presented are the discussions on the results obtained. At last, the final
conclusions are provided in Section 4.
Let g(w) be the gradient vector associated to the current weight vector w. Each
component of vector g(w) is given by the partial derivative of AU C(w) with
respect to an network weight w, as described by the expression
1
N1
∂ AU C(w)
N2
∂R (dpq (w))
= (6)
∂w N1 N2 p=1 q=1 ∂w
∂R(d pq(w))
where ∂w corresponds to the gradient scalar due to the presentation of
the pair of examples xp and xq . For an arbitrary output layer weight ws , this
term can is obtained as follows
∂R (dpq (w))
∂R (dpq (w))
= τ (−fp + fq − κ)τ −1 [−φ (vp ) ws φ usp xrp
∂wsr
+ φ (vq ) ws φ usq xrq ] (8)
The weight vector (w) can then be updated at each iteration in the opposite
direction of the gradient vector, as follows
Δw = −η g(w) (9)
where η is a positive number (learning rate) that controls the size of the update
term to be applied to the weight vector. The momentum constant 0 ≤ ρ ≤ 1 is
also used in order to speed up convergence.
The algorithms were configured following suggestions from the literature. For
RBoost, the number of nearest neighbors used to adjust the sampling probabili-
ties of the minority examples and generate the synthetic data were set to 5 and
10, respectively; the number of boosting iterations and the scaling coefficient
were set to 20 and 0.3, respectively [14]. For SMTTL, the number of nearest
neighbors was set to 5 [13], while the value 3 was chosen for WWE [11]. Finally,
as suggested in Section 2.1, the AUCMLP cost function parameters κ and ρ were
set to 1.2 and 2 on all trials.
Tables 2 and 3 list the values of G-mean and AUC achieved by MLP, SMTTL,
WWE, RBoost and AUCMLP. Means and standard deviations were calculated
on 20 different test cases. The best performances achieved on each dataset are
highlighted in bold.
As can be observed from Table 2, AUCMLP has superior G-mean perfor-
mance than the other methods in seven out of ten tested datasets. Although a
statistical test was not accomplished to point out significance on the results, the
observation of means and standard deviations achieved by all methods suggest
that AUCMLP has better performance with datasets that have larger unbal-
anced degrees. The increase in performance with AUCMLP is more expressive
for the following datasets: euth, sat, a18-9, gls6, y5 and a19. In the particular
case of a19, characterized by a huge imbalance degree (0.008), when both MLP
320 C.L. Castro and A.P. Braga
and WWE were unable to classify positive data (T P R = 0.0), AUCMLP had a
satisfactory performance.
Similar observations can be drawn from Table 3. AUCMLP performed better
than other algorithms in six out of ten evaluated datasets. It should be noted,
however, that the score gains obtained by AUCMLP with the less unbalanced
datasets (iono, wpbc, veh, seg, euth) are lower than those obtained with sat,
a18-9, y5 and a19. This suggest that AUCMLP can be more effective in optimiz-
ing ROC curves under more severe unbalanced conditions. This remark agrees
with the discussion previously conducted in [6]. In that study, the authors have
formally showed that algorithms that embody AUC optimization in the learning
process should perform better than overall error-based algorithms in situations
with high level of unbalancing and class overlapping. They also argued that in
roughly even conditions the optimization of an overall error-based cost function
also implies on AUC optimization, what could also explain the similar perfor-
mances obtained with more balanced datasets, such as iono and wpbc.
Finally, average ROC curves were plotted by applying the threshold averaging
technique [3] on 20 test cases. Fig. 1 shows the ROC curves obtained for the data
set a19 which has the lowest degree of imbalance, where it can be observed that
AUCMLP generates a better ROC curve than the other algorithms.
0.9
0.8
0.7
0.6
TPr
MLP
0.5 SMTTL
WWE
0.4 RBoost
AUCMLP
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
FPr
4 Conclusion
It is well accepted in the literature that the bias induced by scarce and unbal-
anced learning sets is a direct consequence of the minimization of the overall
error rate. Changes in this parameter estimation criteria in order to improve the
detection of the underrepresented class have been proposed in different forms.
Most common strategies involve the incorporation of unequal costs to the indi-
vidual errors (cost-sensitive learning approach), as well as modifications in the
class probability distributions via sampling data.
A different approach was considered in this paper. Motivated by the theoret-
ical aspects of the AUC, we evaluate the effectiveness of this metric as a criteria
for neural model selection in unbalanced classification. Preliminary results sug-
gest the validity of our initial hypothesis about the advantages of optimizing
Improving ANNs with an AUC-Based Learning Algorithm 321
AUC, in contrast with other error-based methods. Such results point out that
our algorithm (AUCMLP), which embodies AUC optimization in the learning
process, can be used to compensate the bias imposed by the dominant class,
leading to better decision surfaces, specially under severe unbalance situations.
References
1. Rumelhart, D.E., McClelland, J.L.: Parallel distributed processing: Explorations
in the microstructure of cognition, vol. 1: Foundations. MIT Press (1986)
2. Lan, J., Hu, M.Y., Patuwo, E., Zhang, G.P.: An investigation of neural network
classifiers with unequal misclassification costs and group sizes. Decis. Support
Syst. 48, 582–591 (2010)
3. Fawcett, T.: An introduction to ROC analysis. Pat. Rec. Lett. 27, 861–874 (2006)
4. Rudin, C., Schapire, R.E.: Margin-based ranking and an equivalence between Ad-
aBoost and RankBoost. J. of Mach. Learn. Research 10, 2193–2232 (2009)
5. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition 30, 1145–1159 (1997)
6. Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Advances
in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)
7. Yan, L., Dodier, R.H., Mozer, M., Wolniewicz, R.H.: Optimizing classifier perfor-
mance via an approximation to the wilcoxon-mann-whitney statistic. In: ICML
2003: Proceedings of the 20th Int. Conf. on Machine Learning, pp. 848–855 (2003)
8. Joachims, T.: A support vector method for multivariate performance measures. In:
ICML 2005: Proc. of the 22nd Int. Conf. on Machine learning, pp. 377–384 (2005)
9. Herschtal, A., Raskutti, B., Campbell, P.K.: Area under ROC optimization using
a ramp approximation. In: Proc. of 6th Int. Conf. on Data Mining, pp. 1–11 (2006)
10. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. on Knowledge
and Data Engineering 21, 1263–1284 (2009)
11. Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Supervised neural network mod-
eling: An empirical investigation into learning from imbalanced data with labeling
errors. IEEE Trans. on Neural Networks 21, 813–830 (2010)
12. Hanley, J.A., Mcneil, B.J.: The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)
13. Batista, G., Prati, R., Monard, M.: A study of the behavior of methods for bal-
ancing machine learning training data. SIGKDD Expl. Newsl. 6, 20–29 (2004)
14. Chen, S., He, H., Garcia, E.A.: Ramoboost: ranked minority oversampling in boost-
ing. IEEE Trans. on Neural Networks 21, 1624–1642 (2010)
15. UCI machine learning repository, http://archive.ics.uci.edu/ml/
16. Wu, G., Chang, E.: KBA: Kernel boundary alignment considering imbalanced data
distribution. IEEE Trans. on Knowl. and Data Eng. 17, 786–795 (2005)
Learning Using Privileged Information
in Prototype Based Models
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 322–329, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Learning Using Privileged Information in Prototype Based Models 323
Over the last few years, there has been considerable research on Distance Metric
Learning algorithms which aim to optimize a target distance for a given set
of data points under various types of constraints (given in the form of side
information), e.g. [7],[8]. In Information Theoretic Metric Learning (ITML) [8],
given a set of n points {x1 , ..., xn }, xi ∈ Rm , one learns a positive definite matrix
T
A defining the squared metric dA (xi , xj ) = (xi − xj ) A (xi − xj ) (see eq. (1)),
Learning Using Privileged Information in Prototype Based Models 325
In this formulation we would like to modify dM so that the distances under the
’new’ metric dC on X are enlarged and shrunk for pairs of points that have
‘dissimilar’ and ‘similar’ privileged information, respectively.
In the ITML approach, two sets of pairs of data points form X are formed
corresponding to the ‘similar and dissimilar’ data items: S+ = {(xi , xj )|xi and
xj are judged to be ‘similar’} and S− = {(xi , xj )|xi and xj are judged to be
‘dissimilar’}. While in ITML similarity/dissimilarity is decided purely based
on class labels, we use proximity information in the privileged space X ∗ as well.
In particular, assume we are given a global metric tensor M ∗ on X ∗ giving the
squared distance,
x 2C = xT Cx = xT U T U x = x̃T x̃ = x̃ 22 ,
where (x̃ = U x) is the image of x under the basis transformation U . The layout
of the transformed points x̃i = U xi now reflects the ‘similarity/dissimilarity’ in-
formation from X ∗ . Data points with ‘similar’ privileged data representation will
now in general be closer than in the original data layout. Likewise, data points
with more distant privileged representations will tend to move further apart.
The GMLVQ algorithm (in its original form) is now applied to the transformed
data {(x̃1 , y1 ), ...., (x̃n , yn )}.
2. Extended GMLVQ: GMLVQ is first run on the original training set with-
out privileged information, yielding a global metric dM (given by metric tensor
M ) and a set of prototypes wj ∈ Rm , j = 1, 2, ..., L. Then, using the privileged
information, the ITML technique finds metric dC on X that will replace the
metric dM originally found by GMLVQ. The metric dC now incorporates the
privileged information. Finally, GMLVQ is run once more with metric tensor C
fixed to modify the prototype positions4 .
The data set used in this section [1] represents binary classification of digit images
for which privileged information is available in the form of ‘poetic descriptions’
of the images [1]. We followed the experimental settings used by [1].
Original space X: Training inputs consist of the first 50 examples of digits ’5’
and ’8’ from the MNIST6 training data (100 data points); test data has 1,866
samples of digits ’5’ and ’8’ from the MNIST test data.
Privileged space X ∗ : Poetic descriptions describing the first 50 examples of
digits ’5’ and ’8’. Poetic description of the digit images incorporated what lan-
guage experts saw and interpreted using their words in the form of a poem.
An example of poetic description for the first image of 5 is given in [1]. Poetic
description were then translated into 21-dimensional feature vectors7.
As in [1], we used training sets of increasing size 40, 50, ..., 100 (each train-
ing set containing the same number of digits ’5’ and ’8’). The training subsets
were sub-sampled randomly from the original 100 training inputs 5 times. Fig.1a
shows the mean8 number of misclassified points as a function of training set size.
The results show that ITML-GMLVQ outperforms the standard GMLVQ, with
slightly better performance for ’transformed basis’ over the ’extended GMLVQ’
technique. Comparison of ITML-GMLVQ with the baseline SVM (trained with-
out privileged information) and SVM+ is presented in Fig.1b. ITML-GMLVQ
achieves relative performance improvement of 14%, 6%, and 2% over the SVM,
SVM+, and dSVM+, respectively.
280 300
GMLVQ SVM+
GMLVQ−ITML (GMLVQ in transformation) dSVM+
180
180
160 160
140 140
120
120 40 50 60 70 80 90
40 50 60 70 80 90 Training digits size
Training digits size
(a) (b)
Fig. 1. Mean number of misclassifications obtained by (a) GMLVQ based methods (b)
SVM based algorithms ([1]) and the ITML-GMLVQ on the MNIST data
Table 1. Mean misclassification rates (standard deviations across 10 runs are reported
in brackets) in the galaxy morphological classification
obtain, but is available for a number of galaxies. When classifying a new galaxy,
full spectral ‘privileged’ information will typically be not available.
Our dataset contained 20,000 galaxies, extracted from Galaxy Zoo project cat-
alog [10] (galaxy IDs and their labels), characterized by 13 photometric features
(in X) extracted based on galaxy IDs from the SDSS DR7 data catalogues9. In
addition, 8 privileged spectral features (in X ∗ ) were extracted from the MPA-
JHU DR7 release of spectrum measurements10 .On the set of this size, we found
it infeasible to run extensive sets of experiments using the SVM+ based ap-
proaches. Galaxies were classified into three morphological classes - Elliptical,
Spiral, and Irregular. We report in Table 1 the means and standard deviations of
the performance measures across 10 experimental runs. In each run the galaxy
set was randomly split into training set (75%) and test set (25%). Inclusion of
the spectral privileged information in the training phase via the ITML-GMLVQ
model reduces the misscalssification rate, even though in the test phase the mod-
els are fed with the original features only. In general, ITML-GMLVQ achieves the
average relative improvement over GMLVQ of about 50%, with slightly better
performance for ’extended GMLVQ’ over the ’transformed basis’ technique.
6 Conclusion
We have introduced a novel methodology, based on Information Theoretic Ap-
proach to metric adaptation [8], for learning with privileged information in the
9
http://cas.sdss.org/astro/en/tools/crossid/upload.asp
10
http://www.mpa-garching.mpg.de/SDSS/DR7/
Learning Using Privileged Information in Prototype Based Models 329
References
1. Vapnik, V., Vashist, A.: A New Learning Paradigm: Learning Using Privileged
Information. In: Neural Networks (NNs), vol. 22(5-6), pp. 544–555. Elsevier Ltd.
(2009)
2. Cervantes, J., Li, X., Yu, W.: Multi-Class SVM for Large Data Sets Consider-
ing Models of Classes Distribution. In: International Conference on Data Mining
(DMIN), pp. 30–35. CSREA Press, Las Vegas (2008)
3. Cervantes, J., Li, X., Yu, W., Li, K.: Support Vector Machine Classification For
Large Data Sets Via Minimum Enclosing Ball Clustering. In: Neural Networks
(NNs): Algorithms and Applications, 4th International Symposium on Neural Net-
works 2008, vol. 71(4-6), pp. 611–619. Elsevier Science Publishers, Amsterdam
(2008)
4. Biehl, M., Hammer, B., Schneider, P., Villmann, T.: Metric Learning for Prototype-
Based Classification. In: Innovations in Neural Information Paradigms and Appli-
cations, vol. 247, pp. 183–199. Springer (2009)
5. Kohonen, T.: Learning Vector Quantization for Pattern Recognition. Technical re-
port, No (TKK-F-A601), Helsinki University of Technology. Espoo, Finland (1986)
6. Sato, A.S., Yamada, K.: Generalized Learning Vector Quantization. In: Touretzky,
D., Leen, T. (eds.) Advances in Neural Information Processing Systems (ANIPS),
vol. 7, pp. 423–429. MIT Press (1995)
7. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning with
Application to Clustering with Side-Information. In: Neural Information Processing
Systems, vol. 15, pp. 505–512. MIT Press (2002)
8. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-Theoretic Met-
ric Learning. In: Proceedings of the 24th International Conference on Machine
Learning, ICML 2007, pp. 209–216. ACM Press, New York (2007)
9. Wijesinghe, D.B., Hopkins, A.M., Kelly, B.C., Welikala, N., Connolly, A.J.: Mor-
phological Classification of Galaxies and Its Relation to Physical Properties.
Monthly Notices of the Royal Astronomical Society (MNRAS) 404(4), 2077–2086
(2010)
10. Lintott, C., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., Rad-
dick, M., Nichol, B., Szalay, A., Andreescu, D.: Galaxy Zoo: Morphologies Derived
From Visual Inspection of Galaxies From The Sloan Digital Sky Survey. Monthly
Notices of the Royal Astronomical Society 389(3), 1179–1189 (2008) ISSN 0035-
8711
A Sparse Support Vector Machine Classifier
with Nonparametric Discriminants
1 Introduction
The Kernel Nonparametric Discriminant (KND) [4,16] provides improvements
over the the Kernel Fisher Discriminant Analysis (KFD) [11] by relaxing the
normality assumption [4]. KND measures the between-class scatter matrix on a
local basis in the neighborhood of the decision boundary in the feature space.
This is based on the observation that the normal vectors on the decision bound-
ary are the most informative for discrimination [9]. We can consider KND as
a classifier based on the “near-global” characteristics of data realized by the κ-
nearest neighbors for each data point. Although KND gets rid of the underlying
assumptions of KFD and results in better classification performance, it is not
always an easy task to find an appropriate choice of κ − N N ’s on the decision
boundary for all data points to obtain the best discrimination.
On the other hand, the Kernel Support Vector Machine (KSVM) is based
on the idea of maximizing the margin or degree of separation in the training
data. KSVM tries to find the optimal decision hyperplane using support vectors,
which are the training samples that approximate the hyperplane and are the
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 330–338, 2012.
c Springer-Verlag Berlin Heidelberg 2012
SSVMKND Classifier 331
most difficult patterns to classify [14]. In other words, they consist of those data
points which are closest to the optimal hyperplane. So it can be said that the
KSVM solution is based on the “local” variations of the training data. However,
KSVM does not take into consideration the “near-global” properties of the class
distribution on which the KND is based.
In this paper, we propose a novel SSVMKND model which combines the
KND and the KSVM methods. In that way, a decision boundary is obtained
which reflects both near-global characteristics (realized by the KND) of the
training data in feature space and its local properties (realized by the local
margin concept of the KSVM). The proposed method provides the following
significant advantages over our recently developed models [7,8] :
– Unlike the methods in [7,8], our proposed model forms a convex optimiza-
tion problem because the final matrix used to modify the objective function
is positive-definite. As a result, the method generates one global optimum
solution and existing numerical methods can be used to solve this problem
easily and efficiently.
– We provide a probabilistic viewpoint of the proposed method to show that
the optimization problem of SSVMKND can be formulated to be represented
by a sparsity-promoting Gaussian prior [13]. On the other hand, the matrices
in the objective functions in [7,8] are not positive-definite and cannot form
Gaussian priors.
Moreover, the method in [8] uses KFD [11], while the proposed method is a
combination of KSVM and KND. Since KND relaxes the normality assumption,
the proposed method is more robust. We also show that our method is a variation
of the KSVM optimization problem, so that existing SVM implementations can
be used. The experimental results also verify these claims. Finally, we provide
an application of our method for face recognition, which has been one of the
most challenging tasks in the pattern recognition area.
In the feature space, KSVM tries to find the optimal decision hyperplane
which has largest minimal distance from any of the samples. To tackle the cases
of misclassification even in the higher dimensional space, the margin constraint
is slacked and a penalty factor is introduced in the objective function to control
the amount of slack. The KSVM optimization problem is:
N
1 T T
min w w+C max(0, 1 − ti (Φ (xi )w + w0 )) , (1)
w=0,w0 2 i=1
Here, max(0, 1−ti (ΦT (xi )w+w0 )) is the hinge loss function [14]. The loss factor
for misclassified samples is controlled by C.
Only a fraction of the total samples (Support Vectors or SVs) contribute to the
solution of KSVM [14]. Therefore, KSVM considers only those data points which
are close to the decision hyperplane. In other words, KSVM only considers the
local variations in data samples. The overall distribution of the training samples
is not taken into consideration. Incorporating some kind of global distribution
(e.g. results from classifiers like KND) can provide better classification.
In [4], the Nonparametric Discriminant Analysis (NDA) is proposed to remove
the normality assumption in the Linear Discriminant Analysis (LDA). The NDA
can be extended to the feature space F based on the ideas developed in extending
the LDA to the KFD [11]. We call this the Kernel Nonparametric Discriminant
(KND).
Instead of calculating the simple mean vectors, the nearest neighbor mean vec-
tors are calculated to formulate the between-class scatter matrix ∇ of the KND.
The motivation behind KND is the observation that essentially the nearest neigh-
bors represent the classification structure in the best way because the nearest
neighbor mean vectors represent the direction of the gradients of the respec-
tive class density functions in the feature space [4]. This way, the between-class
scatter matrix ∇ preserves the classification structure.
The KND does not make any modifications to the within-class scatter matrix
Δ, so it is similar to the KFD. With these definitions of ∇ and Δ, the KND
method proceeds the same way as other LDA-based methods i.e., by computing
the eigenvectors and eigenvalues of (Δ+βI)−1 ∇, and considering the eigenvector
corresponding to the largest eigenvalue to form the optimal decision hyperplane.
βI is added before inversion of Δ to tackle the small sample size problem [11].
The most significant advantage of the KND over the KFD is the removal of
the normality assumption. Because of this, KND is a robust classifier which can
perform better in case of real-life datasets. However, finding the optimum number
of nearest neighbors is not an easy task. Also, KND considers the near-global
variations in the data distribution by considering the k-nearest neighbors. Unlike
the KSVM, the points that are crucial to classify (the local variations) are not
given any special consideration. This can result in degradation of performance.
SSVMKND Classifier 333
In our proposed SSVMKND method, we try to utilize both the local and the
near-global variational information obtained from KSVM and KND. Our objec-
tive is to keep the method close to the KSVM approach as much as possible,
since the KSVM optimization approach is more robust than the discriminant
approach of KND in the sense that it is less sensitive to data distribution or the
small sample size problem. We modify the KSVM optimization problem (Equa-
tion (1)) to incorporate the scatter matrices from the KND. Hence, our method
can be described by the following convex optimization problem:
N
1 T −1 1 T T
min w (ηΔ(∇ + βI) Δ)w + w w + C max(0, 1 − ti (Φ (xi )w + w0 )) .
w=0,w0 2 2 i=1
(2)
Here, the first term is added to the optimization problem to incorporate the KND
scatter matrices. The coefficient ηΔ(∇ + βI)−1 Δ changes the orientation of the
weight vector w by incorporating near-global variational information obtained by
the KND. The parameter η is the key control parameter to find out the optimal
weight vector orientation. η can take values from 0 to ∞. The second term
1 T
2 w w is the traditional KSVM term to retain the local variational information.
A compact form of the problem is:
N
1 T T
min w Θw + C max(0, 1 − ti (Φ (xi )w + w0 )) , (3)
w=0,w0 2 i=1
4 Experimental Results
In this section we evaluate the proposed SSVMKND method against the KSVM,
KND and the KFD. For kernelization of the data, we use the popular Gaussian
RBF Kernel [3]. For KND and KFD, after finding the optimal eigenvector, Bayes
classifier is used for conducting the final classification. The different control
parameters for different methods (e.g. weight parameter η for SSVMKND, RBF
width parameter σ for all methods) are optimized using exhaustive search. If the
optimization needs to be faster, efficient methods can be used at the cost of a
small degradation in accuracy values.
We have applied the classification algorithms on 7 real-world and artificial
datasets obtained from the Benchmark Repository in [12]. There are 100 par-
titions available for each dataset, where about 60% data is used for training
and the rest for testing. We randomly picked 5 out of these 100 partitions and
repeated the random picking 5 times to achieve the unbiased average result.
Table 1 contains the average accuracy values and the standard deviations
obtained over all the runs. We see that the SSVMKND method outperforms
the KSVM, KND and KFD methods in all of the cases. Since the SSVMKND
combines the global and near-global variations provided by the KSVM and the
KND, respectively, it can classify the relatively difficult test samples.
The sparsity values (ratio of number of support vectors to number of total
training samples) in Table 1 indeed verify that the SSVMKND classifier pro-
vides sparse solution, as we have explained in Section 3.2. A sparse solution is
important for real-world problems where the data is usually of high dimension.
The sparsity ensures that the solution model consists of lower number of param-
eters than that of a non-sparse solution [2]. Hence, a sparse solution can directly
result in a model with lower computational complexity by using efficient math-
ematical programming-based algorithms. We also observe that in all cases, the
336 N.M. Khan et al.
% Sparsity % Sparsity
Dataset SSVMKND KSVM KND KFD
(SSVMKND) (KSVM)
Flare-Sonar 67.7 (0.47) 66.9 (0.41) 67.1 (0.65) 66 (0.40) 4.65 5.25
German 78.1 (0.40) 77 (0.38) 76.3 (0.68) 75.7 (0.51) 39.1 38.85
Heart 86.5 (2.21) 85.4 (2.3) 81.7 (1.58) 82.9 (2.13) 39.4 36.47
Banana 89.8 (0.25) 89.6 (0.29) 89.6 (0.22) 89.5 (0.20) 60.5 59.5
Diabetes 78.6 (0.50) 77.7 (0.69) 75.7 (0.90) 77.3 (1.03) 53.2 52.8
Ringnorm 98.5 (0.04) 98.4 (0.04) 98.3 (0.03) 97.4 (0.07) 46 38.5
Thyroid 97.3 (0.6) 96.5 (1.02) 97.1 (0.64) 96.8 (0.49) 35.7 31.4
SSVMKND method provides better sparsity than the KSVM. This is because
of the nature of the Gaussian prior used. The prior in the KSVM method [13]
considers the underlying coefficients of the weight vectors to be uncorrelated.
However, in real-world applications, this assumption is not necessarily true. In
the SSVMKND, the prior (Section 3.2) incorporates the between-class and the
within-class scatter matrices from the KND. As a result, in our prior, the co-
efficients are not considered uncorrelated. This is why the SSVMKND provides
better sparsity values than the KSVM.
Our purpose is to build a basic face recognition application that can act as a
“proof-of-concept” system for comparison purposes among different methods.
Hence, we have chosen the well-established eigenfaces technique [1] for our ap-
plication which essentially uses Principal Component Analysis (PCA) to project
the images onto a reduced subspace. To observe the results of varying PCA di-
mension on different classification methods, we have repeated our experiment
100
SSVMKND
97 SSVMKND
KSVM
KSVM
KND
KND
96 KFD
99.5 KFD
95
99
94
% Accuracy
% Accuracy
98.5
93
92
98
91
97.5
90
89 97
3 6 9 12 15 18 21 24 27 30 3 6 9 12 15 18 21 24 27 30
PCA Dimension PCA Dimension
5 Conclusion
References
1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: recognition
using class specific linear projection. IEEE Transactions on Pattern Analysis and
Machine Intelligence 19, 711–720 (1997)
2. Camps-Valls, G., Bruzzone, L.: Kernel-based methods for hyperspectral image clas-
sification. IEEE Transactions on Geoscience and Remote Sensing 43(6), 1351–1362
(2005)
3. Cristianini, M., Shawe-Taylor, J.: An Introduction to Support Vector Machines.
Cambridge University Press, Cambridge (2000)
4. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press (2000)
5. Georghiades, A.: Yale face database (1997)
6. Horn, R., Charles, R.: Matrix Analysis. Cambridge University Press (1990)
7. Khan, N., Ksantini, R., Ahmad, I., Boufama, B.: A novel SVM+NDA model for
classification with an application to face recognition. Pattern Recognition 45(1),
66–79 (2012)
8. Ksantini, R., Ahmad, I., Boufama, B., Khan, N.: A new combined KSVM and
KFD model for classification and recognition. In: Fifth International Conference
on Digital Information Management, pp. 188–193 (2010)
9. Lee, C., Landgrebe, D.: Feature Extraction Based on Decision Boundaries. IEEE
Transactions on Pattern Analysis and Machine Intelligence 15(4), 388–400 (1993)
338 N.M. Khan et al.
10. Lyons, M., Budynek, J., Akamatsu, S.: Automatic classification of single facial
images. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(12),
1357–1362 (1999)
11. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.: Fisher Discriminant
Analysis with Kernels. In: Neural Networks for Signal Processing, pp. 41–48 (Au-
gust 1999)
12. Ratsch, G., Onoda, T., Muller, K.: Soft Margins for Adaboost. Machine Learn-
ing 42(3), 287–320 (2000)
13. Sollich, P.: Bayesian Methods for Support Vector Machines: Evidence and Predic-
tive Class Probabilities. Machine Learning 46(1-3), 21–52 (2002)
14. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons (1998)
15. Xiong, T., Cherkassky, V.: A Combined SVM and LDA Approach for Classification.
In: International Joint Conference on Neural Networks, pp. 1455–1459 (2005)
16. You, D., Hamsici, O.C., Martinez, A.M.: Kernel Optimization in Discriminant
Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(3),
631–638 (2011)
Training Mahalanobis Kernels
by Linear Programming
Shigeo Abe
1 Introduction
Regular support vector machines do not assume a priori data distributions and
determine the decision boundary using only support vectors that are near the
boundary. However, if data of one class have a large variance and those of the
other class have a small variance, it may not be good to place the hyperplane in
the middle of the unbounded support vectors. In such a situation, instead of the
Euclidean distance, the Mahalanobis distance is sometimes effective [1,2,3,4,5].
There are two ways to incorporate the Mahalanobis distance into support vec-
tor machines: one is to reformulate support vector machines so that the margin is
measured by the Mahalanobis distance [1,2], and the other is to use Mahalanobis
kernels [3,4,5], which calculate the kernel value according to the Mahalanobis
distance between the associated two argument vectors.
Radial basis function (RBF) kernels are widely used because they usually give
good performance for most applications. To improve the generalization ability of
RBF kernels, generalized RBF kernels are proposed, in which each input variable
has a weight in calculating the kernel value. Mahalanobis kernels are an extension
of generalized RBF kernels and if the covariance matrix is restricted to a diagonal
matrix, the Mahalanobis kernels reduce to generalized RBF kernels [4].
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 339–346, 2012.
c Springer-Verlag Berlin Heidelberg 2012
340 S. Abe
where xi and xj belong to the same class but xk and xi (and thus xj ) belong to
different classes, and |P | denotes the number of elements in P . In [6], the details
of how to generate triplets are not described. Here, we consider generating the
set of triplets, P . For a training sample xr (r = 1, . . . , M ) belonging to a class,
we find the nearest xj belonging to the same class and the nearest xk belonging
to the other class and make (xr , xj , xk ) the triplet. In this way, we obtain P
with M triplets.
A Mahalanobis distance between two data belonging to the same class needs
to be shorter than that between data of different classes. Thus, we define a
margin ρr for the rth triplet as follows:
m
ρr = (xr − xk ) R (xr − xk ) − (xr − xj ) R (xi − xj ) = air Rii , (4)
i=1
where air = (xir − xik )2 − (xir − xij )2 for i = 1, . . . , m, r = 1, . . . , M , xir
is the ith element of xr , and Rii is the ith diagonal element of R.
We want ρr as large as possible but similar to support vector machines, for
some triplets we allow negative margins. Then, we formulate the following opti-
mization problem:
M
maximize Jρ (ρ, R, ξ) = ρ − Cρ ξr (5)
r=1
m
subject to Rii = 1 (6)
i=1
Rii ≥ 0 for i = 1, . . . , m, (7)
m
air Rii ≥ ρ − ξr for r = 1, . . . , M, (8)
i=1
ξr ≥ 0 for r = 1, . . . , M, ρ > 0, (9)
where ρ is the margin, ξr are slack variables to allow negative margins, Cρ is a
margin parameter, and (6) is to make the left-hand side of (8) be unique.
The above formulation is similar to the ν-LP SVM discussed in [7]. Because
the ν-LP SVM treats a two-class problem, air Rii in (8) is multiplied by yr which
takes 1 or −1. But this is a trivial difference. Thus, we call the above support
vector machine primal ν-LP SVM or ν-LP SVM for short.
According to [7], for positive ρ, the ν-LP SVM is equivalent to the following
formulation:
m
M
minimize J(R, ξ) = Rii + CM ξr (10)
i=1 r=1
subject to Rii ≥ 0 for i = 1, . . . , m, (11)
m
air Rii ≥ 1 − ξr for r = 1, . . . , M, (12)
i=1
ξr ≥ 0 for r = 1, . . . , M, (13)
where CM is a margin parameter.
342 S. Abe
From (15) and (17), the dual ν-LP SVM has a feasible solution for Cρ ≥ 1/M ,
but does not have a feasible solution for 1/M > Cρ > 0. For the ν-LP SVM,
because of the slack variables ξi , a feasible solution always exists for Cρ > 0.
But for 1/M > Cρ > 0, the optimal solution of the ν-LP SVM is unbounded:
Theorem 2. For 0 < Cρ < 1/M , the solution of the ν-LP SVM is unbounded.
But for Cρ ≥ 1, Rii (i = 1, . . . , m) are the same:
Theorem 3. For Cρ ≥ 1 the optimal Rii (i = 1, . . . , m) are the same for dif-
ferent values of Cρ .
From Theorem 3, it is clear that for 1 ≥ Cρ ≥ 1/M , the bounded optimal
solutions exist for the primal and dual ν-LP SVMs. From (15) and (17), for the
optimal solution of the ν-LP SVM with 1/(j − 1) > Cρ ≥ 1/j (j = 2, . . . , M ),
at least j inequality constraints in (8) are active. Theorem 3 is further refined
as follows:
Theorem 4. If for Cρ = C0 (≥ 1/M ), the solution of the ν-LP SVM satisfies
ρ > 0 and ξ = 0, or ρ = 0, the same solution is obtained for Cρ = t C0 (t > 1).
Assume that for Cρ = 1/j (j = 2, . . . , 1/M ) the solutions of the primal and
dual ν-LP SVMs are (ρ̄, R̄, ξ̄) and (z̄, δ̄), respectively. Then what can we say
about the solutions for Cρ = t/j (j/j − 1 > t > 1)? From Theorem 4, if ρ̄ > 0
and ξ̄ = 0, or ρ̄ = 0, R does not change for Cρ = t/j. Then, for ρ̄ > 0 and
ξ̄
= 0, can (ρ̄, R̄, ξ̄) and (z̄, δ̄) or (ρ̄, R̄, ξ̄) and (t z̄, t δ̄) be the solutions for
Cρ = t/j? Neither can. The solutions (ρ̄, R̄, ξ̄) and (z̄, δ̄) do not satisfy the
complementarity condition for ξ¯r > 0. And the solutions (ρ̄, R̄, ξ̄) and (t z̄, t δ̄)
do not satisfy the complementarity condition. This means that the optimal R
may change for 1/(j − 1) > Cρ > 1/j.
3.2 LP SVMs
Now investigate the dependence of the solution of the LP SVM on the CM value.
Because of the slack variables, the LP SVM has a feasible solution. In addition,
because the objective function given by (10) is restricted to be non-negative, the
optimal solution of the minimization problem always exists.
For a small CM value, the solution Rii = 0 (i = 1, . . . , m) is obtained as the
following theorem shows:
Theorem 5. Define
M
Cmin = min 1/
i=1,...,m
air . (18)
M
a >0
r=1 ir
r=1
Then, if Cmin exists, for Cmin > CM > 0 the solution of the LP SVM is
As Theorem 1 shows, the ν-LP SVM and LP SVM are equivalent when ρ is
positive and the objective function value for the ν-LP SVM is positive. To obtain
a solution with positive ρ and the positive objective function, Cρ needs to be
selected in [1/M, 1/j], where j ∈ {1, . . . , M −1}. But, there is no way of selecting
the value of j.
For the optimal solution with Cρ = 1/j, at least j constraints are active.
Therefore, controlling the number of active constraints is easy but again the
optimal j is not known in advance.
For the LP SVM, the lower bound of CM that gives nonzero R is Cmin given
by (18). But unlike the ν-LP SVM, there is no upper bound of CM . This is
because a zero-margin solution (i.e., Rii → ∞) is not obtained.
Using either the ν-LP SVM or LP SVM, we need to optimize the value of
Cρ or CM by e.g., cross-validation. And because by the LP SVM, we do not
worry about the upper bound of CM , in the following we use the LP SVM for
Mahalanobis kernel training.
4 Computer Experiment
5 Conclusions
References
1. Lanckriet, G.R.G., Ghaoui, L., El Bhattacharyya, C., Jordan, M.I.: A robust min-
imax approach to classification. Journal of Machine Learning Research 3, 555–582
(2002)
2. Xue, H., Chen, S., Yang, Q.: Structural regularized support vector machine:
A framework for structural large margin classifier. IEEE Trans. Neural Net-
works 22(4), 573–587 (2011)
3. Grandvalet, Y., Canu, S.: Adaptive scaling for feature selection in SVMs. In: Neural
Information Processing Systems 15, pp. 569–576. MIT Press (2003)
4. Abe, S.: Training of Support Vector Machines with Mahalanobis Kernels. In: Duch,
W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp.
571–576. Springer, Heidelberg (2005)
5. Wang, D., Yeung, D.S., Tsang, E.C.C.: Weighted Mahalanobis distance kernels for
support vector machines. IEEE Trans. Neural Networks 18(5), 1453–1462 (2007)
6. Shen, C., Kim, J., Wang, L.: Scalable large-margin Mahalanobis distance metric
learning. IEEE Trans. Neural Networks 21(9), 1524–1530 (2010)
7. Demiriz, A., Bennett, K.P., Shawe-Taylor, J.: Linear programming boosting via
column generation. Machine Learning 46(1-3), 225–254 (2002)
8. Abe, S.: Support Vector Machines for Pattern Classification. Springer, Heidelberg
(2010)
9. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007),
http://www.ics.uci.edu/~ mlearn/MLRepository.html
10. USPS Dataset,
http://www-i6.informatik.rwth-aachen.de/~ keysers/usps.html
Correntropy-Based Document Clustering
via Nonnegative Matrix Factorization
1 Introduction
Large size of the data is one of the central issues in data analysis research. The size of
the data is always increasing; therefore it becomes more and more important to reduce
its size without losing its most essential features. There are several methods to reduce
the dimensionality of large data such as Principal Component Analysis (PCA),
Singular Value Decomposition (SVD) and Independent Component Analysis (ICA).
Recently defined Nonnegative Matrix Factorization (NMF) approach also allows to
reduce the number of attributes of the data [1, 4, 25].
The NMF technique tries to approximate a data matrix with the product of low
rank matrices and , such that and the elements of and are
nonnegative. If columns of are data samples, then the columns of can be
interpreted as basis or parts from which data samples are formed, while the columns
of give the position of encoded samples in the feature space. It is also common to
find a clustering of samples based on values of the matrix. Typically, the quality of
the approximation is measured by the Euclidean distance between the elements of
and the elements of , however measures derived from the Kullback-Leibler
divergence have also been studied in the literature [1-4, 9]. In this paper, we propose
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 347–354, 2012.
© Springer-Verlag Berlin Heidelberg 2012
348 T. Ensari, J. Chorowski, and J.M. Zurada
minimize: ,
(1)
subject to: 0 0, , ,
, , 0.5 (2)
,
, , exp (3)
2
,
1
From http://people.csail.mit.edu/jrennie/20Newsgroups
Correntropy-Based Document Clustering via Nonnegative Matrix Factorization 349
[3, 8, 10, 12]. NMF has been also successfully applied to face recognition,
bioinformatics, text data mining and audio (speech) processing [2, 3, 5, 8-13].
Clustering is also one of the topics for NMF and has been extensively discussed in the
literature [9, 12, 21, 22].
4 Experimental Results
We have compared the Euclidean distance and correntropy based NMF formulations
by evaluating quality of clusters computed from factorizations. We have used the 20-
newsgroups data set, which is one of the popular benchmarks used for clustering and
classification of the text data. It has approximately 11,000 documents taken from 20
different newsgroups pertaining to various subjects.
350 T. Ensari, J. Chorowski, and J.M. Zurada
,
exp (4)
2
,
exp (5)
2
We have used the popular PGD algorithm to minimize the Euclidian distance based
loss (2). Both algorithms were run with the same random initial matrices and .
We have used the same stopping criteria by setting a relative tolerance of 10 and by
allowing at most 1000 iterations. We refer to the results obtained when minimizing
correntropy as NMF-Corr and to those obtained while minimizing the Euclidean
distance as NMF-PGD (EucD).
∑ (6)
where is the number of classes. Entropy of the full data set as the sum of the
entropies of each cluster weighted by the size of each cluster:
∑ (7)
where is the number of clusters and is the total number of data points [24].
(for σ 1, σ 0.5 and σ 0.01) ) in Fig. 2. Here, “r” denotes the assumed number of
clusters and equals to the ranks of W, H. We change it from 2 to 20 to track the clustering
performance. We show all entropy values in Fig. 2, but for brevity we only illustrate 10
data points in Table 1. Since lower entropy values indicate better clustering performance
we see that NMF-Corr (σ 0.5) demonstrates superior clustering performance than
NMF-PGD (EucD) for every evaluated number of clusters.
Table 1. Entropy of 20-Newsgroups data set with NMF-PGD (EucD) and NMF-Corr
Number of NMF-PGD NMF-Corr NMF-Corr NMF-Corr
Clusters (r) (EucD) (σ = 1) (σ = 0.5) (σ = 0.01)
specifically worst performance for 0.01. This can be seen from Fig. 2 and Table 1.
The parameter controls the threshold of saturation of the correntropy loss for
individual residues . When is large, no saturation will occur. On
the other hand, when is small nearly all residues saturate, and there may not be
enough information in the unsaturated ones to accurately learn the NMF
decomposition. The optimum value of sigma depends on the distribution of the
residues . Determining the exact nature of this relationship warrants further studies.
soc.religion.christian
talk.politics.mideast
Cluster ID Number
rec.sport.baseball
talk.religion.misc
comp.sys.ibm.pc.
talk.politics.guns
talk.politics.misc
comp.windows.x
rec.sport.hockey
rec.motorcycles
comp.graphics
comp.sys.mac.
windows.misc
sci.electronics
comp.os.ms-
misc.forsale
alt.atheism
hardware
hardware
rec.autos
sci.space
sci.crypt
sci.med
1 25 3 5 5 15 7 11 11 6 1 48 15 140 25 434 553 242 465 212 324
2 63 73 147 113 16 48 1 1 2 2 6 25 1 6
4 176 114 13 17 84 9 1 5 4 19 11 1 23 5 4 6 2 2
6 10 9 16 31 28 5 12 2 2 441 56 4 4 1 5 7 4 4
7 38 247 49 14 235 27 7 6 1 4 5 1 3
8 104 41 54 118 63 222 486 518 544 571 65 316 366 476 34 20 276 74 234 37
5 Conclusion
This paper introduces correntropy measure for improved clustering via NMF. It uses
correntropy-based objective function and compares the clustering performance results
with Euclidean Distance objective function. We have implemented this approach to
cluster well-known 20-Newsgroups data set. Our approach applied to the clustering of
documents in the 20-Newsgroups data set shows that minimized correntropy yields
better clustering quality than EucD objective function. We will also test the limits of
our approach with other data sets in the future studies.
References
1. Lee, D.D., Seung, H.S.: Learning the Parts of Objects with Nonnegative Matrix
Factorization. Nature 401, 788–791 (1999)
2. Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and
Applications for Approximate Nonnegative Matrix Factorization. Computational Statistics
and Data Analysis 52(1), 155–173 (2007)
3. Hoyer, P.O.: Non-negative Matrix Factorization with Sparseness Constraints. Journal of
Machine Learning Research 5, 1457–1469 (2004)
4. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: Proc. of
Advances in NeuralInformation Processing, vol. 13, pp. 556–562 (2001)
5. Schmidt, M.N., Olsson, R.K.: Single-channel Speech Separation Using Sparse Non-
negative Matrix Factorization. In: Proc. of Interspeech, pp. 2614–2617 (2006)
6. Xu, J.W., Bakardjian, H., Cichocki, A., Principe, J.C.: A New Nonlinear Similarity
Measure for Multichannel Biological Signals. In: Proc. of Int. Joint Conf. on Neural
Networks, Orlando, Florida, USA, August 12-17 (2007)
7. Fevotte, C., Idier, J.: Algorithms for Nonnegative Matrix Factorization with the β-
Divergence. Neural Computation 13(3), 1–24 (2010)
8. Choi, S.: Algorithms for Orthogonal Nonnegative Matrix Factorization. In: Proc. of the Int.
Joint Conf. on Neural Networks, Hong Kong, June 1-6, pp. 1828–1832 (2008)
9. Zhao, W., Ma, H., Li, N.: A Nonnegative Matrix Factorization Algorithm with Sparseness
Constraints. In: Proc. of the Int. Conf. on Machine Learning and Cybernetics, Guilin,
China
10. Schmidt, M.N., Winther, O., Hansen, L.K.: Bayesian Non-negative Matrix Factorization.
In: Adali, T., Jutten, C., Romano, J.M.T., Barros, A.K. (eds.) ICA 2009. LNCS, vol. 5441,
pp. 540–547. Springer, Heidelberg (2009)
11. Fevotte, C., Bertin, N., Durrieu, J.L.: Nonnegative Matrix Factorization with the Itakura-
Saito Divergence. Neural Computation 21, 793–830 (2009)
12. Shahnaz, F., Berry, M.W., Pauca, V.P., Plemmons, R.J.: Document Clustering Using
Nonnegative Matrix Factorization. Int. Journal of Information Processing and
Management 42(2), 373–386 (2006)
13. Guillamet, D., Vitria, J., Schiele, B.: Introducing a Weighted Non-negative Matrix
Factorization for Image Classification. Pattern Recognition Letters 24, 2447–2454 (2003)
14. He, R., Zheng, W.S., Hu, B.G.: Maximum Correntropy Criterion for Robust Face
Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 3(8), 1561–
1576 (2011)
15. Liu, W., Pokharel, P.P., Principe, J.C.: Correntropy: Properties and Applications in Non-
Gaussian Signal Processing. IEEE Trans. on Signal Processing 55(11), 5286–5298 (2007)
16. Singh, A., Principe, J.C.: Using Correntropy as a Cost Function in Linear Adaptive Filters.
In: Proc. of Int. Joint Conference on Neural Networks, Atlanta, USA, June 14-19 (2009)
17. He, R., Hu, B.G., Zheng, W.S., Kong, X.W.: Robust Principal Component Analysis Based
on Maximum Correntropy Criterion. IEEE Trans. on Image Processing 20(6) (2011)
18. Chalasani, R., Principe, J.H.: Self Organizing Maps with Correntropy Induced Metric. In:
Proc. of Int. Joint Conf. on Neural Networks, Spain, pp. 1–6 (2010)
19. Jeong, K.H., Principe, J.C.: Enhancing the Correntropy MACE Filter with Random
Projections. Neurocomputing 72(1-3), 102–111 (2008)
20. Matlab Software by Mark Schmidt,
http://www.di.ens.fr/~mschmidt/Software/minConf.html
354 T. Ensari, J. Chorowski, and J.M. Zurada
21. Berry, M.W., Gillis, N., Glineur, F.: Document Classification Using Nonnegative Matrix
Factorization and Underapproximation. In: Int. Symp on Circuits and Systems, Taiwan
(2009)
22. Zhao, W., Ma, H., Li, N.: A New Non-negative Matrix Factorization Algorithm with
Sparseness Constraints. In: Proc. of the 2011 Int. Conf. on Machine Learning and
Cybernetics, Guilin, July 10-13 (2011)
23. Lin, C.J.: Projected Gradient methods for Non-Negative Matrix Factorization. Neural
Computation 19, 2756–2779 (2007)
24. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison Wesley
(2006)
25. Paatero, P.: Least Squares Formulation of Robust Non-negative Factor Analysis.
Chemometricsand Intelligent Laboratory System 37, 23–35 (1997)
SOMM – Self-Organized Manifold Mapping
Abstract. The Self Organizing Map (SOM) [1] proposed by Kohonen has
proved to be remarkable in terms of its range of applications. It can be used for
high dimensional space visualization, pattern recognition, input space dimen-
sionality reduction and for generating prototyping to extrapolate information.
Basically, tasks conducted by the SOM method are closely related with input
space mapping in order to preserve topological and metric relationship between
samples. These maps are meant to create a low dimensional output representa-
tion of high dimensional input space. Although maps higher than two dimen-
sions can be created by SOM, it is common to work with the limit of one or two
dimensions. This work presents a methodology named SOMM (Self-Organized
Manifold Mapping) that can be useful to discover structures and clusters of in-
put dataset using the SOM map as a representation of data distribution structure.
1 Introduction
Based on the biological principles of organization, Kohonen, in developing his well
known SOM [1], postulates that there are good reasons to have the following aspects
of organization: a) grouping similar stimulus minimizes neural wiring, b) it creates a
robust and logical structure in the brain, avoiding “crosstalk”, c) from information
organized by attributes, a natural manifold structure can emerge from input patterns d)
reduces dimensionality by creating representations (prototypes) that preserve neigh-
borhood relationship between input patterns. Each representation, also known as pro-
totype, retains the most important features that represent a group of input patterns. It
can be argued that patterns with high similarities are memorized and retrieved from
memory by similarities with features of input patterns. Then, the SOM map tends to
be a discretized nonlinear representation of input distribution surface, because the
algorithm is based upon Vector Quantization (VQ) [1] technique. In this direction,
SOM is characterized by mapping high dimensional input space to low dimensional
output space preserving the input space topology. Although maps higher than two
dimensions can be created by SOM, the usual is to work with the limit of one or two
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 355–362, 2012.
© Springer-Verlag Berlin Heidelberg 2012
356 E.C. Kitani, E. Del-Moral-Hernandez, and L.A. Silva
dimensions. However, there are some challenges to understand the results of the SOM
map, since the neurons from the map have the same dimensionality of the original
space. Several visualization techniques to interpret and extract information from the
map were proposed [2], [3], [4]. Nevertheless, most of them require some level of
human interpretation.
This paper proposes to address three aspects of SOM that have increasingly ap-
peared in the neural networks community and received attention: Manifold Learning,
Dimensionality Reduction and Intrinsic Dimension. Based on the assumption that the
map created by SOM is related with the input space distribution structure, we devel-
oped a methodology that uses the neurons from the SOM map as support vectors for a
“walk” along the local manifold.
This paper is organized as follows: in Section 2, a brief review of manifold learn-
ing and nonlinear dimensionality reduction techniques is presented. In Section 3, the
new proposal of how to interpret and understand the results of SOM map based on the
input data manifolds is discussed. Section 4 presents some aspects of manifold learn-
ing that are captured by the new methodology; Section 5 presents the computational
experiments using an artificial image database. Finally, Section 6 discusses the results
and concludes the paper.
2 Manifold Learning
Several studies have provided us with some insight about how to interpret the SOM
map [9], [10], [11]. One of the best known tools in this regard is the U-Matrix [4] that
provides a quantitative summary of the topological relationships between similar data
samples. The result of the U-Matrix map is a complex image (colored or monochro-
matic) indicating peaks and valleys that represent the Euclidean distances between
neighbor neurons. Essentially, the resulting map preserves the topological distribution
at the input space of the entire sample data considered.
However, to understand the relationship between the information captured in the
U-Matrix and the input samples, as well as to identify and explain the nature of the
groups or clusters defined by the manifolds, it would be helpful to represent all the
SOM neurons and their corresponding similarities and dissimilarities in the original
data space, instead of restricting such an analysis to the neighbor neurons. Based on
the principle of the locally optimal pathway and the idea of navigating on the neurons
that compose the SOM, in [8] we proposed the algorithm SOMM that seeks the path-
ways or manifolds described by the standard SOM. This work extends the understand-
ing of how SOMM extracts information from the manifold.
Considering that neurons are support vectors of the data distribution structure, the
possibility of using these vectors to “walk” on the surface of the manifold that they
represent is assumed. The SOMM algorithm constructs a graph G = (V, E) consider-
ing the edge E as a path-cost function f for connectivity and the neurons from the
classical SOM map as vertices V. As pointed out above, manifold can be defined as a
topological space that is locally Euclidean. Thus, the path-cost function f is assigned
to a distance πk = d(i,j) from neuron i to neuron j.
Starting from an initial neuron seed, the idea is to find the locally shortest path in-
volving the neuron seed and all neurons rk from the SOM map and incrementally
create a list L={r1, r2,…,rk} that represents such a pathway. Basically, the SOMM
algorithm is similar to Dijkstra algorithm with multiple sources [13]. However, a
main difference is that the SOMM does not find the minimum sum path-costs from
neuron i to m; SOMM finds the minimum local path-cost πk from neuron i to its near
neighbor j. The next path cost πk+1 will start from neuron j and so on. SOMM algo-
rithm considers that the pathway started from neuron i will always create a path ahead
using the constraint of not going back to the previous neuron. However, the sub graph
will end when path πk+1 indicates a neuron that already belongs to list L, except the
last path πk.
The motivation for this proposal was the result obtained by the visual reconstruc-
tion of the neurons of the SOM map from faces images, as described in [12]. The
SOMM algorithm can be described through the steps presented in Table 1.
358 E.C. Kitani, E. Del-Moral-Hernandez, and L.A. Silva
Fig. 1. Two connected pathways with a common loop. The first one starts at neuron 1 and final-
ly enters loop (11) → (3) → (4) → (11), whereas the second path starts at neuron (13) and ends
in the same loop.
More specifically, in step (d), a list L is created and, in step (d.1), the algorithm in-
serts each visited neuron in L. Given a neuron r, step (d.3) seeks for the closest neuron
s* such that s* ∉ L – {Lc-1, Lc}; that means s* does not belong to the last visited edge
of the graph. This step implements a greedy algorithm that makes the locally optimal
choice at each stage, generating a locally optimal pathway that connects a subset of
SOM neurons. This is necessary because the idea is to generate a pathway that crosses
different clusters but without losing the notion of similarity in the parameter space. If
SOMM – Self-Organized Manifold Mapping 359
s*∈ {L}, then we have a loop, as the one exemplified in figure 1. In this case, the
pathway that starts at neuron 1 ends in loop (11)→(3)→(4)→(11). Step (e) completes
the pathway, which, in this figure 1, is composed by the sequence: L =
(1)→(2)→(11)→(3)→(4)→(11).
In order to clarify how the SOMM methodology works and mainly two aspects of the
data properties that can be extracted by SOMM, the capability of SOMM to find clus-
ters and intrinsic dimensionality in an unsupervised way is presented. Using two dis-
tinct sets of artificial images, 10 ellipses and 10 squares with a smooth rotation in 10
degrees between each image in a set, a SOM map with 20 neurons in hexagonal 5×4
format was trained using the Matlab® SOM Toolbox released by [14]. Each artificial
image has 229×229 pixels of resolution with background in white color and lines in
black.
The idea is to show how the result of the standard SOM map is useful to extract
that information of relationship (through rotation) between the sample images in the
training set. Indeed, at a first moment, clusters are not visible directly from the SOM
map. The SOMM methodology will apply the concept of manifolds, in which the
Euclidean metric is locally feasible. The PCA (Principal Components Analysis) me-
thod was applied in the training set before the SOM processing, in order to minimize
memory and computational time consumption. The projection matrix of the data in
the PCA space was used for training the SOM network. However, when we return the
neurons to the visual space (through reverse mapping from the PCA space back to the
image space), it is observed that the clusters are not so obvious. Figure 2(B) shows the
visual reconstruction of each neuron of the SOM map of figure 2(A) using the Back-
ward Visualization method described in [12]. The images that are not within in a red
square frame do not represent any sample of the input space. In fact, they are called
interpolating unit and were not associated with any input data at the end of training
phase.
In figures 2(A) and 2(B), the neurons 5, 9, 15, and 19 at the SOM map are ob-
served to represent the squares, but the visual reconstruction does not provide this
sense of similarity. As the SOM works by Vector Quantization (VQ), neurons are
representations of regions occupied in the input space by the samples of the training
set. The projections are extrapolations of input data that do not exist in the training
set. As mentioned before, manifold learning methods try to discover an embedding
function, locally defined, that describes the intrinsic similarities of the data. Although
the visual reconstruction in figure 2(B) could be illustrative, it does not provide clear
visual evidence that the clusters were correctly determined. To better assess these
clusters, we applied the SOMM methodology. The visual result of the clustering
found by the SOMM methodology can be evaluated in figure 2(C). That figure shows
clusters found by the SOMM methodology. All neurons embedded in red rectangle
represent a prototype. The remaining neurons did not become prototypes at the end of
the training phase and do not represent any sample. Yet, for certain applications, they
360 E.C. Kitani, E. Del-Moral-Hernandez, and L.A. Silva
can be considered an interpolation unit between two neurons. It can be observed that
the SOMM captured a dynamic of movement embedded in the training set. This dy-
namic is intrinsic to the manifold.
Fig. 2. In (A), the SOM map results after training with ellipse and square image database. In
(B), the visual reconstruction of each neuron from the SOM map and in (C), the clusters formed
by the SOMM methodology using the SOM map. All the images inside the red square are con-
sidered prototypes. All the numbers and labels were placed manually.
The computational experiment conducted with the artificial dataset showed that the
SOMM can extract information from the SOM map and reorder it considering local
properties in terms of Euclidean distance. This information is related with some level
of similarity and some adjacency relation between neighbor neurons. This new adja-
cency relation connects the neurons in a different way with respect to that defined by
the initial SOM training. In other words, SOMM cuts the initial neighbor relationship
between neurons and creates a new one, based on the final structure learned from the
training set.
It means that the SOMM seeks local pathways using manifold formed by all neu-
rons, based on the assumption that all neurons will be distributed along the input data
structure. In [6], a computational experiment with ISOMAP was proposed using a set
of 698 face images artificially generated to represent different conditions in terms of
pose, illumination and lighting directions. The purpose was to evaluate the capability
of ISOMAP to discover an intrinsic dimensionality embedded in that faces database.
This work uses the same database and conducts an experiment to evaluate the mani-
fold found by SOMM using the SOM map as a reduced embedded structure from an
input data set. In order to reduce the computational effort, a pre-processing using PCA
was conducted over the face database. The final projection in PCA space created a
new matrix with 698×697 in size. As all the techniques used in this work are related
with distance measurement, the pre-processing with PCA is important in order to have
normalized data, avoiding influence by scale. The size of the SOM map was defined
to have 48 neurons in dimension ℜ697 and, after the training phase, Backward Visuali-
zation [12] is applied. Figure 3 (left) shows the nearest face image from the data set,
regarding each neuron of the SOM map. There is one neuron (39) that does not
SOMM – Self-Organized Manifold Mapping 361
Fig. 3. On the left, the neurons from the SOM map (8×6) were replaced by their near face im-
age from the training dataset. Six clusters found by the SOMM methodology can be seen on the
right. The white track lines in the SOM map represent the first pathway of the manifold that
represents cluster 1 and face images along that pathway. The remaining clusters represent dif-
ferent regions of the manifold.
Additionally, the rest of the clusters represent different regions of the manifold and
can be understood as regions where similar images can lie. As each neuron represents
a group of input samples, the cluster can be seen as a visualization of that region ma-
nifold surface, showing which and how input data samples are distributed along that
surface. Not only was the direction of the pose clustered, but also illumination differ-
ences and up-down head position and smooth changes across the images. However,
cluster number six seems to have a discontinuity. Analyzing the first three principal
components that carry 60.1% of the global variance, it was observed that projections
of that group of neurons belong to an overlap region. It can be understood as a region
that belongs to some borders. This is a different approach compared with ISOMAP
[6] or equivalent methods that try to create a new and reduced mapping from the orig-
inal manifold. It can be observed by the images that they are near to a local manifold
based on their distribution.
here), we constructed a neighborhood graph of the SOM neurons based on the prin-
ciple of the locally optimal path. Such graph visualization method explicitly provides
information about the number of clusters that describe the sample data under investi-
gation, as well as the specific features extracted and explained by them. We believe
that the methodology proposed here might be a useful tool in SOM analysis, provid-
ing an intuitive explanation of the topologically constrained manifolds modeled by the
SOM and highlighting latent variables such as left-right pose, up-down pose and light
direction with all combinations in head pose.
Acknowledgements. The authors thank Dr. Gilson A. Giraldi and Dr. Carlos E.
Thomaz for their important and helpful discussions and collaboration during the pre-
vious work in [8] and [12].
References
1. Kohonen, T.: Self-organization and associative memory. Springer-Verlag New York, Inc.,
New York (1989)
2. Mayer, R., Rauber, A.: Visualising Clusters in Self-Organising Maps with Minimum
Spanning Trees. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part II.
LNCS, vol. 6353, pp. 426–431. Springer, Heidelberg (2010)
3. Pölzbauer, G., Rauber, A., Dittenbach, M.: Graph projection techniques for self-organizing
maps. In: ESANN 2005 European Symposium on Artificial Neural Networks, Bruges, pp.
533–538 (2005)
4. Ultsch, A.: Maps for the visualization of high-dimensional data spaces. In: Proc. of Work-
shop of Self Organizing Maps, pp. 225–228 (2003)
5. Sammon Jr., J.W.: A nonlinear mapping for data structure analysis. IEEE Transaction on
Computers, 401–409 (1969)
6. Tenenbaum, J.B., Silva, V.D., Langford, J.C.: A global geometric framework for nonlinear
dimensionality reduction. Science Magazine 290, 2319–2323 (2000)
7. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by local linear embedding.
Science Magazine 290, 2323–2326 (2000)
8. Kitani, E.C., Del-Moral-Hernandez, E., Giraldi, A.G., Thomaz, C.E.: Exploring and under-
standing the high dimensional and sparse image face space: A self organized manifold
mapping. New approaches to characterization and recognition of faces, pp. 225–238. In-
tech Open Access Publisher (2011)
9. Brugger, D., Bogdan, M., Rosenstiel, W.: Automatic cluster detection in Kohonen’s SOM.
IEEE Transaction on Neural Networks 19(3), 442–459 (2008)
10. Bauer, H.U., Pawelzik, K.R.: Quantifying the neighborhood preservation of Self-
Organizing Feature Maps. IEEE Transaction on Neural Networks 3(4), 570–579 (1992)
11. Kiviluoto, K.: Topology preservation in self-organizing maps. In: IEEE International Con-
ference on Neural Networks, vol. 1, pp. 294–299 (1996)
12. Kitani, E.C., Del-Moral-Hernandez, E., Thomaz, C.E., Silva, L.A.: Visual interpretation of
Self Organizing Maps. In: IEEE-CS 11th Brazilian Symposium on Neural Networks
(SBRN), pp. 37–42. São Bernardo do Campo (2010)
13. Cormen, T.H., et al.: Introduction to algorithms, 2nd edn. MIT Press (2001)
14. Vesanto, J., et al.: SOM Toolbox for Matlab 5. Helsinki University of Technology, Helsin-
ki, pp. 1-60. Report A57 (2000)
Self-Organizing Map and Tree Topology
for Graph Summarization
1 Introduction
Large graphs are used in several applications such as social networks, social relations,
interactions between proteins, etc... Currently, large graphs correspond to real-world
data with millions of nodes and edges. Thus, there is an increasing need for graph
summarization for reasons related to either privacy protection or space efficiency. It
sometimes makes sense to replace the original graph with a summary, which removes
some details from this graph. Most of works in graph summarization suggest to reduce
the number of nodes by grouping similar nodes in same clusters [1]. Other works pro-
pose compression techniques that consist of choosing super-nodes and super-edges in
the original graph [2]. In graph compression methods, nodes are grouped based on the
similarity of their relationships to other nodes, not by their (direct) mutual relations [2].
The process of summarization is to remove some information from the original large
graph to simplify its use. To address this problem, we propose SOM-tree to summarize
large graphs into smaller graphs. Our idea is to depict graphs in 2D map using Self-
organizing map (SOM). Each cell of our topogical map consists of a tree-like structure.
SOM-based methods try to preserve the topological properties of the input space.
However, we are now interested to the topology tree-like structure of data gathered in
each cell [3]. Thus, it is possible to summarize large graphs into topological and tree-
like organizations. The preliminary obtained results encourage further evaluation with
additional (and more sophisticated) graph datasets.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 363–370, 2012.
c Springer-Verlag Berlin Heidelberg 2012
364 N.-Q. Doan, H. Azzag, and M. Lebbah
There is large literature on graph clustering algorithms, that can be applied to graph
summarization because they tend to group similar nodes. We note an important work
in community detection [4,5] and also using modularity [6,7]. These graph clustering
algorithms have been used to detect community structures (dense subgraphs) in large
networks and are only based on nodes connectivities. In other graph summarization
methods [8,9] authors use simple statistical approaches to represent graph properties,
this type of summary contains poor information. In fact, statistical measures are not
sufficient to understand information lying within the graph structure content.
In this paper, we present a new approach named SOM-tree composed of topological
and hierarchical organization for graph summarization. The proposed method provides
a summarized graph which contains the same number of nodes but less edges than the
original graph. Experts observe not only the general topological view between nodes of
graph but also a hierarchical view of a particular part of summarized graph.
The remainder of this paper is organized as follows: Section 2 provides graph sum-
marization model. Section 3 is dedicated to the experiments that have been conducted
on three graph datasets. The paper ends with Section 4, which contains conclusions and
possible future works.
– treer : the tree associated to cell r. Each node of tree presents a node of the input
graph xi .
– treexi : the subtree of data, which have xi as root and all nodes recursively con-
nected to it (treexi ⊂ treer ).
– wr is representative vector (or prototype) of cell r.
The objective function of self-organizing map using tree structure is written as follows:
K
K
R(φ, W) = K(δ(φ(treexi ), r))||xi − wr ||2 (1)
c=1 xi ∈T reec r=1
2
φ(xi ) = arg min KT (δ(r, c)) xi − wc (2)
r
c=1..K
2. Tree construction: In this step we seek to find the best position of a given data xi
(vi ∈ V ) in the treec associated to cell c. We use connection/disconnection rules in-
spired from AntTree [3]. The particularity of the obtained tree is that each node whether
it is a leaf or an internal node represents a given data xi . In this case, Nxi denotes the
node that is connected and associated to the data xi , Nxpos represents current node of
the tree and Nxi+ the node connected to Nxpos , which is the most similar (closest by
distance) to Nxi . TDist (Nxpos ) is the highest distance value which can be observed
among the local neighborhood Npos . xi is connected to Nxpos if and only if the con-
nection of Nxi further increases this value:
2
TDist (Nxpos ) = M axj,k Nxj − Nxk
= M axj,k xj − xk 2 (3)
In other words, connections rules consist of comparing a node Nxi to the nearest node
Nxi+ . In the case where both nodes are sufficiently far away (Nxi − Nxi+ 2 >
TDist (Nxpos )) then the node Nxi is connected to its current position Nxpos .
366 N.-Q. Doan, H. Azzag, and M. Lebbah
Otherwise, the node Nxi associated to data xi is moved toward the nearest node
Nxi+ . Therefore, the value TDist increases for each node connected to the tree. In fact,
each connection of a given data xi implies a local minimization of the value of the cor-
responding TDist . Therefore a minimization of the cost function (1). At the end of the
tree construction step, each cell c of the map C will be associated to a treec . Connec-
tions rules are based on Nearest Neighbor approach. Each data will be connected to its
nearest neighbor.
3. Optimization step: assuming that φ is fixed, this step minimizes R(φ, W) with
respect to W in the space Rd . It is easy to see that this minimizations allow to define the
prototype for each treec using the expression (4). Instead of centroids we choose leaders
[12] as the representative vertices for all the vertices of a graph. Leader is referred to the
node vi ∈ V ( xi ∈ X) that have the highest degree in original graph G. The expression
is defined as follows:
Lr = arg max (deg(vi )) (4)
∀xi ∈treer ,vi ∈V
where the local degree deg(vi ) is the number of internal edges incident to vi in original
graph G.
3 Experimental Results
In this section, we performed extensive experiments to evaluate the performance of our
method on real graph data. In this experiment, we first compare the efficiency of differ-
ent clustering algorithms : MST (Minimum Spanning Tree) that builds tree structures
[13] and
√ SOM. In the case of SOM we adopt the same initial parameters. Here we fix
d = n for each dataset to reduce the number of dimensions. The size of map is re-
spectively selected 3 × 3 (K = 9 clusters) for ”Adjective and noun”; 5 × 3 (K = 15)
for ”Football Teams”; 5 × 5 (K = 25) for ”Political blogs”. The motivation is to choose
the map size superior to the number of class so that a large map is able to cover graph
space.
In order to evaluate the performance, we selected two different criteria, each of them
should be maximized: Accuracy and Normalized Mutual Information [14]. The studied
databases are presented in Table 1 and are available at http://www-personal.
umich.edu/˜mejn/netdata/. Results have been averaged over 10 runs.
First of all, the quality of the clustering depends on the choice of d, but we intend
to optimize neither the value of d, the map size nor their influence to quality measures
in this paper. Looking to columns associated to SOM-tree (Self-organizing Map-Tree),
Self-Organizing Map and Tree Topology for Graph Summarization 367
we can notice that our approach provides a comparable and better results. Our purpose
through this comparison, is not to assert that our method is the best, but to show that
SOM-tree method can obtain quite the same good results as other clustering algorithms
with more information provided by topological and hierarchical.
We are interested on analyzing the connection between a pair of nodes in the new
representation. For this purpose we explore the structure of the graph by analyzing the
added and/or the deleted direct links between a pair of nodes in the original graph G.
We show how our method simplifies the exploration of the original graph by offering
user-friendly summarization and visualization. We use Tulip [15] as the framework
to visualize and analyze the graph. Using GEM layout, we provide here two types of
architecture for visualizing graph:
1. Default: the graph is drawn from the original collection of edges and vertices. We
want to discriminate leaders from other nodes, the size of a node depends on its
degree.
2. Summarized: here we propose new organization of graph which has less edges
than the default organization so that is easy and visible to interpret. The new graph
is drawn from graph nodes as well as map nodes, however their structure has a form
of hierarchy and topology. The map nodes with black color are located in the center
surrounded by trees. We have the same number of trees as the map size.
We note that the leader node is represented in the graph by a big node with symbol L.
Every cluster is represented by one leader and one color.
map) contain less than 3% of nodes from the original graph and the other clusters are
very big compared to the three smallest ones.
Case of ”Football”: Different from others, this dataset has more balanced distribu-
tion of data. ”Football” dataset has 115 nodes classified into 10 different classes, the
obtained visualization is shown in Figure 2. The number of nodes are quite balanced
in each cell. The most of nodes belong to the same class are grouped in one cluster.
We also observe several pure clusters or trees represented by leaders L1, L2, L6. In
this case, the differences between the proposed decomposition and the ground truth are
not important. As additional information, SOM-tree organization and the original graph
have several common links. We observe that 80% of direct edges in SOM-tree organi-
zation are also direct links in the original graph.
in this situation, to clusters that are more adapted to visual exploration. Indeed SOM-
tree organization contains only 4% of direct edges from the original graph whereas the
value of purity is about 0.85. SOM-tree eliminates external direct links (between two
clusters) and replaces internal direct links by building a path between the corresponding
nodes in the same tree. SOM-tree organization also provides a decomposition of graph
that allows a better interaction with the data (Figure 3(b)). The external interactions are
showed by topological map. It is very difficult to visualize the interactions between the
node in the original graph.
After studying the obtained summarization we notice that visual results given by our
method lead to important insight on the graph content.Our approach improve the stan-
dard decomposition and visualization by building topological map and tree topology of
data. Atypical nodes are clearly pinpointed with this approach and can be further studied
by the analyst. The summarization provides a clear visualization in which the analysts
can easily navigate. A hierarchical visual exploration is provided by topological level
to the last level of trees provide useful graph information.
4 Conclusion
In this paper, we have proposed a summarization of graph using Self-organizing Map
and new hierarchical clustering. This novel method provides a new look at self-
organizing models allowing better visualization of graph organization. The obtained
graph is able to determine both the hierarchical distribution of the nodes and its struc-
tured topology. We notice that SOM-tree works well with several real world datasets
through the experiments. In practice, our model reduces the number of edges and pro-
vides tree-like graph for every cell of the map (grid). Furthermore, our model offers a
user-friendly visualization space which consists of both tree structures and topological
map. The benefit of our approach is to permit to the user a fast data analysis using a
smaller graph.
370 N.-Q. Doan, H. Azzag, and M. Lebbah
As future work, we will study the influence of the number of selected eigenvectors
and an incremental graph summarization. We will show how sub-graphs evolve over
time. Another perspective is to study biological and biomedical graph. Several problems
can be faced with hierarchical structure in the case of expressed genes.
References
1. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In:
SIGMOD Conference, pp. 419–432 (2008)
2. Toivonen, H., Zhou, F., Hartikainen, A., Hinkka, A.: Compression of weighted graphs. In:
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 2011, pp. 965–973. ACM, New York (2011)
3. Azzag, H., Venturini, G., Oliver, A., Guinot, C.: A hierarchical ant based clustering al-
gorithm and its use in three real-world applications. European Journal of Operational Re-
search 179(3), 906–922 (2007)
4. Flake, G.W., Lawrence, S., Lee Giles, C.: Efficient identification of web communities. In:
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 2000, pp. 150–160. ACM, New York (2000)
5. Ino, H., Kudo, M., Nakamura, A.: Partitioning of web graphs by community topology. In:
Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp.
661–669. ACM, New York (2005)
6. Rossi, F., Villa-Vialaneix, N.: Optimizing an organized modularity measure for topographic
graph clustering: A deterministic annealing approach. Neurocomput. 73, 1142–1163 (2010)
7. Newman, M.E.J.: Modularity and community structure in networks. Proceedings of the Na-
tional Academy of Sciences 103(23), 8577–8582 (2006)
8. Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Com-
put. Surv. 38 (June 2006)
9. Chakrabarti, D., Faloutsos, C., Zhan, Y.: Visualization of large networks with min-cut plots,
a-plots and r-mat. Int. J. Hum.-Comput. Stud. 65(5), 434–445 (2007)
10. Chung, F.R.K.: Spectral Graph Theory (CBMS Regional Conference Series in Mathematics,
No. 92). American Mathematical Society (February 1997)
11. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395–416 (2007)
12. Stanoev, A., Smilkov, D., Kocarev, L.: Identifying communities by influence dynamics in
social networks. CoRR, abs/1104.5247 (2011)
13. Grygorash, O., Zhou, Y., Jorgensen, Z.: Minimum spanning tree based clustering algorithms.
In: Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelli-
gence, ICTAI 2006, pp. 73–81. IEEE Computer Society, Washington, DC (2006)
14. Strehl, A., Ghosh, J., Cardie, C.: Cluster ensembles - a knowledge reuse framework for com-
bining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
15. Auber, D.: Tulip: A huge graph visualisation framework. In: Mutzel, P., Jünger, M. (eds.)
Graph Drawing Softwares. Mathematics and Visualization, pp. 105–126. Springer (2003)
Variable-Sized Kohonen Feature Map
Probabilistic Associative Memory
1 Introduction
In the real world, it is very difficult to get all information to learn in advance.
So we need the model which can realize successive (additional) learning. How-
ever, most of the conventional neural network models can not realize successive
learning. As the model which can realize successive learning, some associative
memories based on the Kohonen Feature Map (KFM)[1] have been proposed[2]–
[6]. Although most of them can realize one-to-many associations[4]–[6] and the
probabilistic associations[5], their storage capacities depends on the number of
the neurons in the Map Layer and they can not learn new patterns more than
their original storage capacities.
In this paper, we propose the Variable-sized Kohonen Feature Map Prob-
abilistic Associative Memory (VKFMPAM). This model is based on the con-
ventional Improved KFM Probabilistic Associative Memory based on Weights
Distribution[5] and the Variable-sized KFM Associative Memory with Refrac-
toriness based on Area Representation[6]. In the proposed model, the proba-
bilistic association for the training set including one-to-many relations can be
realized. And the connection weights fixed and semi-fixed neurons are intro-
duced, and a new pattern can be memorized. Moreover, when unknown patterns
are given, neurons can be added in the Map Layer if necessary.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 371–378, 2012.
c Springer-Verlag Berlin Heidelberg 2012
372 H. Sato and Y. Osana
2.1 Structure
Figure 1 shows the structure of the proposed model. As shown in Fig.1, the
proposed model has two layers; (1) Input/Output Layer and (2) Map Layer,
and the Input/Output Layer is divided into some parts. In the proposed model,
neurons can be added in the Map Layer if necessary, so the distance between
neurons in the Map Layer is not equal.
(1) The connection weights are initialized randomly. In the proposed model, the
initial Map Layer has xmax × ymax neurons.
(2) The Euclidean distances between the input vector X (p) and the weight vector
W i , d(X (p) , W i ) are calculated for all neurons in the Map Layer.
(3) If d(X (p) , W i ) > θt is satisfied for all neurons in the Map Layer, the input
pattern X (p) is regarded as the unknown pattern, and go to (4). Otherwise,
the input pattern is regarded as one of the known patterns, and go to (8).
(4) The center neuron of the learning area is selected.
(4-1) If there is no weight-fixed neuron, the neuron c whose Euclidean distance
is shortest is selected as the center of the learning area, and go to (4-3).
Otherwise go to (4-2).
(4-2) Whether the area for the input pattern X (p) can be taken without over-
lapping to the areas for the stored patterns is checked. For the weight-
fixed neurons z, if
Ï ½
Ï ¾
Map Layer
is satisfied, the neuron i can be a center of the learning area. Here, dmax
is the maximum distance between adjacent neurons and dmax = 1. dmin z
is the distance from the weight-fixed neuron z to the nearest neuron.
And Diz is the constant which decides the area size whose center is the
neuron i in the direction to the neuron z. In the proposed model, the
real moving radius of the area whose center is the neuron z is given by
dmin
z Dzi . diz is the distance between the neuron i and the weight-fixed
neuron z.
If there are some candidate neurons, go to (4-6). Otherwise, go to
(4-3).
(4-3) Whether the area for the input pattern X (p) can be taken without over-
lapping to the areas for the stored patterns when the distance between
neurons in the area for the stored patterns is reduced to φn (dmin z ) is
checked. Here, φn (·) is given by
d/ 2n , d/ 2n ≥ dmin
φn (d) = (2)
dmin (otherwise)
is satisfied, the neuron i can be a center of the learning area. The neurons
that satisfy the condition given by Eq.(3) for all weight-fixed neurons are
selected as the candidate of the center of the learning area. If there are
some candidate neurons, go to (4-6). Otherwise, go to (4-4).
(4-4) Whether the area for the input pattern X (p) can be taken without over-
lapping to the areas for the stored patterns when the distance between
neurons in the area for the stored patterns is reduced to φn (dminz ) and
the distance between neurons in the area for the input pattern is set to
φn (dmax ) is checked.
For all weight-fixed neurons z,
(4-6) From the neurons which are selected as the center candidates of the
learning area in (4-2)∼(4-4), the neuron c that the Euclidean distance
between the input vector X (p) and its weight vector W i is minimum is
selected.
(5) Some neurons are added to the Map Layer if necessary.
(5-1) If the center candidates are selected in (4-3) or (4-4), the distance be-
tween neurons in the areas for stored patterns is reduced, and some
neurons are added.
If the center candidates are selected in (4-3), for the area whose center
is the neuron z which satisfies
the distance between neurons in the area is reduced, and neurons are
added. The neurons i which satisfy
dmin
z Dzi ≥ diz (7)
If the neuron exists at (xi , yi ), no neuron is added there. The weight
vector of the neuron i W i is set to W i .
If the center candidates are selected in (4-4), the new neurons are
added in the area whose center z that satisfy
(5-2) If the center candidates are selected in (4-3) or (4-4), new neurons are
added in the area for the new pattern X (p) .
If the center candidates are selected in (4-3) and n ≥ 1, the neurons
are added in the area whose center is the neuron c. Here, the neurons
corresponding to the neurons which satisfy
If the neuron exists at (xi , yi ), no neuron is added there, and the weight
vector W i is generated randomly.
Variable-Sized KFM Probabilistic Associative Memory 375
If the center candidates are selected in (4-4), new neurons are added in
the area whose center is the neuron c. Here, the neurons corresponding
to the neurons which satisfy
If the neuron exists at (xi , yi ), no neuron is added there. The weight
vector W i is generated randomly.
(6) The input pattern X (p) is trained in the area whose center is the neuron c.
The connection weights which are not fixed are updated by
⎧ (p)
⎪
⎪ X , dci ≤ dminc Dci and
⎪
⎪
⎪
⎪ dx
/dmin
∈ Z and dyci /dmin ∈Z
⎨ ci c c
W i (t + 1) = W i (t) + H(dcimin
)(X (p) − W i (t)), (17)
⎪
⎪ dc Dci < dci ≤ l(dmin Dci )
⎪
⎪ c
⎪
⎪ and dmin Dzi < dzi (∀z ∈ Cf ix )
⎩ z
W i (t), (otherwise)
where Z is the set of integers, l is the coefficient which decides the neighbor-
hood area size, and Cf ix is the set of the weight-fixed neurons. And H(dci )
is the neighborhood function and is given by
1
H(dij ) = (18)
1 + exp (dij − D)/ε
probability that the pattern whose area is large is recalled is higher than the
probability that the pattern whose area is small. As a result, the probabilistic
association can be realized based on the weights distribution.
When the pattern X is given to the Input/Output Layer, the output of the
neuron i in the Map Layer, xmap
i is calculated by
1, (i = r)
xmap
i = (20)
0, (otherwise)
where r is selected randomly from the neurons which satisfy
1
g(Xk − Wik ) ≥ θmap (21)
N in
k∈Cin
where N in is the number of neurons in the Input/Output Layer, and Cin is the
set of the neurons which receive the input in the Input/Output Layer. And g(·)
is given by
1, (|u| < θd )
g(u) = (22)
0, (otherwise)
where θd and θmap are the thresholds.
When the pattern X is given to the Input/Output Layer, the output of the
neuron k in the Input/Output Layer xiok is given by
⎧
⎨ Wrk (X is analog pattern)
xio = 1, (X is binary pattern andWrk ≥ 0.5) (23)
k
⎩
0, (X is binary pattern andWrk < 0.5) .
(a) t=1 (b) t=2 (c) t=4 (a) t=501 (b) t=504 (c) t=514
(a) Stored Patterns (b) “lion” was given (c) “crow” was given
Fig. 2. One-to-Many Associations
lion-bear lion-monkey
Table 1. The Number of Recall Times
1.0 1.0
1.0 Proposed Model
# of Neurons in Map Layer : 100
0.8 0.8
0.8
Recall Rate
Storage Capacity
Recall Rate
Conventional Model [5]
# of Neurons in Map Layer : 400 0.6 0.6 Conventioanl
Model [5]
0.6
0.4 Proposed Conventional 0.4
Conventional Model [5]
Model [5]
0.4 Model
0.2 0.2
0.2
0.0 0.0
0 20 40 60 80 100 0 10 20 30 40 50 60
0.0 Rate of Damaged Neurons in Map Layer Noise Rate
0 100 200 300 400 500 600
The Number of Stored Patterns
(a) Damage (b) Noise
Fig. 4. Storage Capacity Fig. 5. Robustness for Damaged Neurons/Noise
4 Conclusions
In this paper, we have proposed the Variable-sized Kohonen Feature Map Prob-
abilistic Associative Memory which can realize the probabilistic association for
the training set including one-to-many relations. We carried out a series of com-
puter experiments and confirmed the effectiveness of the proposed model.
References
1. Kohonen, T.: Self-Organizing Maps. Springer (1994)
2. Ichiki, H., Hagiwara, M., Nakagawa, M.: Kohonen feature maps as a supervised
learning machine. In: ICNN, pp. 1944–1948 (1993)
3. Yamada, T., Hattori, M., Morisawa, M., Ito, H.: Sequential learning for associative
memory using Kohonen feature map. In: IJCNN, Washington D.C. (1999)
4. Abe, H., Osana, Y.: Kohonen feature map associative memory with area represen-
tation. In: IASTED AIA, Innsbruck (2006)
5. Noguchi, S., Osana, Y.: Improved Kohonen feature map probabilistic associative
memory based on weights distribution. In: IJCNN, Barcelona (2010)
6. Imabayashi, T., Osana, Y.: Variable-sized KFM associative memory with refractori-
ness based on area representation. In: SMC, San Antonio (2009)
Learning Deep Belief Networks
from Non-stationary Streams
1 Introduction
Machine learning typically assumes that the underlying process generating the
data is stationary. Moreover the dataset must be sufficiently rich to represent this
process. These assumptions do not hold for non-stationary environments such as
time-variant streams of data (e.g., video). In different communities, a number of
approaches exist to deal with non-stationary streams of data: Adaptive Learning
[4], Evolving Systems [2], Concept Drift [15], Dataset Shift [12]. In all these
paradigms, incomplete knowledge of the environment is sufficient during the
training phase, since learning continues during run time.
Within the adaptive learning framework, there is a new set of issues to be
addressed when dealing with large amounts of continuous data online: limita-
tions on computational time and memory. In fast changing environments, even
a partially correct classification can be valuable.
Since its introduction [8,3] Deep Learning has proven to be an effective method
to improve the accuracy of Multi-Layer Perceptrons (MLPs) [6]. In particular,
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 379–386, 2012.
c Springer-Verlag Berlin Heidelberg 2012
380 R. Calandra et al.
Deep Belief Networks (DBNs) have been well-established and can be considered
the state of the art for artificial neural networks. To the best of our knowledge,
DBNs have not been used to incrementally learn from non-stationary streams
of data. Dealing with changing streams of data with classical DBNs requires to
store at least a subset of the previous observations, similar to other non-linear
approaches to incremental learning [10,1]. However, storing large amounts of
data can be impractical.
The contributions of this paper are two-fold. Firstly, we study the generative
capabilities of DBNs as a way to extract and transfer learned beliefs to other
DBNs. Secondly, based on the possibility to transfer knowledge between DBNs,
we introduce a novel approach called Adaptive Deep Belief Networks (ADBN)
to cope with changing streams of data while requiring only constant memory
consumption. With the ADBN it is possible to use the DBN parameters to
generate observations that mimic the original data, thus reducing the memory
requirement to storing only the model parameters. Moreover, the data compres-
sion properties of DBNs already provide an automatic selector of the relevant
extracted beliefs. To the best of our knowledge, the proposed ADBN is the first
approach toward generalizing Deep Learning to incremental learning.
...
...
...
Classifier Regeneration. DBNs can generate unlabeled samples that mimic the
distribution of the training inputs. When making use of the DBN/classifier con-
figuration it is still possible to generate unlabeled samples with the generative
connections. Furthermore, these samples can be used as a standard input for the
recognition connections and, thus, be classified. Hence, this procedure allows the
generation of datasets of labeled samples. Similarly to Belief Regeneration, we
use this artificially generated dataset to train a second DBN/classifier,in what
382 R. Calandra et al.
4 Experiments
To evaluate the properties of our models we used the hand-written digit recog-
nition MNIST dataset [11] in our experiments. The MNIST dataset consists of a
training set of 60000 observations and a test set of 10000 observations where ev-
ery observation is an image of 28x28 binary pixels stored as a vector (no spatial
information was used).
To train the RBMs, we used the algorithm introduced by Cho et al. [5] that
makes use of Contrastive Divergence learning (CD-1). We used Gibbs sampling
to generate samples from a trained DBN. The reconstruction error over a dataset
is defined as
N D
R(X) = 1
N (Xij − X̂ij )2 , (1)
i=1 j=1
200
Reg. DBN
Reconstruction Error
DBN
150
100
50
0 2 3 4
10 10 10
Number of samples used for regeneration
DBN trained with the full dataset. However, it is also computationally expensive
to generate many samples. In our experiment, there seems to be a clear threshold
at 750 samples above which the original DBN can be considered well approxi-
mated. A further indication is given by the visual inspection of the generated
samples from the regenerated DBN in Fig. 4. Above 750 samples there is little
difference between the samples generated from original and reconstructed DBN
(top row of Fig. 4), for a human observer.
Fig. 5 shows that for chained regenerations the reconstruction error gradually
increases with the number of sequential reconstructions. Similar conclusions are
visually drawn from the generated samples in Fig. 6 where after 100 generations
of regeneration (using 10000 samples at each generation) there is a visible degra-
dation in the quality of the generated samples. The reason of this degradation
is the error propagation between sequential regenerations.
However, fine-tuning a 100th generation DBN shows little decrease in terms
of classification accuracy compared to fine-tuning the original DBN, as shown
in Fig. 7. This result suggests that despite becoming humanly incomprehensible
(Fig. 6), the generated samples retain valuable features for training a DBN and
still prove to be useful during an eventual fine-tuning: Fine-tuning initialized
from a regenerated DBN led to a similar optimum as the original DBN.
150
Reg. DBN
Reconstruction Error
DBN
100
50
0
0 20 40 60 80 100
Number of sequential regenerations
Adaptive Deep Belief Networks. We trained a DBN and classifier using 3 digits
(8,6,4) of the MNIST dataset. Every 50 fine-tuning iterations, we presented a
new batch of data containing samples from a novel digit to the ADBN. These
samples, together with the generated ones, were then used to re-train both the
DBN and the classifier, see Sec. 3. Fig. 10 shows the classification accuracy and
memory consumption of the ADBN on all 10 digits when adding new digits to the
data set. The accuracy increases, which means that the ADBN can successfully
learn from new data while at the same time retaining the knowledge previously
acquired. In Fig. 10, we compare to a DBN that is trained on all the previous
observations which led to a higher classification accuracy but at the expense of
the memory consumption. While the DBN memory increase linearly (as we store
!"
!"
4000 100
ADBN accuracy
DBN accuracy
Memory (MB)
2000 60
1000 40
0 20
Init (8,6,4) add 0 add 3 add 5 add 2 add 7 add 1 add 9
!
!
Fig. 10. Classification accuracy and memory
consumption of the DBN and the ADBN on
all 10 digits during the training phase: Ini-
Fig. 9. Comparison of the classi- tially trained with 3 digits (8,6,4), for every
fication accuracies on the MNIST novel digit introduced the DBN/classifier is
test set during the training of the regenerated. A classical DBN achieves higher
classifier on top, using the origi- classification accuracy but at the expense
nal and regenerated classifiers af- of memory consumption. In contrast, the
ter different N-generations ADBN requires only a constant memory.
more and more observations), the amount of memory required by the ADBN is
constant: Only the model parameters need to be stored.
each epoch of the ADBN training can be a sensitive parameter. In particular, the
ratio between generated samples and novel observations, can be used to modify
the stability/plasticity trade-off.
Finally, an interesting extension to our approach is the possibility to change
the topology of the network adaptively at running time in order to adapt the
capability to the complexity of the environment.
Acknowledgments. We thank Olli Simula and Jan Peters for invaluable dis-
cussions and the friendly environments they provided to realize this paper. The
research leading to these results has received funding from the European Commu-
nity’s Seventh Framework Programme (FP7/2007-2013) under grant agreement
#270327 and by the DFG within grant #SE1042/1.
References
1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: On demand classification of data
streams. In: Proceedings of KDD 2004, pp. 503–508 (2004)
2. Angelov, P., Filev, D.P., Kasabov, N.: Evolving Intelligent Systems: Methodology
and Applications. Wiley-IEEE Press (2010)
3. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training
of deep networks. In: Proceedings of NIPS 2006, vol. 19, pp. 153–160 (2006)
4. Bifet, A. (ed.): Proceeding of the 2010 Conference on Adaptive Stream Mining:
Pattern Learning and Mining from Evolving Data Streams (2010)
5. Cho, K., Raiko, T., Ilin, A.: Enhanced gradient and adaptive learning rate for
training restricted Boltzmann machines. In: Proceedings of ICML 2011, pp. 105–
112 (2011)
6. Erhan, D., Courville, A., Bengio, Y., Vincent, P.: Why does unsupervised pre-
training help deep learning? In: Proceedings of AISTATS 2010, pp. 201–208.
7. Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., Vincent, P.: The difficulty of
training deep architectures and the effect of unsupervised pre-training. In: Pro-
ceedings of AISTATS 2009, pp. 153–160 (2009)
8. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief
nets. Neural Computation 18(7), 1527–1554 (2006)
9. Igel, C., Hüsken, M.: Improving the RPROP learning algorithm. In: Proceedings
of NC 2000, pp. 115–121 (2000)
10. Last, M.: Online classification of nonstationary data streams. Intelligent Data Anal-
ysis 6(2), 129–147 (2002)
11. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010),
http://yann.lecun.com/exdb/mnist/
12. Quiñonero Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset
Shift in Machine Learning. The MIT Press (2009)
13. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation
learning: the RPROP algorithm. In: IEEE International Conference on Neural
Networks, vol. 1, pp. 586–591 (1993)
14. Salakhutdinov, R.: Learning deep generative models. PhD thesis, University of
Toronto (2009)
15. Zliobaite, I.: Learning under concept drift: an overview. CoRR (2010)
Separation and Unification of Individuality
and Collectivity and Its Application to Explicit Class
Structure in Self-Organizing Maps
Ryotaro Kamimura
Abstract. In this paper, we propose a new type of learning method in which in-
dividuality and collectivity are separated and unified to control the characteristics
of neurons. This unification is expected to enhance the characteristics shared by
individual and collective outputs, while the characteristics specific to them are
weakened. We applied the method to self-organizing maps to demonstrate the
utility of unification. In self-organizing maps, the introduction of unification has
the effect of controlling cooperation among neurons. Experimental results on the
glass identification problem from the machine learning database showed that ex-
plicit class boundaries could be obtained by introducing the unification.
1 Introduction
In this paper, we propose a new type of learning method based upon the separation and
unification of individuality and collectivity of neurons. The utility of this unification can
be explained through self-organizing maps. The collectivity of neurons has received due
attention in the field of self-organizing maps [1], because SOMs have been concerned
with the collective behavior of neurons. One of the main principles for this interaction
is that neighboring neurons behave in the same way. We think that this property of
neighboring neurons has been related to the difficulty in visualizing final results by
the conventional SOMs. For example, conventional SOMs have been used to clarify
class boundaries. Naturally, class boundaries are based on some discontinuity between
neighboring neurons. In SOMs, the neighboring neurons cooperate with each other, and
discontinuity between neurons tends to be reduced. Because we have had difficulty in
representing class boundaries using conventional SOMs, there have been many different
kinds of visualization techniques proposed, [2], [3], [4], [5], [6], to cite a few. However,
even with these visualization methods, it remains clear that class structure cannot be
easily identified. This fact suggests that we need to weaken the collectivity of neurons
for improved visualization. Our method aims to control cooperation or collectivity by
introducing the individuality of neurons.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 387–394, 2012.
c Springer-Verlag Berlin Heidelberg 2012
388 R. Kamimura
Fig. 1. Two examples of learning processes where individually outputs interact with collective
outputs to produce unified outputs
individual output, while output from a collectively treated neuron becomes collective
output. We try to unify these two forms of output into a third form, namely unified out-
put. One of the main objectives of this unification is to enhance characteristics common
to individual and collective outputs, and to weaken the characteristics specific to each
output. For example, Figure 1(a) shows an example of the enhancement. Because the
individual and collective outputs are large in Figure 1(a1) and (a2), the unified output
becomes large in Figure 1(a3). This corresponds to the enhancement of characteristics
common to the two types of outputs. Concordantly, characteristics specific to the two
types of outputs should be weakened. This is an example of inhibition, as shown in
Figure 1(b). The collective output is small, see Figure 1(b1), while individual output
is large, as in Figure 1(b2). Because the two outputs are different from each other, the
unified output should be small, as in Figure 1(b3). Finally, we should note that in the
unification process, we try to weight individual outputs by collective outputs as shown
in Figure 1. Because we apply our method to self-organizing maps, collective output
plays a more important role in forming ordered maps.
where xs and wj are supposed to represent L-dimensional input and weight column
vectors, where L denotes the number of input units. The L × L matrix Λ is called a
”scaling matrix,” and the klth element of the matrix denoted by (Λ)kl is defined by:
p(k)
(Λ)kl = δkl , k, l = 1, 2, · · · , L. (2)
σβ2
where σβ is a spread parameter, and p(k) shows the importance of the kth input unit
and is initially set to 1/L, because we have no preference in input units. We represent
relations between the jth neuron and mth neuron by: φjm . Then, the collective output
is defined by:
M
s 1 s T s
yj ∝ φjm exp − (x − wm ) Λ(x − wm ) . (3)
m=1
2
The unified output is based upon individual and collective outputs. For the first approx-
imation to the unified outputs, we consider the unified output as the individual outputs
weighted by the normalized collective outputs:
1
z(j | s) ∝ q(j | s) exp − (xs − wj )T Λ(xs − wj ) (4)
2
We use the normalized collective outputs,
M 1 s T s
m=1 φjm exp − 2 (x − wm ) Λ(x − wm )
q(j | s) = M M 1 , (5)
m=1 φjm exp − 2 (x − wm ) Λ(x − wm )
s T s
j=1
because the collective outputs are the sum of all neighboring individual neurons’ outputs
and much larger than the individual outputs. This equation shows that when the jth
collective and individual outputs are larger, the unified output becomes larger in turn.
On the other hand, when the collective and individual outputs are different, the unified
output is weakened. Thus, this equation principally describes our above explained idea.
To obtain connection weights, we introduce the free energy function: [7],
S M
1
F = −2σβ2 p(s) log q(j|s) exp − (xs − wj )T Λ(xs − wj ) . (6)
s=1 j=1
2
When the spread parameter σβ is larger, the unified outputs tend to be more similar
to the collective outputs. When the spread parameter is smaller, the quantization er-
rors tend to be much smaller. By differentiating the free energy, we can compute the
connection weights:
S ∗ s
s=1 p (j | s)x
wj = S
. (9)
∗
s=1 p (j | s)
where rj and rc denote the position of the jth and the cth unit on the output space and
σγ is a spread parameter. Using this neighborhood function, we can compute collective
outputs:
M
s 1 s T s
yj ∝ hjm exp − (x − wj ) Λ(x − wj ) . (11)
m=1
2
0.8
0.6
0.4
0.2
-0.2
-0.4
-1 -0.5 0 0.5
(a) U-matrix (b) PCA(SOM)
1.2
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-1 -0.5 0 0.5
(c) PCA(data)
Fig. 2. U-matrices (a) by the conventional SOM, the results of PCA applied to the weights (b)
by the SOM and to actual data (c) for the glass identification problem
clearer, represented in warmer colors. When the parameter β was increased to 11, as in
Figure 3(d), the class boundary became slightly differentiated. When the parameter β
was further increased to 13 and 15, as in Figures 3(e) and (f), the main boundary was
decomposed into minor class boundaries.
Figure 4 shows the results of the PCA on the connection weights when the parameter
β was increased from 5 (a) to 15 (f). When the parameter β was five, as in Figure 4(a),
connection weights were uniformly distributed. When the parameter was increased to
seven, as in Figure 4(b), a clear class appeared on the right hand side. This class became
further clearer when the parameter was increased to nine, as seen in Figure 4(c). When
the parameter was further increased from 11 in Figure 4(d) to 15 in Figure 4(f), the class
was decomposed into several subclasses.
Figure 5 shows quantization and topographic errors for training and testing patterns.
As shown in Figure 5(a1) and (a2), quantization errors decreased gradually as the pa-
rameter β was increased for the training and testing data. Figure 5(b1) shows the topo-
graphic errors for the training patterns. As shown in the figure, the topographic errors
were below the level obtained by the conventional SOM when the parameter β was
small. They increased gradually when the parameter β was increased. Figure 5(b2)
shows topographic errors as a function of the parameter β for the testing data. The
392 R. Kamimura
Fig. 3. U-matrices by the new method when the parameter β was increased from 5 (a) to 15 (f)
for the glass identification problem
topographic errors were much lower than those obtained with the conventional SOM.
When the parameter β increased beyond six, the topographic errors increased rapidly.
The experimental results showed that very clear class boundaries could be detected
by our method. However, when the parameter β was increased, topographic errors in-
creased as well, meaning that fidelity to input patterns tended to be smaller. Thus, we
must carefully choose the value for the parameter β and compromise between fidelity
and clear class boundaries.
4 Conclusion
In this paper, we separated the individuality and collectivity of neurons, which were rep-
resented in terms of the different outputs from the neurons. Then, we tried to unify the
individuality and collectivity. Unification has the effect of controlling the
Separation and Unification of Individuality and Collectivity 393
0.5
0.1 0.6
0.4
0.4
0.05 0.3
0.2 0.2
0
0.1
0
-0.05 0
-0.2
-0.1
-0.1
-0.4
-0.2
-0.3 -0.6
-0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2
0.8 1 1
0.6 0.8
0.6
0.4
0.5
0.4
0.2
0.2
0
0
0
-0.2
-0.2
-0.4 -0.4
Fig. 4. The results of PCA applied to the connection weights when the parameter β was 10, 13,
and 15 for the glass identification problem
0.14 0.14
0.12 0.12
0.1 0.1
QE
QE
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
5 10 15 5 10 15
Beta Beta
(a1)Training (a2)Testing
(a) Quantization error
0.2 0.2
0.15 0.15
0.1 0.1
TE
TE
0.05 0.05
0 0
5 10 15 5 10 15
Beta Beta
(b1)Training (b2)Testing
(b) Topographic error
Fig. 5. Quantization errors (a) and topographic errors (b) for training (1) and testing (2) data for
the glass identification problem.
References
1. Kohonen, T.: Self-Organizing Maps. Springer (1995)
2. Sammon, J.W.: A nonlinear mapping for data structure analysis. IEEE Transactions on Com-
puters C-18(5), 401–409 (1969)
3. Ultsch, A., Siemon, H.P.: Kohonen self-organization feature maps for exploratory data analy-
sis. In: Proceedings of International Neural Network Conference, pp. 305–308. Kulwer Aca-
demic Publisher, Dordrecht (1990)
4. Vesanto, J.: SOM-based data visualization methods. Intelligent Data Analysis 3, 111–126
(1999)
5. Kaski, S., Nikkila, J., Kohonen, T.: Methods for interpreting a self-organized map in data
analysis. In: Proceedings of European Symposium on Artificial Neural Networks, Bruges,
Belgium (1998)
6. Yin, H.: ViSOM-a novel method for multivariate data projection and structure visualization.
IEEE Transactions on Neural Networks 13(1), 237–243 (2002)
7. Kamimura, R.: Self-enhancement learning: target-creating learning and its application to
self-organizing maps. Biological Cybernetics, 1–34 (2011)
8. Frank, A., Asuncion, A.: UCI machine learning repository (2010)
9. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM toolbox for Matlab. tech.
rep., Laboratory of Computer and Information Science, Helsinki University of Technology
(2000)
10. Kiviluoto, K.: Topology preservation in self-organizing maps. In: Proceedings of the IEEE
International Conference on Neural Networks, pp. 294–299 (1996)
Autoencoding Ground Motion Data
for Visualisation
1 Introduction
Neural networks have been widely used in visualisation of high-dimensional data
[8,9,6]. A particular instance is the autoencoder [10,1,13]. Its architecture typi-
cally defines three hidden layers, with the middle layer being smaller than the
others in terms of number of neurons, thus commonly referred to as the ‘bot-
tleneck’. The input and output layers have the same number of neurons. The
autoencoder learns an identity mapping by training on targets identical to the
inputs y. Training is hampered by the bottleneck that forces the autoencoder to
reduce the dimensionality of the inputs. Thus, output ỹ is only an approximate
reconstruction of input y. Training minimises the L2 norm between inputs and
reconstructions, y − ỹ2 . If the number of neurons in the bottleneck is set to
two, the autoencoder can be used for visualisation. Inputs y induce activations
in the bottleneck that are taken as 2-D coordinates x in a two-dimensional space
where reduced representations of the inputs live. The autoencoder may be viewed
as the composition of an encoding fenc and a decoding fdec function. Encoding
maps inputs to coordinates, fenc (y) = x, while decoding approximately maps
coordinates back to inputs, fdec (x) = ỹ. The complete mapping is denoted as
f (y; w) = fdec (fenc (y)) = ỹ, where w are the weights of the autoencoder.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 395–402, 2012.
c Springer-Verlag Berlin Heidelberg 2012
396 N. Gianniotis et al.
Our data of interest are ground motions (GMs), i.e. the acceleration of ground
movement during an earthquake. Such data are manifestations of an underlying
physical model. However, the autoencoder does not guarantee that reconstruc-
tions ỹ respect the underlying physics. Interpreting a visualisation plot where the
2-D coordinates x map to items ỹ that make no physical sense is problematic.
Moreover, physical data embody domain knowledge that should be reflected in
the visualisation. This helps the analyst inspecting the visualisation understand
why a data item is projected to a particular location x. Being able to physically
interpret the results is important because it increases the analyst’s confidence
in the visualisation. Equally important, physical interpretation may also reveal
inconsistencies, informing the analyst that the visualisation is not meaningful.
In this work we focus on the visualisation of GMs. We extend the autoencoder
by coupling its outputs to a physical model that generates GMs. Effectively, we
add an output layer that physically constrains the reconstructions ỹ. Currently,
visualisation of GMs has been addressed with the application of SOM [12]. How-
ever, the model-free nature of SOM [9] does not permit a physical interpretation
of the visualisation. This work takes a step towards this direction.
magnitude. We note that log P GA increases with magnitude and decreases with
distance. Instead of working with continuous GM curves, we discretise them by
keeping only 60 log P GA values over the grid of distance-magnitude pairs in-
dicated with × markers in the GM examples in Fig. 4. Thus, each GM curve
becomes a discrete vector y of 60 values.
The non-analytical nature of the stochastic model [5] does not permit cal-
culating derivatives with respect to its inputs θ. We want to use this physical
model as an output layer for the autoencoder. Training the autoencoder via
backpropagation requires calculating derivatives of the activations at all layers
with respect to the weights [2], thus rendering the stochastic model unsuitable.
Our solution was to emulate the stochastic model with a differentiable surrogate
model. This was chosen to be a neural network2 of 12 neurons, trained on a
dataset of pairs (θ, y) generated by the stochastic model. The trained neural
network was tested on independent data and its error was judged low enough
as to accept it as a replacement to the stochastic model. Its weights were fixed
and no further adaptation took place when coupled to the autoencoder. We de-
note the surrogate neural network by function g(θ) = y. As aforementioned,
GM is normally distributed. Consequently, we define a noise model that outputs
observed GM curves y obs according to y obs ∼ N (g(θ), σstoch
2
).
that “squashes” its inputs in [0, 1]: ψ(f ) = [σ(f1 ), σ(f2 ), σ(f3 )]T .
– A is a 3 × 3 diagonal matrix that scales parameters to the appropriate
range. Its elements are the ranges (θimax − θimin ) of parameters, so that
A = diag((2800 − 630), (0.08 − 0.01), (100 − 1)).
– the 3 × 1 vector β holds the minimum value θimin of each parameter, and
shifts parameters to the appropriate interval: β = [630, 0.01, 1]T .
iii) the output of the vector-valued mapping is finally fed to surrogate model g.
f1 θ
s
u
r
r
x vs30 o
g
a
f2 t
e
k n
e
u
f3 r ~
y sd a y
l
n
e
t
ξ
bottleneck
g
"old" autoencoder f
hdec (henc (y)). The introduced modifications may be seen as extending the au-
toencoder with extra non-adaptable layers.
As aforementioned, the surrogate model defines a noise model N (y, σstoch
2
).
The similarity of a reconstruction ỹ to its input y is judged by the likelihood
implied by the noise model. Thus, optimisation of the extended autoencoder
proceeds by optimising the log-likelihood function, where the contribution of
input y n is defined as:
E n = log N (ỹ n |y n , σstoch
2
) = log N (h(y n )|y n , σstoch
2
), (1)
N
with the log-likelihood over all N inputs being E = n=1 E n . Discarding con-
stants, this simplifies to the negative L2 norm:
1
E n = − (y n − h(y n ))T (y n − h(y n )) , (2)
2
which is also the objective optimised by the standard autoencoder.
The only adaptable part in h is autoencoder f . Optimising cost function (2)
requires derivating it with respect to the free parameters w. Cost function (2)
is a composition of f, ξ, g, thus its derivatives can be obtained by using the
chain rule. For f we need the gradient ∇f (y) of its outputs with respect to its
weights w. Since f is the standard autoencoder, the gradient ∇f (y) is obtained
via standard back-propagation [2]. Surrogate model g is also a neural network
and its Jacobian J (g) is also obtained via back-propagation. For mapping ξ, the
Jacobian J (ξ) is easily calculated as the diagonal 3 × 3 matrix:
J (ξ) = A diag (σ(f1 )(1 − σ(f1 )), σ(f2 )(1 − σ(f2 )), σ(f3 )(1 − σ(f3 ))) . (3)
Thus, the contribution of input y n to the gradient is:
T
G(n) = (y n − h(y n ))J (g) J (ξ) (∇f (y)) , (4)
Autoencoding Ground Motion Data 399
N
with the gradient over all N inputs being G = n=1 G(n). Having obtained
gradient G we use a gradient-based optimisation on cost function (2).
4 Magnification Factors
Since the 2-D coordinates x induced at the bottleneck are products of non-linear
mappings, the topographic map ‘stretches’ and ‘contracts’. Such manifestations
are known as magnification factors [3]. It means that as we traverse the map
by taking equally small steps, we may encounter regions where models change
slowly, but also regions where models change rapidly as we move. Thus, distances
between the projected data items are distorted and a correct interpretation re-
quires the calculation of magnification factors. Models such as SOM, measure
magnifications via the U-matrix [15], while the Generative Topographic Map [3]
uses tools from differential geometry. Here we also calculate magnification factors
x1
x + Δx
x2
intervals within its respective range. Taking all combinations of these values,
we generated via the stochastic model a dataset of 603 = 216000 GMs. The
architecture of autoencoder f was (inputs=60) × (hidden=10) × (bottleneck=2)
× (hidden=10) × (out=3).
Fig. 4 presents the visualisation obtained via the extended autoencoder. The
plot is annotated with GMs that are typical for their respective regions. The
main feature of the plot is the organisation of GMs in the 4 annotated regions.
In particular, as we move from region (1) towards region (2) we see that the
shapes of the encountered GMs change systematically in that the curves fall lower
0
logPGA
−1
3
−2
2
−3
1
−4
0
−5 3
logPGA
−1
−6 2
−2
−7 1
−3 20 23 28 34 40 48 58 69 83 100
Distance 0
−4
3
logPGA
−1
−5
−6
x1
1 −2
−3
−7
20 23 28 34 40 48 58 69 83 100 −4
Distance
3 −5
2 −6
1 2 −7
20 23 28 34 40 48 58 69 83 100
0
4 3
Distance
logPGA
−1
2
−2
x2
1
−3
0
−4
3
logPGA
−1
−5
2
−2
−6
1
−3
−7
20 23 28 34 40 48 58 69 83 100 0
−4
Distance
logPGA
−1
−5
−2
−6
−3
−7
20 23 28 34 40 48 58 69 83 100
−4
Distance
−5
−6
−7
20 23 28 34 40 48 58 69 83 100
Distance
x1 x1 x1
x2 x2 x2
Fig. 5. From left to right, parameter plots for vs30, κ and sd. We stress that the
plots do not display the parameters of the visualised data items, but of the parameters
obtained via the ξ(fdec (x)) for any x. The parameter plots show how the physical
model changes in order to accommodate the projections.
Autoencoding Ground Motion Data 401
and lower in terms of log P GA. We discern two trends: moving from region (1)
towards (2) along the solid line, we find GMs whose log P GA-curves show greater
slopes, while along the dashed line the log P GA-curves show lower slopes.
We also visualised the dataset of GMs using a state-of-the-art method, the
Gaussian Process Latent Variable Model (GPLVM) [11]. We empirically observed
that it produced the same level of topographic organisation as our approach, not
displayed here due to lack of space. This shows that a model-based approach like
ours, is not always privileged in terms of topographic organisation. However, the
real advantage of a model-based over a model-free approach (e.g. autoencoder
[10], SOM [9], GTM [4], GPLVM [11]) is that it provides an interpretation of
how the visualisation arises. Any coordinate x in the visualisation plot maps via
ξ(fdec (x)) to a valid GM with parameters θ. Fig. 5 display plots for parameters
vs30, κ and sd corresponding to coordinates x in the visualisation plot. Bright
and dark shades indicate high and low values respectively.
A data item y is mapped to a coordinate x, if x maps to a GM with parame-
ters θ that resembles y. Thus, the parameter plots in Fig. 5 help us understand
why data items are projected to their particular locations. Specifically: region
(1) corresponds to low values of vs30 which physically means more site ampli-
fications. Region (2) corresponds to low sd and high vs30, generally leading to
GMs of low log P GA. These two findings are not surprising and meet our prior
physical expectations. However, κ is in general a parameter difficult to interpret.
Parameter κ separates GMs into region (3) of low κ and (4) of high κ that ex-
hibit high and moderate slopes respectively. The visualisation informs us that κ
affects the distance-dependency of log P GA. This effect is not apriori expected,
it is rather an indirect effect that is raised due to the complex behaviour of the
stochastic model.
Fig. 3 displays the magnification factors. Bright and dark shades show high
and low magnifications respectively. We see that magnification is overall fairly
moderate apart from the lower-right of region (4) where the space is stretched,
i.e. the stochastic model is sensitive to parameter change in this region. Such
changes are easily detected via magnification factors. Indeed, upon inspection
of the corner of region (4), we find GMs whose curves change rapidly towards
lower values in terms of log P GA, which explains the peaked lower-right corner.
6 Conclusions
We presented a new visualisation method for GMs based on the autoencoder that
allows interpreting the visualisation through a physical model which explains
why data are projected at their particular locations x. The parameter plots
clearly show how the physical model drives the visualisation which helps us better
understand the behaviour of the stochastic model. Magnification factors reveal
parameter sensitivities. Such interpretability can be achieved only by adopting
a model-based approach. Other data types can be accommodated by coupling a
suitable model to the autoencoder.
402 N. Gianniotis et al.
References
1. Baldi, P., Hornik, K.: Neural networks and principal component analysis: Learning
from examples without local minima. Neural Networks 2, 53–58 (1989)
2. Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press
(1996)
3. Bishop, C.M., Svensén, M., Williams, C.K.I.: Magnification Factors for the GTM
Algorithm. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN
1997. LNCS, vol. 1327, pp. 64–69. Springer, Heidelberg (1997)
4. Bishop, C.M., Svensén, M., Williams, C.K.I.: GTM: The generative topographic
mapping. Neural Computation 10(1), 215–234 (1998)
5. Boore, D.M.: Simulation of ground motion using the stochastic method. Pure and
Applied Geophysics 160, 635–676 (2003)
6. Hagenbuchner, M., Sperduti, A., Tsoi, A.C.: A self-organizing map for adaptive
processing of structured data. IEEE Transactions on Neural Networks 14(3), 491–
505 (2003)
7. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning.
Springer (2001)
8. Hinton, S.: Reducing the dimensionality of data with neural networks. SCIENCE:
Science 313 (2006)
9. Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78(9), 1464–1480
(1990)
10. Kramer, M.A.: Nonlinear principal component analysis using autoassociative neu-
ral networks. AICHE Journal 37, 233–243 (1991)
11. Lawrence, N.D.: Gaussian process latent variable models for visualisation of high
dimensional data. In: NIPS 16 (2004)
12. Scherbaum, F., Kuehn, N.M., Ohrnberger, M., Koehler, A.: Exploring the prox-
imity of ground-motion models using high-dimensional visualization techniques.
Earthquake Spectra 26(4), 1117–1138 (2010)
13. Tan, C.C., Eswaran, C.: Autoencoder Neural Networks: A Performance Study
Based on Image Reconstruction, Recognition and Compression. Lambert Academic
Publishing (2009)
14. Tiňo, P., Gianniotis, N.: Metric properties of structured data visualizations through
generative probabilistic modeling. In: IJCAI 2007, pp. 1083–1088 (2007)
15. Ultsch, A., Siemon, H.P.: Kohonen’s self organizing feature maps for exploratory
data analysis. In: INNC Paris, vol. 90, pp. 305–308 (1990)
Examining an Evaluation Mechanism
of Metaphor Generation with Experiments
and Computational Model Simulation
1 Introduction
The purpose of this paper is to examine the evaluation mechanism within
metaphor generation using a computational model. Metaphor generation is re-
garded as a process where an expression consisting of a TOPIC modified by
certain FEATURES (e.g. ”a sad song and a song narrates”), referred to as the
input expression, becomes a metaphorical expression of the form TOPIC like VE-
HICLE” (e.g. ”the song like tears”). Some computational models of metaphor
generation have already been developed, and, among these, some utilize linguistic
corpora. For instance, Kitada and Hagiwara[1] constructed a figurative compo-
sition support system that included a model of metaphor generation based on
an electronic dictionary. In contrast, Abe, Sakamoto and Nakagawas model[2] is
based on the results of statistical language analysis, which is more objective than
existing dictionaries which must be compiled through the considerable efforts of
language professionals. Moreover, Terai and Nakagawa[3][4] constructed a model
This research is supported by MEXT’s program ”Promotion of Environmental Im-
provement for Independence of Young Researchers” and Grant-in-Aid for Young
Scientists (B) (23700160).
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 403–410, 2012.
c Springer-Verlag Berlin Heidelberg 2012
404 A. Terai, K. Abe, and M. Nakagawa
that incorporates the dynamic interaction among features by employing the re-
sults of statistical language analysis. In particular, Terai and Nakagawas model[4]
has a mechanism of metaphor evaluation and a simulation of that model was
slightly closer to human metaphor-generation performance than a model without
the evaluation mechanism[3].
The model[4] consists of a candidate metaphor generation process and a
metaphor evaluation process. Within the candidate metaphor generation pro-
cess, the model outputs candidate nouns for the vehicle. Within the metaphor
evaluation process, the vehicle candidate nouns are evaluated based on similar-
ities between the meaning of a metaphor including the candidate noun and the
meaning of an input expression, such that the metaphor that is most similar
to the input expression is output as the most adequate metaphor. Thus, the
model assumed a two-process structure where the evaluation mechanism starts
after the metaphor generation mechanism is complete. It is, however, feasible
that the generation mechanism and the evaluation mechanism operate simulta-
neously. Accordingly, the two-process assumption warrants further examination,
specifically with respect to the timing of the evaluation mechanism. Moreover,
previous versions of the model have not considered the degree of dis-similarity
between the topic and the vehicle within the metaphor generation mechanism,
even though such dissimilarities are taken into account in Kitada and Hagiwaras
model[1]. Thus, this paper improves the model of the metaphor generation mech-
anism by considering this dissimilarity, and subsequently investigates the timing
of the evaluation mechanism through a simulation of the improved model.
where crk indicates the kth latent class assumed within this method for the r
type of modification data. The parameters (P (nri |crk ), P (arj |crk ), and P (crk )) are
estimated as the values that maximize the log likelihood of the co-occurrence
frequency data between nri and arj using the EM algorithm. The statistical lan-
guage analysis is applied to each set of co-occurrence data fixing the number of
Examining an Evaluation Mechanism of Metaphor Generation 405
latent classes at 200. P (crk |nri ) and P (crk |arj ) are computed using Bayes’ theory.
The 18,142 noun types (n∗h ) that are common to all four types modification data
were used in the system. The meaning vector of the noun and the meaning vector
of the feature were estimated based on P (crk |nri ) and P (crk |arj ).
where Oa(n∗h ) indicates the adequacy of the vehicle candidate from the input
expression, λ1 means the influence of the adequacy and λ2 indicates the degrees
of dissimilarity between the topic and the vehicle candidate noun. Output from
the model without the evaluation mechanism is in the form of a ranking of the
vehicle candidate nouns according to their adequacy values.
The meaning of the metaphor and the meaning of the input expression are
estimated according to Kintschs predication algorithm[6].
The meaning of a metaphor consisting of a topic and a candidate vehicle
(M (n∗h0 , n∗h )), which indicates ”topic (n∗h0 ) like vehicle (n∗h )”, is estimated using
meaning vectors. First, the semantic neighborhood (N (n∗h )) of a candidate vehi-
cle (n∗h ) of size Snm is computed on the basis of similarity to the vehicle, which
is represented by the cosine of the angles between meaning vectors. Next, Sm
nouns are selected from the semantic neighborhood (N (n∗h )) of the vehicle on
the basis of their similarity to the topic (n∗h0 ). Finally, a vector (V (M (n∗h0 , n∗h ))
is computed as the centroid of the meaning vectors for the topic, the vehicle
and the selected Sm nouns. The computed vector (V (M (n∗h0 , n∗h )) indicates the
assigned meaning of the topic (n∗h0 ) as a member of the ad-hoc category of the
vehicle n∗h in the metaphor M (n∗h0 , n∗h ).
The meaning of the input expression (L(n∗h0 , aru ju , ...)) consisting of the inputs
for the topic (n∗h0 ) and its features (aru ju ) is also estimated. First, the semantic
neighborhood (N (aru ju )) of a feature of size Snl is computed on the basis of
the similarity to the feature (aru j ), which is represented by the cosine of the
angles between feature vectors. Next, Sl features are selected from the semantic
neighborhood (N (aru ju )) of the feature on the basis of their similarities to the
topic (n∗h0 ), which are computed as the coine of the angles between the meaning
vectors of the features and the topic using P (cru ru ru ∗
k |aju ) and P (ck |nh0 ). A vector
(V (L(n∗h0 , aruju ))) is computed as the centroid of the meaning vectors for the
topic, an input feature and the selected Sl features. When the input expression
has more than one feature, namely u > 1 , the centroid was first computed for
each feature separately and then a centroid of the centroids was computed as
the meaning vector of the input expression (V (L(n∗h0 , aru ju , ...))).
The similarity between a metaphor including a candidate noun and the input
expression is represented as the cosine of the angle between the meaning vector
of the metaphor (V (M (n∗h0 , n∗h )) and the meaning vector of the input expression
(V (L(n∗h0 , aruju , ...))). The similarity is represented as Oe(nh ).
∗
where λ3 indicates the extent of the evaluation. The nouns are sorted according
to their adequacy values (Oes (n∗h )). The ranking indicates the adequacy of a
noun as a vehicle in the metaphor. In the model, the metaphor generation mech-
anism and the evaluation mechanism operate simultaneously. However, within
this model, all the nouns have to be evaluated. Thus, this model includes a higher
computational load than the two-process model.
Hybrid Model. The top P nouns are evaluated according to their adequacy
values as vehicle candidates (Og(n∗h )). Only P nouns are sorted according to their
adequacy values as a vehicle (Oes (n∗h )), which are computed using formula(4).
The ranking of other nouns is not changed. That is to say, the adequacy values
for noun as vehicles are computed using the following formula (5):
∗ λ3 Oe(n∗h ) + Og(n∗h ) if n∗h ∈ Np
Oeh (nh ) = (5)
Og(n∗h ) else,
The model incurs a similar computational load as the two-process model. After
the generation mechanism, the top P nouns are evaluated, but, within the evalu-
ation process, the generation mechanism and the evaluation mechanism operate
simultaneously for the top P nouns.
3 Experiments
Two kinds of experiment were conducted in order to elucidate the timing of the
evaluation mechanism using same input expressions in Japanese: a sad song and
a song narrates and hot words and severe words. One was an experiment that
included no time for thinking and the other was an experiment that included
thinking time. In the experiment without thinking time, the participants were 90
undergraduates. They were presented with the input expressions and were asked
to make a metaphorical expression of the form ”TOPIC like VEHICLE” and
to respond with the vehicles within 1.5 minutes. In contrast, in the experiment
with thinking time included, the participants were 71 undergraduates. They were
presented with the input expressions and were asked to think about the vehicles
during the thinking time of 3 minutes before responding. Then, they were asked
to provide their vehicle responses within 1.5 minutes.
We hypothesized that the influence of the evaluation mechanism would be
more strongly reflected in the results from the experiment including think-
ing time than in the results of the experiment without thinking time. If the
408 A. Terai, K. Abe, and M. Nakagawa
Table 1. The results of the rank correlation coefficients between the experimental
results and the parameters. Significancies at 1% level(**), 5% level(*) and 10% level(†)
are shown.
5 Discussion
The model without evaluation exhibited the lowest correlation coefficients for
all conditions. It is, therefore, clear that some form of evaluation mechanism
is involved in metaphor generation. The two-process model provides the most
comprehensive explanation of the experimental results. However, there are no
major differences among the three types of models that include an evaluation
mechanism. From a cognitive perspective, the simultaneous evaluation model
would appear to necessitate excessive cognitive loads in order to evaluate all
candidate nouns, and it is, thus, rather unlikely that people generate metaphors
in a manner that is similar to that assumed in that model. Based purely on the
results obtained from the present study, it is not possible to distinguish between
the two-process model and the hybrid model in terms of which is better, because
of the limited set of just two input expressions. However, both models assume
that vehicle candidates are initially generated, and then, some of the candidates
are subsequently evaluated. Essentially, the results indicate that an evaluation
mechanism is executed while the generation mechanism is continuing to operate.
In addition, no differences due to the experimental contrast were observed in the
present results. The participants could have evaluated candidate nouns as much
as in the without-thinking time condition as in the thinking time. However,
the results suggest that the evaluation mechanism may start at an early stage
of the generation mechanism, even though it is initiated after the generation
mechanism has begun its operation.
410 A. Terai, K. Abe, and M. Nakagawa
References
1. Kitada, J., Hagiwara, M.: Figurative composition support system using electronic
dictionaries. Transactions of Information Processing Society of Japan 42(5), 1232–
1241 (2001)
2. Abe, K., Sakamoto, K., Nakagawa, M.: A computational model of metaphor genera-
tion process. In: Proc. of the 28th Annual Meeting of the Cognitive Science Society,
pp. 937–942 (2006)
3. Terai, A., Nakagawa, M.: A Neural Network Model of Metaphor Generation with
Dynamic Interaction. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas, G.
(eds.) ICANN 2009, Part I. LNCS, vol. 5768, pp. 779–788. Springer, Heidelberg
(2009)
4. Terai, A., Nakagawa, M.: A Computational System of Metaphor Generation with
Evaluation Mechanism. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN
2010, Part II. LNCS, vol. 6353, pp. 142–147. Springer, Heidelberg (2010)
5. Kameya, Y., Sato, T.: Computation of probabilistic relationship between concepts
and their attributes using a statistical analysis of Japanese corpora. In: Proc. of
Symposium on Large-scale Knowledge Resources: LKR 2005, pp. 65–68 (2005)
6. Kintsch, W.: Metaphor comprehension: A computational theory. Psychonomic Bul-
letin & Review 7(2), 257–266 (2000)
7. Lakoff, G., Johnson, M.: Metaphors We Live By. University of Chicago Press (1980)
Pairwise Clustering with t-PLSI
1 Introduction
Probabilistic clustering has been obtaining promising results in many applica-
tions (e.g. [1,3]). Especially, Probabilistic Latent Semantic Indexing (PLSI) [10]
that provides a nice factorizing structure for optimization and statistical in-
terpretation has attracted much research effort in the past decade. PLSI was
originally used for topic modeling and later also found a good application in
clustering (e.g. [5]).
Despite its success in many tasks, PLSI is restricted to multinomial data -
originally, word counts in documents. That is, it assumes that data is generated
from a multinomial distribution. Besides the nonnegative integer limitation, the
multinomial assumption may not hold for other data with different types of
noise.
In this paper we generalize PLSI with a more flexible formulation based on
nonnegative low-rank approximation. Maximizing the PLSI likelihood is equiv-
alent to minimizing the Kullback-Leibler (KL) divergence between the input
matrix and its approximation. KL-divergence was recently generalized to a fam-
ily called t-divergence for measuring the approximation error. The t-divergence
This work is supported by the Academy of Finland in the project Finnish Center of
Excellence in Computational Inference Research (COIN).
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 411–418, 2012.
c Springer-Verlag Berlin Heidelberg 2012
412 H. Zhang et al.
is more flexible than KL-divergence in the sense that more types of noise model,
e.g. data with a heavy-tailed distribution, can be accommodated [8]. Here we
integrate t-divergence in our new PLSI formulation and name the generalized
method t-PLSI.
As the algorithmic contribution, we propose a Majorization-Minimization
algorithm to solve the t-PLSI optimization problem. The t-divergence is con-
structed through the Fenchel dual of the log-partition function of t-exponential
family distributions [8]. The resulting convexity facilitates developing convenient
multiplicative variational algorithms for t-PLSI.
We apply t-PLSI to pairwise clustering analysis. Fourteen real-world datasets
are selected for comparing PLSI and the generalized method. Experimental re-
sults show that PLSI based on KL-divergence (t = 1) is not always the best. For
many selected datasets, t-PLSI can achieve better purities with other t values.
The rest of the paper is organized as follows. Section 2 reviews the PLSI model
and gives its formulation for the pairwise clustering framework. In Section 3,
we present the generalized PLSI based on t-divergence, including its learning
objective and optimization algorithm. Section 4 gives the experimental settings
and results. Conclusions and some future work are given in Section 5.
PLSI [10] was originally developed for text document analysis. Let C be an m×n
word-document matrix, where Cij is the number of times the ith word appears in
the jth document. PLSI assumes that the data is generated from a multinomial
distribution and maximizes the likelihood
m
n
P (w = i, d = j)Cij . (1)
i=1 j=1
r
P (w = i, d = j) = P (w = i|z = k)P (d = j|z = k)P (z = k), (2)
k=1
with conditional independence between the word variable w and the document
variable d given the latent class variable z. In the following, we rewrite the
probabilities by matrix notations for convenience: X = W SH, where X ij =
P (w = i, d = j), Wik = P (w = i|z = k), Hkj = P (d = j|z = k) and S a diagonal
matrix with Skk = P (z = k).
C
Given X as the normalized version of C, i.e. Xij = ijCab , maximizing the
ab
likelihood in Eq. (1) is equivalent to minimizing the Kullback-Leibler divergence
Xij
=
DKL (X||X) Xij log (3)
ij
(W SH T )ij
Pairwise Clustering with t-PLSI 413
r n
subject to W ≥ 0, H ≥ 0, m i=1 Wik = 1, k=1 Skk = 1, j=1 Hjk = 1 (for
details, see [5]).
In this paper we focus on the symmetric form of PLSI for pairwise clustering,
where X is taken as the affinity matrix of a weighted undirected graph that
represents pairwise similarities between data samples. No distinction between
“words” and “documents” is made like in the original PLSI, but data can be of
any type. In this case, H = W T .
t
p(x)
where q(x) = p(x) t dx is a normalization term called the escort distribution of
p(x), and logt is the inverse of t-exponential function (see e.g. [7]):
x1−t − 1
logt (x) = (5)
1−t
for t ∈ (0, 2)\{1}. When t = 0, logt (x) = x. When t = 2, logt (x) = x1 . When
t → 1, logt (x) = log(x) and the t-divergence (4) reduces to KL-divergence. It
is worth noticing that the logt decays towards 0 more slowly than the usual log
function for 1 < t < 2, which leads to the heavy-tailed nature of t-exponential
family and is desired for robustness.
We now generalize the PLSI model by using t-divergence. The discrete version of
t-divergence between a normalized nonnegative matrix X and its approximate
X is given by
=
Dt (X||X) ij ,
qij logt Xij − qij logt X (6)
i,j
414 H. Zhang et al.
3.2 Optimization
ij for t ∈ [0, 2]. We can therefore ap-
The t-divergence in Eq. (7) is convex in X
ply Jensen’s inequality to develop a Majorization-Minimization algorithm which
iterates the following update rules:
⎧
⎪
⎨W
t 2t−1
1
ik A W SW T W 0<t<1
Wik ∝
ik (9)
⎪W
⎩ A W SW T
t
W 1 < t < 2,
ik
ik
t 1t
Skk ∝Skk W T A W SW T W , (10)
ik
Xt
where Aij = ijX t and denotes the element-wise division between two
ab ab
matrices of equal size.
4 Experiments
We have compared the clustering performances on a number of undirected graphs
between the original PLSI and the proposed PLSI based on t-divergence. The
graphs have ground-truth
r classes. The evaluation criterion that we adopt is the
clustering purity = n1 k=1 max1≤l≤q nlk , where nlk is the number of vertices in
the partition k that belong to ground-truth class l. A larger purity in general
corresponds to a better clustering result.
We have used fourteen datasets to evaluate the two compared methods. These
datasets can be retrieved from Pajek database1 , Newman’s collection2 , or UCI3
1
http://vlado.fmf.uni-lj.si/pub/networks/data/
2
http://www-personal.umich.edu/~ mejn/netdata/
3
http://archive.ics.uci.edu/ml/index.html
Pairwise Clustering with t-PLSI 415
machine learning repository. Nine datasets are sparse graphs and the remaining
five are multivariate data. We preprocessed the latter type into sparse similarity
matrices by symmetrizing their K-Nearest-Neighbor graphs (K = 15).
We set the number of clusters as the true number of classes for the two
clustering algorithms on all datasets. The parameters W and S were initialized
by following the Fresh Start procedure proposed by Ding et al. [6]. Table 1 shows
the clustering performances of the compared algorithms.
From the results in Table 2, we can see that the original PLSI based on KL-
divergence, i.e. t = 1, is not always the best. For certain datasets, the clustering
purity can be improved by using t-divergence with other t values. For example
when t = 0.2 in the case 0 < t < 1 (Table 1a), the t-PLSI achieves a perfect
clustering result on “strike” dataset with purity 1. As for the “cities” dataset,
the clustering purity has been improved by 10% over that of the original PLSI.
Similar analysis can be made for the case 1 < t < 2 (Table 1b). Compared with
KL-divergence, t-divergence has an extra degree of flexibility given by the free
parameter t. Especially, as t → 2, the tail of the logt function becomes heavier
than that of the usual log function, which makes the new method more robust
against noise.
416 H. Zhang et al.
5 Conclusions
We have studied the generalization of Probabilistic Latent Semantic Indexing
with t-divergence family. The generalized PLSI was formulated as a nonnega-
tive low-rank approximation problem. The formulation is more flexible and can
accommodate more types of noise in data. We have proposed a Majorization-
Minimization algorithm for optimizing the constrained objective. Empirical com-
parison shows that clustering performance in purity can be improved by using
the generalized method with suitable t-divergences other than KL-divergence.
The proposed generalization is not restricted to PLSI. The t-divergence could
be used in other nonnegative approximation problems with, for example, other
matrix factorization/decomposition or other constraints, where the flexibility
might also help to improve the performance.
An interesting question raised with the t-exponential family is whether we can
find its conjugate prior. The multinomial distribution that underlies PLSI has
Dirichlet as its conjugate prior. This conjugacy mainly accounts for the success
of generative topic modeling in recent years. Inference based on t-exponential
family might also benefit from its conjugate prior if they exist. That is, if we could
find a conjugate prior for t-PLSI, we may also apply the nonparametric Bayesian
treatment similar to Latent Dirichlet Allocation and thus avoid overfitting.
The t-divergence is related to two other divergence families, α-divergence and
Rényi divergence (see e.g. [3,4,13]). One of the major differences is normalization
on the input: α-divergence involves no normalization and could be problematic
when combined with prior information; Rényi divergence normalizes the input
before the power operation while t-divergence employs t-power before normal-
ization. The latter has the ability to smooth nonzero entries (for 0 < t < 1)
or exclude outliers (for 1 < t < 2). More thorough comparison among these
divergence families should be carried out in the future.
Another important and still open problem is how to select among various
t-divergences. This strongly depends on the nature of the data to be analyzed.
Usually a larger t leads to more exclusive approximation while a smaller t to more
inclusive approximation. Automatic selection among parameterized divergence
family generally requires extra information or criterion, for example, ground
truth data [2] or cross-validations with a fixed reference parameter [12].
(Majorization)
1−t
, S) ≤ 1
J(W Aij φijk Wik Skk W
jk + λk ik − 1 ,
W
t − 1 ij i
k k
t t
where Aij =
Xij
t and φijk =
Wik Skk Wjk
, W ).
. Denote the r.h.s. by G0 (W
ab Xab l Wil Sll Wjl
, W ) ≤ G1 (W
Case 1: when 0 < t < 1, we have G0 (W , W ), where
1 2(1−t)
W
, W ) =
G1 (W Aij 1−t 1−t
φijk Skk Wjk ik ik − 1 .
t − 1 ij 1−t + λk W
k
Wik k i
, W ) ≤ G2 (W
Case 2: when 1 < t < 2, we have G0 (W , W ), where
1 (1−t) W
W 1−t
G2 (W , W ) = Aij 1−t 1−t 1−t
φijk Skk Wik Wjk
ik
1 + log (1−t)
jk
t − 1 ij Wik Wjk 1−t
k
+ λk
Wik − 1 .
k i
(Minimization)
For Case 1: when 0 < t < 1
∂G1 2(1−t)−1
1−t Wpq
= −2 1−t
Apj φpjq Sqq Wjq + λq = 0.
pq
∂W Wpq1−t
j
1−t 1
−1 1−t 1
This gives Wpq = Wpq λq2t−1 2t−1 2t−1
2Sqq 1−t 2t−1
j Apj φpjq Wjq . By using
p Wpq = 1, we obtain
⎛ ⎞ 2t−1
1
1 1−t t−1
λq 2t−1
= 2Sqq 2t−1
Waq 2t−1 ⎝ 1−t ⎠
Aaj φajq Wjq . (12)
a j
pq = λ−1
This gives W 1−t
q Sqq
1−t 1−t
j Apj φpjq Wpq Wjq . By using p Wpq = 1, we
obtain
1−t 1−t 1−t
λq = Sqq Waq Aaj φajq Wjq . (13)
a j
References
1. Arora, R., Gupta, M., Kapila, A., Fazel, M.: Clustering by left-stochastic ma-
trix factorization. In: International Conference on Machine Learning (ICML), pp.
761–768 (2011)
2. Choi, H., Choi, S., Katake, A., Choe, Y.: Learning alpha-integration with partially-
labeled data. In: Proc. of the IEEE International Conference on Acoustics, Speech,
and Signal Processing, pp. 14–19 (2010)
3. Cichocki, A., Lee, H., Kim, Y.D., Choi, S.: Non-negative matrix factorization with
α-divergence. Pattern Recognition Letters 29, 1433–1440 (2008)
4. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor
Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley
(2009)
5. Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factor-
ization and probabilistic latent semantic indexing. Computational Statistics and
Data Analysis 52(8), 3913–3927 (2008)
6. Ding, C., Li, T., Jordan, M.: Convex and semi-nonnegative matrix factorizations.
IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), 45–55
(2010)
7. Ding, N., Vishwanathan, S.: t-logistic regression. In: Advances in Neural Informa-
tion Processing Systems, vol. 23, pp. 514–522 (2010)
8. Ding, N., Vishwanathan, S., Qi, Y.A.: t-divergence based approximate inference.
In: Advances in Neural Information Processing Systems, vol. 24, pp. 1494–1502
(2011)
9. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the
Itakura-Saito divergence: With application to music analysis. Neural Computa-
tion 21(3), 793–830 (2009)
10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd
Annual International Conference on Research and Development in Information
Retrieval (SIGIR), pp. 50–57. ACM (1999)
11. Hunter, D.R., Lange, K.: A tutorial on MM algorithms. The American Statisti-
cian 58(1), 30–37 (2004)
12. Mollah, M., Sultana, N., Minami, M.: Robust extraction of local structures by the
minimum of beta-divergence method. Neural Networks 23, 226–238 (2010)
13. Yang, Z., Oja, E.: Unified development of multiplicative algorithms for linear and
quadratic nonnegative matrix factorization. IEEE Transactions on Neural Net-
works 22(12), 1878–1891 (2011)
Selecting β-Divergence for Nonnegative Matrix
Factorization by Score Matching
1 Introduction
Nonnegative Matrix Factorization (NMF) has been recognized as an important
tool for signal processing and data analysis. Ever since Lee and Seung’s pioneer-
ing work [13,14], many NMF variants have been proposed (e.g. [5,12]), together
with their diverse applications including text, images, music, bioinformatics, etc.
(see e.g. [11,6]). See [4] for a survey.
Divergence or approximation error measure is an important dimension in the
NMF learning objective. Originally the approximation error was measured by
squared Euclidean distance or generalized Kullback-Leibler divergence [14]. Later
these measures were unified and extended to a broader parametric divergence
family called β-divergence [12].
Compared to the vast number of NMF cost functions, our knowledge about
how to select the best one among them for given data is small. It is known that a
large β leads to more robust but less efficient estimation. Some more qualitative
Corresponding author.
This work is supported by the Academy of Finland in the project Finnish Center of
Excellence in Computational Inference Research (COIN).
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 419–426, 2012.
c Springer-Verlag Berlin Heidelberg 2012
420 Z. Lu, Z. Yang, and E. Oja
discussion on the trade-off can be found in [2,3] and references therein. Currently,
automatic selection among a parametric divergence family generally requires
extra information, for example, ground truth data [1] or cross-validations with a
fixed reference parameter [15], which is infeasible in many applications, especially
for unsupervised learning.
In this paper we propose a new method for the automatic selection of β-
divergence for NMF which requires no ground truth or reference data. Our
method can reuse existing NMF algorithms for estimating the factorizing ma-
trices. We then insert the estimated matrices into a Tweedie distribution which
underlies the β-divergence family. The partition function or normalizing con-
stant in Tweedie distribution is intractable except for three special cases. To
overcome this difficulty, we employ a non-maximum-likelihood estimator called
Score Matching, which avoids calculating the partition function and has shown
to be both consistent and computationally efficient.
Evaluating the estimation performance is not a trivial task. We verify our
method by using both synthetic and real-world data. Experimental results show
that the estimated β conforms well with the underlying distribution that is
used to generate the synthetic data. Empirical study on music signals using our
method is consistent to previous research findings in music.
After this introduction, we recapitulate the essence of NMF based on β-
divergence in Section 2. Our main theoretical contribution, including the prob-
abilistic model, Score Matching background, and the selection criterion, is pre-
sented in Section 3. Experimental settings and results are given in Section 4.
Finally we conclude the paper and discuss some future work in Section 5.
minimize Dβ (X||X), (1)
W ≥0,H≥0
1 β
=
Dβ (X||X) β − βXij X
Xij + (β − 1)X β−1 (2)
β(β − 1) ij ij ij
for β ∈ R\{0, 1}. For example, when β = 2, β-divergence becomes the squared
Euclidean distance. When β → 1, β-divergence reduces to the (generalized)
Selecting β-Divergence for NMF by Score Matching 421
where the division and the power are taken elementwise, and ‘.’ refers to ele-
mentwise multiplication. The cost function Dβ (X|W H) monotonically decreases
under the above updates for β ∈ [1, 2] and thus usually converges to a local min-
imum [12]. Empirical convergence of β outside this range is also observed in
most cases. Moreover, there are several other more comprehensive algorithms
for β-NMF (see e.g. [7]).
It is known that the β-divergence is associated with the classical Tweedie distri-
butions which are a special case of an exponential dispersion model [16]. Mini-
mizing a β-divergence is usually equivalent to maximizing the likelihood of the
422 Z. Lu, Z. Yang, and E. Oja
where
1
T
1
J (θ) = 2Xi (t)ψi + Xi (t)2 ∂i ψi + ψi2 Xi (t)2 (9)
T t=1 i=1 2
Hyvärinen [9,10] showed that the score matching estimator θ∗ = arg min J (θ) is
θ
consistent, i.e. θ∗ approaches the true value when T → ∞.
Selecting β-Divergence for NMF by Score Matching 423
Xij
2
for β
= 1. When β → 1, L(1) = ij Xij − 1
2 ij Xij ln X
+ 2 . The best β
ij
selected by using score matching is
4 Experiments
Next we empirically evaluate the performance of the proposed selection method.
The estimated β using the criterion in Eq. (13) is compared with the ground
truth or established results in a particular domain. Because the β-NMF objective
in Eq. (1) is non-convex, only local optimizers are available. We repeat the
optimization algorithms multiple times from different starting points to get rid
of poor local optima.
repeated the process ten times and recorded the mean and standard deviation
for the selected β’s. We have also tried different low ranks k and input matrices
of different sizes.
The results are shown in Table 1. We can see that our method can estimate
β quite accurately. For various matrix sizes and low ranks, the selected β’s are
very close to the ground truth values with small deviations.
Fig. 1. Score matching objectives for β-NMF on the piano signal: (top) coarse search
where objectives are shown in log-scale, (bottom) fine search
Selecting β-Divergence for NMF by Score Matching 425
Fig. 2. Score matching objectives for β-NMF on the jazz signal: (left) coarse search
where objectives are shown in log-scale, (right) fine search
al. [6]. In this way the matrices used as input in β-NMF are of sizes 513 × 676 for
piano and 129 × 9312 for jazz. We also followed their choice of the compression
rank: k = 5, 6 for piano and k = 10 for jazz. The search interval [−2, 4] was the
same as in the synthetic case. We first performed coarse search using step size 0.5
and then fine search in [−0.1, 0.15] with smaller steps. For each β, we repeated
the NMF algorithms ten times with different starting points and recorded the
mean and standard deviation of score matching objectives.
The results for piano and jazz are shown in Figures 1 and 2, respectively.
In coarse search, β = 0 is the only value that causes negative score matching
objective, which is also the minimum. The fine search results confirm that the
minimum or the elbow point is very close to zero for both piano and jazz. These
results are well consistent with the previous finding that Itakura-Saito divergence
is good for music signals (e.g. [6]).
Several issues still need to be resolved in the future. The selection of the rank
k was specified in advance in our method, which would affect the result of the
β-selection. It would be interesting to see whether score matching also works for
the low rank selection. In addition, β-divergence is a scale dependent measure,
upon which scale tuning of the data matrix might be required.
References
1. Choi, H., Choi, S., Katake, A., Choe, Y.: Learning alpha-integration with partially-
labeled data. In: Proc. of the IEEE International Conference on Acoustics, Speech,
and Signal Processing, pp. 14–19 (2010)
2. Cichocki, A., Amari, S.I.: Families of alpha- beta- and gamma- divergences: Flexible
and robust measures of similarities. Entropy 12, 1532–1568 (2010)
3. Cichocki, A., Cruces, S., Amari, S.I.: Generalized alpha-beta divergences and their
application to robust nonnegative matrix factorization. Entropy 13, 134–170 (2011)
4. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor
Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley
(2009)
5. Dhillon, I.S., Sra, S.: Generalized nonnegative matrix approximations with breg-
man divergences. In: Advances in Neural Information Processing Systems, vol. 18,
pp. 283–290 (2006)
6. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the
Itakura-Saito divergence: With application to music analysis. Neural Computa-
tion 21(3), 793–830 (2009)
7. Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the
β-divergence. Neural Computation 23(9), 2421–2456 (2011)
8. Grunwald, P.D., Myung, I.J., Pitt, M.A.: Advances in Minimum Description
Length: Theory and Applications. MIT Press (2005)
9. Hyvarinen, A.: Estimation of non-normalized statistical models by score matching.
Journal of Machine Learning Research 6(1), 695 (2006)
10. Hyvarinen, A.: Some extensions of score matching. Computational Statistics &
Data Analysis 51(5), 2499–2512 (2007)
11. Kim, H., Park, H.: Sparse non-negative matrix factorizations via alternating non-
negativity-constrained least squares for microarray data analysis. Bioinformat-
ics 23(12), 1495 (2007)
12. Kompass, R.: A generalized divergence measure for nonnegative matrix factoriza-
tion. Neural Computation 19(3), 780–791 (2006)
13. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401, 788–791 (1999)
14. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Ad-
vances in Neural Information Processing Systems, vol. 13, pp. 556–562 (2001)
15. Mollah, M., Sultana, N., Minami, M.: Robust extraction of local structures by the
minimum of beta-divergence method. Neural Networks 23, 226–238 (2010)
16. Tweedie, M.: An index which distinguishes between some important exponential
families. In: Statistics: Applications and New Directions: Proc. Indian Statistical
Institute Golden Jubilee International Conference, pp. 579–604 (1984)
17. Xu, L.: Bayesian ying yang harmony learning. In: Arbib, M.A. (ed.) The Handbook
of Brain Theory and Neural Networks, 2nd edn., pp. 1231–1237. MIT Press (2002)
Neural Networks for Proof-Pattern Recognition
1 Introduction
Automated theorem proving has been applied to solve sophisticated mathe-
matical problems (e.g., verification of the Four-Colour Theorem in Coq), and
for industrial-scale software and hardware verification (verification of micro-
processors in ACL2). However, such “computer-generated” proofs require consid-
erable programming skills, and overall, are time-consuming and hence expensive.
Programs in automated provers may contain thousands of theorems of variable
sizes and complexities. Some proofs will require programmer’s intervention. In
this case, a manually found proof for one problematic lemma may serve as a
template for several other lemmas needing a manual proof. Automated discovery
of common proof-patterns using tools of statistical machine learning such as
neural networks could potentially provide the much-sought automatisation for
statistically similar proof-steps; as was argued e.g. in [1,2,3,4,10,11,12].
As was classified in [1], applications of machine-learning assistants to mech-
anised proofs can be divided into symbolic (akin e.g. Inductive logic program-
ming), numeric (akin neural networks or Kernels), and hybrid. In this paper,
we focus on neural networks. The advantages of the numeric methods over sym-
bolic is tolerance to noise and uncertainty, as well as availability of powerful
learning functions. For example, the standard multi-layer perceptrons with er-
ror back-propagation algorithm are capable of approximating any function from
finite-dimensional vector space with arbitrary precision. In this case, it is not
the power of the learning paradigm, but the feature selection and representation
method that sets the limits. Consider the following example.
Example 1. Let ListNat be a logic program defining lists of natural numbers:
1. nat(0) ← ; 2. nat(s(x)) ← nat(x);
The work was supported by EPSRC grants EP/F044046/2 and EP/J014222/1.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 427–434, 2012.
c Springer-Verlag Berlin Heidelberg 2012
428 E. Komendantskaya and K. Lichota
Recursive logic programs, such as the program above, are traditionally problem-
atic for neural network representation, as they cannot be soundly proposition-
alised and represented by the vectors of truth values, but see e.g. [5,7].
The method we present here is designed to steer away from these problems.
It covers (co-)recursive first-order logic programs. To manage (co-)recursion ef-
ficiently, we use the formalism of coinductive proof trees for logic programs, see
[8]. The coinductive trees possess more regular structure than e.g. SLD-trees.
In Section 2, we propose an original feature extraction algorithm for arbitrary
proof-trees. It allows to capture intricate structural features of the proof-trees
such as branching, dependencies between the terms and predicates; as well as
internal dependencies between structures of terms appearing at different stages
of the proof. We implement the feature extraction algorithm: see [6].
The main advantages of the feature selection method we propose are its accu-
racy, generality and robustness to changes in classification tasks. In Section 3, we
test this method on a range of classification tasks and possible implementation
scenarios with very high rates of success. All our experiments involving neural
networks were made in MATLAB Neural Network Toolbox (pattern-recognition
package), with a standard three-layer feed-forward network, with sigmoid hid-
den and output neurons. The network was trained with scaled conjugate gradi-
ent back-propagation. Such networks can classify vectors arbitrarily well, given
enough neurons in the hidden layer, we tested their performance on 40, 50, 60,
70, 90 hidden neurons for all experiments. All the software, datasets, and detailed
reference manual are available in [6].
→ →
li(c(x, c(y, z))) li(c(s(w), c(s(w), nil))) li(c(s(0), c(s(0), nil)))
nat(w) 2 2 nat(0) 2
Fig. 1. Coinductive trees for ListNat. We abbreviate cons by c and list by li. The
last tree is a success tree which implies that the sequence of derivation steps succeeded.
1θ θ
2 3 θ
→ → ... →
stream(x) stream(scons(z, y)) stream(scons(0, scons(y1 , z1 )))
2 bit(y1 ) stream(z1 )
Fig. 2. Coinductive derivation for the goal G = stream(x) and the program Stream
– A is the root of T .
– Each node in T is either an and-node or an or-node: Each or-node is given
by •. Each and-node is an atomic formula.
– For every and-node A occurring in T , if there exist exactly m > 0 dis-
tinct clauses C1 , . . . , Cm in P (a clause Ci has the form Bi ← B1i , . . . , Bni i ,
for some ni ), such that A = B1 θ1 = ... = Bm θm , for some substitu-
tions θ1 , . . . , θm , then A has exactly m children given by or-nodes, such
that, for every i ∈ m, the ith or-node has n children given by and-nodes
B1i θi , . . . , Bni i θi .
Fig. 3. Matrices M1 and M2 encode the left-most and right-most trees in Figure 1
Trees like the ones given in Figure 1 were positive examples, and trees like in
Figure 2 – negative. Bearing in mind subtlety of the notion of a success family,
the accuracy of classification was astonishing (86%), cf. Figure 4.
Problems 4 and 5 have conceptual significance for future applications, see [4];
our experiments show high accuracy in recognition of such proofs. Overall, the
proposed method works well, and applies to the variety of classification tasks.
IS 1. NN-tool can create a new neural network for every new logic program.
Example 5. Using our running examples, a separate sets of feature vectors for
ListNat and Stream can be used to train two separate neural networks, see also
Figure 4 for experiments supporting this approach.
The obvious objection to such approach is that creating a new neural network for
every new fragment of a big program development may be cumbersome; it will
not capture possible common patterns across different programs and fragments,
but also, it will handle badly the cases where some apparently disconnected
programs are bound by a newly added definition. The next two implementation
scenarios address these problems.
IS 2. NN-tool can use only one neural network, and re-train it irrespective of
the changes in program clauses, new predicates or proof structures.
Neural Networks for Proof-Pattern Recognition 433
X-Y — Problem Initial accuracy for X Test 1 on Y Test 2 Test 3 X-Y Mixed data
List-Stream — P1 76.4% 44.2% 51.9% 63.9% 67.1%
Stream-List — P1 84.3% 36.7% 44% 67% 67.1%
List-Stream — P5 82.4% 65.6% 80.% 99% 80.1%
Stream-List — P5 79% 43.5% 63.5% 85.9% 80.1%
Fig. 5. Gradual adaptation to new types of proofs. Letters X and Y stand for
logic programs ListNat and Stream interchangeably; P1 and P5 stand for Problems 1
and 5. First logic program X is taken, and neural network’s accuracy is shown in the
first column. Then these trained networks were used to classify examples of the proofs
for a new logic program Y. The accuracy drops at the start, see the “Test 1” column.
Further columns show how the neural network regains its accuracy as it is trained
and tested on more examples of type Y. For comparison, the last column shows batch
learning on mixed data without gradual adaptation.
In this case, the main question is how quickly the neural network will adapt
to new patterns determined by a new logic program. We designed an experiment
to test it, see Figure 5. It shows that gradual adaptation of previously trained
neural network is at least as efficient as training on a mixed data. In fact, for
Problem 5, it is more successful than training on mixed data!
Example 6. Suppose the NN-tool was used to work with proofs constructed for
two programs – ListNat and Stream; and maintains two corresponding neural
networks. However, a new clause is added by the user:
listream(x,y) ← list(x), stream(y).
This new program Listream subsumes also ListNat and Stream.
The old neural networks will not accept the changed feature vectors for new proofs,
as additional new predicate will infer change in the size of the feature vectors. In
this case, it is possible to extend feature matrices, and thus treat the proof
features from different programs as features of one meta-proof. An example of
an extended feature matrix and results of tests are given in Figure 6. Note
that accuracy for Listream (Figure 6) exceeds accuracy for Stream and List
separately (cf. Figure 4), despite of the growth of the feature vectors.
Finally, as Figure 6 shows, we tried to mix the Scenarios 2 and 3. It is encour-
aging that for Problem 5, training on merged-matrix features over-performed
simple mixing of data sets (as in Figure 5). When working with extended fea-
ture vectors, Listream over-performed the simpler merged-matrix data training.
This shows that the feature-selection method we present allows extensions that
capture significant and increasingly intricate proof-patterns.
4 Conclusions
The advantage of the learning method presented here lies in its ability to capture
intricate relational information hidden in proof trees, such as patterns arising
from interdependencies of predicate type, term structure, branching of proofs
and ultimate proof success. This method allows to apply neural networks to a
wide range of data mining tasks; and universality of the method is its another
advantage. We implemented it in [6]. The future work is to integrate the neural
network tool into one of the existing theorem provers. Another direction is to
apply the method to other kinds of contextually-rich data, such as e.g. web-pages.
References
1. Denzinger, J., Fuchs, M., Goller, C., Schulz, S.: Learning from previous proof ex-
perience: A survey. Technical report, Technische Universitat Munchen (1999)
2. Denzinger, J., Schulz, S.: Automatic acquisition of search control knowledge from
multiple proof attempts. Inf. Comput. 162(1-2), 59–79 (2000)
3. Duncan, H.: The use of Data-Mining for the Automatic Formation of Tactics. PhD
thesis, University of Edinburgh (2002)
4. Grov, G., Komendantskaya, E., Bundy, A.: A statistical relational learning chal-
lenge - extracting proof strategies from exemplar proofs. In: ICML 2012 Worshop
on Statistical Relational Learning, Edinburgh, July 30 (2012)
5. Hitzler, P., Hölldobler, S., Seda, A.K.: Logic programs and connectionist networks.
Journal of Applied Logic 2(3), 245–272 (2004)
6. Komendantskaya, E.: ML-CAP home page (2012),
http://www.computing.dundee.ac.uk/staff/katya/MLCAP-man/
7. Komendantskaya, E., Broda, K., d’Avila Garcez, A.: Neuro-symbolic Represen-
tation of Logic Programs Defining Infinite Sets. In: Diamantaras, K., Duch, W.,
Iliadis, L.S. (eds.) ICANN 2010, Part I. LNCS, vol. 6352, pp. 301–304. Springer,
Heidelberg (2010)
8. Komendantskaya, E., Power, J.: Coalgebraic semantics for derivations in logic pro-
gramming. In: CALCO. LNCS, pp. 268–282. Springer, Heidelberg (2011)
9. Lloyd, J.: Foundations of Logic Programming, 2nd edn. Springer (1987)
10. Lloyd, J.: Logic for Learning: Learning Comprehensible Theories from Structured
Data. Cognitive Technologies Series. Springer (2003)
11. Tsivtsivadze, E., Urban, J., Geuvers, H., Heskes, T.: Semantic graph kernels for
automated reasoning. In: SDM 2011, pp. 795–803. SIAM / Omnipress (2011)
12. Urban, J., Sutcliffe, G., Pudlák, P., Vyskocil, J.: Malarea sg1- Machine Learner for
Automated Reasoning with Semantic Guidance. In: Armando, A., Baumgartner,
P., Dowek, G. (eds.) IJCAR 2008. LNCS (LNAI), vol. 5195, pp. 441–456. Springer,
Heidelberg (2008)
Using Weighted Clustering and Symbolic Data
to Evaluate Institutes’s Scientific Production
1 Introduction
Nowadays the use of databases is increasingly more important and they grow
hugely. In the classical approach, these databases are described by arrays of
quantitative or qualitative values, where each column represents a variable. In
particular, each individual takes a single value for each variable [1]. However,
this representation can be restrictive to use large databases. A possible solution
is to use techniques of Symbolic Data Analysis (SDA) to summarize the data us-
ing symbolic variables that can assume intervals, histograms, and even functions
as values, in order to consider the variability and/or uncertainty innate to the
data [1]. The main goal of SDA is to model information using symbolic data and
extend classical data analysis and data mining techniques, such as, clustering,
factorial techniques, decision trees and another to symbolic data. Therefore, sym-
bolic objects can be used to reduce, improve the understandability and enhance
the recovery of the data.
In many situations, a basic concept of interest (second level of observation)
can be observed in the data set by aggregating observations (first level) that have
this concept in common, that is, a collection of individuals satisfy the concept of
interest [2]. Thus, to summarize the data set, categories of variables are selected
and considered new statical units. A way widely used to aggregate the observa-
tions is to use interval-valued variables. For this, SDA has tools to generalize the
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 435–442, 2012.
c Springer-Verlag Berlin Heidelberg 2012
436 B.A. Pimentel, J.P. Nóbrega, and R.M.C.R. de Souza
standard approach using symbolic data [1], in special using the interval-valued
variables. Standard descriptive statistics (like mean, variance and histogram)
and methods (like K-Means, Fuzzy C-Means and Principal Component Analy-
sis) were extended by symbolic data and several works showed the importance
of this extension [1].
A few works shows applications of SDA on actual data bases. Neto and De
Carvalho [3] showed an application about administrative managements concern-
ing cities of the Pernambuco state (Brazil) that used interval-valued variables to
describe public services. Zuccolotto [4] presented the use of symbolic data analy-
sis in a data base about job satisfaction of Italian workers through the principal
component analysis method. Silva et al. [5] made experiments using information
of web users and the goal was to group ones such that individuals into the same
group were similar behavior, for this, a dynamic clustering method for interval-
valued variable was used. Giusti and Grassini [6] presented experiments that
used symbolic data analysis methods and tools to cluster local areas of Italy
based in the their economic specialization where the concept of interest was the
local labor systems and the first level observation was the municipalities.
The purpose of this work is to investigate the use of the symbolic data analysis
methods and tools to classify, through a weighted clustering method, Brazilian
research institutes in the basis of their scientific production. The advantage of
this method is that the clustering algorithm is able to recognize clusters of
different shapes and sizes. The analysis is carried out on the data derived from
the National Council for Scientific and Technological Development over the years
2006, 2007 and 2008. These data are aggregated by sub-area of knowledge (like
physics, mathematics, medicine, etc) and institute in order to summarize the
original data base and obtain new concepts of scientific production. Thus, a new
data base is constructed in which each item of this base represents a group of
researches under the same institute and subjects of research.
The paper proceeds as follows: Section 2 describes the scientific production
data considered in this paper and highlights the aggregation process adopted to
obtain symbolic data (second- level objects) from these data (first-level objects).
This section also presents the reasons of using symbolic data. Section 3 shows the
application of a weighted clustering method on the scientific production data.
The concluding remarks are given in the Section 4.
The data were extracted from the National Council for Scientific and Techno-
logical Development (http://www.cnpq.br) that is an agency of the Ministry
of Science, Technology and Innovation in order to promote scientific and techno-
logical research and the training of human resources for research in the country.
Other important Brazilian agency is the Coordination for the Improvement of
Higher Level Personnel (http://www.capes.gov.br) whose main activity is to
evaluate the Brazilian research institutes. This agency evaluates the Brazilian
post-graduate courses based on the scientific production of the researchers.
Weighted Clustering and Symbolic Data 437
Each interval is obtained in the following way: the interval xjh = [ajh , bjh ] with
ajh < bjh and ajh , bjh ∈ where ajh = [min{vij }, max{vij }] ∀i ∈ Ω such that
c1i = c1f , c3i = c3f f
= i f ∈ Ω.
Example: consider a group of 4 researchers indexed by 1, . . . , 4 under the same
institute ”UFPE” and sub-area of knowledge ”Biophysical”. Given an continuous
numerical variable ”International journal” theses researches have the following
values for this variable, respectively: 1.75, 0.25, 0.75 and 0.00. In order to describe
these researchers regarding a symbolic interval variable, the original data are
aggregated and minimum and maximum generalization tools are applied to the
continuous values. Now this group of researchers has [0.00, 1.75] as interval for
the interval variable ”International journal”.
3 Application
In this work we used the dynamic clustering algorithm based on Hausdorff adap-
tive distances [8] to partition the data set into K clusters. The main goal of this
partition is to analyze groups of sub-area of knowledge of different institutions
which have similar proprieties according the scientific production. In the follow-
ing section, there is a brief description of this algorithm.
Let mji = (aji + bji )/2 be the midpoint of the interval xji = [aji , bji ] and lij =
(bji − aji )/2 be half of its length. Consider μjk = median{mji , i ∈ Ck } and ρjk =
median{lij , i ∈ Ck }. The lower and upper bounds αjk and βkj are given as:
The parameter λjk belongs to weight vector λk = (λ1k , . . . , λpk ) and weights the
distance between the object xi and the prototype yk according the cluster Ck
and variable j. The weight vectors minimize the clustering criterion J and can
be calculated using the method of Lagrange multipliers. After some algebra the
parameter λjk is calculated as:
p 1
[ h=1 ( i∈Ck max{|ahi − αhk |, |bhi − βkh |})] p
λjk = (4)
i∈Ck max{|aji − αjk |, |bji − βkj |}
with the restrictions: pj=1 λjk = 1 and λjk > 0.
The algorithm starts with an initial partition and alternates three steps until
convergence when the criterion J reaches a stationary value or the partition does
not change (test = 0) representing a local minimum.
Schema of the weighted clustering algorithm
1. Initialization
Choose (randomly) a partition P = ({C1 . . . , CK }) of Ω.
2. Representation step
(Fixed the partition P = (C1 . . . , CK ) of Ω)
a) Compute the prototype yk = ([α1k , βk1 ], . . . , [αpk , βkp ]) where αjk and βkj are
calculated following the equation (3);
b) For j = 1, . . . , p and k = 1, . . . , K compute λjk with equation (4).
3. Allocation step: definition of the partition
(Fixed the prototypes yk = ([α1k , βk1 ], . . . , [αpk , βkp ]) and the weights λjk (j =
1, . . . , p) and k = 1, . . . , K))
Fix test ← 0.
For i = 1 to n: Define k∗ = argmin dk (xi , yk ); If i ∈ Ck and k∗ = k, set
k=1,...,K
test ← 1, Ck∗ ← Ck∗ ∪ {i} and Ck ← Ck \ {i}.
4. Stopping criterion
If test = 0 then STOP, otherwise go to step (2).
3.2 Results
Coordination for the Improvement of Higher Level Personnel initially considered
7 levels of categories of researcher. The level 1 means institutes with a very poor
performance in terms of scientific production below the minimum standard value
of quality required and the level 7 means institute that offers an excellence level
of scientific production. However, in this application, it was adopted the levels 3
to 7 since the levels 1 and 2 are bellow the minimum standard value of quality
required . Thus, in this work, the clustering algorithm aims to look for a partition
in 5 clusters.
The weighted clustering method is run until the convergence to a stationary
value of the adequacy criterion 200 times and the best result, according to the
adequacy criterion, is selected. The 5 clusters of the partition have the following
sizes, respectively: 935, 763, 1734, 600, 1598.
440 B.A. Pimentel, J.P. Nóbrega, and R.M.C.R. de Souza
In order to evaluate the spread for each variable, Figures 1 and 2 present the
visualization using the Zoom Star method [7] for the 5 prototypes found by the
weighted clustering algorithm. The Zoom Star method allows to display the area
between the upper-bound and lower-bound polygons. Here, each axis represents
the lower and upper bounds of the interval variables for a given class. Moreover,
the variables were standardized. The new boundaries are:
αjk − mj βkj − mj
ajk = and b j
k =
n j − mj n j − mj
Fig. 1. Prototypes of the clusters 1, 2 and 3 according the Zoom Star method
Fig. 2. Prototypes of the clusters 4 and 5 according the Zoom Star method
According the figures above, the cluster 4 has the highest scientific production.
However, this cluster represent 10% of the symbolic data set. The cluster 3 has
the lowest production and it is the greatest cluster in size representing 31%. The
Weighted Clustering and Symbolic Data 441
cluster 5 is similar to the cluster 2 and they together consist of 42% of the data
set. The cluster 1 has medium scientific production in comparison with the other
clusters and it represents 17%. In conclusion, these results show that advances
in scientific production are need since 73% of the researcher groups presents low
production.
The clustering algorithm based on adaptive distance calculates a weight vector
for each cluster. Each weight represents the variability of a variable in relation to
prototype. Low values of weights mean variables with high variability, whereas
high values mean variables with low variability. Table 1 shows the weight vectors
for 5 clusters provided by the clustering algorithm after the convergence. The
registered software and technique variables have very high weights in clusters 1,
3 and 5. These values are outliers. It is because there are many values close to
0 for these variables since the institutes of research register few softwares and
techniques in the years 2006, 2007 and 2008. However, these variables have the
highest values of contribution in cluster 4 and they are not outliers. In cluster 2,
the highest values of contribution are for the following variables: specialization
guidelines finished, unregistered techniques and summary of journal. There are
not weight outliers in cluster 2. Regardless the weight outliers, the unregistered
product variable has the highest contribution in both clusters 1 and 3 and the
unregistered technique variable has the highest weight in cluster 5.
4 Conclusion
References
1. Diday, E., Noirhomme, F.M.: Symbolic Data Analysis and the SODAS Software.
Wiley Interscience, Chichester (2008)
2. Billard, L., Diday, E.: Symbolic Data Analysis: Conceptual Statistics and Data
Mining. Wiley Interscience, Chichester (2006)
3. Neto, E.A.L., De Carvalho, F.A.T.: Symbolic Approach to Analyzing Administrative
Management. The Electronic Journal of Symbolic Data Analysis 1(1), 1–13 (2002)
4. Zuccolotto, P.: Principal components of sample estimates: an approach through
symbolic data analysis. Applied and Metallurgical Statistics 16, 173–192 (2006)
5. Silva, A.D., Lechevallier, Y., De Carvalho, F.A.T., Trousse, B.: Mining web usage
data for discovering navigation clusters. In: IEEE Symposium on Computers and
Communications, pp. 910–915 (2006)
6. Giusti, A., Grassini, L.: Cluster analysis of census data using the symbolic data
approach. Adv. Data Analysis and Classification 2, 163–176 (2008)
7. Noirhomme-Fraiture, M.: Visualization of large data sets: The Zoom Star solution
Electron. J. Symbol. Data Anal. 0, 26–39 (2002)
8. De Carvalho, F.A.T., Souza, R.M.C.R., Chavent, M., Lechevallier, Y.: Adaptive
Hausdorff distances and dynamic clustering of symbolic data. Pattern Recognition
Letters 27(3), 167–179 (2006)
Comparison
of Input Data Compression Methods
in Neural Network Solution of Inverse Problem
in Laser Raman Spectroscopy of Natural Waters
Abstract. In their previous papers, the authors of this study have sug-
gested and realized a method of simultaneous determination of temper-
ature and salinity of seawater using laser Raman spectroscopy, with the
help of neural networks. Later, the method has been improved for de-
termination of temperature and salinity of natural water using Raman
spectra, in presence of fluorescence of dissolved organic matter as dis-
persant pedestal under Raman valence band. In this study, the method
has been further improved by compression of input data. This paper
presents comparison of various input data compression methods using
feature selection and feature extraction and their effect on the error of
determination of temperature and salinity.
1 Introduction
Knowledge of such parameters of seawater as temperature (T) and salinity (S)
is of great importance, because it helps to understand the evolution of climate
change, to study energy exchange between water surface and atmosphere. Neces-
sity of global monitoring of T and S arises from the tendency observed during last
years - decrease of icecap in polar latitudes because of global warming. Melting
of ice leads to desalination of the surface layer of ocean. It can give an impulse to
reconstruction of oceanic current system and become the reason of considerable
climate changes not only in polar areas but in planetary scale.
It is obvious that for ecological monitoring of nature waters - for determination
of such key parameters as T and S, one needs express non-contact methods of
diagnostics, which can be implemented in real time. Such properties are inherent
in the non-contact radiometric method of determination of either S or T of the
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 443–450, 2012.
c Springer-Verlag Berlin Heidelberg 2012
444 S. Dolenko et al.
2 Experiment
Solution of stated problem (determination of T and S taking into account fluo-
rescence of DOM) using NN was performed via experiment-based approach [21].
It means that only experimental spectra were used for NN training. In this case
one needs no a priori constructed model and all specific features of the object
are automatically taken into consideration.
To perform this study, an array of experimental spectra with different values of
the parameters (temperature, salinity and concentration of DOM) was recorded.
Solutions were prepared from bidistilled water, river humus and sea salt. Salinity
was changed from 0 to 45 psu (step 5 psu), concentration of humus - from 0 to
350 mg/l, temperature - from 0 to 35 ◦ C (srep 5 ◦ C).
Fig. 1. Scheme of experimental setup: 1 - argon laser (488 nm), 2 - beam splitter,
3 - laser power meter, 4 - focusing lens, 5 - thermo-stabilized cuvette, 6 - system of
thermo-stabilization, 7 - system of lenses, 8 - edge-filter, 9 - monochromator, 10 -
photomultiplier, 11 - CCD-camera, 12 - computer
All spectra used for work with NN were measured with 5 s camera exposure
time for valence bands and 10 s for low-frequency bands.
3 Methods
In the preceding study [22], the same experimental data array has been used
for NN determination of T and S in presence of DOM. The best results were
obtained with perceptrons with three hidden layers. Using only Raman valence
band, the best results obtained were 1.2 ◦ C for mean absolute error (MAE) of
temperature determination, and 1.5 psu for MAE of salinity determination.
Comparison of Input Data Compression Methods in NN Solution 447
Using both valence band and low-frequency region, it was possible to reduce
errors down to 0.8 ◦ C and 1.1 psu. Remind that maximum MAE values that can
make a method interesting for practical applications are about 1 ◦ C and 1 psu.
The purpose of the present study was to achieve the same level of results or
better using only the valence band. Such an opportunity would be important, as
recording the low-frequency region of Raman spectra requires more sophisticated
and therefore more expensive experimental equipment.
It was planned to achieve this goal by reducing the initial dimensionality of
the input data (1024 features - spectra channels). It is quite obvious that the
actual dimensionality of the problem should be much lower. So, different methods
of feature selection and feature extraction were applied to achieve input data
compression.
For all NN experiments in this study, a fixed NN architecture was used. It was
a perceptron with a single 64-neuron hidden layer, logistic activation function in
the hidden layer and linear activation function in the output one. Learning rate
r=0.01, moment m=0.5. Training was stopped after 1000 epochs after minimum
error on test set. The results were estimated on the examination (out-of-sample)
set. To account for random factors due to weight initialization, 5 NNs with
different initial weights were trained for each experiment.
1) Cross-correlation. The values of cross-correlation (CC) of each of the in-
put features with the output ones were calculated. Then, only the input features
with CC exceeding a pre-defined threshold value (0.3), were used to solve the
problem. The main shortcoming of this method is that linear correlation can cap-
ture only linear relationships between variables, thus missing to find significant
input features with nonlinear influence on the determined output variable. The
determined dependence of CC on spectral shift corresponding to each feature is
presented in Fig. 4.
poor for a small number of samples that can be provided from experiment. The
determined spectral dependence of CE is presented in Fig. 4.
3) General Regression NN (GRNN, [23]) with correcting coefficients for the
smoothing factor for each input feature, as implemented in NeuroShell 2 soft-
ware package [24]. Only input features with correcting coefficient exceeding a
pre-defined threshold value (0.5) were used to solve the problem. As there are
obviously interconnections among input features, and as the correction coeffi-
cients are determined using genetic algorithm, the set of coefficients that are
determined from a single launch of the algorithm has a strong influence of ran-
dom factors. Therefore, the procedure was applied recurrently several times, each
new launch producing a narrower set of significant features. Each of the itera-
tively obtained sets was used to solve the problem. The dependence of MAE for
T and S on the number of selected features is presented in Fig. 5.
Fig. 5. Mean absolute error for T and S de- Fig. 6. Mean absolute error for T and S
termination vs number of features selected determination vs binary logarithm of the
by GRNN number of features extracted by adjacent
channel aggregation
4 Results
The best results obtained in this study for different methods of input data com-
pression are summarized in Table 1. The presented values are mean absolute
error on the out-of-sample set of data.
Table 1. Mean absolute error of problem solution for T and S on the out-of-sample
set of data for different methods of input data compression
5 Conclusion
This study was devoted to comparison of various methods of feature selection and
extraction for NN solution of the inverse problem of determination of seawater
temperature and salinity by valence band of Raman spectrum, in presence of
fluorescence of dissolved organic matter in a wide range of concentrations.
The best results were obtained for feature extraction by aggregating each
16 adjacent spectral channels, producing 64 input features. This means that
practical spectral resolution required to solve the problem is as large as 32 cm−1 ,
which can be easily achieved by inexpensive spectroscopy equipment.
The obtained values of mean absolute error on the out-of-sample set of data
are 0.69±0.02 ◦ C and 0.76±0.04 psu, which are not much greater than the results
obtained by NN solution of the same problem with no dissolved organic matter.
References
1. Font, J., Camps, A., Borges, A., et al.: SMOS: The challenging measurement of
sea surface salinity from space. In: P. IEEE, vol. 98 (5), pp. 649–665. IEEE Press,
New York (2010)
2. Turiel, A., Nieves, V., Garcia-Ladona, et al.: The multifractal structure of satellite
sea surface temperature maps can be used to obtain global maps of streamlines.
Ocean Sci. 5, 447–460 (2009)
450 S. Dolenko et al.
3. Boutin, J., Waldteufel, P., Martin, N., et al.: Surface salinity retrieved from SMOS
measurements over the global ocean: Imprecisions due to sea surface roughness and
temperature uncertainties. J. Atmos. Ocean. Technol. 21, 1432–1447 (2004)
4. Eugenio, F., Marcello, J., Hernandez-Guerra, A., Rovaris, E.: Methodology to ob-
tain accurate sea surface temperature from locally received NOAA-14 data in the
Canary-Azores-Gibraltar area. Scientia Marina 65(1), 127–137 (2001)
5. Garcia-Santos, V., Valor, E., Caselles, V.: Determination of temperature by remote
sensing. J. of Mediterranean Meteorology and Climatology 7, 67–74 (2010)
6. Walrafen, G.E.: Raman Spectral Studies of Water Structure. J. Chem. Phys. 40,
3249–3256 (1964)
7. Walrafen, G.E.: Raman Spectral Studies of the Effects of Temperature on Water
and Electrolyte Solutions. J. Chem. Phys. 44, 1546–1558 (1966)
8. Walrafen, G.E.: Raman Spectral Studies of the Effects of Temperature on Water
Structure. J. Chem. Phys. 47, 114–126 (1967)
9. Chang, C.H., Young, L.A.: Seawater Temperature Measurement from Raman Spec-
tra. Avco Everett Research Laboratory, Inc., Interim technical report (1972)
10. Leonard, D., Chang, C., Yang, L.: Remote measurement of fluid temperature by
Raman scattered radiation. U.S. Patent 3.986.775, Class 356-75 (1974)
11. Leonard, D., Caputo, B., Hoge, F.: Remote sensing of subsurface water temperature
by Raman scattering. Applied Optics 18(11), 1732–1745 (1979)
12. Terpstra, P., Combes, D., Zwick, A.: Effect of salts on dynamics of water: A Raman
spectroscopy study. J. Chem. Phys. 92(1), 65–70 (1990)
13. Dolenko, T.A., Churina, I.V., Fadeev, V.V., Glushkov, S.M.: Valence band of liquid
water Raman scattering: some peculiarities and applications in the diagnostics of
water media. J. of Raman Spectroscopy 31(8-9), 863–870 (2000)
14. Sherer, J., Go, M., Kint, S.: Raman spectra and structure of water from 10 to 90.
J. Phys. Chem. 78(13), 1304–1313 (1974)
15. Burikov, S.A., Churina, I.V., Dolenko, S.A., et al.: New approaches to determina-
tion of temperature and salinity of seawater by laser Raman spectroscopy. In: 3nd
EARSeL Workshop on Remote Sensing of the Coastal Zone, pp. 298–305 (2003)
16. Karl, J., Ottmann, M., Hein, D.: Measuring water temperatures by means of linear
Raman spectroscopy. In: Proc. of the 9th International Symposium on Application
of Laser Techniques to Fluid Mechanics, vol. II, pp. 23.2.1–23.2.8 (1998)
17. Becucci, M., Cavalieri, S., Eramo, R., Fini, L., Materazzi, M.: Raman spectroscopy
for water temperature sensing. Laser Physics 9(1), 422–425 (1999)
18. Furić, K., Ciglenečki, I., Ćosović, B.: Raman spectroscopic study of sodium chloride
water solutions. J. Mol. Str., 550–551, 225–234 (2000)
19. Bekkiev, A., Gogolinskaya (Dolenko), T., Fadeev, V.: Simultaneous determination
of temperature and salinity of seawater by the method of laser Raman spectroscopy.
Soviet Physics Doklady 271(4), 849–853 (1983)
20. Shubina, D.M., Patsaeva, S.V., Yuzhakov, V.I., et al.: Fluorescence of organic mat-
ter dissolved in natural water. Water: Chemistry and Ecology 11, 31–37 (2009)
21. Gerdova, I.V., Churina, I.V., Dolenko, S.A., et al.: New Opportunities in Solution
of Inverse Problems in Laser Spectroscopy Due to Application of Artificial Neural
Networks. In: Proc. SPIE, vol. 4749, pp. 157–166 (2002)
22. Dolenko, T.A., Burikov, S.A., Sabirov, A.R., et al.: Remote determination of tem-
perature and salinity in consideration of dissolved organic matter in natural waters
using laser spectroscopy. In: EARSeL eProceedings, vol. 10(2), pp. 159–165 (2011)
23. Specht, D.: A General Regression Neural Network. IEEE Trans. on Neural Net-
works 2(6), 568–576 (1991)
24. NeuroShell 2, http://www.wardsystems.com/neuroshell2.asp
New Approach for Clustering Relational Data
Based on Relationship and Attribute
Information
Abstract. A wide range of the database systems in use today are based
on the relational model. As a consequence, more information used by
those systems has been stored in multi relational object types. However,
most of the traditional machine learning algorithms have not been orig-
inally proposed to handle this type of data. Aiming to propose better
ways of handling the relational particularities of the data, this paper
proposes a new relational clustering method based on relationship and
attribute information. In our method, attributes have weights associated
with their importance between the object types. An empirical analysis is
performed in order to evaluate the effectiveness of the proposed method,
comparing with two traditional methods for relational clustering. Three
relational databases were used in the experiments.
1 Introduction
The vast majority of traditional clustering algorithms proposed in the literature
are focused on flat data, also known as single table. In this type of data, informa-
tion is organized into rows and columns, very similar to a spreadsheet. However,
the databases in the real world are much richer in structure, since they involve
multi-type objects and their relationships.
The relational databases are powerful because they require few assumptions
about how data is related or how it will be extracted from the database. As
a result, the same database can be viewed in many different ways [5]. These
characteristics have made it a popular choice of data storage. As a consequence,
different types of systems have used relational database. In a relational database,
all data is stored in tables (object types). They have the same structure repeated
in each row (like a spreadsheet), and also have links (foreign keys) that are used
to perform the relation between them [11].
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 451–458, 2012.
c Springer-Verlag Berlin Heidelberg 2012
452 J.C. Xavier-Júnior et al.
2 Related Works
The main aim of clustering algorithms is to use attribute information to group
objects (instances) that have similar attribute values. However, when dealing
with relational data there are more types of information available that need to
be used to distinguish groupings, such as relationship information. Clustering in
relational data has been studied in some works [1], [9], [4], [8], [12].
In [1], the authors proposed a modelling approach that is capable of learning
clustering model from a relational database directly. In this modelling approach,
they presented a way of generalizing or mapping data with one-to-many relation-
ship learned from relational domain. The authors use DARA (Dynamic Aggre-
gation of Relational Attributes) to convert the data from a relational model into
a vector space model. DARA processes the relation data from tables convert-
ing it into instances of binary values. In [9], the authors adapted graph cutting
algorithms to cluster only link attribute (relationships) information, only at-
tributes information, or both by using a hybrid approach (graph-partitioning
algorithm with an attribute similarity metric). Their algorithm was only applied
on synthetic datasets.
In [4], the authors proposed a two-stage clustering method for multi-type rela-
tional data, called TSMRC. To improve clustering quality, the authors proposed
different similarity measures for both stages. In TSMRC, only attributes val-
ues were considered when clustering the tables separately (first stage), and all
relationships were considered during the second stage. The authors state that
the method improves clustering efficiency and accuracy. However, it is not clear
whether this method can cope with a considerably large number of tables and
the consuming time for clustering them in two-stages.
In another work, [8], the authors proposed a probabilistic model for rela-
tional clustering, which also provides a principal framework to unify various im-
portant clustering tasks including traditional attributes-based clustering, semi-
New Approach for Clustering Relational Data 453
0, if ai = bi
δc (ai , bi ) = (3)
bi
1, if ai =
The main aim of this method is to calculate and to use weights for the inter-
relationships. In this case, we will discard the use of the attributes of the sec-
ondary table of a relational dataset. In other words, only the attributes of the
main table is considered, along with its relational information. In this case, if two
instances are pointing to different instances in the secondary table, its similarity
measure δ is 1. Then, it is multiplied by its corresponding weights in Eq. 1.
Moreover, weights can be considered a data-dependent parameter and we
used a simple method to calculate these weights in RCRAI. It is based on the
similarity of each attribute of the secondary tables of the relationship and it is
based on the following equation.
m
wr = M Vi (4)
i=1
The two stage clustering algorithm for multi-type relational data (TSMRC) is
a method in two stage for relational data. In the first stage of TSMRC, all
object types are clustered separately by using both attribute and relationship
information. In the second stage, all the resulting clusters of the first stage are
merged according to their interrelationship.
The similarity used in the first stage is defines as follows:
p
N
r
SimAtt (Xij , Xik ) = Xij − Xik
r
+λ r
δ(Xij r
, Xik ) (5)
r=1 r=p+1
r r
where Xij , Xik is the rth attribute of Xij , Xik . N is the number of attributes,
p is the number of numeric attributes, and (N − p) is the number of categorical
attributes. Function δ() is difference function, if a and b is equal, the value of
δ(a, b) is 0, otherwise, the value is 1.
New Approach for Clustering Relational Data 455
In the second stage, the similarity between two object types is defined as
follows:
⎧
⎨|Rinter (Cip , Cjq )| ∗ |Xi | ∗ |Xj | , i
=j
Sim(Cip , Cjq ) = |Cip | ∗ |Cjq | ∗ |Rinter (Xi , Xj )| (6)
⎩
0, i=j
where Cip and Cjq are two clusters of Xi and Xj , which are in the results
of the first stage; |Cip | and |Cjq | are the number of objects in Cip and Cjq ;
|Rinter (Xi , Xj )| is the number of interrelationships between Cip and Cjq ; |Xi |
and |Xj | are the number of objects in Xi and Xj , |Rinter (Xi , Xj )| is the number
of interrelationships between Xi and Xj .
The Agglomerative hierarchical clustering algorithm was used in both stages
according to the authors.
4 Experimental Setting Up
Finding available relational databases is not an easy task. Due to this reason, we
used only three databases in our experiments. The first one is a movie database,
which is widely used in experiments related to relational clustering methods. It
is available at UCI repository (http://archive.ics.uci.edu/ml/datasets/Movie).
This database is typically relational, where the data is divided into multiple
tables (object types). It consists of five tables (Movie, Studio, Director, Actor
and Casts). The original database has more than 10,000 movies. However, we
selected only 500 movies divided into ten categories: action (50), adventure (50),
comedy (50), crime (50), drama (50), fiction (50), war (50), musical (50), romance
(50) and terror (50). For each movie, we chose only three actor who acted in.
Thus, the table Casts has a total of 1,500 instances. Finally, the 10 different
categories represent the class attribute of the dataset.
The second one is called Nursery Database and it was derived from a hierarchi-
cal decision model originally developed to rank applications for nursery schools.
It was used during several years in 1980’s when there was excessive enrolment
to these schools in Ljubljana, Slovenia, and the rejected applications frequently
needed an objective explanation. This database contains three tables (Employ,
Finance and Health) which combined sum a total of 12,960 instances. The class
attribute is composed by five unbalanced values. The database is available at
UCI repository (http://archive.ics.uci.edu/ml/datasets/Nursery).
The last one is the NatalGIS database and it stores accesses of the users
that visualize geographic information provided by a system of the same name
[13]. This database is composed by eight tables which store the features of the
geographic information provided by the system. All available information belong
to a coral reef area known as Parrachos de Maracaja located in North-east of
Brazil. Finally, the tables combined sum a total of 1,000 instances with no class
attribute.
456 J.C. Xavier-Júnior et al.
For measuring the results of a clustering algorithm, some validity indices have
been proposed in the field of data mining [7]. These indices are used to measure
the ”goodness” of a clustering result comparing it to other results obtained by
other clustering algorithms, or the same algorithm but varying different param-
eters. In this paper, we use three validity indices for measuring the clustering
results. Two internal, the Davies-Bouldin (DB) [2] and the Silhouette [10] in-
dices, and one external, the Adjusted Rand index [6].
For experimental purposes, we proposed three different scenarios. Each sce-
nario represent a method for clustering relational data. The main purpose of
having scenario 1 (very common in literature) is to establish a baseline approach
necessary for comparison. In this scenario, a dataset containing all the attributes
of all tables is created. However, only numeric and categorical attributes were
considered in this scenario. We use the agglomerative Hierarchical clustering
algorithm as the base clustering method.
In scenario 2, we use a method for clustering relational data proposed in [4].
The main idea of this method is to cluster the instances of the relational tables
separately and then merge them together according to their relationships. Again,
we used the agglomerative hierarchical clustering algorithm.
In the third scenario, we use a new method for clustering relational data
called Relational Clustering based on Relationship and Attribute Information
(RCRAI). In our method, we use a similarity measure of Eq. 1 to compute the
values of normal attributes (categorical and numeric) and relationships. Both
attributes and interrelationships receive weights that directly influence the dis-
tance between two instances. Again, in order to perform a fair evaluation, we
use the agglomerative hierarchical clustering algorithm to cluster the datasets.
5 Experimental Results
Table 2 presents the experimental results for the Movie dataset. Unlike the
previous dataset, Scenario 2 obtained the best result for the DB index while
scenario 3 obtained the best results for Silhouette and Rand indices. However,
the results were not statistically significant in any analysed case.
Table 3 presents the experimental results for the Nursery dataset. Note that
Scenario 2 obtained the best result for the DB and Silhouette indices while
scenario 3 obtained the best results for the Rand index. Moreover, the results
obtained in scenario 3 measured by the rand index were statistically significant.
This is an important result since it indicates that our method clustered the
Nursery dataset well, according to the class attribute of the dataset (rand index).
6 Conclusion
The main contribution of this paper is to propose a new clustering method able to
handle the real world relational database particularities. Our method is very sim-
ple and it computes the distances between instances based on common (numeric
and categorical) and relationship attributes. Based on the experiments results
obtained for scenarios 1, 2 and 3, we can conclude that the proposed method
458 J.C. Xavier-Júnior et al.
References
1. Alfred, R., Kazakov, D.: Clustering approach to generalized pattern identification
based on multi-instanced objects with dara. In: ADBIS Research Communications,
pp. 38–49 (2007)
2. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence 1(2), 224–227 (1979)
3. Fisher, R.: Statistical methods for research workers. The Eugenics Review 18(2),
148–150 (1926)
4. Gao, Y., Liu, D.Y., Sun, C.M., Liu, H.: A two-stage clustering algorithm for multi-
type relational data. In: Proceedings of the 2008 Ninth ACIS International Con-
ference on Software Engineering, Artificial Intelligence, Networking, and Paral-
lel/Distributed Computing, pp. 376–380. IEEE Computer Society, Washington,
DC (2008)
5. Harrington, J.L.: Relational Database Design and Implementation: Clearly Ex-
plained. Morgan Kaufmann (2009)
6. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1),
193–218 (1985)
7. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput.
Surv. 31(3), 264–323 (1999)
8. Long, B., Zhang, Z., Yu, P.S.: A probabilistic framework for relational clustering.
In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 470–479 (2007)
9. Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and
link information. In: Proceedings of the Text Mining and Link Analysis Workshop
(2003)
10. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of
cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
11. Seltzer, M.I.: Beyond relational databases. ACM Queue 3(3), 50–58 (2005)
12. Xavier-Júnior, J.C., Canuto, A., Freitas, A., Gonçalves, L., Silla-Jr., C.: A hi-
erarchical approach to represent relational data applied to clustering tasks. In:
International Joint Conference on Neural Networks, pp. 3055–3062. IEEE Press
(2011)
13. Xavier-Junior, J.C., Signoretti, A., Canuto, A.M.P., Campos, A.M., Gonçalves,
L.M.G., Fialho, S.V.: Introducing Affective Agents in Recommendation Systems
Based on Relational Data Clustering. In: Hameurlain, A., Liddle, S.W., Schewe, K.-
D., Zhou, X. (eds.) DEXA 2011, Part II. LNCS, vol. 6861, pp. 303–310. Springer,
Heidelberg (2011)
14. Yin, X., Han, J., Yu, P.S.: Cross-relational clustering with user’s guidance. In: Pro-
ceedings of the Eleventh ACM SIGKDD International Conference on Knowledge
Discovery in Data Mining, KDD 2005, pp. 344–353. ACM, New York (2005)
Comparative Study on Information Theoretic
Clustering and Classical Clustering Algorithms
1 Introduction
There are many fields that clustering techniques can be applied such as mar-
keting, biology, pattern recognition, image segmentation and text processing.
Clustering algorithms attempt to organize unlabeled data points into clusters
in a way that samples within a cluster are “more similar” than samples in dif-
ferent clusters [4]. To achieve this task, several algorithms were developed using
different heuristics. It is known that spatial distribution is a problematic issue
in clustering tasks, since most part of the algorithms has some bias to a specific
cluster shape. For example, single linkage hierarchical algorithms are sensitive
to noise and outliers tending to produce elongated clusters and k-means yields
to elliptical clusters.
The incorporation of spatial statistics of the data gives a good measure of
spatial distribution of the objects in a dataset. One way of doing this is using
information-theoretic (IT) elements in the clustering process. In fact, Informa-
tion Theory involves the quantification of information in a dataset using some
statistical measures. Recently, [11,10,14] achieved good results using some ele-
ments of information theory to help clustering tasks.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 459–466, 2012.
c Springer-Verlag Berlin Heidelberg 2012
460 D. Araújo, A.D. Neto, and A. Martins
2 Clustering Analysis
Clustering analysis is an unsupervised way of grouping data using a specific sim-
ilarity measure. Clustering algorithms try to organize unlabeled feature vectors
in natural agglomerates in such a way that objects inside a cluster are more
similar than objects in different clusters. As mentioned before, there some is-
sues related to the traditional clustering analysis, where only the data itself is
considered in the process. Information Theoretic Clustering appeared as a good
alternative to use the underlying statistics of the data for clustering analysis.
2.2 Preprocessing
Machine learning techniques, such as clustering algorithms, may not work cor-
rectly with high dimensional data. In general, the efficiency and accuracy of
these techniques degrades as the size of the data increases. Besides, features
that are related to special properties of the data often lie in a subspace of a
lower dimension within the original data [6].
In the context of this work, we want to cluster gene expression data, which is
characterized by few samples (only a few dozen), but each sample is in a very
high dimension (up to tens of thousands). This type of data makes the use of
the proposed clustering unfeasible, both because it requires a large runtime, and
has uninformative attributes which disturbs the clustering process.
In [1] is made a comparative study between dimension reduction techniques in
the context of complex datasets (gene expression). We need a tool for reduce the
original dimension of our real datasets, then we chose the best method pointed
in that paper: t-Distributed Stochastic Neighbor Embedding (T-SNE) [9]. The
t-SNE method consists of a nonlinear mapping of the original dimensional space
to a feature space and it can perform feature extraction very efficiently, helping
the cluster analysis by reducing the runtime of the algorithms and increase the
accuracy of the solutions created.
Dataset k n
boat 3 100
easy doughnut [7] 2 100
four gaussians [7] 4 100
half rings [7] 2 400
petals 4 100
spirals [7] 2 200
4 Experimental Results
In this session, we show the results obtained with datasets described in Sec. 3.
All results are expressed as values of the index corrected Rand (CR) with respect
to the already known labels for each object of the data. In order to contextualize
the analysis, the results are separated into two different sections. The CR values
related to partitions created using each of the clustering methods are displayed
in tables so that we can compare the performances of them.
Table 3 shows the results of the tested algorithms: k-means (KM), hierarchical
single linkage (SL), finite mixture of Gaussians (FMG) and the IT learning al-
gorithm (IT) for all synthetic datasets. For the latter algorithm, we also show
the number of auxiliary regions related to the presented partition. These tech-
niques were chosen in order to show the performance of algorithms with different
clustering criteria.
Based on this, first we can notice that KM achieved good results in only few
datasets (four gaussians, petals). These three sets, have the same character-
istics, i.e,, all groups are well separated and the points are clustered around a
common center. This behavior is predictable, because KM only works correctly
464 D. Araújo, A.D. Neto, and A. Martins
Dataset KM SL FMG IT
boat 0.38 0.00 0.52 0.71(35)
easy doughnut 0.24 1.00 1.00 1.00(07)
four gaussians 0.95 0.69 0.97 0.97(05)
half rings 0.24 0.03 0.66 1.00(10)
petals 1.00 0.38 0.89 1.00(04)
spirals 0.01 1.00 0.06 1.00(25)
when clusters are elliptical. This happens due to the criteria it minimizes (mean
square error) that favors groups with little dispersion. In the other datasets,
where the dispersion is greater or the separation between groups is not so obvi-
ous, KM did not delivered good results.
In practice, for datasets with clusters which can not be linearly separable,
the use of k-means is inappropriate. Despite its low computational complexity,
this technique can not separate the groups when they are, for example, inserted
one inside the other (easy donut and spirals). This happens even when the
separation between the groups is clear.
Opposite result occurs to LS, which by their local nature, only achieves good
results when the groups are completely apart and the points are very close to each
other, favoring partitions with elongated clusters (easy donut and spirals).
This type of technique is very sensitive to noise and outliers, then, if points of
different groups are closer, the probability of these clusters become one is very
high.
Finite mixture of Gaussians has showed good results in different contexts.
For the datasets used in this work, this technique achieved good results where
KM and LS failed. On the other hand, for datasets as boat and spirals its
performance was very low. In the case of spirals its result was even worse than
SL (worst overall performance), which obtained a maximum value of CR. For
such datasets it is natural that FMG can not achieve good results, since their
clustering criteria favors partitions containing groups with elliptical shape.
The IT algorithm has achieved excellent results (CR close to one) for the
most datasets. Both for data that has low spatial complexity (four gaussians
and petals) and for those with complex spatial distribution, i.e., easy donut,
half rings and spirals. Those datasets have a data distribution that prevents
separation by minimizing the mean square error, thus disabling KM to achieve
good results.
Even for the dataset boat, which has a mix of difficulties (internal clusters,
distinct dispersions and different cluster size), the IT algorithm has delivered
results far superior to other algorithms.
It is important to notice that each classical algorithm were able to cluster some
dataset satisfactorily. However, none has achieved good results for the majority
of the datasets. Moreover, IT algorithm has achieved a good overall performance,
reaching good results for the majority of the data sets.
Comparative Study on IT Clustering and Classical Clustering Algorithms 465
For real datasets, we chose to use two datasets of gene expression. We followed
basically the same procedures for the artificial data. Additionally, due to the large
size of the gene expression data, a dimensionality reduction step was added as
a preprocessing procedure. We reduced all data to two dimensions.
Dataset KM SL FMG IT
chowdary 0.45 0.07 0.59 0.78(06)
golub 0.02 0.01 0.34 0.65(05)
From the results presented in Table 4 we can notice that the performance of
the IT algorithm is far superior than classical clustering algorithms. Between
the latter class of algorithms, FMG has the best results.
It is worth noting that the results for Chowdary are, in general, better than
for Golub. That difference is expected, since the first dataset contains the gene
expression of sample extracted from two different types of cancer, while the
second contains the expression of three subtypes of the same cancer.
5 Conclusions
References
1. Araujo, D., Dória Neto, A., Martins, A., Melo, J.: Comparative study on dimension
reduction techniques for cluster analysis of microarray data. In: The 2011 Inter-
national Joint Conference on Neural Networks (IJCNN), pp. 1835–1842 (August
2011)
2. de Araújo, D., Neto, A.D., Melo, J., Martins, A.: Clustering Using Elements of
Information Theory. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN
2010, Part III. LNCS, vol. 6354, pp. 397–406. Springer, Heidelberg (2010)
3. Chowdary, D., Lathrop, J., Skelton, J., Curtin, K., Briggs, T., Zhang, Y., Yu, J.,
Wang, Y., Mazumder, A.: Prognostic gene expression signatures can be measured
in tissues collected in RNAlater preservative. J. Mol. Diagn. 8(1), 31–39 (2006)
4. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley (2001)
5. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander,
E.S.: Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science 286(5439), 531–537 (1999)
6. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper
Saddle River (1988)
7. Kuncheva, L., Hadjitodorov, S., Todorova, L.: Experimental comparison of cluster
ensemble methods. In: 2006 9th International Conference on Information Fusion,
pp. 1–7 (July 2006)
8. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-
Interscience (2004)
9. van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J.: Dimensionality Reduc-
tion: A Comparative Review (2007),
http://www.cs.unimaas.nl/l.vandermaaten/dr/DR_draft.pdf
10. Martins, A.M., Dória Neto, A., Costa, J.D., Costa, J.A.F.: Clustering using neural
networks and kullback-leibler divergency. In: Proc. of IEEE International Joint
Conference on Neural Networks, vol. 4, pp. 2813–2817.
11. Principe, J.C.: Information theoretic learning, ch. 7. John Wiley (2000)
12. Prı́ncipe, J.: Information Theoretic Learning: Renyi’s Entropy and Kernel Perspec-
tives. Information Science and Statistics. Springer (2010)
13. Principe, J.C., Xu, D.: Information-theoretic learning using renyi’s quadratic en-
tropy. In: Proceedings of the First International Workshop on Independent Com-
ponent Analysis and Signal Separation, Aussois, pp. 407–412 (1999)
14. Rao, S., de Medeiros Martins, A., Prı́ncipe, J.C.: Mean shift: An information the-
oretic perspective. Pattern Recogn. Lett. 30(3), 222–230 (2009)
Text Mining for Wellbeing: Selecting Stories
Using Semantic and Pragmatic Features
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 467–474, 2012.
c Springer-Verlag Berlin Heidelberg 2012
468 T. Honkela, Z. Izzatdust, and K. Lagus
Fig. 1. The basic architecture of a system that conducts text mining in order to find
stories that can support users’ wellbeing
Fig. 1 also provides a wider system context for the present work. The con-
tent analysis can be divided into two main areas, i.e., semantic and pragmatic
analysis. Linguistic semantics is a research are where computational modeling
has traditionally taken place in the framework of symbolic logic. However, adap-
tive and statistical method are increasingly popular and there are numerous
approaches based on neural networks and statistical machine learning. Classi-
cal examples include latent semantic analysis [5] and self-organizing semantic
maps [14]. In this work, we apply independent component analysis (ICA) in the
semantic analysis [2,8]. This approach is described in detail in the next section.
Whereas semantics predominantly focuses on prototypical meaning, pragmat-
ics is concerned with communicative, contextual and subjective aspects of mean-
ing [7]. From computational point of view, the amount of research on prototypical
semantics is much more common than work on pragmatics, mainly due to the
efforts invested in knowledge representation and semantic web research. How-
ever, there are increasing evidence that the area of computational pragmatics
is gaining ground. Research on detection of antisocial behavior from texts [13],
and modeling the context of communication [3] can be mentioned as examples.
In the context of the present work, analysis of sentiments (see e.g. [15]) and style
(see e.g. [12]) are of particular interest.
Text Mining for Wellbeing 469
2 Methods
Here we describe in brief the methods that we later apply for extracting wellbeing-
related patterns and features from discussion forum stories. The components of
the analysis process, i.e. topic analysis with independent component analysis,
style analysis and sentiment analysis are explained in the following sections.
estimates the strength of positive and negative sentiment in short texts, even for
informal language. SentiStrength provides values both for positive and negative
sentiments, with scales from -1 (not negative) to -5 (extremely negative), and
from 1 (not positive) to 5 (extremely positive). This means that a document can
at the same time show both positive and negative sentiments which provides us
with more useful information, compared to the regular approach in which only
the polarity of a text is determined.
SentiStrength is a lexicon-based classifier that uses negating words, emoticons,
spelling correction, punctuation and other kinds of linguistic information in an
attempt to achieve a high precision in detecting sentiments[15].
3 Experiments
The vocabulary was manually selected to only cover words that are related to
the theme of wellbeing. The full list is too long to be included here but it can be
found at http://research.ics.tkk.fi/cog/data/icann12sp/wordlist.txt.
We used the FastICA Matlab package to extract a prespecified number of 20
features. In considering the feature distributions, it is good to keep in mind that
the sign of the features is arbitrary. This is due to the ambiguity of the sign: the
components can be multiplied by 1 without affecting the model [10,8].
Examples of the ICA results on the Reddit data are shown in Fig. 2 and
Fig. 3. The upper row of diagram in Fig. 2 shows how words “anxiety”, “stress”,
“nausea”, and “relief” are associated with the same emergent feature. Similarly,
when the lower row in the figure is considered, it is clear that the words “class”,
“book”, “exam” and “professor” have a shared feature. In each case, the repre-
sentation of the word is quite sparse, i.e., each word is mainly represented by
one or two distinguishable features.
anxiety
1 stress nausea relief
0.1 0.04 0.06
0 0.04
0.02
0.05
0.02
−1
0
0
0
−2
−0.02
−0.02
−0.05
−3 −0.04
−0.04
−0.1
−4 −0.06
−0.06
0.7
0.15
0.08 0.06
0.6
0.5 0.1
0.06 0.04
0.4
0.05
0
0.2
0.02 0
0.1
−0.05
0
0 −0.02
−0.1
−0.1
Fig. 2. Upper row: Anxiety, stress, nausea and relief, lower row: Class, book, exam and
professor
Another set of result of emergent features based on the use of ICA is presented
in Fig. 3. The analysis of the words “problem”, “worry”, “health” and “moti-
vation” give rise to a a rich representation. These words are clearly associated
with several emergent features.
0.3
0.1 0.4
0.04
0.2
0.05 0.3
0.02
0.1
0 0.2
0 0
−0.05 0.1
−0.1
−0.02
−0.1 0
−0.2
−0.04
−0.15 −0.1
−0.3
direction of negative sentiments. The results are shown in more detail when they
are presented in relation to the results gained with other methods.
The style analysis approach we chose consists of determining the ratio of
obscene language to normal language used in the stories. The method simply
checks the texts against a dictionary of swear words and then normalizes the
results against the length of the corresponding stories.2
References
1. Agarwal, A., Bhattacharyya, P.: Sentiment analysis: A new approach for effective
use of linguistic knowledge and exploiting similarities in a set of documents to be
classified. In: Proc. of the Int. Conf. on NLP (2005)
2. Bingham, E., Kuusisto, J., Lagus, K.: ICA and SOM in text document analysis. In:
Proceedings of the 25th ACM SIGIR Conference, pp. 361–362. ACM, New York
(2002)
3. Bleys, J., Loetzsch, M., Spranger, M., Steels, L.: The grounded color naming game.
In: Proceedings of the 18th IEEE International Symposium on Robot and Human
Interactive Communication (2009)
4. Comon, P.: Independent component analysis—a new concept? Signal Processing 36,
287–314 (1994)
5. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.:
Indexing by latent semantic analysis. Journal of the American Society of Informa-
tion Science 41, 391–407 (1990)
6. Devitt, A., Ahmad, K.: Sentiment analysis in financial news: A cohesion-based
approach. In: Proceedings of the Association for Computational Linguistics (ACL),
pp. 984–991 (2007)
7. Givón, T.: Mind, code, and context: essays in pragmatics. Lawrence Erlbaum As-
sociates (1989)
8. Honkela, T., Hyvärinen, A., Väyrynen, J.: WordICA - Emergence of linguistic
representations for words by independent component analysis. Natural Language
Engineering 16(3), 277–308 (2010)
9. Hurst, M., Nigam, K.: Retrieving topical sentiments from online document collec-
tions. In: Document Recognition and Retrieval XI, pp. 27–34 (2004)
10. Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis, vol. 26.
Wiley (2001)
11. Jutten, C., Hérault, J.: Blind separation of sources, part I: An adaptive algorithm
based on neuromimetic architecture. Signal Processing 24, 1–10 (1991)
12. Karlgren, J.: Textual stylistic variation: Choices, genres and individuals. In: Struc-
ture of Style, pp. 129–142. Springer (2010)
13. Munezero, M., Kakkonen, T., Montero, C.: Towards automatic detection of antiso-
cial behavior from texts. In: Proceedings of the Workshop on Sentiment Analysis
where AI meets Psychology (SAAIP 2011), pp. 20–27 (November 2011)
14. Ritter, H., Kohonen, T.: Self-organizing semantic maps. Biological Cybernet-
ics 61(4), 241–254 (1989)
15. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment in short
strength detection informal text. Journal of the American Society for Information
Science and Technology 61(12), 2544–2558 (2010)
Hybrid Bilinear and Trilinear Models
for Exploratory Analysis of Three-Way
Poisson Counts
1 Introduction
As a generic task, analysis of counts in relation to two categorical variables also
known as factors, ways or modes is encountered in a vast number of scientific
studies and engineering applications. In order to make the problem setting and
evaluation of the present work more accessible, we concentrate on a concrete
example from text analysis where the counts of selected words in a set of doc-
uments is represented as a term-document matrix, words indexed as rows and
documents as columns. Common analyzes of this representation include relating
the documents to each other by the counts of the word occurrences in docu-
ments, or studying the relation of words by their co-occurrences in documents.
As such this comprises an example of 2-way data analysis.
It may be of interest to additionally study how the counts of the term-
document matrix vary according to some other factor such as the author of
the document. In fact, if the variation between documents according to the au-
thor is included in analysis, one may be able to attribute some of the variation
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 475–482, 2012.
c Springer-Verlag Berlin Heidelberg 2012
476 J. Raitio, T. Raiko, and T. Honkela
in the word counts to the language use of the author, and consequently, give
more accurate inference on relations between words (or documents) in general.
The term-document data augmented with information about the authors has a
representation as 3-way array. 3-way data arrays, or in general multi-way data,
can be studied for example using methods of tensor data analysis. For a generic
introduction to the topic, see e.g. [13].
The present work is originally motivated by the finding that there can be sub-
stantial individual variation in how natural language expressions are used and
interpreted. In [6], a method called Grounded Intersubjective Concept Analysis
(GICA) has been introduced. The essence of the GICA method is to model in-
dividual variation in using natural language expressions and for this purpose, a
3-way analysis of Subject-Object-Context (SOC) tensors is needed. The analy-
sis of such tensors may reveal individual differences in style but, more impor-
tantly, indicate subjectivity in modeling the relationship between language and
the world. If this kind of subjectivity remains unrecognized, various kinds of
problems related to communication may arise.
A more specific motivation for the present work stems from the fact that in [6]
the analysis of Subject-Object-Context tensors was conducted by flattening the
3-way arrays to 2-way matrices. These matrices can then be straightforwardly
analyzed using traditional data analysis methods such as PCA, SVD or ICA.
Each direction of flattening introduces a point of view and may, as such, provide
important insights into the data when analyzed. However, the flattening of the
original data appears to be useful, but possibly inadequate approach. It appears
necessary to devise a methodology that would make it possible to analyze all the
relationships without first determining which modes of the array are in focus. As
discussed above, traditional term-document matrices are formed by counting the
number of instances of each term in each document and by storing this count in
the element that corresponds to the row associated with the term in question and
the column associated with the particular document. The GICA data is formed
following the same basic principle, but adding a third mode which is used to
include all the subjects being included in the analysis. Moreover, rather than
considering frequency counts in whole documents, the counts concern typically
some context window of a given length.
One might think that subjectivity of language would be an exception rather
than a rule, since semantics appear to be well defined through thesauri, on-
tologies, and other knowledge representations. However, as natural language is
immersed with ambiguity, there is also a great amount of subjectivity and con-
textuality involved. A more detailed account on this matter is provided in [6].
Here it may be sufficient to refer to two examples. For the basic color terms,
there seems to be a high degree of intersubjective agreement. Around the idea
of prototypical red, green or blue there is not much subjective variation even
though a particular context may shift the evaluation, like in the case of phrases
“red skin” or “red wine” [3]. However, a lot more subjective variation is to be
expected if less typical color names are considered, such as “purple”, “khaki”
or “orchid”. An even more convincing example is when abstract words are
Hybrid Bilinear and Trilinear Models for Analysis of 3-way Counts 477
2 Proposed Model
The count xijk indexed by the levels i, j and k in the ranges {1, 2, . . . , I},
{1, 2, . . . , J} and {1, 2, . . . , K} of the three modes under consideration is modeled
as Poisson distributed
This specifies a model class that predicts the logarithm of the Poisson mean
count by a specially structured trilinear model consisting of
– bias parameters a(0) , b(0) , and c(0) for capturing the mean of each mode,
(q) (q)
– all combinations of the bilinear factorizations with parameters ai: , bj: and
(q)
ck: , q = 1, 2 for capturing interactions between modes,
(3)
– the trilinear factorization or the PARAFAC model [4] with parameters ai: ,
(3) (3)
bj: and ck: for capturing 3-way interactions between modes, and
– hyperparameters h1 , h2 , h3 and h4 for adjusting the model complexity,
478 J. Raitio, T. Raiko, and T. Honkela
where the subscript “:” is used to denote all values of the index of summation
m within a factorization.
Without loss of generality, we assume that the vectors a(q) , b(q) , and c(q) , q =
(q) (q) (q)
1, 2, 3 are zero-mean in the sense that i aim = 0, j bjm = 0 and k ckm = 0
for all m (see Appendix for proof). These parameter vectors are also known as
loadings.
The proposed model class can be interpreted as statistical multiple regression
models, where a Poisson distributed count is regressed on three categorical (fac-
torial) independents. The dimension of the parameter space is the number of
parameters in a specific model, I + J + K + h1 (I + J) + h2 (J + K) + h3 (I + K) +
h4 (I + J + K). In the special case of h1 = h2 = h3 = h4 = 0 our specification
is linear and equals that of a Generalized Linear Model [12] with logarithmic,
canonical link function for Poisson distributed data. Our model is, however,
nonlinear in parameters in its general form.
2.1 Motivation: Exploratory Analysis
The reason we propose a combination of bilinear and trilinear terms instead of
only the trilinear part, is the exploratory analysis of the results: We wish each
phenomenon in the data to be modeled with as simple terms as possible. Since
the trilinear part is often the most interesting but also the most difficult one for
analysis, we hope to clarify it by separating the more trivial phenomena away.
It is easy to see that the trilinear term could emulate the other terms by using
constant loadings 1 for parameter vectors a, b or c. However, since we introduce
the zero-mean constraint, we force the simpler terms to be used, too.
In the GICA context, the interpretation of the terms in Equation (2) is as
follows. I is the number of people (or subjects), J is the number of terms (or
objects) and K is the number of contexts. Biases describe how much text we have
from each subject, and how common is each term and each context. The first
bilinear term models how people prefer using some objects (or terms). This part
is comparable to collaborative filtering. The second bilinear term is about how
terms are used in contexts (or documents). This part is comparable to latent
Dirichlet allocation. The third bilinear term models how common particular
contexts are for different people, again comparable to collaborative filtering.
The trilinear term can model the subjectivity of context to the use of terms.
The gradient for fitting the parameters is further derived using the chain rule.
Finding maximum likelihood estimates is subject to the zero-mean constraints
of the parameter vectors.
The proposed algorithm for model selection is as follows. First, we set the hyper-
(0) (0) (0)
parameters h1 = h2 = h3 = h4 = 0 to estimate the biases ai , bj , ck . Then
model complexity is increased by incrementing the hyperparameters one at a
time and thus introducing new components into the model. The new parameters
are fitted while keeping the old ones fixed.
The avoid overfitting, proper hyperparameter values are determined by cross-
validation [14], i.e., by splitting the tensor elements randomly into a number
of equal sized partitions and then, in turn, holding out each partition from the
parameter estimation as validation set. We stop increasing each hyperparameter
whenever the probability of validation set, that is, its evidence for the model,
stops increasing significantly. In cross-validation we compare the distribution of
changes in the model evidences of the validation sets between before and after
adding new parameters. We apply a non-parametric test (Wilcoxon signed-rank)
to compare the significance level of the increase to a critical value.
After determining the hyperparameters, thus fixing the model complexity, the
model parameters are estimated without holding out any data, and at the end,
the whole model is fine-tuned by estimating all the parameters simultaneously.
3 Simulation Experiment
k h1 h2 h3 h4 tot. k h1 h2 h3 h4 tot.
0 35 33 35 36 139 0 0 0 0 0 0
1 31 33 32 39 135 1 0 0 0.03 0.10 0.04
2 34 34 33 25 126 2 0 0.03 0 0.12 0.03
tot. 100 100 100 100 400 tot. 0 0.01 0.01 0.07 0.02
the 9, the error was that one true generating component was excluded from the
identified model. Once, for h2 , one extra component was included. In total 91
out of the 100 generating models were identified correctly. Table 1.B summarizes
the results in identification accuracy.
In overall, according to this experiment, the model selection procedure is
feasible. It seems that model estimation works surprisingly well despite the lack
of guarantees for finding the global optimum. Failures in the identification may
be due to suboptimal parameter values or to the sampling of cross-validation
data. Consequently, the improvement in the model evidence by introduction
of some components have not been considered significant in our conservative
model selection procedure. It is interesting to note that off-by-one errors in the
identification do not seem to induce further errors in subsequently identified
components and hence the estimation procedure can be considered robust in
this respect.
4 Discussion
Our method estimates the trilinear predictor tensor as a sum of finite number of
constraint rank-one tensors, i.e., as constraint CANDECOMP[1]/PARAFAC[4]
trilinear decomposition. Once this representation has been found, it follows from
the properties of the decomposition that the trilinear components are unique up
to permutation and scaling of the parameter vectors under certain sufficient
conditions [11] that are true in most real world data analyses.
It is known that in general the approximation of a tensor by the trilinear
decomposition is an ill-posed problem [15] that does not have a bounded solu-
tion for some degenerate tensors. Also the greedy approach we apply (but not
depend on) for fitting the decomposition incrementally, does not result in best
fit in the sense that the optimal parameters of a less complex model are not
generally optimal in a more complex model [9]. Both of these results are de-
rived for approximations based on the Frobenius norm. We are not aware of
results that are valid for our probabilistic metric. Additionally, there are results
(e.g. [10]) that in real world applications the trilinear composition, fitted in the
Hybrid Bilinear and Trilinear Models for Analysis of 3-way Counts 481
References
1. Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional
scaling via an n-way generalization of ”eckart-young” decomposition. Psychome-
trika 35(3), 283–319 (1970), http://dx.doi.org/10.1007/BF02310791
2. Friedlander, M.P., Hatz, K.: Computing nonnegative tensor factorizations. Com-
putational Optimization and Applications 23(4), 631–647 (2008)
3. Gärdenfors, P.: Conceptual Spaces. MIT Press (2000)
4. Harshman, R.: Foundations of the PARAFAC procedure: Model and conditions
for an ’explanatory’ multi-mode factor analysis. In: UCLA Working Papers in
phonetics (16) (1970)
5. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products.
Journal of Mathematics and Physics 6, 164–189 (1927)
6. Honkela, T., Raitio, J., Nieminen, I., Lagus, K., Honkela, N., Pantzar, M.: Using
GICA method to quantify epistemological subjectivity. In: Proceedings of IJCNN
2012, International Joint Conference on Neural Networks (2012)
7. Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the
presence of missing values. Journal of Machine Learning Research (JMLR) 11,
1957–2000 (2010)
8. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Re-
view 51(3), 455–500 (2009)
9. Kolda, T.G.: Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis
and Applications 23(1), 243–255 (2001)
10. Kolda, T.G., Bader, B.W., Kenny, J.P.: Higher-order web link analysis using mul-
tilinear algebra. In: ICDM 2005: Proceedings of the 5th IEEE International Con-
ference on Data Mining, pp. 242–249 (November 2005)
482 J. Raitio, T. Raiko, and T. Honkela
11. Kruskal, J.B.: Three-way arrays: Rank and uniqueness of trilinear decompositions,
with application to arithmetic complexity and statistics. Linear Algebra and its
Applications 18, 95–138 (1977)
12. McCullagh, P., Nelder, J.A.: Generalized linear models, 2nd edn. Chapman & Hall,
London (1989)
13. Mørup, M.: Applications of tensor (multiway array) factorizations and decomposi-
tions in data mining (2011),
http://onlinelibrary.wiley.com/doi/10.1002/widm.1/full
14. Picard, R.R., Cook, R.D.: Cross-validation of regression models. Journal of the
American Statistical Association 79(387), 575–583 (1984),
http://www.jstor.org/stable/2288403
15. de Silva, V., Lim, L.H.: Tensor rank and the ill-posedness of the best low-rank
approximation problem. SIAM J. Matrix Analysis Applications 30(3), 1084–1127
(2008),
http://dblp.uni-trier.de/db/journals/siammax/siammax30.html#SilvaL08
16. Yilmaz, Y.K., Cemgil, A.T., Simsekli, U.: Generalised coupled tensor factorisation.
In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q.
(eds.) NIPS, pp. 2151–2159 (2011),
http://dblp.uni-trier.de/db/conf/nips/nips2011.html#YilmazCS11
(a3) (3)
For removing the mean μm of aim , we can increase h2 by 1 and set the new
part to
(1) (a3) (3)
bjh2 = μm bjm , (8)
(2) (a3) (3)
ckh2 = μm ckm . (9)
Estimating Quantities:
Comparing Simple Heuristics
and Machine Learning Algorithms
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 483–490, 2012.
c Springer-Verlag Berlin Heidelberg 2012
484 J.K. Woike, U. Hoffrage, and R. Hertwig
The study of this match between the architecture of decision strategies and en-
vironmental structures is central to the study of ecological rationality [3]. In this
contribution, two heuristics that have been developed within the simple heuris-
tics framework are pitted against a selection of machine learning algorithms to
test for comparative strengths and weaknesses across a range of empirical (data)
environments with binary predictor variables (henceforth: cues) and a continuous
criterion.
1. For each of the k binary cues bi with possible values 0 and 1, calculate s1,i
and s0,i as the average criterion value for all cases o in the learning set (L)
with bi (o) = 1 and bi (o) = 0, respectively.
−
2. Let s+
i = max(s1,i , s0,i ) and si = min(s1,i , s0,i ). Recode the cue values such
+
that si = s1,i .
3. Order the k cues in ascending order of s− i , so that for (c1 , ... ck ):
s− − −
1 ≤ s2 ≤ ... ≤ sk .
QEst can be represented as a minimal binary-tree (see Fig. 1, left side): the tree
has k levels following the root node and one exit node on each level following
the root node, except for the last level, which has two exit nodes. The cues
are assigned in ascending order s− i to the k decision nodes beginning with the
root node. To create an estimate for a case o, cues are looked up in the order
determined above, and once a negative cue value cj is encountered, an exit node
is reached and s− j is returned as the estimate. Only if no single negative cue
value is found, the mean for all cases in L that do not have a single negative cue
value will be predicted.
The only difference between QEst and the original QuickEst is that QuickEst
rounds the estimates to the next spontaneous number [4], as it has been designed
to model human inferences (and humans tend to generate ”round” estimates).
Because in an environment in which the criterion distribution follows a power
law there are many cases with small criterion values (and these cases will tend
to have negative cue values), QEst reduces information search by design and will
likely be able to return estimates after inspecting only very few cues. Note that
neither of the two variants has free parameters.
The second variant that we introduce and test in this paper, the Zig-QEst-
heuristic (ZQ, see Fig. 1, right side), differs from QuickEst on still another di-
mension: It sorts out the extreme cases on both sides of the distribution first.
Cues are put into sequence by choosing the cue with minimum s− and maximum
Estimating Quantities 485
c1 (t) c1 (t)
1 0 1 0
c2 (t) − c2 (t) −
s s
1 1
1 0 1 0
... − + ...
s2 s2
1 0 1 0
1 0 1 0
− + −
s0 s s s
k k k
s+ , alternatingly (Fig. 1, right side, shows one of the two possible structures).
An exit node associated with an s− is reached when the corresponding cue has
a negative value and an exit node associated with s+ , when the corresponding
cue has a positive value. The first exit node is placed based on the maximum
−
absolute deviation from the mean max(|s+ i − ȳ|, |ȳ − si |) across all cues. As
a benchmark, the prediction of the mean observed in L was added as a third
heuristic. The heuristics were implemented in Borland Delphi 6.0.
3 Simulation Setup
3.1 Environments
E(s) − Et
A(s) = 1 − , (1)
Em − Et
where E(s) represents the MSE for the algorithm’s predictions:
m
1
E(s) = (ŷ(oi , s) − y(oi ))2 . (2)
m 1=1
Estimating Quantities 487
Et is the fitting MSE performance of the estimation tree in the full dataset,
that is, with a learning set that consisted of all cases. If U (x) is the subset of
cases in L with cue values identical to those of case x, then the prediction of the
estimation tree is the mean of criterion values in this subset.
Et is the lower bound for E(s), if deterministic predictions have to be made for
the full dataset based on the cue values, as the mean minimizes MSE for all cue
equivalence classes. As the upper bound for the prediction error, in contrast, we
take the variance of criterion values in the full dataset (Em = σ 2 (y)), because this
variance is equivalent to the minimal MSE of a model that completely disregards
cue information. The criterion A(s) is a linear transformation of E(s) that maps
any MSE between these two extremes to the interval [0; 1], such that lower MSEs
correspond to higher values on A(s). There may be subsets of the full dataset
for which A(s) > 1, and ill-fitted models can generate values below 0.
Fitting and prediction accuracy were measured for each strategy in every envi-
ronment under two conditions: Algorithms were trained and fitted to a learning
set of either 50% or 75% of randomly chosen cases from the full dataset and
predicted the criterion values in the hold-out set that consisted of the remaining
50% or 25% of the cases in the datasets. All seven algorithms were tested on the
same randomly generated subsets, and there were 1000 trials for each combina-
tion of dataset, algorithm and learning set size (for a total of 1, 386, 000 fitting
and 1, 386, 000 prediction accuracy results).
4 Simulation Results
The results, averaged across the nintey-nine datasets, are presented in Fig. 2, on
the left side for fitting, and on the right side for prediction. When fitting known
data, the winners of the competition are the machine learning strategies: The
best performing strategy is EstT, which is optimally suited to fit data with binary
cues (the average accuracy is larger than 1, as it is easier for estimation trees
to fit smaller samples). Not far behind in the race are CRT and GRN, followed
by LR. In contrast, the two simple heuristics perform much worse, about half
way between the machine learning strategies and the mean model as the lower
benchmark. As expected, fitting results for the 50% condition are slightly better
than for the 75% condition, as it is easier to fit a smaller number of cases with
the same number of parameters.
For the prediction task the results are markedly different. First, and as ex-
pected, each algorithm performs worse than in fitting. Second, and as expected,
results for the 75% condition are better, because larger learning sets yield bet-
ter parameter estimates. Third, and more interestingly: the estimation tree and
CRT are no longer competitive. In the 50% condition, the best performance is
488 J.K. Woike, U. Hoffrage, and R. Hertwig
A A
1.1 Fitting 1.1 Prediction
1.0 Learning Set 50% 1.0 Learning Set 50%
0.9 Learning Set 75% 0.9 Learning Set 75%
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
Mean
0 0
MeanQEstZigQ LR CRT GRNEstT QEstZigQ LR CRT GRNEstT
Fig. 2. Average Accuracy-Score (A) for the fitting and prediction of strategies across
the ninety-nine data sets: the bar colors represent the learning condition (the size of
the learning sets)
reached by GRN and ZQ and in the 75% condition the winner is LR. Both the
complex GRN and the simple ZQ-heuristic perform on similar levels, while QEst
fails to make good predictions. The difference in accuracy loss between fitting
and prediction clearly signals that the complex models over-fit the data. Finally,
the difference in performance of LR between the 50% and 75% condition under-
scores the importance of large sample sizes for generalizable parameter estimates
in LR-models.
A
1.0
0.5
−0.5
−1.0
−1.5
−2.0 A(GRN)
A(ZQ)
−2.5
Environments
Fig. 3. Predictive Accuracy for GRN and ZQ for each of the ninety-nine data sets
ordered by decreasing Δ(A) = A(GRN ) − A(ZQ). The length of the vertical lines
corresponds to |Δ(A)|
and criterion in the full dataset is positively correlated with the performance of
both algorithms, it shows a positive correlation with the difference between the
algorithms (r=.292, p=.003).
5 Discussion
The results of the horse-race simulation clearly demonstrate that predictive ac-
curacy is not necessarily linked to the algorithmic complexity of the strategies. In
fact, ZQ, a simple non-compensatory heuristic, which is a plausible candidate for
modeling estimation by boundedly rational humans, compared favorably with
the machine learning algorithms when the performance across ninety-nine real-
world datasets was assessed in cross-validation. The ecological analysis further
suggests that the heuristics are less prone to over-fitting, as they suffer less from
a decrease in sample size and can cope better with a large number of variables
than the machine learning algorithms. The ZQ heuristic drastically outperformed
the QEst heuristic and should be more vigorously studied in future research.
These results are in line with previous simulation results for non-compensatory,
lexicographic heuristics in pair-comparison [21] and classification [22] tasks. This
study also supports the claim that the study of strategies (here, strategies for
estimation) cannot be separated from the study of environments in which these
strategies are applied [23].
490 J.K. Woike, U. Hoffrage, and R. Hertwig
References
1. Gigerenzer, G., Todd, P.M., ABC Research Group: Simple heuristics that make us
smart. Oxford UP, New York (1999)
2. Gigerenzer, G., Selten, R. (eds.): The adaptive toolbox. MIT Press, Cambridge
(2001)
3. Todd, P.M., Gigerenzer, G.: and the ABC Research Group, Ecological rationality:
Intelligence in the world. Oxford UP, New York (2012)
4. Hertwig, R., Hoffrage, U., Martignon, L.: Quick estimation: Letting the environ-
ment do some of the work. In: Gigerenzer, G., Todd, P.M., The ABC Research
Group (eds.) Simple Heuristics that Make Us Smart, pp. 209–234. Oxford UP,
New York (1999)
5. Hertwig, R., Hoffrage, U., Sparr, R.: The QuickEst heuristic: How to benefit from
an imbalanced world. In: Todd, P.M., Gigerenzer, G., The ABC Research Group
(eds.) Ecological Rationality: Intelligence in the World, pp. 379–406. Oxford UP,
New York (2012)
6. Specht, D.E.: A general regression neural network. IEEE Transactions on Neural
Networks 2(6), 568–576 (1991)
7. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regres-
sion trees. Wadsworth, Monterey (1984)
8. Dawes, R.M.: The robust beauty of improper linear models in decision making.
American Psychologist 34(7), 571–582 (1979)
9. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of
California, School of Information and Computer Science, Irvine (2007),
http://www.ics.uci.edu/~ mlearn/MLRepository.html
10. Statlib on-line data base, http://lib.stat.cmu.edu/datasets
11. DASL - Data and Story Library, http://lib.stat.cmu.edu/DASL/
12. OzDASL - Australasian Data and Story Library, http://www.statsci.org/data/
13. Journal of Statistics Education Data Archive,
http://www.amstat.org/publications/jse/jse_data_archive.html
14. Swivel, http://www.swivel.com/data_sets/
15. Social Explorer, http://www.socialexplorer.com/
16. Inter-University Consortium for Political and Social Research (ICPSR),
http://dx.doi.org/10.3886/ICPSR02650
17. National Institute for Occupational Safety and Health (NIOSH) Mining Division,
http://www.cdc.gov/niosh/mining/data/
18. UCLA Statistics Data Sets, http://www.stat.ucla.edu/data/
19. Weisberg, S.: Applied linear regression. John Wiley and Sons, New York (1985)
20. Hettich, S., Bay, S.D.: The UCI KDD Archive. University of California, Department
of Information and Computer Science, Irvine (1999), http://kdd.ics.uci.edu
21. Czerlinski, J., Gigerenzer, G., Goldstein, D.G.: How good are simple heuristics.
In: Gigerenzer, G., Todd, P.M., The ABC Reseach Group (eds.) Simple Heuristics
that Make Us Smart, pp. 97–118. Oxford UP, New York (1999)
22. Martignon, L., Katsikopoulos, K.V., Woike, J.K.: Categorization with limited re-
sources: A family of simple heuristics. Journal of Mathematical Psychology 52(6),
352–361 (2008)
23. Todd, P.M., Gigerenzer, G.: Environments that make us smart: Ecological ratio-
nality. Current Directions in Psychological Science 16(3), 167–171 (2007)
Rademacher Complexity and Structural Risk
Minimization: An Application to Human Gene
Expression Datasets
DITEN – University of Genova, Via Opera Pia 11A, Genova, I-16145, Italy
{Luca.Oneto,Davide.Anguita,Alessandro.Ghio,Sandro.Ridella}@unige.it
1 Introduction
The process of building an optimal Support Vector Classifier (SVC) [14] consists
of two phases: (i) the first one addresses the identification of a set of parameters,
which are found by solving a Quadratic Programming problem; (ii) the second
phase aims at tuning a set of additional variables, namely the hyperparameters.
The last step is also known as the model selection phase and is usually linked
to the problem of estimating the generalization ability of the classifier since,
usually, the best hyperparameters are selected as to minimize this quantity.
Out–of–sample techniques exploit a validation set for tuning the SVC hy-
perparameters, which is obtained by removing some samples from the original
dataset and save them for the model selection phase. An example of out–of–
sample technique is the Cross Validation [8], which is a common choice among
practitioners. Unfortunately, these methods result to be unreliable when ap-
plied to the small–sample setting [6], where only few high–dimensional samples
are available for training the classifier. In this framework, in–sample techniques
based on complexity measures, such as the Rademacher Complexity (RC) [4]
or the Maximal Discrepancy (MD) [5], have shown to be a suitable alterna-
tive [3]. The main advantage of in–sample methods, respect to out–of–sample
approaches, is the use of the whole set of available data for both training and
model selection purposes, therefore increasing the reliability of the final classifier.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 491–498, 2012.
c Springer-Verlag Berlin Heidelberg 2012
492 L. Oneto et al.
2
n
C(H)
ˆ = Eσ sup σi (h(xi ), yi ), (1)
h∈H n i=1
Rademacher Complexity and Structural Risk Minimization 493
As this bound is valid for any function belonging to the class H, it will also be
valid for any classifier chosen by the learning and model selection procedures, so
giving a powerful tool for selecting the best performing one.
w 2 n
min +C ξ (h(xi ), yi ), (3)
w,b 2 i=1
where ξ (h(xi ), yi ) = (1 − yi h(xi ))+ is the hinge loss, and (·)+ = max(0, ·). This
is the well–known Tikhonov formulation of the SVM, but the class H is better
defined through the Ivanov formulation [1]:
1
n
2
min ξ (h(xi ), yi ), subject to: w ≤ ρ, (4)
w,b n i=1
which is equivalent to the previous one for some values of the hyperparameters
[1,14]. In problem (4), the hyperparameter is ρ and H is defined as the class of
2
functions for which w ≤ ρ and b ∈ R. In other words, the hyperparameter ρ
controls directly the size of the class: the larger is ρ, the larger is H.
According to the Structural Risk Minimization (SRM) principle [14], the
model selection phase of the SVM consists in selecting the best class of func-
tions H∗ , and by consequence the best classifier h∗ (x) ∈ H∗ , by solving problem
(4) or, equivalently, problem (3), exploring a sequence of classes of increasing
1
They are also known as Rademacher variables. Note that, as there are 2n possible
combinations of such variables, the expectation respect to σ is usually computed, in
practice, through a Monte-Carlo procedure, though novel computing architectures
are opening new perspectives (e.g. quantum computing [4]).
494 L. Oneto et al.
the regularization advantage of the SVM over a general linear classifier is lost.
As an alternative, the soft loss S (h(x), y) = min [ξ (h(x), y)/2, 1] has been
proposed [3], since it is Lipschitz continuous and bounded. The soft loss possesses
some nice symmetry properties, in particular S (h(x), y) = 1 − S (h(x), −y), so
that the computation of C(H) ˆ can be performed quite easily through a mini-
mization procedure [3]. Unfortunately, it is easy to note that
therefore the soft loss is only a loose upper-bound of the number of misclassified
samples.
Then, we propose to use the trimmed hinge loss [12] (see also Fig. 1a)
H (h(x), y) if yh(x) < 0
T (h(x), y) = (6)
ξ (h(x), y) if yh(x) ≥ 0,
The main issue, now, becomes the application of this loss to the primal SVM
learning problem, because the symmetry property does not hold anymore.
Let us consider a particular realization of the Rademacher variables, then, by
inserting the trimmed hinge loss in problem (4), we can write:
2 2
n n
arg max σi T (h(xi ), yi ) = arg min − σi T (h(xi ), yi ) (8)
w,b n i=1 w,b n i=1
2
subject to w ≤ ρ. The previous problem can be reformulated as:
Rademacher Complexity and Structural Risk Minimization 495
2 2 2
1 1 1
loss
loss
loss
0.5 0.5 0.5
0 0 0
−1 −1 −1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
y h(x) y h(x) y h(x)
Fig. 1. Splitting of the trimmed hinge loss T in a convex and a concave part
2
min ζi − ζi subject to: w ≤ ρ (15)
w,b,ζ,ζ̄
i∈S + i∈S −
and also subject to the constraints (11) and (13). Equivalently, the Tikhonov
formulation of the above problem is:
1 2
min w + ζi − ζi , (16)
w,b,ζ,ζ̄ 2 + − i∈S i∈S
2
As an alternative, Convex-ConCave Programming (CCCP) techniques can be ex-
ploited as well [7].
496 L. Oneto et al.
again subject to the constraints (11) and (13). The previous problem is equivalent
to (15) for some value of C [10]. Thus, Problem (15) can be solved through the
Tikhonov formulation of Problem (16), as shown in [1].
We can compute the dual formulation of Problem (16)3 and obtain:
1
n n
min αi αj yi yj xi · xj − αi (17)
α 2 i=1 j=1
i∈S +
n
i ∈ S + : 0 ≤ αi ≤ C
− , yi αi = 0.
i ∈ S : −C ≤ αi ≤ 0
i=1
Once a solution
n has been found [11], the final classifier is obviously defined as4
h(x) = i=1 αi yi xi · x + b. The model is then used to identify possible CSs,
i.e. the patterns which cause the unboundedness of the loss function: the pro-
cedure for identifying and eliminating the critical data is detailed in Algorithm
1. Analogously to [3], it is worth noting that the number of CSs can be used to
obtained a rigorous upper bound of the minimum of Problem (9), as also shown
in line 26 of Algorithm 1.
3
The proof is omitted because of space constraints.
4
It is also worth noting that the non-linear extension of the presented approach
through the kernel trick [14] becomes straightforward by simply applying a non-linear
mapping x → φ(x) and by defining the kernel function as K(xi , xj ) = φ(xi ) · φ(xj ).
Rademacher Complexity and Structural Risk Minimization 497
Table 1. Average number of errors on the test sets of the HGE datasets
4 Experimental Results
In order to verify whether the trimmed hinge loss allows to improve the model
selection performance of RC and MD-based bounds in the small–sample setting,
we make use of several Human Gene Expression datasets [2]5 . As in this kind
of setting a reference set of reasonable size is not available for evaluating the
performance of the entire procedure, we reproduce the methodology suggested
in [13], which consists in generating five different training/test pairs using a
random sampling approach. The model selection is performed by searching for
the optimal hyperparameter ρ in the range [10−6 , 102 ] among 30 values, equally
spaced in a logarithmic scale. As, in this paper, we are targeting two–class clas-
sification, we map multi-class datasets into two–class ones by simply grouping
classes so to obtain almost balanced problems.
In Table 1, we also present the average number of errors, performed on the five
test set replicas. In particular, we compare the results obtained with the RC and
the MD approaches, using both the soft loss (RCS , MDS ) and the trimmed hinge
loss (RCT , MDT ). Though in-sample methods, based on the soft loss, already
resulted to outperform out-of-sample approaches on the same datasets [2], we
report here for the sake of completeness the misclassification rate for the well-
known K-fold Cross Validation (KCV) technique [9]. In particular, for the KCV,
the conventional Tikhonov SVM formulation is exploited, where the number of
folds k is set to 10 [8] and the hyperparameter C is searched within the range
[10−6 , 102 ] among 30 values, analogously to ρ. As expected, the trimmed hinge
loss is a tighter upper–bound of the number of errors, and the classifiers chosen
by RCT and MDT perform consistently better than the ones selected by RCS ,
MDS and the KCV.
5 Conclusions
We introduced in this work the exploitation of the trimmed hinge loss in Support
Vector classifiers, which allows to rigorously perform in-sample model selection
5
We do not include here the original references for all the datasets because of space
constraints. However they can be retrieved in [2].
498 L. Oneto et al.
References
1. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: In-sample Model Selection for Support
Vector Machines. In: Proc. of the Int. Joint Conference on Neural Networks (2011)
2. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: Selecting the Hypothesis Space for
Improving the Generalization Ability of Support Vector Machines. In: Proc. of the
Int. Joint Conference on Neural Networks (2011)
3. Anguita, D., Ghio, A., Ridella, S.: Maximal Discrepancy for Support Vector Ma-
chines. Neurocomputing 74, 1436–1443 (2011)
4. Anguita, D., Ridella, S., Rivieccio, F., Zunino, R.: Quantum optimization for train-
ing support vector machines. Neural Networks 16(5-6), 763–770 (2003)
5. Bartlett, P.L., Boucheron, S., Lugosi, G.: Model selection and error estimation.
Machine Learning 48, 85–113 (2002)
6. Braga-Neto, U.M., Dougherty, E.R.: Is cross-validation valid for small-sample mi-
croarray classification? Bioinformatics 20(3), 374 (2004)
7. Collobert, R., Sinz, F., Weston, J., Bottou, L.: Trading convexity for scalability.
In: Proceedings of the 23rd International Conference on Machine Learning, pp.
201–208 (2006)
8. Hsu, C., Chang, C., Lin, C., et al.: A practical guide to support vector classification
(2003)
9. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation
and model selection. In: International Joint Conference on Artificial Intelligence,
vol. 14, pp. 1137–1145 (1995)
10. Pelckmans, K., Suykens, J.A.K., De Moor, B.: Morozov, ivanov and tikhonov reg-
ularization based ls-svms. Neural Information Processing 3316, 1216–1222 (2004)
11. Platt, J.: Sequential minimal optimization: A fast algorithm for training sup-
port vector machines. Advances in Kernel Methods-Support Vector Learning 208,
98–112 (1999)
12. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge
University Press (2004)
13. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehen-
sive evaluation of multicategory classification methods for microarray gene expres-
sion cancer diagnosis. Bioinformatics 21(5), 631 (2005)
14. Vapnik, V.N.: Statistical learning theory. Wiley Interscience (1998)
Using a Support Vector Machine and Sampling
to Classify Compounds as Potential Transdermal
Enhancers
1 Introduction
Incorporation of certain chemicals into the drug delivery vehicle may lead to enhance-
ment of drug release and a more rapid clinical response. Such chemicals have been var-
iously labeled penetration enhancers, accelerants or sorption promoters. Investigation
and development of suitable enhancers is, like other aspects of pharmaceutical devel-
opment, limited by the time and expense of in vivo studies and even a wide range of in
vitro experiments. Therefore, mathematical relationships are often sought between the
physical properties of pharmaceutical systems and their clinical performance.
In [1], Pugh et al. employed discriminant analysis to classify such enhancers into
simple categories, based on the physicochemical properties of the enhancer molecules.
One of difficulties in this work is imbalanced dataset. The data has two classes labeled
as either good enhancers or poor enhancers, with about 83% being poor enhancers. In
[2], several machine learning methods including K-nearest-neighbour (KNN) regres-
sion, single layer networks, Gaussian processes (GP) and the SVM classifier were ap-
plied to an enhancer dataset. The best classification result was obtained by the ordinary
SVM method. Here we investigate the effect of using two different SVM error costs,
and whether a sampling method including both under-sampling and over-sampling can
further improve the SVM classifier’s accuracy rate.
2 Problem Domain
The percutaneous absorption of exogenous chemicals, particularly for therapeutic pur-
poses, is limited by the nature of the skin barrier, specifically, the properties of the
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 499–506, 2012.
c Springer-Verlag Berlin Heidelberg 2012
500 A. Shah et al.
outermost layer of the skin, the stratum corneum. A range of chemical and physical
methods have been employed, with varying degrees of success, to enhance the rate and
extent of absorption.
Chemical penetration enhancers are often classified based on their mechanism. Math-
ematical models have been developed and applied to investigate the enhancement ef-
fects of chemical compounds. Multiple regression analysis, often in the form of quan-
titative structure-activity relationships (QSARs), have been extensively employed to
develop such relationships between the effect and a set of physical properties of phar-
maceutical systems. For example, in [3] the authors used QSARs to look into the en-
hancement effects associated with a range of penetration enhancers.
Further, Pugh et al. [1] have applied discriminant analysis to identify compounds
with potential as enhancers. They classified 73 potential enhancers of hydrocortisone
permeation based on an enhancement ratio (ER) of the amount of hydrocortisone trans-
ferred after 24 hours relative to a control. In their work, ER ≥ 10 was considered a good
enhancer, and ER < 10 a poor enhancer. They found that discriminant analysis using
the carbon chain length, number of hydrogen-bonding atoms present on a molecule and
the molecular weight resulted in a correct classification of good enhancers of 92%, but
a relatively limited success in correctly classifying poor enhancers (72%).
The application of novel Machine Learning techniques has also highlighted their
usefulness to the field of percutaneous absorption [2]. Results in [2] show that machine
learning methods can provide more accurate classification of enhancer type with fewer
false-positive results.
In the current work, we investigate whether the false positive rate can be further
decreased when using SVM with sampling methods.
The dataset employed in this study is based on [1]. Several changes have been made
from this dataset, including the correction of several errors contained in the original.
The dataset consists of seventy-one compounds in which there are twelve compounds
belonging to the good enhancer class of samples and fifty-nine compounds belonging to
the poor enhancer class of samples. All data used refer to the transfer of hydrocortisone
across hairless mouse skin over 24 hours from propylene glycol solutions of enhancers.
As mentioned in reference [1], a range of calculable molecular features were consid-
ered as predictors. But most of them proved to be unsuccessful, and therefore limited to
a discussion of the five most successful predictors. These five predictors are log P (P
denotes the ratio: octanol/water), log S (S denotes solubility), Molecular Weight (MW),
carbon chain length (CC) and hydrogen bonds (HB). These features are readily calcu-
lable molecular features. In [1], it was shown that log P and log S are highly correlated
with a correlation coefficients equal to −0.91. Here, M W, CC andHB are considered
as another set of independent variables. So in the following experiments, 3 features
refers to M W , CC and HB while 5 features refers to the above 3 plus log P and log S.
A comparison between the good class and poor class was undertaken by visual data
exploration. For instance, Figure 1 shows a box plot of molecule weights by class. (We
do not show the figure for all features here due to the space limit.) In summary, there are
Using a Support Vector Machine and Sampling to Classify Compounds 501
statistical differences between the good and poor classes. For example, the interquartile
range for the good class is at a higher value than that of the poor class on M W , CC
and log P , while it is at a lower value on HB and log S. In addition, the interquartile
range for the good class is always narrower than that for the poor class on all 5 features.
450
400
350
300
MW (Da)
250
200
150
100
50
Poor Good
Fig. 1. Box plot of Molecular Weight grouped as poor and good classes. Da is a unit of mass.
4 Performance Measures
It is obvious that for a problem domain with an imbalanced dataset, classification ac-
curacy rate is not sufficient as a standard performance measure. To evaluate classifiers
used in this work, we apply several common performance metrics, such as Recall, Pre-
cision and F-score, which are calculated in order to fairly quantify the performance of
the classification algorithm on the minority class.
Based on the confusion matrix (see Table 1) computed from the test results, several
common performance metrics can be defined as follows in Table 2.
Table 1. A confusion matrix: where TN is the number of true negative samples; FP is false
positive samples; FN is false negative samples; TP is true positive samples
TN FP
FN TP
502 A. Shah et al.
TP TP
Recall = , (1) Precision = , (2)
(TP + FN) (TP + FP)
FP
2 · Recall · Precision FP rate = . (4)
F-score = , (3) FP+TN
Recall+Precision
In the context of identifying good enhancers a high F-score and low FP Rate are
particularly important, as a higher cost is associated with a degradation of performance
on these metrics. There is a trade-off between Precision and Recall integrated into the
metrics.
In this paper we address the problem of our imbalanced data in two ways: firstly by
using data based sampling techniques [4] and secondly by using different SVM error
costs for the two classes [5].
One way to address imbalance is simply to change the relative frequencies of the two
classes by under sampling the majority class and over sampling the minority class.
Under-sampling the majority class can be done by just randomly selecting a subset
of the class. Over-sampling the minority class is not so simple and here we use the
Synthetic Minority Overs-ampling Technique (SMOTE) [4]. For each member of the
minority class its nearest neighbours in the same class are identified and new instances
are created, placed randomly between the instance and its neighbours.
n n n
||w2 ||
Lp = +C ξi − αi [yi (w·xi + b) − 1 + ξi ] − ri ξi (5)
2 i=1 i=1 i=1
n
subject to: 0 ≤ αi ≤ C and αi yi = 0 (6)
i=1
Here C represents the trade-off between the empirical error, ξ, and the margin. The
problem here is that both the majority and minority classes use the same value for C,
which as pointed out by Akbani et al [6] will probably leave the decision boundary too
Using a Support Vector Machine and Sampling to Classify Compounds 503
near the minority class. Veropoulos et al [5] suggest that having a different C value for
the two classes may be useful. They suggest that the primal Lagrangian is modified to:
n n
||w2 ||
Lp = + C+ ξi + C − ξi − αi [yi (w·xi + b) − 1 + ξi ] − ri ξi
2 i=1 i=1
i|yi =+1 i|yi =−1
(7)
n
subject to: 0 ≤ αi ≤ C + if yi = +1, 0 ≤ αi ≤ C − if yi = −1, and αi yi = 0
i=1
(8)
+ −
Here the trade-off coefficient C is split into C and C for the two classes, allowing
the decision boundary to be influenced by different trade-offs for each class. Thus the
decision boundary can be moved away from the minority class by lowering C + with
respect to C − .
Akbani et al [19] argue that using this technique should improve the position of the
decision boundary but will not address the fact that it may be misshapen due to the
relative lack of information about the distribution of the minority class. So they suggest
that the minority class should also be over-sampled, using SMOTE, to produce a method
they call SMOTE with Different Costs (SDC). This is one of the techniques we evaluate
here.
6 Experiments
For each different dataset, the leave-one-out technique was applied, that is, one chem-
ical is used for testing, and all others are employed for training. This was repeated for
each compound in turn. Finally, performance metrics were computed in terms of all
predictions. The SVM experiments were completed using LIBSVM, which is available
from the URL http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
6.1 Experiment 1
In the first experiment, the effect of different SVM error costs are investigated. A sys-
tematic search was done and the best results were obtained with the cost penalty as-
signed for misclassifying a poor enhancer was 10 whereas that of misclassifying a good
enhancer was 80. Results are shown in Table 3. For comparison, the ordinary SVM
classification results obtained from the radial basis kernel function is also shown in the
table.
It can be seen that using SVM with cost penalties does not improve classification
performance on the dataset with all five features, but it improves F-score, recall and
precision though not FP-rate when using only three features.
6.2 Experiment 2
Table 4 shows that under-sampling has better performance than the one obtained
from the over-sampling method. Comparing with Table 3, it can be seen that on average
results from sampling have better performance on F-score and precision with much
higher values on recall. For FP-rate, SVM with under-sampling provides a lower value
than the one obtained from SVM with or without cost penalties on the corresponding
dataset; while SVM with over-sampling has a higher (worse) FP-rate. Furthermore,
Table 5 shows the confusion matrix obtained from under-sampling. It can be seen that
on average the number of true positives is almost the same as the actual number of
labeled good enhancers.
Figure 2 shows a Principal component analysis (PCA) plot, where label information
is from the SVM with under-sampling on the three feature dataset. One false negative
and three false positive instances can be seen. It can be see that the three false positives
are located in the area occupied by true positives, so that their misclassification is not
surprising.
2 False Positive
False Negative
1.5
0.5
−0.5
−1
−1.5
−2
−3 −2 −1 0 1 2 3
Score for first Principal Component
Fig. 2. The PCA plot of the dataset with three features. The single false negative can be seen to
be a long way from the other positive samples.
6.3 Experiment 3
7 Conclusions
The main results of this work have shown that by using suitable machine learning tech-
niques we can produce an effective classifier on possible transdermal enhancers. More-
over, this has been accomplished with a very small training set of only 70 compounds.
The technical results in this paper have shown that sampling methods can be useful to
improve the SVM classifier’s performance. Using SVM with different cost errors can
also improve the ordinary SVM’s performance, but not so much as sampling does. For
this dataset, over-sampling, under-sampling and combined sampling are evaluated, with
under-sampling giving the best results followed by the combined sampling methods.
References
[1] Pugh, W., Wong, R., Falson, F., Michniak, B., Moss, G.: Discriminant analysis as a tool
to identify compounds with potential as transdermal enhancers. Journal of Pharmacy and
Pharmacology 57, 1389–1396 (2005)
[2] Moss, G., Shah, A., Adams, R., Davey, N., Wilkinson, S., Pugh, W., Sun, Y.: The applica-
tion of discriminant analysis and machine learning methods as tools to identify and classify
compounds with potential as transdermal enhancers. European Journal of Pharmaceutical
Sciences 45, 116–127 (2012)
[3] Ghafourian, T., Zandasrar, P., Hamishekar, H., Nokhodchi, A.: The effect of penetration en-
hancers on drug delivery through skin: a qsar study. J. Control. Release 99, 113–125 (2004)
[4] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote:synthetic minority over-
sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
[5] Veropoulos, K., Cristianini, N., Campbell, C.: Controlling the sensitivity of support vector
machines. In: Proceedings of the International Joint Conference on Artificial Intelligence
(1999)
[6] Akbani, R., Kwek, S.S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced
Datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004.
LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
The Application of Gaussian Processes in the Predictions
of Permeability across Mammalian Membranes
1 Introduction
The problem of predicting the rate at which various chemical compounds penetrate
human skin is an important issue with the increasing use of skin as a means of achiev-
ing both local and systemic drug delivery. In [1] and [2], it is shown that advanced
machine learning techniques, especially, Gaussian Processes (GP), outperform quanti-
tative structure-activity relationships (QSARs), which are widely used in the pharmacy
community.
One key feature of predicting percutaneous absorption accurately is that the target,
the skin permeability coefficient, may have a strongly non-linear relationship with the
compound physicochemical descriptors (features), which has already been shown to
be the case [1] and [2], using a human skin dataset. In [3], GP is further evaluated
on four different datasets, namely experimentally derived drug permeation data across
human skin, pig skin, rodent skin, and a synthetic (SilasticR ) membrane. The GPs with
Matérn and neural network covariance functions give the best performance in the work
shown in [3]. It is found that five compound features applied to human, pig and rodent
membranes cannot represent the main characteristics of the silastic dataset.
In the current work, given the number of animal experiments described in the lit-
erature, and the difficulties in obtaining human skin for experiments, the ability of a
dataset consisting of permeability values from animal experiments (rodent) was inves-
tigated by GP methods in order to determine if it could provide reasonable estimates of
human skin permeability.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 507–514, 2012.
c Springer-Verlag Berlin Heidelberg 2012
508 Y. Sun et al.
2 Problem Domain
Predicting percutaneous absorption accurately has proven to be a major challenge and
one which has substantial implications for the pharmaceutical and cosmetic industries,
as well as toxicological issues in fields such as pesticide usage and chemicals manufac-
ture. Predictive modeling is a frequently used tool to increase the throughput of percuta-
neous absorption experiments. The use of animal models for percutaneous penetration
is often considered essential, given the possible toxicity, cost, ethics and inconvenience
of employing human skin during laboratory experiments. Human skin differs from that
of many animals in numerous ways including the thickness of the stratum corneum,
number of appendages per unit area and amount of skin lipids present. Despite this, it
is very surprising that no quantitative mathematical models had been developed for the
purpose of characterising permeation across non-human skin before [3] was published.
This is perhaps due to the development of the Potts and Guy model [4], the first major,
quantitative model for measuring percutaneous absorption, which was based on human
skin data.
In using a model system, the researcher must take into account the inherent dif-
ferences of the various species employed and the parameters affecting percutaneous
penetration in each species. The model selected must therefore resemble human skin
as closely as possible. Various models have been offered by many researchers. [5] and
[6] investigated several potential models, including rabbit, miniature swine and rat, and
concluded that rabbit skin, and then rat skin, were the most permeable membranes, and
that flux denoted as J, that is the rate of permeant transfer from one side to the other,
through pig skin most resembled that of the permeation across human skin.
By convention Kp denotes the permeability coefficient. Kp is a concentration cor-
rected version of flux that allows comparison of permeation for different molecules. Kp
is defined as Kp = J/ΔCm , where ΔCm denotes the concentration difference across
the membrane. Several approaches have been used to try to quantify and predict skin
absorption. One such method involves the use of quantitative structure activity (or per-
meability) relationships (QSARs, or QSPRs). Usually, lipophilicity (P) and molecular
weight (MW) appear to be the only significant features in QSAR forms, although subset
analysis has shown the significance of other parameters [7]. P is the ratio of the solubil-
ity of a molecule between two phases; octanol, to represent the lipid phase, and water
(or a buffered aqueous solution) to represent the aqueous phase. Normally, this gives
quite a range as some molecules will prefer one phase to another, often across as wide
a range as 10−7 to 107 . Hence, a log scale, log P is used. For the same reason log Kp is
used for skin percutaneous absorption rather than Kp . It is important to note that log Kp
is a completely different term to log P .
Recently, new approaches, for example, artificial neural networks and fuzzy mod-
elling, have been applied to predict percutaneous absorption. [1] has employed Gaus-
sian Processes to predict percutaneous absorption using a human skin dataset, showed
the underlying non-linear nature of the dataset, and provided a substantial statistical
improvement over existing models.
The novel contribution in this work is exploring another interesting issue that is
whether including another species skin data in a training set can improve predictions
of the human skin permeability coefficient.
The Application of Gaussian Processes in the Predictions 509
4 Modelling Methods
Two QSPR methods were applied to the human skin data in order to provide a compar-
ison between Gaussian Processes and previous approaches to this task. The first one,
denoted as Potts, was proposed by [4] and derived from the Flynn dataset [8]. It is given
by the equation log Kp (cm/s) = 0.71 log P − 0.0061M W − 6.3. The second model,
denoted as Moss, is represented by log Kp (cm/s) = 0.74 log P − 0.0091M W − 2.39,
which was derived from a slightly larger dataset [7].
Since there are no QSAR models used for animal skin, a simple naı̈ve model was
used for comparison. In the naı̈ve model, for any input the prediction is always the
same value, namely the mean of log Kp in the training set.
1
Human
Rodent
0
−1
−2
LogKp
−3
−4
−5
−6
−7
−4 −2 0 2 4 6 8 10 12 14 16
PC1
(a)
1
Human
Rodent
0
−1
−2
LogKp
−3
−4
−5
−6
−7
−6 −4 −2 0 2 4 6
PC2
(b)
Fig. 1. The relationship between log Kp and the PCA space of chemical compounds: a) the first
principal component; b) the second principal component
The Application of Gaussian Processes in the Predictions 511
function. The covariance function, k(xi , xj ), allows for specifying a-priori knowledge
from a training dataset. It defines nearness or similarity between the values of f (x) at
the two points xi , xj .
To make a prediction y∗ at a new input x∗ , the conditional distribution
p(y∗ |y1 , . . . , yN ) on the observed vector [y1 , . . . , yN ] is needed to be computed. Since
the model is a Gaussian process, this distribution is also a Gaussian and is completely
defined by its mean and variance. By applying standard linear algebra, the mean and
variance at x∗ are given by
5 Experiments
For each different dataset, the leave-one-out technique was applied, that is, one chem-
ical is used for testing, and all others are employed for training. This was repeated for
each compound in turn. Finally, performance metrics were computed in terms of all
predictions.
512 Y. Sun et al.
5.1 Experiment 1
The first experiment investigated whether a GP model trained using the rodent skin
dataset provides reasonable predictions for the human skin dataset. It is obviously much
easier to obtain animal tissue than human tissue. The rodent skin dataset was used as the
training set and the trained GP model was tested on the complete human skin dataset.
Table 1 shows results for the complete human set. The best result for each column is
indicted in bold. It can be seen that GP models outperform the QSAR models, with the
model trained using the rodent dataset giving the best prediction result.
Table 1. Performances on the complete human skin dataset when trained with the rodent dataset.
5.2 Experiment 2
The possibility of using an animal model to predict human skin permeability was in-
vestigated in experiment 1. In experiment 2, a quantitative comparison of human skin
permeability predictions between trained GP models on a rodent and a human train-
ing dataset was undertaken. To make this comparison, the compounds that are common
in the human and rodent dataset had to be used. A training set including 48 common
chemical compounds for which a target value were known for both the human and ro-
dent data was taken. Two trained models were then produced. Previously unseen human
data were taken as a test set for both models. This used the rest of the human dataset,
including 92 compounds.
Table 2 shows the corresponding results with the best results in bold. It can be seen
that both the human and rodent training sets give better predictions on the human test
set than using either the naı̈ve or the QSAR models, where the human training set has
the best performance. Interestingly, it can be seen that the rodent naı̈ve model is better
than the human naı̈ve model for predicting the human skin permeability. Moreover, GP
predictions from the rodent model is much better than the human naı̈ve model. This
means that the rodent model could be more useful than often thought for predicting
human skin permeability ([6]).
5.3 Experiment 3
The final experiment investigated how adding rodent examples into a human training
set may affect predictions on a human test set. To avoid inconsistent training examples,
that is, examples with the same features but different target values, the non-common
compounds were used as training examples. In this experiment, a human model was
trained using human data, denoted as trnH, using the 92 non-common compounds and
The Application of Gaussian Processes in the Predictions 513
Table 2. Performances on the human test set using model trained on rodent and human training
sets, separately. Note that QSAR’s ION results were from the human training set.
tested using human data with the 48 common compounds. Results produced by the
human model are shown in the first two rows of Table 3.
To generate a mixed model using human and rodent data, a mixed dataset was
needed. Moreover a training set of size 92 compounds was needed in order to compare
it with trnH including 92 compounds. Chemical compounds which are not included in
those common compounds were extracted from the rodent dataset, in total 55 chemical
compounds, denoted by trnR. Then 50 samples from trnH, and 42 from trnR were ran-
domly selected, and added together, denoted as trnHR. A GP model was trained using
trnHR. Finally, the model was tested for the same human data as was used for the hu-
man model (that is 48 compounds). This procedure was repeated 10 times, and average
results are shown in Table 3.
It can be seen that including rodent examples in the training set can produce predic-
tions on average almost as accurate as using a human training set of the same size. In
addition, the model trained on the mixed training set can produce a much more reliably
lower N LL on this test set compared with the one trained using the human training set.
Finally, the actual value of permeability for rodent skin was used as the prediction
value for the human skin. This works since the test set contains only those compounds
that are common to both rodent and human tissues. The final row of Table 3 shows the
result of this ‘prediction’. Interestingly, it gives the best result.
Table 3. Performances on the human test set using models trained on the human, rodent and
mixed training set, respectively. Performance using rodent values directly is also shown (see
text).
6 Conclusions
The main results in this paper have shown that data from the rodent skin can provide
useful information in the prediction of the permeability of human skin. In particular, if
the permeability of rodent skin for a specific compound is used as an estimate for the
permeability of that compound for human skin, then quite a high quality prediction is
made. In fact, such a prediction is the best of all the models we have evaluated. It is also
interesting to see that a GP model trained with a mixture of human and rodent data did
a little better than a GP model trained with just human data on average.
The implication of this work could be important in that testing the permeability of
human skin is difficult and expensive, whereas the testing of animal skin is much easier.
It is known from literature (for example, [6] and [10]), that skin of other species is
more similar to human skin than that of the rodent. In particular, many researchers have
indicated the suitability of porcine (pig) skin, especially from weanling or stillborn
animals, as a model for percutaneous absorption ([11] and [12]). In future work, we
would therefore like to use porcine data as a predictor for human skin permeability.
References
[1] Moss, G., Sun, Y., Davey, N., Adams, R., Pugh, W., Brown, M.: The application of gaussian
processes to the prediction of percutaneous absorption. Journal of Pharmacy & Pharmacol-
ogy 61, 1147–1153 (2009)
[2] Sun, Y., Brown, M., Prapopoulou, M., Davey, N., Adams, R., Moss, G.: The application
of stochastic machine learning methods in the prediction of skin penetration. Applied Soft
Computing 11(2), 2367–2375 (2010)
[3] Sun, Y., Moss, G.P., Prapopoulou, M., Adams, R., Brown M.B., B.M., Davey, N.: The ap-
plication of gaussian processes in the prediction of percutaneous absorption for mammalian
and synthetic membranes. In: ESANN 2010 Proceedings, Bruges, Belgium (2010)
[4] Potts, R., Guy, R.: Predicting skin permeability. Pharmaceutical Research 9, 663–669
(1992)
[5] Bartek, M., LaBudde, J., Maibach, H.: Skin permeability in vivo: comparison in rat, rabbit,
pig and man. Journal of Investigative Dermatology 58, 114–123 (1972)
[6] Wester, R., Noonan, P.: Relevance of animal models for percutaneous absorption. Interna-
tional Journal of Pharmaceutics 7, 99–110 (1980)
[7] Moss, G., Cronin, M.: Quantitative structure-permeability relationships for percutaneous
absorption: re-analysis of steroid data. International Journal of Pharmaceutics 238, 105–
109 (2002)
[8] Flynn, G.L.: Physicochemical Determinants of Skin Absorption, pp. 93–127. Elsevietr, New
York (1990)
[9] Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press
(2006)
[10] Bronaugh, R., Stewart, R., Congdon, E., Giles, A.: Methods for in vitro percutaneous ab-
sorption studies i: Comparison with in vivo results. Applied Pharmacology 62, 481–488
(1982)
[11] Marzulli, F., Maibach, H.: Relevance of animal models: the hexachlorophene story. In: An-
imal Models in Dermatology. Churchill-Livingstone, Edinburgh (1975)
[12] Chow, C., Chow, A., Downie, R., Buttar, H.: Percutaneous absorption of hexachlorophene
in rats, guinea pigs and pigs. Toxicology 9, 147–154 (1978)
Protein Structural Blocks Representation
and Search through Unsupervised NN
1 Introduction
There are currently 80,041 (at 13/3/2012) experimentally determined 3D struc-
tures of protein deposited in the Protein Data Bank (PDB) [1] (with an incre-
ment of about 700 new molecules for month). However this set contains a lot of
very similar (if not identical) structures. The importance of the study of struc-
tural building blocks, their comparison and their classification, is instrumental
to the study on evolution and on functional annotation, and has brought about
many methods for their identification and classification in proteins of known
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 515–522, 2012.
c Springer-Verlag Berlin Heidelberg 2012
516 V. Cantoni et al.
There are several methods for defining protein secondary structure, but the
Dictionary of Protein Secondary Structure (DSSP) [2] approach is the most
commonly used. The DSSP defines eight types of secondary structures, never-
theless, the majority of secondary prediction methods simplify further to the
three dominant states: Helix, Sheet and Coil. Namely, the helices include 3/10
helix, α–helix and π–helix; sheets or strands include extended strand (in parallel
and/or anti-parallel β–sheet conformation); finally, coils which can be considered
just connections. In the sequel, the structural analysis for protein recognition and
comparison, is conducted only on the basis of the two most frequent components
[3]: the α–helices and the β–strands. DSSP and STRIDE both extract from the
PDB segments 3D locations and attitudes, positions in the sequence of SSs and in
particular also strands (constituting sheets), and many other information easily
integrated in the new data structure.
Protein Structural Blocks Representation 517
(a)
(b) (c)
Fig. 1. 1(a) A picture generated by PyMOL on PDB file 1FNB rotated π/2 for format
reasons. In green the Greek key motif (residues 56-116). 1(b) Protein Gaussian Image
of protein 1FNB. Green arrows represent the Greek key motif. 1(c) Protein Gaussian
Image of Greek motif contained in protein 1FNB. Green arrows represent the Greek
key motif, while the green line shows the sequence of SS.
518 V. Cantoni et al.
(a)
(b) (c)
Fig. 2. 2(a) A picture generated by PyMOL on PDB file 4GCR rotated π/2 for format
reasons. In green the Greek key motif (residues 34-62). 2(b) Protein Gaussian Image
of protein 4GCR. Green arrows represent the Greek key motif. 2(c) Protein Gaussian
Image of Greek motif contained in protein 4GCR. Green arrows represent the Greek
key motif, while the green line shows the sequence of SS.
c
τ (uv , x(1) , . . . , x(c) ) = F(Wuv + j xj + θ)
W (3)
j=1
where s = source(D), D(1) , . . . , D(c) are the subgraphs pointed by the outgoing
edges leaving from s, nilA is a special coordinate vector into the discrete output
space A, and
Mnode : Y × A × . . . × A → A (7)
c times
is a SOM, defined on a generic node, which takes as input the label of the
node and the “encoding” of the subgraphs D(1) , . . . , D(c) according to the M#
map. By “unfolding” the recursive definition in 6, it turns out that M# (D) can
be computed by starting to apply Mnode to leaf nodes (i.e., nodes with null
outdegree), and proceeding with the application of Mnode bottom-up from the
frontier nodes (sink nodes) to the supersource of the graph D.
Details about the learning algorithm can be found in [15].
4 Experimental Results
Test Set
Learning Rate 1 1.25 1.5
Iterations 40 60 80 40 60 80 40 60 80
Retrieval 84,99 84,08 84,86 86,46 87,69 79,79 79,29 82,31 79,82
Classification 72,65 56,95 56,91 59,39 70,65 60,73 65,91 64,30 74,82
Clustering 0,79 0,85 0,83 0,82 0,81 0,85 0,83 0,79 0,80
performance with respect to the desired clustering outcome, shows less accurate
results, but with an interesting peak at 74, 82%
The second test has been performed considering as patterns the single side
chains of each protein, i.e., each side chain is represented by a PGI. From Table
2 it can be noted how this “reduced” representation yields better results in term
of classification and retrieval performance while performing slightly worst with
respect to clustering performance. In particular, the clustering performance is
almost the same but with a higher confidence reflected by the higher retrieval
performance. The interesting result concerns the classification performance that
is much higher considering only the side chain.
Table 2. Performance of a 200 × 200 SOM–SD considering the single side chains of
each protein
Test Set
Learning Rate 1 1.25 1.5
Iterations 40 60 80 40 60 80 40 60 80
Retrieval 74,39 81,67 79,63 92,72 92,35 93,40 92,36 92,34 93,85
Classification 75,58 76,37 77,64 85,11 84,14 84,17 85,03 85,78 86,42
Clustering 0,8 0,80 0,80 0,79 0,80 0,79 0,79 0,80 0,79
5 Conclusion
A new data structure has been introduced that supports both artificial and
human analysis of protein structure. Preliminary tests have been performed
by employing the PGI in a structural learning task obtaining very good and
promising results. We are now planning an intensive quantitative analysis of the
effectiveness of this new representation approach for practical problems such as
alignment or even of structural block retrieval at different level of complexity:
from basic motifs composed of a few SSs, to domains, up to units. Moreover, in
future works we will try to exploit other feature beside the orientation, like the
length of the structure in terms of number of amino, biochemical properties, the
sequence of the amino, etc.
522 V. Cantoni et al.
References
1. Protein Data Bank, http://www.pdb.org/
2. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recogni-
tion of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637
(1983)
3. David, E.: The discovery of the alfa-helix and beta-sheet, the principal struc-
tural features of proteins. Proceedings of the National Academy of Sciences USA,
100:11207–100:11210 (2003)
4. Cantoni, V., Ferone, A., Petrosino, A.: Protein Gaussian Image (PGI) - A Protein
Structural Representation Based on the Spatial Attitude of Secondary Structure.
New Tools and Methods for Pattern Recognition in Complex Biological Systems
(in press)
5. Shulman-Peleg, A., Nussinov, R., Wolfson, H.: Recognition of Functional Sites in
Protein Structures. J. Mol. Biol. 339, 607–633 (2004)
6. Bock, M.E., Garutti, C., Guerra, C.: Spin image profile: a geometric descriptor for
identifying and matching protein cavities. In: Proc. of CSB, San Diego (2007)
7. Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing Objects in
Range Data Using Regional Point Descriptors. In: Pajdla, T., Matas, J(G.) (eds.)
ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)
8. Glaser, F., Morris, R.J., Najmanovich, R.J., Laskowski, R.A., Thornton, J.M.: A
Method for Localizing Ligand Binding Pockets in Protein Structures. PROTEINS:
Structure, Function, and Bioinformatics 62, 479–488 (2006)
9. Horn, B.K.P.: Extended Gaussian images. Proc. IEEE 72(12), 1671–1686 (1984)
10. Kang, S.B., Ikeuchi, K.: The complex EGI: a new representation for 3-D pose
determination. In: IEEE-T-PAMI, pp. 707–721 (1993)
11. Shum, H., Hebert, M., Ikeuchi, K.: On 3D shape similarity. In: Proceedings of the
IEEE-CVPR 1996, pp. 526–531 (1996)
12. Kang, S., Ikeuchi, K.: The complex EGI, a new representation for 3D pose deter-
mination. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 707–721 (1993)
13. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural clas-
sification of proteins database for the investigation of sequences and structures. J.
Mol. Biol. 247, 536–540 (1995)
14. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thorn-
ton, J.M.: CATH–a hierarchic classification of protein domain structures. Struc-
ture 5(8), 1093–1108 (1997)
15. Hagenbuchner, M., Sperduti, A., Tsoi, A.H.: A self-organizing map for adaptive
processing of structured data. IEEE Transactions on Neural Networks 14(3), 491–
505 (2003)
16. Sperduti, A.: Knowledge-Base Neurocomputing. In: Cloete, I., Zurada, J.M. (eds.)
pp. 117–152. MIT Press, Cambridge (2000)
17. Kohonen, T.: Self–Organization andAssociative Memory, 3rd edn. Springer, New
York (1990)
Evolutionary Support Vector Machines
for Time Series Forecasting
1 Introduction
Forecasting the future using past data is an important tool to reduce uncertainty
and support both individual and organization decision making. In particular,
multi-step ahead predictions (e.g., issued several months in advance) are useful
to aid tactical decisions, such as planning production resources or evaluating
alternative economic strategies [2]. The field of Time Series Forecasting (TSF)
deals with the prediction of a given phenomenon (e.g., umbrella sales) based
on the past patterns of the same event. TSF has become increasingly used in
distinct areas such as Agriculture, Finance, Production or Sales [14].
The research reported here has been supported by FEDER (program COMPETE
and FCT) under project FCOMP-01-0124-FEDER-022674.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 523–530, 2012.
c Springer-Verlag Berlin Heidelberg 2012
524 P. Cortez and J.P. Donate
Several Operational Research TSF methods have been proposed, such as Holt-
Winters (in the sixties) or the ARIMA methodology [14] (in the seventies). More
recently, several Computational Intelligence (CI) methods have been applied to
TSF, such as Artificial Neural Networks (ANN) [12,6], and Support Vector Ma-
chines (SVM) [15,9,3]. CI models such as ANNs and SVMs are natural solutions
for TSF, since they are more flexible (i.e., no a priori restriction is imposed)
when compared with classical TSF models, presenting nonlinear learning capa-
bilities. When compared with ANN, SVM presents theoretical advantages, such
as the absence of local minima in the learning phase.
When applying these CI methods to TSF, variable and model selection are
critical issues. A sliding time window is often used to create a set of training
examples from the series. A small time window will provide insufficient infor-
mation, while using a large number of time lags will increase the probability of
having irrelevant inputs. Thus, variable selection is useful to discard irrelevant
time lags, leading to simpler models that are easier to interpret and that usually
give better performances [4,9,6]. On the other hand, CI models such as ANN
and SVM have hyperparameters that need to be adjusted (e.g., number of ANN
hidden nodes or kernel parameter) [8]. Complex models may overfit the data,
losing the capability to generalize, while a model that is too simple will present
limited learning capabilities.
Several hybrid systems, which combine two or more CI techniques, have also
been proposed for TSF, such as Evolutionary ANN (EANN) [4]. Most EANNs
use the standard Genetic Algorithm (GA). More recently, the Estimation Dis-
tribution Algorithm (EDA) was proposed [13]. Such algorithm uses exploitation
and exploration properties to find a good solution. In [16], EDA was used as
the search engine of an EANN, outperforming a GA based EANN. Following
this result, in this paper we propose a novel Evolutionary SVM (ESVM) ap-
proach based on the EDA engine, in order to automatically select the best SVM
multi-step ahead forecasting model. Moreover, we also compare ESVM with the
EANN proposed in [16] and the popular ARIMA methodology.
The paper is organized as follows. First, Section 2 describes the ESVM ap-
proach. Next, Section 3 presents the experimental setup and the obtained results.
Finally, the paper is concluded in Section 4.
This model requires setting three parameters: γ – the Gaussian kernel parameter,
exp(−γ||x − x ||2 ), γ > 0; C – a trade-off between fitting the errors and the
flatness of the mapping; and ε - the width of the ε-insensitive tube.
In this paper, an evolving hybrid system that uses EDA and SVM, is adopted.
Following the suggestion of the LIBSVM authors [1], SVM parameters are
526 P. Cortez and J.P. Donate
We selected a total of six time series, with different characteristics and from
distinct domains (Table 1). Five series were selected from the well-known Hyn-
dman’s time series data library repository [10]. These are named Passengers,
Temperature, Dow-Jones, Quebec and Abraham12. We also adopt the Mackey-
Glass series, which is a common nonlinear benchmark. It should be noted that
these six times series were also adopted by the NN3 and NN5 forecasting com-
petitions [5]. Except for Mackey-Glass, all datasets are from real-world domains
and such data can be affected by external issues (e.g., strikes), which make them
interesting datasets and more difficult to predict.
3.2 Evaluation
where et = yt − ŷt , T is the current time period and H is the forecasting horizon,
the number of multi-step ahead forecasts. SMAPE is a popular forecasting metric
that has the advantage of being scale independent when compared with MSE,
thus can be more easily used to compare methods across different series, ranging
from 0% to 200%. SMAPE was also the error metric used in NN3, NN5 and
NNGC1 forecasting competitions [5].
3.3 Results
For the comparison, we have chosen the EANN presented in [16], which is similar
to ESVM except that is uses a multilayer perceptron trained with the RPROP
algorithm, as implemented using the SNNS tool [18]. EANN optimizes the num-
ber of inputs (I ∈ {1, ..., 100}) and hidden nodes (from 0 to 99) and the RPROP
parameters (Δ0 ∈ {1, 0.01, 0.001, . . . , 10−9 } and Δmax ∈ {0, 1, ..., 99}). Both the
528 P. Cortez and J.P. Donate
Table 2. Forecasting errors (%SMAPE, best values in bold) and best SVM models
ESVM and EANN experiments were conducted using code written in the C lan-
guage by the authors. As the stopping criterion, we used a maximum of 100
generations for ESVM and EANN. For a baseline comparison, we have also cho-
sen the ARIMA methodology, as computed by the ForecastPro c tool [7]. The
rationale is to use a popular benchmark that can easily be compared and that
does not require expert model selection capabilities from the user. The obtained
results are shown in Table 2 (SMAPE errors and best SVM models).
Quebec (SMAPE=9.31)
0.015
Passengers
Temperature
0.08
300
Dow−Jones ●
Abraham12
●
Quebec ● ●
● ●
Mackey−Glass
fitness (MSE over normalized data)
0.010
● ● ●
● ●
● ● ● ●
0.06
● ● ●
● ●
●●
● ●
●
● ●
●
250
● ●
Values
●
● ● ● ●
●
●
0.005
0.04
●
●
●
● ● ●
● ●
●
● ●
● ●
0.02
●
200
●
0.000
● ●
●
●
0 5 10 15
0.00
● ESVM
Quebec
0 20 40 60 80 100
0 10 20 30 40 50
generations
Horizon (ahead forecasts)
Fig. 1. Evolution of the ESVM best fitness value for all series (left, includes a zoom of
the bottom left area) and example of the ESVM forecasts for Quebec (right)
the best fitness values for ESVM (left), showing a fast convergence of the EDA
algorithm for all series. For demonstration purposes, the right of Fig. 1 plots the
ESVM forecasts for Quebec, showing a good fit.
The experimentation was carried out with an exclusive access to a server (Intel
Xeon 2.27 GHz processor using Linux). Table 3 shows the computational time
(in minutes) required by each evolutionary method and series. The last column
shows in percentage, the reduction of computational effort obtained by ESVM
when compared with EANN, where Rt = 1 − (tESVM /tEANN ) and tM is the
time required for model M . As shown by the table, ESVM consumes much less
computational effort in all the time series tested. It can be observed a reduction
rate of at least 96% in all cases.
4 Conclusions
References
1. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011)
2. Chatfield, C.: Time-series forecasting. CRC Press (2001)
530 P. Cortez and J.P. Donate
3. Cortez, P.: Sensitivity Analysis for Time Lag Selection to Forecast Seasonal Time
Series using Neural Networks and Support Vector Machines. In: Proceedings of the
International Joint Conference on Neural Networks (IJCNN 2010), pp. 3694–3701.
IEEE, Barcelona (2010)
4. Cortez, P., Rocha, M., Neves, J.: Time Series Forecasting by Evolutionary Neural
Networks. In: Artificial Neural Networks in Real-Life Applications, ch. III, pp.
47–70. Idea Group Publishing, USA (2006)
5. Crone, S.: Time series forecasting competition for neural networks and compu-
tational intelligence (2011), http://www-.neural--forecastingcompetition.com
(accessed on January 2011)
6. Crone, S.F., Kourentzes, N.: Feature selection for time series prediction - a com-
bined filter and wrapper approach for neural networks. Neurocomputing 73, 1923–
1936 (2010)
7. Goodrich, R.L.: The Forecast Pro methodology. International Journal of Forecast-
ing 16(4), 533–535 (2000)
8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, NY (2008)
9. He, W., Wang, Z., Jiang, H.: Model optimizing and feature selecting for sup-
port vector regression in time series forecasting. Neurocomputing 72(1-3), 600–611
(2008)
10. Hyndman, R.: Time Series Data Library (2011), http://robjhyndman.com/TSDL/
(accessed on January 2011)
11. Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy.
International Journal of Forecasting 22(4), 679–688 (2006)
12. Lapedes, A., Farber, R.: Non-Linear Signal Processing Using Neural Networks:
Prediction and System Modelling. Technical Report LA-UR-87-2662, Los Alamos
National Laboratory, USA (1987)
13. Larranaga, P., Lozano, J.A.: Estimation of distribution algorithms: A new tool for
evolutionary computation, vol. 2. Springer (2002)
14. Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting methods and ap-
plications, 3rd edn. John Wiley & Sons, USA (2008)
15. Miiller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.:
Predicting Time Series with Support Vector Machines. In: Gerstner, W., Hasler,
M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–
1004. Springer, Heidelberg (1997)
16. Peralta, J., Gutierrez, G., Sanchis, A.: Time series forecasting by evolving artificial
neural networks using genetic algorithms and estimation of distribution algorithms.
In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8
(July 2010)
17. Smola, A., Schölkopf, B.: A tutorial on support vector regression. Statistics and
Computing 14, 199–222 (2004)
18. Zell, A., Mache, N., Huebner, R., Mamier, G., Vogt, M., Schmalzl, M., Herrmann,
K.U.: Snns (stuttgart neural network simulator). In: Neural Network Simulation
Environments, pp. 165–186 (1994)
Learning Relevant Time Points
for Time-Series Data in the Life Sciences
Abstract. In the life sciences, short time series with high dimensional
entries are becoming more and more popular such as spectrometric data
or gene expression profiles taken over time. Data characteristics rule out
classical time series analysis due to the few time points, and they pre-
vent a simple vectorial treatment due to the high dimensionality. In this
contribution, we successfully use the generative topographic mapping
through time (GTM-TT) which is based on hidden Markov models en-
hanced with a topographic mapping to model such data. We propose an
extension of GTM-TT by relevance learning which automatically adapts
the model such that the most relevant input variables and time points
are emphasized by means of an automatic relevance weighting scheme.
We demonstrate the technique in two applications from the life sciences.
1 Introduction
Due to improved sensor technology, many data sets occurring in the biomedical
domain are very high dimensional such as mass spectra or gene expression pro-
files. At the same time, more and more data display a temporal characteristics
e.g. when investigating the development of an organism over time or the success
of a therapy. In these scenarios, classical time series analysis cannot be applied
due to comparably few time points (often less than 10). In addition, a direct
vectorial treatment is prohibited by the high dimensionality of the data.
A few machine learning techniques exist to investigate high dimensional time
series: Topographic mapping such as the self-organizing map (SOM) is extended
by a recursive context which accounts for the temporal dynamics in the approach
[15]. A probabilistic counterpart is offered by the Generative Topographic Map-
ping Through Time (GTM-TT) which combines hidden Markov models with a
constraint mixture model induced by a low dimensional latent space. This ap-
proach is extended to better take the relevance of the feature components into
account in [13], but relying on an unsupervised model. A supervised relevance
weighting scheme which singles out relevant features in a wrapper approach
based on hidden Markov models has been proposed in [12]. In [6] a similar
approach introducing class-wise constraints in the hidden Markov model. The
approach [5] deals with time series data and feature selection relying on support
vector machines in combination with a Kalman filter. In [12], applications to life
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 531–539, 2012.
c Springer-Verlag Berlin Heidelberg 2012
532 F.-M. Schleif et al.
length of the time series. A data point of the training set is referred to by xi .
Consecutive entries xi (t) and xi+1 (t+1) are strongly correlated. While the space
of observations over time is represented by a topographic mapping as before, the
temporal dependencies are modeled by a hidden Markov model (HMM) with
hidden states characterized by the lattice points wi .
The HMM is parametrized by initial state probabilities
π = (πj )K j=1 where πj = p(z(1) = wj ) and transition probabilities
where pij = p(z(t) = w |z(t − 1) = w ). The data probability is
P = (pij )K j i
i,j=1
p(x|Θ) = z∈{w1 ,...,wK }T p(x, z|Θ) with parameters Θ = (W, β, π, P), the con-
ditional probability p(x(t)|z(t)) := p(x(t)|z(t), W, β) as before (1), and
p(x, z|Θ) = p(z(1)) Tt=2 p(z(t)|z(t−1), W, β) Tt=1 p(x(t)|z(t)) for any sequence
z of hidden states [4].
As for HMMs, a forward-backward procedure allows to determine the hidden
parameters, the responsibilities of states for a given sequence, in an efficient way
[16], based on which the parameters W and β can be determined as before.
We obtain the probability of being in state wk at time t, given the observation
sequence xn :
Akt Bkt
rkn (t) = p(z(t) = wk |xn , Θ) = (2)
p(xn |Θ)
with forward variables Akt = p(xn (1) . . . xn (t), z(t) = wk |Θ) and backward vari-
able Bkt = p(xn (t + 1) . . . xn (tn ), z(t) = wk |Θ).
For an input time series xn (1) . . . xn (T ), GTM-TT gives rise to a time series
of responsibilities rkn (1) . . . rkn (T ) of neuron k. Based on these responsibilities,
a winner can be determined for every time step t as neuron argmaxk rkn (t).
Based on this observation, a supervised variant of GTM-TT (SGTM-TT) can
be determined as follows: Assume that the time series x is equipped with label
information l which is element of a finite set of different labels 1, . . . L. Then, we
train a separate GTM-TT for every class, whereby the models are coupled by
choosing the same bandwidth β and the same underlying topological structure
in the latent space, i.e. the same base functions Φ and prototypes wi . The pa-
rameters Wl are trained individually for every model representing label l. The
same holds for the initial state probability πl and the transition probabilities Pl .
When processing a novel time series x we thus obtain L time series of respon-
sibilities according to the labels. We denote the responsibilities of model l for
input x at time point t by rlk (x(t)). This gives rise to an aggregated value of
K T
responsibilities rl (x) := k=1 t=1 rlk (x(t))/(KT ). One can pick the label as
the value l for which this quantity is largest. However, to take optimum prior
class probabilities into account, we use an additional linear classifier with inputs
given by the vectors (rl (x))L l=1 which is trained using a standard SVM.
D
dλ (x, t) = λ2d (xd − td )2 . (3)
d=1
Relevance learning for GTM has been introduced in [7] for i.i.d. data. For SGTM-
TT, a few modifications are necessary. We use the weighted metric (3) to define
the Gaussians (1). This gives rise to a data log-likelihood which takes into ac-
count the dimensions according to their relevance and, hence, a topographic
mapping which mirrors the relevance weighting scheme.
The question is how to set relevance parameters λ in a such way that the clas-
sification accuracy of the resulting mapping is as high as possible. We proceed
as in [7] and train the relevance parameters based on priorly given class informa-
tion in a separate step which is interleaved with the standard adaptation of the
SGTM-TT. We rely on the cost function as introduced in generalized learning
vector quantization which refers to the hypothesis margin of the classifier [14]:
dλ (xn , t+ ) − dλ (xn , t− )
E(λ) = sgd (4)
n
dλ (xn , t+ ) + dλ (xn , t− )
Relevant Time Points: Since SGTM-TT relies on HMMs, every time point
depends on its predecessor only. Thus, it is not reasonable to adapt the relevance
of time points to obtain a better representation of data in the GTM-TT models.
However, it is reasonable to judge the relevance of time points resulting from
the GTM-TT models for the final classification, in particular if time series are of
the same or a similar length. This method offers insights into which time points
are particularly discriminative for the given task at hand.
We obtain a relevance profile in the following way: Denote by rl (x(t)) :=
K k
k=1 (rl (x(t)))/K the accumulated responsibility of the GTM-TT model l for
Learning Relevant Time Points for Time-Series Data in the Life Sciences 535
data point xn at time point t. Based on this value, a classification can be based
on the maximum responsibility rl (x(t)) in time point t. For every time point
t, we simply count the number of data points which are classified correctly as
belonging to class l based on the classification for time point t only, averaged
over all data. A global relevance profile results thereof as sum over all labels.
4 Experiments
Multiple Sclerosis Data: The multiple sclerosis (MS) data set is taken from
[2] (IBIS) in the prepared form, given in [6]. The data are taken from a clinical
study analyzing the response of MS patients to the treatment. Blood sample en-
trenched with mono-nuclear cells from 52 relapsing-remitting MS patients were
obtained 0, 3, 6, 7, 12, 18 and 24 months after initiation of IFNβ therapy. This
resulted in 7 measurements over 2 years on average. Expression profiles were ob-
tained using one-step kinetic reverse-transcription PCR over 70 genes selected
by the specialists to be potentially related to IFNβ treatment. Overall, 8% of
the measurements were missing due to patients missing the appointments. After
the two year endpoint, patients were classified as either good or bad responders,
depending on strict clinical criteria. Bad responders were defined as having suf-
fered two or more relapses or having a confirmed increase of at least one point on
the expanded disability status scale (EDSS). From 52 patients, 33 were classified
as good and 19 as bad responders, see [2].
We use a SGTM-TT with 9 hidden states and 4 basis functions. A 4 fold cross-
validation with 5 repetitions is used. We compare the results with the general
HMM classifier (HMM-Lin) and the discriminative HMM classifier (HHM-Disc-
Lin) proposed in [12]. We also included the results of [2] who originally proposed
the MS study, the analysis of [1], employing a Kalman Filter combined with
an SVM approach and [6] proposing a semi-supervised analysis coupled with a
wrapper and cut-off technique to identify discriminating features.
In Table 1 we summarize the prediction results for the MS data set in com-
parison to the results given in [2]. As expected, results improve by integration of
Table 1. Prediction accuracies (test data) for different models using the MS data.
Improved prediction accuracy employing relevance learning is observed.
MAP3K1
Mean
0.3
NfKBiB
Min Caspase 10
IRF8
0.25 −Std
RIP
0.2 Jak2 Flip
IL−4Ra Stat4
BAX
Relevance
0.15
0.1
0.05
−0.05
−0.1
0 10 20 30 40 50 60 70
Genes
Fig. 1. Relevance profile as obtained using SGTM-TT with relevance learning. The
plot shows the average relevance (blue/dark), minimal relevance (green/bright) and
the standard deviation, flipped to the negative part of the relevance axis.
0.95
0.8
0.9 0.7
0.85 0.6
Relevance
0.5
Relevance
0.8
0.4
0.75
0.3
0.7
0.2
0.65
0.1
0
1 2 3 4 5 6 7 0 100 200 300 400 500 600 700 800 900
Time point Time point
Fig. 2. Time points relevance profile for the ms data (top) and the insect data set
(bottom). For the insect profile one can clearly identify a peak in the first third of the
experiment, which is when insects climbed the first of two steps. The MS data indicate
the most relevant time points are at t = 2, t = 7 with a relevance of ≈ 0.75 and ≈ 0.87
this may support a prognosis for the therapy outcome already at the second time point.
relevance learning compared to the full feature set. Overall the SGTM-TT with
relevance learning achieves results of 93.43% accuracy which is comparable to
the best reported model but relies on a smaller number of necessary features.
Further the integrated relevance learning avoids multiple time consuming runs
within a wrapper approach like for the techniques used in [12,6]. The obtained
relevance profile is depicted in Figure 1 and provides direct access to an inter-
pretation of the relevant features, or marker-candidates, pruning irrelevant or
noisy dimensions. The five most significant genes found by relevance learning
cover three genes found by [2] and four genes found by [12,6].
Analyzing the relevance of time points (Fig. 2) for the MS data, we observe
a peak at time point two, indicating that a partial prognosis of the therapy
outcome is possible already at an early stage of the therapy.
T2 Y
T2 Z
T1 Y
T1 Z
R3 tib Y
R3 fem Y
R3 cox Y
R3 cox X
R3 cox Z
R2 tib Y
R2 fem Y
R2 cox Y
R2 cox X
R2 cox Z
R1 tib Y
R1 fem Y
R1 cox Y
R1 cox X
R1 cox Z
L3 tib Y
L3 fem Y
L3 cox Y
L3 cox X
L3 cox Z
L2 tib Y
L2 fem Y
L2 cox Y
L2 cox X
L2 cox Z
L1 tib Y
L1 fem Y
L1 cox Y
L1 cox X
L1 cox Z
Head Y
Head Z
Fig. 3. Relevance profile of the joint angle features of the insect data.
5 Conclusion
We have presented a novel approach for the analysis of short temporal sequences.
It is based on the idea to introduce supervision and relevance learning into
Generalized Topographic Mapping through time. Our results show that we are
538 F.-M. Schleif et al.
Funding: This work was supported by the DFG project HA2719/4-1 to BH, by the
EU project EMICAB (FP7-ICT, No. 270182) to VD, and by the Cluster of Excellence
277 CITEC funded in the framework of the German Excellence Initiative.
References
1. Altman, R.B., Murray, T., Klein, T.E., Dunker, A.K., Hunter, L. (eds.): Biocom-
puting 2006, Proceedings of the Pacific Symposium, Maui, Hawaii, USA, January
3-7. World Scientific (2006)
2. Baranzini, S.E., Mousavi, P., Rio, J., Caillier, S.J., Stillman, A., Villoslada, P., Wy-
att, M.M., Comabella, M., Greller, L.D., Somogyi, R., Montalban, X., Oksenberg,
J.R.: Transcription-based prediction of response to ifn using supervised computa-
tional methods. PLoS Biol. 3(1), e2 (2004),
http://dx.doi.org/10.1371%2Fjournal.pbio.0030002
3. Bishop, C.M.: Gtm through time. In: IEE Fifth International Conference on Arti-
ficial Neural Networks, pp. 111–116 (1997)
4. Bishop, C.M., Svensén, M., Williams, C.K.I.: Gtm: The generative topographic
mapping. Neural Computation 10(1), 215–234 (1998)
5. Borgwardt, K.M., Vishwanathan, S.V.N., Kriegel, H.P.: Class prediction from time
series gene expression profiles using dynamical systems kernels. In: Altman, et al.
[1], pp. 547–558
6. Costa, I.G., Schönhuth, A., Hafemeister, C., Schliep, A.: Constrained mixture es-
timation for analysis and robust classification of clinical time series. Bioinformat-
ics 25(12) (2009)
7. Gisbrecht, A., Hammer, B.: Relevance learning in generative topographic mapping.
Neurocomputing 74(9), 1359–1371 (2011)
8. Hafemeister, C., Costa, I.G., Schönhuth, A., Schliep, A.: Classifying short gene
expression time-courses with bayesian estimation of piecewise constant functions.
Bioinformatics 27(7), 946–952 (2011)
9. Hammer, B., Villmann, T.: Generalized relevance learning vector quantization.
Neural Networks 15(8-9), 1059–1068 (2002)
Learning Relevant Time Points for Time-Series Data in the Life Sciences 539
10. Lee, J., Verleysen, M.: Generalizations of the lp norm for time series and its applica-
tion to self-organizing maps. In: Cottrell, M. (ed.) 5th Workshop on Self-Organizing
Maps, vol. 1, pp. 733–740 (2005)
11. Lee, J., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer (2010)
12. Lin, T., Kaminski, N., Bar-Joseph, Z.: Alignment and classification of time series
gene expression in clinical studies. In: ISMB, pp. 147–155 (2008)
13. Olier, I., Vellido, A.: Advances in clustering and visualization of time series using
gtm through time. Neural Networks 21(7), 904–913 (2008)
14. Schneider, P., Biehl, M., Hammer, B.: Distance learning in discriminative vector
quantization. Neural Computation 21, 2942–2969 (2009)
15. Strickert, M., Hammer, B.: Merge SOM for temporal data. Neurocomputing 64,
39–72 (2005)
16. Welch, L.R.: Hidden Markov Models and the Baum-Welch Algorithm. IEEE Infor-
mation Theory Society Newsletter 53(4) (December 2003),
http://www.itsoc.org/publications/nltr/it_dec_03final.pdf
A Multivariate Approach to Estimate
Complexity of FMRI Time Series
1 Introduction
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 540–547, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Multivariate Complexity of FMRI Time Series 541
showed that their measure of complexity was directly related to task difficulty
and mental load during task performance [3]. However, the measure proposed
by Dhamala et al. (2002) requires large sample sizes and depends on manual
inspection steps, two requirements that constrain the usefulness of this measure
for the analysis of fMRI data.
Here, we propose a new method to estimate spatio-temporal complexity of
brain states that is based on multivariate entropy. Unlike in [3], our aim is to
describe changes of global brain states over time rather than globally describe
complexity. One common assumption is that complexity is strongly related to
information content. Hence, entropy functions, which are by definition informa-
tion measures [7], are well-suited candidates to estimate complexity. An intuitive
approach to estimate complexity of brain states would be to compute the multi-
variate differential entropy with respect to the corresponding functional images.
Unfortunately, we do not know the probability density function (pdf) that is nec-
essary for this estimation. If we assume that the functional images are Gaussian
distributed multivariate samples, the differential entropy can be computed by
evaluating a closed-form expression. However, estimating the required Gaussian
distribution parameters from high-dimensional, low sample size data prohibits a
straightforward entropy computation.
In the following, we will introduce Multivariate Principal Subspace Entropy
(MPSE) and show how it is derived from the differential entropy of multivariate
Gaussian distributed data. Then, we apply MPSE to a data set simulated with
a simple model to illustrate the main characteristics of MPSE. Subsequently, we
present results that were obtained by applying MPSE to task-driven fMRI time
series. Finally, we aim to explain our empirical findings by our simple model.
2 Methods
Let X = [x1 , ..., xn ] ∈ Rd×n denote a data matrix representing an fMRI time se-
ries. Each column xt of X represents an fMRI image corresponding to a discrete
time index t ∈ {1, ..., n}. In order to obtain spatio-temporal complexity estimates
size w ∈ N
+
at individual time indices, a temporally sliding
window of an odd
is employed so that 1 < w < n. Let X τ = xτ − w , ..., xτ + w ∈ Rd×w denote
2 2
the data matrix that columnwise containsw windowed images corresponding to
a fix central window position τ ∈ Tw = { w2 , ..., n − w2 }. The corresponding
sample covariance matrix Ĉ X τ is defined as
τ + w
2
1
T
Ĉ X τ = xi − µ̂X τ xi − µ̂X τ , (1)
w−1
i=τ − w
2
where the sample mean µ̂X τ is defined as
τ + w
2
1
µ̂X τ = xi . (2)
w
i=τ − w
2
542 H. Schütze et al.
Since an fMRI time series usually comprises much fewer images (samples) than
voxels (dimensions), a sample covariance matrix Ĉ X τ is necessarily rank defi-
cient, i.e. rk(Ĉ X τ ) ≤ w − 1 < d. Hence, the eigenvalue spectrum Λ(Ĉ X τ ) =
(λ1 , ..., λd ) contains at least d − (w − 1) zero eigenvalues.
The aim of the current paper is to estimate the spatio-temporal complexity
of a given data matrix X τ by employing multivariate differential entropy. It is
assumed that the samples X τ are observations drawn from a continuous random
vector x ∈ Rd that has some pdf p. The corresponding differential entropy H[p]
is defined as
H[p] = − p(x) ln p(x) dx .
x∈Rd
where U k (Ĉ X τ ) = [u1 , ..., uk ] ∈ Rd×k denotes the matrix that contains the unit
eigenvectors of Ĉ X τ corresponding to its k largest eigenvalues. By construc-
tion the sample covariance matrix Ĉ X̃ k is diagonal and contains the k leading
τ
∧
eigenvalues of Ĉ X τ as diagonal elements. Note that diag(Ĉ X̃ k ) = Λ(Ĉ X̃ k ).
τ τ
i.e. rk(Ĉ X̃ k ) = k. Let x̃k ∈ Rk denote the random vector representing the k-
τ
Note that the differential entropy (4) is well-defined due to the rank sufficiency of
Ĉ X̃ k . We call this quantity Multivariate Principal Subspace Entropy (MPSE).
τ
MPSE(X τ ) can be computed as follows:
1
k
k
MPSE(X τ ) = H[q] = ln λj + (1 + ln(2π))
2 j=1 2
1
k
k
= ln λj + (1 + ln(2π)) . (5)
2 j=1 2
3 Data
3.1 Simulated Data
To assess the behavior of the spatio-temporal complexity estimator MPSE, we
simulated a number of d-variate time series that comprise phases with states of
different complexity. To this end, we modeled temporally alternating off-state
phases (off-state a) and more complex on-state phases (on-state b). We assume
that these two phases are represented by distinct temporally changing processes
that lie in orthogonal subspaces of different intrinsic dimensionalities da , db ∈
N+ , and also that the on-state phase process is higher dimensional than the
off-state phase process, i.e. db > da .
Let X a ∈ Rda ×n (X b ∈ Rdb ×n ) be randomly generated by drawing n ∈
N samples from the da -variate (db -variate) Gaussian distribution N (0da , I da )
+
In the first simulation setting (exclusive setting), we assume that both states
a and b are temporally exclusive, i.e. either off-state a or on-state b is active. For
this setting, simulation parameters are chosen as follows: na = 15, nb = 10 (i.e.
n = 5na + 4nb = 115). In the second simulation setting (transition setting), we
additionally model a transition phase c for each alternation of states. A transition
phase is modeled as the concurrency of off-state a and on-state b, i.e. there are
short phases of length nc ∈ N+ in which both states are simultaneously active.
In other words, a transition phase does not enforce the states to be temporal
exclusive and incorporates that one process has a certain shut down delay as
another process already becomes active. Note that the dimensionality of this
transition phase is dc = da + db . For this setting, we chose na = 13, nb = 8 and
nc = 2 (i.e. n = 5na + 4nb + 8nc = 113).
In this section we introduce fMRI data from a task-driven fMRI study that
has been published elsewhere [1]. Six female subjects participated in an fMRI
experiment that comprised ten runs. Each run comprised four emotional periods
(20s) alternating with four resting periods (18s - 24s). During an emotional
period subjects were asked to submerge themselves into an emotional situation
(joy, anger, disgust, fear, sadness) and to facially express their emotional feelings.
A single word cue (e.g. ’joy’) on a black screen signaled an emotional period.
During resting periods subjects were asked to relax. A neutral fixation cross
signaled a resting period. There were two runs for each emotion per subject, i.e.
60 fMRI time series in total (6 subjects × 2 runs × 5 emotions). Each fMRI
time series comprises 80 whole brain functional images, that were acquired with
a TR (repetition time) = 2s.
FMRI time series were preprocessed including slice acquisition time correc-
tion, concurrent spatial realignment and correction of image distortions by use
of individual static field maps, normalization into standard MNI space and spa-
tial smoothing (10 mm Gaussian kernel) using SPM5 [9]. Only voxels within a
standard anatomical gray matter brain mask [8] were considered in our analy-
sis (i.e. 46,556 voxels per image). Furthermore, from each voxelwise fMRI time
series the temporal mean was subtracted and the linear trend was removed. A
detailed description of the fMRI experiment can be found in [1].
4 Results
Fig. 1 shows MPSE time courses of the simulated data described in section 3.1
using a window of size w = 5.
For the exclusive setting in our simulation and a rather large difference in
state dimensionality (da = 20, db = 60) the MPSE time course shows distinctly
higher values during on-state phases than during off-state phases (Fig. 1 (a)).
Multivariate Complexity of FMRI Time Series 545
1.5 0.2
1.0 0.1
MPSE
0.5
0
0
-0.5 -0.1
-1.0 -0.2
20 40 60 80 100 20 40 60 80 100
(a) da = 20, db = 60, exclusive setting (b) da = 55, db = 60, exclusive setting
1.5 0.6
1 0.4
0.5
MPSE
0.2
0 0
−0.5
−0.2
−1
−1.5 −0.4
20 40 60 80 100 20 40 60 80 100
(c) da = 20, db = 60, transition setting (d) da = 55, db = 60, transition setting
Fig. 1. Average MPSE time courses of simulated time series with standard error
(N=30) using a window size w = 5. From each single MPSE time course the tempo-
ral mean was subtracted before averaging. Light gray (white) bars illustrate simulated
on-state (off-state) phases of dimensionality db (da ). Dark gray bars indicate transition
phases of dimensionality (da + db ).
0.4 0.6
0.3 0.4
0.2 0.2
MPSE
0.1
0 0
−0.1 −0.2
−0.2 −0.4
−0.3
−0.4 −0.6
20 40 60 80 20 40 60 80
(a) w = 3 (b) w = 5
Fig. 2. Average MPSE time courses of 60 fMRI time series with standard error across
subjects (N=6) using window size w. The temporal mean of each single time course was
subtracted before averaging. Gray (white) bars indicate emotional (resting) periods.
obviously results in a time course that reflects the alternation between emotional
and resting periods in the fMRI experiment. For both window sizes the average
MPSE time course has higher values during emotional periods and lower values
during resting periods, i.e. there are task-condition specific levels. Interestingly,
the obtained time courses show distinct peaks at task transitions. Increasing the
value w from w = 3 to w = 5 seems to temporally smooth the average MPSE
time course. For the larger window w = 5 the off-state transition peaks become
smaller, but are still clearly visible.
5 Discussion
For simulated data with a large difference in state dimensionality, we showed that
MPSE is capable to successfully detect on-state and off-state phases. MPSE con-
siders only w samples at a time and, hence, (w − 1) non-zero eigenvalues due to
w < da < db . Although the population eigenvalues are equally sized (isotropic
Gaussian model) the estimated eigenvalue spectra give different MPSE levels for
different intrinsic dimensionalities (i.e. high MPSE level for high dimensionality
and low MPSE level for low dimensionality). This could be explained by the
fact that under these high-dimensional, low sample size simulation settings esti-
mated eigenvalues are Marc̆enko-Pastur distributed [5]. Roughly speaking, as the
ratio α between sample size and dimensionality decreases, estimated non-zero
eigenvalues and variance of the corresponding spectra increase. Note that the
Marc̆enko-Pastur law actually holds for the asymptotic case of infinite sample
sizes for some fix α. Even though we are dealing with very small sample sizes,
MPSE increases, due to larger eigenvalue estimates, as the intrinsic dimension-
ality increases. The same argument explains the high third MPSE level during
transition phases. The transition-peaks are rather high compared to the on-state
and off-state levels when difference in state dimensionalities is small. This is not
surprising, as – by construction – the intrinsic dimensionality of a transition
phase is almost twice as high as the dimensionality of an on- or off-state.
Multivariate Complexity of FMRI Time Series 547
For the real-world data, we assumed that during emotional periods brain ac-
tivity of subjects shows an increased complexity, due to the neural processing
induced by the experimental task. During resting periods, in contrast, complex-
ity was expected to decrease. Thus, we expected a correspondence between the
time course of a spatio-temporal complexity estimator and the time course of
the experimental task. MPSE estimates the time course of the experimental task
with high precision. There were significantly higher MPSE levels during emo-
tional than during resting periods. Furthermore, the MPSE time course shows
distinct peaks at the state transitions.
In sum, we have introduced MPSE as a measure to estimate spatio-temporal
complexity of fMRI time series. We have shown that MPSE is capable to detect
different experimental conditions for a real fMRI data set. Entropy levels were
higher during emotional periods and lower during resting periods. Furthermore,
we found evidence that during transitions between emotional and resting periods
complexity increases. Employing a simple model, we could reproduce (1) MPSE
level differences and (2) MPSE task transition peaks in simulated time series
that comprise state phases of different intrinsic dimensionalities.
References
1. Anders, S., Heinzle, J., Weiskopf, N., Ethofer, T., Haynes, J.D.: Flow of affective
information between communicating brains. NeuroImage 54(1), 439–446 (2011)
2. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York
(2006)
3. Dhamala, M., Pagnoni, G., Wiesenfeld, K., Berns, G.S.: Measurements of brain ac-
tivity complexity for varying mental loads. Physical Review E - Statistical, Nonlinear
and Soft Matter Physics 65(4), 041917-1–041917-7 (2002)
4. Holmes, A.P., Poline, J.B., Friston, K.J.: Characterizing brain images with the gen-
eral linear model. In: Human Brain Function, pp. 59–84. Academic Press, USA
(1997)
5. Hoyle, D.C.: Automatic PCA Dimension Selection for High Dimensional Data and
Small Sample Sizes. Journal of Machine Learning Research 9, 2733–2759 (2008)
6. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice
Hall, New Jersey (1998)
7. Shannon, C.E.: A Mathematical Theory of Communication. The Bell System Tech-
nical Journal 27, 423, 623–656 (1948)
8. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Del-
croix, N., Mazoyer, B., Joliot, M.: Automated anatomical labeling of activations in
spm using a macroscopic anatomical parcellation of the mni mri single-subject brain.
NeuroImage 15(1), 273–289 (2002)
9. Wellcome Trust Centre for Neuroimaging, Statistical Parametric Mapping, SPM5,
http://www.fil.ion.ucl.ac.uk/spm/software/spm5/
Neural Architectures for Global Solar
Irradiation and Air Temperature Prediction
1 Introduction
Accurate predictions of global solar irradiation and air temperature are neces-
sary for anticipative control in autonomous energy management systems that
use solar energy. For example, the system could then anticipate future produc-
tion and needs, and adapt its behavior accordingly. This information may be
provided by an external numerical weather prediction system, but to prevent
any service versatility, or its excessive cost, some system designers require au-
tonomous predictions. Naive models such as persistence or monthly means can
then be effective [3]. Yet, weather prediction can be seen as a time series extrap-
olation problem: machine learning algorithms can then be adapted to computing
a predictor for future values of meteorological time series.
Among other models, the Multi-Layer Perceptron (MLP) architecture, and
some of its variants (e.g. Time Delayed Neural Networks, Recurrent Wavelet
Neural Networks), have already been advocated for meteorological time series
predictions [7], and specifically for solar irradiation prediction [2], [8], [9]. Hourly
or daily cumulated irradiation predictions were considered. Some authors pro-
posed a combination of MLP and ARMA models, either as an additive model
[9], or switching their respective usage according to recent observed errors [8].
Most works assume that data inputs come from a stationary process; real input
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 548–556, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Neural Architectures for Global Solar Irradiation 549
where wkd (respectively wsk ) denotes the weight from the dth input node (respec-
tively k th hidden node) to the k th hidden node (respectively sth output node),
and f = {fs } is the model
output. The full set of weights is generally summarized
as w = {wkd }, {wsk } . Hidden and output layer nodes also have bias weights,
implicitly associated to an additional input dimension, clamped at 1 in formula
(1). σ(.) is a non-linear function, chosen as the logistic function in this paper.
Learning a MLP then amounts to set w such as f (x) approximates correctly y.
Values for w are fitted to a set {xn , yn }n=1...N using the back-propagation
algorithm [4]. This algorithm optimizes the quadratic loss of the model output
with respect to target yn vectors. The estimated model is then able to predict
y for an unknown x. Let us note that the cardinality of the weights |w| in the
MLP roughly equals K(D + S).
Fig. 1. Alternatives for input and output strategy: a) the relative predictor uses the
same MLP for all possible hc values, b) the absolute predictor specializes one MLP for
each adminissible hc value, c) the daily predictor uses the same MLP for all possible
hp values, and d) the tri-hourly predictor specializes one MLP for each admissible hp
value.
4 Experiments
4.1 Data Sets Description and Pre-processing
The data sets used in this experimental section were recorded in Saclay (France,
48.72oN, 2.15o E) and in Marcoule (France, 44.14oN, 4.71o E). These data sets
are sampled according to different time steps (10 minutes for Saclay’s data, 30
minutes for Marcoule). Tri-hourly data sets are first built for each location with
a moving average. Solar irradiation measurements were converted to an instan-
taneous scale (Wm−2 ) when needed. For Saclay, the available meteorological
time series span from january 1st , 1996 to december 31, 2004 (9 years of data,
3.3% of missing values), and for Marcoule the span is from january 1st , 1999 to
december 31, 2007 (9 years of data, 6.1% of missing values).
Time series data should be converted to an acceptable stationary process
before being used in a learning procedure [6], [8], [9]. For meteorological time
series, strong yearly and daily cycles have to be taken into account. To this aim,
let us define the monthly-hourly sets of a time series data set :
This results in 96 sets (12 months and 8 time slots). Then the time series can
be normalized in the following way :
552 P. Bruneau, L. Boudet, and C. Damon
xt − mean(xm,h )
xnorm
t = (3)
standard deviation(xm,h )
∀t, s.t. m = month(t) ∧ h = time slot(t).
The transformation (3) aims at making the time series’ mean and variance ap-
proximately contant ∀t, which is the weak stationarity definition. Note that
irradiation measurements are 0 everywhere for 21h, 0h and 3h time slots. The
transformation (3) is then ignored in these cases.
where yn is the time series to predict, yˆn the prediction, ym,h is the monthly-
hourly set associated to time series data yn (see equ. (2)). Errors are thus scaled
to the domain they relate to, and can be seen as the percentage to maximal
possible error on the original scale.
As noted at the end of section 4.1, observed irradiation time series are 0 every-
where on 3 slots. These slots can then be ignored when forming the input vectors;
i.e. D = 29 instead of 32. Moreover, predictions for these slots are trivial: this
means that associated models can be removed from tri-hourly architectures for
irradiation (e.g. absolute-tri-hourly irradiation predictor uses 40 models, instead
of 64). Consequently, these trivial predictions should be ignored when computing
or aggregating quality results.
Neural Architectures for Global Solar Irradiation 553
Let us note that expression (4) depends on the latitude of the measurements
through the zenith angle. The mixed predictor of irradiation in table 1 is the best
estimator under the assumption of an even distribution around fcsmax 2
(t)
. Exper-
imental evaluations of these reference models are reported in table 2, along with
a comparison to the relative-daily predictor, the simplest among the proposed
architectures. 5 independent experiments are used to estimate the variability of
learning the neural architecture. Results are averaged according to each admis-
sible horizon value.
With the exception of the adjusted persistence, naive predictors do not get up-
dated by recently observed data. Thus their nRMSE values remain independent
of the predicting horizon. Among naive predictors, monthly means (respectively
persistence) perform best for irradiation (respectively temperature) prediction.
Temperature prediction errors are reduced by the adjusted persistence predic-
tor for horizons less than or equal to 6 hours, e.g. up to 46.0%, for Saclay and
3-hour horizon. Best naive irradiation prediction errors for the 3-hour horizon
are largely reduced by the neural model (up to 27.7%). This reduction mono-
tonically decreases with the augmentation of the horizon, e.g. down to 4.7% for
Marcoule and 24-hour horizon.
These obervations almost hold for temperature predictions, with the notable
exception of 3 and 6 hours horizons. Indeed, the neural predictor error improve-
ment over the adjusted persistence raises from 28.9% for the 3-hour horizon, to
554 P. Bruneau, L. Boudet, and C. Damon
Table 2. nRMSE performance measures of naive and neural predictors. Results are
averaged according to predicting horizons. Bold-faced results indicate the best predictor
for each experimental setting.
32.4% for the 6-hour horizon. As for the irradiation prediction, this reduction
decreases down to 10.1% with the horizon increase.
The relative-daily predictor thus largely improves naive approaches for short-
term prediction. This improvement becomes gradually less important as the
predicting horizon is increased. This can be seen as a decay of the information
provided by recently observed data.
becomes almost 0% for horizons greater than 12 hours. Also, with the exception
of the 3-hour horizon, the absolute daily and tri-hourly architectures perform
similarly well.
The average hidden layer size K is reported for each experimental setting in
table 3. Average learning and predictive number of operations associated with
each architecture are also reported, using the fact that the back-propagation al-
gorithm is O(|w|2 ), and the prediction with MLP models is O(|w|). Predictions
by absolute architectures are less costly (from 1.5 to 8 times). The decomposition
implied by the absolute architectures thus leads to more parsimonious MLP’s.
The required number of operations for learning is also dramatically reduced by
the absolute architectures (up to 7.4 times). Absolute-daily and tri-hourly archi-
tectures learning costs are in the same order of magnitude: the preference may
be influenced by the site where the data was recorded (i.e. climate conditions).
However predictions by the absolute-daily architecture are much cheaper in pro-
cessing time (up to 3 times), which may be a much more decisive criterion in
the context of embedded systems.
5 Conclusion
In this paper, a general view to neural architectures for time series prediction
was proposed, and applied to tri-hourly meteorological data. This view unifies
previous works with the absolute strategy, that decomposes the prediction prob-
lem according to the current time slot under consideration. In addition, the
tri-hourly strategy specializes each MLP to a specific predicted time slot.
After validating the performance of neural prediction against reference naive
models, several architectures were compared. Maximal gains are made for short-
est term predictions: when the horizon is greater than 9 hours, all architectures
perform equally. The architectures were also compared from a computational
point of view: absolute architectures involve a much lower number of operations
556 P. Bruneau, L. Boudet, and C. Damon
for learning and prediction. Besides, absolute architectures open perspectives for
variable selection: the Bayesian scheme proposed in [4] could be used to derive
a well-founded variable selection procedure on each MLP of an architecture.
Consequences on performance and complexity need further investigations, but
the predictor would gain in interpretability. Indeed, each current time slot or
predicting horizon could then be associated with its own set of relevant inputs.
References
1. Aguiar, R., Collares-Pereira, M.: Statistical properties of hourly global radiation.
Solar Energy 48(3), 157–167 (1992)
2. Cao, J., Lin, X.: Study of hourly and daily solar irradiation forecast using diagonal
recurrent wavelet neural networks. Energy Conversion and Management 49, 1396–
1406 (2008)
3. Giebel, G., Kariniotakis, G.: Best practice in short-term forecasting. a users guide.
In: European Wind Energy Conference & Exhibition (2007)
4. Nabney, I.T.: Netlab Algorithms for pattern recognition. Springer (2002)
5. Perrin de Brichambaut, C.: Météorologie et énergie: l’évaluation du gisement solaire.
La Météorologie 6(5), 129 (1976)
6. Qi, M., Zhang, G.P.: Trend time series modeling and forecasting with neural net-
works. IEEE Transactions on Neural Networks 19(5), 808–816 (2008)
7. Smith, B.A., McClendon, R.W., Hoogenboom, G.: Improving air temperature pre-
diction with artificial neural networks. IJCI 3, 179–186 (2006)
8. Voyant, C., Muselli, M., Paoli, C., Nivet, M.-L.: Numerical weather prediction (nwp)
and hybrid arma/ann model to predict global radiation. Energy 39, 341–355 (2012)
9. Wu, J., Chan, C.K.: Prediction of hourly solar radiation using a novel hybrid model
of arma and tdnn. Solar Energy 85, 808–817 (2011)
Sparse Linear Wind Farm Energy Forecast
1 Introduction
with L̂S (W ) denoting the quadratic loss of a linear model with weights W over
a sample S and R̂(W ) the non–differentiable but still convex 1 or 2,1 norms
of W . This general formulation places the above problems under the scope of
Proximal Optimization [3] which exploits the concept of proximal operators to
arrive to a general optimization procedure. Moreover, it makes quite easy to
extend the previous methods. For instance we will consider here a group version
of Elastic–Net simply by mixing in (1) the 2,1 and 2 norms of W . We shall
apply this set up to the problem of predicting wind farm energy production, of
considerable interest nowadays and fitting squarely in the previous set up.
The usual approach uses historical wind energy production data and fore-
casts derived from numerical weather prediction (NWP) systems, in our case,
the Agencia Española de Meteorologı́a (AEMET; [1]). These systems provide
forecasts for the nodes of a geographical grid, typically at resolutions that start
at 0.16◦ degrees or even finer, for either a surface level derived from a smooth
orographical model or for several constant pressure levels that go from sea level
to a height of about 20 Km. These forecasts are usually given every three hours.
Moreover, many meteorological variables are available at each level and the num-
ber of possible features may clearly become very large. To manage this, a first
obvious approach is to fix a square of grid points centred at the wind farm and
consider for each grid point a number of surface variables. However, these are
given typically 10 m above the grid point height, but that may bear little rela-
tionship with the actual altitude of a wind farm. The alternative is to consider
NWP forecasts for a number of pressure levels but this will of course augment
feature dimension and, thus, make the sparse methods attractive modelling tools.
This approach will be applied to the study of wind energy forecasting at the
Sotavento wind farm, situated in the Galicia region of north–western Spain. We
shall work with a 6 × 6 grid with a 0.25◦ degree resolution, 6 pressure layers
and 5 meteorological variables. Total dimension is thus 1, 080. We will consider
a one–year long training sample, whose size is thus 2, 920, i.e., about three times
pattern dimension and well below the linear regression rule of thumb of having 10
patterns per dimension. Regularization is thus mandatory but sparse regression
also comes in very naturally. In fact, and as we shall see, sparse models beat ridge
regression using a quite small number of the features available. Moreover, sparse
models also shed light on the predictive structure of NWP variables, something
that could be exploited when considering stronger, more complex methods than
standard regression. The paper is organized as follows. In Sect. 2 we will briefly
review the theory of Proximal Optimization and its training algorithms, and
describe how the previous sparse regression problems fit in this set up. In Sect.
3 these models will be applied to wind energy prediction for the Sotavento farm
and the paper ends with a discussion and conclusions section.
and parametrized by a weight vector W , such that fW (X (p) ) ≈ y (p) , ∀p. To make
this more precise, we introduce a convex loss function LS : H → R and look in
∗
principle for a fW that minimizes LS (fW ). However, we may also want to control
model complexity for which we may introduce a sparsity controlling convex term
R(fW ) as well as a regularization term fW 2H . These considerations lead to the
general optimization problem
1 λ2
min LS (fW ) + λ1 R(fW ) + fW 2H , (2)
fW ∈H 2 2
where λ1 and λ2 are the parameters which will determine the relative impor-
tance of the regularization terms against the error term. Notice that if λ2 = 0,
the objective function is strictly convex. While the first and third terms are dif-
ferentiable, R(fW ) will usually not be so. To deal with this, we will consider the
problem (2) under the framework of Proximal Methods, a set of techniques to
solve non–differentiable optimization problems in an iterative way. The starting
∗
point is the fact [6] that the solution fW of (2) satisfies for any η > 0 the fixed
point equation
∗ λ2 ∗ 1 ∗
fW = prox λ1 ;R 1− fW − ∇LS (fW ) , (3)
η η 2η
The solution of (4) is problem dependent and finding it is the main issue when
applying proximal optimization. If it is known, (3) justifies an iterative algorithm
based on the steps
(t) λ2 (t−1) 1 (t−1)
fW = prox λ1 ;R 1− fW − ∇LS (fW ) .
ηt ηt 2ηt
There are several general purpose algorithms that apply this iterative scheme.
Here we will use the Fast Iterative Shrinkage–Thresholding Algorithm (FISTA;
[2]) which automatically determines the step length ηt .
Sparse regularized linear regression fits nicely in this set up. In that case,
H = RD , the basic model is fW (X) = X · W and the loss function is LS (fW ) =
N X W − Y 2 , where X is the matrix collecting all the inputs X
1 2 (p)
in its rows,
and Y is the vector of all the desired outputs y . All the mentioned linear
(p)
sparse models can be derived for particular choices of the functional R and the
parameters λ1 and λ2 , as summarized in Table 1. The simplest case is to fix
λ1 = λ2 = 0, which leads to the Ordinary Least Squares (OLS) model. The
resulting optimization problem can be easily solved analytically (see Table 1),
but if no regularization is included, OLS models are likely to over–fit the sample
when the feature dimension D is comparable with sample size. The simplest way
560 C.M. Alaı́z, A. Torres, and J.R. Dorronsoro
Table 1. Correspondence between the regularized linear models and problem (2)
to avoid this is just to take some λ2 > 0 while keeping λ1 = 0. This leads to
Regularized Least Squares (RLS; [4]) which has also a closed form solution.
A first choice for the functional R is to use the 1 norm, RLA (fW ) = D 1
W 1 .
Setting λ1 > 0 and λ2 = 0 and using this functional, we recover the Lasso (LA;
[8]) algorithm. The 1 norm encourages sparse models, something that can be
seen as an implicit feature selection, because the inputs associated with zero
coefficients are just discarded. Because of its non–differentiability, LA models
will be trained using FISTA, as explained above. The proximal operator for the
1 norm is given by soft thresholding [5] as (proxλ;·1 (x))i = xi (1− |xλi | )+ . Notice
that in LA all coefficients are treated individually. In certain circumstances we
may want to have a grouping effect in the features, so as to detect relevant
groups. A way to achieve this is to enforce that all the coefficients in a group
should be active or inactive at the same time. This is what the Group Lasso (GL;
[9]) algorithm obtains using a mixed 2,1 norm as regularizer, i.e., RGL (fW ) =
1 V V
D
3 Numerical Experiments
In this section we will apply the previous algorithms to study the prediction of
the energy production of a wind farm. We will work with the Sotavento wind
farm [7], located at 43.34◦ N, 7.86◦ W and that makes production data publicly
Sparse Linear Wind Farm Energy Forecast 561
available. The usual features in wind power forecasting are surface predictions
of meteorological variables. We shall work first with the following: V , the norm
of the wind speed, Vx , its x component, Vy , its y component, the temperature
T and pressure P . They will be considered over a rectangular, 0.25◦ resolution,
6 × 6 point grid surrounding Sotavento. The dimension for surface prediction
is thus 180 = 6 × 6 × 5, large but not too much so. We will work with a 1
year training set and a 2 months test period. Meteorological forecast are only
available every three hours; thus, we have eight patterns per day and training
sample size is 2, 920. We normalize the wind energy production target values to
the [0, 1] interval as percentages of the total installed wind power in Sotavento.
In any case, many more variables are available on the 17 constant pressure
levels for which AEMET gives NWP forecasts, although obviously not all of them
will have an effect on the energy. These levels have a 50 hPa resolution and over
them pressure is constant and no longer a predictive variable; we substitute it
by geopotential heights. A first level selection can be done using Sotavento’s
elevation, with an average of about 600 m. The first 11 levels are consistently
located much higher; moreover, correlation plots with wind farm production
(not included) show that they do not contain useful information and we have
discarded them outright. We are left with the lowest 6 layers and total feature
dimension is then 6 × 6 × 6 × 5 = 1, 080. Thus, sample size is about 3 times the
dimension. However, it is not clear that all of these features have the same effect
(if any) on the wind energy production and they may handicap full regression
models even if they are regularized. Sparse methods can thus help us first to
find better models and, second, to better understand which grid points, pressure
levels and variables are the most useful to improve predictions.
To do so, we will use the models described in Sect. 2. For the case of group
algorithms (GL and GENet), we consider as a group the 5 meteorological vari-
ables evaluated over a grid point. As usually done in wind energy, the mod-
els are evaluated using the Mean Absolute Error over the test set, MAE =
N
p=1 |X · W − y (p) |. We will also report the standard deviation σAE of
1 (p)
N
the absolute errors, although they are rather conservative as we do not perform
any sample size correction (assuming
√ independence for these errors would lead to
divide the values given by N and, hence, much smaller values). An important
issue for most of the algorithms used is the estimation of the hyper–parameters
λ1 and λ2 that configure each model. This is done as a search over a grid rep-
resentation of the parameter space, working on a logarithmic scale from 10−3
to 103 with steps of 100.10 . For the algorithms that involve a bi–dimensional
grid (ENet and GENet), the step size is increased to 100.20 . At each point of
the parameter grid, a 5–fold cross validations is used to evaluate a given model
using as fitness the MAE and discarding models above a predefined sparseness
level ρ, fixed as the percentage of non–zero weights. Three different values of ρ,
30, 50 and 100 (i.e., no restrictions), are considered for all models.
The comparison of the different algorithm is summarized in Table 2. A first
reference are surface models, for which we recall that feature dimension is 180;
therefore, we only consider the 100% sparsity level. Their performance is given
562 C.M. Alaı́z, A. Torres, and J.R. Dorronsoro
Table 2. Results and parameters for 6 pressure levels and minimum 30%, 50% and
100% sparseness, and surface data. Models ordered by MAE.
in Subtable 2d and the best models are GL and LA in that order, although
they essentially do not achieve any sparsity enforcing, as their active weights are
100% and 91.7% respectively. Subtables 2a, 2b and 2c give the performance of
the multi–level variables. Recall that feature dimension is now 1, 080, making
mandatory the use of regularized or sparse models (notice that unregularized
linear regression OLS performs very badly due to a clear case of over–fitting).
As it can be expected, the best results are obtained at the 100% sparsity level,
with the best algorithm being GL that uses about half of the features. If more
sparsity is imposed, model performance is just slightly worse, but sparsity greatly
increases. At the 50% level, LA is the second best model, with only 14.4% of
active weights. At the strictest 30% sparsity level, ENet is the best model, with
a sparsity of just 11.7%. Moreover, in all cases the results using pressure level
variables are better than the ones obtained using surface variables. As mentioned
before, a reason for this is that the 10 m height of surface variables may not
be representative of actual wind farm altitude. In any case, the use of sparse
methods such as LA and ENet over pressure layers is justified.
We turn now our attention to the structure identified by the sparse models.
Figure 1a shows the percentage of the total active weights per variable. The non–
sparse algorithms obviously do not perform any kind of variable selection and
the same is true for the group methods, the reason being that they essentially
select all the variables at a given grid point. On the other hand, LA and ENet
favour the V and Vx variables and discard almost completely the geopotential
height. This is reasonable as it has a much smaller correlation with respect
to wind energy production. Figure 1b shows the percentage of the total active
Sparse Linear Wind Farm Energy Forecast 563
%
30 30
20 20
10 10
0 0
V Vx Vy T H 12 13 14 15 16 17
Variable Level
(a) Active weights per variable. (b) Active weights per level.
Fig. 1. Active weight % per variable (left) and level (right) for ρ = 50%
weights per pressure level. Now all the sparse methods perform some kind of
level selection, favouring the highest and lowest layers. The reason for this is
clear, as all levels have high correlations with respect to wind energy while the
extreme levels are the most independent. This effect is particularly strong for
the group models GL and GENet, as they must focus on actually selecting levels
instead of variables. We also point out that sparse methods define some grid
structure as they select points which are mostly located in either the centre of
the grid (closest to the wind farm) or the grid extremes (points least correlated
with the grid centre but still correlated with the wind energy).
Summing things up, it is clear that taking into account different pressure levels
yields better models than considering only surface variables. Sparse models help
on this and, moreover, automatically select the feature structure better suited
for modelling.
4 Conclusions
pressure layers. As our results show, sparse models built over several pressure
layers outperform those built using just NWP surface values, even when a strict
degree of sparsity is required. Moreover, sparse models also identify the predic-
tive structure in the NWP features and discriminate among the levels considered,
thus improving our problem understanding.
In any case, stronger models could clearly yield better predictions, which
makes natural to exploit sparse linear regression to select the most relevant
features upon which more advanced models can be built. For example, better
models can be obtained using standard RLS over features selected by the Lasso.
Moreover, the sparse linear methods also have strong theoretical foundations
that could be brought to bear on feature selection. We are currently studying
these and other related issues.
References
1. Agencia española de meteorologı́a (2012), http://www.aemet.es
2. Beck, A., Teboulle, M.: A fast iterative shrinkage–thresholding algorithm for linear
inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
3. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing.
Recherche 49, 1–25 (2009)
4. Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics 12(12), 55–67 (1970)
5. Kowalski, M., Torrésani, B.: Structured sparsity: from mixed norms to structured
shrinkage. In: Gribonval, R. (ed.) SPARS 2009 – Signal Processing with Adap-
tive Sparse Structured Representations. Inria Rennes – Bretagne Atlantique, Saint
Malo, France (2009)
6. Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity
regularization with proximal methods. In: ECML/PKDD (2), Berlin, Heidelberg,
pp. 418–433 (2010)
7. Sotavento (2012), http://www.sotaventogalicia.com
8. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Statist.
Soc. Ser. B 58(1), 267–288 (1996)
9. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society – Series B: Statistical Method-
ology 68(1), 49–67 (2006)
10. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society – Series B: Statistical Methodology 67(2), 301–320
(2005)
Diffusion Maps and Local Models
for Wind Power Prediction
1 Introduction
Local models are a natural and attractive option when trying to approach pro-
cesses with high variance data or whose underlying phenomena are known to
possibly correspond to quite different settings. However, to identify the appro-
priate local feature areas may be quite difficult, particularly for high dimensional
data that do not lend themselves easily to such a task. Unsupervised clustering
methods, such as K–means, appear as an attractive option. However, clustering
is often more an art than a technology and while many methods have been pro-
posed, simple approaches are usually followed in practice, in particular K–means
which is applied assuming an Euclidean distance in the feature space. Besides
fixing the number K of clusters, an adequate sampling is also an important issue
when working with high dimensional data as samples are then bound to be very
sparse. Moreover, the features to be used may not be homogeneous, something
probably better to be handled outside the chosen clustering procedure.
In this paper we will address the above issues in the context of wind energy
prediction. Wind power clearly presents wide, fast changing fluctuations, cer-
tainly at the individual farm level but also when the production of much larger
areas is considered. This is the case of Spain, the world’s fourth biggest producer
of wind power, where wind is currently the third source of electricity. The well
known, sigmoid–like structure of wind turbine power curves clearly shows differ-
ent regimes at low, medium and high wind speeds. Compounded with this are
wind speed frequencies, that follow a Weibull distribution, that is, a stretched
exponential with low wind having large frequencies. While the above does not
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 565–572, 2012.
c Springer-Verlag Berlin Heidelberg 2012
566 Á. Fernández Pascual et al.
directly apply when a wide area is considered, different regimes also appear.
Wind energy forecasting for large areas also implies high dimensional features
as the predictive variables, that are the outputs of numerical weather prediction
(NWP) models such as the ECMWF or GFS ones given for large grids that
cover the areas under study. Global models may find it difficult to handle these
regimes and local models are natural alternatives [1,9].
This high dimension suggests to precede clustering by some dimensionality
reduction (DR) technique, preferably one that is likely to yield an Euclidean
metric for the new features. Diffusion Maps (DM) [5], a novel spectral technique
for DR, is particularly suited to these requirements. In fact, there is a natural
diffusion metric in the original feature space that corresponds with Euclidean
metric in the embedded space. This means that clustering methods that rely on
Euclidean metrics, particularly K–means, should work well on the new features.
DM also allows to control to some extent the effects of the underlying data
distribution and, moreover, it allows to work with heterogeneous variables. In
other words, DM can be a powerful tool for finding informative clusters in high
dimensional, heterogeneous data.
Of course, DM is not the only option. Straight K–means clustering can cer-
tainly be used. Moreover, NWP variables for a large area usually show high
correlation among different grid points. This may suggest that variance–based
DR methods such as Principal Component Analysis (PCA) may be a useful
alternative. We shall consider these three options here in order to, first, iden-
tify local clusters and then to construct local models to be compared against a
global one. Many paradigms can be considered for model building but here we
will concentrate on the simplest alternative, ridge regression, i.e., regularized lin-
ear least squares, certainly not the strongest possible method but a good option
to measure the usefulness of local methods against a global one.
The paper is organized as follows. We will review in Sect. 2 DM from a general
point of view, as well as its use over heterogeneous data. In Sect. 3 we will
consider K–means on DM, PCA and the original features, we will compare local
ridge regression models on these clusters, we will discuss their effectiveness and
we will conclude on how to combine local and global models for better predictors.
Section 4 ends this paper with a brief discussion and conclusions.
The key assumption in Diffusion Maps (DM) is that the data to be studied
lie in a low–dimensional manifold whose geometry can be described through
a Markov chain diffusion metric. To capture this intrinsic geometry, the first
step is to build a connectivity graph using the sample points S = {x1 , . . . , xn }
as graph nodes and defining a symmetric weight matrix Wij = w(xi , xj ). The
most common way to build this matrix
is to use the Gaussian Kernel and define
w(xi , xj ) = exp −||xi − xj ||2 /σ 2 , where σ determines the radius of the neigh-
borhoods centered at individual sample points. We start with this matrix towards
defining a Markov chain over this graph. We first choose a parameter α ∈ [0, 1]
Diffusion Maps and Local Models for Wind Power Prediction 567
that is used to control the combined effects of manifold geometry and sample
w(x ,x ) n
distribution and define w(α) (xi , xj ) = q(xi )αiq(xjj )α , where q(xi ) = j=1 w(xi , xj )
is the degree at the i–th
n node of the W matrix. We define now the new α–degree
at xi as g (α) (xi ) = j=1 w(α) (xi , xj ) and arrive at the transition probability
w (α) (x ,x )
p(α) (xi , xj ) = g(α) (xi i )j . Notice that when α = 0, we are essentially defining
the weight matrix typically used in spectral dimensionality reduction [3]. In this
case, the infinitesimal generator L0 of the resulting Markov chain acts on an f
as L0 (f ) = Δ(fq q) − Δ(q)
q f [5], with Δ the manifold’s Laplace–Beltrami operator.
However, when α = 1 the infinitesimal generator L1 verifies L1 (f ) = Δf and
it is not influenced by the underlying density q (this will not be the case for
α = 0 unless q is uniform). We will consider here the case α = 1 and write just
pt (xi , xj ) if a t–step Markov chain is used. We will denote by P t the matrix of
t
transition probabilities in t steps (Pi,j = pt (xi , xj )).
Let λi , ψi (x), i = 0, . . . , n − 1, be the eigenvalues and eigenvectors of P , where
we assume 1 = λ0 · · · λn−1 ; P t has then eigenvalues λti and the same
eigenvectors ψi (x). To select for a given t the embedding dimension d = d(t) we
may fix a precision δ and choose d = max{l : |λtl | > δ|λt1 |}. The embedding pro-
jection is then Ψ t (x) = (λt1 ψ1τ (x), . . . , λtd ψdτ (x))τ , with τ the transpose operator.
The previous steps are summarized in Algorithm 1.
The Euclidean distance Ψ t (x) − Ψ t (z)2 = j λ2t j (ψj (x) − ψj (z)) in the em-
2
bedding coincides with the diffusion distance Dt (x, z) = ||p (x, ·)−p (z, ·)||2L2 ( 1 ) ,
2 t t
φ0
where φ0 is the stationary distribution of the P –Markov process. In other words,
if the diffusion distance Dt approximates the manifold metric, we get the orig-
inal data embedded in a lower dimension space for which Euclidean distance
captures the original local geometry, something very convenient if we want to
apply K–means. Once we have obtained K clusters {C1 , . . . , Ck } over the em-
bedded features, they can be projected back into clusters {A1 , . . . , Ak } in the
original space S defined as Ai = {xj |Ψt (xj ) ∈ Ci }.
A limitation of the above scheme is that it implicitly assumes the attributes
to be homogeneous; however, real–life datasets are frequently heterogeneous,
568 Á. Fernández Pascual et al.
something that often cannot be handled just by normalizing the data. In [10] a
method is proposed to adapt DM to work with heterogeneous features just by
dealing separately with groups of attributes that are deemed to be homogeneous.
More precisely, assume that we have M such groups; we then split each pattern
xi into M new, lower dimensional ones xm i and build the corresponding sample
sets {Sm }M m=1 . We now apply DM as described before to each Sm obtaining
M embeddings {Ψm }M m=1 that capture the geometry associated to each feature
subset. Now, these Ψm are given by eigenvalue–eigenvector products, with the
eigenvectors being comparable across the embeddings since they have unit norm.
We can make eigenvalues also comparable if we re–scale them as λnew λm,i .
m,i = j λm,j
Thus the union of the normalized features gives a set of homogeneous features
that still represent the intrinsic geometry of our original data and we can simply
apply DM again to this new dataset to get the final lower dimensional embedding.
In summary, DM makes possible low dimensional embeddings of heteroge-
neous data while transforming the original space metric into an Euclidean one.
However, they require proper choices for the parameters σ and δ. Moreover, a
main drawback (as it also happens in spectral DR) is the difficulty to apply the
computed DM projection to new, unseen patterns. There are several proposals
for this such as Nyström formulae [4] or Laplacian Pyramids [10], but this is still
an area where further work is needed.
3 Experiments
In this section we will apply K–means clustering based on DM to build local
models for predicting the wind energy production in Spain and compare it with
the results of K–means applied to either the original full dimensional data or
to PCA lower dimensional features. Once clusters are defined, we will use Ridge
Regression (RR) [7] for model building. Recall that RR adds a 2 regularization
term to an Ordinary Least Square (OLS) regression, so the optimization problem
becomes minw Xw − y22 + γw22 . This prevents over–fitting in plain OLS but
requires a procedure to compute the penalty term γ. While stronger models
could be considered [2,8], our primary interest here is whether DM–based local
models improve on either other local models or global ones. If so, stronger models
should also benefit from this, although we will not consider them in this work.
We will use as inputs NWPs from the European Centre for Medium–Range
Weather Forecasts (ECMWF) [6] and consider five surface variables: wind speed
(V ), its horizontal and vertical components (Vx and Vy ), pressure (p) and temper-
ature (T ), which we normalize component–wise to zero mean and unit variance.
These variables are available over a rectangular 522–point, 0.5◦ resolution grid
that covers the Iberian peninsula. Pattern dimension is thus 2, 610 = 5 × 522.
Two year data will be considered, the first one for training purposes and the
second for testing. Since eight forecasts are given daily, training sample size is
thus 2, 910 = 365×8, close to feature dimension and hence making regularization
mandatory.
Diffusion Maps and Local Models for Wind Power Prediction 569
50 50 50
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Wind Power (%) Wind Power (%) Wind Power (%)
Fig. 1. Wind power histograms for the clusters obtained using the 3 approaches
80 80 80
70 70 70
60 60 60
Wind Power (%)
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Wind Module (m/s) Wind Module (m/s) Wind Module (m/s)
80 80 80
70 70 70
Wind Power Prediction (%)
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Wind Module (m/s) Wind Module (m/s) Wind Module (m/s)
Fig. 2. Average wind versus actual power (top) and average wind versus predicted
power (bottom) for the clusters C1 (left), C2 (center) and C3 (right)
We will use NWPs and wind energy production data y for cluster definition.
Wind power is obviously unknown for the test dataset so we will use as a proxy
the wind power forecast of a global model. We already mentioned the difficulties
associated with the application of DM to test patterns. We will skip on them by
building the DM features and clusters, as well as the plain and PCA clusters,
over the full two year dataset. This confers some advantage to the local models
over the global one, partially compensated by the global model influencing cluster
definition. In any case, and as mentioned before, the computation of DM features
for new patterns is an area of active research.
We consider wind power production and the NWP variables as heterogeneous
and build first DM features separately on the y, V , Vx , Vy , T and p variables.
In all of them we define the graph’s weight matrix using a Gaussian Kernel with
bandwidth σ equal to the dataset diameter. We arrived at this value heuristi-
570 Á. Fernández Pascual et al.
Table 1. (a) Errors per cluster (top). (b) Global errors (bottom).
cally after visually analyzing the structure of the resulting embeddings. We also
work with t = 1, i.e., considering the one–step diffusion distance on the original
feature space and final embedding dimension was obtained using a δ = 0.1 pre-
cision parameter. Embedding dimensions for the above variables were 1, 6, 3, 5,
1 and 1 respectively and the final dimension for the DM embedding is 5. There-
fore, we also considered a 2, 610 to 5 dimension reduction for PCA. Finally, the
choice of K is always difficult. We will consider 3 clusters that hopefully capture
high, medium and low ranges of wind power. While initial centroids are ran-
domly chosen in K–means, we found that the DM parameters used lead to very
stable cluster structures that are essentially independent of centroid initializa-
tion. Figure 1 gives the cluster histograms of the local wind power distributions
for each approach. As it can be seen, the 3 DM clusters offer a more clear–cut
structure while the other two methods seem to differentiate less between wind
energy regimes.
Once DM, PCA and original feature clusters are defined, we build a global
model and also three local RR models, one per cluster, that we denote as GM,
LMDM , LMPC and LMOr respectively. Prior to model building we select the
optimal regularization parameters for all the RR models by grid search for γ in
the interval [10−2 , 104 ], with a logarithmic step of 0.1 and using as validation
set the last 20% patterns of the first year data clusters. As usually done in wind
energy, we measure model performance by the mean absolute error (MAE) and
the relative mean absolute error (RMAE). The MAE is defined as the mean of
the differences between the predictions and the real values. The RMAE computes
the mean for the ratio of the absolute errors over actual wind power. Table 1a
contains local model errors per cluster as well as the cluster errors of the global
model. As we can see, the local models beat the global one in the first, low
Diffusion Maps and Local Models for Wind Power Prediction 571
wind power cluster C1 but GM beats them in C2 and particularly in the high
wind power cluster C3 . A reason for this can be seem in Fig. 2 that depicts for
the 3 LMDM clusters the relationships between average wind and power (top)
and between average wind and predicted power (bottom). Cluster C3 has the
fewest number of points but presents several marked outliers; these two facts
clearly penalize the local C3 models. Table 1a also gives values for the standard
deviations of MAE and RMAE, although they are rather conservative (assuming
independence for these errors would lead to divide the values given by the square
root of sample size and, hence, much smaller values).
These facts suggest to build predictors combining a local model on the C1 clus-
ter and the global one on the other two. Table 1b contains the MAE and RMAE
errors of the individual GM, LMDM , LMOr and LMPC , and of the combined
models CMDM;G , CMOr;G and CMPC;G . It shows that there is a clear advantage
of the combined models over the global one and that the gain is largest for the
CMDM;G model. While modest at first sight (a MAE of 3.37% against 3.48% for
GM), such gains may have a large economic impact, as wind energy represents
about 16% of Spain’s electricity demand.
4 Conclusions
Local models are obviously useful in many applied problems, being wind energy
forecasting a clear example. The main obstacle for their construction usually is
how to define the local regions on which models will be built. A natural option
is K–means clustering that requires to choose an adequate metric, something
always difficult and more so when we also have to deal with the high dimensional,
heterogeneous features that arise in wide area wind energy forecasting. In this
work we have applied to this task Diffusion Maps (DM), a novel dimensionality
reduction technique that lends itself naturally to work with heterogeneous data
and that has the very important property that Euclidean metric in the projected
space is naturally related to a diffusion distance on the original features. This
distance is in turn related to a Markov process whose infinitesimal generator is
just the Laplace–Beltrami operator of the underlying manifold. We can expect
that the Euclidean metric in the reduced features captures the original space
metric and, thus, standard K–means on the embedding results in meaningful
clusters for the original features.
We have compared this approach with clusters obtained by straight Euclidean
K–means on the full features and on PCA features with the same dimension as
the DM ones, building local ridge regression models that, in turn, are compared
with a global one. The local models beat the global one over a low wind power
cluster, with DM a clear winner, but the global model performs better on the
other medium and high wind power clusters. This suggests to define a mixed
model, using the DM local model for the low wind power cluster and the global
one for the other two. This model outperforms the others.
We can conclude that DM dimensionality reduction and clustering is an ef-
fective tool for local model building, although further work is needed. In fact,
572 Á. Fernández Pascual et al.
DM features are derived after an spectral analysis of the sample distance matrix.
As it is also the case with spectral dimensionality reduction and clustering, this
makes costly to assign new, unseen patterns to already defined clusters. Tools
to alleviate this are Nyström formulae or Laplacian Pyramids. We are currently
doing research on this for wind energy and other applied problems.
References
1. Alaı́z, C., Barbero, A., Fernández, A., Dorronsoro, J.: High wind and energy spe-
cific models for global production forecast. In: Proceedings of the European Wind
Energy Conference and Exhibition (EWEC 2009), Marseille, France (March 2009)
2. Barbero, A., López, J., Dorronsoro, J.: Kernel methods for wide area wind power
forecasting. In: Proceedings of the European Wind Energy Conference and Exhi-
bition (EWEC 2008), Brussels, Belgium (April 2008)
3. Belkin, M., Nyogi, P.: Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15(6), 1373–1396 (2003)
4. Bengio, Y., Delalleau, O., Roux, N.L., Paiement, J., Vincent, P., Ouimet, M.:
Learning eigenfunctions links spectral embedding and kernel pca. Neural Compu-
tation 16(10), 2197–2219 (2004)
5. Coifman, R., Lafon, S.: Diffusion maps. Applied and Computational Harmonic
Analysis 21(1), 5–30 (2006)
6. European center for medium–range weather forecasts (2005),
http://www.ecmwf.int/
7. Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics 12(12), 55–67 (1970)
8. Monteiro, C., Bessa, R., Miranda, V., Botterud, A., Wang, J., Conzelmann, G.:
Wind power forecasting: State–of–the–art 2009. Tech. rep., INESC Porto and Ar-
gonne National Laboratory (2009)
9. Pinson, P., Nielsen, H., Madsen, H., Nielsen, T.: Local linear regression with adap-
tive orthogonal fitting for the wind power application. Statistics and Comput-
ing 18(1), 59–71 (2009)
10. Rabin, N., Coifman, R.: Heterogeneous datasets representation and learning using
diffusion maps and laplacian pyramids. In: Proceedings of the 12th SIAM Interna-
tional Conference on Data Mining (SDM 2012), Anaheim, California, USA (April
2012)
A Hybrid Model for S&P500 Index Forecasting
1 Introduction
Several linear and nonlinear statistical models were proposed in the literature to
solve the problem of financial phenomena forecasting [1]. However, they have the
need of a problem specialist to validate their forecasts, limiting the development
of automatic forecasting systems [1]. Alternatively, artificial neural networks
(ANNs) [2, 3] have been applied in the attempt to overcome such drawback.
However, due to many complex features frequently present in these phenom-
ena, such as irregularities, volatility, trends, and noise, a limitation arises from all
these models for financial forecasting and is known as the random walk dilemma
(RWD). It has been reported in the literature [2–6]. In this context, forecasts
generated by arbitrary models have a characteristic one step ahead delay with
respect to the actual values, so that, there is a time phase distortion in financial
phenomena reconstruction [2–6]. This behavior has led some researchers to argue
that financial phenomena are unpredictable [4, 5].
In this work we present a hybrid model to overcome the RWD in the finan-
cial forecasting problem. The proposed model is generically called the dilation-
erosion-linear perceptron (DELP) and is composed of morphological operators
under context of lattice theory and a linear operator. The proposed learning
process of the DELP employs a gradient-based method based on ideas from
the back-propagation (BP) algorithm and using a systematic approach to over-
come the problem of non-differentiability of morphological operations, based
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 573–581, 2012.
c Springer-Verlag Berlin Heidelberg 2012
574 R. de A. Araújo, A.L.I. Oliveira, and S.R.L. Meira
on ideas from Pessoa and Maragos [7] and Sousa [8]. Also, we have included
into learning process of the DELP a procedure to overcome the RWD, called
the automatic phase fix procedure (APFP) [2, 3, 6], which is a correction step
that is geared toward eliminating time phase distortions that occur in financial
phenomena.
Furthermore, an experimental analysis is conducted with the DELP using
the S&P500 Index. Five metrics are used to assess the forecasting performance.
An evaluation function (EF) is further used as global forecasting performance
indicator. The obtained results are discussed and compared to results found
using .
The obtained results shown better performance of the proposed model when
compared to the random walk model [4, 5], the time-delay added evolutionary
forecasting (TAEF) method [3] and the morphological-rank-linear time-lag added
evolutionary forecasting (MRLTAEF) method [6]. The last two models, accord-
ing to their accurate and precise forecasts recently presented in the literature,
shown boosted performance enhancement when compared to classical forecasting
models.
x = {xt ∈ R | t = 1, 2, . . . , N }, (1)
where t is the temporal index, which is called time and defines the granularity
of observations of a given phenomenon, and N is the number of observations.
The aim of forecasting techniques applied to a given time series is to provide
a mechanism that allows, with certain accuracy, the forecasting of the future
values of x, given by xt+h , h = 1, 2, . . . , H, where h represents the forecasting
horizon of H steps ahead. These techniques try to identify certain regular pat-
terns present in the data set, creating a model capable of generating the next
temporal patterns, where, in this context, a most relevant factor for an accurate
forecasting performance is the correct choice of the past window, or the time
lags, considered for the representation of a given time series.
In mathematical sense, the relationship which involves time series historical
data defines a d-dimensional phase space, where d is the minimum dimension
capable of representing such relationship. Therefore, a d-dimensional phase space
can be built so that it is possible to unfold its corresponding time series. Takens
[9] proved that if d is sufficiently large, such phase space is homeomorphic to the
phase space that generates the series. The Takens’ Theorem [9] is the theoretical
justification of the phase space reconstruction using time lags.
A Hybrid Model for S&P500 Index Forecasting 575
xt = xt−1 + zt , (2)
where
β = x · pT = x1 p1 + x2 p2 + . . . + xn pn , (4)
and
α = θϕ + (1 − θ)ω, θ ∈ [0, 1], (5)
in which
n
ϕ = δa (x) = (xi + ai ), (6)
i=1
and
n
ω = εb (x) = (xi + bi ), (7)
i=1
where term n denotes the dimensionality of the input signal (x), terms λ, θ ∈ R
and a, b, p ∈ Rn . The vector p ∈ Rn represents the linear operator weights.
576 R. de A. Araújo, A.L.I. Oliveira, and S.R.L. Meira
The term β represents the output of the linear operator. The term α represents
the linear combination of the morphological operators of dilation and erosion
(the mixture term is defined by θ). The terms ϕ and ω represent the output
of morphological operators of dilation and erosion, respectively. The vectors a
and b represent the structuring elements (weights) of the dilation (δa (x)) and
erosion (εb (x))operators employed into the nonlinear module of the DELP.
Terms and represent the supremum and the infimum operations. Note
that the output y is given by a linear combination of the linear operator and
another linear combination of morphological operators of dilation and erosion
(the mixture term is defined by λ). The main differences between “+ ” and “+”
are given by the following rules:
and
(−∞) + (+∞) = (+∞) + (−∞) = +∞. (9)
M
J(w) = e2 (m), (11)
m=1
in which M represents the input patterns amount in the learning process and
e(m) represents the instantaneous error for the m-th input pattern, and given
by
e(m) = t(m) − y(m), (12)
where t(m) and y(m) are the target and the model output, respectively.
The learning process updates the weight vector w based on the gradient steep-
est descent method. The adjustment of vector w for the m-th input training
pattern is given by the following iterative formula:
where μ > 0 and i ∈ {1, 2, . . .}. Term ∇J(w) is the gradient, which is given by
∂J ∂J ∂J ∂J ∂J ∂J
∇J(w) = = , , , , , (14)
∂w ∂a ∂b ∂p ∂λ ∂θ
A Hybrid Model for S&P500 Index Forecasting 577
in which
∂J ∂y ∂y ∂y ∂y ∂y
= −2e(m) , , , , , (15)
∂w ∂a ∂b ∂p ∂λ ∂θ
Note that the existence of the gradient of J with respect to w depends on the
existence of the gradients ∂y ∂y ∂y ∂y ∂y
∂a , ∂b , ∂p , ∂λ and ∂θ . Next, we present the formulas
to calculate them.
∂y
The term ∂λ is given by
∂y
= α − β. (16)
∂λ
∂y
The term ∂p is given by
∂y ∂y ∂β
= , (17)
∂p ∂β ∂p
in which
∂y
= 1 − λ, (18)
∂β
and
∂β
= x, (19)
∂p
where x represents the input signal (m-th input training pattern).
The term ∂y
∂θ is given by
∂y ∂y ∂α
= , (20)
∂θ ∂α ∂θ
in which
∂y
= λ, (21)
∂α
and
∂α
= ϕ − ω. (22)
∂θ
Terms ∂y ∂y
∂a and ∂b are estimated using the concept of smoothed rank indicator
vector [7, 8] (because dilation and erosion operators can be seen as particular
cases of the rank function), where we choose the smoothed unit sample function
Qσ (x) = [qσ (x1 ), qσ (x2 ), . . . , qσ (xn )], in which
x
∂y ∂y ∂α ∂ϕ ∂α ∂ϕ
= =λ , (24)
∂a ∂α ∂ϕ ∂a ∂ϕ ∂a
578 R. de A. Araújo, A.L.I. Oliveira, and S.R.L. Meira
in which
∂α
= θ, (25)
∂ϕ
and
∂ϕ Qσ (ϕ ·1 − (x + a))
= . (26)
∂a Qσ (ϕ ·1 − (x + a)) · 1T
∂y
As the same way, the term ∂b is given by
∂y ∂y ∂α ∂ω ∂α ∂ω
= =λ , (27)
∂b ∂α ∂ω ∂b ∂ω ∂b
in which
∂α
= 1 − θ, (28)
∂ω
and
∂ω Qσ (ω ·1 − (x + b))
= . (29)
∂b Qσ (ω ·1 − (x + b)) · 1T
Besides, in order to automatically adjust time phase distortions in financial
time series representation, we have included an automatic phase fix procedure
(APFP) [3, 6] in the proposed learning process of the DELP model. Figure 1
presents the APFP.
experiments to determine the best learning rate (μ) and the scale factor (σ), we
use μ = 0.01 and σ = 1.5. It is worth mentioning that three stop conditions
are used into the learning process [10]: i) The maximum epoch number equals
to 104 ; ii) The decrease in the training error process training (P t) of the cost
function equals to 10−6 ; iii) The increase in the validation error or generalization
loss (Gl) of the cost function equals to 5%.
In order to establish a performance study, results with the random walk
(RW) model [4, 5], which represents the results generated by classical forecast-
ing models, and with the time-delay added evolutionary forecasting (TAEF)
method [3] and the morphological-rank-linear time-lag added evolutionary fore-
casting (MRLTAEF) method [6] are employed in our comparative analysis, where
we investigate the same time series under the same conditions. Additionally, we
have used five well-known evaluation metrics formally defined in [3, 6] to assess
the forecasting performance: mean square error (MSE), mean absolute percent-
age error (MAPE), u of theil statistic (UTS), prediction of change in direction
(POCID) and average relative variance (ARV). Also, we use an evaluation func-
tion (EF) defined in [6] to serve as a global forecasting performance indicator.
Table 1. Obtained results with RW and DELP model for S&P500 series (test set)
Evaluation Metrics
Models MSE MAPE UTS ARV POCID EF
RW 5.5775e-004 2.2676e-002 1.0000e-000 5.4492e-002 55.65 26.8336
TAEF 4.8706e-004 1.4576e-002 3.6437e-001 1.3946e-002 92.93 66.6935
MRLTAEF 1.3429e-004 1.2904e-002 1.0046e-001 3.8450e-003 95.96 85.8818
DELP 1.3217e-006 1.2819e-003 2.3614e-003 1.2965e-004 100.00 99.6248
According to the Table 1, we can see that in all experiments the POCID metric
greater than 50%, indicating that the DELP model has much better performance
than a “coin-tossing” experiment. The obtained UTS metric value (2.36e-003)
indicates that the DELP model was able to overcome the random walk dilemma.
Note that the MAPE metric value (1.28e-003) is very small, that is, without
high percentage deviations. According to ARV metric value (1.30e-004), we
can see a much better performance of the proposed model regarding a naive
forecasting model. Also, we can verify a small value of MSE metric (1.32e-
006), which means that the forecasts are too close to real values. The EF metric
580 R. de A. Araújo, A.L.I. Oliveira, and S.R.L. Meira
0.99
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.9
228 230 232 234 236 238 240 242 244 246 248
Test Set
Fig. 2. Forecasting results of S&P500 series (last twenty points of the test set): actual
values (solid line) and predicted values (dashed line)
value (99.6) shows that the DELP have good global forecasting performance.
Besides, we can see that the proposed DELP model overcame, for all evaluation
metrics and for the evaluation function, the RW, TAEF and MRLTAEF models
in this work. Finally, we present in Figure 2 a comparative graphic between real
(solid line) and predicted (dashed line) values generated by DELP model for the
last twenty points of the S&P500 series test set. Note that the predicted values
are very close to the real values of the S&P500 series, where the one step delay
regarding the forecasting values did not occur.
5 Conclusion
In this work we presented a morphological-linear model to overcome the ran-
dom walk dilemma in the financial forecasting problem. The proposed model
was generically called the dilation-erosion-linear perceptron (DELP) and con-
sists of nonlinear morphological operators under context of lattice theory and a
linear operator. Also, we presented a learning process of the DELP employing
a gradient-based method based on ideas from the back-propagation (BP) algo-
rithm. Also, we have included into learning process of the DELP a procedure to
overcome the RWD, which is a correction step that is geared toward eliminating
time phase distortions that occur in financial phenomena. The evaluation per-
formance of the proposed DELP model regarding to the random walk (RW), the
time-delay added evolutionary forecasting (TAEF) and the morphological-rank-
linear time-lag added evolutionary forecasting (MRLTAEF) models was assessed
in terms of five well-known performance measures and using the S&P500 Index
series. In addition, an evaluation function served as a global indicator for the
quality of solutions achieved by the investigated models.
The experimental results demonstrated a consistently better performance of
the proposed DELP model, where we succeeded in to overcome the random
walk dilemma in a particular financial forecasting problem. It is possible to
verify that our forecasts have not any one step delay regarding real time series
A Hybrid Model for S&P500 Index Forecasting 581
values. Further studies must be developed to better formalize and explain the
properties of the DELP model and to determine its possible limitations with
other time series with components such as trends, seasonalities, impulses, steps
and other non-linearities. Further studies, in terms of risk and financial return,
must be done in order to determine the additional economical benefits, for an
investor, with the use of the DELP model in stock market applications. Finally,
a particular theoretical study about the complexity of the DELP model must be
done in order to establish a complete cost-performance evaluation of the DELP.
References
1. Clements, M.P., Franses, P.H., Swanson, N.R.: Forecasting economic and financial
time-series with non-linear models. International Journal of Forecasting 20, 169–
183 (2004)
2. de A. Araújo, R.: Swarm-based hybrid intelligent forecasting method for financial
time series prediction. Learning and Nonlinear Models 5(2), 137–154 (2007)
3. Ferreira, T.A.E., Vasconcelos, G.C., Adeodato, P.J.L.: A new intelligent system
methodology for time series forecasting with artificial neural networks. Neural Pro-
cessing Letters 28, 113–129 (2008)
4. Sitte, R., Sitte, J.: Neural networks approach to the random walk dilemma of
financial time series. Applied Intelligence 16(3), 163–171 (2002)
5. Malkiel, B.G.: A Random Walk Down Wall Street, Completely Revised and Up-
dated Edition. W. W. Norton & Company (April 2003)
6. de A. Araújo, R., Ferreira, T.A.E.: An intelligent hybrid morphological-rank-linear
method for financial time series prediction. Neurocomputing 72(10-12), 2507–2524
(2009)
7. Pessoa, L.F.C., Maragos, P.: Neural networks with hybrid morphological rank linear
nodes: a unifying framework with applications to handwritten character recogni-
tion. Pattern Recognition 33, 945–960 (2000)
8. Sousa, R.P., Carvalho, J.M., Assis, F.M., Pessoa, L.F.C.: Designing translation
invariant operations via neural network training. In: Proc. of the IEEE Intl. Con-
ference on Image Processing, Vancouver, Canada (2000)
9. Takens, F.: Detecting strange attractor in turbulence. In: Dold, A., Eckmann, B.
(eds.) Dynamical Systems and Turbulence. Lecture Notes in Mathematics, vol. 898,
pp. 366–381. Springer, New York (1980)
10. Prechelt, L.: Proben1: A set of neural network benchmark problems and bench-
marking rules. Technical Report 21/94 (1994)
Author Index